E-Book Enlightenment

Donating E-Books To The Internet Archive

donating_1.png

The Internet Archive is attempting to create an electronic version of the Library of Alexandria, preserving the public domain including books, audio recordings, movies, and even websites.  Most of the books on the site they scanned in themselves, using custom built book scanning machines called Scribe workstations.  The easiest way to donate an e-book to their collection would be to send them the actual book and let them do the scanning.  You might not be able to get the book back afterwords, though. In addition to scanning public domain texts they also do copyrighted titles.  These are not available for download by the general public, but they can be gotten in a format known as DAISY (used to support text to speech) if you are "print disabled".  You need to be vetted by the Library of Congress to get access to these copyrighted works.

If you prefer you can scan your own book and submit it.  I have gone through this process with several books, so I can tell you what to expect if you go this route.

The first thing you need to do is go to the Internet Archive website and apply for a "virtual library card", by clicking on the Patron Info tab and following the instructions.  This is a different kind of library card, because you don't need one to download books from the site but you do need one to donate books.

Once you've gotten your card and logged into the site you can donate materials using the Upload/Share button in the upper right corner of the site.  You will be donating your book to the Community Texts collection.  Other collections are possible, but unless you represent a library you won't need to have your own collection.

Generally your text donations will be one of two possibilities, although other options are available.  The possibilities are:

Whichever one you have, you should pay attention to how you name the file if you want it to be downloadable by Get Books or Get Internet Archive Books.   You want to name your PDF the name you want to use as your identifier on the site, but without spaces.  For example:

  • You want to use the title "Big Aviation Book For Boys" as your title on IA.
  • IA will create an identifier for your entry named "BigAviationBookForBoys".
  • Your content will be posted in a sub directory named "BigAviationBookForBoys".
  • If you want your items to be downloadable by Get Internet Archive Books you'll want to follow the standard used internally by IA and name your PDF "BigAviationBookForBoys.pdf".

On one of my submissions I didn't do that.  I named my file "AncientMannersOriginalPages.pdf", so all the file names that were created were based on that file name.  I was able to rename most of them afterwords, but not the EPUB file.  As a result you can't download that EPUB with Get Internet Archive Books.

When you upload your submission the website will run a derive job which will convert your PDF into several other formats:

Index of /17/items/BigAviationBookForBoys/


BigAviationBookForBoys.djvu               4036000
BigAviationBookForBoys.gif                 319668
BigAviationBookForBoys.pdf              182866779
BigAviationBookForBoys_abbyy.             6944391
BigAviationBookForBoys_djvu.txt            456551
BigAviationBookForBoys_djvu.xml           4319378
BigAviationBookForBoys_files.xml             3235
BigAviationBookForBoys_jp2.zip           83198619
BigAviationBookForBoys_meta.xml              1473
BigAviationBookForBoys_scandata.xml         96124
BigAviationBookForBoys_text.pdf          29897117

BigAviationBookForBoys.pdf is my original submission, and all the others were derived.  You can ignore the .xml files.  The website has its own uses for these, but they are not something you are likely to want to download.  The rest of the files are:

  • BigAviationBookForBoys.djvu:  A DjVu version of the book.  This e-book is only 3.8 megabytes yet it looks as good as my original 174 megabyte PDF!
  • BigAviationBookForBoys.gif: An animated GIF file that shows sample pages from the book and is shown on the web page for the book.
  • BigAviationBookForBoys.pdf:  My original monster PDF.
  • BigAviationBookForBoys_abbyy.gz:  A zipped up XML file created by or for ABBYY Fine Reader which you can ignore.
  • BigAviationBookForBoys_djvu.txt:  A text file created by ABBYY Fine Reader out of the PDF.  With a little proofreading this could be used as the basis for a Project Gutenberg E-text.
  • BigAviationBookForBoys_jp2.zip:  A zip archive containing page images extracted from your PDF and saved in the JPEG2000 format.  You're not likely to use this for your own submissions, but if you wanted to do a Project Gutenberg submission of a text in the Internet Archive this might be helpful.
  • BigAviationBookForBoys_text.pdf:  The PDF that the Internet Archive created from my original 174 megabyte monster.  28.5 megabytes -- quite an improvement!  The word text in the name refers to the layer of OCR'd text in the PDF that makes it searchable and supports copying text to the clipboard.

In addition to these there is an EPUB file that you can download from the main page.  It looks like this:

Boys Aviation book EPUB.jpg

If you're a glass half-empty kind of person you'll note that the EPUB needs a lot of proofreading before you could really give it to anyone.  If you're a glass half-full you'll note that 90% of the text is right and the program that generated the EPUB has done a great job with the illustrations.  It has found them, cropped and resized them, and placed them in the EPUB nearly where you'd like for them to be.  You can use this EPUB as the basis for a hand-crafted version and save yourself some work.

The first thing you're going to want to do after your book has "derived" is rename BigAviationBookForBoys.pdf to BigAviationBookForBoysLarge.pdf and rename BigAviationBookForBoys_text.pdf to BigAviationBookForBoys.pdf.  The Get Internet Archive Books Activity will download the BigAviationBookForBoys.pdf file when you specify that you want to download a PDF, so it's important that that name points to the smaller file.

It sometimes happens that the derive job fails.  You'll know this because a day later your original posting is the only file available for download on the page.  There are only two ways to deal with this that I know of.  The first is to post in the Community Texts forum on the website.  As it happens the first two books I donated to the Internet Archive did not derive successfully.  My post on the forum was never answered.  The second thing I tried was to send an email to info@archive.org.  I didn't get immediate action, but one of the staff did rerun the derive job for both books and both were processed successfully.

In the case of the two books where the "derive" job did not run the first time I was uploading the PDF from a Linux box and on Linux the normal "Share" button does not work.  You need to use an alternate method of uploading texts that does not use Flash in the web browser.  For my other books I used Internet Explorer on MS Windows with the normal Flash-using "Share" button.  For these books the "derive" jobs ran just fine.  For this reason, if you are a Linux user you may find it worthwhile to use a Windows box or a Macintosh to upload your donations to the Internet Archive.

EPUBs and MOBIs

When you look at the list of derived files you won't see any files with the suffix of .mobi or .epub.  Yet the Internet Archive does support downloading books in those formats, and you may well want to replace the automatically generated MOBI and EPUB files with versions that you have corrected and improved by hand.  I wanted to do this with my own book, Make Your Own Sugar Activities!  If you can't see files named .epub and .mobi in the file list, how is it possible to delete the automatically generated versions and replace them with your own?

It turns out that the EPUB and MOBI versions are not generated as part of the "derive" job.  Instead, they are generated on the fly when you click on the download link.  If the program that generates the EPUB or MOBI sees that a file with the correct name has been uploaded to the server it will give that file to the person doing the download instead of generating one. 

Here is the file list for Ancient Manners, another book where I donated both a PDF and hand crafted EPUB and MOBI versions:

As you can see, on May 31st I uploaded my EPUB and MOBI, making sure that the filename was consistent with the PDF I uploaded originally.  As a result of this anyone who requests an EPUB or Kindle version will get my beautifully handcrafted version, not the generated version.

You can only update books that you have created the entries for yourself, so you might wonder what happens if you use Booki to create a corrected EPUB of a book you did not donate.  The answer so far is that you'll need to submit the corrected EPUB as a brand new book.  There is no way to replace the uncorrected version in this case.

It makes a lot of sense to handcraft EPUBs and MOBIs for the books you donate to the Internet Archive.  There are books like The Big Book Of Aviation For Boys that should be in the public domain (because there is no evidence that its copyright was renewed), but which Project Gutenberg would not accept because they felt that the copyright status was still in doubt.  The Internet Archive is a good home for books like that.

If you do donate corrected EPUBs, make sure you change the description of your book to call attention to it.

Examples

When considering how you will create your submission to the Internet Archive you might find it useful to look at my submissions first.  For reasons I will not attempt to explain my user id on the Internet Archive website is nicestep.  If you enter that in the Search box on the website it will list out all my donations.  Some of these donations were done twice.  The first version was done using manual cropping and clean up, and the second was done with Scan Tailor.  Some of my more recent submissions were only done with Scan Tailor.

It is a natural impulse when seeing a beautifully crafted e-book like Abroad to want to try and do something just as well.  There's nothing wrong with wanting this, but getting results like that is not easy.  You need well-lit pages when taking the photos and lots and lots of patience.  Your first attempts will probably look worse than what you get with Scan Tailor, and will be a great deal more work.  Not every book is worth that kind of effort, and not every book will benefit from that effort.

The books Thirteen Women and Out Of This World are not masterpieces of the book maker's art.  They were cheaply made and showing signs of wear.  The e-books I made of them using Scan Tailor look better than the originals.

My newer submissions are in the public domain because of Rule 6, at least as far as I am able to determine.  IA does not require copyright clearance before posting, but they will remove copyrighted material from the index when they find it.  They do distribute copyrighted works in DAISY format to qualified persons, so even if your donation turns out to not be in the public domain it is not necessarily lost.  I look up my submissions in the Stanford copyright renewal database to verify that they are eligible.

The Internet Archive And The Nook Store

After I had published several books on the Kindle Store I decided to check out the Nook Store, only to find that most of the books I had donated to the Internet Archive had already found their way to the Nook Store.  Barnes and Noble had taken the automatically generated, unproofed EPUBs from the Internet Archive and was offering them as free books in the Nook Store.  I can't figure out why they took some and not others.  There definitely is a human element involved, although nobody is correcting the texts.

Another outfit called Kessenger Publishing has taken Image Container PDFs from the Internet Archive and will create paperback books on demand with them.  Both Barnes and Noble and Amazon sell these books, and a couple of them might be based on page images I made.  These reprinted books are not cheap either!

None of this is illegal.  I only mention it so you'll know that most of the "free" books for the Nook and Kindle are free on the XO laptop as well, and you don't need to have a credit card to get them!