E-Book Enlightenment

Donating E-Books To The Internet Archive


The Internet Archive is attempting to create an electronic version of the Library of Alexandria, preserving the public domain including books, audio recordings, movies, and even websites.  Most of the books on the site they scanned in themselves, using custom built book scanning machines called Scribe workstations.  The easiest way to donate an e-book to their collection would be to send them the actual book and let them do the scanning.  You might not be able to get the book back afterwords, though. In addition to scanning public domain texts they also do copyrighted titles.  These are not available for download by the general public, but they can be gotten in a format known as DAISY (used to support text to speech) if you are "print disabled".  You need to be vetted by the Library of Congress to get access to these copyrighted works.

If you prefer you can scan your own book and submit it.  I have gone through this process with several books, so I can tell you what to expect if you go this route.

The first thing you need to do is go to the Internet Archive website and apply for a "virtual library card", by clicking on the Patron Info tab and following the instructions.  This is a different kind of library card, because you don't need one to download books from the site but you do need one to donate books.

Once you've gotten your card and logged into the site you can donate materials using the Upload/Share button in the upper right corner of the site.  You will be donating your book to the Community Texts collection.  Other collections are possible, but unless you represent a library you won't need to have your own collection.

Generally your text donations will be one of two possibilities, although other options are available.  The possibilities are:

Whichever one you have, you should pay attention to how you name the file if you want it to be downloadable by Get Books or Get Internet Archive Books.   You want to name your PDF the name you want to use as your identifier on the site, but without spaces.  For example:

  • You want to use the title "Big Aviation Book For Boys" as your title on IA.
  • IA will create an identifier for your entry named "BigAviationBookForBoys".
  • Your content will be posted in a sub directory named "BigAviationBookForBoys".
  • If you want your items to be downloadable by Get Internet Archive Books you'll want to follow the standard used internally by IA and name your PDF "BigAviationBookForBoys.pdf".

On one of my submissions I didn't do that.  I named my file "AncientMannersOriginalPages.pdf", so all the file names that were created were based on that file name.  I was able to rename most of them afterwords, but not the EPUB file.  As a result you can't download that EPUB with Get Internet Archive Books.

When you upload your submission the website will run a derive job which will convert your PDF into several other formats:

Index of /17/items/BigAviationBookForBoys/

BigAviationBookForBoys.djvu                        17-Jun-2010 01:10             4036000
BigAviationBookForBoys.gif                         16-Jun-2010 23:39              319668
BigAviationBookForBoys.pdf                         25-May-2010 01:27           182866779
BigAviationBookForBoys_abbyy.gz                    17-Jun-2010 00:45             6944391
BigAviationBookForBoys_djvu.txt                    17-Jun-2010 01:18              456551
BigAviationBookForBoys_djvu.xml                    17-Jun-2010 00:51             4319378
BigAviationBookForBoys_files.xml                   17-Jun-2010 01:18                3235
BigAviationBookForBoys_jp2.zip                     16-Jun-2010 23:38            83198619
BigAviationBookForBoys_meta.xml                    17-Jun-2010 01:18                1473
BigAviationBookForBoys_scandata.xml                17-Jun-2010 01:10               96124
BigAviationBookForBoys_text.pdf                    17-Jun-2010 01:18            29897117

BigAviationBookForBoys.pdf is my original submission, and all the others were derived.  You can ignore the .xml files.  The website has its own uses for these, but they are not something you are likely to want to download.  The rest of the files are:

  • BigAviationBookForBoys.djvu:  A DjVu version of the book.  This e-book is only 3.8 megabytes yet it looks as good as my original 174 megabyte PDF!
  • BigAviationBookForBoys.gif: An animated GIF file that shows sample pages from the book and is shown on the web page for the book.
  • BigAviationBookForBoys.pdf:  My original monster PDF.
  • BigAviationBookForBoys_abbyy.gz:  A zipped up XML file created by or for ABBYY Fine Reader which you can ignore.
  • BigAviationBookForBoys_djvu.txt:  A text file created by ABBYY Fine Reader out of the PDF.  With a little proofreading this could be used as the basis for a Project Gutenberg E-text.
  • BigAviationBookForBoys_jp2.zip:  A zip archive containing page images extracted from your PDF and saved in the JPEG2000 format.  You're not likely to use this for your own submissions, but if you wanted to do a Project Gutenberg submission of a text in the Internet Archive this might be helpful.
  • BigAviationBookForBoys_text.pdf:  The PDF that the Internet Archive created from my original 174 megabyte monster.  28.5 megabytes -- quite an improvement!  The word text in the name refers to the layer of OCR'd text in the PDF that makes it searchable and supports copying text to the clipboard.
In addition to these there is an EPUB file that you can download from the main page.  It looks like this:

Boys Aviation book EPUB.jpg

If you're a glass half-empty kind of person you'll note that the EPUB needs a lot of proofreading before you could really give it to anyone.  If you're a glass half-full you'll note that 90% of the text is right and the program that generated the EPUB has done a great job with the illustrations.  It has found them, cropped and resized them, and placed them in the EPUB nearly where you'd like for them to be.  You can use this EPUB as the basis for a hand-crafted version and save yourself some work.

The first thing you're going to want to do after your book has "derived" is rename BigAviationBookForBoys.pdf to BigAviationBookForBoysLarge.pdf and rename BigAviationBookForBoys_text.pdf to BigAviationBookForBoys.pdf.  The Get Internet Archive Books Activity will download the BigAviationBookForBoys.pdf file when you specify that you want to download a PDF, so it's important that that name points to the smaller file.

It sometimes happens that the derive job fails.  You'll know this because a day later your original posting is the only file available for download on the page.  There are only two ways to deal with this that I know of.  The first is to post in the Community Texts forum on the website.  As it happens the first two books I donated to the Internet Archive did not derive successfully.  My post on the forum was never answered.  The second thing I tried was to send an email to info@archive.org.  I didn't get immediate action, but one of the staff did rerun the derive job for both books and both were processed successfully.

In the case of the two books where the "derive" job did not run the first time I was uploading the PDF from a Linux box and on Linux the normal "Share" button does not work.  You need to use an alternate method of uploading texts that does not use Flash in the web browser.  For my other books I used Internet Explorer on MS Windows with the normal Flash-using "Share" button.  For these books the "derive" jobs ran just fine.  If you are a Linux user you may find it worthwhile to use a Windows box or a Macintosh to upload your donations to the Internet Archive.


When considering how you will create your submission to the Internet Archive you might find it useful to look at my submissions first.  For reasons I will not attempt to explain my user id on the Internet Archive website is nicestep.  If you enter that in the Search box on the website it will list out all my donations.  Some of these donations were done twice.  The first version was done using manual cropping and clean up, and the second was done with Scan Tailor.  Some of my more recent submissions were only done with Scan Tailor.

It is a natural impulse when seeing a beautifully crafted e-book like Abroad to want to try and do something just as well.  There's nothing wrong with wanting this, but getting results like that is not easy.  You need well-lit pages when taking the photos and lots and lots of patience.  Your first attempts will probably look worse than what you get with Scan Tailor, and will be a great deal more work.  Not every book is worth that kind of effort, and not every book will benefit from that effort.

The books Thirteen Women and Out Of This World are not masterpieces of the book maker's art.  They were cheaply made and showing signs of wear.  The e-books I made of them using Scan Tailor look better than the originals.

My newer submissions are in the public domain because of Rule 6, at least as far as I am able to determine.  IA does not require copyright clearance before posting, but they will remove copyrighted material from the index when they find it.  They do distribute copyrighted works in DAISY format to qualified persons, so even if your donation turns out to not be in the public domain it is not necessarily lost.  I look up my submissions in the Stanford copyright renewal database to verify that they are eligible.