Most of my time in the last week (apart from organising my land/house) has been taken up by Project Gutenberg. From the website:
Project Gutenberg is the oldest producer of free electronic books (eBooks or etexts) on the Internet. Our collection of more than 12.000 eBooks was produced by hundreds of volunteers. Most of the Project Gutenberg eBooks are older literary works that are in the public domain in the United States. All may be freely downloaded and read, and redistributed for non-commercial use…”
After reading Bram Stoker’s Dracula, I managed to find the Project Gutenberg of Australia, which although might not have quite the amount of texts as it’s American counterpart, sure has an ace up it’s sleeve in copyright laws.
Copyright!
The copyright on books in Australia ends fifty years after the author’s death (except books released after their death, where it’s release year + 50 years). The USA has a few different rules, anything that was published before 1923 is okay, but from 1923 to 1977 gets 95 years of copyright.
This means for example that George Orwell’s Nineteen eighty-four can’t be put online in the USA until 2044!! Thankfully, Australia’s laws allowed it to be online in 2000. Quite a difference!
This obviously means we have more opportunites than our American counterparts, and thus we can include authors such as Geoge Orwell and many others. The list of ones that are done can be found here. As you’d notice, there’s a lot more to be done.
Where do they all come from?
I wondered just how all these texts end up online. After looking around a while I stumbled upon Project Gutenberg’s Distributed Proofreaders – a kind of SETI for Gutenberg texts. It works like this:
1) You register – this takes all of ten seconds, no fancy details required.
2) You pick out a book, read any special instructions (a lot don’t have any) then start proof reading.
It’s surprisingly easy to do! You have the original scanned page and the text of that page to edit right in front of you. The text was gathered using an ocr, which (if you are unsure) is basically a program that takes an image, grabs all the words out of it and turns it into normal text.
As you’d guess the ocr is never perfect, or at least can’t perfectly format things the way they want it to, so us humans step in here and check it after the ocr does its work.
All you have to do really (and there are a few ways of doing it, but this is my method) is read the text that the ocr has prepared, and if you see something odd look on the original scanned page at the same spot, and make a decision. There’s even a guide for pretty much everything you can come across that might be tricky.
They say that if you can do a page a day (which would take you not even two minutes to proof) you are making a good effort.
Instead of playing games now I proof (I’m trying to hit 33 pages a day for as long as I can) as once these texts are digitised, they will last forever, long after I die. The games, well, they are a fun waste of time and that’s about all. Gutenberg seems much more of a worthwhile cause to me =)
Links of interest…
Project Gutenberg – The USA site that started it all. Very old interface which makes it hard to casually browse, but if you know what you are looking for it’s very good.
Mazarin – A much better interface for finding interesting things to read. It’s a search engine that not only searches for authors and titles, but searches every word in every text. Wow!
Project Gutenberg of Australia – Also somewhat confusing design, and still yet a baby compared to the USA collection. (Note: it only hosts files that couldn’t be hosted legally on the USA site, nothing is doubled up.)
Project Gutenberg’s Distributed Proofreaders – (As their slogan says) Preserving history one page at a time. Go join, add yourself to the Aussie team, and proof a page!






