Fixing ebooks with errors – A personal challenge
I have wholly embraced the eBook revolution. As a long time traveler, and SciFi aficionado, I have assembled a large collection of books that I continue to read to mark them off my to do list.
Being a fan of science fiction, I have been forced to acquire some of my books by extra-legal means. Since many of the classic tomes of the golden era of SciFi are out of print, and have no official ebook release to buy, I turn to the internet.
With few exceptions, these books are scanned and OCR’d from print, and then stuffed into a file to read. Lots of early Heinlien, and obscure authors exist only this way.
The problem, OCR still sucks. Even the best algorithms barf a lot on text and thus there are spots of garbage in many of these books.
I sometimes make it a personal mitzvah to clean up a book.
Classic example was the “To the Stars” trilogy, by Harry Harrison (his real name, not a nom de plume). It was a rather poor scan and conversion to an RTF file. It was a painful process to fix, but totally worth it, because it made the book completely readable.
However, if your book is in ePub of PDF format, you have fewer options.
The program I go to is Sigil. Provided there is no DRM, you can open and inspect the book, and fix small things. If you are savvy, you can also dive into the CSS stylesheet and alter fonts, indents, and other text properties (but be warned, some readers ignore much of the CSS codes and classes – I’m looking at you Sony Reader).
Sigil allows you to look at the text as it renders, at a split screen with the code below the rendered text, or just pure code. You can fix a lot of errors and glitches with the search and edit the code, saving back to the original file.
A future series of posts will go into depth on how to better structure the ebook.
Another good program, and one that is widely used Calibre. A library, and file manipulation program, it is open source and extensible. It makes it easy to convert from one format to another (Kindle to ePub, or LRM to ePub, and many other options.)
A nice touch is that in Calibre you can better setup the ISBN, the cover images, and get data on the book from public databases. I used Calibre to convert a collection of Doc Savage stories from the lrm format (the original Sony Reader format) to ePub, and to add good cover pictures.
In fact, most of the ebook files I look at in Sigil have signs of being converted/cleaned by Calibre, even some commercial books.
Doing this work, you find some things like:
- Files which came from Microsoft Word – littered with the “class=msonormal” tag. Ugh. I don’t usually curse too much about microsoft office, but what it outputs for HTML that is converted into an ebook is a crime against humanity.
- Most ebooks, even commercial, professionally edited and assembled ones, have horrible structure. Not proper links to the chapters, nor proper tables of contents. Commercial books are much more likely to get this right, but it is a disaster on the community sourced works. I am working up a process to fix that.
- There are some truly shitty OCR engines out there. Even high priced, high performance engines have trouble, the second tier is atrocious. Someone once grumbled on Slashdot why there weren’t any good (free) open source OCR engines, and the answer is that because it is friggin hard, and it often becomes a lifetime’s work to tune and improve the algorithm, so the good ones are not in a hurry to be given away.
I rarely make a mission to fix an ebook, but when I do, I want to leave something that is a better experience to read.
(For the record, if there is a place to buy a book, I will always buy it, but much of what I read is esoteric, or out of print, so I am forced into alternatives. )
All the books at Project Gutenberg are in the public domain, and their standards are very high. I currently volunteer at Distributed Proofreaders, which does the bulk of their content creation. Individual pages go through 3 rounds of proofreading and two of formatting before being stitched together by a Post-Processor. Then THAT is double-checked by someone at PG before ultimately being posted.
Here is a link to their Science Fiction selections:
I stand corrected. The quality of PG books has improved immensely. I stopped grabbing them a couple of years ago, because they were not much better than the scans I could find, but the few I sampled are pretty kick ass in formatting and quality.
However, their website is atrocious. Difficult to navigate, and cumbersome to download. I do like the ability to push the files to my google drive, but it takes too many steps to download, and there is pretty poor navigation.
Some whiz kids could probably turn this into a great experience.
The real problem with PG is not their fault. The copyright laws in this country are so warped that much of the golden era of SciFi will probably never be in the public domain. It is far beyond the concerns for printed works, but the exceptions that keep getting added to protect Mickey Mouse, are really damaging the dissemination of information.
Much of what I like to read is still covered by copyright, but is out of print, and has little chance of ever being properly formatted for the ebook world. There is just no legal way to get these works, and that makes the baby Jesus cry.
I get many of my book from purveyors in Canada, that still has reasonable copyright protections, not a forever extending rolling door of exclusions that hurt the public good
I have many project gutenberg files, and some are great, but many of them are not much better than the “cut the binding and scan” files I find in the wild.
Glad to hear that there is some editorial process though.