Fixing ebooks with errors – A personal challenge

I have wholly embraced the eBook revolution. As a long time traveler, and SciFi aficionado, I have assembled a large collection of books that I continue to read to mark them off my to do list.

Being a fan of science fiction, I have been forced to acquire some of my books by extra-legal means. Since many of the classic tomes of the golden era of SciFi are out of print, and have no official ebook release to buy, I turn to the internet.

With few exceptions, these books are scanned and OCR’d from print, and then stuffed into a file to read. Lots of early Heinlien, and obscure authors exist only this way.

The problem, OCR still sucks.  Even the best algorithms barf a lot on text and thus there are spots of garbage in many of these books.

I sometimes make it a personal mitzvah to clean up a book.

Classic example was the “To the Stars” trilogy, by Harry Harrison (his real name, not a nom de plume). It was a rather poor scan and conversion to an RTF file. It was a painful process to fix, but totally worth it, because it made the book completely readable.

However, if your book is in ePub of PDF format, you have fewer options.

Sigil, a pretty awesome open source ePub editor
Sigil, a pretty awesome open source ePub editor

The program I go to is Sigil. Provided there is no DRM, you can open and inspect the book, and fix small things. If you are savvy, you can also dive into the CSS stylesheet and alter fonts, indents, and other text properties (but be warned, some readers ignore much of the CSS codes and classes – I’m looking at you Sony Reader).

Sigil allows you to look at the text as it renders, at a split screen with the code below the rendered text, or just pure code. You can fix a lot of errors and glitches with the search and edit the code, saving back to the original file.

A future series of posts will go into depth on how to better structure the ebook.

Another good program, and one that is widely used Calibre. A library, and file manipulation program, it is open source and extensible. It makes it easy to convert from one format to another (Kindle to ePub, or LRM to ePub, and many other options.)

A nice touch is that in Calibre you can better setup the ISBN, the cover images, and get data on the book from public databases. I used Calibre to convert a collection of Doc Savage stories from the lrm format (the original Sony Reader format) to ePub, and to add good cover pictures.

In fact, most of the ebook files I look at in Sigil have signs of being converted/cleaned by Calibre, even some commercial books.

Doing this work, you find some things like:

  • Files which came from Microsoft Word – littered with the “class=msonormal” tag. Ugh. I don’t usually curse too much about microsoft office, but what it outputs for HTML that is converted into an ebook is a crime against humanity.
  • Most ebooks, even commercial, professionally edited and assembled ones, have horrible structure. Not proper links to the chapters, nor proper tables of contents. Commercial books are much more likely to get this right, but it is a disaster on the community sourced works. I am working up a process to fix that.
  • There are some truly shitty OCR engines out there. Even high priced, high performance engines have trouble, the second tier is atrocious. Someone once grumbled on Slashdot why there weren’t any good (free) open source OCR engines, and the answer is that because it is friggin hard, and it often becomes a lifetime’s work to tune and improve the algorithm, so the good ones are not in a hurry to be given away.

I rarely make a mission to fix an ebook, but when I do, I want to leave something that is a better experience to read.

(For the record, if there is a place to buy a book, I will always buy it, but much of what I read is esoteric, or out of print, so I am forced into alternatives. )