The magical alphabet of OCR: a trap for authors and proofreaders

Optical Character Recognition misreading words

Optical character recognition (OCR) is a blessing for publishers and authors, converting an image of text into a document, for example in Word. Magic! A lost manuscript restored from hard copy!

The OCR process is somewhat similar to the way a human harvests words from page or screen.

We look.
We see text as a series of shapes. See that "W"? It has a shape. See that word "shape"? It has a shape.
We gather handfuls or rather eyefuls of shapes, one at a time — phrases or half-lines.

At this point, the similarity between human reading and OCR "reading" ends. For us, the harvest is only step one. Humans attribute meaning to these shapes, and assemble meanings into a narrative. We interpret, we evaluate, we follow, we think. From a page full of symbols, we construct something bigger.

OCR is kind of dyslexic, or perhaps is still a pre-schooler. It thinks in pictures. It sees a shape, compares that shape with known symbols, and regurgitates its best guess. (It finds some fonts much clearer than others, by the way.)

I've been reformatting a novel from such a document. Sometimes you can literally see why OCR software might reach a certain conclusion...

hanclful = handful
manrnade = manmade
groyvth ="font-size: 14px; line-height: 1.5;">= growth
Tlie = The
niorning = morning
Tm = "I'm
$aicl = said.

And sometimes it's hard to spot the logic...

qjr = or
\vas = was
w£re = were
greksy = greasy.

So what's the moral of this story? Never skimp on proofreading.

The magical alphabet of OCR: a trap for authors and proofreaders

Leave a comment: