The magical alphabet of OCR: a trap for authors and proofreaders

Optical Character Recognition misreading words

Optical character recognition (OCR) is a blessing for publishers and authors, converting an image of text into a document, for example in Word. Magic! A lost manuscript restored from hard copy!

The OCR process is somewhat similar to the way a human harvests words from page or screen.

  • We look.
  • We see text as a series of shapes. See that "W"? It has a shape. See that word "shape"? It has a shape.
  • We gather handfuls or rather eyefuls of shapes, one at a time — phrases or half-lines.

At this point, the similarity between human reading and OCR "reading" ends. For us, the harvest is only step one. Humans attribute meaning to these shapes, and assemble meanings into a narrative. We interpret, we evaluate, we follow, we think. From a page full of symbols, we construct something bigger.

OCR is kind of dyslexic, or perhaps is still a pre-schooler. It thinks in pictures. It sees a shape, compares that shape with known symbols, and regurgitates its best guess. (It finds some fonts much clearer than others, by the way.)

I've been reformatting a novel from such a document. Sometimes you can literally see why OCR software might reach a certain conclusion...

  • hanclful = handful
  • manrnade = manmade
  • groyvth ="font-size: 14px; line-height: 1.5;">= growth
  • Tlie = The
  • niorning = morning
  • Tm = "I'm
  • $aicl = said.

And sometimes it's hard to spot the logic...

  • qjr = or
  • \vas = was
  • w£re = were
  • greksy = greasy.

So what's the moral of this story? Never skimp on proofreading.

Leave a comment: