Optical character recognition (OCR) is a blessing for publishers and authors, converting an image of text into a document, for example in Word. Magic! A lost manuscript restored from hard copy!
The OCR process is somewhat similar to the way a human harvests words from page or screen.
- We look.
- We see text as a series of shapes. See that "W"? It has a shape. See that word "shape"? It has a shape.
- We gather handfuls or rather eyefuls of shapes, one at a time — phrases or half-lines.
At this point, the similarity between human reading and OCR "reading" ends. For us, the harvest is only step one. Humans attribute meaning to these shapes, and assemble meanings into a narrative. We interpret, we evaluate, we follow, we think. From a page full of symbols, we construct something bigger.
OCR is kind of dyslexic, or perhaps is still a pre-schooler. It thinks in pictures. It sees a shape, compares that shape with known symbols, and regurgitates its best guess. (It finds some fonts much clearer than others, by the way.)
I've been reformatting a novel from such a document. Sometimes you can literally see why OCR software might reach a certain conclusion...
- hanclful = handful
- manrnade = manmade
- groyvth ="font-size: 14px; line-height: 1.5;">= growth
- Tlie = The
- niorning = morning
- Tm = "I'm
- $aicl = said.
And sometimes it's hard to spot the logic...
- qjr = or
- \vas = was
- w£re = were
- greksy = greasy.
So what's the moral of this story? Never skimp on proofreading.