Thank you all for the feedback and questions you’ve sent us!
Many have had issues getting the OCR to recognize the text properly and some users didn’t find the app to produce what they expected.
Issues with low-res photos
The biggest issue is clearly the low quality of the original photo that users have tried to run the OCR with. We’ve asked some users to send us the photo they’re trying to run the OCR to. Just like we expected, the resolution have been more or less 72 dpi, which is not good enough for the Tesseract OCR.
The DocScanner Mac’s OCR feature is meant for high-resolution photos, taken in 5 Mpix or greater quality.
In short, if you want good results with the OCR feature, please run it on High quality, min. 5 Mpix photos. You might want to use the edit panel to fine tune the photo, before running the OCR. The OCR’d text resets, if you edit the photo in DocScanner Mac after the initial run, so you’ll just need to click the “OCR” -button again.
Different fonts on OCR
Some fonts and font sizes are difficult for the Tesseract OCR to give correct results. Most commonly users have had issues, when tried to run the OCR on texts, that has really small font size. The best results come with clear text with a font size of approximetly 10-12 pt.
The easiest fonts to recognize for Tesseract OCR are the grotesque (Roman or Sans-serif) font families, such as Arial, Hevetica or Futura. As the Antiqua (Serif) fonts are better for human eye to read, they’re not so good for the OCR.
How does it work?
The DocScanner uses Tesseract OCR to search characters in images. The system recognizes the fixed static shape of the character – more accurate statement would be, “guessing” the character from a shape. The OCR’s current state in general isn’t 100% accurate, even on the most clear images.
DocScanner enhances the images before running the OCR, so it would be easier for Tesseract OCR to recocgnize the text correctly. Although the text in some photos are too difficult for a correct result, thus it gives gibberish instead. We’re constantly developing the image pre-treatment process for getting more accurate OCR results.
When using DocScanner Mac, users have an option to run the OCR for individual pages or choose to automatically run OCR on every imported page (from the preference panel). Once the OCR has been ran, users may view the results by clicking the OCR -button again, or by clicking the mouses right button for the “View OCR” -option. You may also copy the text from this view to your favourite text editor and make changes to the text.