How to get best results with the OCR -feature

On 01/18/2011, in How to..., Mac, News, by Tuomas Rasila

Thank you all for the feedback and questions you’ve sent us!
Many have had issues getting the OCR to recognize the text properly and some users didn’t find the app to produce what they expected.

DSLR How to get best results with the OCR  feature

Take the photos with at least 5 Mpix camera, if good OCR results are wanted.

Issues with low-res photos

The biggest issue is clearly the low quality of the original photo that users have tried to run the OCR with. We’ve asked some users to send us the photo they’re trying to run the OCR to. Just like we expected, the resolution have been more or less 72 dpi, which is not good enough for the Tesseract OCR.

The DocScanner Mac’s OCR feature is meant for high-resolution photos, taken in 5 Mpix or greater quality.

In short, if you want good results with the OCR feature, please run it on High quality, min. 5 Mpix photos. You might want to use the edit panel to fine tune the photo, before running the OCR. The OCR’d text resets, if you edit the photo in DocScanner Mac after the initial run, so you’ll just need to click the “OCR” -button again.

font samples How to get best results with the OCR  feature

Different fonts on OCR

Some fonts and font sizes are difficult for the Tesseract OCR to give correct results. Most commonly users have had issues, when tried to run the OCR on texts, that has really small font size. The best results come with clear text with a font size of approximetly 10-12 pt.

The easiest fonts to recognize for Tesseract OCR are the grotesque (Roman or Sans-serif) font families, such as Arial, Hevetica or Futura. As the Antiqua (Serif) fonts are better for human eye to read, they’re not so good for the OCR.

DS ocr illustr web How to get best results with the OCR  featureHow does it work?

The DocScanner uses Tesseract OCR to search characters in images. The system recognizes the fixed static shape of the character – more accurate statement would be, “guessing” the character from a shape. The OCR’s current state in general isn’t 100% accurate, even on the most clear images.

DocScanner enhances the images before running the OCR, so it would be easier for Tesseract OCR to recocgnize the text correctly. Although the text in some photos are too difficult for a correct result, thus it gives gibberish instead. We’re constantly developing the image pre-treatment process for getting more accurate OCR results.

When using DocScanner Mac, users have an option to run the OCR for individual pages or choose to automatically run OCR on every imported page (from the preference panel). Once the OCR has been ran, users may view the results by clicking the OCR -button again, or by clicking the mouses right button for the “View OCR” -option. You may also copy the text from this view to your favourite text editor and make changes to the text.

app store How to get best results with the OCR  feature

 

2 Responses to How to get best results with the OCR -feature

Please note: If you have a support request, please contact support

  1. Mathias says:

    How do I even USE the OCR? Tried dragging a PDF in, nothing happened. Tried exporting it, nothing happened. It might be “as simple as drag and drop”, but I’m starting to believe that it’s because nothing else actually happens. If it does, some kind of status bar would be good, just to give a little feedback.

    So, if I want some text recognized, how do I get it to work?

  2. Gretchen says:

    Need to ask a question and I’m not sure I’m in the right place, but here goes…

    I am interested in scanning photos from old magazines.

    From what I have read, the process would be photograph the image with a digital camera, have Docscanner magically convert image to pdf.

    Then would I be able to select and copy image from the pdf and convert to a jpg?

Subscribe our Monthly Newsletter

Our newsletter drops to your mailbox only once-a-month and it covers our not yet released work exclusively