Saturday, 7 September 2013

Train Tesseract for specific words - possible?

Train Tesseract for specific words - possible?

I want to use Tesseract to extract about 10-20 keywords from a document.
The document will contain all English characters/words. What I am
interested in is something like "Age: 23". Here Age is the keyword I am
interested in and want to extract the 23 (the value for that) as well.
The first approach that comes in my mind is to extract the whole page into
text and then look for keywords in the recognized text. But in terms of
training the tesseract, is there a better approach if I know the keywords,
which might result in a better accuracy?
I am more or less aware of the limitations of Tesseract OCR. Trying to
maximize within that limitations. Thanks for all your expert advice.

No comments:

Post a Comment