Recently we had to create a tool for one of our clients from the West Coast, where the inputs where an image file and an xml file having the configuration of the text to be extracted from the image.
For this purpose we chose the tess4j jar, which can be downloaded from http://sourceforge.net/projects/tess4j/, the stax jar file, which stands for Streaming API for XML as we had to write our output back into a new xml file with the extracted text.
The were numerous steps involved in getting the code to work:
Make sure the dll’s (libtesseract302.dll and liblept168.dll) are part of your project, these are part of leptonica otherwise you might start seeing linkage errors. These need to be added to the system32(incase of windows) folder if you decide to create runnable jars/exe files.
The manifest file of one of the jars associated with tess4j may be missing the vendor name, in this case explicitly add the vendor name to the file(this error got us scratching our heads).)
Do not forget to install tesseract(only having the jar will not do!), usually it would be in you program files unless you choose otherwise. The language data can be added here. In case you decide to use cube(which is another type extractor from the Tesseract family), ensure you have the cube files inside the tesseract folder. Also create an environment variable and point to this folder with the name TESSDATA_PREFIX.
There is very limited and scattered support available online – understand your setup well.
Tesseract instance = Tesseract.getInstance();// JNA Interface Mapping
String result = instance.doOCR(imageFile);
Once you are all set, it’s very straightforward to run Tesseract. The first line is where the instance is initialized, the second line tells tesseract to use English(you can do away with this line, as it is English by default), the third line sets your engine where 0 stands for tesseract only, 1 stands for cube only and 2 for both. We did not see better results using cube, to be on the safer side we decided to go ahead with the flag 2.
The results from tesseract are very decent, but a little bit of tweaking and you end up performing better than most commercials OCR’s. There are multiple variants of the training data that are available, using a combination of these will enhance your performance to take you closer to 99%+ accuracy. Using tesseract on our data, we have been able to achieve more than 99.5% accuracy.