Sunday, 8. October 2006, 13:17:43

It was completely accident that I found Google was interest in OCR. I know you might have the same response as what I had at first: What? Google? OCR? But if you can I'll tell you a not-that-long story about this OCR digging
It all began from a assignment my project manager gave me this morning. He gave me a huge address list (in paper form!) and asked me to transfer it into MS Office document by the end of the day. My first responce was very clear: there's no way I type this damn stupid address list character by character, I must find a useful OCR kit to help me out.
I was completely right. OCR was necessary, but which one? As you guys know, Chinese is a extremely complicated language in its written form, it is a what people call ideogram, it has more than 10000 characters which each one is composed by many different strokes. Okay, let's get things clear, there're ton of English OCR software out there but for Chinese language, there're very few candidate.
Fortunately few means there're still some. In the beginning I turned to freesoft field for help (there're no open source Chinese OCR exist, to the best of what I know), and found there was a tiny kit called Mini OCR. Unfortunately this software disappointed me a lot -- its succeed rate of recognise was probably less than 20%. The author was encourageable to release such kind of a complicated software for free of charge, but he really needs do more rewriting of the kernel of this software to make it work.
There came the plan B, the commercial software. There're some Chinese OCR manufacturers out there and Hanwang (literally means Chinese Language's King) was one of the most famous. Very unexpected, when I tried to find any useful download link for any of its product, it all turned out to be dead link. Hopefully there were some other name Chinese OCR manufacturers, and Qinghua TH-OCR was the one that I used before. I managed to find a web link to it latest version and installed it. TH-OCR worked just find, it successfully identified most of the characters of that address list, I was saved! Viva!
Now, here's how the story of digging Google & OCR comes. My custom is, when find something interesting and useful (for currently or future use), there's no way I just let it gi without doing some research. That was exactly what I did to this OCR subject. When I used Google to find the download link, I was aware that there's a article from blogspot looked like it was telling story abotu Google & OCR. I kept it in mind. When I done my work, I quickly turned to that page, and it turned out to be pretty much a official announcement of a open source OCR priject called Tesseract OCR.
This announcement stated that the Tesseract OCR was a commercial software found by HP more than ten years ago and used to be a major player in the OCR field at that moment, unfortunately it was abandoned years later when HP decided to quit the OCR business. After ten years of waiting, it was brought to life again by the support of University of Nevada in Las Vegas and (mainly) Google. This time Google decide to make it a open source project. Nowadays Tesseract OCR host its project pages on Sourceforge.net, and releases only in source code form. Needless to say, Tesseract OCR has a long way to go to catch up the major commercial kits on today's OCR markets, but anyway it's a very good start, and especially if you consider it's backed by such a gaint company Google.
And there's another article which is even more interesting. 'OCR & Google' analyses the purpose Google supports Tesseract OCR project, and it considers that it's because of the need of Google's online book project. This article defiitely worths a reading.
For more detail about the story of Google & OCR, please visit these pages:
Announcing Tesseract OCR (from Blogspot, by Luc Vincent, Uber Tech Lead)
http://google-code-updates.blogspot.com/2006/08/announcing-tesseract-ocr.htmlOCR & Google (from Windows Live Space, by search-science)
http://search-science.spaces.live.com/Blog/cns!9D2D609139C8C9EF!3145.entryTesseract OCR Project's website
http://sourceforge.net/projects/tesseract-ocr