I was in Portland, Oregon a couple of weeks ago, and one of the things I do when I visit PDX is drink beer with my little (big) brother Michael (@m_thelander). He is a programmer in Portland, working diligently away at Rentrak. Unlike myself, Michael is a classically trained programmer, and someone you want as your employee. ;-) He’s a rock solid guy.
Anyhoo. Michael and I were drinking beer in downtownt Portland, and talking about a project he had worked on during an internal hackathon at Retrak. I won’t give away the details, as I didn’t ask him if I could write this. :-) The project involved the programmatic analysis of thousand of PDFs, so I asked him what tools he was using to work with PDFs?
He said they were stumbling on the differences between the formatting of each PDF, and couldn’t get consistent results, so they decided to just save each page as an image, and used the tesseract open source OCR engine to read each image. Doing this essentially flattened the differences between PDF types, giving him additional details provided when you use tesseract.
It may not seem like much, but ultimately it is a very interesting approach, and as I continue doing big data projects around things like patents, I’m always faced with the question—what do I do with a PDF? I will have to steal (borrow) from my smart little brothers work and build a tesseract API prototype.