Reading text from a PDF in ColdFusion 8

Yesterday I blogged about ColdFusion and DDX, a way to some fancy-pants neato transformations of PDF documents. One of the cooler examples was that DDX could be used to grab the text from a PDF file. For those who thought it might be too difficult to use the DDX, I've wrapped up the code in a new ColdFusion Component I'm calling PDF Utils. (Coming to a RIAForge near you soon. Watch the skies...)

Right now the CFC has one method, getText. You pass in the path to a PDF and you get an array of pages. Each item in the array is the text on that particular page. I've included on this blog post two sample PDFs. One is a normal PDF with simple text. As you can imagine, the function works great with it. The other one is a highly graphical, wacky looking PDF. Ok it isn't wacky looking per se, but it isn't a simple letter. When the method is run on this PDF, the text does come back, but it is a bit crazy looking. I think this is to be expected though. And what's cool is that if your intent is to get the text out for searching/indexing purposes, you can still find it useful.

Anyway, here is a sample:

<cfset pdf = createObject("component", "pdfutils")>

<cfset mypdf = expandPath("./paristoberead.pdf")>

<cfset results = pdf.getText(mypdf)>
<cfdump var="#results#">

Which gives this result:

The zip includes 2 PDFs, the component, and my test script.

Download attached file.

Raymond Camden's Picture

About Raymond Camden

Raymond is a developer advocate. He focuses on JavaScript, serverless and enterprise cat demos. If you like this article, please consider visiting my Amazon Wishlist or donating via PayPal to show your support. You can even buy me a coffee!

Lafayette, LA