Reading text from a PDF in ColdFusion 8

Yesterday I blogged about ColdFusion and DDX, a way to some fancy-pants neato transformations of PDF documents. One of the cooler examples was that DDX could be used to grab the text from a PDF file. For those who thought it might be too difficult to use the DDX, I've wrapped up the code in a new ColdFusion Component I'm calling PDF Utils. (Coming to a RIAForge near you soon. Watch the skies...)

Right now the CFC has one method, getText. You pass in the path to a PDF and you get an array of pages. Each item in the array is the text on that particular page. I've included on this blog post two sample PDFs. One is a normal PDF with simple text. As you can imagine, the function works great with it. The other one is a highly graphical, wacky looking PDF. Ok it isn't wacky looking per se, but it isn't a simple letter. When the method is run on this PDF, the text does come back, but it is a bit crazy looking. I think this is to be expected though. And what's cool is that if your intent is to get the text out for searching/indexing purposes, you can still find it useful.

Anyway, here is a sample:


<cfset pdf = createObject("component", "pdfutils")>

<cfset mypdf = expandPath("./paristoberead.pdf")>

<cfset results = pdf.getText(mypdf)>
<cfdump var="#results#">

Which gives this result:

The zip includes 2 PDFs, the component, and my test script.

Download attached file.

Archived Comments

Comment 1 by Ben Nadel posted on 7/26/2007 at 12:58 AM

A PDF utility component... that just all the makings of a massively popular CFC :)

Comment 2 by Raymond Camden posted on 7/26/2007 at 1:01 AM

My next utility will be pageGet. Right now you can easily delete pages, but what if you want just one page? Well the answer is to obviously delete everything else, but I'll have a utility function to make it easier.

I've actually got an entire online PDF Editor I worked on for the CF8 tour. I'm going to load it up when I wrap the CFPDF series.

Comment 3 by Ben Nadel posted on 7/26/2007 at 1:07 AM

That's gonna be sweet! I work with a lot of legal clients and they all but worship the PDF as a medium.

Comment 4 by Lola LB posted on 7/26/2007 at 3:55 PM

Cool! I can see a lot of use for those PDF utils.

Comment 5 by O?uz Demirkap? posted on 9/10/2007 at 9:56 PM

Any update for pdfutils? :)

Comment 6 by Raymond Camden posted on 9/10/2007 at 10:00 PM

I haven't done anything special with it - but it probably is time to release it at RIAForge. Oguz, if I don't by EOD, yell at me.

Comment 7 by Mark posted on 4/5/2008 at 7:52 AM

The demo doesnt seem to just work.
test.cfm seems OK
test2.cfm dumps the PDFDocument structure? Is the cfpdf write failing silently?
genpdf.cfm works (just try paristoread_new.pdf or whatever)
xmptest.cfm returns [empty string] no matter what I try. . ...

Comment 8 by Raymond Camden posted on 4/5/2008 at 9:36 PM

If you look at test2, you will see it does indeed dump out the pdf structure. That is expected. And it should overwrite page2.pdf.

As for xmptest.cfm, it will be empty if your PDF doesn't use XMP. Not all pdfs do. If you think your does and it doesn't work, email me the PDF.

Comment 9 by Johnny posted on 7/22/2008 at 7:07 PM

Nice component.
Can I do the same in MX7?
Thanks

Comment 10 by Raymond Camden posted on 7/24/2008 at 6:39 PM

Not that I know of.

Comment 11 by Tim posted on 9/29/2008 at 10:37 PM

Great tool! But it looks like it replaces newline characters with spaces. Is there any way to figure out where they used to be?

Comment 12 by Raymond Camden posted on 9/29/2008 at 10:43 PM

I just checked, and my code isn't doing anything to the text. It must be how it comes from DDX.

Comment 13 by Raymond Camden posted on 9/29/2008 at 11:09 PM

Dumb question - but if you are outputting it into HTML, don't forget newlines won't be displayed unless you use PRE tags.

Comment 14 by Armando posted on 2/13/2009 at 3:53 AM

When I use getText I just get back a blank page. I tried PDFUtils because all my other attempts to use DDX to get text from a pdf also failed - all return a blank page (unless there is an error).

Processing seems to stop as soon as the the cfpdf tag calls processddx.

This is on a shared coldfusion 8 server at crystaltech.

Comment 15 by Raymond Camden posted on 2/14/2009 at 9:08 PM

Are you sure processing ends? Do you get an error? What happens when you try DDX my itself?

Comment 16 by Virginia Neal posted on 5/29/2010 at 3:24 AM

As you probably know, in CF9 the cfpdf tag can now use action="extracttext" to pull text out of the pdf document. According to the documentation you should also be able to use useStructure="true" along with honourspaces="true" in order to return the structure. However, I have been unable to get this to work. I simply need to get line breaks, centering, and indents/tabs, but no luck thus far. Any ideas?
Thanks

Comment 17 by Raymond Camden posted on 5/30/2010 at 5:54 PM

How is it _not_ working? Do you get a syntax error? Something else?

Comment 18 by Virginia Neal posted on 6/1/2010 at 6:18 PM

The cfpdf tag is extracting the content of the pdf file and it is readable; however, the basic structure of the document is lost. There are no line breaks, no indenting, no centering.
I have tried both .doc and .docx. Code is simple
<cfpdf useStructure="true" addquads="false" honourspaces="true" type="string" action="extracttext" source="test.docx.pdf" name="pdfToText" />
The document is simple and contains about 7 lines, just enough to test line breaks, indents, and centering.
Thanks for any help you can provide.

Comment 19 by Raymond Camden posted on 6/1/2010 at 6:21 PM

Just to be dumb - how are you testing that line breaks were removed? Remember that if you just output it to the browser, since it is an HTML environment, you won't see the line breaks unless you view source.

Comment 20 by Virginia Neal posted on 6/1/2010 at 6:34 PM

I wrote the results to a text file and checked the content of the file. BTW, I also tried the extract using type='xml' and that did not work any better.

Comment 21 by Raymond Camden posted on 6/1/2010 at 6:35 PM

Interesting. Do you feel comfortable sharing your PDF with me?

Comment 22 by Virginia Neal posted on 6/1/2010 at 6:38 PM

No problem, it is a simple file. How do you want me to send it to you. I don't see a place on your blog where I can upload the file.

Comment 23 by Raymond Camden posted on 6/1/2010 at 6:40 PM

Email it to me. ray at camdenfamily dot com

Comment 24 by Raymond Camden posted on 6/1/2010 at 7:21 PM

Well I wish I had something nice to report, but I don't. I'm seeing the same as you. I also tried my pdfutils component (even though I assumed Adobe was using the exact same code (ddx) for theirs) and I got the same result. If you use the quads option you DO seem to get the proper data, but you would need to 'draw' it on screen to get it which would be overkill.

This may be a bit much - but you could use Google Docs. I have a wrapper CFC for it. You could upload to Google Docs than download as HTML or text. It would be slow(well slowish), but if you needed it for one time conversions, it would be acceptable I think.

Comment 25 by Virginia Neal posted on 6/1/2010 at 7:30 PM

Thanks much for trying!!
I had also tried the DDX approach and the quads option. I had briefly considered using the coordinates, but the docs I need to convert are gigantic and I get as many as 40 a day. Given that, I don't think the Google approach will work either. I have 3rd party software that currently converts Word to PDF and to Text, but I had hoped to get rid of the software and let CF do both. With CF 9, I am now able to convert the docs to PDF, but thus far no easy way to go to text. Maybe the POI will work.

Comment 26 by Raymond Camden posted on 6/1/2010 at 7:32 PM

Well, 40 isn't so much. If you assume 1 minute of network time to do the 'work' (really, Google is), I'd call this fair. I'd just do it behind the scenes - ie not make the user wait for it.

As to your 3rd party tool - shoot, if it works, use it! I'd brush my teeth with ColdFusion if I could, but at the end of the day, you want to use what works best.

Comment 27 by Virginia Neal posted on 6/1/2010 at 7:42 PM

That is 40 gigantic docs all in one batch with processing handled, as you said, "behind the scenes". Going to a complete CF solution, without the 3rd party conversion software, would save us some bucks, and small government agencies are all about saving money. :-)

Anyway, thanks again for your help!! It is good to know it is not just me and banging my head for another week won't change the fact that the tag won't work!

Comment 28 by Corey Christensen posted on 4/29/2011 at 2:11 AM

Hey guys. its AOC's finest! How u doing? I'm trying to fix a prob with old skool verity collections that is requiring me to build a new search interface for about 500+ pdfs. we are only running cf8 (boo) so wondering if you have any pointers to allow for cf searching in pdfs!!

Comment 29 by Raymond Camden posted on 4/29/2011 at 2:12 AM

Verity can index PDFs. :)

Comment 30 by Virginia Neal posted on 4/29/2011 at 2:58 AM

Hi Corey. As you may remember we do not use Verity, so I can't help. Wondered what happened to you - send me an email one day and we will catch up.

Comment 31 by Corey Christensen posted on 4/29/2011 at 11:34 PM

Correct, we are having a problem with our verity collections. We are adding new pdf documents to the verity collections, but verity isn't picking up the new pdfs. We have attempted to repair and even re-create the collection, but that new pdfs aren't being collected. Thats why i was starting to look other routes to resolve this issue. any ideas?

Comment 32 by Raymond Camden posted on 4/29/2011 at 11:38 PM

Just to be clear, you do know that when you add new PDFs that you have to update the index, right? You can and should also check the status result to ensure that items got inserted or updated.

Comment 33 by Corey Christensen posted on 4/30/2011 at 12:04 AM

were using action="refresh" which should delete all docs in collection and then add keys to the index. Prob is, its still not picking up newly added files. its been so long since i've used verity...will need to continue troubleshooting

Comment 34 by Raymond Camden posted on 4/30/2011 at 12:09 AM

You should check the status result from cfindex. See if it matches the #s you expect.

Comment 35 by Corey Christensen posted on 5/2/2011 at 8:33 PM

The reason new pdfs weren't being added to the collection was because the user had been upgraded to Acrobat 9.1 and the distiller was saving the pdfs in v1.5 which cfindex doesn't like. fix was
1. use cffile to read file contents.
2. cfif version gt 1.4; notify user to upload compatible pdf.
3. always use the status for cfindex to catch future issues. Thanks so much!

Comment 36 by Raymond Camden posted on 5/2/2011 at 8:35 PM

To be clear, this was Verity, right? I assume Solr wouldn't have an issue with 1.5.

Comment 37 by Corey Christensen posted on 5/2/2011 at 8:38 PM

yes, we are using verity (cf8)

Comment 38 by Phil Evans posted on 11/21/2011 at 12:41 PM

Brilliant - thanks Ray!

This has enabled me to do a workaround fix for the problem where cfdocument does not support
style="page-break-inside: avoid;"

It is not overly efficient, in that it recreates the entire report for each instance where
a page break needs to be repositioned, but the preliminary testing is good. Others may like to embellish the following.
It creates hidden fields in the pdf, formatted as
keep_together_start_001, keep_together_end_001, keep_together_start_002, keep_together_end_002 etc
If the matching start and end tags are not on the same page, then the pdf is recreated with an appropriate page break prior to the section to be kept together.

In the style sheet
p.hidden { height: 1px; width: 1px; overflow: hidden; top: -10px; margin: 0px;}

On initialisation of the page that creates the pdf
<cfif not isdefined("url.keep_together")>
<cfset session.keep_together_list = "">
</cfif>
<cfset keep_together_count = 0>

...

At the start of a section of output that I want kept together, put the following

<cfset keep_together_count += 1>
<cfif ListFind(session.keep_together_list,"#NumberFormat(keep_together_count,"000")#","|") gt 0>
<cfdocumentitem type="pagebreak"></cfdocumentitem>
<cfelse>
<cfoutput>
<p class="hidden">keep_together_start_#NumberFormat(keep_together_count,"000")#</h2>
</cfoutput>
</cfif>

...

At the end of the section to be kept together

<cfif ListFind(session.keep_together_list,"#NumberFormat(keep_together_count,"000")#","|") eq 0>
<cfoutput>
<p class="hidden">keep_together_end_#NumberFormat(keep_together_count,"000")#</h2>
</cfoutput>
</cfif>

Once the PDF file has been created, use Raymond's pdfutils to check each page for a start without a matching end.
If found, indicate which section of code need to have a page break inserted before it, and then reproduce the report.

<!--- Check for keep_togethers that were split --->
<cfset pdf = createObject("component", "pdfutils")>
<cfset mypdf = expandPath("my_report.pdf")>
<cfset pdf_struct = pdf.getText(mypdf)>

<cfloop index="idx" from="1" to="#ArrayLen(pdf_struct)#">
<cfset start_offset = find("keep_together_start_",pdf_struct[idx].text)>
<cfif start_offset neq 0>
<cfset keep_together_id = Mid(pdf_struct[idx].text,start_offset + 20,3)>
<cfset end_offset = findnocase("keep_together_end_#keep_together_id#",pdf_struct[idx].text)>
<cfif end_offset eq 0>
<cfset session.keep_together_list = session.keep_together_list & "|" & keep_together_id>
<cflocation addtoken="No" url="my_report.cfm?keep_together">
</cfif>
</cfif>
</cfloop>

Comment 39 by misty posted on 8/26/2012 at 5:02 PM

hi ray, i currently have a pdf where the field shows %firstname% the king, where it can be removed or replaced with another text, i am not sure if that is form inside or what, but i can actually remove it.

the thing i i waqnt to replace it some name like misty, how and what is the better approach for this

Comment 40 by Raymond Camden posted on 8/26/2012 at 6:59 PM

I'd build a HTML template with the tokens. You can then use CF string functions to replace the tokens and pass it to cfdocument to create the PDF.

Comment 41 by Lee Wissmiller posted on 4/11/2013 at 12:06 AM

I am working on a search engine for a website, and I noticed your code would create an array of the PDF files. I would like to modify and expand on this code to search any newly posted PDF files at a set scheduled time (hourly, daily etc) and store the results in a database, then I can use SQL to search the database and return result to the users. What do you think of something like this as a simple PDF search engine for a website?

Comment 42 by Raymond Camden posted on 4/11/2013 at 1:33 AM

Unless I'm misunderstanding you, couldn't you just take the array my stuff returns, iterate over it, and do db inserts? Or if you want to insert all the text, just arrayToList it with a " " delimiter to get one big blog.

Comment 43 by Raymond Camden posted on 4/11/2013 at 1:33 AM

"one big blob" I meant.

Comment 44 by Chris posted on 6/5/2013 at 7:45 PM

Hi Ray, have you run into an issue parsing certain PDFs that are in landscape orientation? A department receives a large monthly billing PDF that contain an employee name, department number, etc. I'm trying to pull out this information to automate splitting the PDF and emailing the relevant pages to each employee. But when I try using either your example or the read action of the CFPDF tag, the data returned is jumbled... like it's trying to parse the PDF as if it was in portrait orientation, moving from top to bottom. Have you seen this before? I have no problems with other PDFs in portrait format.

Comment 45 by Raymond Camden posted on 6/6/2013 at 12:16 PM

I have seen this - or something similar. My understanding, and this is NOT scientific, is that when a PDF has text in a "non straight top to bottom" format, the text will be jumbled. I think this kind of makes sense. I mean humans can look at something like an invoice and mentally grok the right way to parse the text, but a computer cannot. As far as I know, there isn't anything you can do about this. (But again, I could be wrong.)

Comment 46 by Chris posted on 6/6/2013 at 6:02 PM

Opps, I meant to say other landscaped PDFs parse fine in my post above.

This is the only report I'm dealing with that is generated by some software called DocumentBurster, so it must be something screwy that software is doing to the report that CF can't understand.

If I figure it out, I'll post back as I haven't found anyone else having this issue in my searches. Thanks Ray.

Comment 47 by Eric posted on 11/7/2013 at 9:15 PM

How do you use the cfdump to dump the results of the PDF text to a database table?

Comment 48 by Raymond Camden posted on 11/7/2013 at 9:16 PM

You don't. CFDUMP is a debugging tool. Instead of cfdumping the variable, you would use it in a cfquery statement using an INSERT statement.

Comment 49 by Eric posted on 11/7/2013 at 9:25 PM

I did this

<cfset pdf = createObject("component", "pdfutils")>

<cfset mypdf = expandPath("PSIAppendices.pdf")>

<cfset results = pdf.getText(mypdf)>

<!---<cfdump var="#results#">--->

<cfquery datasource="xx" name="ins">
INSERT INTO [raw_data]
(data_value)
VALUES
(#pdf.getText(mypdf)#)
</cfquery>

got this

Complex object types cannot be converted to simple values.

The expression has requested a variable or an intermediate expression result as a simple value. However, the result cannot be converted to a simple value. Simple values are strings, numbers, boolean values, and date/time values. Queries, arrays, and COM objects are examples of complex values.
The most likely cause of the error is that you tried to use a complex value as a simple one. For example, you tried to use a query variable in a cfif tag.

The error occurred in C:\ColdFusion9\wwwroot\wrestleworx\testpdfread.cfm: line 14
12 : (data_value)
13 : VALUES
14 : (#pdf.getText(mypdf)#)
15 : </cfquery>
16 :

Comment 50 by Raymond Camden posted on 11/7/2013 at 9:29 PM

Sorry - if you look above, you can see it is an array of strings. If you want to insert all the text, you need to convert the array into a string. The simplest way would be to use arrayToList with " " as a delimiter.

Comment 51 by Eric posted on 11/7/2013 at 9:32 PM

Ok, I have never worked with arrays before so I will give it a try

Comment 52 by Skiff Blankenship posted on 5/4/2016 at 2:07 PM

"Download attached file" throws a 404 error.

Comment 53 (In reply to #52) by Raymond Camden posted on 5/4/2016 at 2:22 PM

You should be able to find it at https://static.raymondcamde.... The link will be fixed the next time I update the site.