Yesterday I blogged about ColdFusion and DDX, a way to some fancy-pants neato transformations of PDF documents. One of the cooler examples was that DDX could be used to grab the text from a PDF file. For those who thought it might be too difficult to use the DDX, I've wrapped up the code in a new ColdFusion Component I'm calling PDF Utils. (Coming to a RIAForge near you soon. Watch the skies...)
Right now the CFC has one method, getText. You pass in the path to a PDF and you get an array of pages. Each item in the array is the text on that particular page. I've included on this blog post two sample PDFs. One is a normal PDF with simple text. As you can imagine, the function works great with it. The other one is a highly graphical, wacky looking PDF. Ok it isn't wacky looking per se, but it isn't a simple letter. When the method is run on this PDF, the text does come back, but it is a bit crazy looking. I think this is to be expected though. And what's cool is that if your intent is to get the text out for searching/indexing purposes, you can still find it useful.
Anyway, here is a sample:
<cfset pdf = createObject("component", "pdfutils")>
<cfset mypdf = expandPath("./paristoberead.pdf")>
<cfset results = pdf.getText(mypdf)>
<cfdump var="#results#">
Which gives this result:

The zip includes 2 PDFs, the component, and my test script.
Archived Comments
A PDF utility component... that just all the makings of a massively popular CFC :)
My next utility will be pageGet. Right now you can easily delete pages, but what if you want just one page? Well the answer is to obviously delete everything else, but I'll have a utility function to make it easier.
I've actually got an entire online PDF Editor I worked on for the CF8 tour. I'm going to load it up when I wrap the CFPDF series.
That's gonna be sweet! I work with a lot of legal clients and they all but worship the PDF as a medium.
Cool! I can see a lot of use for those PDF utils.
Any update for pdfutils? :)
I haven't done anything special with it - but it probably is time to release it at RIAForge. Oguz, if I don't by EOD, yell at me.
The demo doesnt seem to just work.
test.cfm seems OK
test2.cfm dumps the PDFDocument structure? Is the cfpdf write failing silently?
genpdf.cfm works (just try paristoread_new.pdf or whatever)
xmptest.cfm returns [empty string] no matter what I try. . ...
If you look at test2, you will see it does indeed dump out the pdf structure. That is expected. And it should overwrite page2.pdf.
As for xmptest.cfm, it will be empty if your PDF doesn't use XMP. Not all pdfs do. If you think your does and it doesn't work, email me the PDF.
Nice component.
Can I do the same in MX7?
Thanks
Not that I know of.
Great tool! But it looks like it replaces newline characters with spaces. Is there any way to figure out where they used to be?
I just checked, and my code isn't doing anything to the text. It must be how it comes from DDX.
Dumb question - but if you are outputting it into HTML, don't forget newlines won't be displayed unless you use PRE tags.
When I use getText I just get back a blank page. I tried PDFUtils because all my other attempts to use DDX to get text from a pdf also failed - all return a blank page (unless there is an error).
Processing seems to stop as soon as the the cfpdf tag calls processddx.
This is on a shared coldfusion 8 server at crystaltech.
Are you sure processing ends? Do you get an error? What happens when you try DDX my itself?
As you probably know, in CF9 the cfpdf tag can now use action="extracttext" to pull text out of the pdf document. According to the documentation you should also be able to use useStructure="true" along with honourspaces="true" in order to return the structure. However, I have been unable to get this to work. I simply need to get line breaks, centering, and indents/tabs, but no luck thus far. Any ideas?
Thanks
How is it _not_ working? Do you get a syntax error? Something else?
The cfpdf tag is extracting the content of the pdf file and it is readable; however, the basic structure of the document is lost. There are no line breaks, no indenting, no centering.
I have tried both .doc and .docx. Code is simple
<cfpdf useStructure="true" addquads="false" honourspaces="true" type="string" action="extracttext" source="test.docx.pdf" name="pdfToText" />
The document is simple and contains about 7 lines, just enough to test line breaks, indents, and centering.
Thanks for any help you can provide.
Just to be dumb - how are you testing that line breaks were removed? Remember that if you just output it to the browser, since it is an HTML environment, you won't see the line breaks unless you view source.
I wrote the results to a text file and checked the content of the file. BTW, I also tried the extract using type='xml' and that did not work any better.
Interesting. Do you feel comfortable sharing your PDF with me?
No problem, it is a simple file. How do you want me to send it to you. I don't see a place on your blog where I can upload the file.
Email it to me. ray at camdenfamily dot com
Well I wish I had something nice to report, but I don't. I'm seeing the same as you. I also tried my pdfutils component (even though I assumed Adobe was using the exact same code (ddx) for theirs) and I got the same result. If you use the quads option you DO seem to get the proper data, but you would need to 'draw' it on screen to get it which would be overkill.
This may be a bit much - but you could use Google Docs. I have a wrapper CFC for it. You could upload to Google Docs than download as HTML or text. It would be slow(well slowish), but if you needed it for one time conversions, it would be acceptable I think.
Thanks much for trying!!
I had also tried the DDX approach and the quads option. I had briefly considered using the coordinates, but the docs I need to convert are gigantic and I get as many as 40 a day. Given that, I don't think the Google approach will work either. I have 3rd party software that currently converts Word to PDF and to Text, but I had hoped to get rid of the software and let CF do both. With CF 9, I am now able to convert the docs to PDF, but thus far no easy way to go to text. Maybe the POI will work.
Well, 40 isn't so much. If you assume 1 minute of network time to do the 'work' (really, Google is), I'd call this fair. I'd just do it behind the scenes - ie not make the user wait for it.
As to your 3rd party tool - shoot, if it works, use it! I'd brush my teeth with ColdFusion if I could, but at the end of the day, you want to use what works best.
That is 40 gigantic docs all in one batch with processing handled, as you said, "behind the scenes". Going to a complete CF solution, without the 3rd party conversion software, would save us some bucks, and small government agencies are all about saving money. :-)
Anyway, thanks again for your help!! It is good to know it is not just me and banging my head for another week won't change the fact that the tag won't work!
Hey guys. its AOC's finest! How u doing? I'm trying to fix a prob with old skool verity collections that is requiring me to build a new search interface for about 500+ pdfs. we are only running cf8 (boo) so wondering if you have any pointers to allow for cf searching in pdfs!!
Verity can index PDFs. :)
Hi Corey. As you may remember we do not use Verity, so I can't help. Wondered what happened to you - send me an email one day and we will catch up.
Correct, we are having a problem with our verity collections. We are adding new pdf documents to the verity collections, but verity isn't picking up the new pdfs. We have attempted to repair and even re-create the collection, but that new pdfs aren't being collected. Thats why i was starting to look other routes to resolve this issue. any ideas?
Just to be clear, you do know that when you add new PDFs that you have to update the index, right? You can and should also check the status result to ensure that items got inserted or updated.
were using action="refresh" which should delete all docs in collection and then add keys to the index. Prob is, its still not picking up newly added files. its been so long since i've used verity...will need to continue troubleshooting
You should check the status result from cfindex. See if it matches the #s you expect.
The reason new pdfs weren't being added to the collection was because the user had been upgraded to Acrobat 9.1 and the distiller was saving the pdfs in v1.5 which cfindex doesn't like. fix was
1. use cffile to read file contents.
2. cfif version gt 1.4; notify user to upload compatible pdf.
3. always use the status for cfindex to catch future issues. Thanks so much!
To be clear, this was Verity, right? I assume Solr wouldn't have an issue with 1.5.
yes, we are using verity (cf8)
Brilliant - thanks Ray!
This has enabled me to do a workaround fix for the problem where cfdocument does not support
style="page-break-inside: avoid;"
It is not overly efficient, in that it recreates the entire report for each instance where
a page break needs to be repositioned, but the preliminary testing is good. Others may like to embellish the following.
It creates hidden fields in the pdf, formatted as
keep_together_start_001, keep_together_end_001, keep_together_start_002, keep_together_end_002 etc
If the matching start and end tags are not on the same page, then the pdf is recreated with an appropriate page break prior to the section to be kept together.
In the style sheet
p.hidden { height: 1px; width: 1px; overflow: hidden; top: -10px; margin: 0px;}
On initialisation of the page that creates the pdf
<cfif not isdefined("url.keep_together")>
<cfset session.keep_together_list = "">
</cfif>
<cfset keep_together_count = 0>
...
At the start of a section of output that I want kept together, put the following
<cfset keep_together_count += 1>
<cfif ListFind(session.keep_together_list,"#NumberFormat(keep_together_count,"000")#","|") gt 0>
<cfdocumentitem type="pagebreak"></cfdocumentitem>
<cfelse>
<cfoutput>
<p class="hidden">keep_together_start_#NumberFormat(keep_together_count,"000")#</h2>
</cfoutput>
</cfif>
...
At the end of the section to be kept together
<cfif ListFind(session.keep_together_list,"#NumberFormat(keep_together_count,"000")#","|") eq 0>
<cfoutput>
<p class="hidden">keep_together_end_#NumberFormat(keep_together_count,"000")#</h2>
</cfoutput>
</cfif>
Once the PDF file has been created, use Raymond's pdfutils to check each page for a start without a matching end.
If found, indicate which section of code need to have a page break inserted before it, and then reproduce the report.
<!--- Check for keep_togethers that were split --->
<cfset pdf = createObject("component", "pdfutils")>
<cfset mypdf = expandPath("my_report.pdf")>
<cfset pdf_struct = pdf.getText(mypdf)>
<cfloop index="idx" from="1" to="#ArrayLen(pdf_struct)#">
<cfset start_offset = find("keep_together_start_",pdf_struct[idx].text)>
<cfif start_offset neq 0>
<cfset keep_together_id = Mid(pdf_struct[idx].text,start_offset + 20,3)>
<cfset end_offset = findnocase("keep_together_end_#keep_together_id#",pdf_struct[idx].text)>
<cfif end_offset eq 0>
<cfset session.keep_together_list = session.keep_together_list & "|" & keep_together_id>
<cflocation addtoken="No" url="my_report.cfm?keep_together">
</cfif>
</cfif>
</cfloop>
hi ray, i currently have a pdf where the field shows %firstname% the king, where it can be removed or replaced with another text, i am not sure if that is form inside or what, but i can actually remove it.
the thing i i waqnt to replace it some name like misty, how and what is the better approach for this
I'd build a HTML template with the tokens. You can then use CF string functions to replace the tokens and pass it to cfdocument to create the PDF.
I am working on a search engine for a website, and I noticed your code would create an array of the PDF files. I would like to modify and expand on this code to search any newly posted PDF files at a set scheduled time (hourly, daily etc) and store the results in a database, then I can use SQL to search the database and return result to the users. What do you think of something like this as a simple PDF search engine for a website?
Unless I'm misunderstanding you, couldn't you just take the array my stuff returns, iterate over it, and do db inserts? Or if you want to insert all the text, just arrayToList it with a " " delimiter to get one big blog.
"one big blob" I meant.
Hi Ray, have you run into an issue parsing certain PDFs that are in landscape orientation? A department receives a large monthly billing PDF that contain an employee name, department number, etc. I'm trying to pull out this information to automate splitting the PDF and emailing the relevant pages to each employee. But when I try using either your example or the read action of the CFPDF tag, the data returned is jumbled... like it's trying to parse the PDF as if it was in portrait orientation, moving from top to bottom. Have you seen this before? I have no problems with other PDFs in portrait format.
I have seen this - or something similar. My understanding, and this is NOT scientific, is that when a PDF has text in a "non straight top to bottom" format, the text will be jumbled. I think this kind of makes sense. I mean humans can look at something like an invoice and mentally grok the right way to parse the text, but a computer cannot. As far as I know, there isn't anything you can do about this. (But again, I could be wrong.)
Opps, I meant to say other landscaped PDFs parse fine in my post above.
This is the only report I'm dealing with that is generated by some software called DocumentBurster, so it must be something screwy that software is doing to the report that CF can't understand.
If I figure it out, I'll post back as I haven't found anyone else having this issue in my searches. Thanks Ray.
How do you use the cfdump to dump the results of the PDF text to a database table?
You don't. CFDUMP is a debugging tool. Instead of cfdumping the variable, you would use it in a cfquery statement using an INSERT statement.
I did this
<cfset pdf = createObject("component", "pdfutils")>
<cfset mypdf = expandPath("PSIAppendices.pdf")>
<cfset results = pdf.getText(mypdf)>
<!---<cfdump var="#results#">--->
<cfquery datasource="xx" name="ins">
INSERT INTO [raw_data]
(data_value)
VALUES
(#pdf.getText(mypdf)#)
</cfquery>
got this
Complex object types cannot be converted to simple values.
The expression has requested a variable or an intermediate expression result as a simple value. However, the result cannot be converted to a simple value. Simple values are strings, numbers, boolean values, and date/time values. Queries, arrays, and COM objects are examples of complex values.
The most likely cause of the error is that you tried to use a complex value as a simple one. For example, you tried to use a query variable in a cfif tag.
The error occurred in C:\ColdFusion9\wwwroot\wrestleworx\testpdfread.cfm: line 14
12 : (data_value)
13 : VALUES
14 : (#pdf.getText(mypdf)#)
15 : </cfquery>
16 :
Sorry - if you look above, you can see it is an array of strings. If you want to insert all the text, you need to convert the array into a string. The simplest way would be to use arrayToList with " " as a delimiter.
Ok, I have never worked with arrays before so I will give it a try
"Download attached file" throws a 404 error.
You should be able to find it at https://static.raymondcamde.... The link will be fixed the next time I update the site.