Posted in ColdFusion | Posted on 07-25-2007 | 7,747 views
Yesterday I blogged about ColdFusion and DDX, a way to some fancy-pants neato transformations of PDF documents. One of the cooler examples was that DDX could be used to grab the text from a PDF file. For those who thought it might be too difficult to use the DDX, I've wrapped up the code in a new ColdFusion Component I'm calling PDF Utils. (Coming to a RIAForge near you soon. Watch the skies...)
Right now the CFC has one method, getText. You pass in the path to a PDF and you get an array of pages. Each item in the array is the text on that particular page. I've included on this blog post two sample PDFs. One is a normal PDF with simple text. As you can imagine, the function works great with it. The other one is a highly graphical, wacky looking PDF. Ok it isn't wacky looking per se, but it isn't a simple letter. When the method is run on this PDF, the text does come back, but it is a bit crazy looking. I think this is to be expected though. And what's cool is that if your intent is to get the text out for searching/indexing purposes, you can still find it useful.
Anyway, here is a sample:
2
3<cfset mypdf = expandPath("./paristoberead.pdf")>
4
5<cfset results = pdf.getText(mypdf)>
6<cfdump var="#results#">
Which gives this result:

The zip includes 2 PDFs, the component, and my test script.


I've actually got an entire online PDF Editor I worked on for the CF8 tour. I'm going to load it up when I wrap the CFPDF series.
test.cfm seems OK
test2.cfm dumps the PDFDocument structure? Is the cfpdf write failing silently?
genpdf.cfm works (just try paristoread_new.pdf or whatever)
xmptest.cfm returns [empty string] no matter what I try. . ...
As for xmptest.cfm, it will be empty if your PDF doesn't use XMP. Not all pdfs do. If you think your does and it doesn't work, email me the PDF.
Can I do the same in MX7?
Thanks
Processing seems to stop as soon as the the cfpdf tag calls processddx.
This is on a shared coldfusion 8 server at crystaltech.
Thanks
I have tried both .doc and .docx. Code is simple
<cfpdf useStructure="true" addquads="false" honourspaces="true" type="string" action="extracttext" source="test.docx.pdf" name="pdfToText" />
The document is simple and contains about 7 lines, just enough to test line breaks, indents, and centering.
Thanks for any help you can provide.
This may be a bit much - but you could use Google Docs. I have a wrapper CFC for it. You could upload to Google Docs than download as HTML or text. It would be slow(well slowish), but if you needed it for one time conversions, it would be acceptable I think.
I had also tried the DDX approach and the quads option. I had briefly considered using the coordinates, but the docs I need to convert are gigantic and I get as many as 40 a day. Given that, I don't think the Google approach will work either. I have 3rd party software that currently converts Word to PDF and to Text, but I had hoped to get rid of the software and let CF do both. With CF 9, I am now able to convert the docs to PDF, but thus far no easy way to go to text. Maybe the POI will work.
As to your 3rd party tool - shoot, if it works, use it! I'd brush my teeth with ColdFusion if I could, but at the end of the day, you want to use what works best.
Anyway, thanks again for your help!! It is good to know it is not just me and banging my head for another week won't change the fact that the tag won't work!
1. use cffile to read file contents.
2. cfif version gt 1.4; notify user to upload compatible pdf.
3. always use the status for cfindex to catch future issues. Thanks so much!
This has enabled me to do a workaround fix for the problem where cfdocument does not support
style="page-break-inside: avoid;"
It is not overly efficient, in that it recreates the entire report for each instance where
a page break needs to be repositioned, but the preliminary testing is good. Others may like to embellish the following.
It creates hidden fields in the pdf, formatted as
keep_together_start_001, keep_together_end_001, keep_together_start_002, keep_together_end_002 etc
If the matching start and end tags are not on the same page, then the pdf is recreated with an appropriate page break prior to the section to be kept together.
In the style sheet
p.hidden { height: 1px; width: 1px; overflow: hidden; top: -10px; margin: 0px;}
On initialisation of the page that creates the pdf
<cfif not isdefined("url.keep_together")>
<cfset session.keep_together_list = "">
</cfif>
<cfset keep_together_count = 0>
...
At the start of a section of output that I want kept together, put the following
<cfset keep_together_count += 1>
<cfif ListFind(session.keep_together_list,"#NumberFormat(keep_together_count,"000")#","|") gt 0>
<cfdocumentitem type="pagebreak"></cfdocumentitem>
<cfelse>
<cfoutput>
<p class="hidden">keep_together_start_#NumberFormat(keep_together_count,"000")#</h2>
</cfoutput>
</cfif>
...
At the end of the section to be kept together
<cfif ListFind(session.keep_together_list,"#NumberFormat(keep_together_count,"000")#","|") eq 0>
<cfoutput>
<p class="hidden">keep_together_end_#NumberFormat(keep_together_count,"000")#</h2>
</cfoutput>
</cfif>
Once the PDF file has been created, use Raymond's pdfutils to check each page for a start without a matching end.
If found, indicate which section of code need to have a page break inserted before it, and then reproduce the report.
<!--- Check for keep_togethers that were split --->
<cfset pdf = createObject("component", "pdfutils")>
<cfset mypdf = expandPath("my_report.pdf")>
<cfset pdf_struct = pdf.getText(mypdf)>
<cfloop index="idx" from="1" to="#ArrayLen(pdf_struct)#">
<cfset start_offset = find("keep_together_start_",pdf_struct[idx].text)>
<cfif start_offset neq 0>
<cfset keep_together_id = Mid(pdf_struct[idx].text,start_offset + 20,3)>
<cfset end_offset = findnocase("keep_together_end_#keep_together_id#",pdf_struct[idx].text)>
<cfif end_offset eq 0>
<cfset session.keep_together_list = session.keep_together_list & "|" & keep_together_id>
<cflocation addtoken="No" url="my_report.cfm?keep_together">
</cfif>
</cfif>
</cfloop>
[Add Comment] [Subscribe to Comments]