In today's entry I'll be discussing the processDDX action of the CFPDF tag. I have to admit that I wasn't looking forward to this entry. Every time I had looked at the documentation, it just didn't make sense. I didn't see the point. But now that I've looked at it again more in depth, I'm almost in awe at how cool this feature is. I'm definitely just scratching the surface in this blog post, but hopefully it will encourage others to look into DDX and how it works with ColdFusion.
So as you can probably guess, CFPDF's processDDX action lets ColdFusion work with DDX. Ok, so what in the heck is DDX? DDX stands for Document Description XML. You can think of it like a template for a PDF file. At a basic level, it lets you lay out PDF files (like the Merge option does) and add special commands (generate a table of contents for example). DDX is used by Adobe's LiveCycle Assembler product. ColdFusion ships with a stripped down version of this product. The exact XML tags not allowed in ColdFusion are listed in the documentation. As far as I can see, there is no way to enter a serial and enable the full power of LiveCycle Assembler. But even with the restrictions there is an incredible amount of power that you have built in. As I mentioned above, this entry is only going to talk at a high level about DDX. You can find the DDX reference here. Also as Charlie Arehart has mentioned in a comment in my PDF series, the ColdFusion documentation is excellent. I want to credit them for my examples below as all are either direct copies or modified versions of their examples. Also note that this is a very complex topic. There is a good chance I will screw something up so please let me know if I do.
Let's begin by talking about how you use DDX in ColdFusion. ColdFusion 8 adds an isDDX() function. This function takes either a relative/absolute path to a filename or an actual string of DDX tags. Don't worry too much about the XML just yet, but here is a simple example of checking a string to see if it is valid DDX:
<cfsavecontent variable="myddx">
<?xml version="1.0" encoding="UTF-8"?>
<DDX xmlns="http://ns.adobe.com/DDX/1.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://ns.adobe.com/DDX/1.0/ coldfusion_ddx.xsd">
<PDF result="Out1">
<PDF source="Title"/>
<TableOfContents/>
<PDF source="Doc1"/>
<PDF source="Doc2"/>
</PDF>
</DDX>
</cfsavecontent>
<cfset myddx = trim(myddx)>
<cfif isDDX(myddx)>
yes, its ddx
<cfelse>
no its not
</cfif>
In this example I've just used the CFSAVECONTENT tag to wrap my DDX XML. I trim it and then check to see if it is DDX. Now that I've shown you a bit of DDX, let me talk a bit about what that example does. Ignoring the DDX tag, there are 2 XML tags in use here, PDF and TableOfContents. The first PDF tag uses result="Out1" and wraps the other tags. This basically says the result of everything on the inside should be put into a result named Out1. On the inside there are 3 PDF tags with a source. You can think of this like a merge. The tags specify an order based on names: Title, Doc1, and Doc2. So far so good. But then note that a TableOfContents tag exists right after the Title PDF. This particular tag can do a lot - but at a basic level, it just says, "Create a table of contents using the PDFs following me."
So let me repeat what I said above. This is partially for my sake to ensure I'm describing it right (remember what I said, I'm new to this!). What we have is a template that takes 3 PDFs. It puts the Title PDF first. It defines a page as a Table of Contents. It then lays down two more PDFs. Let's take a look at how ColdFusion can work with this DDX.
First note that the DDX worked with PDF names. Notice I don't have any real file names. Nor do I have ColdFusion variables. Instead I have labels like Out1, Title, Doc1, and Doc2. So we need a way to pass real values so that LiveCycle Assembler can use them when processing the DDX. The CFPDF tag takes two related attributes, inputFiles and outputFiles. Each of these are a structure of names to file names. So using our sample DDX above, I can define my 3 input PDFs like so:
<cfset inputStruct=StructNew()>
<cfset inputStruct.Title="title.pdf">
<cfset inputStruct.Doc1="paris.pdf">
<cfset inputStruct.Doc2="booger.pdf">
Defining the output file is also struct based:
<cfset outputStruct=StructNew()>
<cfset outputStruct.Out1="output1.pdf">
Ok so at this point I've detailed all the various variables used in the DDX file. Now lets use CFPDF to run the process:
<cfpdf action="processddx" ddxfile="#myddx#" inputfiles="#inputStruct#" outputfiles="#outputStruct#" name="ddxVar">
Pretty trivial I think. I passed in my structs and DDX. At this point I now have a result. If I dump ddxVar, I will see a structure. Each key of the structure maps to the output key from my DDX. I had used this tag:
<PDF result="Out1">
So ddxVar.out1 will contain a status message for my result. It will either be "successful" or "failed" followed by a reason. One quick note. You will notice I used paths for my PDFs. In order to use DDX, you have to work with real files. You can't pass in a PDF created in memory. Obviously you can make the PDF on the fly and save it in the same request.
If you view your PDF now (remember it was named output1.pdf), you may notice that you don't have a table of contents. Turns out that the TableOfContents tag looks for a bookmark. I had to switch this code:
<cfdocument format="pdf" filename="paris.pdf" overwrite="true">
<h2>Paris Hilton</h2>
<p>
Here is the collected wisdom of Paris Hilton.
</p>
</cfdocument>
To this:
<cfdocument format="pdf" filename="paris.pdf" overwrite="true" bookmark="true">
<cfdocumentsection name="Paris Section">
<h2>Paris Hilton</h2>
<p>
Here is the collected wisdom of Paris Hilton.
</p>
</cfdocumentsection>
</cfdocument>
Note the use of bookmark=true and a cfdocumentsection that wraps the entire page. That was slightly confusing at first, but the end result is perfect. What is great is that my ColdFusion Cookbook site will be able to benefit from this. Right now I have something like 120+ pages in a PDF with no real easy way to navigate. By using DDX I'll be able to add a real table of contents to document!
So what else can you do with DDX? As I mentioned some features were removed from the bundled product, but what is left is still pretty awesome. Charlie Arehart added a comment to another of my blog articles saying that he wished it were simpler to add a watermark to a PDF. I.e., just add "Foo" to the PDF without needing to make a new PDF or an image. Turns out DDX supports that as well. Here is some sample DDX that demonstrates how to apply a watermark. Again - check the LiveCycle Assembler DDX documentation for explicit documentation on each tag.
<?xml version="1.0" encoding="UTF-8"?>
<DDX xmlns="http://ns.adobe.com/DDX/1.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://ns.adobe.com/DDX/1.0/ coldfusion_ddx.xsd">
<PDF result="Out1">
<PDF source="Doc1">
<Watermark rotation="30" opacity="50%">
<StyledText><p font-size="85pt" font-weight="bold" color="gray" font="Arial">FINAL</p></StyledText>
</Watermark>
</PDF>
</PDF>
</DDX>
Nothing too terribly complex here. Frankly I find this a bit easier than earlier PDF and watermarks blog article. Maybe not easier per se - but I find it to be more direct. And in case it isn't obvious - since the DDX is completely abstracted, you can pass any PDF in that you want and specify any output. One thing I'm not sure on is if the value of the watermark, the text, can be dynamic as well. Obviously I can generated my DDX in ColdFusion, so yes, it can be dynamic, but I'm curious to know if DDX supports variables for values like the text between the P tags.
One more example. I always wondering why there wasn't a way to read the text of a PDF. Turns out there is - DDX. Consider this simple DDX example:
<?xml version="1.0" encoding="UTF-8"?>
<DDX xmlns="http://ns.adobe.com/DDX/1.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://ns.adobe.com/DDX/1.0/ coldfusion_ddx.xsd">
<DocumentText result="Out1">
<PDF source="doc1"/>
</DocumentText>
</DDX>
Here is the source PDF I used:
<cfdocument format="pdf" filename="paristoberead.pdf" overwrite="true">
<h2>Paris Hilton</h2>
<p> <cfoutput> This is the text of a PDF. It has a bit of randomness (#randRange(1,100)#) in it. </cfoutput> </p>
<cfdocumentitem type="pagebreak" />
<h2>Fetch Adams</h2>
<p> <cfoutput> This is the second page. It has a bit of randomness (#randRange(1,100)#) in it. </cfoutput> </p>
</cfdocument>
When processed, you get an XML file. The result will look something like so:
<?xml version="1.0" encoding="UTF-8"?>
<DocText xmlns="http://ns.adobe.com/DDX/DocText/1.0/">
<TextPerPage>
<Page pageNumber="1">Paris Hilton This is the text of a PDF . It has a bit of randomness ( 67 ) in it .</Page>
<Page pageNumber="2">Fetch Adams This is the second page . It has a bit of randomness ( 7 ) in it .</Page>
</TextPerPage>
</DocText>
Notice how the HTML was removed. What's cool about this is that if you ned to index PDF data and you don't want to use Verity, you could use this instead. (I think tonight I'll write a quick UDF just for this.)
That's it for this blog entry. I want to remind folks - DDX is a big topic and I didn't cover much at all. I also used a lot of code in this example so I've taken all my test CFMs and PDFs and packaged them as a zip attached to this article.
Archived Comments
Ray,
Do you know, are the CF8 PDF functions still using iText at all, or is it all Adobe technology now?
Sorry, I don't. Hopefully an Adobe reader can chime in here.
Is it possible to use this to enter positioned text into an existing PDF document or tabular data?
I'm not sure. I'd check the DDX docs.
Great series Ray. I've scoured the documentation and for the life of me can't see how I can read a pdf file's table of contents. The ddx documentation appears to only show how to inject ddx info and not how to extract it from an existing pdf file. Am I just missing something?
Well I know DDX can do extractions as that I how I got the text out. In theory you could get the text from the page that had the TOC. This text wouldn't be structured though.
It appears that CF8 does not support the Bookmarks DDX element that allows you to extract bookmarks from a pdf. In the CF8 docs, it lists the restricted DDX elements but "Bookmarks" is not in the list.
Too bad, I was really hoping for this functionality.
If it isn't listed in the restricted list, can you please file a bug report?
Anyone ever seen this error message:
failed: DDXM_S18005: An error occurred in the PrepareTOC phase while building <TableOfContents>. Cause given.
I only get this error when using the TableOfContents element in the DDX:
<TableOfContents maxBookmarkLevel="infinite" bookmarkTitle="Table of Contents" includeInTOC="false">
<Footer styleReference="CatalogueFooter" />
</TableOfContents>
Any thoughts?
Reading through the livecycle docs I found a neat little parameter that can be added to the <header> and <footer> tags called replaceExisting=true SEE: http://livedocs.adobe.com/l...
Unfortunately I have been able to get it to work yet. I have added a comment to the CF8 docs but would also love to hear if anybody else has used this successfully.
do most people use verity on cf8 to search though pdf's?
or do they parse out pdf's into text files and search through those?
what are the differences in on resources?
I don't know if there is a right answer to that. I don't think a lot of people use Verity in CF, even though they should.
@my own comment about replaceExisting="true"
I HAVE been able to use this with the pdf's I have created with ColFusion8. My initial confusion was dealing with (existing) pdf's that "looked" as if they had a header and footer BUT when I converted those pdf's to text I found that it was actually body text stretched out the edge of the page.
@Verity
I have been reluctant to use Verity because a) it does consume quite a bit of RAM and b)Databases like MySql come with Full Text Searching built in. Fair enough MySql doesn't search pdf documents though.
I agree with Ray, I would really like to use Verity more, but I also agree with Martin and I tend to shy away from it when it comes to the resources used. I tend to use it more when I absolutely need full search capabilities with document context and scoring.
I tried using the Adobe docs to split Verity off onto its own server, but I was never able to make it work successfully on CF7 or CF8. If anyone has ever completed it successfully, I certainly would be interested.
I've got an odd error when running this ddx watermark example. The text "FINAL" sometimes appears as random nonsense characters like #&$&^%*. This happens maybe 1 out of 5 times I run the code. I'm using CF8 on OS X Leopard and opening the files in Apple's Preview program. I have not seen this problem with PDF's generated from my OS X server and viewing on Windows. Have any other Mac users out there noticed this intermittent watermark text problem? Has anyone solved the issue?
Thanks!
Did you change the hard coded font to something else?
Also - note that in 8.0.1, you can now supply HTML for watermarks. This means you don't need to use DDX for it anymore.
Thanks Ray. I used your ddxpdf.zip example as is. In my last test I generated one copy of output2.pdf from ddx3.cfm. When I browse to the ddxpdf folder to open and close output2.pdf, at least 1 in 5 times, Preview renders FINAL as junk text.
I've confirmed this same behavior on another Mac running Leopard too. You've got a Mac right? You don't see this behavior?
I appreciate the heads up on 8.0.1 allowing watermark text outside of ddx, I'll give that a shot.
Tell me this - in the other 4 times, do you see font changes? I mean - still readable, but random fonts?
When I see the word FINAL, it's always Arial. In fact when it doesn't read FINAL but something like &^&*%&* it looks like Arial as well.
You got me on that one. _Are_ you using 8.0.1 yet?
Thanks for bouncing around some ideas Ray. I am using 8.0.1. I tried using the new addwatermark text functionality and ran into the same problem. It looks like the problem though is with the Arial font. I don't see an issue on my main machine or others when using Verdana or Courier.
Looks like it's time to log a bug. :)
http://www.adobe.com/go/wish
Hi Ray,
I'm running your example code for pdf generation using a DDX files. Specifically, the ddx2.cfm
I'm getting the same error as Brian above.
failed: DDXM_S18005: An error occurred in the PrepareTOC phase while building <TableOfContents>. Cause given.
I've narrowed it down. When you add bookmark="true" to cfdocument you get the error. If you don't have bookmark="true" it works but no TOC. But I saw your output2.pdf HAS a TOC. Any idea why your code won't run on my copy of CF8? I've tried it on the developer edition and a standard version.
Thanks!
Are you running 801 along with the cumulative hot fix?
Hi Ray,
Installing the 8.01 update solved the problem. That will teach me to not run the latest version of CF.
Did a <a href="http://www.designovermatter...">short post</a> for anyone who does a google search on the error message. (Which is what I did).
Cheers
Hi Ray, Obviously being a bit stupid here but it appears the only way I can create a List of Contents is by setting the bookmark is true and giving the bookmark a name using the <cfdocument> tag. Is there no other way?
Terry - the short answer is yes. The long answer is that it appears, MAYBE, that you can do it via DDX as well:
http://livedocs.adobe.com/l...
But certainly it's easier doing it in CFML (imho).
Thanks. I've been trying to use processddx to grab the text of a PDF which is set up in a 2 column display format.
The OutXML seems like it's reading across the lines instead of down the columns.
The docs (http://livedocs.adobe.com/l... say "the order in which the words are listed is not guaranteed to be the reading order." I've also tried mode="WithQuads", but it seems like it would be really tough to reconstruct the text from the coordinates.
Any ideas?
Well, if you knew your pdfs were 2 columns, then you could expect certain types of results when using WithQuads. That would make reconstructing the text a bit simpler. That's all I can suggest.
is it possible to create a dynamic watermark? i was looking for ways to pass a variable to the ddx file.
This article is a bit outdated. CF801 added support for creating watermarks from simple HTML.
I have a quick question. I have developed an app that captures info from a web form and injects that information into a PDF. Everything works there. I have done tons of research and I'm trying to find a solution to the electronic signature capture portion. Seems that you would have a java applet to write the signature and you could draw that with an interpreter CGI or PERL script or with one of Coldfusion's imagedraw functions. Is there a way to inject that image drawn onto a PDF where the signature line would be or place that image into signature field?
There is a way to get the summary attribute of PDF for example, and split the output result to insert an paragraph caracter foe each paragraph?
When you get this field attribute for searching and print the result, you get an large paragraph that contain a lot of sentance from diferent position of the pdf, but nothing is seaparte these sentance. you know what i mean?
Not quite sure I get what you mean, but if the summary has line breaks in between stuff then obviously - in html - it would all run together. You can use ColdFusion's paragraphFormat function to add in paragraphs.
Hi Raymond,
In the text the your article about ColdFusion8 and DDX; you include a lnk to some documentation on LiveCycle DDX 'language' in depth. In fortunately the link is now broken.
Is there any chance that you might be able to inform me of the current link to the same information.
You see, I am implementing some code to create a compound PDF document that could do with a Tablee of Contents; (which I have working in a basic sense) however the formatting could do with some tweaking. Someone elses Blog mentioned how to do it, but their code gives up some errors; so I want to check all of the syntax.
Thanks,
Bryn Parrott
Google is your friend. ;)
http://help.adobe.com/en_US... (PDF obviously)
Great article. I was looking at adding a TOC dynamically and your example was spot on. :-)
Just wondering if CF(9) will let us set multiple levels in the TOC. It appears that DDX would have nog problem with that (maxBookmarkLevel="infinite"), but the CF documentation explicitly states that all bookmarks are placed directly under the root.
Any ideas on how to get this done?
Thnx in advance!
Jasper
No idea... but give it a try and let us know. :)
Ok, will do. :)
I went down this path a few weeks ago. I also wanted to produce multiple levels of bookmarks. I acheived a modicum of success, but spent way too much time doing it. Best I could manage was 2 levels of bookmarks. Basic problem was that CF would not allow nested cfdocumentsections and that is the only way they provide to create them.
Now if you were to consolidate documents that were created some other way then it might work. PDFLib perhaps.
To be honest it would have been far better had Adobe provided the capability to create bookmarks based on HTML H tags e.g. h1, h2, h3 etc. or even something like <cfdocumentbookmark /> perhaps !!. Far more intuitive, flexible and easy to work with than documentsections.
Don't get me started on DDX by the way. No fun at all.