Have you ever needed to get information about Office documents with ColdFusion? In this blog entry I'll talk about how we can use some open source tools along with ColdFusion to read from various types of Microsoft Office documents. I'll be focusing on retrieving the text of the documents only. You can use this code to allow users to upload Word documents and provide basic search features against the actual contents of the documents.
Now you may ask - doesn't the built in Verity engine do this? It does, but if you have any other need for the text, you are out of luck. Verity sucks in the text and stores it in it's own format. You can't ask Verity to make a copy of the complete text. Let's get started!
Our code will make use of two open source products. The first is the Apache POI product. This is a set of Java libraries that allow deep integration into Office documents, both reading and writing. Our use of the library will be fairly simple - just reading. I downloaded the latest 3.2 version which provides support for Office formats from the old 97 versions up to the Office 2007 release. For OOXML you can use POI 3.5, which is in beta. I had trouble playing with this so decided to focus on the non-OOXML version.
After you download the zip, open up the archive and copy the 3 JAR files you find in the root. I copied these to a folder called 'jars' under my web root.
The next thing we need is JavaLoader. JavaLoader is a ColdFusion project created by Mark Mandel (the creator of Transfer). This code lets you load any random JAR file on the fly. Normally you have to copy new JARs to a specific location under your ColdFusion install and you have to follow this up with a server restart. JavaLoader is a much simpler way to handle this. It also helps with another problem - class conflicts. ColdFusion itself makes use of POI (you can see a few POI jar files in the lib folder) but the version bundled is older than what we need to use to extra our text. By using JavaLoader, we can make sure everything comes directly from the JARs in play. I downloaded JavaLoader and extracted the code to a folder named javaloader under my web root.
Alright, so let's look at some code. The first thing I want to do is get JavaLoader initialized with the JAR files from POI. When you initialize JavaLoader you pass in an array of JAR files. Since I put everything in a folder, I wrote some code to simply iterate over the directory and create a list of all the JARs. I then pass this to JavaLoader:
<!--- where the poi files are --->
<cfset jarpath = expandPath("./jars")>
<cfset paths = []>
<cfdirectory action="list" name="files" directory="#jarpath#" filter="*.jar" recurse="true">
<cfloop query="files">
<cfset arrayAppend(paths, directory & "/" & name)>
</cfloop>
<!--- load javaloader --->
<cfset variables.loader = createObject("component", "javaloader.JavaLoader").init(paths)>
Now I'm going to create a set of Java classes. This will be used by the extraction code and comes from my reading of the POI docs. Each extraction will typically involve one Java object for the file itself and one for the extractor:
<!--- generic file reader doohicky --->
<cfset myfile = createObject("java","java.io.FileInputStream")>
<!--- get our required things loaded --->
<!--- Word --->
<cfset doc = loader.create("org.apache.poi.hwpf.HWPFDocument")>
<cfset wordext = loader.create("org.apache.poi.hwpf.extractor.WordExtractor")>
<!--- Excel --->
<cfset excel = loader.create("org.apache.poi.hssf.usermodel.HSSFWorkbook")>
<cfset xlsext = loader.create("org.apache.poi.hssf.extractor.ExcelExtractor")>
<!--- Powerpoint --->
<cfset ppt = loader.create("org.apache.poi.hslf.HSLFSlideShow")>
<cfset pptext = loader.create("org.apache.poi.hslf.extractor.PowerPointExtractor")>
Ok, now to get parsing. I created a new folder called testdocs and dumped a few Word, Excel, and Powerpoint files inside. Our code will now loop over each file and attempt to extract it based on the extension:
<!--- get files --->
<cfset filePath = expandPath("./testdocs")>
<cfdirectory action="list" name="files" directory="#filePath#">
<cfoutput query="files">
<cfset theFile = filePath & "/" & name>
<cfset myfile.init(theFile)>
Reading: #theFile#<br/>
<cfswitch expression="#listLast(name,".")#">
<cfcase value="doc">
<cfset doc = doc.init(myfile)>
<cfset wordext.init(doc)>
<cfoutput>
<pre>
#wordext.getText()#
</pre>
</cfoutput>
</cfcase>
<cfcase value="xls">
<cfset excel = excel.init(myfile)>
<cfset xlsext = xlsext.init(excel)>
<cfoutput>
<pre>
#xlsext.getText()#
</pre>
</cfoutput>
</cfcase>
<cfcase value="ppt">
<cfset ppt = ppt.init(myfile)>
<cfset pptext = pptext.init(ppt)>
<cfoutput>
<pre>
#pptext.getText(true,true)#
</pre>
</cfoutput>
</cfcase>
</cfswitch>
<p><hr/></p>
</cfoutput>
As before we use cfdirectory and then we loop over each file. Notice that I'm initializing a FileInputStream object, myfile, which will be used to seed the specific object representations of the Office documents.
For each file we check the extension, and based on the extension, use the appropriate extractor object to get the text.
That's it! If you download the zip attached to this blog entry and then run the code, you will see that the code does a reasonable good job of representing the text. For example, the text from the Excel document is laid out much like how you see it visually in MS Excel.
Sample output from an Excel sheet
Once you have the text you can store it in a database, email it, or do anything else you would like with it.
Pretty simple and powerful! Also note that you can get other data then just the text. Each extractor has it's own options and each Office object itself can do it's own unique thing. For example, with the Powerpoint support you can choose to get the text of slides or the notes, or both. Check the docs at Apache for more information.
How are people using POI in the wild? Ben Nadel has his POI Utility which enables for reading/writing Excel docs. Todd Sharp used to use it for SlideSix but has since moved on to using OpenOffice.
Archived Comments
So folks know that I have nothing against POI, the only reason I moved away from using it with SlideSix is because the PPT support is still far from being as mature as the DOC and XLS stuff. As far as I know it is still in the 'scratchpad' area of the project. I actually really liked using it at first, it just didn't grow along with the needs of the site.
Good post Ray!
This is very cool stuff ...
I see envision lot's of opportunities with this ...
Great post Ray ... :)
FWIW, that's not (or hasn't been) just a problem with POI. PowerPoint has always been the forgotten child in the Office family. In my former life as a VB developer, I found the PowerPoint object model to be vastly underdeveloped through 2000. (I do much less work in VBA now and haven't poked around with the object models in 2003 and 2007.) We could run with Access, Excel, and Word, but PowerPoint was much more difficult, and Outlook was nearly out of the question (and of course 2000 brought in a whole other set of problems, the Object Model Guard and all that).
So it may simply be that there isn't that much for POI to hook into.
@Dave - interesting. You may have a point.
Nice post. Like I told you on twitter, I'm using POI to extract text in word fiel and check the language http://cyrilhanquez.com/blo....
Now I'm trying to switch to Tika (http://lucene.apache.org/ti... to support more types of document.
Awesome post Ray! Wish I had known this stuff earlier :)
> For OOXML you can use POI 3.5, which is in beta.
> I had trouble playing with this
Just curious, but what problems did you have? I just started using the beta this week and have had pretty good results so far. Though, it does require more jars than previous versions.
But I am very glad you posted this - because it made discover there is a handy factory class for extracting text. You can pass in a file and the factory returns the correct type of extractor. I have not tried it with PPT, but it seems to be working well with excel and word.
ExtractorFactory = javaLoader.create("org.apache.poi.extractor.ExtractorFactory");
inputFile = createObject("java", "java.io.File").init( "c:\myFiles\testExcel2007.xlsx");
excelExt = ExtractorFactory.createExtractor(inputFile);
WriteOutput("excelExt text=<hr>"& excelExt.getText() &"<hr>");
Leigh
> it made ^me^ discover
Darn fat fingers.
Let me try that Leigh. I had Java issues with the input specifically, but my code was different from yours.
Well kick butt Leigh. Thank you. I was definitely doing it wrong. I'll follow up later today with an update. I also have another update showing getting some basic metadata.
I wrote a new version and the code is a LOT slimmer. I'm now digging to see if there is an easy way to get the Doc object so I can perform metadata stuff on the code. Getting text works fine, but I'd also like to get title, etc. But if you just want text, this is a LOT slimmer.
So my next step is to delete your comment, ban your IP, and take credit! Muahahahahahahahahhahahahaha!
(Sorry! ;)
Cool stuff!
Reading Office 2007 files is very usefull. Especially when Verity does not handle docx and xlsx files.
Now I can read the text from the Office 2007 files and index the text with CFINDEX.
- Mikkel
Interesting. So the DOCX Extractor allows for getDocument and the DOC Extractor allows for getSummaryInfo. So to get the MD it would be slightly complex, but certainly doable.
Mikkel - interesting point there. Didn't know Verity wasn't updated for Office 2007.
So question: Does it make sense for me to build a generic reader CFC for Office files. Something that will get text and get MD properties? I don't want to duplicate Ben's work, but his is focused on Excel only. Would a CFC to just read various office formats be worth the time/effort? Should I ping Ben about possibly expanding his POIUtils to cover more Office formats?
Thanks for this post. It's an important first step for me. What are other people using to convert word documents to PDF via coldfusion?
Ray - the CF8 documentation has a fine list of wich filetypes Verity supports (http://livedocs.adobe.com/c...
And to me it is out-dated. Latest Word format is Word 2002, PDF version is 1.4 (acrobat 5).
If Adobe wants to use Verity as a sales parameter. Then they have to update the supported file types.
Mark, it would be trivial to convert the Word text to PDF using CFDOCUMENT. The issue would be layout. getText() returns just the text. It "kinda" looks the same, spacing wise, ie, paragraphs are vertically separated, but you won't get headers and stuff like that.
You could wrap it in <pre> and put that in cfdocument to get started.
Mikkel: Ah ok. Well, I'm a big fan of Lucene now. One thing I forgot to mention in the blog post is that I use code like this for Seeker (http://seeker.riaforge.org), my Lucene wrapper. I need to update it (when I have time) to support Powerpoint though.
You could get funky with POI to extract everything (text, layout, imgs, etc) and 're-construct' it as HTML to make a PDF. But I'm sure it would get ugly real quick (code-wise).
You could also use OpenOffice to do it, but that requires OpenOffice running as a service on the server.
I've been meaning to blog on that topic for about 6 months now. Maybe some day...
My last comment was directed @Mark
Thanks Todd,
I would be very interested in the OpenOffice method.
I am also interested in how you are using OpenOffice as a service on the server.
Very much so.
It should be noted that although the folks creating the POI jars have done amazing work, they don't necessarily condone MS Office. The "H" in HWPF, HSSF, and HSLF stands for "Horrible". So you have Horrible Word Processor Format, Horrible Spread Sheet Format and Horrible Slide Layout format.
Heh yeah, good point, although I think they were mainly being silly. ;) Of course, I've felt that Google has been horrible from time to time w/ their API. ("Hey, let's make this API really complex just because we want to!")
Now wouldn't it be nice if the upcoming CF 9 had the feature to send a Word document to PDF? That's what I want for Xmas.
@Raymond,
> I also have another update showing getting some basic metadata.
Cool. I look forward it. I have not gotten around to exploring metadata with the newer format yet.
PS: My response was delayed because I think my ip was banned (okay, I forgot to subscribe .. but being banned makes me sounds dangerous, like an outlaw ;-)
I posted part 2 today using your code. Thank you again. I did NOT take credit though. ;)
Leigh: I notice your blog has -0- info about you. Name, job, etc. Is that intentional? If not, would you mind introducing yourself?
Yes, it is intentional ;-)
This article helped a lot to get me started in reading Word documents with CF/POIFS. But, I also need to write them.
I am trying to use a Word document as a template, read it in using POIFS, modify some content, and create a new Word document WITH the styles from the original document.
Can you clue me in as to how to do this or where there is good example code for ColdFusion/POIFS Word document creation?
Thanks!
I can't, maybe someone else on the thread can. The closest I've done to this is working with RTF files.
@Ray: Leigh is the ghost in the machine or <cf_heWhoShallNotBeNamed>. I tried previously to get him involved in the alpha/beta, but no go. I can only assume he's a rock star moonlighting as a CF Developer.
@Other Todd -- how do you know Leigh is a he? I know females named Leigh too. Just sayin...
@Other Todd: I don't, but at this point I'm using 'he' as a generic gender since I don't believe Leigh is an 'it' - that being said, it's quite possible that Leigh is a nickname for skynet. But, you're right, I should have said he/she.
Hello,
I am trying to convert PPTX files to PPT files because CF doesn't like to convert PPTX files to a PDF. I am using the Open Office ODFConverter to convert my PPTX files to an ODP file before I convert it to a PDF. This works great.
When I try to use CFDocument to convert that ODP file to a PDF I get a bunch of gibberish on the page.
Do you know how to convert these ODP files to a PDF?
<cfdocument
format="pdf"
srcfile="#directory#\sourceFile.odp"
filename="#directory#\destinationFile.pdf"
overwrite="yes">
</cfdocument>
Thanks for the help
Are you using CF9?
Josh - The ODFConverter may work well on your test document, but trust me, it is far (*far*) from perfect and you'll eventually get some serious jibberish looking ODP's.... that being said if you really want to try converting them to PDF's you can check out JODConverter which should help you convert the ODP's to PDF. Technically it can also convert the PPTX to PDF, but again, it'll look like shit....trust me, I've tried it.
@Raymond... Yes, I am using CF9.
@Todd... I'll check out the JODConverter and see how it goes. Thank you for the suggestion.
I'm trying to extract a docx on Windows 7 and it just hangs on
<cfset docxExt = extractorFactory.createExtractor(fileIo)>
Any output after this tag doesn't show, I can confirm everything works until then, and doc files work fine.
fileIo is a fileInputStream, I've tried a java.io.file too, no difference.
After running my test script, the docx file is still locked since any attempts to change the filename returns an error saying Jrun still has it.
So basically it goes into this method and never returns, any ideas?
Very interesting. A regular ppt file is processed correctly, however, a slideshow ppt file runs into the following problem with Railo 3.x for Windows:
"
No matching Method/Function for org.apache.poi.hslf.HSLFSlideShow.INIT(java.io.FileInputStream) found
"
Line 66:
: <cfset ppt = ppt.init(myfile)>