Posted in ColdFusion | Posted on 02-06-2009 | 4,560 views
I've had a chance to play a bit more with the Apache POI project and I thought I'd share a bit of code demonstrating how to read Office document metadata. Unfortunately I was not able to get this working with Office 2007, but maybe I'll get lucky with reader Leah again!
Anyway, the code:
2<cfset jarpath = expandPath("./jars2")>
3<cfset paths = []>
4<cfdirectory action="list" name="files" directory="#jarpath#" filter="*.jar" recurse="true">
5
6<cfloop query="files">
7 <cfset arrayAppend(paths, directory & "/" & name)>
8</cfloop>
9
10<!--- load javaloader --->
11<cfset variables.loader = createObject("component", "javaloader.JavaLoader").init(paths)>
This is the exact same code I've used in my previous two blog entries so I won't explain it again.
2<cfset myfile = createObject("java","java.io.FileInputStream").init(expandPath("./testdocs/Testing Reading Word Docs.doc"))>
Unlike my previous entries where I looped over a folder of documents, in this example I'm just using one specific Word document.
2<cfset doc = loader.create("org.apache.poi.hwpf.HWPFDocument")>
3<!--- init it with my java file input stream set to my test file --->
4<cfset doc = doc.init(myfile)>
Next we create an instance of HWPFDocument. This is the specific class used for Word documents. You would use something different for PPT or Excel files. Once I create the class I pass in the file I specified earlier.
Ok, now for the fun part:
This code retrieves a set of summary data from the document. This is another Java object itself with a set of methods to get, set, and remove document metadata. As an example:
2Title=#summary.getTitle()#<br/>
3Page Count=#summary.getPageCount()#<br/>
4Word Count=#summary.getWordCount()#<br/>
5Application=#summary.getApplicationName()#<br/>
6Author=#summary.getAuthor()#<br/>
7Comments=#summary.getComments()#<br/>
8CreateDateTime=#summary.getCreateDateTime()#<br/>
9Edit Time=#summary.getEditTime()#<br/>
10Keywords=#summary.getKeywords()#<br/>
11Last Author=#summary.getLastAuthor()#<br/>
12Last Printed=#summary.getLastPrinted()#<br/>
13Last SaveDateTime=#summary.getLastSaveDateTime()#<br/>
14Revision Number=#summary.getRevNumber()#<br/>
15Security=#summary.getSecurity()#<br/>
16Subject=#summary.getSubject()#<br/>
17Template=#summary.getTemplate()#<br/>
18</cfoutput>
Pretty much all of those methods should be self-explanatory, but I'll point out some interesting ones. The getEditTime() function actually returns how long the document has been edited. I assume that is related to how long I have the document open in my application. Not sure how I'd use that but it's cool nonetheless. getSecurity returns an integer that defines what type of security the document has (duh), and is documented in the POI API docs. (I've copied the values to my test CFM file attached to this entry.) Another method that I didn't actually demonstrate above is getThumbnail(). This returns binary data for a thumbnail. The data is either in WMF or BMP format. CF can work with BMP, but my test document must have had a WMF thumbnail. I was able to save the bits to the file system but wasn't able to actually do anything with it. My Mac wanted to use Adobe Illustrator to view it, but AI complained that the file wasn't valid. If we can get that working, it would be cool!
So how hard is it to update the metadata?
2
3<!--- read in my Word doc --->
4<cfset myfile2 = createObject("java","java.io.FileOutputStream").init(expandPath("./testdocs/Testing Reading Word Docs.doc"))>
5<cfset doc.write(myfile2)>
I set a new, random title. I then create a FileOutputStream using the same file name and then simply ran the write method of the doc object. This works because the summary object pointed back to the original document, so even though I modified summary, it updated the doc object as well.
Pretty simple I think. One could wrap this up into a nice CFC and make it even simpler of course.


The ooxml format has different property types: core, extended and custom. So to determine the file format, I used java's isInstance method. CF's IsInstanceOf method did not seem to work with the objects I created using the javaLoader.
ExtractorFactory = javaLoader.create("org.apache.poi.extractor.ExtractorFactory");
inputFile = createObject("java", "java.io.File").init( "c:\myFiles\testExcel2007.xlsx" );
extractor = ExtractorFactory.createExtractor( inputFile );
// Determine the format
POIXMLTextExtractor = javaLoader.create("org.apache.poi.POIXMLTextExtractor");
isFormatOOXML = POIXMLTextExtractor.getClass().isInstance( extractor );
if (isFormatOOXML)
{
// extract core properties (author, title, etctera...)
coreProp = extractor.getCoreProperties().getUnderlyingProperties();
WriteOutput("Creator = " & coreProp.getCreatorProperty().getValue() & "<br>");
WriteOutput("Title = " & coreProp.getTitleProperty().getValue() & "<br/>");
// ...
}
else
{
summary = extractor.getSummaryInformation();
WriteOutput("Title="& summary.getTitle() &"<br/>");
WriteOutput("Author"& summary.getAuthor() &"<br/>");
// ....
}
<!--- where the poi files are --->
<cfset jarpath = ABSPath & "\jars">
<cfset paths = []>
<cfdirectory action="list" name="files" directory="#jarpath#" filter="*.jar" recurse="true">
<!--- Get all the jar files to load --->
<cfloop query="files">
<cfset arrayAppend(paths, directory & "/" & name)>
</cfloop>
<!--- Load javaloader --->
<cfset server.loader = createObject("component", "#CFHome#/components.javaloader.JavaLoader").init(paths)>
<!--- Generic file reader --->
<cfset myfile = createObject("java","java.io.File")>
<cfset myfile.init("#UploadPath##currFileName#")>
<!--- Init the extractor factory --->
<cfset extractorFactory = server.loader.create("org.apache.poi.extractor.ExtractorFactory")>
<!--- Create Extractor --->
<cfset extractor = extractorFactory.createExtractor(myFile)>
<!--- Get Summary Info
<cfset summary = extractor.getSummaryInformation()>
<cfoutput><pre>#summary.getPageCount()#</pre></cfoutput>
--->
<!--- Get our page count --->
<cfset PagesFound = REMatch(Session.KeyPhrase,extractor.getText())>
<cfset PageCounter = ArrayLen(PagesFound)>
NOTE: I added this trying to see if i released the objects if it would let the file go, but it made no difference.
<!--- Close the file? --->
<cfset extractor = "">
<cfset extractorFactory = "">
<cfset myfile = "">
...
<cfset fis = createObject("java", "java.io.FileInputStream").init(myFile)>
<cfset extractorFactory = server.loader.create("org.apache.poi.extractor.ExtractorFactory")>
<cfset extractor = extractorFactory.createExtractor(fis)>
...etcetera ..
<cfset fis.close()>
Thanks for your help! Changing to the FileInputStream did the trick and it was able to purge the file after reading it.
Thanks again!
<cfset myfile = createObject("java","java.io.FileInputStream").init(expandPath("UPLOAD_DOCS/#DOCMT_URL_TXT#"))>
<cfset extractorFactory = createObject("java","org.apache.poi.extractor.ExtractorFactory")>
<cfset extractor = extractorFactory.createExtractor(myFile)>
<cfset summary = extractor.getSummaryInformation()>
<cfoutput><pre>#summary.getCreateDateTime()#</pre></cfoutput>
Not quite, but they are close enough (3.5-beta). The metadata is slightly different for binary and ooxml files. Take a look at the first comment above. It utilizes different properties for the two formats.
For ooxml documents, try using the extractor's core properties.
ie
core = extractor.getCoreProperties();
created = core.getCreated();
title = core.getTitle();
etcetera ...
HTH
-Leigh
Thanks in advance
http://pastebin.com/bpff6cmY
http://www.rbrasier.com/2010/08/adding-custom-prop...
[Add Comment] [Subscribe to Comments]