Working with Office Metadata

This post is more than 2 years old.

I've had a chance to play a bit more with the Apache POI project and I thought I'd share a bit of code demonstrating how to read Office document metadata. Unfortunately I was not able to get this working with Office 2007, but maybe I'll get lucky with reader Leah again!

Anyway, the code:

<!--- where the poi files are ---> <cfset jarpath = expandPath("./jars2")> <cfset paths = []> <cfdirectory action="list" name="files" directory="#jarpath#" filter="*.jar" recurse="true">

<cfloop query="files"> <cfset arrayAppend(paths, directory & "/" & name)> </cfloop>

<!--- load javaloader ---> <cfset variables.loader = createObject("component", "javaloader.JavaLoader").init(paths)>

This is the exact same code I've used in my previous two blog entries so I won't explain it again.

<!--- read in my Word doc ---> <cfset myfile = createObject("java","java.io.FileInputStream").init(expandPath("./testdocs/Testing Reading Word Docs.doc"))>

Unlike my previous entries where I looped over a folder of documents, in this example I'm just using one specific Word document.

<!--- Word Support ---> <cfset doc = loader.create("org.apache.poi.hwpf.HWPFDocument")> <!--- init it with my java file input stream set to my test file ---> <cfset doc = doc.init(myfile)>

Next we create an instance of HWPFDocument. This is the specific class used for Word documents. You would use something different for PPT or Excel files. Once I create the class I pass in the file I specified earlier.

Ok, now for the fun part:

<cfset summary = doc.getSummaryInformation()>

This code retrieves a set of summary data from the document. This is another Java object itself with a set of methods to get, set, and remove document metadata. As an example:

<cfoutput> Title=#summary.getTitle()#<br/> Page Count=#summary.getPageCount()#<br/> Word Count=#summary.getWordCount()#<br/> Application=#summary.getApplicationName()#<br/> Author=#summary.getAuthor()#<br/> Comments=#summary.getComments()#<br/> CreateDateTime=#summary.getCreateDateTime()#<br/> Edit Time=#summary.getEditTime()#<br/> Keywords=#summary.getKeywords()#<br/> Last Author=#summary.getLastAuthor()#<br/> Last Printed=#summary.getLastPrinted()#<br/> Last SaveDateTime=#summary.getLastSaveDateTime()#<br/> Revision Number=#summary.getRevNumber()#<br/> Security=#summary.getSecurity()#<br/> Subject=#summary.getSubject()#<br/> Template=#summary.getTemplate()#<br/> </cfoutput>

Pretty much all of those methods should be self-explanatory, but I'll point out some interesting ones. The getEditTime() function actually returns how long the document has been edited. I assume that is related to how long I have the document open in my application. Not sure how I'd use that but it's cool nonetheless. getSecurity returns an integer that defines what type of security the document has (duh), and is documented in the POI API docs. (I've copied the values to my test CFM file attached to this entry.) Another method that I didn't actually demonstrate above is getThumbnail(). This returns binary data for a thumbnail. The data is either in WMF or BMP format. CF can work with BMP, but my test document must have had a WMF thumbnail. I was able to save the bits to the file system but wasn't able to actually do anything with it. My Mac wanted to use Adobe Illustrator to view it, but AI complained that the file wasn't valid. If we can get that working, it would be cool!

So how hard is it to update the metadata?

<cfset summary.setTitle("Ray changed this doc #randRange(1,100)#")>

<!--- read in my Word doc ---> <cfset myfile2 = createObject("java","java.io.FileOutputStream").init(expandPath("./testdocs/Testing Reading Word Docs.doc"))> <cfset doc.write(myfile2)>

I set a new, random title. I then create a FileOutputStream using the same file name and then simply ran the write method of the doc object. This works because the summary object pointed back to the original document, so even though I modified summary, it updated the doc object as well.

Pretty simple I think. One could wrap this up into a nice CFC and make it even simpler of course.

Download attached file.

Raymond Camden's Picture

About Raymond Camden

Raymond is a senior developer evangelist for Adobe. He focuses on document services, JavaScript, and enterprise cat demos. If you like this article, please consider visiting my Amazon Wishlist or donating via PayPal to show your support. You can even buy me a coffee!

Lafayette, LA https://www.raymondcamden.com

Archived Comments

Comment 1 by Leigh posted on 2/8/2009 at 5:50 AM

The shortest route I found was to use the ExtractorFactory. Since it can handle both formats, I used it as a generic shortcut to grab the underlying properties. (Unfortunately, I could not find a ReaderFactory.)

The ooxml format has different property types: core, extended and custom. So to determine the file format, I used java's isInstance method. CF's IsInstanceOf method did not seem to work with the objects I created using the javaLoader.

ExtractorFactory = javaLoader.create("org.apache.poi.extractor.ExtractorFactory");
inputFile = createObject("java", "java.io.File").init( "c:\myFiles\testExcel2007.xlsx" );
extractor = ExtractorFactory.createExtractor( inputFile );

// Determine the format
POIXMLTextExtractor = javaLoader.create("org.apache.poi.POIXMLTextExtractor");
isFormatOOXML = POIXMLTextExtractor.getClass().isInstance( extractor );

if (isFormatOOXML)
{
// extract core properties (author, title, etctera...)
coreProp = extractor.getCoreProperties().getUnderlyingProperties();

WriteOutput("Creator = " & coreProp.getCreatorProperty().getValue() & "<br>");
WriteOutput("Title = " & coreProp.getTitleProperty().getValue() & "<br/>");
// ...
}
else
{
summary = extractor.getSummaryInformation();
WriteOutput("Title="& summary.getTitle() &"<br/>");
WriteOutput("Author"& summary.getAuthor() &"<br/>");
// ....
}

Comment 2 by Raymond Camden posted on 2/8/2009 at 7:31 AM

Nice, I tried like heck to figure out how to get props for 2007 and I just couldn't figure it out. Thank you!

Comment 3 by Leigh posted on 2/8/2009 at 9:48 PM

You are welcome. Sometimes the api only gets you so far... Eclipse's debug mode for java code often fills in the gaps. (It definitely did this time ;-)

Comment 4 by Raymond Camden posted on 2/8/2009 at 9:52 PM

So I asked this question on the earlier post. Does it make sense to turn this into a CFC? There is a metadata project at RIAForge, but I believe it hasn't been updated since 06. Also, I don't want to step on Ben Nadel's toes. His code does everything Excel related. But maybe a CFC that _just_ does text and metadata read/writes would be useful?

Comment 5 by Leigh posted on 2/8/2009 at 10:57 PM

No question there would be some overlap. But a separate cfc might be useful. I could see cases where you might want to extract the text or metadata from _any_ office document, not Excel specifically.

Comment 6 by Raymond Camden posted on 2/9/2009 at 5:18 PM

I've spoken with Ben and he agrees. I'll see what I can whip up. I assume I have full rights to steal I mean innovate from you, Leah? ;) I'll credit you by URL to help respect your privacy.

Comment 7 by Leigh posted on 2/9/2009 at 11:32 PM

Well, it is okay with me. But you will have to ask Leah too ;-)

Comment 8 by Anne posted on 4/26/2009 at 4:28 AM

Ray, thank so much for the post. I'm trying to batch extract about 4,000 word docs and POI is doing a great job until it hits a *.doc file that it determines is actually an RTF file. I know POI doesn't support RTF but I've run into a wall trying to determine the best way to ignore/possibly re-classify the file and continue processing. Any ideas?

Comment 9 by Raymond Camden posted on 4/26/2009 at 4:44 PM

Why not just try/catch the call?

Comment 10 by Marc posted on 8/3/2009 at 11:18 PM

I'm using the POI tools to successfully read the text in from a Word 2007 (docx) file, but for some reason when it completes reading the file, ti will not release the document so that I can delete it. Even when I try to manually browse to the file, Windows tells me that it cannot delete the file because jrun.exe is still using it. My Word 2003 (doc) files do not exhibit this behaviour. Any clues? Some code is below:

<!--- where the poi files are --->
<cfset jarpath = ABSPath & "\jars">
<cfset paths = []>
<cfdirectory action="list" name="files" directory="#jarpath#" filter="*.jar" recurse="true">

<!--- Get all the jar files to load --->
<cfloop query="files">
<cfset arrayAppend(paths, directory & "/" & name)>
</cfloop>

<!--- Load javaloader --->
<cfset server.loader = createObject("component", "#CFHome#/components.javaloader.JavaLoader").init(paths)>

<!--- Generic file reader --->
<cfset myfile = createObject("java","java.io.File")>
<cfset myfile.init("#UploadPath##currFileName#")>

<!--- Init the extractor factory --->
<cfset extractorFactory = server.loader.create("org.apache.poi.extractor.ExtractorFactory")>

<!--- Create Extractor --->
<cfset extractor = extractorFactory.createExtractor(myFile)>

<!--- Get Summary Info
<cfset summary = extractor.getSummaryInformation()>
<cfoutput><pre>#summary.getPageCount()#</pre></cfoutput>
--->

<!--- Get our page count --->
<cfset PagesFound = REMatch(Session.KeyPhrase,extractor.getText())>
<cfset PageCounter = ArrayLen(PagesFound)>

NOTE: I added this trying to see if i released the objects if it would let the file go, but it made no difference.
<!--- Close the file? --->
<cfset extractor = "">
<cfset extractorFactory = "">
<cfset myfile = "">

Comment 11 by Raymond Camden posted on 8/3/2009 at 11:34 PM

Weird - I looked around for a method that would possibly close the connection, but do not see one and I see no reason why the extractor would even need to keep it open. Maybe it is a Office 2007 versus Earlier issue?

Comment 12 by Leigh posted on 8/4/2009 at 1:09 AM

It seems like it has something to do with the underlying java code. It creates a PushbackInputStream to read the first few bytes of the file to determine if it is binary or ooxml. Try using the createExtractor() method that accepts an InputStream:

...
<cfset fis = createObject("java", "java.io.FileInputStream").init(myFile)>
<cfset extractorFactory = server.loader.create("org.apache.poi.extractor.ExtractorFactory")>
<cfset extractor = extractorFactory.createExtractor(fis)>
...etcetera ..
<cfset fis.close()>

Comment 13 by Marc posted on 8/4/2009 at 3:40 AM

Ray/Leigh,

Thanks for your help! Changing to the FileInputStream did the trick and it was able to purge the file after reading it.

Thanks again!

Comment 14 by Tad posted on 5/18/2010 at 10:59 PM

Hey there - great post, and hoping this is the answer to what I'm looking to do. Trying to extract summary information from DOCX files on CF9, using (what I think are) the latest POI jars. Getting "The getSummaryInformation method was not found" error when trying to get summary info. I'm specifically trying to extract the create date/time from the docs. Any insight? I'm currently saying:

<cfset myfile = createObject("java","java.io.FileInputStream").init(expandPath("UPLOAD_DOCS/#DOCMT_URL_TXT#"))>
<cfset extractorFactory = createObject("java","org.apache.poi.extractor.ExtractorFactory")>
<cfset extractor = extractorFactory.createExtractor(myFile)>
<cfset summary = extractor.getSummaryInformation()>
<cfoutput><pre>#summary.getCreateDateTime()#</pre></cfoutput>

Comment 15 by Leigh posted on 5/19/2010 at 2:13 AM

> using (what I think are) the latest POI jars.

Not quite, but they are close enough (3.5-beta). The metadata is slightly different for binary and ooxml files. Take a look at the first comment above. It utilizes different properties for the two formats.

For ooxml documents, try using the extractor's core properties.

ie
core = extractor.getCoreProperties();
created = core.getCreated();
title = core.getTitle();
etcetera ...

HTH
-Leigh

Comment 16 by Richard Brasier posted on 7/6/2010 at 2:48 AM

Im having trouble setting custom metadata to a word document using the Extractor Factory method, after not much luck with the. I thought the following would work, and its not throwing any errors, but i cant help think im missing something.  Anyone have any ideas?

Thanks in advance

http://pastebin.com/bpff6cmY

Comment 17 by Richard Brasier posted on 8/23/2010 at 8:23 AM

In case anyone is wondering. I have managed to solve my earlier problem. I posted a solution on my blog:

http://www.rbrasier.com/201...

Comment 18 by Raymond Camden posted on 8/23/2010 at 3:22 PM

Very cool - thanks for sharing your solution.

Comment 19 by BeekerMD03 posted on 2/25/2011 at 8:11 PM

Anyone find a solution to updating properties for Office 07 documents yet?

Comment 20 by Raz posted on 9/5/2012 at 8:15 AM

I've read about org.apache.poi.xwpf.usermodel.XWPFDocument and tried to implement them but I'm having trouble initilizing it.

<cfset javaloader = createObject("component", "home.cfcs.javaloader.JavaLoader").init(paths)>
<cfset fis = createObject("java","java.io.FileInputStream")>
<cfset theFile = fileQuery.directory & "/" & fileQuery.name>
<cfset fis.init(theFile)>
<cfset docx = javaloader.create("org.apache.poi.xwpf.usermodel.XWPFDocument").init(fis)>

the last code gives me an "Object instantation exception" error

Comment 21 by Raymond Camden posted on 9/5/2012 at 2:47 PM

The last code - so you mean the last line?

Comment 22 by Raz posted on 9/5/2012 at 8:02 PM

yes, the last line. I also tried
cfset docx = createObject("java","org.apache.poi.xwpf.usermodel.XWPFDocument").init(createObject("java","java.io.FileInputStream").init(theFileJava))

but it gives me an "Unable to find a constructor for class org.apache.poi.xssf.usermodel.XSSFWorkbook that accepts parameters of type ( java.io.FileInputStream ).". I'm stuck to this because I know it should accept the FileInputStream because it's a chile of InputStream. I'm using CF9, not sure if it matters.

Thanks in advance.

Comment 23 by Leigh posted on 9/5/2012 at 8:25 PM

Did you verify you are passing in a valid file path ie FileExists(theFile)?

<i>I know it should accept the FileInputStream because it's a chile of InputStream</i>

True, but it was created by a different class loader ie createObject versus javaLoader which can cause problems in some cases. Try creating both objects with the javaLoader.

Comment 24 by Raz posted on 9/6/2012 at 8:55 PM

Yes, it's a valid file.

I tried this:
<cfset docx = javaloader.create("org.apache.poi.xwpf.usermodel.XWPFDocument").init(fis)>

but it's giving me an "Object instantation exception"

Comment 25 by Leigh posted on 9/6/2012 at 9:03 PM

That is just a generic error you get whenever something goes wrong with a java object. Look in the stack trace. That is where you will find the real exception.

Comment 26 by Raz posted on 9/7/2012 at 7:31 AM

Thank you so much for a very prompt reply. I do appreciate it.

I modify the code a bit to make it clear:

<cfscript>
theFileCF = fileQuery.directory & "\" & fileQuery.name;
docx = createObject("java","org.apache.poi.xwpf.usermodel.XWPFDocument").init(createObject("java","java.io.FileInputStream").init(theFileCF));
</cfscript>

this still gives me an "Unable to find a constructor for class org.apache.poi.xwpf.usermodel.XWPFDocument that accepts parameters of type ( java.io.FileInputStream ). "

tried to use the javaloader:
<cfscript>
theFileCF = fileQuery.directory & "\" & fileQuery.name;
docx = javaloader.create("org.apache.poi.xwpf.usermodel.XWPFDocument").init(javaloader.create("java.io.FileInputStream").init(theFileCF));
</cfscript>

but it gives me an error "An exception occurred while instantiating a Java object. The class must not be an interface or an abstract class."

Comment 27 by Raz posted on 9/7/2012 at 7:33 AM

I use the poi jars from poi-bin-3.8-20120326

Comment 28 by Leigh posted on 9/7/2012 at 7:49 AM

Forget about the "..class must not be an interface or an abstract class" error. It is usually meaningless. Just a throw away header to say "oops, something" went wrong. To find out what that "something" is, you have to review the stack trace.

http://help.adobe.com/en_US...

Comment 29 by Raz posted on 9/7/2012 at 11:09 AM

here's the stack trace:

Stack Trace
at cffilename_search_action2ecfm454045983.runPage(D:\Websites\Home\Employee Services\Policies & Procedures\filename_search_action.cfm:153)

coldfusion.runtime.java.JavaProxy$NoSuchConstructorException: Unable to find a constructor for class org.apache.poi.xwpf.usermodel.XWPFDocument that accepts parameters of type ( java.io.FileInputStream ).
at coldfusion.runtime.java.JavaProxy.CreateObject(JavaProxy.java:178)
at coldfusion.runtime.java.JavaProxy.invoke(JavaProxy.java:80)
at coldfusion.runtime.CfJspPage._invoke(CfJspPage.java:2360)
at cffilename_search_action2ecfm454045983.runPage(D:\Websites\Home\Employee Services\Policies & Procedures\filename_search_action.cfm:153)
at coldfusion.runtime.CfJspPage.invoke(CfJspPage.java:231)
at coldfusion.tagext.lang.IncludeTag.doStartTag(IncludeTag.java:416)
at coldfusion.filter.CfincludeFilter.invoke(CfincludeFilter.java:65)
at coldfusion.filter.ApplicationFilter.invoke(ApplicationFilter.java:363)
at coldfusion.filter.RequestMonitorFilter.invoke(RequestMonitorFilter.java:48)
at coldfusion.filter.MonitoringFilter.invoke(MonitoringFilter.java:40)
at coldfusion.filter.PathFilter.invoke(PathFilter.java:87)
at coldfusion.filter.LicenseFilter.invoke(LicenseFilter.java:27)
at coldfusion.filter.ExceptionFilter.invoke(ExceptionFilter.java:70)

Comment 30 by Leigh posted on 9/7/2012 at 7:52 PM

No, sorry I meant the stack trace from the other example. (If the full trace is too long for comments, use pastebin).

Comment 31 by Raz posted on 9/8/2012 at 11:02 PM
Comment 32 by Leigh posted on 9/9/2012 at 1:40 AM

Perfect, thanks. When you create the javaLoader, try setting the loadColdFusionClassPath property to true:

javaLoader.init(loadPaths=paths, loadColdFusionClassPath=true);

Comment 33 by Raz posted on 9/9/2012 at 7:18 PM

Hello, I tried the code but I'm still getting the same error

here's my code http://pastebin.com/RReHJ77m

and here's the stack trace http://pastebin.com/24zXEcQQ

Thanks in advance for your patience.

Comment 34 by Leigh posted on 9/10/2012 at 1:03 AM

Basically the problem is the dom4j jar which is used by POI and CF. It is notorious for causing problems when multiple class loaders are used. In short, omit that jar and use the loadColdFusionClassPath=true and it should work. See example here: http://pastebin.com/iEafTZqq

Though if all you need is the properties, it is simpler to use the ExtractorFactory. Then you will not need separate code for the different types of POI files. There is an example above in the initial comments.

Comment 35 by Raz posted on 9/10/2012 at 7:06 AM

Thank you so much, Leigh!!! It worked!

I omit dom4j jar, use loadPaths=paths, loadColdFusionClassPath=true and ExtractorFactory.

Comment 36 by Leigh posted on 9/10/2012 at 7:25 AM

Great! Glad to hear it is all sorted out :)