January 26, 2010 (This post is more than 2 years old.)

Some criticisms on Solr in ColdFusion 9

coldfusion

Recently Ryan Stille (one of the new ColdFusion ACPs) posted a comment on my blog entry, Some Basic Solr/Verity Differences. In that comment he pointed out that he was noticing differences in results returned by Verity and Solr. No big surprise there - but what was surprising was the lack of data returned by Solr. Spurred on by his comment I did some testing of mine and I have to say - I'm pretty disappointed. What follows are some findings in regards to testing file based collections in Solr and Verity. I'll point out that all of this has been brought to Adobe, so I'm not just complaining but actively trying to improve the problem for ColdFusion 9.X.X (i.e., whatever comes next).

Before going further, please be sure you note the qualification I made above. These issues refer to file based collections of data. In other words, cases where you ask Verity/Solr to index files, like Word Docs, PDFs, and other binary formats. It does not refer to a collection that is built from your database.

For my testing I used Windows XP and a folder of 8 documents. This folder included 1 MP3, 4 PDFs, 2 Word docs, and one text file. I indexed both using the ColdFusion Administrator. My tests were done using CF Admin Searcher, a ColdFusion Admin extension that lets you perform ad hoc queries against Verity and Solr collections. I basically opened the tool up in two tabs and performed the same search in both to do my comparisons. Here is what I found:

The Summary field in results for Solr contained binary "junk". Verity cleaned this up. Example:

That result came from a MP3, which you expect to be 'dirty', but Verity correctly clears this from the summary. I also see these chars in PDF files as well. Word docs seemed fine though.

Solr failed to pick up the TITLE value for any binary file. Verity got them all. I also saw this in other metadata fields. Solr also missed the author field for example.
Solr failed to return any context values for results while Verity had no trouble. You should note that you have to perform a file edit to enable context with Solr (details here: http://help.adobe.com/en_US/ColdFusion/9.0/Developing/WSe9cbe5cf462523a0-5bf1c839123792503fa-8000.html) but even with this change there was no additional text in my context.
While I wouldn't classify this as serious testing in any way, in every test, Verity searches were about twice as fast. Now we are talking about 15 versus 30 ms, which is not something to be concerned about, but Solr was supposed to be quite a bit fasting. (To be fair, my test suites are so small as to not really be relevant.)
The TYPE value for results is correct for Verity, but comes back as application/octet-stream for Solr. You don't need this column of course, you can sniff the extension, but still...

All in all, this is disappointing. I don't think I can recommend Solr (specifically, Solr as bundled in ColdFusion 9) for production use... at least for a collection that is heavily file based. You can, of course, do post-processing of search results to get metadata. ColdFusion 9 supports getting metadata from Word docs now, but you have to convert it to PDF first. That's something you would definitely want to cache though.

As I said, both Ryan and I shared our findings with Adobe so I'm sure it will get corrected in the future. People making a decision about search support should consider carefully though. I don't think anyone thinks Verity will last much longer. Solr is definitely the future. But we've got a few bumps in the road to get past first.

Support this Content!

If you like this content, please consider supporting me. You can become a Patron, visit my Amazon wishlist, or buy me a coffee! Any support helps!

Want to get a copy of every new post? Use the form below to sign up for my newsletter.

Archived Comments

Comment 1 by Josh Curtiss posted on 1/27/2010 at 12:23 AM

Isn't it true that Verity doesn't support the new Office 2007 file formats? This was a deal-breaker for me. It made Solr wonderful for me compared to Verity despite other faults.

Comment 2 by Raymond Camden posted on 1/27/2010 at 12:26 AM

Fire me off an example and I'll add it to my test. Be sure to send me a snippet of text in case I can't open it.

Comment 3 by John Sieber posted on 1/27/2010 at 2:00 AM

Thanks for taking the time to test and post your findings. On one of my sites I have a large .pdf file based collection that I planned to move from verity to solr. I think I will now wait as these issues will cause headaches for me that I would rather not have right now.

Comment 4 by Raymond Camden posted on 1/27/2010 at 2:30 AM

@Josh - Thanks for the test file. As you expected, Verity did NOT parse the DOCX. Solr did. Like before though it missed all the metadata. It did, though, correctly match inside. Oh - one change - it DID do context this time.

Comment 5 by Shannon Hicks posted on 1/27/2010 at 4:02 AM

I will point out that most of the problems you've encountered are with Adobe's CF implementation of Solr, not with Solr itself. Ditching cfindex/cfsearch and using Solr's API directly ends up being a far more useful, though nowhere near as easy.

I *really* need to clean up my cfsolr library and get that released :(

Comment 6 by Raymond Camden posted on 1/27/2010 at 4:12 AM

Why not just build off of my Seeker OS project? ;) Although it is Lucene, not Solr.

Comment 7 by Peter Harris posted on 1/27/2010 at 4:37 AM

Second what Shannon says, this is ColdFusion, not Solr.

We've gone the standalone solr route and couldn't be happier. We use Tika for binary text extraction and don't have the issues described above.

Comment 8 by Raymond Camden posted on 1/27/2010 at 4:50 AM

Good points on the "It isn't Solr's fault" thread. Do you guys think I don't make that clear enough? The comments should clear it up of course, but if it needs clarifying even more, I can add a note to the end of the entry.

Comment 9 by Todd Rafferty posted on 1/27/2010 at 5:40 PM

@Ray: I think when you read the blog and come across statements like "I don't think I can recommend Solr for production user" - it means you're blaming Solr. :)

Comment 10 by Raymond Camden posted on 1/27/2010 at 5:43 PM

Good point - I added a qualifier and I fixed the typo.

Comment 11 by Sid Maestre posted on 1/28/2010 at 7:24 PM

Is there any problem continuing to use Verity when upgrading to ColdFusion 9?

Comment 12 by Raymond Camden posted on 1/28/2010 at 7:28 PM

@Sid: I don't believe so. Again - note that Adobe says Solr is much faster. As I said above, I do not think my speed tests are very accurate. I think, most likely, you will find Solr faster. That being said, Verity should be "good enough". I do think Verity will be removed soon. But it's not like you will be surprised by this. If Verity is removed in CF10, you will have plenty of time to migrate.

Comment 13 by Aaron Longnion posted on 6/23/2010 at 5:57 PM

Hi @Shannon or @Peter (or anyone else) - do you have any documentation and/or CFC wrappers to share for a Standalone SOLR implementation with CF? We're on CF9, but have over 700,000 documents (think CV's and resumes), so we don't want to break old search functionality when upgrading from Verity to SOLR.

Comment 14 by Shannon Hicks posted on 6/23/2010 at 6:32 PM

@aaron

http://cfsolrlib.riaforge.org/

Comment 15 by alex posted on 7/16/2010 at 12:04 AM

It seems like some of your issues were addressed in 9.0.1 according to the doc: http://www.adobe.com/suppor...

I'd be interested in a followup post.

Comment 16 by Raymond Camden posted on 7/16/2010 at 12:04 AM

Yep, if you saw my post on the 901 released, I mentioned this specifically. A follow up is planned.

CF901 blog post: http://www.coldfusionjedi.c...

Comment 17 by Ed posted on 8/31/2010 at 12:12 AM

Is the lack of MS Office 2007 support in Verity new to CF9? We are currently on CF8 with a pretty large doc store 280k+ docs indexed and working fine with Word 2007 files. As one of the other comments said... think CV's indexed and search-able so I know I would hear about it if Word 2007 files did not show up. :-)

Comment 18 by Raymond Camden posted on 8/31/2010 at 12:14 AM

Basically Verity hasn't been updated in 100 years. (Well the embedded version.) I'd consider testing in Solr and only Solr.

Comment 19 by Bobbytuck posted on 6/21/2011 at 1:24 AM

I realize this is pretty old, but is there any easy way to include a PDF or Excel or Word file creation date in the initial solr CFINDEX? Is it possible to do this with a custom field?

Comment 20 by Raymond Camden posted on 6/21/2011 at 6:13 AM

If you create the index manually - sure - but then you don't get the auto text extraction. Actually wait - if you do a file/dir based index, and then follow it up with manual updates, then yes, that should work fine.

Comment 21 by Ronnie posted on 9/30/2011 at 6:47 PM

In reference to Ray's comments regarding the context field not displaying any additional information, I found the same situation. I altered the xml files as described but no new information showed up in the context field. Verity had useful information to display on search results page. Any suggestions as to retrieve the same sort of information in context field?

Comment 22 by Raymond Camden posted on 9/30/2011 at 7:07 PM

Ronnie, a few things:

a) Are you running 901+the latest hot fixes?
b) Did you remember to _ask_ for context with the contextPassages attribute?
c) Did you reindex your data?

Comment 23 by Ronnie posted on 10/5/2011 at 11:03 PM

Ray,

1. /opt/coldfusion9/lib/updates/hf901-00002.jar
2. <cfsearch collection="#coll#" name="Getresults" startrow="#startRow#" maxrows="#maxRow#" contextpassages="1"
contextbytes="500" contexthighlightbegin="<strong>" contextHighlightEnd="</strong>"
criteria="#searchFor#" suggestions="Always" status="info" language="English" >
3. Not explicitly. Each time the scheduled task that builds the collection (weekly) runs, it purges the collection and then rebuilds it. I assumed that indexed it so that it would never need to be reindexed per se.

Comment 24 by Raymond Camden posted on 10/7/2011 at 8:16 PM

So yeah, if you purge and reindex it should be refresh. Odd. At this time I'll have to punt and suggest calling Adobe for support. You mentioned you altered the XML files. If you try with a new collection, and don't change the XML, do you see anything different?

Comment 25 by Ronnie posted on 11/19/2011 at 1:16 AM

Ray, first off, I was able to fix the previous problem by making the changes to the config files. It didn't happen right away but I think that after our SAs bounced our server and I reran the collection build all was well. Now for the next problem, I posted this to another of you blogs so forgive me if it's a duplicate.

Our team has been in the process of moving our site search from Verity to Solr. We are running CF9. I am able to build the collections and search them but have run into a problem. Hopefully, someone can shed some light on a solution. We are using cfincludes which can be either cfm or html files. When the search results are displayed and clicking on one of the links that happen to be one of the included files, none of the css is applied which is usually displayed from the file containing the cfinclude. Is there any way to have Solr display the link to the file that contains the cfinclude with all the CSS displayed properly as opposed to the included file?

Comment 26 by Raymond Camden posted on 11/19/2011 at 1:21 AM

I responded to your question on the other comment you posted.

Comment 27 by Garth Beebe posted on 8/28/2012 at 5:30 PM

Is there an issue with indexing .zip files and password encrypted pdfs?
I thought i had read a few months ago (most likely longer) that password encrypted pdfs were an issue.

Garth

Comment 28 by Raymond Camden posted on 8/28/2012 at 5:48 PM

Not sure. The easiest thing to do though is to write a quick test and report back to us. :)

Comment 29 by Garth Beebe posted on 8/28/2012 at 6:38 PM

I was unable to index password encrypted pdfs. Wasn't sure if there was a setting or solr config file i could change.

In the mean time i found on http://wiki.apache.org/solr...
some information regarding this. :)

It appears that you can not with the version (1.3) that comes with CF 9.

I tried to index zip files via the CFadmin interface. This errored out.
My thought was same as above.... maybe i needed to add something to the config file.

I haven't found anything yet on http://wiki.apache.org/solr, but will keep looking.

Thanks

Support this Content!

Archived Comments

Webmentions