Back in January of this year I wrote a blog entry about just how poorly Solr was integrated into ColdFusion 9. I'll take partial blame for that as I didn't test it during the betas, but at the end of the day, the support for binary files were so poor there was no way I could recommend it. Database content - sure. But for any collection of files the integration failed miserably.
This is one of the areas that was touched in the 901 updater, and I'm very happy to see that Adobe did indeed do a great job improving their support for binary files. I did a quick test where I dumped in some MP3, PDF, and Word docs into a folder and then indexed it. (I also tried an AVI for the heck of it - it didn't work though.) After the collection was created I did a few quick tests. Here is the result of a search for e*:

As you can see, it picked up quite a bit of information. Check out the summary on the MP3. It isn't clear unless you view source, but the data is line delimited and could be parsed if you wanted to.
I guess this shouldn't come as any surprise that "Yes, it did get fixed," but I'm pretty darn happy. Let me leave you with two quick notes.
First - the application I used to test searching against the collection is my CF Admin extension, CF Admin Searcher. It is a great tool for quickly testing ad hoc queries against your collections.
Secondly - on my Mac I ran into an issue the first time I tried to use Solr after my 901 update. Vinu (from Adobe) suggested checking ColdFusion9/solr/muticore/solr.xml to see if I had any collections there that didn't exist on the file system. I simply removed them all (except core0!), restarted Solr, and everything worked after that.
Archived Comments
@Ray: Been testing SOLR against our old application that uses verity and very impressed so far. One quick question, what is the core0 collection? Is it important or can I remove it?
From what I know, you can't delete it. Let me try to find out why though.
Ray,
Have you had any issues with Solr and large files? I was doing some tests similar to yours yesterday and ran into a situation where Solr seems to be choking on one particular PDF file. The browser cranks for a while and eventually it stops and the center frame in the CF admin is blank. The document count on the collection screen remains the same (zero if it is a new collection). The PDF file in question is the coldfusion_9_cfmlref.pdf.
Note that verity indexes the file with no problem.
Thanks for the CF Admin Searcher extension. It's been very handy!
Ray
Ouch. I can confirm that as well. I indexed a small folder of PDFs of which that was one. I'll raise this with the Adobe team.
Thanks for passing it along to the Adobe Team! I may have to go with Verity for a project I'm working on rather than Solr, at least until this gets resolved.
More info: Nothing is logged to the solr logs. However, the exception logs shows:
"Error","jrpp-744","07/20/10","08:28:43","cfadmin","GC overhead limit exceeded The specific sequence of files included or processed is: C:\ColdFusion9\wwwroot\CFIDE\administrator\verity\indexcollection.cfm, line: 111 "
java.lang.OutOfMemoryError: GC overhead limit exceeded
I noticed that when I updated to CF901, my JVM min/max memory settings were reset to 512M as the max. I just changed that to 512/1024. I'm running the index again and will let you know. It's still ongoing.
I aborted the request after 15 minutes. Hmpth.
So as an FYI, I tried Verity. I first ran into an issue which prevented me from using Verity at all. I fixed that here: http://www.stillnetstudios....
When I got my collection created I tried an index. It took 60 seconds. So I've got an open request w/ Adobians to see what the heck. There is no reason why Solr should take that long.
@RayB: What platform are you on?
Windows 7, 64-bit...using the 64-bit version of CF (Developer Edition).
Ok this is VERY interesting. So am I - and someone on Twitter (not you I assume? @dyakovich) said the exact same thing.
I just confirmed that this issue occurs on Windows 7 32-bit too so it's not specific to Win7 64-bit
I've confirmed it on the Mac too. I got a reply from an Adobian too and he is looking at my source files.
Multiple updates from Adobe, specifically Vinu.
First, to the core0 question: core0 was the default collection shipped earlier with CF9. You should be able to delete it now. The restriction to have atleast one collection is no longer there.
Second, to the PDF: I think the issue is with that particular pdf. i am able to repro this. I was able to index 55 pdf's (excluding the cfml ref) of 90MB size in 3 mins with the default solr memory config of 256M. For indexing PDFs, we use the same API's as CFPDF and running CFPDF extracttext causes OOM (in ColdFusion and not Solr). We will investigate the issue with this particular pdf. Please exclude this file and let me know if you are able to index the other files.
Also to tune Solr, you can increase the solr buffer limit (default to 40) based on the solr memory parameters. i.e 80 for 512MB and so on. This will speed up the whole process.
Yep....I definitely think the problem is with that particular pdf. I tried to delete half the pages in the pdf and Acrobat crashed.
I also just indexed 341 pdf's in approximately 5 to 8 minutes.
Woot. Let me just say - I am VERY happy to see it's just a problem PDF. Still though - it may be something to watch for in the future.
I'm glad to hear there have been improvements with Solr, however, I am still waiting for Adobe (or Apache) to integrate a web-crawler (Nutch?) that works with Solr before I make the switch from Verity...hoping it happens in CF10! :-)
Ray....thanks for running this down. It's much appreciated!
My Windows 7 64bit/Solr issue happens with a pdf file, but the file size is only 35kb and is two pages in length. I'm able to index several pdf files before Solr hits this one pdf and chokes.
I will bring this up with Vinu.
@Dave Yakovich : Can you send me the pdf file to kvinu-at-adobe.com
E-mail sent with the PDF. Thanks!
I just emailed Vinu, but figured I'd should post here as well. If you have a pdf with a .PDF extenstion, it will choke and give you an org/apache/pdfbox/pdmodel/PDDocument null error.
Renaming the extension to .pdf will fix the problem.
I am having an interesting problem doing a cfindex to a solr collection on a large group of documents (html, pdf. txt etc). If I run the cfindex on my XP dev machine, indexing, for example, 100 records of the same type (html), the routine runs flawlessly. If I run the same on my Windows 2003 production server, I get the following errors:
Error opening new_searcher exceeded limit of maxWarmingSearchers4_try_again_later Error_opening_new_searcher_exceeded_limit_of_maxWarmingSearchers4_try_again_later request: http://localhost:8983/solr/casecollection/update?commit=true&waitFlush=false&waitSearcher=false&wt=javabin&version=1,
If I reduce the number to 4 records at a time, no errors. If I increase it to 5 records, I start to generate the errors.
I need to index 60k+ files, so this is a bit of a concern.
Any Suggestions?
Vinu is on this thread so hopefully he will have an answer. Outside of that, your best option is to contact Adobe support. Sorry!
@Anthony:
Open solrconfig.xml under <collectiondirectory>/conf and look for maxWarmingSearchers. You can increase the value of that, but make sure Solr has enough memory allocated. Restart Solr. The default may not be sufficient for Solr to handle load in production. (http://help.adobe.com/en_US...
Thanks Vinu, but that did not do it. I increased memory to 1024 and changed maxWarmingSearchers to various values, however the issue simply followed the maxWarmingSearchers number, failing at 8, then 10, then 20 etc as I moved the number around.
Its odd cause the issue is on a windows 2003 server, yet performs as expected on my XP dev machine, running the same set of documents against the index.
I guess next is open a ticket with Adobe.
For the record, this was due to a corrupted collection. I deleted the collection, re-created it and indexing is working as expected. Now I need to figure out why indexing some .doc files is throwing errors to the server.log, but not to the cftry/cfcatch tags.
Has there been any update to CF901 for SOLR and the issue of indexing .PDF extensions?
I did change the extension to lower case as stated in an earlier post. It worked, but seems odd that you would need to do this.
The latest 'build' of CF is 901+CHF.
@Anthony. I also had the maxWarmingSearchers (Coldfusion9.0.1) using the cfindex tag. I was looping files to index each file with Solr.
I put a try/catch around the cfindex tag and found about 2 times out of 100 files generated that issue, so I put a cfthread sleep after each index to delay the calls to cfindex and it fixed the issue, almost like Solr couldn't handle so many 'index' queries hitting it one after the other?
Turns out (at least in my case) you can just index multiple files from a query by specify type="file" in cfindex and it doesn't seem to have the issues like the method I was using in my previous post (it would be nice if this was in the coldfusion docs);
<cfindex
action="update"
collection="#arguments.collection#"
type="file"
query="tmpQuery"
key="file"
URLpath="urlpath"
categoryTree="categorytree"
custom1="custom1"
>
I'm also having a LOT of problems with the warming searchers error. My app runs a cfindex whenever a particularly type of data is added, and when my unit tests run that have several tests for this data type, I invariably get this error. I would expect I'd get it when the app is in production as well. The problem appears to be the ColdFusion commits after every cfindex tag and this causes Solr to open a new searcher and "auto warm" it (load from cache). If you are committing more frequently than the warming process, these searches pile up and you hit the error. Problem is, I'm not sure how we can work around this easily. I tried setting the autoWarmCounts to 0 to see if I could reduce the startup time (and since my app does a lot more indexing at runtime than searches so slow startup is not a big issue) but I still hit the error when the number of searches exceeded the maximum set. So kind of at a loss as to how to fix this. I could try setting the limit really high but that supposedly is a bit of a performance killer.
Bug report added for this, for anyone that wants to vote/add comments. Hopefully they can fix this in CF10.
http://cfbugs.adobe.com/cfb...
I was also receiving this error. I tried Barry's suggestion and added half a second cfthread sleep in the loop, which did worked for me.
Okay, this gets better all the time. So I've had the same problem with having to index a HUGE number of documents (HTML files). The problem was temporarily resolved by assigning more heap space to SOLR and then by upping the maxWarmingSearchers setting in solrconfig.xml. However...it appears that when SOLR rebuilds the index, it re-writes the solrconfig.xml file, overwriting our change to this setting!!! This happened when we reindexed last night. Then, of course, the indexing process failed with the same error re: maxWarmingSearchers. So we've got a chicken and egg problem going on here: In order for the indexing to work, we need to up the maxWarmingSearchers setting...but the process of reindexing appears to be overwriting the value of this setting!!
Is there a 'master prototype' version of solrconfig.xml that any future copies of solrconfig.xml created by SOLR (and put in the individual collection directories) will be based on? If there is...what's its name and where is it located?
I believe - stress - believe - it's copied from here:
C:\ColdFusion9\solr\solr\conf
Yeah Ray, I'd noticed that and wondered. Beliefs are often good, and on occasion, correct :)
Vinu - do you *know* if this is the source for the solrconfig.xml that winds up in your collection directory? If not...what is the source? Is it accessible to developers somewhere?
Ray - The default for maxWarmingSearchers in the /solr/solr/conf/solorconfig.xml is 2. I just set it to 6 and reindexed. I figured I could test your theory about this being the "prototype" for the the config files created in the individual collections. The reindexing process over-wrote the solrconfig.xml in our collection's conf/ directory as expected. But the value of maxWarmingSearchers in the new config file was set to 4. So I don't think it's using the config file under /solr/solr/conf as a prototype...
You truly got me there. Punt to support?
Our team has been in the process of moving our site search from Verity to Solr. We are running CF9. I am able to build the collections and search them but have run into a problem. Hopefully, someone can shed some light on a solution. We are using cfincludes which can be either cfm or html files. When the search results are displayed and clicking on one of the links that happen to be one of the included files, none of the css is applied which is usually displayed from the file containing the cfinclude. Is there any way to have Solr display the link to the file that contains the cfinclude with all the CSS displayed properly as opposed to the included file?
If I read you right, you are telling Solr to index CFM files. Normally you don't do that. Solr isn't CF so it can't really 'grok' what is going on. If you really want to do this, then the best I can suggest is logic in App.cfc/cfm that a) recognizes a direct request to a file that is normally included and b) redirects the request to the proper top level document that will include the file.
Just thought I'd update my status to me above-mentioned dilemma. We used Verity in the past and found that our nightly rebuilds worked best by deleting the collection and then recreating. But with SOLR you get into this chicken-and-egg problem using this approach due to SOLR's "insistence" on overwriting solorconfig.xml with its own default settings.
So it turns out that with SOLR, by setting cfindex action="refresh" we can work around this. It has the same effect of emptying out the collection and rebuilding it, but this way, it doesn't stomp on its own configuration file. Hope this proves helpful to anyone else with this problem.
Just for my own clarification and future development, why don't you normally index cfm files when using Solr? We currently use cfm and html files to display content. By eliminating the former, wouldn't we be excluding the content in the cfm page that might want to be searched by a user?
CFM files, much like PHP, etc, are dynamic files that typically spit out database data. When you tell Solr to index them though it is NOT "running" them, but rather reading the file in. So you are - basically - asking Solr to index
<cfoutput>#name#</cfoutput>
which is meaningless.
So then is there no way to index dynamic content with Solr?
Of course there is. cfindex allows you to index queries. So you would simply pass a query to cfindex instead.
First off, I love Solr, especially the performance of searches. However, that said, I am fighting a major battle with using CFIndex to index files.
The maxWarmingSearchers issues continues to be a major issue. Due to the design of our system, we can simply index the documents in a directory path, rather, we must call CFIndex for each individual file based upon language, etc., from the database.
Solr has a lot of configuration options, however, it appears that CFIndex, when it invokes Solr, specifies that each invocation should perform a commit, not wait to flush (wish I could teach the 13-year old that), and not wait for a searcher. See the line below:
Error_opening_new_searcher_exceeded_limit_of_maxWarmingSearchers32_try_again_later Error_opening_new_searcher_exceeded_limit_of_maxWarmingSearchers32_try_again_later request: http://localhost:8983/solr/indure132/update?commit=true&waitFlush=false&waitSearcher=false&wt=javabin&version=1
Since it appears a commit is being forced on each index operation, the larger the collection, the worse the problem.
We have built a very poor work around by sleeping for 1000ms between each document and trapping for the error, then sleeping for 60 seconds, but this is not a good long term solution.
Any other ideas besides giving Adobe a swift kick in the #@$ to get a viable solution?
Cheers!
Well, certainly you can skip cfindex and use solr directly via cfhttp. Outside of that, you want to make sure you file a bug report.
@Kevin: if you do choose to use Solr via cfhttp, keep in mind that CF's implementation doesn't include all of the file extraction jars that the Solr Cell interface uses to extract content from files (PDF, Word, Excel). Adobe utilizes its own PDF engine, as well as the MS Office formats from what I can gather. I'm using CFIndex File similar to you where I'm populating some of the collection information from my database and then the rest of the content from the "rich content files". Just thought I'd give you a heads-up that the Solr implementation is different from the one "out of the box" from Lucene.
We're bumping into the same problem - CF10 provides an option to suppress the autocommit (<cfindex autocommit="no"....>) - has anyone had any experience with this? How vulnerable is uncommitted data (i.e. is it in memory or in a staging area on disk? - or, more to the point, is it critical to commit before restarting CF and/or the Solr engine??)
One thought we have is to wrap the collection in a CFC and commit every N updates (where N is probably in the 10-20 range) and/or when a CFSearch request comes in - any thoughts?
Thanks for your Admin Searcher add-on to the administrator, Ray. I found it because I was googling why my simple searches were taking so long. The following is all run on my developer version under CF 9,0,2,282541
Firstly I needed to remove the references to the Verity engine in the _inc9.cfm file - cf threw an error about verity being unsupported. Easy fix.
Now: timing....
I am only indexing db data. As you can see below there are only 32 records in the collection. I noticed that the actual execution time of the test request was way longer than the time reported by cfsearcher (as indeed it was when I ran the same search in my app). So I used getTickCount() around the cfsearch tag in your results.cfm page and this is what I get (as a typical example):
"Your search for +policy returned 11 item(s) out of 32 records. It took 1076ms to process and 5372ms overall. "
In cfsearcher I am only setting the collection and the criteria. The collection does use categories but the timing is the same whether or not I specify these in the search.
So, my question is, what would account for the extra 4 seconds? I recreated the collection from scratch before I tested in case it was "corrupted" in some way. No change. Any suggestions for places to optimise would be greatly appreciated.
Thanks Ray,
Murray
I'm not quite sure I understand what you are saying. Are you saying your code takes 4 seconds longer to run than my admin tool?
Thanks Ray. No, what I am saying is that the timing reported by the tool based on the meta data returned from the cfsearch is (in the example above) 1076ms but the actual elapsed time for the same cfsearch via the tool as determined by wrapping it within getTickCounts reports that the search takes 5372ms.
I amended your results.cfm as follows to time and report:
<cfset variables.startTime = getTickCount()>
<cfsearch collection="#form.collection#" criteria="#form.search#" name="result" status="meta" startrow="#url.start#" maxrows="#max#"
contextpassages="3" contexthighlightbegin="____B____" contexthighlightend="____BE____"
categorytree="#form.categorytree#" category="#form.category#" suggestions="always">
<cfset variables.totalTicks = getTickCount() - variables.startTime>
<cfoutput>
<p>
Your search for #form.search# returned #meta.found# item(s) out of #meta.searched# records. It took #meta.time#ms to process
and #variables.totalTicks#ms overall. <-- I added that last phrase
{etc}
I am curious about the 4 second difference. And, when I do the same search via my app I get a similar difference. I want my searches to be faster and was investigating the slowness and came upon your tool.
Thanks,
Murray
You got me there. All I can think of is that the time reported is how long it took the search engine to respond, *but* CF does some work after it to get it ready. Like maybe the context passsages stuff is on the CF side, not the search engine. Try removing some of those features, get it down to just the search, and see if they match.
Thanks again Ray. I tried a couple of things:
1. Minimising all settings - 0 context, 1 row, etc etc. No difference
2. I searched against the empty bookclub collection. This resulted in:
Your search for book returned 0 item(s) out of 0 records. It took 1060ms to process and 7455ms overall.
I tried the bookclub multiple times and it always took between 7.4 and 7.9 seconds.
At the risk of hijacking your post, I would be interested to know if others are getting the same results. It is hard to understand why a search of an empty collection would take 7 seconds. To confirm these results I rebooted my computer (Win7 Prof 8gb ram) and restarted CF.
In my original testing against the same small collection on a shared host live server using my app I was getting similar delays. i.e. it was taking about 5 seconds (for just the cfsearch). So, it seems that it isn't just my computer, and I am at a loss.
Thanks again for looking at this.
Murray
PS: I tried the same search of the empty bookclub collection using the cfsearcher on a different computer with CF10 developer version and got:
Your search for book returned 0 item(s) out of 0 records. It took 437ms to process and 10858ms overall.
Multiple tests were all in the 10-11 second range.
I haven't used Solr in a few months, but I know it has never been that slow. I honestly don't know what to tell you outside of trying the main CF support line.
Thanks for looking at it Ray. :-)
If I get it resolved I will post back here.
Go well,
Murray