Two ColdFusion/Solr issues I discovered

This post is more than 2 years old.

As I prepare for my Solr presentation at RIAUnleashed I've run into two interesting issues that may trip people up. One is a bug proper and the other is simply misleading. Let's start with the bug.

Status is incorrect.
If you run a cfindex tag with the update action in a folder with files in it, the status result is supposed to tell you how many items were added versus how many were updated. In my testing, I had a folder with 4 files in it. I added these to my index and correctly saw that 4 were inserts. I then dropped a new file in, ran the same script, and expected to see 4 updated files and 1 inserted files. Instead I saw that 5 were updates.

This isn't a huge issue but if you rely on the status result for reporting or validation then you need to be aware of this.

You can see the bug report here: http://cfbugs.adobe.com/cfbugreport/flexbugui/cfbugtracker/main.html#bugId=84938

Misleading error message.
When you use cfindex to index stuff into an index (wow, say that 3 times fast), if you provide a collection name that does not exist, you get a misleading error message. Here is what you get:

Unable to connect to the ColdFusion Search service.

On Windows, you may need to start the ColdFusion Search Server from the services control panel. On Unix, you may need to run the search startup script in the ColdFusion bin directory. Error: java.io.IOException: unable to obtain from connection pool: cannot make connection to server at: k2://localhost:9953

As you can see, the error implies that ColdFusion was trying to connect to a server - and based on the "k2" at the bottom - a Verity server. If you make the same mistake with cfsearch, you get an error clearly stating that the collection doesn't exist.

You can see the bug report here: http://cfbugs.adobe.com/cfbugreport/flexbugui/cfbugtracker/main.html#bugId=84939

Raymond Camden's Picture

About Raymond Camden

Raymond is a senior developer evangelist for Adobe. He focuses on document services, JavaScript, and enterprise cat demos. If you like this article, please consider visiting my Amazon Wishlist or donating via PayPal to show your support. You can even buy me a coffee!

Lafayette, LA https://www.raymondcamden.com

Archived Comments

Comment 1 by Rick Smith posted on 11/11/2010 at 6:00 AM

Ray, will you please post your presentation/slides after the conference? Although I've found a tremendous amount of information on [CF] solr, I've been setting up a scenario for a client with 10k documents and been spending quite a bit of time familiarizing myself with it. Having used Verity extensively in the past, I'm still tweaking solr to get my result exactly how I want them. Thanks!

Comment 2 by Raymond Camden posted on 11/12/2010 at 12:14 AM

I will - and if it goes well - I'll offer to do it for the Online Meetup. To be honest though, it sounds like you have Solr up and running but haven't yet tuned it to your needs yet. This class is more an intro level discussion of Solr within CF.

Comment 3 by John Wells posted on 12/13/2010 at 8:58 PM

Ray,
Is it clear that SOLR is the way to go vice Verity? Any clear performance gotchas that arent well know?

We are about to index 4 million files and are looking for writeups on the kinds of performance and useability details that could make a big difference at that level.

The set of files has some logical breakouts, with the largest being 750,000.

Comment 4 by Raymond Camden posted on 12/13/2010 at 9:11 PM

Yes - Verity is officially deprecated. I bet it will be gone in CF10 (but can't promise it will).

Performance issues? None that I know of. Wait - I do know there is a issue with some types of PDFs - but I'm not aware of it being critical. For the most part the answer is no. If you are going to be indexing a large number of binary files, expect it to be slow. There is no way around that. But - that should typically be a one type operation - or an operation you don't do often.

Looking at your second paragraph, I'd assume the initial index of your 4 million files will be slow. You may want to consider breaking it up into chunks (like files 1-250K, etc). Oh heh -just saw paragraph3. :) So I think you will be ok. The initial index will be painful, but going forward, your atomic operations (updating, removing, adding) should be zippy. Search should be zippy too. I've seen searches on indexes of 20M records and it was fast.

Obviously if you see different, please let me know!

Comment 5 by Daren Valentine posted on 12/19/2010 at 7:46 PM

Ray,

Have you had any problems where Solr does not either index a document properly or searching does not return an expected result. I have a pdf document that contains the word "infrared". Solr will return the document is the search criteria is "infrared*", but not if "infrared". Also, using "infrared OR infrared*" as the crit does not work either!

Comment 6 by Raymond Camden posted on 12/19/2010 at 8:07 PM

I haven't had my coffee yet - but isn't that the same word?

Comment 7 by Daren Valentine posted on 12/20/2010 at 2:08 AM

Yes, same word. That's what is so freakin weird. If I just search for "infrared" in the pdfs, one client's document is not returned. If I search for "infrared*" using the wildcard at the end, it is returned. Likewise, other clients with infrared on their documents disappear with the wildcard. So, thinks I, just take the user's search input and create a new search string with both iterations "infrared OR infrared*". No good! My duck tape solution was to search the collection twice, once with and once without the wildcard, and coalesce the results. Very inelegant, but it was the only thing that worked

Comment 8 by Raymond Camden posted on 12/20/2010 at 6:26 AM

You see, this is where the coffee would have helped. I swear I did not see the * at the end of the second example. Could you share the PDF where infrared existed but did NOT return when you didn't have the *?

Comment 9 by Raymond Camden posted on 12/24/2010 at 9:01 PM

Thought I'd share an update. Daren and I shared an email and he sent me his test files. Unfortunately I wasn't able to see anything so I recommended he visit the CF bug site and report it.

Comment 10 by Raphael Sbegue posted on 2/11/2011 at 8:32 PM

Hello Ray ..

I'm having a serious problem in the use Solr to view binary documents like ms-word and pdf ...

In theory they are working perfectly for the search in the document, but when I use the context to display the information he can only tell the start of the document not displaying information from the middle or end of document.

Is there somewhere in the configuration of the Solr increase the amount of bytes for the context?

Code:
<cfsearch name = "results" collection = "myCollection"Crit = "# # form.search"status = "sCollection" contextpassages = "3" suggestions = "always" contextHighlightBegin = "<b> " contextHighlightEnd = "</ b> "/>

Thanks.

Comment 11 by Raymond Camden posted on 2/12/2011 at 12:14 AM

I'm not quite sure I get your meaning. Are you saying the context portion is incorrect?

Comment 12 by Raphael Sbegue posted on 2/12/2011 at 12:21 AM

It is as if the collection does not index the file correctly.

In the search it returns the correct file and summary, but in context when the terms are searched for the end of the file it returns nothing like it had a limiter of bytes to display.

Comment 13 by Raymond Camden posted on 2/12/2011 at 2:10 AM

And you are running CF901 _and_ the CHF?

Comment 14 by Raphael Sbegue posted on 2/12/2011 at 3:27 AM

I think it's the CF 9 without hot fix, I'll check and come back here to say whether it worked or did

Comment 15 by Fernando posted on 3/3/2011 at 12:58 AM

I am having the same issue where no matter how big I set contextpassages or contextbytes to, it always caps off at a couple of words. I want to bring at least 500 characters (or whatever amount it translates into bytes) of a description field from a collection (created from a query, not documents), but it seems to always bring the same small amount no matter what i set those 2 attributes to.

Comment 16 by Mike Lerley posted on 9/13/2014 at 12:29 AM

I know this is old, but what Fernando points out is still happening in CF10. If you look at the SOLR logs at the URL generated by CF, it is always setting contextbytes to 100. Infuriating...

Comment 17 by Raymond Camden posted on 9/13/2014 at 12:30 AM

Is there a bug filed?

Comment 18 by Mike Lerley posted on 9/13/2014 at 12:33 AM

Don't know... and honestly not sure where to look.

Comment 19 by Raymond Camden posted on 9/13/2014 at 12:34 AM

Bugs can be search, added here: https://bugbase.adobe.com/

Comment 20 by Mike Lerley posted on 9/13/2014 at 12:38 AM

Thanks, I'll check it out. Not on Friday at 4:38 though :)

Comment 21 by Mike Lerley posted on 9/15/2014 at 11:02 PM

Submitted. Bug #3824890

https://bugbase.adobe.com/i...