As I prepare for my Solr presentation at RIAUnleashed I've run into two interesting issues that may trip people up. One is a bug proper and the other is simply misleading. Let's start with the bug.
Status is incorrect.
If you run a cfindex tag with the update action in a folder with files in it, the status result is supposed to tell you how many items were added versus how many were updated. In my testing, I had a folder with 4 files in it. I added these to my index and correctly saw that 4 were inserts. I then dropped a new file in, ran the same script, and expected to see 4 updated files and 1 inserted files. Instead I saw that 5 were updates.
This isn't a huge issue but if you rely on the status result for reporting or validation then you need to be aware of this.
You can see the bug report here: http://cfbugs.adobe.com/cfbugreport/flexbugui/cfbugtracker/main.html#bugId=84938
Misleading error message.
When you use cfindex to index stuff into an index (wow, say that 3 times fast), if you provide a collection name that does not exist, you get a misleading error message. Here is what you get:
Unable to connect to the ColdFusion Search service.
On Windows, you may need to start the ColdFusion Search Server from the services control panel. On Unix, you may need to run the search startup script in the ColdFusion bin directory. Error: java.io.IOException: unable to obtain from connection pool: cannot make connection to server at: k2://localhost:9953
As you can see, the error implies that ColdFusion was trying to connect to a server - and based on the "k2" at the bottom - a Verity server. If you make the same mistake with cfsearch, you get an error clearly stating that the collection doesn't exist.
You can see the bug report here: http://cfbugs.adobe.com/cfbugreport/flexbugui/cfbugtracker/main.html#bugId=84939
Archived Comments
Ray, will you please post your presentation/slides after the conference? Although I've found a tremendous amount of information on [CF] solr, I've been setting up a scenario for a client with 10k documents and been spending quite a bit of time familiarizing myself with it. Having used Verity extensively in the past, I'm still tweaking solr to get my result exactly how I want them. Thanks!
I will - and if it goes well - I'll offer to do it for the Online Meetup. To be honest though, it sounds like you have Solr up and running but haven't yet tuned it to your needs yet. This class is more an intro level discussion of Solr within CF.
Ray,
Is it clear that SOLR is the way to go vice Verity? Any clear performance gotchas that arent well know?
We are about to index 4 million files and are looking for writeups on the kinds of performance and useability details that could make a big difference at that level.
The set of files has some logical breakouts, with the largest being 750,000.
Yes - Verity is officially deprecated. I bet it will be gone in CF10 (but can't promise it will).
Performance issues? None that I know of. Wait - I do know there is a issue with some types of PDFs - but I'm not aware of it being critical. For the most part the answer is no. If you are going to be indexing a large number of binary files, expect it to be slow. There is no way around that. But - that should typically be a one type operation - or an operation you don't do often.
Looking at your second paragraph, I'd assume the initial index of your 4 million files will be slow. You may want to consider breaking it up into chunks (like files 1-250K, etc). Oh heh -just saw paragraph3. :) So I think you will be ok. The initial index will be painful, but going forward, your atomic operations (updating, removing, adding) should be zippy. Search should be zippy too. I've seen searches on indexes of 20M records and it was fast.
Obviously if you see different, please let me know!
Ray,
Have you had any problems where Solr does not either index a document properly or searching does not return an expected result. I have a pdf document that contains the word "infrared". Solr will return the document is the search criteria is "infrared*", but not if "infrared". Also, using "infrared OR infrared*" as the crit does not work either!
I haven't had my coffee yet - but isn't that the same word?
Yes, same word. That's what is so freakin weird. If I just search for "infrared" in the pdfs, one client's document is not returned. If I search for "infrared*" using the wildcard at the end, it is returned. Likewise, other clients with infrared on their documents disappear with the wildcard. So, thinks I, just take the user's search input and create a new search string with both iterations "infrared OR infrared*". No good! My duck tape solution was to search the collection twice, once with and once without the wildcard, and coalesce the results. Very inelegant, but it was the only thing that worked
You see, this is where the coffee would have helped. I swear I did not see the * at the end of the second example. Could you share the PDF where infrared existed but did NOT return when you didn't have the *?
Thought I'd share an update. Daren and I shared an email and he sent me his test files. Unfortunately I wasn't able to see anything so I recommended he visit the CF bug site and report it.
Hello Ray ..
I'm having a serious problem in the use Solr to view binary documents like ms-word and pdf ...
In theory they are working perfectly for the search in the document, but when I use the context to display the information he can only tell the start of the document not displaying information from the middle or end of document.
Is there somewhere in the configuration of the Solr increase the amount of bytes for the context?
Code:
<cfsearch name = "results" collection = "myCollection"Crit = "# # form.search"status = "sCollection" contextpassages = "3" suggestions = "always" contextHighlightBegin = "<b> " contextHighlightEnd = "</ b> "/>
Thanks.
I'm not quite sure I get your meaning. Are you saying the context portion is incorrect?
It is as if the collection does not index the file correctly.
In the search it returns the correct file and summary, but in context when the terms are searched for the end of the file it returns nothing like it had a limiter of bytes to display.
And you are running CF901 _and_ the CHF?
I think it's the CF 9 without hot fix, I'll check and come back here to say whether it worked or did
I am having the same issue where no matter how big I set contextpassages or contextbytes to, it always caps off at a couple of words. I want to bring at least 500 characters (or whatever amount it translates into bytes) of a description field from a collection (created from a query, not documents), but it seems to always bring the same small amount no matter what i set those 2 attributes to.
I know this is old, but what Fernando points out is still happening in CF10. If you look at the SOLR logs at the URL generated by CF, it is always setting contextbytes to 100. Infuriating...
Is there a bug filed?
Don't know... and honestly not sure where to look.
Bugs can be search, added here: https://bugbase.adobe.com/
Thanks, I'll check it out. Not on Friday at 4:38 though :)
Submitted. Bug #3824890
https://bugbase.adobe.com/i...