For the past week or so I've been working on the updated full-text searching chapter of the ColdFusion Web Application Construction Kit. We are totally removing all mention of Verity (outside of that fact that it used to be the only engine supported) and focusing entirely on Solr. While working through the chapter I ran across a couple of "gotchas" that I thought I'd share for anyone considering migrating to ColdFusion 9 and taking the initiative to also update to Solr. This is by no means a complete list - it's just what I encountered. Comments, corrections, and additions are welcome.
Edited: Let me stress - Solr and Verity are very different products. The point of this article is to focus on the differences at the ColdFusion level only.
- Speed: Ok so this isn't something to worry about per se, but it bears repeating. Solr in CF9 is four times faster than Verity. Sweet.
- NO LIMIT! Ditto the above statement about not being something to worry about, but just in case you didn't like the index limits in Verity, you will be happy to know that there are no limits to the size of your Solr based collections. Thank you Henry Ho for reminding me of this.
- When creating a Solr collection, you do not need to specify a language. But you can specify it for cfindex and cfsearch.
- When creating a Solr collection, categories are automatically supported. In the admin UI it is disabled even. It just plain works out of the box.
- In Verity, a search for an uppercase or lowercase term resulted in a case insensitive search. If you supplied a mixcase term though the search was case sensitive. Solr is case insensitive no matter what the case.
- Edited October 31, 2009. Solr does support AND and OR. The docs, however, seem to imply it only supports + and - (for not). This is not the case. Both formats are supported.
Solr doesn't support AND or OR, instead, you have to use the + operator to require words. So to require A and B, you would use: +A +B. To support A or B you would use: A B. I will say that one example in the docs (url linked below) shows OR. That may be a typo. - To support NOT (ie, include something but preclude any result that has something else) you use the - operator. Example: A -B.
- Solr supports wildcards like Verity (* and ?) but it cannot be used as the first character in a search. You want to consider writing a UDF to handle "fixing" search terms for users in case they make a mistake like this. CFLib has one for Verity already.
- For more details on operators and search examples, see this documentation page: Solr search examples
- Solr does not support previousCriteria. This is a feature where you can search within the result set of another search. While it may not be the same, you could mimic this with query of query.
- Verity scores range from 0 to 1, with 0 meaning it sucks and 1 meaning it rules. In Solr it's a bit different. There is no (as far as I can tell) upper range. The higher the number though the better the result. Basically you probably don't want to bother displaying the score.
- With Verity, you can ask for suggestions. This was used along with the status attribute. It would return a structure of which one key was named suggestedquery. Solr still returns this, but the value is just the corrected spelling of one search term. So if you searched for nmber one, it may tell you "number" as the correct spelling. But - if you want to provide a better search term - use collatedResult. This key would have both the fixed spelling and the rest of the query. So it would show something like "number one" as the value.
- With Verity, to use a field during a search, you would do cf_custom1
something. In Solr, all these fields drop the cf_ in front. - The cfcollection docs state that if you leave engine off when using the list operation, you will get all collections. Under OSX at least this is not the case. You must specify engine="solr" to get Solr collections.
That's all I have for now. I plan on updating my Verity CF Admin to work with Solr. This tool allows you to perform ad-hoc searches against collections via the CF Admin. I'm then going to follow up with an index of the ColdFusion HTML documentation and compare some searches in Verity via Solr.
Archived Comments
Strange that the cfsearch tag won't change the AND/OR/NOT operators to the solr equivalent for you. This would help with migration to the new platform. Less legacy code to change.
If you use Verity's "internet" search type, then it already works like Solr. The "internet" search type uses + to force an inclusion and - to exclude.
Solr should certainly be supporting AND/OR searches - we've done extensive solr integration and the schema.xml for solr defines what your 'default' predicate is - and defaults to OR but can be modified - that means that if you have a search string of 'Jim Jones' solr would interpret this as 'JIM AND Jones' and look for documents with both in it.
Personally I'd use solr stand alone and tweak the configs - but another thing you can do is find the ip/port solr is running on and execute your search through the browser and append ?debugQuery=true to see exactly what your search string is resulting to, as it's difficult at times to determine output response w/o having this debug information (can be found at something like 127.0.0.1:8983/solr/select?q=Jim Jones&debugQuery=true type of a url)
I did a cfug topic on this stuff at colderfusion.com - Long live Solr!
And maybe the CF implementation doesn't support the 'filter query' stuff, but solr does - i.e. q=search string&fq=filtered search.
I'm not a fan of the implementation, as they just tried to be 'verity-like' with the results.
http://wiki.apache.org/solr...|%28filter%29#fq
I'd do outside CF's implementation and go directly to solr and parse the xml files returned.
@kevin: To your first point - are you saying SOlr should support the actual word "AND" and "OR" or the style? If so - I definitely didn't say it didn't support and/or, just not the actual words.
@Kevin: Nice to know. I'm not sure I'd agree with you about 'going' native. It's nice that you can for sure, but for most users, I don't think it will be an option they are comfortable with, or would need per se.
Accodoring to the apache solr site/lucene parsing syntax both AND and OR (upper case) as well as the + - operators work just fine - I've spent less trying to see how cf9 implements it than actually using solr syntax directly, but here's where the syntax is stated on the Lucene (solr's java engine) site:
http://lucene.apache.org/ja...
Here's a raw search string that can be executed once a solr 'collection' in cf9 has been created called 'core0'
http://127.0.0.1:8983/solr/core0/select/?q=Belkin%20AND%20ipod&version&version=2.2&start=0&rows=10&indent=on&debugQuery=true
Where I'm searching for both 'Belkin' AND 'ipod' together in the same document - from the debugQuery output it states:
+contents:belkin +contents:ipod which essentially converts the AND to plus signs. This is what solr is doing under the hood.
I confirmed this myself. So wtf - why would the docs specifically say to use +, -, etc? Does NOT work as well? Going to ping my Adobe contact to find out more.
Best of Luck there - I didn't think Solr was covered well, nor given the attention it deserves - I'm glad they moved in that direction and I will take it as a 1.0 implementation of it (hence the move direct to the solr xml as you can do what you like). I'd also mention the ria forge solColdFusion component that works nicely with your existing Solr Collection is certainly worth checking out.
It's possible Adobe just specified the shorthand, since it's what the major search engines use--like Google. Users are probably more familiar with the +coldfusion +solr syntax.
I'm not sure I agree with that. If I had to show someone +X +Y or X AND Y, I think they would prefer X AND Y.
Agreed Ray - our users certainly prefer 'AND' syntax, as our search logging attests to.
Ray,
Solr has a lot of features that aren't exposed by CF. But there is a javascript library that does expose many more called Ajax Solr:
http://asserttrue.blogspot....
While I haven't used the Ajax Solr yet - javascript is a wee bit easier for me to grok than java.
@Rick: Cool. I would like to do another blog later about using the web service stuff, the solr admin, etc.
Heard back from Adobe. The use of + and not "AND" (or any other keyword except OR once) was a doc mistake. They said the same as above, that AND is certainly supported. I've got to go update my WACK chapter now and restore some of the text I had taken out. I'm also going to go edit the blog entry right now (since some people don't read comments).
Verity in CF7 / CF8 has a limit specified in the documentation, but it really isn't enforced in any way whatsoever, at least not on Enterprise. We're well exceeding the limits and have been for awhile and nothing happens. Unless the limit is "per collection" which isn't really clear. We're exceeding it across 4-5 collections collectively.
In any case, Solr sounds promising. I work for the government so we'll probably have it when CF 10 is out. We just recently (3 months or so) upgraded to CF 8.
Thanks for the post comparing the 2 Ray!
Anyone notice the data that solr is indexing looks quite a bit different than what Verity is storing? I just did a test with two collections, a Verity one and a Solr one, indexing a single word document. When I search against the two and dump the results, I see the summary from Verity is much longer than the one returned by Solr, and is cleaner. By cleaner I mean that the Solr result has what I can only assume is MS Word formatting codes. Its pretty ugly and not near as useful.
I was getting ready to migrate from Verity to Solr but this really makes me rethink that decision.
I had not seen that, but will test it out myself as well.
Ugh it gets worse. I threw an additional document into my two test collections, the iPhone User's Guide PDF from Apple's site. I searched for "memory", which I can easily see is in there. The Verity search found it just fine, but when I search for it in the Solr collection I get zero results!
I'm also seeing that the title isn't being picked up in Solr. Verity gets the title. I'm not happy. Going to bug some people.
The Verity K2 server can be installed independently on a different server, and then a remote ColdFusion server can be configured to connect to it. This is a great solution for scalability. However, the license agreement with Verity in ColdFusion MX 7,8 and 9 indicates that only one host is allowed to connect to the K2 server (and there is a config file which enforces this limitation).
If you have multiple CF instances installed on the same server then they can all connect to a single K2 server. But if you have a remotely distributed ColdFusion cluster where clustered instances exist on different servers, then they cannot all connect to the same K2 server.
This is a serious problem limiting scalability. In that scenario, all instances on a particular server host must connect to their own K2 server, and manage its collections there. A CF instance from each server must manage creating, optimizing, and indexing collections, and since it must also be performed by CF instances on other servers this becomes an redundant task that is prone to headaches if the different K2 servers get out of sync.
Sooooo, my question is, if I install CF9 with Solr and decide to put Solr on its own server, can I have a CF9 cluster across multiple servers where all instances can connect to a single Solr server?
I suspect that this will be possible. If so, then this is a major advantage because it reduces maintenance efforts, eliminates synchronization problems, and most importantly lets you better scale ColdFusion applications.
I could install and try it out, but I thought I'd share my thoughts on this potential advantage here.
With the Solr 1.4 release, there's a new token filter called the ReversedWildcardFilter which make it possible to query with a leading wildcard. See http://wiki.apache.org/solr...
I've absolutely no idea about what the Solr schema and queries look like so I cannot comment on the relevancy issues.
utilities that you should consider which address these file formats. (a) Tika, at http://www.lucidimagination... ; and Solr Cell which is part of the LucidWorks Certified Distribution for Solr, documented at http://www.lucidimagination...
I indexed the same collection with Verity and with Solr. All files are .html with normal text content. Verity index size -> 600 Mb Solr index size -> 30 Mb . Is it normal?
No idea. ;) But remember that the Verity code (included with CF) is quite old now - so the index size being large could simply be a reflection of an older engine.
Hey Ray - did you ever learn anything from Adobe on this last issue that Ryan pointed out, where Verity was finding the word, but SolR wasn't, and it's clearly listed in the document?
I'm experiencing something similar and I remember you mentioned this on your blog a year ago so here I am.
My problems is this: I have a document that has these words in it somewhere: Alex and Alexander
When I search (using SolR):
Alex == 0 results
Alexander == 1 result
Alexanders == 1 result (note the "S" at the end of this)
So the problem is that it IS returning a result for a hit that doesn't exist, and it's NOT returning a result for a word thats clearly in there.
I think these might be related, so I'm pinging this thread in case someone else has seen this or has a fix.
G'night and thank you!
Solr support was improved in 901, are you running that version?
This is really nica feature and good explanation why to use solr over verity search.
Thanks
Hello Everyone
I'm facing one issue with NOT operator in Lucene Solr.
Issue : NOT is considered as operator and give me result correct but 'not' is not considered as operator where AND/OR/and/or (all ) are considered as operators.
Can you please help me here ?
I'm not sure you're right. When I test here, for example, I see a different # of results for "coldfusion and phonegap" versus "coldfusion AND phonegap".
Actually - check the blog post itself (did you read it? :) I cover NOT. You are supposed to use -, not NOT.