Posted in ColdFusion | Posted on 08-20-2009 | 12,531 views
Last night I decided to whip together a simple example of how to add Solr search indexing to an application. Luckily, for the most part, this is the exact same process we've been using for years now with Verity. I know many people avoided Verity due to the document size limits so with that in mind, I thought a simple ColdFusion 9 example would help introduce the feature. To start off with, let me show you a simple application that has no search capability at all. This will be the first draft application that I'll modify to add Solr support.
My application is a Press Release viewer. The public page consists of a list of press releases. You click on a press release to view the details. The admin folder (and for this proof of concept it won't have any security) allows for basic CRUD operations. I won't show most of the code as it's rather boring, but I'll demonstrate my Application.cfc and the model layer. First, the Application.cfc file:
2
3 this.name = "pressreleases";
4 this.ormenabled = true;
5 this.datasource=this.name;
6 this.ormsettings = {
7 dialect="MySQL",
8 dbcreate="update",
9 eventhandling="true"
10 };
11 this.mappings["/model"] = getDirectoryFromPath(getCurrentTemplatePath()) & "model";
12
13 public boolean function onApplicationStart() {
14 application.prService = new model.prService();
15 return true;
16 }
17
18 public boolean function onRequestStart(string page) {
19 if(structKeyExists(url, "init")) { ormReload(); applicationStop(); location('index.cfm?reloaded=true'); }
20 return true;
21 }
22}
Nothing too fancy here - I've enabled ORM, allowed for easy restarts, and created a grand total of one CFC in the application scope, the prService. The prService is simply a component to abstract access to my press release model. The press release entity is just:
2
3 property name="id" generator="native" sqltype="integer" fieldtype="id";
4 property name="title" ormtype="string";
5 property name="author" ormtype="string";
6 property name="body" ormtype="text";
7 property name="published" ormtype="date";
8
9}
And the service provides an abstraction layer to it:
2
3 pubic function deletePressRelease(id) {
4 entityDelete(getPressRelease(id));
5 ormFlush();
6 }
7
8 public function getPressRelease(id) {
9 if(id == "") return new pressrelease();
10 else return entityLoad("pressrelease", id, true);
11 }
12
13 public function getPressReleases() {
14 return entityLoad("pressrelease");
15 }
16
17 public function getReleasedPressReleases() {
18 return ormExecuteQuery("from pressrelease where published < ? order by published desc", [now()]);
19 }
20
21 public function savePressRelease(id,string title,string author,date published,string body) {
22 var pr = getPressRelease(id);
23 pr.setTitle(title);
24 pr.setAuthor(author);
25 pr.setPublished(published);
26 pr.setBody(body);
27 entitySave(pr);
28 }
29}
I assume most of this makes sense. Note that I have bot ha getPressReleases function as well as a getReleasedPressReeleases function. The later handles the public view and only gets press releases where the published date is in the past. Notice that savePressRelease is kind of nice - it just plain works whether you have a new press release or an existing one. Also make note of delete. In order to handle calling a delete operation followed by a list, I force a flush on the ORM stuff. If I didn't, the deleted item would show in the list during the same request.
You can download all of this code at the bottom, and again, I don't want to waste too much time on basic list/edit forms. What I want to talk about instead is the process of enabling Solr searching support for this application.
When you work with Solr (and Verity as well), you work with an index of your data. This index, much like an index in a book, represents all the data that you want to be searchable. However, and this is the critical point, it is your responsibility to keep the index up to date. That means every time you add, edit, or delete content, you have to update the index. The maintenance aspect then is typically the most complex part of the process. Searching really just comes down to one tag.
I normally create a "Ground Zero" type script that handles creating my collection and index from scratch. (Think of the collection just as the folder or name of the index.) This is useful to run during testing or if you encounter a bug where your index gets out of data. I created the following script for that purpose:
2
3<!--- collection check --->
4<cfif not listFindNoCase(valueList(collections.name), application.collection)>
5 <cfoutput>
6 <p>
7 Need to create collection #application.collection#.
8 </p>
9 </cfoutput>
10
11 <cfcollection action="create" collection="#application.collection#" engine="solr" path="#application.collection#">
12</cfif>
13
14<!--- nuke old data --->
15<cfindex collection="#application.collection#" action="purge">
16
17<!--- get data --->
18<cfset prs = application.prService.getPressReleases()>
19<cfoutput><p>Total of #arraylen(prs)# press releases.</p></cfoutput>
20
21<!--- convert to a query --->
22<cfset data = entityToQuery(prs)>
23
24<!--- add to collection --->
25<cfindex collection="#application.collection#" action="update" body="body,title" custom1="author" title="title" key="id" query="data">
26
27<p>Done.</p>
I begin by getting a list of collections. The ColdFusion 9 docs say that if you leave the engine attribute off the cfcollection tag it will return everything. I did not see that. I file a bug on it. But for now, I've just added the engine attribute. This returns a query of collections. If I don't find my collection in there (I created an application variable to store the name) then I create one. In theory, this will only happen one time.
Next I remove all data from the collection with the purge. Again, I'm thinking that this script would be useful both for a first time seeding of the index as well as a 'recovery' type action.
Once we have an empty index, I get all of my press releases and convert it to a query with the entityToQuery function.
Lastly, I simply pass that query to the cfindex tag. Now, here is an important part. When you pass data into the index, you get to the decide what gets stored in the body and what, if anything, gets stored in the 4 custom fields. I decided that the body and title made sense for the searchable information. I repeated title again for the title attribute. This will let me get the title in search results. For the custom field I used the author. Again, this was totally up to me and what made sense for my application.
Alright, so at this point we can run the script to create our collection and populate the index. I then switched gears and worked on the front end. I create a new search template to handle that:
2<cfparam name="form.search" default="#url.search#">
3<cfset form.search = trim(form.search)>
4
5<form action="search.cfm" method="post">
6<cfoutput><input type="text" name="search" value="#form.search#"> <input type="submit" value="Search"></cfoutput>
7</form>
8
9<cfif len(form.search)>
10 <cfsearch collection="#application.collection#" criteria="#form.search#" name="results" status="r" suggestions="always" contextPassages="2">
11
12 <cfif results.recordCount>
13
14 <cfoutput>
15 <p>There were #results.recordCount# result(s).</p>
16 <cfloop query="results">
17 <p>
18 <a href="detail.cfm?id=#key#">#title#</a><br/>
19 #context#
20 </p>
21 </cfloop>
22 </cfoutput>
23
24 <cfelse>
25
26 <p>
27 Sorry, but there were no results.
28 <!--- trim is in relation to bug 79509 --->
29 <cfif len(trim(r.suggestedQuery))>
30 <cfoutput>Try a search for <a href="search.cfm?search=#urlEncodedFormat(r.suggestedQuery)#">#r.suggestedQuery#</a>.</cfoutput>
31 </cfif>
32 </p>
33
34 </cfif>
35
36</cfif>
Going line by line, we begin with some simple parameterizing of a search variable, along with a basic form. If the user actually searched for something, we use cfsearch. As you can see, it works pretty simply. Pass in a criteria and a name for the results and you are done. The status attribute is not necessary but provides some cool functionality I'll describe in a bit.
If we have any results, I simply loop over them like any other query. The context is created by Solr based on your matches. So if you searched for enlightenment (don't we all), then the context will show you where it was found in the data.
The cool part is the else block. Solr (and Verity before it) provided a nice feature for searches called suggestions. Let's say a user wanted to search for Dharma but accidentally entered Dhrma. In some cases, the Solr engine can recognize the typo and will actually return a suggested query: Dharma. Pretty cool, right? Please note that the trim in there is due to another bug I found. In cases where Solr could not find a suggestion, it returned a single space character. I'm sure this will be fixed for the final release. If we do get a suggested query then we simply provide a link to allow the user to try that instead.
So far so good. Now let's talk about keeping the index up to date. If you remember, I had built a simple service component, prService, to handle all CRUD operations for my data. Because I did that, it was rather simple to handle the changes necessary for my index. First, my Application.cfc onApplicationStart was modified to support passing in the collection name:
2 application.collection = "pressreleases";
3 application.prService = new model.prService(application.collection);
4 return true;
5}
And then prService was modified to support it. Unfortunately, there are no script based alternatives for Solr/Verity support. To be honest, it would probably be trivial to create such a component. (In case you didn't know, the ColdFusion 9 script based support for mail, and other things, was done this way.) I ended up simply rewriting my component into tags:
2
3 <cffunction name="init" output="false">
4 <cfargument name="collection">
5 <cfset variables.collection = arguments.collection>
6 </cffunction>
7
8 <cffunction name="deletePressRelease" output="false">
9 <cfargument name="id">
10
11
12 <cfset entityDelete(getPressRelease(id))>
13 <cfset ormFlush()>
14
15 <!--- update collection --->
16 <cfindex collection="#variables.collection#" action="delete" key="#id#" type="custom">
17
18 </cffunction>
19
20 <cffunction name="getPressRelease" output="false">
21 <cfargument name="id">
22
23 <cfif id is "">
24 <cfreturn new pressrelease()>
25 <cfelse>
26 <cfreturn entityLoad("pressrelease", id, true)>
27 </cfif>
28 </cffunction>
29
30 <cffunction name="getPressReleases" output="false">
31 <cfreturn entityLoad("pressrelease")>
32 </cffunction>
33
34 <cffunction name="getReleasedPressReleases" output="false">
35 <cfreturn ormExecuteQuery("from pressrelease where published < ? order by published desc", [now()])>
36 </cffunction>
37
38 <cffunction name="savePressRelease" output="false">
39 <cfargument name="id">
40 <cfargument name="title">
41 <cfargument name="author">
42 <cfargument name="published">
43 <cfargument name="body">
44
45 <cfset var pr = getPressRelease(id)>
46 <cfset pr.setTitle(title)>
47 <cfset pr.setAuthor(author)>
48 <cfset pr.setPublished(published)>
49 <cfset pr.setBody(body)>
50 <cfset entitySave(pr)>
51
52 <!--- update collection --->
53 <cfindex collection="#variables.collection#" action="update" key="#pr.getId()#" body="#pr.getBody()#,#pr.getTitle()#" title="#pr.getTitle()#" custom1="#pr.getAuthor()#" type="custom">
54
55 </cffunction>
56
57</cfcomponent>
If we ignore the tags, the only changes are the cfindex tags in deletePressRelease and savePressRelease. In both cases it isn't too difficult. The key attribute refers to the primary key in the index. We used the database ID record so it's what we use when updating/deleting. The update action works for both additions and updates, so that is pretty simple as well.
Unfortunately, I ran into an issue with deletes. Delete operations are 100% broken in the current release of ColdFusion 9, at least on the Mac (and I bet it works ok in Verity). Keep this in mind as you play with the demo code. I've been told this is fixed already.
So what do folks think? Will you use this when you upgrade to ColdFusion 9? Also, have you notice the slight logic bug with search? I won't say what it was - but I'll tackle it in the next post.


Isn't a "slight logic bug" illogical?
@Shannon: To be honest, that isn't too exciting. I was able to skip quite a bit with one call: entityToQuery, but if I couldn't do that, then I'd simply loop over and make a query by hand. I _could_ do N cfindex calls as well, but that tends to be slow. If folks do feel a more complex example would be warranted, then I can definitely consider it for the next post.
So does anyone see the security error with the search?
When I saw Ben Forta speak with Adam Lehman at NYCFUG, it was my understanding that every tag but 1 (I forget which one) would be available in CFScript syntax.
how well does this work with verity/k2 ?
from my experience there is a massive performance hit when calling CFindex using verity and updating only 1 record. since moving from CF6 to cf7/8 we have had to re-structure our apps to CFINDEX via a schedule and pass a query with a large number of records to cfindex. ie. cfindex with 1 record = 10 seconds. cfidnex passing 100 records = 10 seconds.
Does CF9 bring back the the vdk style of updating your index on save ? from your example - It looks like it.
To your last question, I'm not sure you mean by 'vdk style of updating' - but - as far as I know, this code should just plain "work" if you switched from Solr to Verity - in fact, I'm willing to bet the delete bug doesn't exist on the Verity site.
K2 in 7/8 would take seconds to update just 1 record in the index. it was so slow for us we had to deferr indexing out of the save operation on records and into a schedule task.
anyways im excited to see this solr example in CF9
Have you tried to use custom1 in cfsearch with solr?
<cfsearch collection="#arguments.collection#" type="simple" criteria="CF_CUSTOM2 <matches> #newResId#" name="qTemp" />
Throws an error for me.
CF9 doc say custom1 .. 4 are for verity only
http://help.adobe.com/en_US/ColdFusion/9.0/CFMLRef...
However, one corrupt PDF through an error,
but another one just hangs the systems???
1) A bad PDF that gets indexed causes the entire collection to stop working. Right?
2) It also sounds like another bad PDF caused the server to hang. Right?
Please confirm as to me it sounds like 2 separate issues.
There are (at least) two corrupt PDF files to be added to the collection. One throws and error (which can be handled with try/catch). The other is hanging CF - no error thrown, not even a timeout.
I took out cfsearch as I thought that was causing the error - but still hangs.
I added cfpdf getinfo (removing cfindex) and that hangs the system on the same file.
I will do more testing tomorrow when I am back in the office and let you know what I discover.
http://cfbugs.adobe.com/cfbugreport/flexbugui/cfbu...
but not sure how to supply my test code and files
So, instead of:
<cfset Application.prService = CreateObject("component","Model.prService")>
we now say:
this.mappings["/model"] = getDirectoryFromPath(getCurrentTemplatePath()) & "model";
application.prService = new model.prService();
?
Make sense?
To be clear, CF ORM doesn't really DEMAND you follow any particular type of way of coding. So don't take what I say as the One True Way.
Thank you!
custom1:NNNNN
I am indexing hundreds of .htm docs with a 'question' in the <title> and the 'answer' in the <body>.
1) Will Solr prioritize the title in my cfsearch? I mean, is the title more important for the engine, isn't it?
2) And how can I add one or more categories/tags to my doc in the index, e.g. based on the argument? I mean, how can I read a html tag (ex. <h1>) and put its content into an index field? is it possible?
2) You can use categorization when you index data. If you are indexing files, it means you have to switch to a more manual process, but you can do it. The cfindex tag supports the category and categorytree arguments. You also have 4 custom fields.
I knew that categorization would let me achieve my goal but... no reference on how to do it actually. I mean how can I assign a document to the 'red' category and another to the 'blue' one?
To be honest, here's what I'm tryiing to realize:
my .htm doc contains <title>What color are your eyes?</title><body>They are blue.</body> . I want to search "are your eyes blue?" or "tell me your eyes' color" and get that as a result. Solr doesn't seem to get the more relevant word 'eyes'.. It highlights 'what color' or 'me your' .. wtf?!? Also, I am working in italian language (cfcollection cfindex cfsearch using language='italian').
Indeed, for categorizing, do I have to put some category tag (e.g. <h1>Blue</h1>) and tell solr CUSTOM1="h1" ?
Feel free to send me an e-mail if you can help me, please.
Thanks in advance!
http://help.adobe.com/en_US/ColdFusion/9.0/Develop...
However, in terms of file based indexing the category/categoryTree you use is assigned to every thing you index. In order to apply a unique value to each file, you would need to a) decide on your business logic (ie, WHAT cat goes with what file) and b) index one file at a time.
Thank you.
http://wiki.apache.org/solr/SolrRelevancyFAQ#How_c...
By using the "custom" attributes for my DB fields (custom1="title" custom2="description" etc), I was able to "boost" title in the criteria as follows:
criteria="custom1:#searchStr#^2 custom2:#searchStr#"...
Thank you once again!
I can get it to work by using +custom1:(+hello +world) but that seems wrong, I would have thought +custom1:"hello world" would work.
Use custom1:"hello world"~1000000 , I nearly had it but I'd left the 1000000 (slop?) of of it when testing.
This obviously will not work for a production environment since anytime the server restarts, searching will not work in any applications that have implemented it.
ColdFusion 9.0.1 Standard on Linux - web root is /www, but web sites are hosted in /www/sitename.com.
When I index /www/sitename.com, and search the index, I get hits from /www/CFIDE. I have no symlink, only the mapping in the CF administrator.
My "common sense" tells me Solr shouldn't index anything not physically present in the recursed directories. Is there some known setting that I need to flip to prevent this behavior, or am I completely misunderstanding the issue here? I swear I'm not a complete idiot. Mostly.
But since I've gone and revived this old thread, I'd love to hear any recommendations for a good solution to search CFML content. Is there a site spider plugin that might integrate with Solr somehow?
I need to boost scoring for title and by using this snippet:
criteria="custom1:#searchStr#^2 custom2:#searchStr#".
(from above) the search doesn't return anything.
This is my statement:
<cfindex collection="myCol" action="update" body="title,description" custom1="description" title="title" key="ID" query="myQ">
<cfsearch collection="myCol" criteria="title:#searchStr#^10 custom1:#searchStr#^5" name="searchResult" >
Thanks
Mike
I searched all over the place and can't find a reason for it.
Very strange.
Can somebody test and see if they can use title: in the search criteria to boost the score?
Thanks
Mike
Have you tested with the Solr search index with more than 10,000 accessions of contents and 5000 hits a day in it?
I'm presenting on SOLR at RIAUnleashed, and will have a small sample app then. I can try running JMeter against the site and see how it holds up. But to be honest, 10K hits in a day isn't a whole heck of a lot.
Thanks,
Destroyed the collection and rebuilt. Reindexed. Still no PDFs.
Thanks for the help.
David
The collection was created from the website via the CF ADMIN pages.
The index was created from a webpage using CFINDEX. I will post the code below, just in case...
<CFSET IndexCollection = "psolr">
<CFSET IndexDirectory = "d:\prelude-printed\docs">
<CFSET IndexRecurse = "YES">
<CFSET IndexExtensions = ".*">
<CFSET IndexLanguage = "english">
<CFINDEX
collection="#IndexCollection#"
action="update"
type="PATH"
key="#IndexDirectory#\"
extensions="#IndexExtensions#"
recurse="#IndexRecurse#"
language="#IndexLanguage#"
URLPATH="http://newdev.hipco.com/preludedocuments/docs/&quo...;
>
I have contact Adobe support in regards to this. Waiting to hear from them. The server is a new install of W2K3, CF9 with the update and CHF applied. No other application are on the server, other than IIS.
Thanks to your suggestions, I thought it was very weird that it was not indexing the PDFs that Verity has previously indexed (different collection). I threw in the status attribute as you suggested. I was able to see the txt and cfm files that I had also added to that directory to be indexed. Those all indexed just fine, but none of the PDFs in same directory.
I then decided to put in some other documents (XLS, BMP, DOC, etc...) into the directory to see if the indexing was going to work. It did!
I then found some "other" PDFs, put them in the directory, and they indexed!
Background: The PDFs are created by a system called DCS, which we use for invoice printing for our ERP system. It is odd to me that Verity can index these PDFs while Solr cannot. I took a look at the PDF, it is compatible with 5.x and greater. There must be something with these PDFs that Solr does not like.
Also - you may want to try using CFPDF to read the text from them. If you can, you can index them manually.
It seems that a PDF is not a PDF. Adobe called the PDF corrupt, even though it can be indexed by Verity, open/modified by Acrobat, and modified by CFPDF.
We are going to create a PDF from a PDF (using the CFPDF tag) and then letting Solr index that PDF. We tested the process and it works!
Thanks to all that helped in this.
David
thanks
I'm sure the problem is you have it locked down so solr is only accessible to localhost. You need to open it up to allow access from your web server. You might need to do this both at the solr level, and at the machine or network firewall level, depending on your setup.
For solr, you basically have to do the opposite of this tech article, either allowing from any IP, or just from your CF server's IP: http://kb2.adobe.com/cps/807/cpsid_80719.html
Thanks again
I am new to coldfusion and I am reading up on how to use solr. I am reading throught your solr example on http://www.coldfusionjedi.com/index.cfm/2009/8/20/... and I have downloaded the zip. However, when I launched it locally, I get the error: 'Datasource pressreleases could not be found.' Casn you tell me what I must do for this?
Thanks for you help!
Conor
Using
<cfsearch collection="mycollection" criteria="#searchString# custom2:#searchString#" name="q" status="r" suggestions="always" contextPassages="0">
I get a big fat error:
here was a problem while attempting to perform a search.
Error executing query : orgapachelucenequeryParserParseException_Cannot_parse_custom_Encountered_EOF_at_line_1_column_7__Was_expecting_one_of____________________QUOTED_______TERM_______PREFIXTERM_______WILDTERM_____________________NUMBER_______
Thanks
Mike
I thought I have the syntax wrong, but didn't find anywhere a different way of doing it.
BTW I'm on CF 9.01.
Thanks Ray for your quick response
Not the ideal solution but it's working. I will try on a different server and let you know.
eagerly waiting for CF10. I hope solr will finally have all the features enabled and working.
Thanks again
Mike
I can say that in my CF/Solr preso next week I'm revealing one of the new things in CF10. You should attend. :)
Is it on the cfmeetup or something else?
Please provide URL if it's available.
Thanks
http://www.adobe.com/cfusion/event/index.cfm?event...
Also, I really hope that some advanced features are shown including the ability to access solr directly as you mention in the previous message.
Thanks
Basically you just need to make HTTP requests with the right url parameters. Check the Solr docs for information on that.
If you want to search for a specific item, remember you can search against the key field. I'm not having luck now using a file based key, but I'm pretty sure it _does_ work.
Did you have a chance to see if ranking works in CF implementation.
See my previous post (41).
Thx
Mike
cffeed title:cfthread
Which means: cffeed in the body or cfthread in the title.
I then did
cffeed title:cfthread^100
And the result with cfthread in the title popped to the top rank wise.
Maybe I'm doing something wrong, even though if I would have it wrong (from a syntax point of view) I would expect an error, But I get no results.
If I do a normal search without ranking I get results
Thx for the quick replay
Mike
Unfortunately it's me again, For the life of me I can't get how to boost certain fields so they show at the top.
CF version: 9,0,1,274733
Update Level chf9010002.jar
I'm using the following code:
<cfquery name="getBooks">
select bookID, title, url, datePublished, author, description
from books
</cfquery>
<cfindex collection="books" action="update" body="description"
custom1="datePublished" custom2="url"
custom3="author" title="title"
type="custom" key="bookID" query="getBooks">
<cfsearch collection="books" criteria="title:#searchCriteria#^10" name="results" status="r" suggestions="always"
ContextBytes="1000" ContextPassages="4">
When I run this I get 0 results.
If I removed the title from the criteria:
<cfsearch collection="books" criteria="#searchCriteria#" name="results" status="r" suggestions="always"
ContextBytes="1000" ContextPassages="4">
I get all the hits but entries without the search query in the title have higher ranking then the ones with the criteria.
Is this the correct syntax? do I have to make changes to the solrconfig.xml file?
Based on all I have read it should work, but it doesn't (for me).
Please help.
Thanks
Mike
Is your installation standard?
Do I have to update the installation of solr?
Can you please send me your solr confing xml file.
I have read that I can play with that.
Thanks Ray for the quick response on a Saturday.
Mike
My solr config wasn't modified. I just made an index of the cfdocs that ship with CF. If you make one it should be the same.
If it's not too much to ask can you please (when you have some time) post the exact statements to create the collection, index and search with boost.
Definitely I am missing something.
Thanks again
Mike
I searched for rss and number one was: Adobe ColdFusion * cffeed
I then did
rss OR title:RSS
and then an item with RSS in the title went to the top. When I boosted the score by 10, it stayed at number one, but oddly the score went down by 10. So... um... not sure.
Thanks
and have a great week-end
And while the order changed there were more entries at the top without the search word in the title.
Thanks
Mike
[Add Comment] [Subscribe to Comments]