After my presentation last week I had a few ColdFusion/Solr questions to follow up on. Here are two of them.
- Can you use Solr with content indexed on Amazon S3?
Yes and no. The main answer is no. The code below is what I used to test:
<cfdump var="#files#"> <cfoutput>Indexing #s3dir#<p></cfoutput> <cfindex action="update" collection="indextest1" type="path" key="#s3dir#"
recurse="true" status="result" extensions=".txt,.pdf"> <cfdump var="#result#" label="Result of update operation">
<cfset s3dir = "s3://myaccess:mysecret@s3.coldfusionjedi.com">
<cfdirectory directory="#s3dir#" name="files">
When run, you get: The key specified is not a directory: s3://myaccess:mysecret@s3.coldfusionjedi.com. The path in the key attribute must be a directory when type="path". Obviously "myaccess" and "mysecret" were real values, but nonetheless, this isn't supported. I'm not terribly surprised by this ColdFusion speaks to Solr and asks it to index a folder but in this case the folder is only 'reachable' via ColdFusion. However, you can make use of S3 and Solr indexing. Whenever you move a file to S3, simply run the index operation first. Let Solr index the file and then push it off to S3.
- Can you index a file and a db record together in the same search "row". I know SOLR can handle it if you roll the code manually, but can this be done with the CF tags?
Again - yes and no. The tag that indexes file based data and query based data (cfindex) can only do one type at a time. So with just one tag you couldn't do this. However - if you read and parse the file yourself (for example, using cfpdf to read in the text of a pdf) you can then merge that textual data with any other database data when you add it to the index. I'm not sure how useful this would be. I could see merging file data with database information being stored in the custom fields though.
Archived Comments
First off, thanks for your great resources on this site. Now to Q#2 above: this is exactly my issue in the system I'm building.
I am creating a searchable document repository where a significant amount of metadata about a binary doc is stored in a database record. I was planning on using MSSql2008 fulltext BLOB search, which works great, actually, but I'm not able to get my corporate webhost to comply so I'm looking into storing the docs in the filesystem. I then need to link the DB metadata with its file.
What I need to be able to do is search for a word -- let's say "elbow" -- and get one record returned whether "elbow" shows up in a doc or in the title (or any other) metadata field.
I don't fully understand your description. Are you saying that I should parse the text of the uploaded file into a database field and use that for search, and then in my search results just point to the actual file in the filesystem instead?
Well, you could do a query against the metadata table, and then do cfsearch. This gives you two queries. You could loop over one, and then loop over the second, but for each row of the second you see if it exists in the first query, and if so, skip it. valueList() is a quick way to get a 'column' of values from a db.