I updated Seeker a few minutes. This is my code that wraps Lucene functionality. If that sounds like a type of mouthwash to you - just think of Lucene as a search engine, much like Verity, except that Lucene is free and open source. It also runs just fine on OSX.
The updates I included in Seeker are just bug fixes, but pretty critical bug fixes. Later this week I hope to have the ColdFusion Administrator pages build in to make it even easier to use. I'll be mimicking the Verity admin UI (pretty much) but will also include a search tool (like my Verity one) that will let you search indexes directly from the administrator.
p.s. And while I have your attention - my work on wrapping SVNKit as a possible replacement for the front end SVN stuff for RIAForge is close to being done. I'll be releasing that code as well (most likely).
Archived Comments
Fantastic! I hadn't heard of Lucene before and it comes at just the right time. I am working on a question and answer site which requires a fast and efficient search engine (with relevancy etc). I was set on Verity but I'm concerned about the search limits (you can only index a certain amount under the normal license can't you?). Lucene looks like it could be a good alternative.
So does it work in a very similar way and would you recommend it as a good Verity alternative.
Verity does have a limit - 250k - which I think is pretty reasonable. Don't get me wrong - I love Verity - and I think people don't give it enough credit, nor thank Adobe enough for shipping a -very- expensive search server w/ the product for free.
Does it work in a similar way: Kinda. :) Like Verity, you have 2 main parts. Part one is creating and maintaining the index. Part 2 is the searching. I tried to make things very much like the Verity API in CF.
Would I recommend it? My code has had VERY little usage. I think about 2 people have used it. To me - that's a bit scary. But - we got to start someplace. ;)
While I think Lucene is a great open source solution, the lack of support for most common file formats is problematic.
There are ways of dealing with this, to a degree, but it sure is nice that Verity handles the document conversions out of the box.
On your RIA Forge page, you say that Lucene is a good candidate for people who can't run Java. I'm guessing that you meant Verity? :)
This is definitely something that I will be looking into using, along with your FeedBurner CFC!
@Chris - Oops, fixed. Thanks.
@Gus - There is another project at Apache that helps with this, but I haven't worked much with it. I built Seeker though so that it is easy to extend. Download it and look at how I built the readers. To add support for format X, you just add a CFC. Todd Sharp is going to share some PPT code with me soon.
Sweet. Thanks Ray!
We host our sites on OS X xserves running cf8 and so Verity is not an option. Ray's Seeker code has been a life saver for our query based searches. We have about 6 production sites using it!!
The latest updates are looking promising for file based searches which were not working properly in the previous version. I really look forward to seeing the cfadmin stuff and will continue to test the code. Adding some more file readers will be really useful although pdf's and htm files are covered and these are the most common ones we index.
Keep up the excellent work Ray, I seriously don't know how you find the time.
Thanks Ray - I'm still going to consider verity but the limit is a little worrying as the site I'm working on has the potential to smash through those limits in time. Is the limit based on all collections or 'per' collection?
I will give Lucene a test anyway - I'm only interested in query based indexing so it looks ideal.
The limit in Verity is per box, not per collection.
I've got a new release of Seeker coming out later today. It just adds the ability to search N fields (thanks to AJ Mercer) and cleans up the zip a bit.
I also need to look into index operations like update/delete. It's going to suck if you have to blow away your index for every update.
i have just been playing with Seeker, and working on a mac provides a superb alternative to Verity.
I would like to know what file formats can currently be read and indexed?
Thanks
PDF, DOC, txt, html. It will try to read any other file as well as text and attempt to get something out of it.
brilliant, thanks.
Would it be possible to index metadata from images? Or would that be something that could be added?
One of the things I'm proud of is how easy it is to add 'indexers' to Seeker. You basically just write the CFC. So if you were to write the CFC for gif, jpg, tiff, whatever, and you used CF8 image funcs to get the metadata, your job is basically done. If do you so and share it with me, I'd most likely add it to the core project.
thanks Ray
im using the query index tag from Seeker, and was wondering if when i run different queries and save the resultant indexes. Do there overwrite the previous files in there... or are they appended?
I basically need to create an indexes for multiple tables
Right now Seeker does not support adding, updating, or deleting stuff from an index. I've been meaning to do that for a while now but I haven't found the time. You would need to do it all at once for multiple tables. What I would recommend is using Query of Query to join the multiple record sets, and then index that query.
thanks again ray.
I assume that would be true for file indexing too?
Yep. A bit of a pain, I know. :) Luckily things index rather quickly (as far as I know). Adding add/edit/delete support for indexes is #1 on my list for Seeker. Now to find the time...
its a life saver considering i dont have access to verity.
Thanks for creating it!
Hi.
I am in the search for a search engine product as a replacement of Verity (Apparently it has been bought by a company called Autonomy, and the licences are not cheap).
What i requires is simply a search engine that will allow me to spide or catalogue external sites.
Where do i start looking.
p.s. i have come here because i have read a little about Lucence and the name seems to appear quite often.
Well, you can always Google. Your question is a bit broad. You haven't said what your requirements are or what you have considered already.
I will point out that if you upgrade to ColdFusion 9, you get Solr/Lucene support built in. No need to make use of my Seeker project at all.
Hi Ray.
First thanks for your response (to be honest i wasn't actually expecting one
seen as you seem a bit of a CF Guru).
About me:
Ok, i am relatively new to both ColdFusion and search engines.
About the business:
I work for a UK based local government organization, hence we don't have
deep pockets.
The project:
The project is a web based portal/observatory for the local district. The
idea is to have one place for all the information on the district with a
primary focus on Geographic information.
The project was initially developed as a pilot, which forced to become live
by the powers that be (hence the lack of accessibility and bad design).
The search engine:
Until now i have been using the free verity search engine built into
ColdFusion.
I use ColdFusion 8 on my external server and development servers (i have
just been informed i already have ColdFusion 9 that just needs installing).
The search engine has served its purpose for the preliminary release however
i now need a product that will allow me to search multiple sites of partner
organizations (i.e. police, national government, statistical sites, national
health service etc). Also my hunt for a new product has become more urgent
seen as i can not do Boolean searches in it (i am sure you are supposed to).
The primary aspects i require:
1. scalability
2. ability to do cross site searching
3. ranking / scoring
4. potential advanced search features (to narrow the search)?
My questions are:
1. Given the above what products would you recommend, (even purchased
products, providing the cost isn't astronomical like Autonomy's licensing of
verity or its own products.
2. any other hints, tips and ideas you can provide for a complete newbie?
Thanks again
Saff
p.s. sorry for the email, i just replied to what landed in my inbox...also....thus far i have enjoyed reading your blog so now adding one to your
subscribers list :D
Hmm, I think the big issue here is your #2. You want the ability to search against _other_ sites. I assume those sites aren't on your server. As far as I know, your only option would probably be a custom Google search engine. http://www.google.com/cse/
This is a Google feature that allows you to build a custom search engine limited to certain URLs.
Give that a shot first I'd say.
Does Seeker need any updating since it's last release and now that Lucene Java 3 is out?
I honestly have no idea. If you see that it does and want to perform the updates, I'll gladly take in any submissions.