Here is a simple question. Given the "typical" search scenario (user searches for X, your SQL searches for "like %x%"), how do you broaden the user's input when multiple words are used? Let me begin with an example of what I mean and then I'll demonstrate a few solutions. Also, note that in this blog entry I'm focusing only on SQL searching, not free text searching ala Lucene (or full text searching like you get in some databases).
I began with an incredibly simple search form tied to the cfartgallery sample database.
<cfparam name="form.search" default="">
<form method="post">
<cfoutput>
<input type="text" name="search" value="#form.search#">
<input type="submit" value="Search">
</cfoutput>
</form>
<cfif len(trim(form.search))>
<cfset term = "%" & trim(form.search) & "%">
<cfset term = ucase(term)>
<cfquery name="results" datasource="cfartgallery">
select artname, description
from art
where ucase(artname) like <cfqueryparam cfsqltype="cf_sql_varchar" value="#term#">
or ucase(description) like <cfqueryparam cfsqltype="cf_sql_varchar" value="#term#">
</cfquery>
<cfdump var="#results#">
</cfif>
I assume nothing here is too crazy or unique. I basically take the form input, trim it, and wrap it with % characters. (Note the use of ucase here as the Derby databases were case sensitive.) I then check this input against the artname and description columns. While this works, it fails in a few cases. A good example of this involves two pieces of art: "Man in Jeans" and "Man on Stool." If I knew a piece of art existed that was named "man something something jeans", a search for "man jeans" would fail to work. My search code here is treating the input as a phrase, so it must match the literal 'man jeans' set of characters anywhere in the two columns.
One possible way to fix this would be to treat the input like a list of options and allow for a match against any of them. Let's look at that modification first.
<cfparam name="form.search" default="">
<form method="post">
<cfoutput>
<input type="text" name="search" value="#form.search#">
<input type="submit" value="Search">
</cfoutput>
</form>
<cfif len(trim(form.search))>
<cfquery name="results" datasource="cfartgallery">
select artname, description
from art
where 1=0
<cfloop index="word" list="#trim(form.search)#" delimiters=" ">
<cfset word = "%" & ucase(word) & "%">
or
ucase(artname) like <cfqueryparam cfsqltype="cf_sql_varchar" value="#word#">
or
ucase(description) like <cfqueryparam cfsqltype="cf_sql_varchar" value="#word#">
</cfloop>
</cfquery>
<cfdump var="#results#">
</cfif>
Notice I've modified the where clause. I begin with a 1=0 to act as a simple placeholder that will match nothing. Then for each "word" in the input I output OR clauses for each thing. Given the input "man jeans", the SQL you will end up with is:
where 1=1
or ucase(artname) like '%MAN%'
or ucase(description) like '%MAN%'
or ucase(artname) like '%JEANS%'
or ucase(description) like '%JEANS%'
This works well, but could a bit too loose. My search for "man jeans" ends up matching both "Man in Jeans" and "Man on Stool." That isn't horrible - and is better than matching nothing. But if we wanted to be a bit more strict, we can use an AND search. In this example, we will require all the words to exist, but still allow them to match in any column.
<cfparam name="form.search" default="">
<form method="post">
<cfoutput>
<input type="text" name="search" value="#form.search#">
<input type="submit" value="Search">
</cfoutput>
</form>
<cfif len(trim(form.search))>
<cfquery name="results" datasource="cfartgallery">
select artname, description
from art
where
<cfloop index="x" from="1" to="#listLen(form.search, " ")#">
<cfset word = listGetAt(form.search, x, " ")>
<cfset word = "%" & ucase(word) & "%">
<cfif x neq 1>
and
</cfif>
(
ucase(artname) like <cfqueryparam cfsqltype="cf_sql_varchar" value="#word#">
or
ucase(description) like <cfqueryparam cfsqltype="cf_sql_varchar" value="#word#">
)
</cfloop>
</cfquery>
<cfdump var="#results#">
</cfif>
The where clause here is different. This time I loop over each word but include the AND prefix for each set of searches. The AND prefix is used only after the first word. So given the "man jeans" example we used earlier, we end up with this clause:
where
(ucase(artname) like '%MEN%'
or
ucase(description) like '%MEN%')
and
(ucase(artname) like '%JEANS%'
or
ucase(description) like '%JEANS%')
This provides a tighter result set than the previous example. So which is best? It depends. :) What I recommend is - log your searches and spend some time trying them yourselves. See how successful they are. Take that data and then decide what kind of modifications you should make to your search form.
Archived Comments
I think the most difficult part of this would be trying to order them in order of most relevant. I can think of several ways and it can get complicated fast.
Here are some ways you can try to order them:
-How many words are actually matched out of the total words searched.
-Order of the words in relation to the search string.
-How many times the words show up in the document.
-In this case, possibly an artname has more relevance than the description?
-If the first word had a match it would have more relevance than if the last word matched.
-Can you weight results in the database that would be most popular or more sought after?
I don't know if this adds to what you were getting at but I think for better use it does deserve to be in the conversation. I think just finding matches would be a simple problem to actually giving the users the best result first.
Agreed - providing a score of some kind would be useful. Mine simply finds them all.
This highlights the many complexities in building a search app. You want the results to be as precise as possible, but at the same time you want it to accommodate the possibilities that exist for any one search.
For instance, what if you liked the painting but had mixed up the title and thought it was "Dude in Jeans" or "Guy in Jeans?" Using the AND prefix, you don't get that result.
I did something like this t'other day:
<cfquery name="search" datasource="#application.datasource#">
select etc.
from etc.
where 1=1
<cfloop index="i" list="#form.keywords#" delimiters=" ">
and (product_description1 like '%#i#%' or
product_description2 like '%#i#%')
</cfloop>
</cfquery>
which removes the need for the listlen and listgetat...
...for what it's worth
This actually sounds like a fun project. Maybe once I get my current work done I will give this a shot and see what I can come up with. I would be interested in ideas from other people on how they think the rankings should be done.
I know I studied Google's PageRank algorithm before and I think this would be great for me to use my experience from that.
One of my favorite searches I wrote, probably 10 years ago, took all the permutations of the search terms and did separate searches for each combination of words, the unioned them all together. It was a bit processor intensive, but after a bit of caching, that wasn't a big deal.
For example, search for: "red clown shoes"
would search for:
red and clown and shoes
red and clown
red and shoes
clown and shoes
red
clown
shoes
So:
select * from table where description like '%red%' and description like '%clown%' and description like '%shoes%'
union
select * from table where description like '%red%' and '%clown%'
union
select * from table where description like '%red%' and '%shoes%'
union
select * from table where description like '%clown%' and '%shoes%'
union
select * from table where description like '%red%'
union
select * from table where description like '%clown%'
union
select * from table where description like '%shoes%'
@Steve
I think that would give you anything related to what you are searching for and does limited amount of relevance sorting. What I am interested is coming up with more of a function or plugin that would allow dynamic tables and column names to be search, such as the name and description in the example. I think any good search needs to balance performance and quality results. I feel most people would rather see something a little less quality and get it quicker rather than wait to see the best quality. As the best quality is only relative if you can truly understand what they are searching for.
This permutation search is neat in that it puts the most relevant (i.e. the results with the most search terms) at the top of the list while still not ignoring the semi-relevant.
It makes it much harder to do when including multiple fields though. Well not harder, but more queries. Of course you could just put ORs in between each field, but AND filters tend to give you better results. One possible solution (I'm not sure if this is allowed with a like search, try it) is to combine multiple fields together on the left side of a like statement. i.e.:
where name + ' ' + description like '%red%' and name + ' ' + description like '%clown%' and name + ' ' + description like '%shoes%'
Anyone know if that is that legal SQL?
Ray, Nice search that works perfectly with just two tables, but when I add in the third table (v_ac_srv_address) as below, my search just runs forever and never returns any results?
When I replace your WHERE/CFLOOP clause with just a simple
WHERE ac_srv_phoneno.SERVICE_NUMBER LIKE <cfqueryparam cfsqltype="cf_sql_varchar" value="#url.search#"> it works fine?
---------------------------------------
<cfparam name="url.search" default="">
<cfparam name="form.search" default="#url.search#">
<cfparam name="url.start" default="1">
<cfparam name="getCust.recordcount" default="0">
<cfset perpage = 10>
<cfif len(trim(form.search))>
<cfset search = "%" & trim(form.search) & "%">
<cfquery name="getCust" datasource="mbs">
SELECT AC_ACCT.SUBSCRIBER_ID AS SubID
, AC_ACCT.ACCT_ID AS ActID
, DATE_FORMAT(AC_ACCT.ACTIVATION_DATE,'%m-%d-%Y') AS ActDT
, AC_ACCT.ACCT_TYPE AS ActType
, AC_ACCT.ACCOUNT_DESC AS ActDes
, v_ac_srv_address.ADDRESS AS PhyAdd
, v_ac_srv_address.CITY_NAME AS PhyCty
, AC_ACCT.ACTIVE_SERVICES AS ActSta
, v_ac_srv_address.POSTAL_CODE AS AddZip
, v_ac_srv_address.COUNTY_NAME AS AddCnty
, ac_srv_phoneno.SERVICE_NUMBER AS PhoNum
FROM ac_srv_phoneno RIGHT OUTER JOIN ac_acct ON ac_srv_phoneno.ACCT_ID = ac_acct.ACCT_ID LEFT OUTER JOIN v_ac_srv_address ON v_ac_srv_address.ACCT_ID = ac_acct.ACCT_ID
WHERE
<cfloop index="x" from="1" to="#listLen(form.search, " ")#">
<cfset word = listGetAt(form.search, x, " ")>
<cfset word = "%" & ucase(word) & "%">
<cfif x neq 1>
and
</cfif>
(
ucase(AC_ACCT.ACCT_ID) like <cfqueryparam cfsqltype="cf_sql_varchar" value="#word#">
or
ucase(AC_ACCT.ACCOUNT_DESC) like <cfqueryparam cfsqltype="cf_sql_varchar" value="#word#">
or
ucase(v_ac_srv_address.ADDRESS) like <cfqueryparam cfsqltype="cf_sql_varchar" value="#word#">
or
ucase(v_ac_srv_address.CITY_NAME) like <cfqueryparam cfsqltype="cf_sql_varchar" value="#word#">
or
ucase(v_ac_srv_address.POSTAL_CODE) like <cfqueryparam cfsqltype="cf_sql_varchar" value="#word#">
or
ucase(ac_srv_phoneno.SERVICE_NUMBER) like <cfqueryparam cfsqltype="cf_sql_varchar" value="#word#">
)
</cfloop>
GROUP BY AC_ACCT.ACCT_ID
ORDER BY AC_ACCT.ACCOUNT_DESC ASC
</cfquery>
-------------------------------------------
What I'd do is check the generated SQL. See if something looks wrong there.
SQL is correct.. I use ColdFusion Report builder to generate & test my SQL visually and it runs fast & returns correct results, for example this SQL works just fine: http://cerberus.clearwave.c...
It is correct SQL, but when you _run_ it, how many results do you get?
I tested the WHERE to search for 'jabber' and it returns 4 results (see here) http://cerberus.clearwave.c...
It's interesting that it works in CFRB for a single criteria, but not at all in a CFM page with multiple criteria?
But what happens, in your SQL Server, when you multiple criteria. Your test was for single. I want you to get the SQL from CF when multiple are used and try that. Make sense?
Here is multiple criteria in CFRB (works fine): http://cerberus.clearwave.c...
- On the MySQL server, it shows the query executing forever..
- When I use EXPLAIN, it shows me the query w/variables
USE `mbs`;
EXPLAIN
SELECT AC_ACCT.SUBSCRIBER_ID AS SubID
, AC_ACCT.ACCT_ID AS ActID
, DATE_FORMAT(AC_ACCT.ACTIVATION_DATE,'%m-%d-%Y') AS ActDT
, AC_ACCT.ACCT_TYPE AS ActType
, AC_ACCT.ACCOUNT_DESC AS ActDes
, v_ac_srv_address.ADDRESS AS PhyAdd
, v_ac_srv_address.CITY_NAME AS PhyCty
, AC_ACCT.ACTIVE_SERVICES AS ActSta
, v_ac_srv_address.POSTAL_CODE AS AddZip
, v_ac_srv_address.COUNTY_NAME AS AddCnty
, ac_srv_phoneno.SERVICE_NUMBER AS PhoNum
FROM ac_srv_phoneno RIGHT OUTER JOIN ac_acct ON ac_srv_phoneno.ACCT_ID = ac_acct.ACCT_ID LEFT OUTER JOIN v_ac_srv_address ON v_ac_srv_address.ACCT_ID = ac_acct.ACCT_ID
WHERE
(
ucase(AC_ACCT.ACCT_ID) like '%JA%'
or
ucase(AC_ACCT.ACCOUNT_DESC) like '%JA%'
or
ucase(v_ac_srv_address.ADDRESS) like '%JA%'
or
ucase(v_ac_srv_address.CITY_NAME) like '%JA%'
or
ucase(v_ac_srv_address.POSTAL_CODE) like '%JA%'
or
ucase(ac_srv_phoneno.SERVICE_NUMBER) like '%JA%'
)
and
(
ucase(AC_ACCT.ACCT_ID) like '%618%'
or
ucase(AC_ACCT.ACCOUNT_DESC) like '%618%'
or
ucase(v_ac_srv_address.ADDRESS) like '%618%'
or
ucase(v_ac_srv_address.CITY_NAME) like '%618%'
or
ucase(v_ac_srv_address.POSTAL_CODE) like '%618%'
or
ucase(ac_srv_phoneno.SERVICE_NUMBER) like '%618%'
)
GROUP BY AC_ACCT.ACCT_ID
ORDER BY AC_ACCT.ACCOUNT_DESC ASC
Not sure what CFRB is. Hmm. Nothing looks terribly odd here. I'd start by removing stuff. Like maybe the entire second clause. See if you can get it down to the basics.
CFRB = ColdFusion Report Builder (those image links I reference before are snapshots from the Query Builder)
Anyway, some observances..
- If I remove all WHERE criteria referencing the 3rd table, it works?
- If I add back even one of the criteria for that 3rd table, it doesn't work?
- The issue is definitely in the WHERE clause..
Looks like there is an issue with the OR/AND/variables code when using more than two tables since it works with only two, (and all three in CFRB?)
Thanks again for cool code..
I may just forgo the fancy "multiple word search"
Sorry I couldn't help here. Got to be something simple. If you do figure it out, please post back.
Ray, I'm using your query above with my own table & fields (see below)
- This search works great except in the case of an apostrophe
- It will not find any records that contain apostrophe's
- How do I handle cases when there might be an apostrophe in the AC_ACCT.ACCOUNT_DESC field?
- I do not have the option to change the data fed into this query, so I have to deal with the apostrophe in my query.
------------------------------------------
<cfparam name="url.search" default="">
<cfparam name="form.search" default="#url.search#">
<cfparam name="url.start" default="1">
<cfparam name="getCust.recordcount" default="0">
<cfset perpage = 10>
<cfif len(trim(form.search))>
<cfset search = "%" & trim(form.search) & "%">
<cfquery name="getCust" datasource="mbs">
SELECT ac_acct.ACCT_ID AS ActID
, AC_ACCT.SUBSCRIBER_ID AS SubID
, DATE_FORMAT(AC_ACCT.ACTIVATION_DATE,'%m-%d-%Y') AS ActDT
, ac_acct.NUM_ACTIVE_SRV
, AC_ACCT.ACTIVE_SERVICES AS ActSta
, AC_ACCT.ACCT_TYPE AS ActType
, AC_ACCT.ACCOUNT_DESC AS ActDes
, v_ac_srv_address.ADDR_TYPE
, v_ac_srv_address.ADDRESS AS PhyAdd
, v_ac_srv_address.CITY_NAME AS PhyCty
, v_ac_srv_address.STATE_PROV_ABBR
, v_ac_srv_address.POSTAL_CODE AS AddZip
, v_ac_srv_address.COUNTY_NAME AS AddCnty
, v_ac_srv_address.EFF_FROM_DATE
, v_ac_srv_address.EFF_THRU_DATE
FROM ac_acct LEFT OUTER JOIN v_ac_srv_address ON ac_acct.ACCT_ID = v_ac_srv_address.ACCT_ID
WHERE
<cfloop index="x" from="1" to="#listLen(form.search, " ")#">
<cfset word = listGetAt(form.search, x, " ")>
<cfset word = "%" & ucase(word) & "%">
<cfif x neq 1>
and
</cfif>
(
ucase(AC_ACCT.ACCT_ID) like <cfqueryparam cfsqltype="cf_sql_varchar" value="#word#">
or
ucase(AC_ACCT.SUBSCRIBER_ID) like <cfqueryparam cfsqltype="cf_sql_varchar" value="#word#">
or
ucase(AC_ACCT.ACCOUNT_DESC) like <cfqueryparam cfsqltype="cf_sql_varchar" value="#word#">
or
ucase(v_ac_srv_address.ADDRESS) like <cfqueryparam cfsqltype="cf_sql_varchar" value="#word#">
or
ucase(v_ac_srv_address.CITY_NAME) like <cfqueryparam cfsqltype="cf_sql_varchar" value="#word#">
or
ucase(v_ac_srv_address.POSTAL_CODE) like <cfqueryparam cfsqltype="cf_sql_varchar" value="#word#">
or
ucase(v_ac_srv_address.COUNTY_NAME) like <cfqueryparam cfsqltype="cf_sql_varchar" value="#word#">
)
</cfloop>
GROUP BY AC_ACCT.ACCT_ID
ORDER BY AC_ACCT.ACCOUNT_DESC ASC
</cfquery>
<cfelse>
<cfset noSearch = true>
</cfif>
-----------------------------------
Um, wow, you got me. An apostrophe shouldn't cause an issue.