Martin asks a pretty interesting question about comparing strings:
Is there a way to compare the similarities of variable strings in ColdFusion? For instance, lets say variable.foo = "15 St Mungo Place" and variable.coo = "15 St. Mungo Place"
Sql will not return a match but you and I both know that these two strings mean the same thing AND that there is only a one character difference.
My question is - Is there a way to get Coldfusion to recognise that there is only a one character difference (similarities) between the two strings. Obviously the potential applications for search queries etc.. could be enormous here so I'm VERY curious to see if this is possible.
Right off the bat, there is nothing baked into ColdFusion that will do this automatically, so we have to look at other possible solutions.
For some cases, you could use straight substitutions. You could have a list of common alternative spellings ("st. for st") and loop over them and do replacements. That will work for simple things but probably isn't going to scale up.
You could also compare strings and see how close they are. I did some Googling on that, and kept coming across the Levenshtein distance formula. I was actually working to rebuild the pseudo-code on Wikipedia when I remembered CFLib - turns out it has a udf, levDistance, that can use this logic.
This is still somewhat of a slow process as you would need to compare your search against all the words in the database. This is probably a case where something like Verity, or the full text searching feature of your database, may be more handy.
Anyone done anything like this in ColdFusion?
Archived Comments
just subscribing for comments - good question.
Not totally the same but if it's a word your looking to do this against you can use soundex, basically phonetic algorithms http://en.wikipedia.org/wik...
For employee name management, I did set up a first-name substitution table that worked like you mentioned (Bob=Robert, Sue=Susan, etc)... Provided alternative spellings, and all alternatives would be tried when running comparisons until a match was found. But that was a simple case that was a bit more realistic to take this approach.
Using verity is a great idea...Flag it as a possible match if one of Verity's suggested phrases is a match...
I'm just shooting in the dark here, and there's no way it's scalable, but along the lines of the Levenshtein distance, you could do a diff comparison of the two strings (search for "LCS algorithm", or see my blog) and see if the difference(s) are significant or not. It would really only be appropriate for two specific strings, so it wouldn't scale if given a list of strings.
Note that I am in no way saying that this is a *good* idea, just something that might be possible.
Soundex checks up to first 4 sounds only, and not numbers or any non alpha characters. So that won't really work.
--- Ben
Sounds like a "diff" problem. There's a Java library at http://www.incava.org/proje... that could possibly be wrapped in a CF function.
Most database systems have some kind of fulltext index you could put on that table or field.
I think Ray's suggestion regarding the Levenshtein (edit distance) is probably the right way to go. The terminology to search on that provides the best results I found were "fuzzy search". The most accessible implementation for Coldfusion is probably Lucene: see the information about Fuzzy Searching
Query Syntax: http://lucene.apache.org/ja...
JavaDocs: http://lucene.apache.org/ja...
This would require making a Lucene index of values to compare against. In my own experience, Lucene is fast, and scales well as the record count of an index increases.
If you are using MS SQL Server, there is also an editable thesaurus that will allow you to equate things like 'ST' and 'ST.' and 'Street'.
I have used Levenshtein to accomplish what you describe for both Postgres and ms-sql(which requires an external com object to be installed, don't have the details on hand).
the results are reasonably fast, however a full table scan is requires for each individual search. So performance could be an issue if run quite often.
Theo
If you're looking to do address comparison specifically, you could use the Address Validation web service from ServiceObjects:
Web Service info:
http://www.serviceobjects.c...
Demo:
http://www.serviceobjects.c...
It standardizes the address, similar to how usps.com does it (but server-side):
http://zip4.usps.com/zip4/w...
If addresses are standardized before saving them to the database, it's an easy SQL Query to compare a new address to the other addresses in the database.
(Not to mention all the other benefits of having standardized, validated addresses -- minimal returned mail, professional looking labels, etc.)
WOW - thanks for posting this so quickly Ray, I too like the Levenshtein distance formula, I didn't know someone had had posted some code on CFlib! Also - thanks to everyone who has responded - I have been looking at ALL the suggestions.
My reason for asking is that my boss has asked me to write a back-end WEB BASED app to compare closely matching addresses for manual verification. We get two sets of data from different sources and the address is the only key we can use to compare.
Saint (St. St) is a good example because it's easy to confuse with Street (St St.) so look ups aren't a perfect solution.
MySql also has "Full Text Searching" so you could add FULLTEXT(Address) to your table and do a reReplace on the address string to transform "15 St Mungo Place" into a Sql statement such as :
SELECT foo.Address, coo.Address
FROM Properties as foo, Muggles as coo
WHERE MATCH(foo.Address) AGAINST('+15 +St +Mungo +Place')
This is fine for comparing one address at a time (as you would get ranked multiple results) but what I really want to do is end up with a single table of ONLY the most likely matches in descending order.
Levenshtein could do this but could score poorly with flats: eg. Flat 1, 16 St Mungo Place & 1/16 St. Mungo Place (Levenshtein score of 9).
Using an Address Validation Service like Adam suggested (I will need to find a UK one but excellent idea Adam) will certainly help this.
And finally... I could also use some proprietary software to do this and just press a button but then.... a) I might not be able to afford that and b)what would I actually learn from doing that?
Obviously a combination of methods will have to be used for matching and filtering, I'm just glad I "Asked a Jedi." I will definitely reply here again if I find anything else!!
Thanks again to all.
Martin
@Ben's comment on SOUNDEX
I don't know if this has been changed in MySql 5 but
SELECT SOUNDEX("10 St Andrews"),SOUNDEX("10 St. Andrews"),SOUNDEX("10 St Andrew's"),SOUNDEX("10, St-Andrews") ALL return the same result.
This is VERY cool!
I know this is 4 years later, but just for posterity:
Martin - so would 68570 Standards of the Jedi Republic Boulevard Apartment 26842.
That's why Ben pointed out these limitations. Soundex was invented for comparisons of first names in census records like 80 years ago, before computers. It's still pretty handy for catching typoed first names in something non-critical like an employee directory search, but don't rely on it for too much else.