Ask a Jedi: Flexible string comparisons

This post is more than 2 years old.

Martin asks a pretty interesting question about comparing strings:

Is there a way to compare the similarities of variable strings in ColdFusion? For instance, lets say variable.foo = "15 St Mungo Place" and variable.coo = "15 St. Mungo Place"

Sql will not return a match but you and I both know that these two strings mean the same thing AND that there is only a one character difference.

My question is - Is there a way to get Coldfusion to recognise that there is only a one character difference (similarities) between the two strings. Obviously the potential applications for search queries etc.. could be enormous here so I'm VERY curious to see if this is possible.

Right off the bat, there is nothing baked into ColdFusion that will do this automatically, so we have to look at other possible solutions.

For some cases, you could use straight substitutions. You could have a list of common alternative spellings ("st. for st") and loop over them and do replacements. That will work for simple things but probably isn't going to scale up.

You could also compare strings and see how close they are. I did some Googling on that, and kept coming across the Levenshtein distance formula. I was actually working to rebuild the pseudo-code on Wikipedia when I remembered CFLib - turns out it has a udf, levDistance, that can use this logic.

This is still somewhat of a slow process as you would need to compare your search against all the words in the database. This is probably a case where something like Verity, or the full text searching feature of your database, may be more handy.

Anyone done anything like this in ColdFusion?

Raymond Camden's Picture

About Raymond Camden

Raymond is a senior developer evangelist for Adobe. He focuses on document services, JavaScript, and enterprise cat demos. If you like this article, please consider visiting my Amazon Wishlist or donating via PayPal to show your support. You can even buy me a coffee!

Lafayette, LA https://www.raymondcamden.com

Archived Comments

Comment 1 by Scott P posted on 1/4/2008 at 1:00 AM

just subscribing for comments - good question.

Comment 2 by David Herman posted on 1/4/2008 at 1:22 AM

Not totally the same but if it's a word your looking to do this against you can use soundex, basically phonetic algorithms http://en.wikipedia.org/wik...

Comment 3 by Joshua Curtiss posted on 1/4/2008 at 1:34 AM

For employee name management, I did set up a first-name substitution table that worked like you mentioned (Bob=Robert, Sue=Susan, etc)... Provided alternative spellings, and all alternatives would be tried when running comparisons until a match was found. But that was a simple case that was a bit more realistic to take this approach.

Using verity is a great idea...Flag it as a possible match if one of Verity's suggested phrases is a match...

Comment 4 by Rick O posted on 1/4/2008 at 1:59 AM

I'm just shooting in the dark here, and there's no way it's scalable, but along the lines of the Levenshtein distance, you could do a diff comparison of the two strings (search for "LCS algorithm", or see my blog) and see if the difference(s) are significant or not. It would really only be appropriate for two specific strings, so it wouldn't scale if given a list of strings.

Note that I am in no way saying that this is a *good* idea, just something that might be possible.

Comment 5 by Ben Forta posted on 1/4/2008 at 2:23 AM

Soundex checks up to first 4 sounds only, and not numbers or any non alpha characters. So that won't really work.

--- Ben

Comment 6 by joel posted on 1/4/2008 at 2:32 AM

Sounds like a "diff" problem. There's a Java library at http://www.incava.org/proje... that could possibly be wrapped in a CF function.

Comment 7 by Scott Fitchet posted on 1/4/2008 at 2:34 AM

Most database systems have some kind of fulltext index you could put on that table or field.

Comment 8 by orangepips posted on 1/4/2008 at 7:04 AM

I think Ray's suggestion regarding the Levenshtein (edit distance) is probably the right way to go. The terminology to search on that provides the best results I found were "fuzzy search". The most accessible implementation for Coldfusion is probably Lucene: see the information about Fuzzy Searching

Query Syntax: http://lucene.apache.org/ja...

JavaDocs: http://lucene.apache.org/ja...

This would require making a Lucene index of values to compare against. In my own experience, Lucene is fast, and scales well as the record count of an index increases.

Comment 9 by Gus posted on 1/4/2008 at 7:20 AM

If you are using MS SQL Server, there is also an editable thesaurus that will allow you to equate things like 'ST' and 'ST.' and 'Street'.

Comment 10 by theo posted on 1/4/2008 at 8:18 AM

I have used Levenshtein to accomplish what you describe for both Postgres and ms-sql(which requires an external com object to be installed, don't have the details on hand).

the results are reasonably fast, however a full table scan is requires for each individual search. So performance could be an issue if run quite often.

Theo

Comment 11 by Adam Fairbanks posted on 1/4/2008 at 11:55 AM

If you're looking to do address comparison specifically, you could use the Address Validation web service from ServiceObjects:

Web Service info:
http://www.serviceobjects.c...

Demo:
http://www.serviceobjects.c...

It standardizes the address, similar to how usps.com does it (but server-side):

http://zip4.usps.com/zip4/w...

If addresses are standardized before saving them to the database, it's an easy SQL Query to compare a new address to the other addresses in the database.

(Not to mention all the other benefits of having standardized, validated addresses -- minimal returned mail, professional looking labels, etc.)

Comment 12 by Martin posted on 1/4/2008 at 6:21 PM

WOW - thanks for posting this so quickly Ray, I too like the Levenshtein distance formula, I didn't know someone had had posted some code on CFlib! Also - thanks to everyone who has responded - I have been looking at ALL the suggestions.

My reason for asking is that my boss has asked me to write a back-end WEB BASED app to compare closely matching addresses for manual verification. We get two sets of data from different sources and the address is the only key we can use to compare.

Saint (St. St) is a good example because it's easy to confuse with Street (St St.) so look ups aren't a perfect solution.

MySql also has "Full Text Searching" so you could add FULLTEXT(Address) to your table and do a reReplace on the address string to transform "15 St Mungo Place" into a Sql statement such as :
SELECT foo.Address, coo.Address
FROM Properties as foo, Muggles as coo
WHERE MATCH(foo.Address) AGAINST('+15 +St +Mungo +Place')
This is fine for comparing one address at a time (as you would get ranked multiple results) but what I really want to do is end up with a single table of ONLY the most likely matches in descending order.

Levenshtein could do this but could score poorly with flats: eg. Flat 1, 16 St Mungo Place & 1/16 St. Mungo Place (Levenshtein score of 9).
Using an Address Validation Service like Adam suggested (I will need to find a UK one but excellent idea Adam) will certainly help this.
And finally... I could also use some proprietary software to do this and just press a button but then.... a) I might not be able to afford that and b)what would I actually learn from doing that?

Obviously a combination of methods will have to be used for matching and filtering, I'm just glad I "Asked a Jedi." I will definitely reply here again if I find anything else!!

Thanks again to all.
Martin

Comment 13 by Martin posted on 10/31/2008 at 8:53 PM

@Ben's comment on SOUNDEX
I don't know if this has been changed in MySql 5 but

SELECT SOUNDEX("10 St Andrews"),SOUNDEX("10 St. Andrews"),SOUNDEX("10 St Andrew's"),SOUNDEX("10, St-Andrews") ALL return the same result.

This is VERY cool!

Comment 14 by JC posted on 7/27/2012 at 10:51 PM

I know this is 4 years later, but just for posterity:

Martin - so would 68570 Standards of the Jedi Republic Boulevard Apartment 26842.

That's why Ben pointed out these limitations. Soundex was invented for comparisons of first names in census records like 80 years ago, before computers. It's still pretty handy for catching typoed first names in something non-critical like an employee directory search, but don't rely on it for too much else.