This is one of those times when I discover something that I probably should have known for a long time, and I certainly wish I had. And if you’re one of my friends and you knew about this and didn’t tell me, may you rot in hell lying in a bed of nails covered with black flies. Now that I got this out of the way, on with the article…
I decided to look for some kind of algorithm that would allow for matching words that are similar. This would inevitably be to retrieve records from a database based on some search criteria. So I search google and find something about a Soundex algorithm. The algorithm goes as follows (copied straight from Wikipedia):
- Retain the first letter of the string
- Remove all occurrences of the following letters, unless it is the first letter: a, e, h, i, o, u, w, y
-
Assign numbers to the remaining letters (after the first) as follows:
- b, f, p, v = 1
- c, g, j, k, q, s, x, z = 2
- d, t = 3
- l = 4
- m, n = 5
- r = 6
- If two or more letters with the same number were adjacent in the original name (before step 1), or adjacent except for any intervening h and w (American census only), then omit all but the first.
- Return the first four characters, right-padding with zeroes if there are fewer than four.
But that’s not the good part… The good part is that this algorithm is implemented in some DBMS systems. And apparently, you guessed it, it’s implemented in the most popular ones, SQL Server, Oracle, and MySQL. How does it work? It’s oh so difficult… Check out the code below:
SELECT *
FROM address
WHERE SOUNDEX(city) = SOUNDEX('Washgton')
If you have any records in the database for Washington (note in the query it’s missing the “i”) it will be returned. Wonderful! I could probably have used this before. And for those who have the possibility of adding UDFs to their DB server, there are implementations of other algorithms such as Metaphone and Similar_text.
Note that Soundex is a phonetic algorithm, so it looks for words that would sound similar. So you might not always get the results you want. When I searched my person table using SOUNDEX(first_name) = SOUNDEX(‘Tomas’) I did get a bunch of “Thomas” records back, but if I use SOUNDEX(‘Thomas’) I did not, I got a bunch of “Tom”, “Tommy”, and even “Tony” records, but no “Thomas”. Oh well, still better than nothing. I bet that using a combination of different algorithms you can probably get some good results. More research to be done…


March 26th, 2007 at 10:17 pm
hehe, I knew about this. A while back I wanted to do a spell check thing and soundex algorithms are one of the ways you can figure out what the person wanted to say when they misspelt a word. I deliberately decided not to tell you just hoping for this very day. Oh revenge is such a sweet and tasty dish when served cold.
I did not know that dbms’s have it integrated natively though. Will definitely use a hybrid of this in order to improve my search returns.