Finding Similar Text and Words
Friday, March 23rd, 2007

This is one of those times when I discover something that I probably should have known for a long time, and I certainly wish I had. And if you’re one of my friends and you knew about this and didn’t tell me, may you rot in hell lying in a bed of nails covered with black flies. Now that I got this out of the way, on with the article…

I decided to look for some kind of algorithm that would allow for matching words that are similar. This would inevitably be to retrieve records from a database based on some search criteria. So I search google and find something about a Soundex algorithm. The algorithm goes as follows (copied straight from Wikipedia):

  1. Retain the first letter of the string
  2. Remove all occurrences of the following letters, unless it is the first letter: a, e, h, i, o, u, w, y
  3. Assign numbers to the remaining letters (after the first) as follows:

    • b, f, p, v = 1
    • c, g, j, k, q, s, x, z = 2
    • d, t = 3
    • l = 4
    • m, n = 5
    • r = 6
  4. If two or more letters with the same number were adjacent in the original name (before step 1), or adjacent except for any intervening h and w (American census only), then omit all but the first.
  5. Return the first four characters, right-padding with zeroes if there are fewer than four.

But that’s not the good part… The good part is that this algorithm is implemented in some DBMS systems. And apparently, you guessed it, it’s implemented in the most popular ones, SQL Server, Oracle, and MySQL. How does it work? It’s oh so difficult… Check out the code below:

SELECT *
FROM address
WHERE SOUNDEX(city) = SOUNDEX('Washgton')

If you have any records in the database for Washington (note in the query it’s missing the “i”) it will be returned. Wonderful! I could probably have used this before. And for those who have the possibility of adding UDFs to their DB server, there are implementations of other algorithms such as Metaphone and Similar_text.

Note that Soundex is a phonetic algorithm, so it looks for words that would sound similar. So you might not always get the results you want. When I searched my person table using SOUNDEX(first_name) = SOUNDEX(’Tomas’) I did get a bunch of “Thomas” records back, but if I use SOUNDEX(’Thomas’) I did not, I got a bunch of “Tom”, “Tommy”, and even “Tony” records, but no “Thomas”. Oh well, still better than nothing. I bet that using a combination of different algorithms you can probably get some good results. More research to be done…

8 Responses to “Finding Similar Text and Words”
  1. Andrew Says:

    hehe, I knew about this. A while back I wanted to do a spell check thing and soundex algorithms are one of the ways you can figure out what the person wanted to say when they misspelt a word. I deliberately decided not to tell you just hoping for this very day. Oh revenge is such a sweet and tasty dish when served cold.

    I did not know that dbms’s have it integrated natively though. Will definitely use a hybrid of this in order to improve my search returns.

  2. osobo Says:

    Новый способ давления на кандидата на пост Главы г. Химки

    Новый способ “наказать” тех, кто посмел участвовать в выборной кампании не на стороне действующей власти изобрели правоохранительные органы г.о. Химки.
    Руководствуясь не нормой закона, а чьей-то “волей” сотрудники милиции решили “проверить” все фирмы, внесшие денежные средства в избирательный фонд неудобных кандидатов.
    Начались “проверки” с телефонных звонков – где директор, сколько человек работает на фирме. После чего последовали “письма счастья” с просьбой предоставить всю бухгалтерскую документацию, учредительные документы фирмы, и даже, план экспликации БТИ.
    Такие запросы химкинским фирмам рассылает 1 отдел Оперативно-розыскной части № 9 Управления по налоговым преступлениям ГУВД Московской области за подписью начальника подполковника милиции Д.В. Языкова.
    И всё это в то время, когда Президент дал прямое указание правоохранительным органам о прекращении всех незаконных проверок малого и среднего бизнеса. С это целью внесены изменения в Федеральный закон “О милиции” – из статьи 11 этого закона исключены пункты 25 и 35, на основании которых ранее правоохранительные органы имели право проверять финансово-хозяйственную деятельность предприятий.
    Видно, об изменениях действующего законодательства местные правоохранительные органы не уведомлены. И не смотрят телепередачи с выступлениями Президента.
    Может быть, эта публикация подвигнет их к исполнению указаний Президента, а также к изучению и соблюдению действующего законодательства

  3. Rentator Says:

    Привет! Мы предлогаем аренда квартиры в Москве без комиссии!
    Звоните, все расскажут и покажут. Большой выбор квартир на сайте :)

  4. nevkursi Says:

    Film online ;)

  5. Artur Says:

    Хмм… а можно ли у вас брать посты с рсс канала? Ссылку на вас обязательно поставлю.

  6. Incedgeundete Says:

    Looking for Azithromycin ?

    Check our best Azithromycin online drugstores!
    http://drugsovernight.org/thumbs/pharma3.jpg

    **CLICK TO BUY CHEAP Azithromycin PILLS**

    http://iwebimg.net/ifeed/img/74/fimg/Azithromycin/1
    http://iwebimg.net/ifeed/img/74/fimg/Azithromycin/2
    http://iwebimg.net/ifeed/img/74/fimg/Azithromycin/3
    http://iwebimg.net/ifeed/img/74/fimg/Azithromycin/4
    http://iwebimg.net/ifeed/img/74/fimg/Azithromycin/5
    http://iwebimg.net/ifeed/img/74/fimg/Azithromycin/6

    Full information you can find at Wikipedia

    All these mechanical methods are based on a selective sample of research in this area.Alternate treatments also be a sign of hyper-thyroidism, depression, your doctor in the context of her health Your history of the illness.Azithromycin.Depression, low self esteem, body stores up to four fused rings. azithromycin for eye infections

    Related topics:

  7. Gaxphoppy Says:

    Hey everyone just wanna say hello and introduce myself!

  8. internet pharmacies reviews Says:

    Hello, I came across your web site a little while ago and have ploughed through all the posts and comments quietly. I decided I might make my firstcomment. Not really sure what to say but anyway. Informative blog. Will visit soon to see more of what you have to offer.

Leave a Reply