<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>epiphantastic &#187; Databases</title>
	<atom:link href="http://www.epiphantastic.com/category/databases/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.epiphantastic.com</link>
	<description>Just another WordPress weblog</description>
	<lastBuildDate>Wed, 13 Apr 2011 19:20:37 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.1.1</generator>
		<item>
		<title>Finding Similar Text and Words</title>
		<link>http://www.epiphantastic.com/2007/03/23/finding-similar-text-and-words/</link>
		<comments>http://www.epiphantastic.com/2007/03/23/finding-similar-text-and-words/#comments</comments>
		<pubDate>Fri, 23 Mar 2007 20:07:52 +0000</pubDate>
		<dc:creator>Thomas</dc:creator>
				<category><![CDATA[Databases]]></category>

		<guid isPermaLink="false">http://www.epiphantastic.com/?p=21</guid>
		<description><![CDATA[This is one of those times when I discover something that I probably should have known for a long time, and I certainly wish I had. And if you&#8217;re one of my friends and you knew about this and didn&#8217;t tell me, may you rot in hell lying in a bed of nails covered with [...]]]></description>
			<content:encoded><![CDATA[<p>
This is one of those times when I discover something that I probably should have known for a long time, and I certainly wish I had. And if you&#8217;re one of my friends and you knew about this and didn&#8217;t tell me, may you rot in hell lying in a bed of nails covered with black flies. Now that I got this out of the way, on with the article&#8230;
</p>
<p>
I decided to look for some kind of algorithm that would allow for matching words that are similar. This would inevitably be to retrieve records from a database based on some search criteria. So I search google and find something about a <a href="http://en.wikipedia.org/wiki/Soundex">Soundex algorithm</a>. The algorithm goes as follows (copied straight from Wikipedia):</p>
<ol>
<li>Retain the first letter of the string</li>
<li>Remove all occurrences of the following letters, unless it is the first letter: a, e, h, i, o, u, w, y</li>
<li>
Assign numbers to the remaining letters (after the first) as follows:</p>
<ul>
<li>b, f, p, v = 1</li>
<li>c, g, j, k, q, s, x, z = 2</li>
<li>d, t = 3</li>
<li>l = 4</li>
<li>m, n = 5</li>
<li>r = 6</li>
</ul>
</li>
<li>If two or more letters with the same number were adjacent in the original name (before step 1), or adjacent except for any intervening h and w (American census only), then omit all but the first.</li>
<li>Return the first four characters, right-padding with zeroes if there are fewer than four.</li>
</ol>
<p>
But that&#8217;s not the good part&#8230; The good part is that this algorithm is implemented in some DBMS systems. And apparently, you guessed it, it&#8217;s implemented in the most popular ones, SQL Server, Oracle, and MySQL. How does it work? It&#8217;s oh so difficult&#8230; Check out the code below:</p>
<pre>
SELECT *
FROM address
WHERE SOUNDEX(city) = SOUNDEX('Washgton')
</pre>
<p>If you have any records in the database for Washington (note in the query it&#8217;s missing the &#8220;i&#8221;) it will be returned. Wonderful! I could probably have used this before. And for those who have the possibility of adding UDFs to their DB server, there are implementations of other algorithms such as <a href="http://en.wikipedia.org/wiki/Metaphone">Metaphone</a> and Similar_text.
</p>
<p>
Note that Soundex is a phonetic algorithm, so it looks for words that would sound similar. So you might not always get the results you want. When I searched my person table using SOUNDEX(first_name) = SOUNDEX(&#8216;Tomas&#8217;) I did get a bunch of &#8220;Thomas&#8221; records back, but if I use SOUNDEX(&#8216;Thomas&#8217;) I did not, I got a bunch of &#8220;Tom&#8221;, &#8220;Tommy&#8221;, and even &#8220;Tony&#8221; records, but no &#8220;Thomas&#8221;. Oh well, still better than nothing. I bet that using a combination of different algorithms you can probably get some good results. More research to be done&#8230;</p>
]]></content:encoded>
			<wfw:commentRss>http://www.epiphantastic.com/2007/03/23/finding-similar-text-and-words/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>

