>From: Johan De Meersman [mailto:vegivamp@stripped]
>Sent: Friday, April 29, 2011 5:56 AM
>To: Jerry Schwartz
>Cc: mysql mailing list
>Subject: Re: Join based upon LIKE
>----- Original Message -----
>> From: "Jerry Schwartz" <jerry@stripped>
>> [JS] This isn't the only place I have to deal with fuzzy data. :-(
>> Discretion prohibits further comment.
>Heh. What you *really* need, is a LART. Preferably one of the spiked variety.
[JS] Unless a LART is a demon of some kind, I don't know what it is.
>> A full-text index would work if I were only looking for one title at
>> a time, but I don't know if that would be a good idea if I have a list of
>> 10000 titles. That would pretty much require either 10000 separate queries
>> or a very, very long WHERE clause.
>Yes, unfortunately. You should see if you can introduce a form of data
>normalisation - say, shadow fields with corrected entries, or functionality
>the application that suggests correct entries based on what the user typed.
[JS] Except for obvious misspellings and non-ASCII characters, I do not have
the freedom to muck with the text. If the data were created in-house, I could
correct it on the way in; but it comes from myriad other companies.
>Or, if the money's there, you could have a look at Amazon Mechanical Turk
>really) for cheap-ish data correction.
[JS] Again, I can't change the data. The titles are assigned by the
publishers. Think what would happen if Amazon decided to "fix" the titles of
books. "Ain't Misbehavin" would, at best, turn into "I am not misbehaving".
Global Information Incorporated
195 Farmington Ave.
Farmington, CT 06032
860.674.8796 / FAX: 860.674.8341
Web site: www.the-infoshop.com
>Bier met grenadyn
>Is als mosterd by den wyn
>Sy die't drinkt, is eene kwezel
>Hy die't drinkt, is ras een ezel