>From: Nuno Tavares [mailto:nuno.tavares@stripped]
>Sent: Tuesday, May 03, 2011 6:21 PM
>Subject: Re: Join based upon LIKE
>I've been silently following this discussion because I've missed the
>But from your last explanation, now it really looks you have a "data
>quality" kind of issue, which is by far related with MySQL.
[JS] Definitely -- but I have to work with the tools available. This is only
one part of the process, there is more trouble further on that is not related
to our database at all.
>Indeed, in Data Quality, there is *never* a ready solution, because the
>source is tipically chaotic....
>May I suggest you to explore Google Refine? It seems to be able to
>address all those issues quite nicely, and the clustering might solve
>your problem at once. You shall know, however, how to export the tables
>(or a usable JOIN) as a CSV, see SELECT ... INTO OUTFILE for that.
[JS] I never heard of Google Refine. Thanks for bringing to my attention.
>Hope it helps,
[JS] Thank you.
Global Information Incorporated
195 Farmington Ave.
Farmington, CT 06032
860.674.8796 / FAX: 860.674.8341
Web site: www.the-infoshop.com
>Em 03-05-2011 21:34, Jerry Schwartz escreveu:
>> My situation is sounds rather simple. All I am doing is matching a
>> of products against our database. My job is to find any matches against
>> existing products and determine which ones are new, which ones are
>> replacements for older products, and which ones just need to have the
>> publication date (and page count, price, whatever) refreshed.
>> Publisher is no problem. What I have for each "feed" is a title and (most
>> the time) an ISBN or other identification assigned by the publisher.
>> Matching by product ID is easy (assuming there aren't any mistakes in the
>> current or previous feeds); but the publisher might or might not change the
>> product ID when they update a report. That's why I also run a match by
>> and that's where all the trouble comes from.
>> The publisher might or might not include a mix of old and new products in a
>> feed. The publisher might change the title of an existing product, either
>> purpose or by accident; they might simply be sloppy about their spelling;
>> (and this is where it is critical) the title might include a reference to
>> time period such as a year or a quarter.
>> I think we'd better pull the plug on this discussion. It doesn't seem like
>> there's a ready solution. Fortunately our database is small, and most feeds
>> are only a few hundred products.
>> Jerry Schwartz
>> Global Information Incorporated
>> 195 Farmington Ave.
>> Farmington, CT 06032
>> 860.674.8796 / FAX: 860.674.8341
>> E-mail: jerry@stripped
>> Web site: www.the-infoshop.com
>>> -----Original Message-----
>>> From: shawn wilson [mailto:ag4ve.us@stripped]
>>> Sent: Tuesday, May 03, 2011 4:08 PM
>>> Cc: mysql mailing list
>>> Subject: Re: Join based upon LIKE
>>> I'm actually enjoying this discussion because I have the same type of
>>> However, I have done away with trying to do a full text search in favor of
>>> making a table with unique fields where all fields should uniquely
>>> the group. If I get a dupe, I can clean it up.
>>> However, like you, they don't want me to mess with the original data. So,
>>> what I have is another table with my good data that my table with my
>>> data refers to. If a bad record is creased, I don't care I just create my
>>> relationship to the table of data I know (read think - I rarely look at
>>> stuff) is good.
>>> So, I have 4 fields that should be unique for a group. Two chats and two
>>> ints. If three of these match a record in the 'good data' table - there's
>>> relationship. If two or less match, I create a new record in my 'good
>>> table and log the event. (I haven't gotten to the logging part yet though,
>>> easy enough just to look sense none of the fields in 'good data' should
>>> I'm thinking you might have to dig deeper than me to find 'good data' but
>>> think its there. Maybe isbn, name, publisher + address, price, average
>>> pages, name of sales person, who you guys pay for the material, etc etc
>>> On May 3, 2011 10:59 AM, "Johan De Meersman" <vegivamp@stripped>
>>>> ----- Original Message -----
>>>>> From: "Jerry Schwartz" <jerry@stripped>
>>>>> I'm not sure that I could easily build a dictionary of non-junk
>>>>> words, since
>>>> The traditional way is to build a database of junk words. The list tends
>>> to be shorter :-)
>>>> Think and/or/it/the/with/like/...
>>>> Percentages of mutual and non-mutual words between two titles should be
>>> reasonable indicator of likeness. You could conceivably even assign value
>>> individual words, so "polypropylbutanate" is more useful than "synergy"
>>> comparison purposes.
>>>> All very theoretical, though, I haven't actually done much of it to this
>>> level. My experience in data mangling is limited to mostly
>>> should-be-fixed-format data like sports results.
>>>> Bier met grenadyn
>>>> Is als mosterd by den wyn
>>>> Sy die't drinkt, is eene kwezel
>>>> Hy die't drinkt, is ras een ezel
>>>> MySQL General Mailing List
>>>> For list archives: http://lists.mysql.com/mysql
>>>> To unsubscribe: http://lists.mysql.com/mysql?unsub=1
>MySQL General Mailing List
>For list archives: http://lists.mysql.com/mysql
>To unsubscribe: http://lists.mysql.com/mysql?unsub=1