I've been silently following this discussion because I've missed the
But from your last explanation, now it really looks you have a "data
quality" kind of issue, which is by far related with MySQL.
Indeed, in Data Quality, there is *never* a ready solution, because the
source is tipically chaotic....
May I suggest you to explore Google Refine? It seems to be able to
address all those issues quite nicely, and the clustering might solve
your problem at once. You shall know, however, how to export the tables
(or a usable JOIN) as a CSV, see SELECT ... INTO OUTFILE for that.
Hope it helps,
Em 03-05-2011 21:34, Jerry Schwartz escreveu:
> My situation is sounds rather simple. All I am doing is matching a spreadsheet
> of products against our database. My job is to find any matches against
> existing products and determine which ones are new, which ones are
> replacements for older products, and which ones just need to have the
> publication date (and page count, price, whatever) refreshed.
> Publisher is no problem. What I have for each "feed" is a title and (most of
> the time) an ISBN or other identification assigned by the publisher.
> Matching by product ID is easy (assuming there aren't any mistakes in the
> current or previous feeds); but the publisher might or might not change the
> product ID when they update a report. That's why I also run a match by title,
> and that's where all the trouble comes from.
> The publisher might or might not include a mix of old and new products in a
> feed. The publisher might change the title of an existing product, either on
> purpose or by accident; they might simply be sloppy about their spelling; or
> (and this is where it is critical) the title might include a reference to some
> time period such as a year or a quarter.
> I think we'd better pull the plug on this discussion. It doesn't seem like
> there's a ready solution. Fortunately our database is small, and most feeds
> are only a few hundred products.
> Jerry Schwartz
> Global Information Incorporated
> 195 Farmington Ave.
> Farmington, CT 06032
> 860.674.8796 / FAX: 860.674.8341
> E-mail: jerry@stripped
> Web site: www.the-infoshop.com
>> -----Original Message-----
>> From: shawn wilson [mailto:ag4ve.us@stripped]
>> Sent: Tuesday, May 03, 2011 4:08 PM
>> Cc: mysql mailing list
>> Subject: Re: Join based upon LIKE
>> I'm actually enjoying this discussion because I have the same type of issue.
>> However, I have done away with trying to do a full text search in favor of
>> making a table with unique fields where all fields should uniquely identify
>> the group. If I get a dupe, I can clean it up.
>> However, like you, they don't want me to mess with the original data. So,
>> what I have is another table with my good data that my table with my unique
>> data refers to. If a bad record is creased, I don't care I just create my
>> relationship to the table of data I know (read think - I rarely look at this
>> stuff) is good.
>> So, I have 4 fields that should be unique for a group. Two chats and two
>> ints. If three of these match a record in the 'good data' table - there's my
>> relationship. If two or less match, I create a new record in my 'good data'
>> table and log the event. (I haven't gotten to the logging part yet though,
>> easy enough just to look sense none of the fields in 'good data' should
>> I'm thinking you might have to dig deeper than me to find 'good data' but I
>> think its there. Maybe isbn, name, publisher + address, price, average
>> pages, name of sales person, who you guys pay for the material, etc etc etc.
>> On May 3, 2011 10:59 AM, "Johan De Meersman" <vegivamp@stripped> wrote:
>>> ----- Original Message -----
>>>> From: "Jerry Schwartz" <jerry@stripped>
>>>> I'm not sure that I could easily build a dictionary of non-junk
>>>> words, since
>>> The traditional way is to build a database of junk words. The list tends
>> to be shorter :-)
>>> Think and/or/it/the/with/like/...
>>> Percentages of mutual and non-mutual words between two titles should be a
>> reasonable indicator of likeness. You could conceivably even assign value to
>> individual words, so "polypropylbutanate" is more useful than "synergy" for
>> comparison purposes.
>>> All very theoretical, though, I haven't actually done much of it to this
>> level. My experience in data mangling is limited to mostly
>> should-be-fixed-format data like sports results.
>>> Bier met grenadyn
>>> Is als mosterd by den wyn
>>> Sy die't drinkt, is eene kwezel
>>> Hy die't drinkt, is ras een ezel
>>> MySQL General Mailing List
>>> For list archives: http://lists.mysql.com/mysql
>>> To unsubscribe: http://lists.mysql.com/mysql?unsub=1