List:General Discussion« Previous MessageNext Message »
From:Jerry Schwartz Date:April 29 2011 5:09pm
Subject:RE: Join based upon LIKE
View as plain text  
>-----Original Message-----
>From: Jim McNeely [mailto:jim@stripped]
>Sent: Thursday, April 28, 2011 6:43 PM
>To: Jerry Schwartz
>Subject: Re: Join based upon LIKE
>
>It just smells wrong, a nicer system would have you joining on ID's of some
>kind so that spelling wouldn't matter. I don't know the full situation for 
>you
>though.
>
[JS] That would be nice, wouldn't it.

In a nutshell, we sell publications. Publishers send us lists of publications. 
Some are new, some replace previous editions. (Think of books, almanacs, and 
newsletters.) Some publishers make do without any product IDs at all, but most 
do use product IDs of some kind.

The problem is that the March edition of a publication might or might not have 
the same product ID as the February edition. I try to match them both by 
product ID and by title. Sometimes the title will "fuzzy match", but the ID 
won't; sometimes the ID will match but the title won't; sometimes (if I'm 
really lucky) they both match; and sometimes the ID matches one product and 
the title matches another.

It's the fuzzy match by title that gives me fits:

- The title might have a date in it ("Rain in Spain in 2010 Q2"), but not 
necessarily in a uniform way ("Rain in Spain Q3 2010").
- The title might have differences in wording or punctuation ("Rain in Spain - 
2010Q2").
- The title might have simple misspellings ("Rain in Spian - Q2 2010").

I've written code that looks for troublesome constructs and replaces them with 
"%": " in ", "-", " to ", "Q2", "2Q", and more and more. So "Rain in Spain - 
2010 Q2" becomes "Rain%Spain%".

I shove those modified titles into a table and do a JOIN ON `prod_title` LIKE 
`wild_title`.

This will miss actual misspellings ("Spain", "Spian"). It will also produce a 
large number of false positives.

On the back end, I have other code that compares the new titles against the 
titles retrieved by that query and decides if they are exact matches, 
approximate matches (here I do use regular expressions, as well as lists of 
known "bad boys"), or false positives. From there on, it's all hand work.

Pretty big nut, eh?

So that's why I need to use LIKE in my JOIN.

Regards,

Jerry Schwartz
Global Information Incorporated
195 Farmington Ave.
Farmington, CT 06032

860.674.8796 / FAX: 860.674.8341
E-mail: jerry@stripped
Web site: www.the-infoshop.com




>Jim McNeely
>
>On Apr 28, 2011, at 12:28 PM, Jerry Schwartz wrote:
>
>> No takers?
>>
>> -----Original Message-----
>> From: Jerry Schwartz [mailto:jerry@stripped]
>> Sent: Monday, April 25, 2011 2:34 PM
>> To: 'Mailing-List mysql'
>> Subject: Join based upon LIKE
>>
>> I have to match lists of new publications against our database, so that I 
>> can
>> replace the existing publications in our catalog. For example,
>>
>> "The UK Market for Puppies in February 2011"
>>
>> would be a replacement for
>>
>> "The UK Market for Puppies in December 2010"
>>
>> Unfortunately, the publishers aren't particularly careful with their 
>> titles.
>> One might even say they are perverse. I am likely to get
>>
>> "UK Market: Puppies - Feb 2011"
>>
>> as replacement for
>>
>> "The UK Market for Puppies in December 2010"
>>
>> You can see that a straight match by title is not going to work.
>>
>> Here's what I've been doing:
>>
>> =====
>>
>> SET @PUBID = (SELECT pub.pub_id FROM pub WHERE pub.pub_code = 'GD');
>>
>> CREATE TEMPORARY TABLE new_titles (
>> 	new_title VARCHAR(255), INDEX (new_title),
>> 	new_title_like VARCHAR(255), INDEX (new_title_like)
>> 	);
>>
>> INSERT INTO new_titles
>> VALUES
>>
>> ('Alternative Energy Monthly Deal Analysis - M&A and Investment Trends, 
>> April
>> 2011', 'Alternative Energy Monthly Deal Analysis%M&A%Investment Trends%'),
>> ('Asia Pacific Propylene Industry Outlook to 2015 - Market Size, Company
>> Share, Price Trends, Capacity Forecasts of All Active and Planned Plants',
>> 'Asia Pacific Propylene Industry Outlook to%Market Size%Company Share%Price
>> Trends%Capacity Forecasts of All Active%Planned Plants'),
>> ...
>> ('Underground Gas Storage Industry Outlook in North America, 2011 - Details
>of
>> All Operating and Planned Gas Storage Sites to 2014', 'Underground Gas
>Storage
>> Industry Outlook%North America%Details of All Operating%Planned Gas Storage
>> Sites to%'),
>> ('Uveitis Therapeutics - Pipeline Assessment and Market Forecasts to 2017',
>> 'Uveitis Therapeutics%Pipeline Assessment%Market Forecasts to%');
>>
>> SELECT prod.prod_title AS `Title IN Database`,
>>        new_titles.new_title AS `Title IN Feed`,
>>        prod.prod_num AS `ID`
>> FROM new_titles JOIN prod ON prod.prod_title LIKE 
>> (new_titles.new_title_like)
>> 	AND prod.pub_id = @PUBID AND prod.prod_discont = 0
>> ORDER BY new_titles.new_title;
>> ======
>>
>> (I've written code that substitutes "%" for certain strings that I specify,
>> and there is some trial and error involved.)
>>
>> Here's how MySQL handles that SELECT:
>>
>> *************************** 1. row ***************************
>>           id: 1
>>  select_type: SIMPLE
>>        table: new_titles
>>         type: ALL
>> possible_keys: NULL
>>          key: NULL
>>      key_len: NULL
>>          ref: NULL
>>         rows: 47
>>        Extra: Using filesort
>> *************************** 2. row ***************************
>>           id: 1
>>  select_type: SIMPLE
>>        table: prod
>>         type: ref
>> possible_keys: pub_id
>>          key: pub_id
>>      key_len: 48
>>          ref: const
>>         rows: 19607
>>        Extra: Using where
>> =====
>>
>> Here's the important part of the table `prod`:
>>
>> =====
>>
>>       Table: prod
>> Create Table: CREATE TABLE `prod` (
>>  `prod_id` varchar(15) NOT NULL DEFAULT '',
>>  `prod_num` mediumint(6) unsigned DEFAULT NULL,
>>  `prod_title` varchar(255) DEFAULT NULL,
>>  `prod_type` varchar(2) DEFAULT NULL,
>>  `prod_vat_pct` decimal(5,2) DEFAULT NULL,
>>  `prod_discont` tinyint(1) DEFAULT NULL,
>>  `prod_replacing` mediumint(6) unsigned DEFAULT NULL,
>>  `prod_replaced_by` mediumint(6) unsigned DEFAULT NULL,
>>  `prod_ready` tinyint(1) DEFAULT NULL,
>>  `pub_id` varchar(15) DEFAULT NULL,
>> ...
>>  PRIMARY KEY (`prod_id`),
>>  UNIQUE KEY `prod_num` (`prod_num`),
>>  KEY `prod_pub_prod_id` (`prod_pub_prod_id`),
>>  KEY `pub_id` (`pub_id`),
>>  KEY `prod_title` (`prod_title`),
>>  FULLTEXT KEY `prod_title_fulltext` (`prod_title`)
>> ) ENGINE=MyISAM DEFAULT CHARSET=utf8
>>
>> =====
>>
>> This works reasonably well for a small number (perhaps 200-300) of new
>> products; but now I've been handed a list of over 15000 to stuff into the
>> table `new_titles`! This motivates me to wonder if there is a better way,
>> since I expect this to take a very long time.
>>
>> Suggestions?
>>
>> Regards,
>>
>> Jerry Schwartz
>> Global Information Incorporated
>> 195 Farmington Ave.
>> Farmington, CT 06032
>>
>> 860.674.8796 / FAX: 860.674.8341
>> E-mail: jerry@stripped
>> Web site: www.the-infoshop.com
>>
>>
>>
>>
>>
>> --
>> MySQL General Mailing List
>> For list archives: http://lists.mysql.com/mysql
>> To unsubscribe: 
>> http://lists.mysql.com/mysql?unsub=1
>>




Thread
FW: Join based upon LIKEJerry Schwartz28 Apr
  • Re: Join based upon LIKEJohan De Meersman28 Apr
    • RE: Join based upon LIKEJerry Schwartz29 Apr
      • Re: Join based upon LIKEJohan De Meersman29 Apr
        • RE: Join based upon LIKEJerry Schwartz29 Apr
  • Re: FW: Join based upon LIKEhsv30 Apr
RE: Join based upon LIKEJerry Schwartz29 Apr
  • Re: Join based upon LIKEJohan De Meersman1 May
    • RE: Join based upon LIKEJerry Schwartz2 May
      • Re: Join based upon LIKEJohan De Meersman3 May
        • RE: Join based upon LIKEJerry Schwartz3 May
          • Re: Join based upon LIKEJohan De Meersman3 May
            • Re: Join based upon LIKEshawn wilson3 May
              • RE: Join based upon LIKEJerry Schwartz3 May
                • Re: Join based upon LIKENuno Tavares4 May
                  • RE: Join based upon LIKEJerry Schwartz5 May
Re: FW: Join based upon LIKEhsv30 Apr