List:General Discussion« Previous MessageNext Message »
From:Jerry Schwartz Date:April 28 2011 7:28pm
Subject:FW: Join based upon LIKE
View as plain text  
No takers?

-----Original Message-----
From: Jerry Schwartz [mailto:jerry@stripped]
Sent: Monday, April 25, 2011 2:34 PM
To: 'Mailing-List mysql'
Subject: Join based upon LIKE

I have to match lists of new publications against our database, so that I can 
replace the existing publications in our catalog. For example,

"The UK Market for Puppies in February 2011"

would be a replacement for

"The UK Market for Puppies in December 2010"

Unfortunately, the publishers aren't particularly careful with their titles. 
One might even say they are perverse. I am likely to get

"UK Market: Puppies - Feb 2011"

as replacement for

"The UK Market for Puppies in December 2010"

You can see that a straight match by title is not going to work.

Here's what I've been doing:

=====

SET @PUBID = (SELECT pub.pub_id FROM pub WHERE pub.pub_code = 'GD');

CREATE TEMPORARY TABLE new_titles (
	new_title VARCHAR(255), INDEX (new_title),
	new_title_like VARCHAR(255), INDEX (new_title_like)
	);

INSERT INTO new_titles
VALUES

('Alternative Energy Monthly Deal Analysis - M&A and Investment Trends, April 
2011', 'Alternative Energy Monthly Deal Analysis%M&A%Investment Trends%'),
('Asia Pacific Propylene Industry Outlook to 2015 - Market Size, Company 
Share, Price Trends, Capacity Forecasts of All Active and Planned Plants', 
'Asia Pacific Propylene Industry Outlook to%Market Size%Company Share%Price 
Trends%Capacity Forecasts of All Active%Planned Plants'),
...
('Underground Gas Storage Industry Outlook in North America, 2011 - Details of 
All Operating and Planned Gas Storage Sites to 2014', 'Underground Gas Storage 
Industry Outlook%North America%Details of All Operating%Planned Gas Storage 
Sites to%'),
('Uveitis Therapeutics - Pipeline Assessment and Market Forecasts to 2017', 
'Uveitis Therapeutics%Pipeline Assessment%Market Forecasts to%');

SELECT prod.prod_title AS `Title IN Database`,
        new_titles.new_title AS `Title IN Feed`,
        prod.prod_num AS `ID`
FROM new_titles JOIN prod ON prod.prod_title LIKE (new_titles.new_title_like)
	AND prod.pub_id = @PUBID AND prod.prod_discont = 0
ORDER BY new_titles.new_title;
======

(I've written code that substitutes "%" for certain strings that I specify, 
and there is some trial and error involved.)

Here's how MySQL handles that SELECT:

*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: new_titles
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 47
        Extra: Using filesort
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: prod
         type: ref
possible_keys: pub_id
          key: pub_id
      key_len: 48
          ref: const
         rows: 19607
        Extra: Using where
=====

Here's the important part of the table `prod`:

=====

       Table: prod
Create Table: CREATE TABLE `prod` (
  `prod_id` varchar(15) NOT NULL DEFAULT '',
  `prod_num` mediumint(6) unsigned DEFAULT NULL,
  `prod_title` varchar(255) DEFAULT NULL,
  `prod_type` varchar(2) DEFAULT NULL,
  `prod_vat_pct` decimal(5,2) DEFAULT NULL,
  `prod_discont` tinyint(1) DEFAULT NULL,
  `prod_replacing` mediumint(6) unsigned DEFAULT NULL,
  `prod_replaced_by` mediumint(6) unsigned DEFAULT NULL,
  `prod_ready` tinyint(1) DEFAULT NULL,
  `pub_id` varchar(15) DEFAULT NULL,
...
  PRIMARY KEY (`prod_id`),
  UNIQUE KEY `prod_num` (`prod_num`),
  KEY `prod_pub_prod_id` (`prod_pub_prod_id`),
  KEY `pub_id` (`pub_id`),
  KEY `prod_title` (`prod_title`),
  FULLTEXT KEY `prod_title_fulltext` (`prod_title`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8

=====

This works reasonably well for a small number (perhaps 200-300) of new 
products; but now I've been handed a list of over 15000 to stuff into the 
table `new_titles`! This motivates me to wonder if there is a better way, 
since I expect this to take a very long time.

Suggestions?

Regards,

Jerry Schwartz
Global Information Incorporated
195 Farmington Ave.
Farmington, CT 06032

860.674.8796 / FAX: 860.674.8341
E-mail: jerry@stripped
Web site: www.the-infoshop.com




Thread
FW: Join based upon LIKEJerry Schwartz28 Apr
  • Re: Join based upon LIKEJohan De Meersman28 Apr
    • RE: Join based upon LIKEJerry Schwartz29 Apr
      • Re: Join based upon LIKEJohan De Meersman29 Apr
        • RE: Join based upon LIKEJerry Schwartz29 Apr
  • Re: FW: Join based upon LIKEhsv30 Apr
RE: Join based upon LIKEJerry Schwartz29 Apr
  • Re: Join based upon LIKEJohan De Meersman1 May
    • RE: Join based upon LIKEJerry Schwartz2 May
      • Re: Join based upon LIKEJohan De Meersman3 May
        • RE: Join based upon LIKEJerry Schwartz3 May
          • Re: Join based upon LIKEJohan De Meersman3 May
            • Re: Join based upon LIKEshawn wilson3 May
              • RE: Join based upon LIKEJerry Schwartz3 May
                • Re: Join based upon LIKENuno Tavares4 May
                  • RE: Join based upon LIKEJerry Schwartz5 May
Re: FW: Join based upon LIKEhsv30 Apr