List:General Discussion« Previous MessageNext Message »
From:John Stoffel Date:September 28 2010 3:02pm
Subject:Howto optimize Distinct query over 1.3mil rows?
View as plain text  
Hi,

I'm running MySQL 5.0.51a-24+lenny3-log  on a Debian Lenny box with
2Gb of RAM and a pair of dual core 2.6Ghz CPUs.  I'm using a pair of
40Gb disks mirrored using MD (Linux software RAID) for both the OS and
the storage of the mysql DBs.  

My problem child is doing this simple query:

   mysql> select distinct Call_No from Newspaper_Issues
   mysql> WHERE BIB_ID = 464;
   +----------+
   | Call_No  |
   +----------+
   | News     | 
   | NewsD CT | 
   +----------+
   2 rows in set (2.98 sec)

The Newspaper_Issues table has 1.3 million rows, and has a structure
like this:

   mysql> describe  Newspaper_Issues;
   +----------------+-------------+------+-----+---------+----------------+
   | Field          | Type        | Null | Key | Default | Extra          |
   +----------------+-------------+------+-----+---------+----------------+
   | Record_No      | int(11)     | NO   | PRI | NULL    | auto_increment | 
   | BIB_ID         | varchar(38) | NO   | MUL | NULL    |                | 
   | Issue_Date     | date        | NO   | MUL | NULL    |                | 
   | Type_Code      | char(1)     | NO   |     | r       |                | 
   | Condition_Code | char(1)     | NO   |     | o       |                | 
   | Format_Code    | char(1)     | NO   |     | p       |                | 
   | Edition_Code   | char(1)     | NO   |     | n       |                | 
   | Date_Type_Code | char(1)     | NO   |     | n       |                | 
   | Ed_Type        | tinyint(1)  | NO   |     | 1       |                | 
   | RS_Code        | char(1)     | NO   |     | c       |                | 
   | Call_No        | varchar(36) | YES  | MUL | NULL    |                | 
   | Printed_Date   | varchar(10) | YES  |     | NULL    |                | 
   | Update_Date    | date        | NO   |     | NULL    |                | 
   +----------------+-------------+------+-----+---------+----------------+
   13 rows in set (0.00 sec)


I've tried adding various indexes, and reading up on howto optimize
DISTINCT or GROUP BY queries, but I'm hitting a wall here.  My current indexes are:

mysql> show index from Newspaper_Issues;
+------------------+------------+----------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| Table            | Non_unique | Key_name             | Seq_in_index | Column_name |
Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+------------------+------------+----------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| Newspaper_Issues |          0 | PRIMARY              |            1 | Record_No   | A   
     |     1333298 |     NULL | NULL   |      | BTREE      |         | 
| Newspaper_Issues |          1 | BIB_ID               |            1 | BIB_ID      | A   
     |       14980 |     NULL | NULL   |      | BTREE      |         | 
| Newspaper_Issues |          1 | Call_No              |            1 | Call_No     | A   
     |         927 |     NULL | NULL   | YES  | BTREE      |         | 
| Newspaper_Issues |          1 | Issue_Date           |            1 | Issue_Date  | A   
     |       49381 |     NULL | NULL   |      | BTREE      |         | 
| Newspaper_Issues |          1 | BIB_ID_Issue_Date    |            1 | BIB_ID      | A   
     |       14980 |     NULL | NULL   |      | BTREE      |         | 
| Newspaper_Issues |          1 | BIB_ID_Issue_Date    |            2 | Issue_Date  | A   
     |     1333298 |     NULL | NULL   |      | BTREE      |         | 
| Newspaper_Issues |          1 | call_no_short        |            1 | Call_No     | A   
     |          30 |        6 | NULL   | YES  | BTREE      |         | 
| Newspaper_Issues |          1 | BIB_ID_call_no_short |            1 | BIB_ID      | A   
     |       14980 |     NULL | NULL   |      | BTREE      |         | 
| Newspaper_Issues |          1 | BIB_ID_call_no_short |            2 | Call_No     | A   
     |       15503 |     NULL | NULL   | YES  | BTREE      |         | 
| Newspaper_Issues |          1 | call_no_bib_id       |            1 | Call_No     | A   
     |         927 |     NULL | NULL   | YES  | BTREE      |         | 
| Newspaper_Issues |          1 | call_no_bib_id       |            2 | BIB_ID      | A   
     |       15503 |     NULL | NULL   |      | BTREE      |         | 
+------------------+------------+----------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
11 rows in set (0.00 sec)


So now when I do an explain on my query I get:

    mysql> explain select distinct(Call_No) from Newspaper_Issues WHERE BIB_ID = 464;
   
+----+-------------+------------------+-------+-----------------------------------------------+----------------+---------+------+---------+--------------------------+
    | id | select_type | table            | type  | possible_keys                         
       | key            | key_len | ref  | rows    | Extra                    |
   
+----+-------------+------------------+-------+-----------------------------------------------+----------------+---------+------+---------+--------------------------+
    |  1 | SIMPLE      | Newspaper_Issues | index |
BIB_ID,BIB_ID_Issue_Date,BIB_ID_call_no_short | call_no_bib_id | 227     | NULL | 1333298
| Using where; Using index | 
   
+----+-------------+------------------+-------+-----------------------------------------------+----------------+---------+------+---------+--------------------------+
    1 row in set (0.00 sec)

Interestingly enough, I get much better performance if I just drop the WHERE clause, but
that doesn't help me get what I want though.  *grin*

    mysql> explain select distinct(Call_No) from Newspaper_Issues;
   
+----+-------------+------------------+-------+---------------+---------+---------+------+------+--------------------------+
    | id | select_type | table            | type  | possible_keys | key     | key_len |
ref  | rows | Extra                    |
   
+----+-------------+------------------+-------+---------------+---------+---------+------+------+--------------------------+
    |  1 | SIMPLE      | Newspaper_Issues | range | NULL          | Call_No | 111     |
NULL |  928 | Using index for group-by | 
   
+----+-------------+------------------+-------+---------------+---------+---------+------+------+--------------------------+
    1 row in set (0.00 sec)



Would it make sense to split the Call_No data off into it's own table, and put in a proper
numeric ID, instead of the current VARCHAR(36) it uses?  So in Newspaper_Issues I'd just
have a Call_No_ID and a new Call_No table with:

      Call_No_ID    INT
      Call_No	    Char(36)

which would hopefully index better?  I only have 928 distinct Call_No strings, so I'm not
afraid of wasting space or anything.

Thanks,
John
Thread
Howto optimize Distinct query over 1.3mil rows?John Stoffel28 Sep
  • Re: Howto optimize Distinct query over 1.3mil rows?Johnny Withers28 Sep
    • Re: Howto optimize Distinct query over 1.3mil rows?John Stoffel28 Sep
  • Re: Howto optimize Distinct query over 1.3mil rows?Johan De Meersman28 Sep