List:General Discussion« Previous MessageNext Message »
From:Michael Stassen Date:September 17 2004 3:35pm
Subject:Re: Query with group by
View as plain text  
Jose Miguel Pérez wrote:

> Hi Michael!
> 
>     Yes, you're right, thanks for extending and clarifying my message.
> However, I'm not confident about your comments on the inefficiency for the
> following query:

I'll explain my reasoning below.

>>>        SELECT c1.date, c1.location, c1.version
>>>        FROM cities c1
>>>        LEFT JOIN cities c2
>>>            ON c1.location=c2.location AND c1.content=c2.content
>>>                AND c2.date>c1.date
>>>        WHERE c2.id IS NULL AND c1.content = 'ALPHA'
>>>
>>This will work, but it is also somewhat inefficient.  The LEFT JOIN is
>>creating numerous extra, unwanted rows, only to throw them away with the
>>WHERE c2.id IS NULL.  Assuming n rows for a particular location value, you
> 
>     [...] and then [...]
> 
>>The most efficient way is probably to use a temporary table.
>>
>>   CREATE TEMPORARY TABLE max_dates
>>   SELECT location, MAX(date) AS max_date
>>   FROM temp
>>   WHERE content = 'ALPHA'
>>   GROUP BY location;
>>
>>   SELECT t.*
>>   FROM temp t, max_dates m
>>   WHERE t.location = m.location
>>   AND t.date = m.max_date;

I notice I missed something here.  We need to add "AND content = 'ALPHA', in 
case there is a row in t with the wrong content which matches the location 
and max_date from m.  (There isn't one in the sample data.)

>>   DROP TABLE max_dates;
> 
>     I don't think the temporary table is such an efficient way of doing
> this. Pardon me, I'm provably wrong, but let me explain to see if I think
> correctly. First, I assume as true this table have an index on "location",
> "content" and "date", apart from the PK on ID. Given that, on my query we
> are using the keys at full, I mean, although you say "the left join is
> creating numerous extra, unwanted rows", this is not true. We could apply
> the standard algebra here, but the real world query optimizers are smart
> enough to not retrieve unwanted data. (What about joining four or more
> tables! Multiply then).

You are certainly right that indexes, or their lack, will make a difference. 
  I assumed proper indexing in my answer, but I probably should have made 
that explicit.

I do not claim that joins are not optimized.  Mysql is certainly smart 
enough, in general, to only fetch matching rows for joins, and to use 
indexes to do so when they are available.  But there are limits.  The 
problem in this case is the condition on the right side of the LEFT JOIN.

 From the manual, section "7.2.8 How MySQL Optimizes LEFT JOIN and RIGHT 
JOIN" <http://dev.mysql.com/doc/mysql/en/LEFT_JOIN_optimization.html>:

   A LEFT JOIN B join_condition is implemented in MySQL as follows:
   ...
   * The LEFT JOIN condition is used to decide how to retrieve rows from
     table B. (In other words, any condition in the WHERE clause is not
     used.)
   ...

So, the "WHERE c2.id IS NULL" cannot be applied until after the rows which 
match the ON clause (and the NULL rows) have been fetched.  To illustrate, 
here's the LEFT JOIN query without the right side condition (I added a few 
columns to make it clear what's happening):

   SELECT c1.id id1, c1.date, c1.location, c1.version, c2.id id2, c2.date
   FROM temp c1
   LEFT JOIN temp c2
     ON c1.location=c2.location AND c1.content=c2.content
        AND c2.date>c1.date
   WHERE c1.content = 'ALPHA';
+------+------------+----------+---------+------+------------+
| id1  | date       | location | version | id2  | date       |
+------+------------+----------+---------+------+------------+
|    1 | 2004-09-14 | PARIS    |      10 |    2 | 2004-09-15 |
|    1 | 2004-09-14 | PARIS    |      10 |    3 | 2004-09-16 |
|    2 | 2004-09-15 | PARIS    |      11 |    3 | 2004-09-16 |
|    3 | 2004-09-16 | PARIS    |      10 | NULL | NULL       |
|    4 | 2004-09-14 | NEW-YORK |      11 |    5 | 2004-09-15 |
|    4 | 2004-09-14 | NEW-YORK |      11 |    6 | 2004-09-16 |
|    5 | 2004-09-15 | NEW-YORK |      11 |    6 | 2004-09-16 |
|    6 | 2004-09-16 | NEW-YORK |      10 | NULL | NULL       |
|    7 | 2004-09-14 | TOKYO    |      10 |    8 | 2004-09-15 |
|    8 | 2004-09-15 | TOKYO    |      11 | NULL | NULL       |
+------+------------+----------+---------+------+------------+
10 rows in set (0.07 sec)

There are actually 2 more rows than the original table has with 
content='ALPHA'!  For each group (location value), we get 1 + (n(n-1)/2) 
results, where n is the number of rows in the group.  In this case, 10 rows 
to get the 3 we want.

Now, there is a special case where a condition on the right side can help 
the optimizer (same manual page as above):

   * If you use LEFT JOIN to find rows that don't exist in some table and
     you have the following test: col_name IS NULL in the WHERE part, where
     col_name is a column that is declared as NOT NULL, MySQL stops
     searching for more rows (for a particular key combination) after it has
     found one row that matches the LEFT JOIN condition.

So, if id has been declared NOT NULL, which seems likely, mysql should stop 
looking at a given date-location combination as soon as a right side match 
is found.  That reduces the rows to be filtered to:

+------+------------+----------+---------+------+------------+
| id1  | date       | location | version | id2  | date       |
+------+------------+----------+---------+------+------------+
|    1 | 2004-09-14 | PARIS    |      10 |    2 | 2004-09-15 |
|    2 | 2004-09-15 | PARIS    |      11 |    3 | 2004-09-16 |
|    3 | 2004-09-16 | PARIS    |      10 | NULL | NULL       |
|    4 | 2004-09-14 | NEW-YORK |      11 |    5 | 2004-09-15 |
|    5 | 2004-09-15 | NEW-YORK |      11 |    6 | 2004-09-16 |
|    6 | 2004-09-16 | NEW-YORK |      10 | NULL | NULL       |
|    7 | 2004-09-14 | TOKYO    |      10 |    8 | 2004-09-15 |
|    8 | 2004-09-15 | TOKYO    |      11 | NULL | NULL       |
+------+------------+----------+---------+------+------------+

That's a little better, 8 rows to get 3.  In general, you'll get one row for 
each location-date combination in this case.  That is, one row for every row 
in the table on the LEFT which matches the WHERE location = 'ALPHA'.

>     Your query is creating a temporary table, doing a full scan of it
> (thanks to the MAX(date) function), etc. If you do a EXPLAIN SELECT for your
> query, you'll notice there is an Extra of: "Using where; Using temporary;
> Using FILESORT". Reading the MySQL documentation, one can see "If you want
> to make your queries as fast as possible, you should look out for Extra
> values of Using filesort and Using temporary.". (Chapter 7.2.1 EXPLAIN
> Syntax).

In reverse order:

You are certainly right that "temporary" and "filesort" are to be avoided. 
And they will be, if the table is properly indexed.  Single column indexing 
won't help much here, because the WHERE condition, the GROUP BY column, and 
the MAX column are all different.  A multi-column index on (content, 
location, date), however, will allow mysql to use the index to find the 
matching rows, find the groups, and calculate the MAX date.  In that case, 
EXPLAIN shows "Using where; Using index".

Yes, a full table scan of the temporary table is done in step 2, but that's 
OK, because the temporary table has precisely one row per desired result.

   SELECT * FROM max_dates;
   +----------+------------+
   | location | max_date   |
   +----------+------------+
   | NEW-YORK | 2004-09-16 |
   | PARIS    | 2004-09-16 |
   | TOKYO    | 2004-09-15 |
   +----------+------------+
   3 rows in set (0.00 sec)

>     If I'm not wrong, maybe the first LEFT JOIN is worse from a mathematical
> point of view, but the temporary one maybe is the worst from a practical
> perspective.

I meant my comments to be practical, rather than theoretical.  In general, I 
am wary of solutions which select extra rows then filter for the desired 
results.  Sometimes that's necessary, but my expectation is that a solution 
which selects only what you need should be faster most of the time.  For 
this small sample table, the difference is irrelevant.  For a large, busy, 
production db, a difference in speed may be a big deal.  In that case, one 
could certainly try it both ways to see which is faster.

>     And you'll see I'm very cautious because I'm not such a SQL guru, but
> I'd like to know other opinions.
> 
>     Anyway, I don't know if one can program an agregate UDF called something
> like EXTERNAL_MAX(...) or something, so that we could do like:
> 
>         SELECT EXTERNAL_MAX(date, version)  ---> i.e: Returns the "version"
> value for the row with MAX(date).
> 
>     This, for sure, will be the best solution. ;-)

That would have to do the same thing behind the scenes.

>     Cheers,
>     Jose Miguel.

Michael
Thread
Query with group byVincent.Badier16 Sep
  • Re: Query with group byRhino16 Sep
    • Re: Query with group byVincent.Badier16 Sep
  • Re: Query with group byJose Miguel Pérez16 Sep
    • Re: Query with group byMichael Stassen16 Sep
  • Re: Query with group byRhino16 Sep
    • Re: Query with group byMichael Stassen17 Sep
  • Re: Query with group byJose Miguel Pérez16 Sep
    • Re: Query with group byMichael Stassen17 Sep
      • RE: Query with group byJose Miguel Pérez22 Sep
  • Re: Query with group byRhino17 Sep
    • Re: Query with group byMichael Stassen18 Sep
  • Re: Query with group byRhino18 Sep