On 02.10.10 17.17, Guilhem Bichot wrote:
> Hello Roy,
>
> Thanks for a clearly commented patch.
>
>> === modified file 'mysql-test/r/subquery_sj_mat.result'
>> --- a/mysql-test/r/subquery_sj_mat.result 2010-09-20 14:06:02 +0000
>> +++ b/mysql-test/r/subquery_sj_mat.result 2010-09-29 14:33:39 +0000
>> @@ -2176,7 +2176,7 @@ create table t3 (a int);
>> insert into t3 select A.a + 10*B.a from t0 A, t0 B;
>> explain select * from t3 where a in (select kp1 from t1 where kp1<20);
>> id select_type table type possible_keys key key_len ref rows Extra
>> -1 PRIMARY t3 ALL NULL NULL NULL NULL 100 Using where
>> +1 PRIMARY t3 ALL NULL NULL NULL NULL 100
>> 1 PRIMARY t1 range kp1 kp1 5 NULL 48 Using where; Using index; Materialize
>
> This one looks worrying, performance-wise. The disappearing "Using where" was
> probably a condition "a<20"; having it attached to t3 allowed to do less lookups
> in the materialized table.
>
> Looking at "handler%" values in "show status" says that after the patch, the
> SELECT used 105 Handler_read_key, instead of 25 before. Other statistics are
> unchanged. Handler_write stays at 40: there are 40 row writes into the tmp table
> (logical, there are 40 rows having kp1<20 in t1)
> Thus: after the patch, t3's rows are not filtered anymore with a<20, so the
> number of key lookups into the materialized table grows (handler_read_key).
> Handler_write is unchanged so nothing has changed for t1. Also confirmed by the
> unchanged EXPLAIN row for t1 ("range" and "using where").
>
> The commit comment says
>
> "Some tests using semijoin materialization show that where clause
> has moved from the outer query into the materialized inner query."
>
> but I think here it's not a move, only a loss.
I am a bit concerned about this, sometimes there will be a loss, and sometimes a
gain.
Consider:
select * from ot where a in (select b from it where b<20);
It is correct that the optimizer may freely move the predicate "b<20" into the
outer query, but the price to pay is increased cost when materializing the
table. Thus, the number of rows being filtered away in the inner vs. the outer
query will tell whether there is a loss or gain.
However, an intelligent programmer will instead write this query:
select * from ot where a in (select b from it where b<20) and a<20;
(An intelligent programmer will assume a not-so-intelligent optimizer).
But the optimizer will actually move both the predicates into the outer query
(just validate with EXPLAIN EXTENDED).
Thus, I would say that we should keep the predicate within the query it is
defined in, if the materialized semijoin algorithm is chosen.
The best option would however be to generate a duplicate predicate so that it
can be attached to both queries.
Thanks,
Roy