List:Commits« Previous MessageNext Message »
From:Guilhem Bichot Date:August 23 2010 10:07am
Subject:Re: bzr commit into mysql-next-mr-opt-backporting branch (jorgen.loland:3228)
Bug#49129
View as plain text  
Hello Jorgen,

I'm not reviewer for this bug, but as I had spent a bit of time on it 
I'm casting thoughts anyway:

Jorgen Loland a écrit, Le 13.08.2010 14:49:
> #At file:///export/home/jl208045/mysql/mysql-next-mr-opt-backporting-49129/ based on
> revid:guilhem.bichot@stripped
> 
>  3228 Jorgen Loland	2010-08-13
>       Bug#49129 - Wrong result with IN-subquery with join_cache_level=6 
>                   and firstmatch=off
>       
>       Patch based on contribution from Sergey Petrunia.
>       
>       Consider the query:
>       
>       SELECT * FROM t0 WHERE t0.a IN (
>          SELECT t1.a FROM t1, t2 WHERE t2.a=t0.a AND t1.b=t2.b);
>       
>       With join cache level 6, this query only returns the first tuple
>       from t0 that has a match in the subquery. Consider the relevant part
>       of EXPLAIN:
>       
>       t0    | Using where                                   |
>       t1    | Start temporary; Using join buffer            |
>       t2    | Using where; End temporary; Using join buffer |
>       
>       When the optimizer decides to use join buffers, temporary tables
>       created for duplicate weedout should extend to the first table
>       after const tables. I.e., Start temporary should be printed for
>       t0 above.

here we could explain why the temporary table should extend to the first 
table after const tables. I think it's for the following reason.
If join buffering is not used, we have a usual nested loop join. It can 
be like this:
* read a first record from t0
** read a first record from t1
*** read a first record from t2
*** store (t1 || t2) ( || stands for concatenation of records) into tmp 
table
*** read a second record from t2
*** store (t1 || t2) into tmp table; to not produce duplicates, we need 
to recognize that this is about the same first record of t1 as in the 
previous iteration; this works as we have stored the rowid of t1's 
record in the tmp table, so that it can be eliminated as duplicate.
Or it can be like this:
* read a first record from t0
** read a first record from t1
*** read a first record from t2
*** store (t1 || t2) into tmp table
** no more records in t2, read a second record from t1
*** read a first record from t2
*** store (t1 || t2) into tmp table; to not produce duplicates, we need 
to recognize that this is about the same first record of t2 as in the 
previous iteration; this works as we have stored the rowid of t2's 
record in the tmp table, so that it can be eliminated as duplicate.
Anyway, when we get to the second record of t0, we will never have a 
chance to produce duplicates of what we produced for the first record of 
t0 (we will never get back to this first record of t0, this is nested 
loop join): thus t0's record does not need to be stored in the tmp table.
Now, if using join buffering, for example block nested loop join.
We read records from t0, we buffer many of them (because t1 does join 
buffering), then we read records of t1, we also buffer many of them 
(because t2 does join buffering), then it's time to flush all those 
buffers and produce a result set. Flushing happens in the t2-t1 order: 
we read a record from table t2, and for this record, we join it with all 
buffered records from t1 and all buffered records from t0. So you get 
this order of output records:
first-row-of-t2 || first-row-of-t1 || first-row-of-t0
first-row-of-t2 || first-row-of-t1 || second-row-of-t0
etc
If we don't store t0 into the tmp table, the two first rows reduce to:
first-row-of-t2 || first-row-of-t1
first-row-of-t2 || first-row-of-t1
which are then seen as duplicates, so the second-row-of-t0 will be 
wrongly excluded from the result. On the opposite, if we store t0 into 
the tmp table, the two first are correctly non-duplicate (t0's rowid 
helps distinguishing them).

>       The reason for the bug is that setup_semijoin_dups_elimination()
>       is called before the final decision is made in
>       check_join_cache_usage() on whether or not to use join buffering.
>       In this case, use_join_buffer==false for t1 and t2 during
>       setup_semijoin_dups_elimination(), and the range of tables to
>       buffer is therefore not extended to t0.
>       
>       Since check_join_cache_usage() needs to know if duplicate weedout
>       is used, so moving setup_semijoin_dups_elimination() from before
>       check_join_cache_usage() to after it is not possible. 
>       
>       The temporary fix of this patch is to use a rough estimate of
>       whether join buffering will be used in
>       setup_semijoin_dups_elimination(). This rough test covers more 
>       cases than actually end up with join buffering, and in these cases 
>       we now extend the temporary table to store rowids for more tables 
>       than strictly required, i.e., the first table up to the start of 
>       the semijoin. A proper (but much more costly to do) fix would be to 
>       merge the loops of setup_semijoin_dups_elimination() and 
>       make_join_readinfo() (which calls check_join_cache_usage()).

good comment

> === modified file 'sql/sql_join_cache.cc'
> --- a/sql/sql_join_cache.cc	2010-07-23 17:51:11 +0000
> +++ b/sql/sql_join_cache.cc	2010-08-13 12:49:50 +0000
> @@ -612,7 +612,12 @@ int JOIN_CACHE_BKA::init()
>        copy_end= cache->field_descr+cache->fields;
>        for (copy= cache->field_descr+cache->flag_fields; copy < copy_end;
> copy++)
>        {
> -        if (copy->field->table == tab->table &&
> +        /*
> +          (1) - when we store rowids for DuplicateWeedout, they have
> +                copy->field==NULL

do you know why?

> +        */
> +        if (copy->field &&  // (1)
> +            copy->field->table == tab->table &&
>              bitmap_is_set(key_read_set, copy->field->field_index))
>          {
>            *copy_ptr++= copy; 
> 
> === modified file 'sql/sql_select.cc'
> --- a/sql/sql_select.cc	2010-08-12 09:02:11 +0000
> +++ b/sql/sql_select.cc	2010-08-13 12:49:50 +0000
> @@ -1430,10 +1430,38 @@ int setup_semijoin_dups_elimination(JOIN
>            forwards, but do not destroy other duplicate elimination methods.
>          */
>          uint first_table= i;
> +        uint join_cache_level=
> join->thd->variables.optimizer_join_cache_level;
>          for (uint j= i; j < i + pos->n_sj_tables; j++)
>          {
> -          if (join->best_positions[j].use_join_buffer && j <=
> no_jbuf_after)
> +          /*
> +            The final decision on whether or not join buffering will
> +            be used is done in check_join_cache_usage(), which is
> +            called from make_join_readinfo()'s main loop.
> +            check_join_cache_usage() needs to know if duplicate
> +            weedout is used, so moving
> +            setup_semijoin_dups_elimination() from before the main
> +            loop to after it is not possible. I.e.,
> +            join->best_positions[j].use_join_buffer is not
> +            trustworthy at this point.
> +
> +            TODO: merge make_join_readinfo() and

for doxygen, you could use /** @todo */

> +            setup_semijoin_dups_elimination() loops and change the
> +            following 'if' to
> +
> +            "if (join->best_positions[j].use_join_buffer && 
> +                 j <= no_jbuf_after)".
> +
> +            For now, use a rough criteria:
> +          */
> +          JOIN_TAB *sj_tab=join->join_tab + j; 
> +          if (j != join->const_tables && 
> +              sj_tab->use_quick != 2 && 
> +              j <= no_jbuf_after &&
> +              ((sj_tab->type == JT_ALL && join_cache_level != 0) ||
> +               (join_cache_level > 4 && (tab->type == JT_REF || 
> +                                         tab->type == JT_EQ_REF))))

I would like to protect against the case of a developer, in the future, 
who would relax the conditions in check_join_cache_usage() and would 
forget to update the if() above. Imagine that person would make 
check_join_cache_usage() use BKA for some more types than 
JT_REF/JT_EQ_REF and forget to update the if() above: the bug would be 
re-introduced.
To prevent this, a suggestion:
- put the if() condition into a function, named like
bool might_do_join_buffering() const (or a better name :-)
- in check_join_cache_usage(), each time we decide to use a join buffer 
(JOIN_CACHE_(BNL|BKA|BKA_UNIQUE)), do
DBUG_ASSERT(might_do_join_buffering());
(or add this assertion into the constructor of JOIN_CACHE, if 
technically possible).
This would ensure that the if() above is always a superset of 
check_join_cache_usage()'s decisions.

>            {
> +            /* Looks like we'll be using join buffer */

I would change to "it might be that we'll be using join buffer" to 
insist on the uncertainty.

>              first_table= join->const_tables;
>              break;
>            }
> @@ -8080,6 +8108,10 @@ void calc_used_field_length(THD *thd, JO
>  			     (join_tab->table->s->reclength- rec_length));
>      rec_length+=(uint) max(4,blob_length);
>    }
> +  /*
> +    psergey-todo: why we don't count here rowid that we might need to store
> +    when using DuplicateElimination?
> +  */

How about a different type of todo-tag (/** @todo */)? If you have time: 
maybe this todo asks an interesting question, maybe it's possible to 
create a testcase which shows a real bug? Or it's possible to muse a bit 
and find that this todo is actually a non-problem? Maybe that would be 
significant work, though?

>    join_tab->used_fields=fields;
>    join_tab->used_fieldlength=rec_length;
>    join_tab->used_blobs=blobs;
Thread
bzr commit into mysql-next-mr-opt-backporting branch (jorgen.loland:3228)Bug#49129Jorgen Loland13 Aug
  • Re: bzr commit into mysql-next-mr-opt-backporting branch (jorgen.loland:3228)Bug#49129Jorgen Loland13 Aug
    • Re: bzr commit into mysql-next-mr-opt-backporting branch (jorgen.loland:3228)Bug#49129Roy Lyseng26 Aug
  • Re: bzr commit into mysql-next-mr-opt-backporting branch (jorgen.loland:3228)Bug#49129Guilhem Bichot23 Aug
    • Re: bzr commit into mysql-next-mr-opt-backporting branch (jorgen.loland:3228)Bug#49129Jorgen Loland27 Aug