From: Jorgen Loland Date: August 27 2010 12:42pm Subject: Re: bzr commit into mysql-next-mr-opt-backporting branch (jorgen.loland:3228) Bug#49129 List-Archive: http://lists.mysql.com/commits/117012 Message-Id: <4C77B2C1.20700@oracle.com> MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 8bit Guilhem, Roy Thank you both for the review. I have done some improvements in my latest patch, most noticeably: * Added bool extend_duplicate_generating_range() to use in if and ASSERT as suggested * Added comment on why rowid is needed for join buffering. I added this to the code instead of the commit comment so that it's easily accessible for later. I had to dig a lot into WLs and code for this, but I think I got it right. * Changed tbl_no and tbl names as suggested. * Fixed to doxygen @todo's Two questions remain: Why is copy->field NULL for duplicate weedout Sergeys todo on counting rowids I do not have a good answer to these :( On 08/23/2010 12:07 PM, Guilhem Bichot wrote: > Hello Jorgen, > > I'm not reviewer for this bug, but as I had spent a bit of time on it > I'm casting thoughts anyway: > > Jorgen Loland a écrit, Le 13.08.2010 14:49: >> #At >> file:///export/home/jl208045/mysql/mysql-next-mr-opt-backporting-49129/ based >> on revid:guilhem.bichot@stripped >> >> 3228 Jorgen Loland 2010-08-13 >> Bug#49129 - Wrong result with IN-subquery with join_cache_level=6 and >> firstmatch=off >> Patch based on contribution from Sergey Petrunia. >> Consider the query: >> SELECT * FROM t0 WHERE t0.a IN ( >> SELECT t1.a FROM t1, t2 WHERE t2.a=t0.a AND t1.b=t2.b); >> With join cache level 6, this query only returns the first tuple >> from t0 that has a match in the subquery. Consider the relevant part >> of EXPLAIN: >> t0 | Using where | >> t1 | Start temporary; Using join buffer | >> t2 | Using where; End temporary; Using join buffer | >> When the optimizer decides to use join buffers, temporary tables >> created for duplicate weedout should extend to the first table >> after const tables. I.e., Start temporary should be printed for >> t0 above. > > here we could explain why the temporary table should extend to the first > table after const tables. I think it's for the following reason. > If join buffering is not used, we have a usual nested loop join. It can > be like this: > * read a first record from t0 > ** read a first record from t1 > *** read a first record from t2 > *** store (t1 || t2) ( || stands for concatenation of records) into tmp > table > *** read a second record from t2 > *** store (t1 || t2) into tmp table; to not produce duplicates, we need > to recognize that this is about the same first record of t1 as in the > previous iteration; this works as we have stored the rowid of t1's > record in the tmp table, so that it can be eliminated as duplicate. > Or it can be like this: > * read a first record from t0 > ** read a first record from t1 > *** read a first record from t2 > *** store (t1 || t2) into tmp table > ** no more records in t2, read a second record from t1 > *** read a first record from t2 > *** store (t1 || t2) into tmp table; to not produce duplicates, we need > to recognize that this is about the same first record of t2 as in the > previous iteration; this works as we have stored the rowid of t2's > record in the tmp table, so that it can be eliminated as duplicate. > Anyway, when we get to the second record of t0, we will never have a > chance to produce duplicates of what we produced for the first record of > t0 (we will never get back to this first record of t0, this is nested > loop join): thus t0's record does not need to be stored in the tmp table. > Now, if using join buffering, for example block nested loop join. > We read records from t0, we buffer many of them (because t1 does join > buffering), then we read records of t1, we also buffer many of them > (because t2 does join buffering), then it's time to flush all those > buffers and produce a result set. Flushing happens in the t2-t1 order: > we read a record from table t2, and for this record, we join it with all > buffered records from t1 and all buffered records from t0. So you get > this order of output records: > first-row-of-t2 || first-row-of-t1 || first-row-of-t0 > first-row-of-t2 || first-row-of-t1 || second-row-of-t0 > etc > If we don't store t0 into the tmp table, the two first rows reduce to: > first-row-of-t2 || first-row-of-t1 > first-row-of-t2 || first-row-of-t1 > which are then seen as duplicates, so the second-row-of-t0 will be > wrongly excluded from the result. On the opposite, if we store t0 into > the tmp table, the two first are correctly non-duplicate (t0's rowid > helps distinguishing them). > >> The reason for the bug is that setup_semijoin_dups_elimination() >> is called before the final decision is made in >> check_join_cache_usage() on whether or not to use join buffering. >> In this case, use_join_buffer==false for t1 and t2 during >> setup_semijoin_dups_elimination(), and the range of tables to >> buffer is therefore not extended to t0. >> Since check_join_cache_usage() needs to know if duplicate weedout >> is used, so moving setup_semijoin_dups_elimination() from before >> check_join_cache_usage() to after it is not possible. The temporary >> fix of this patch is to use a rough estimate of >> whether join buffering will be used in >> setup_semijoin_dups_elimination(). This rough test covers more cases >> than actually end up with join buffering, and in these cases we now >> extend the temporary table to store rowids for more tables than >> strictly required, i.e., the first table up to the start of the >> semijoin. A proper (but much more costly to do) fix would be to merge >> the loops of setup_semijoin_dups_elimination() and >> make_join_readinfo() (which calls check_join_cache_usage()). > > good comment > >> === modified file 'sql/sql_join_cache.cc' >> --- a/sql/sql_join_cache.cc 2010-07-23 17:51:11 +0000 >> +++ b/sql/sql_join_cache.cc 2010-08-13 12:49:50 +0000 >> @@ -612,7 +612,12 @@ int JOIN_CACHE_BKA::init() >> copy_end= cache->field_descr+cache->fields; >> for (copy= cache->field_descr+cache->flag_fields; copy < copy_end; >> copy++) >> { >> - if (copy->field->table == tab->table && >> + /* >> + (1) - when we store rowids for DuplicateWeedout, they have >> + copy->field==NULL > > do you know why? > >> + */ >> + if (copy->field && // (1) >> + copy->field->table == tab->table && >> bitmap_is_set(key_read_set, copy->field->field_index)) >> { >> *copy_ptr++= copy; >> === modified file 'sql/sql_select.cc' >> --- a/sql/sql_select.cc 2010-08-12 09:02:11 +0000 >> +++ b/sql/sql_select.cc 2010-08-13 12:49:50 +0000 >> @@ -1430,10 +1430,38 @@ int setup_semijoin_dups_elimination(JOIN >> forwards, but do not destroy other duplicate elimination methods. >> */ >> uint first_table= i; >> + uint join_cache_level= join->thd->variables.optimizer_join_cache_level; >> for (uint j= i; j < i + pos->n_sj_tables; j++) >> { >> - if (join->best_positions[j].use_join_buffer && j <= no_jbuf_after) >> + /* >> + The final decision on whether or not join buffering will >> + be used is done in check_join_cache_usage(), which is >> + called from make_join_readinfo()'s main loop. >> + check_join_cache_usage() needs to know if duplicate >> + weedout is used, so moving >> + setup_semijoin_dups_elimination() from before the main >> + loop to after it is not possible. I.e., >> + join->best_positions[j].use_join_buffer is not >> + trustworthy at this point. >> + >> + TODO: merge make_join_readinfo() and > > for doxygen, you could use /** @todo */ > >> + setup_semijoin_dups_elimination() loops and change the >> + following 'if' to >> + >> + "if (join->best_positions[j].use_join_buffer && + j <= no_jbuf_after)". >> + >> + For now, use a rough criteria: >> + */ >> + JOIN_TAB *sj_tab=join->join_tab + j; + if (j != join->const_tables >> && + sj_tab->use_quick != 2 && + j <= no_jbuf_after && >> + ((sj_tab->type == JT_ALL && join_cache_level != 0) || >> + (join_cache_level > 4 && (tab->type == JT_REF || + tab->type == >> JT_EQ_REF)))) > > I would like to protect against the case of a developer, in the future, > who would relax the conditions in check_join_cache_usage() and would > forget to update the if() above. Imagine that person would make > check_join_cache_usage() use BKA for some more types than > JT_REF/JT_EQ_REF and forget to update the if() above: the bug would be > re-introduced. > To prevent this, a suggestion: > - put the if() condition into a function, named like > bool might_do_join_buffering() const (or a better name :-) > - in check_join_cache_usage(), each time we decide to use a join buffer > (JOIN_CACHE_(BNL|BKA|BKA_UNIQUE)), do > DBUG_ASSERT(might_do_join_buffering()); > (or add this assertion into the constructor of JOIN_CACHE, if > technically possible). > This would ensure that the if() above is always a superset of > check_join_cache_usage()'s decisions. > >> { >> + /* Looks like we'll be using join buffer */ > > I would change to "it might be that we'll be using join buffer" to > insist on the uncertainty. > >> first_table= join->const_tables; >> break; >> } >> @@ -8080,6 +8108,10 @@ void calc_used_field_length(THD *thd, JO >> (join_tab->table->s->reclength- rec_length)); >> rec_length+=(uint) max(4,blob_length); >> } >> + /* >> + psergey-todo: why we don't count here rowid that we might need to store >> + when using DuplicateElimination? >> + */ > > How about a different type of todo-tag (/** @todo */)? If you have time: > maybe this todo asks an interesting question, maybe it's possible to > create a testcase which shows a real bug? Or it's possible to muse a bit > and find that this todo is actually a non-problem? Maybe that would be > significant work, though? > >> join_tab->used_fields=fields; >> join_tab->used_fieldlength=rec_length; >> join_tab->used_blobs=blobs; -- Jørgen Løland | Senior Software Engineer | +47 73842138 Oracle MySQL Trondheim, Norway