List:Commits« Previous MessageNext Message »
From:Roy Lyseng Date:April 21 2010 9:31am
Subject:Re: patches for BUG#50595 (diff -u format)
View as plain text  
Hi Guilhem,

thanks for a well documented bug fix, it's a reviewer's delight!

Patch approved, see some comments below.

Guilhem Bichot wrote:
> [resending identical but with all patches in "diff -u" format]
> 
> Hello,
> 
> Because the fix for BUG#50595 is better understood as 6 small patches,
> and because I need to share some analysis with reviewers, I'm not
> using the ordinary commit mail but some text with patches in
> attachments.
> 
> In BUG#50595 we have a case where an EXPLAIN (or the equivalent
> SELECT) takes 2 seconds in a 6.0-codebase, versus 1.4
> secs in next-mr (both optimized builds BUILD/compile-pentium64 with
> gcc, on Linux 64-bit; this was also observed on Solaris 32-bit
> SPARC and was originally seen on pushbuild2).
> The case is a bit artificial, a natural join between 20 tables:
> 
> EXPLAIN SELECT CAST(f1 AS SIGNED INTEGER) AS f1,
> CAST(f2 AS CHAR) AS f2 FROM test1.v20;
> 
> v20 is a view which is a join between a base table and a view v19:
> CREATE VIEW test1.v20 AS SELECT f1, f2
> FROM test3.t1 tab1 NATURAL JOIN test1.v19 tab2;
> 
> v19 is a view which is a join between a base table and a view v18,
> etc: when all views are replaced by their definition ("merged" in the
> top-level query) it makes a 20-table join.
> The test and result are attached (bug.*).
> 
> So the goal is to go from 2 secs to 1.4 secs.
> I observe that compiler flags are identical; and next-mr is built with
> gcc, 6.0-codebase with g++, but it's not a cause (building next-mr with
> g++ doesn't make it slower).
> The produced query plan is identical. Due to those many 20 tables, the
> EXPLAIN query spends 95%+ of its time in greedy_search() and that's
> where the slowdown happens. There is no semijoin in the query.
> 
> So the culprits are in greedy_search() and what it calls:
> best_extension_by_limited_search() which itself calls
> best_access_path()). An analysis of those functions in next-mr vs
> 6.0-codebase:
> - in best_extension_by_limited_search(), 6.0-codebase calls semijoin
> functions (advance_sj_state() and restore_prev_sj_state()). This will
> prove to be a cause.
> - best_extension_by_limited_search() does memcpy's of plans, which
> means POSITION structures. In 6.0-codebase the size of this struct is
> 184 bytes vs 40 in next-mr (due to semijoin-specific members at
> least), but this isn't a cause (padding next-mr's POSITION to 184
> bytes doesn't slow it down).
> - the number of calls to best_extension_by_limited_search() (6
> millions) and best_access_path() (14 millions) is the same in
> 6.0-codebase and next-mr, so the number isn't the cause.
> - In best_access_path(): 6.0-codebase uses a Loose_scan_opt object and
> calls its member functions (to evaluate the Loosescan semijoin
> strategy). This will prove to be a cause.
> 
> My testcase has no index, and the best_access_path() function above
> has a lot of index-specific code, so to verify that I wasn't slowing
> down that index-specific code, I made a variant of the testcase by
> adding indice (just adding "INDEX(f1),INDEX(f2)" to each CREATE TABLE).
> With indices, EXPLAIN didn't finish after 55 minutes at 100% CPU (you
> remember it took 2 secs without indices). The explanation is that
> without indices, the heuristic of optimizer_prune_level eliminates many
> plans early (don't ask me why, I don't understand this heuristic, and
> the comments inside it:
>                /* TODO: What is the reasoning behind this condition? */
>                (!(s->key_dependent & remaining_tables) ||
> are not encouraging). If I turn if off (optimizer_prune_level=0) then
> EXPLAIN with no indices starts taking a very long time too.
> I couldn't research whether this is normal (off-topic with my assigned
> bug), so for tables with indices (and only in this case), I limited to
>    set optimizer_search_depth=5;
> Then EXPLAIN takes 1-2 seconds which is workable for my studies.
> So I have two testcases: one without index (with search_depth=default)
> and one with index (with search_depth=5), and I benchmark my patches
> against both testcases.
> 
> To benchmark, I edit greedy_search() (which is called once in the
> query): when the function starts I record the time with my_getsystime(),
> when it returns I record the time again and print the difference; the
> timer function seems to have a resolution of 100ns, which is enough.
> Below, timings will be given in milliseconds (ms). When I benchmark, I
> run the query 20 times, eliminate the two first runs (which seem to have
> some jitter), then compute the average, and verify that the standard
> deviation is below 5 ms.
> 
> 
>                     no index | % faster than van. 6.0 | index | % faster
> vanilla next-mr    1427     | -28                    | 1280  | -28
> vanilla 6.0        1979     | 0                      | 1770  | 0
> 
> So 28% is our reduction goal.
> 
> Patch "if_around_sj_state"
> ==========================
> 
> Functions advance_sj_state()/restore_prev_sj_state() (credits to Roy
> for early finger-pointing at those functions):
> - do nothing useful for our particular test (no semijoin)
> - but still take time: oprofile for 6.0-codebase:
> 474720   67.6149  mysqld                   best_access_path(JOIN*, etc)
> 87981    12.5312  mysqld                   best_extension_by_limited_etc
> 65706     9.3586  mysqld                   advance_sj_state(JOIN*, etc)
> 34818     4.9592  libm-2.10.1.so           floor
> 16248     2.3142  no-vmlinux               kernel
> 9015      1.2840  mysqld                   handler::scan_time()
> 9.35% for nothing useful...
> So the patch puts the calls to advance_sj_state() and its
> companion restore_prev_sj_state() behind an "if(no semijoin in this
> query)", so that they are not called if no semijoins.
> Results (with a reminder of previous results):
> 
>                     no index | % faster than van. 6.0 | index | % faster
> vanilla next-mr    1427     | -28                    | 1280  | -28
> vanilla 6.0        1979     | 0                      | 1770  | 0
> if_around_sj_state 1656     | -16                    | 1524  | -14
> 
> I tried putting the if() inside advance_sj_state() (early in that
> function); that would have made a more "encapsulated", cleaner change
> in my view, but then results would not be so good, 1712 ms in "no
> index", so I didn't go this route. In those intensively-called
> functions, even just the overhead of a function call/return makes
> a difference.
> 
> My code change accelerates queries without semijoins while potentially
> slowing down queries with semijoins as those have to go through the
> extra if(). This has some logic: our goal should be that
> - queries without semijoins should not be made slower in 6.0-codebase
> than in next-mr, so the semijoin-specific code is put behind an if()
> - queries with semijoins should be made faster in 6.0-codebase than in
> next-mr thanks mostly to new *semijoin strategies* (Firstmatch etc,
> which I hope should give a gain largely exceeding the time lost in an
> additional if()).
> 
> After that fix, oprofile's outputs for 6.0-codebase and next-mr list
> the same functions. The number of calls to floor() and
> handler::scan_time() is identical between both versions, not a
> cause. Codes of handler::scan_time() are identical too, not a
> cause.

Approved.

Another possible solution is to consolidate both "restore" functions into e.g. 
"backout_join_state()" ("backout" because it is the opposite of "advance").
The idx argument to restore_prev_sj_state() is redundant and can be removed, and 
you might even make the entire function inline.

> 
> Patch "loose_scan_inits"
> ============================
> 
> Loose_scan_opt's constructor does some unneeded initializations.
>    Loose_scan_opt():
>      try_loosescan(FALSE),
>      bound_sj_equalities(0),
>      quick_uses_applicable_index(FALSE)
>    {
>      UNINIT_VAR(quick_max_loose_keypart);
>      UNINIT_VAR(best_loose_scan_key);
>      UNINIT_VAR(best_loose_scan_records);
>      UNINIT_VAR(best_max_loose_keypart);
>      UNINIT_VAR(best_loose_scan_start_key);
>    }
> bound_sj_equalities needn't be initialized here, it is read only if
> try_loosescan is true, and when we set try_loosescan to true we make
> sure to initialize bound_sj_equalities.
> All UNINIT_VAR(x) actually do "x=0;" because of this (include/my_global.h):
> /*
>     Suppress uninitialized variable warning without generating code.
> 
>     The _cplusplus is a temporary workaround for C++ code pending a fix
>     for a g++ bug (http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34772).
> */
> #if defined(_lint) || defined(FORCE_INIT_OF_VARS) ||
> defined(__cplusplus) || \
>    !defined(__GNUC__)
> #define UNINIT_VAR(x) x= 0
> #else
> #define UNINIT_VAR(x) x= x
> #endif
> i.e. because of a bug in g++, the x=x (which has no cost and just
> serves to silence the "uninitialized" warning) is replaced by x=0.
> Removing those initializations, which is ok code-wise, does
> reintroduce the warnings, but improves speed:
> 
>                       no index | % faster than van. 6.0 | index | % faster
> vanilla next-mr      1427     | -28                    | 1280  | -28
> vanilla 6.0          1979     | 0                      | 1770  | 0
> if_around_sj_state   1656     | -16                    | 1524  | -14
> loose_scan_inits     1941     | -2                     | 1755  | -1
> 
> Just initialization of a few double/int makes a difference.
> 
> Note: patches are not applied cumulatively, but tested independently
> (each patch is applied to the vanilla, not stacked on the previous
> patch).
> 
> By the way, Loose_scan_opt::init() could be eliminated and its job
> could be done into the constructor, but this didn't bring gain and I
> imagine it's written this way for the case where we would want to
> re-use an existing Loose_scan_opt object (construct it, init it, use
> it, init it again, use it, etc), so I'm not proposing a change.

Approved - but reconsider if change causes warnings.

I tried consolidating the constructor and the init() function, but it had no 
effect on performance.

> 
> Patch "loose_scan_reverse_if"
> =============================
> 
> Loosce_scan_opt::init() has such if():
>      if (!join->emb_sjm_nest && s->emb_sj_nest &&
>          s->emb_sj_nest->sj_in_exprs < 64 &&
>          ((remaining_tables & s->emb_sj_nest->sj_inner_tables) ==  etc
>      {
>        /* This table is an LooseScan scan candidate */
> 
> I'm following the logic of making queries without semijoins not slower
> in 6.0-codebase than in next-mr: in a query without a semijoin,
> !join->emb_sjm_nest is true and s->emb_sj_nest is false; so we evaluate
> two conditions. It is faster to reverse the order, so that the false
> condition is met first and shortcuts the true one. I reverse the order:
> s->emb_sj_nest && !join->emb_sjm_nest . Results:
> 
>                        no index | % faster than van. 6.0 | index | % faster
> vanilla next-mr       1427     | -28                    | 1280  | -28
> vanilla 6.0           1979     | 0                      | 1770  | 0
> if_around_sj_state    1656     | -16                    | 1524  | -14
> loose_scan_inits      1941     | -2                     | 1755  | -1
> loose_scan_reverse_if 1943     | -2                     | 1762  | 0
> 
> Even a condition testing makes a difference.
> 
> We have covered changes to 6.0-specific code pieces.
> Then I propose changes to code which isn't 6.0-specific. I propose to
> do those changes in 6.0 only though.

Approved - but questionable: It gave a negative effect on my computer (tyr67).
> 
> Patch "unlikely_keyuse"
> =======================
> 
> In best_access_path(), code looks like
>    if (s->keyuse)
>    {
>      code which studies index access paths;
>    }
>    code which studies table scan path;
> 
> The generated assembler (gcc -dA -dP -save-temps) looks like
>    compare s->keyuse and 0
>    if they are equal jump to "study table scan path"
>    study index paths
> 
> In my query there is no index on the table, so we take the jump, which
> can be slow (the jump's target may be far away: not present in the CPU's
> buffers, maybe?).
> 
> The patch changes from
>    if (s->keyuse)
> to
>    if (unlikely(s->keyuse))
> and produces assembler:
> 
>    compare s->keyuse and 0
>    if different jump to "study index paths"
>    study table scan path
> 
> which has no jump *in the no-index case* and improves speed. Results:
> 
>                        no index | % faster than van. 6.0 | index | % faster
> vanilla next-mr       1427     | -28                    | 1280  | -28
> vanilla 6.0           1979     | 0                      | 1770  | 0
> if_around_sj_state    1656     | -16                    | 1524  | -14
> loose_scan_inits      1941     | -2                     | 1755  | -1
> loose_scan_reverse_if 1943     | -2                     | 1762  | 0
> unlikely_keyuse       1861     | -6                     | 1762  | 0
> 
> Surprisingly, it appears that the unlikely() doesn't make the "index"
> case any slower. One possible cause: because search depth is 5, there
> are few calls to best_access_path() (9 million, instead of 14 million
> for "no indices"), so the penalty of a jump is less visible. But as
> there is really no penalty for those 9 millions, there must be another
> cause. The "study index paths" block is much bigger than the "study
> table scan path" block (300 lines of code without comments, including
> loops, vs 60 lines), so one jump added to the "indiex" case is
> probably not significant. It's good news anyway.

Approved.

Please add a comment that says that hitting an index in join analysis is not at 
all "unlikely", and that the macro is there for performance reasons only.

Again, this had no effect on my computer, but that was a rather unscientific test.

> 
> Patch "keyuse_scope_vars"
> =========================
> 
> Also in the if(s->keyuse) branch of best_access_path(), 'start_key'
> needn't be initialized to 0. 'rec' can be made local to an inner code
> block. Such changes don't improve or degrade speed so I propose them
> because I find them good.

Approved.

> 
> Patch "remove_floor"
> ====================
> 
> As you can see in the oprofile output:
> 34818     4.9592  libm-2.10.1.so           floor
> 5% of time is spent in floor(). It's this in best_access_path():
> 
>          /* We read the table as many times as join buffer becomes full*/
>          tmp*= (1.0 + floor((double) cache_record_length(join,idx) *
>                             record_count /
>                             (double) thd->variables.join_buff_size));
> See: record's size, multiplied by record count, divided by our join
> buffer's size, rounded to the next integer, says how many times we
> need to read the table.
> 
> My change is to eliminate the floor():
>          tmp*= (1.0 + ((double) cache_record_length(join,idx) *
>                         record_count /
>                         (double) thd->variables.join_buff_size));
> 
> Results:
>                        no index | % faster than van. 6.0 | index | % faster
> vanilla next-mr       1427     | -28                    | 1280  | -28
> vanilla 6.0           1979     | 0                      | 1770  | 0
> if_around_sj_state    1656     | -16                    | 1524  | -14
> loose_scan_inits      1941     | -2                     | 1755  | -1
> loose_scan_reverse_if 1943     | -2                     | 1762  | 0
> unlikely_keyuse       1861     | -6                     | 1762  | 0
> remove_floor          1827     | -8                     | 1612  | -9
> 
> Without floor(), the formula of 'tmp' does become less exact: it
> over-estimates. But given that 'tmp' serves only in a cost
> calculation, that it is common wisdom that costs needn't be absolutely
> exact (the Garcia-Molina book goes as far as saying that cost should
> only be exact by an order of magnitude), given that our costs are
> already quite wrong, and given the gain by removing floor(), I propose
> this change.
> It does cause a few query plan changes in the testsuite. It makes
> subselect3.test crash (segmentation fault in
> optimize_semijoin_nests()) but I don't think it's my change which is
> wrong per se, likely it causes a plan change which happens to take
> some untested route (I'll debug this and fix this for a next round of
> review, of course).
> 
> Now a last test where all patches are applied:
> 
>                        no index | % faster than van. 6.0 | index | % faster
> vanilla next-mr       1427     | -28                    | 1280  | -28
> vanilla 6.0           1979     | 0                      | 1770  | 0
> if_around_sj_state    1656     | -16                    | 1524  | -14
> loose_scan_inits      1941     | -2                     | 1755  | -1
> loose_scan_reverse_if 1943     | -2                     | 1762  | 0
> unlikely_keyuse       1861     | -6                     | 1762  | 0
> remove_floor          1827     | -8                     | 1612  | -9
> all patches           1294     | -35                    | 1312  | -26
> 
> So with all patches applied, we are 9% faster than next-mr in the "no
> index case", and 3% slower than next-mr in the "index" case, to me
> the goal is reasonably reached.
> 
> If we take away the remove_floor patch because it's changing
> some behaviours inside, we have:
>                        1446     | -27                    | 1490  | -16
> But I'd recommend the remove_floor patch (at least, until I find proof
> that it does real wrong in the testsuite).

Approved.

I had the same idea as Olav, ie rewriting to use integers in the sub-expression. 
But for some reason, this does not add any gain at all.

My combined performance gain is ca. 16%. CPU time is reduced from ca. 1820 ms to 
ca. 1530 ms.
> 
> 
> ------------------------------------------------------------------------
> 
> 
Thread
patches for BUG#50595 (diff -u format)Guilhem Bichot18 Apr
  • Re: patches for BUG#50595 (diff -u format)Olav Sandstaa20 Apr
    • Re: patches for BUG#50595 (diff -u format)Guilhem Bichot23 Apr
      • Re: patches for BUG#50595 (diff -u format)Olav Sandstaa23 Apr
        • Re: patches for BUG#50595 (diff -u format)Tor Didriksen23 Apr
      • Re: patches for BUG#50595 (diff -u format)Roy Lyseng26 Apr
  • Re: patches for BUG#50595 (diff -u format)Roy Lyseng21 Apr
    • Re: patches for BUG#50595 (diff -u format)Guilhem Bichot26 Apr
    • Re: patches for BUG#50595 (diff -u format)Guilhem Bichot26 Apr
    • new review needed Re: patches for BUG#50595 (diff -u format)Guilhem Bichot27 Apr
      • Re: new review needed Re: patches for BUG#50595 (diff -u format)Olav Sandstaa27 Apr
        • Re: new review needed Re: patches for BUG#50595 (diff -u format)Guilhem Bichot27 Apr