Guilhem,
thank you for another round of useful comments.
Guilhem Bichot wrote:
> Hello,
>
> Roy Lyseng a écrit, Le 26.04.2010 20:44:
>> Hi Guilhem,
>>
>> thanks for commenting.
>>
>> Guilhem Bichot wrote:
>>> Hello Roy,
>>>
>>> Roy Lyseng a écrit, Le 26.04.2010 12:54:
>>>> #At file:///home/rl136806/mysql/repo/mysql-6.0-work1/ based on
>>>> revid:alfranio.correia@stripped
>>>>
>>>> 3836 Roy Lyseng 2010-04-26
>>>> WL#5347: Refactor optimizer function at_sjmat_pos()
>>>
>>>> @@ -7241,58 +7270,79 @@ optimize_straight_join(JOIN *join, table
>>>> }
>>>>
>>>>
>>>> -/*
>>>> - Check if the last tables of the partial join order allow to use
>>>> - sj-materialization strategy for them
>>>> +/**
>>>> + Check whether a semijoin materialization strategy is allowed for
>>>> + the current (semi)join table order.
>>>>
>>>> - SYNOPSIS
>>>> - at_sjmat_pos()
>>>> - join - remaining_tables
>>>> - tab the last table's join tab
>>>> - idx last table's index
>>>> - loose_scan OUT TRUE <=> use LooseScan
>>>> + @param join Join object
>>>> + @param remaining_tables Tables that have not yet been added to
>>>> the join plan
>>>> + @param tab Join tab of the table being considered
>>>> + @param idx Index of table with join tab "tab"
>>>> +
>>>> + @retval SJ_OPT_NONE - Materialization not applicable
>>>> + @retval SJ_OPT_MATERIALIZE_LOOKUP - Materialization with lookup
>>>> applicable
>>>> + @retval SJ_OPT_MATERIALIZE_SCAN - Materialization with scan
>>>> applicable
>>>>
>>>> - RETURN
>>>> - TRUE Yes, can apply sj-materialization
>>>> - FALSE No, some of the requirements are not met
>>>> + @details
>>>> + The function checks applicability of both MaterializeLookup and
>>>> + MaterializeScan strategies.
>>>> + No checking is made until "tab" is pointing to the last inner table
>>>> + of a semijoin nest that can be executed using materialization -
>>>> + for all other cases SJ_OPT_NONE is returned.
>>>> + If both MaterializeLookup and MaterializeScan are applicable
>>>> + (ie if there are no correlated outer tables),
>>>
>>> I don't understand the equivalence ("i.e. if there are" etc)
>>
>> I mean that both strategies are applicable only when there are no
>> correlated outer tables, ie all outer tables are non-correlated. If
>> there are correlated tables in the suffix, MaterializeScan must be
>> used. If there are correlated tables only in the prefix, then
>> MaterializeLookup must be used.
>
> From the text it's not clear: it says that MatScan can handle:
>
> + MaterializeScan strategy
> + ~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> + (ot|nt)* [ it (it)* ] (ot|nt)*
> + +------+ +==========+ +-----+
> + (1) (2) (3)
> +
>
> This picture accepts the case where the suffix is only "nt", doesn't it?
> But then, this particular MatScan is like a MatLookup...
> I wonder what's the difference between MatScan and MatLookup when there
> are no "ot" in the suffix?
It is quite likely that the two materialization strategies can be consolidated
into one at some point in time. But that will be a separate WL...
>
>> That leaves the case of no correlated outer tables either places,
>> which could, at least in theory, be executed by both strategies. The
>> materialized table would degenerate to a boolean scalar, then.
>
> yes
>
>>>> + MaterializeLookup is returned.
>>>> + This is a purely heuristic decision.
>>>
>>> Is this decision something which is introduced by the patch?
>>
>> No, but the decision is moved into this function. In the existing
>> version, the caller could make the decision itself. It would be easy
>> to revert this however, e.g by implementing the return value as a set
>> of applicable strategies, if somebody wanted it.
>
> no, it's ok, thanks
>
>>>> + MaterializeScan allows correlated outer tables both before and after
>>>> + the materialized inner tables. In this case, the materialized
>>>> table must
>>>> + contain keys from the prefix outer tables followed by keys from the
>>>> + suffix outer tables, called prefix-key and suffix-key, respectively.
>>>
>>> What do you mean by "keys" here... I think the materialized tables
>>> has a single index.
>>
>> I should state that as key columns.
>
> ok
>
>>>> + First, a scan is set up over the prefix outer tables. Then a
>>>> range scan
>>>> + is done over the prefix-key of the materialized table.
>>>> + Finally, the suffix-key from the materialized table is used for a
>>>> scan
>>>> + over the suffix outer tables.
>>>
>>> mmmm the suffix outer tables are not always scanned, if they have an
>>> index then we will do a index lookup on them (searching for a value
>>> which we found when scanning the materialized table).
>>
>> I thought that I could use "scan" as a common term for table scan and
>> index scan/lookup. Do you want me to be more specific?
>
> in fact, it's hard to be accurate. Even this:
> "First, a scan is set up over the prefix outer tables"
> can be misleading: if the prefix outer tables are preceded by other
> tables, then the prefix outer tables may be accessed via "ref" or
> "unique" index accesses (no scan) based on rows found in preceding other
> tables.
> Maybe this:
> "first, data is read from the prefix outer tables, then matches are
> located in the materialized table, then matches are located in the
> suffix outer tables. If there are no prefix outer tables, the
> materialized table is scanned."
> I used "data" instead of "rows" because the optimizer might be able to
> find all data from the indices (so, not even read "rows").
> Maybe this makes a too vague sentence though.
Thanks, I tried to work out a new text based on your suggestion. I object to
using "data", though, because I think it is too general. I prefer "rows", or
"data rows" and "index rows" when I need to make the distinction.
>
>>> Is the comment a description of how things work now?
>>> I was surprised to see that materialization-scan is expected work
>>> with the inner table not first, so I tested it and filed BUG#53172.
>>> It may or not be related to BUG#52068 - semijoin materialization has
>>> a bit too many bugs it seems :-(
>>
>> If there are so many bugs here, we could perhaps disable the "mixed
>> scan" method. It is not very significant, I think... Øystein has a bit
>> more information about this strategy.
>
> Yes, let's see when we start attacking those bugs. All you did is
> describing in greater detail than the existing comments, that's already
> good reading for us all.
>
>>>> -SJ_MATERIALIZATION_INFO *
>>>> -at_sjmat_pos(const JOIN *join, table_map remaining_tables, const
>>>> JOIN_TAB *tab,
>>>> - uint idx, bool *loose_scan)
>>>> +static int
>
> Good that you made it static.
>
>>>> +semijoin_order_allows_materialization(const JOIN *join,
>>>> + table_map remaining_tables,
>>>> + const JOIN_TAB *tab, uint idx)
>>>> {
>>>> /*
>>>> Check if 1. We're in a semi-join nest that can be run with
>>>> SJ-materialization
>>>> - 2. All the tables correlated through the IN subquery are in the
>>>> prefix
>>>> + 2. All the tables from the subquery are in the prefix
>>>> */
>>>> - TABLE_LIST *emb_sj_nest= tab->emb_sj_nest;
>>>> - table_map suffix= remaining_tables & ~tab->table->map;
>>>> - if (emb_sj_nest && emb_sj_nest->sj_mat_info &&
>>>> - !(suffix & emb_sj_nest->sj_inner_tables))
>>>> + const TABLE_LIST *emb_sj_nest= tab->emb_sj_nest;
>>>> + if (!emb_sj_nest ||
>>>> + !emb_sj_nest->sj_mat_info ||
>>>> + (remaining_tables & emb_sj_nest->sj_inner_tables))
>>>
>>> The old code would use ~tab->table->map here...?
>>
>> If you go back to advance_sj_state(), you see this code line:
>>
>> remaining_tables &= ~new_join_tab->table->map;
>>
>> so, suffix is identical to remaining_tables, and there is no need for it.
>
> ok. Though then the property "tab is not part of remaining_tables"
> becomes an assumption of semijoin_order_allows_materialization(), I'd
> DBUG_ASSERT() it.
Done.
>
>>>> + return SJ_OPT_NONE;
>>>> +
>>>> + /* + Walk back and check if all immediately preceding tables
>>>> are from
>>>> + this semi-join.
>>>> + */
>>>> + const uint n_tables= my_count_bits(emb_sj_nest->sj_inner_tables);
>>>> + for (uint i= 1; i < n_tables ; i++)
>>>> {
>>>> - /* - Walk back and check if all immediately preceding
>>>> tables are from
>>>> - this semi-join.
>>>> - */
>>>> - uint n_tables=
> my_count_bits(tab->emb_sj_nest->sj_inner_tables);
>>>> - for (uint i= 1; i < n_tables ; i++)
>>>> - {
>>>> - if (join->positions[idx - i].table->emb_sj_nest !=
>>>> tab->emb_sj_nest)
>>>> - return NULL;
>>>> - }
>>>> - *loose_scan= test(remaining_tables & ~tab->table->map
> &
>>>> - (emb_sj_nest->sj_inner_tables |
>>>> -
>>>> emb_sj_nest->nested_join->sj_depends_on));
>>>> - if (*loose_scan &&
> !emb_sj_nest->sj_subq_pred->sjm_scan_allowed)
>>>> - return NULL;
>>>> - else
>>>> - return emb_sj_nest->sj_mat_info;
>>>> + if (join->positions[idx - i].table->emb_sj_nest !=
> emb_sj_nest)
>>>> + return SJ_OPT_NONE;
>>>> }
>>>> - return NULL;
>>>> -}
>>>>
>>>> + /*
>>>> + Must use MaterializeScan strategy if there are outer correlated
>>>> tables
>>>> + among the remaining tables, otherwise use MaterializeLookup.
>>>> + */
>>>> + if (remaining_tables &
> emb_sj_nest->nested_join->sj_depends_on)
>>>> + {
>>>> + if (!emb_sj_nest->sj_subq_pred->sjm_scan_allowed)
>>>
>>> There seems to be code changes too (I don't see ~tab->table->map,
>>> emb_sj_nest->sj_inner_tables). They may well be fine, but I'd really
>>> need an explanation, please. I don't fully understand what all those
>>> bitmaps are about and why they are not needed in fact...
>>
>> The table->map thing is moot, see above.
>>
>> We have also tested above that if
>> "remaining_tables & emb_sj_nest->sj_inner_tables",
>> we have returned SJ_OPT_NONE.
>>
>> This is the reason why I could simplify the code.
>>
>> It's a bit difficult to annotate every code change with a reason. I
>> cannot write that into the comment section, because it should document
>> the existing code. I could write it into the commit comments, but that
>> means that I need to copy the diffs into the commit section. So
>> refactoring code actually means harder work than writing the initial
>> code :(
>
> Maybe (an idea for a next time) make a commit with verbose comments
> inside the code like "To reviewer: in the above line I could eliminate
> variable X because Y", and when it's time to push, re-commit without
> comments...?
Good idea...
>
> Ok to push. Thanks!
>
>
Thanks,
Roy