List:Internals« Previous Message
From:shawn l.green Date:August 10 2017 1:39pm
Subject:Re: How was this records/fanout logic derived for the
“no statistics” case in MySQL's Query Planner?
View as plain text  

On 8/9/2017 7:51 AM, Sergei Golubchik wrote:
> Hi, Jacky!
>
> On Aug 07, YuFeng Shen wrote:
>> Hi Experts,
>>
>> In the MySQL Server 5.7 source code, the formula records = (x * (b-a) +
> a*c-b)/(c-1) is used in the query planner to calculate the number of records when key
> distribution statistics are not available.
>> Where is this formula coming from, how was it derived, or why is this
>> specific formula the formula that's being used? Does it have an
>> established theoretical underpinning, and if so, what is its basis?
>>   https://github.com/mysql/mysql-server/blob/5.7/sql/sql_planner.cc#L529
>>
>>            Assume that the first key part matches 1% of the file
>>            and that the whole key matches 10 (duplicates) or 1
>>            (unique) records.
>>            Assume also that more key matches proportionally more
>>            records
>>            This gives the formula:
>>            records = (x * (b-a) + a*c-b)/(c-1)
>>            b = records matched by whole key
>>            a = records matched by first key part (1% of all records?)
>>            c = number of key parts in key
>>            x = used key parts (1 <= x <= c)
>
> According to git history, this was present even in the very first
> revision, when MySQL moved from CVS to BitKeeper in July 2000.
>
> There's only one person in Oracle that was in MySQL back than, and I
> don't think he's reading this mailing list.
>
> So, I'm afraid, only Monty can know that.
>

I can see why there is a formula at all in lines 551-565 in that same file.

             if (tab->records() == 0)
               tmp_fanout= 0.0;
             else if (rec_per_key / tab->records() >= 0.01)
               tmp_fanout= rec_per_key;
             else
             {
               const double a= tab->records() * 0.01;
               if (keyinfo->user_defined_key_parts > 1)
                 tmp_fanout=
                   (cur_used_keyparts * (rec_per_key - a) +
                    a * keyinfo->user_defined_key_parts - rec_per_key) /
                   (keyinfo->user_defined_key_parts - 1);
               else
                 tmp_fanout= a;
               set_if_bigger(tmp_fanout, 1.0);

*in pseudo logic*
If <there are no records in the table>
   then tmp_fanout=0
If <we might match more than 1% of table>
   then tmp_fanout= # of records per key (the Cardinality)
otherwise
   create a non-zero value of tmp_fanout based on a different estimate

That estimate is based on a some initial conditions
a) there are rows in the table
b) Each indexed tuple is likely to match less than 1% of the entire table
c) At least one of the left-most columns of an index will be used for 
searching the table.

I don't have time (right now) to reverse engineer the entire derivation 
but try to imagine a multicolumn index as if it were a value tree. Let's 
say you have a 3-column index (a,b,c). For each unique value of a (a1, 
a2, a3,...) , there will be a subset of the b values that are associated 
with it (for a1, that might be the set {b65, b90, b205}) and that for 
each {a,b} tuple there is a set of c values associated with those.  At 
the end of that tree is a set of row identifiers that identify which 
rows had those combinations of values.  Using some crude ASCII art, it 
might look like this... (only diagramming the first few values)

a1+ b65 + c1 -> (row104, row107, )
   |     + c2 -> (row5, row305, row400)
   |     + c4 -> (row799)
   |
   + b90 + c2 -> (row4,row6,row15)
   |     + c3 -> (row19)
   |
   + b205 + c2 -> (row75)
a2 ...

Does that help you to visualize how the formula can provide a non-zero 
estimate of what a "fanout factor" could be?  As Sergei said, the 
originator of the formula in our code (and their notes) is not available 
for direct comment.

I also wasn't around when it was created so I could not have any emails 
containing a discussion about its derivation.

If you think you may have a better formula for the estimation, we would 
be happy to discuss it.

Yours,
-- 
Shawn Green
MySQL Senior Principal Technical Support Engineer
Oracle USA, Inc. - Integrated Cloud Applications & Platform Services
Office: Blountville, TN

Become certified in MySQL! Visit https://www.mysql.com/certification/ 
for details.
Thread
How was this records/fanout logic derived for the “no statistics” case in MySQL's Query Planner?YuFeng Shen7 Aug 2017
  • Re: How was this records/fanout logic derived for the “no statistics ”case in MySQL's Query Planner?Sergei Golubchik9 Aug 2017
    • Re: How was this records/fanout logic derived for the “no statistics” case in MySQL's Query Planner?shawn l.green10 Aug 2017