On 8/9/2017 7:51 AM, Sergei Golubchik wrote:
> Hi, Jacky!
>
> On Aug 07, YuFeng Shen wrote:
>> Hi Experts,
>>
>> In the MySQL Server 5.7 source code, the formula records = (x * (b-a) +
> a*c-b)/(c-1) is used in the query planner to calculate the number of records when key
> distribution statistics are not available.
>> Where is this formula coming from, how was it derived, or why is this
>> specific formula the formula that's being used? Does it have an
>> established theoretical underpinning, and if so, what is its basis?
>> https://github.com/mysql/mysql-server/blob/5.7/sql/sql_planner.cc#L529
>>
>> Assume that the first key part matches 1% of the file
>> and that the whole key matches 10 (duplicates) or 1
>> (unique) records.
>> Assume also that more key matches proportionally more
>> records
>> This gives the formula:
>> records = (x * (b-a) + a*c-b)/(c-1)
>> b = records matched by whole key
>> a = records matched by first key part (1% of all records?)
>> c = number of key parts in key
>> x = used key parts (1 <= x <= c)
>
> According to git history, this was present even in the very first
> revision, when MySQL moved from CVS to BitKeeper in July 2000.
>
> There's only one person in Oracle that was in MySQL back than, and I
> don't think he's reading this mailing list.
>
> So, I'm afraid, only Monty can know that.
>
I can see why there is a formula at all in lines 551-565 in that same file.
if (tab->records() == 0)
tmp_fanout= 0.0;
else if (rec_per_key / tab->records() >= 0.01)
tmp_fanout= rec_per_key;
else
{
const double a= tab->records() * 0.01;
if (keyinfo->user_defined_key_parts > 1)
tmp_fanout=
(cur_used_keyparts * (rec_per_key - a) +
a * keyinfo->user_defined_key_parts - rec_per_key) /
(keyinfo->user_defined_key_parts - 1);
else
tmp_fanout= a;
set_if_bigger(tmp_fanout, 1.0);
*in pseudo logic*
If <there are no records in the table>
then tmp_fanout=0
If <we might match more than 1% of table>
then tmp_fanout= # of records per key (the Cardinality)
otherwise
create a non-zero value of tmp_fanout based on a different estimate
That estimate is based on a some initial conditions
a) there are rows in the table
b) Each indexed tuple is likely to match less than 1% of the entire table
c) At least one of the left-most columns of an index will be used for
searching the table.
I don't have time (right now) to reverse engineer the entire derivation
but try to imagine a multicolumn index as if it were a value tree. Let's
say you have a 3-column index (a,b,c). For each unique value of a (a1,
a2, a3,...) , there will be a subset of the b values that are associated
with it (for a1, that might be the set {b65, b90, b205}) and that for
each {a,b} tuple there is a set of c values associated with those. At
the end of that tree is a set of row identifiers that identify which
rows had those combinations of values. Using some crude ASCII art, it
might look like this... (only diagramming the first few values)
a1+ b65 + c1 -> (row104, row107, )
| + c2 -> (row5, row305, row400)
| + c4 -> (row799)
|
+ b90 + c2 -> (row4,row6,row15)
| + c3 -> (row19)
|
+ b205 + c2 -> (row75)
a2 ...
Does that help you to visualize how the formula can provide a non-zero
estimate of what a "fanout factor" could be? As Sergei said, the
originator of the formula in our code (and their notes) is not available
for direct comment.
I also wasn't around when it was created so I could not have any emails
containing a discussion about its derivation.
If you think you may have a better formula for the estimation, we would
be happy to discuss it.
Yours,
--
Shawn Green
MySQL Senior Principal Technical Support Engineer
Oracle USA, Inc. - Integrated Cloud Applications & Platform Services
Office: Blountville, TN
Become certified in MySQL! Visit https://www.mysql.com/certification/
for details.