List: | Internals | « Previous Message | |

From: | shawn l.green | Date: | August 10 2017 1:39pm |

Subject: | Re: How was this records/fanout logic derived for the “no statistics” case in MySQL's Query Planner? | ||

View as plain text |

On 8/9/2017 7:51 AM, Sergei Golubchik wrote: > Hi, Jacky! > > On Aug 07, YuFeng Shen wrote: >> Hi Experts, >> >> In the MySQL Server 5.7 source code, the formula records = (x * (b-a) + > a*c-b)/(c-1) is used in the query planner to calculate the number of records when key > distribution statistics are not available. >> Where is this formula coming from, how was it derived, or why is this >> specific formula the formula that's being used? Does it have an >> established theoretical underpinning, and if so, what is its basis? >> https://github.com/mysql/mysql-server/blob/5.7/sql/sql_planner.cc#L529 >> >> Assume that the first key part matches 1% of the file >> and that the whole key matches 10 (duplicates) or 1 >> (unique) records. >> Assume also that more key matches proportionally more >> records >> This gives the formula: >> records = (x * (b-a) + a*c-b)/(c-1) >> b = records matched by whole key >> a = records matched by first key part (1% of all records?) >> c = number of key parts in key >> x = used key parts (1 <= x <= c) > > According to git history, this was present even in the very first > revision, when MySQL moved from CVS to BitKeeper in July 2000. > > There's only one person in Oracle that was in MySQL back than, and I > don't think he's reading this mailing list. > > So, I'm afraid, only Monty can know that. > I can see why there is a formula at all in lines 551-565 in that same file. if (tab->records() == 0) tmp_fanout= 0.0; else if (rec_per_key / tab->records() >= 0.01) tmp_fanout= rec_per_key; else { const double a= tab->records() * 0.01; if (keyinfo->user_defined_key_parts > 1) tmp_fanout= (cur_used_keyparts * (rec_per_key - a) + a * keyinfo->user_defined_key_parts - rec_per_key) / (keyinfo->user_defined_key_parts - 1); else tmp_fanout= a; set_if_bigger(tmp_fanout, 1.0); *in pseudo logic* If <there are no records in the table> then tmp_fanout=0 If <we might match more than 1% of table> then tmp_fanout= # of records per key (the Cardinality) otherwise create a non-zero value of tmp_fanout based on a different estimate That estimate is based on a some initial conditions a) there are rows in the table b) Each indexed tuple is likely to match less than 1% of the entire table c) At least one of the left-most columns of an index will be used for searching the table. I don't have time (right now) to reverse engineer the entire derivation but try to imagine a multicolumn index as if it were a value tree. Let's say you have a 3-column index (a,b,c). For each unique value of a (a1, a2, a3,...) , there will be a subset of the b values that are associated with it (for a1, that might be the set {b65, b90, b205}) and that for each {a,b} tuple there is a set of c values associated with those. At the end of that tree is a set of row identifiers that identify which rows had those combinations of values. Using some crude ASCII art, it might look like this... (only diagramming the first few values) a1+ b65 + c1 -> (row104, row107, ) | + c2 -> (row5, row305, row400) | + c4 -> (row799) | + b90 + c2 -> (row4,row6,row15) | + c3 -> (row19) | + b205 + c2 -> (row75) a2 ... Does that help you to visualize how the formula can provide a non-zero estimate of what a "fanout factor" could be? As Sergei said, the originator of the formula in our code (and their notes) is not available for direct comment. I also wasn't around when it was created so I could not have any emails containing a discussion about its derivation. If you think you may have a better formula for the estimation, we would be happy to discuss it. Yours, -- Shawn Green MySQL Senior Principal Technical Support Engineer Oracle USA, Inc. - Integrated Cloud Applications & Platform Services Office: Blountville, TN Become certified in MySQL! Visit https://www.mysql.com/certification/ for details.

Thread | ||
---|---|---|

• How was this records/fanout logic derived for the “no statistics” case in MySQL's Query Planner? | YuFeng Shen | 7 Aug 2017 |

• Re: How was this records/fanout logic derived for the “no statistics ”case in MySQL's Query Planner? | Sergei Golubchik | 9 Aug 2017 |

• Re: How was this records/fanout logic derived for the “no statistics” case in MySQL's Query Planner? | shawn l.green | 10 Aug 2017 |