Hello,
I have been wondering if somehow on can get a huge enough set of real-life
SQL queries so that it can be determined what percentage of those queries
get benefit by a given optimization and what percentage do not. In other
words, when considering a given optimization, how does MySQL decide if they
are going to implement it or not (apart from constant nagging by paying
customers)?
I have been thinking about that because I either hallucinated or read in
some presentation that eventually MySQL will optimize away statements like
WHERE LEFT(string, 3) = 4_letter_string or WHERE varchar_2_column =
4_letter_string (both conditions being impossible because the strings being
compared are of different lengths). Such an optimization appears to be
relatively easy to implement, and at the same time would probably only
benefit a very small set of all potential queries in the world.
Will there be a benefit to construct a tool that gets a real-life list of
queries as input (e.g. from a query log, from listening to tcp streams,
being a slave (for UPDATEs), an ODBC hook, or whatever) and converts all
query expressions into some standard form that allows data to be aggregated
from many installations across many companies into useful reports that show
universal query patterns that are worth optimizing?
For example, the input query as-is from the query log
SELECT is_informant FROM california_cia_contacts WHERE LEFT(company, 4) =
'ACME'
is too particular for the application and at the same time too revealing
(even without knowing the actual data) to ship outside of the company for
aggregation and analysis. However, given some schema knowledge, it can be
transformed into a string like:
full_table_scan for SELECT bool_col FROM single_myisam_table WHERE
Item_eq_func(Item_func_left(indexed_char_col, 4),4_letter_string)
which is reasonably anonymous and general to be both exported and aggregated
with other reports. If it turns out that many applications across many
companies tend to use LEFT() on an indexed_column, it may be wise to
implement an optimization which will convert "LEFT(company, 4) = 'ACME'"
into "company LIKE 'ACME%'". It may even be possible, given information on
key cardinality, to estimate the total time saved across all monitored
installations.
Some years ago I ran into a sub-optimal special case of the MySQL optimizer.
I prepared a bug report which required that I obfuscate my original
proprietary data and revealing field names into no-secrets-revealing sample
table and query and post a bug in MySQL, which was in turn demoted to either
Won't fix or Feature request. Instead, what we can have is a tool that
obfuscates all (such) queries and sends them to MySQL in an automated
manner, with every such submission a sort of "vote" for the prevalence in
the wild of a particular construction, with the most common constructions
submitted warranting either a special optimizer handling or at least a
better description in the manual on how to rewrite it or an item for
someone's next "MySQL common pitfalls" conference talk.
If it turns out that having the original data is still important to
replicate the setup, a tool can be created that uses SHOW TABLE STATUS to
create a fake duplicate table containing hashed versions of the original
data so that the copy has the same key distribution, size and other
characteristics as the original.
Philip Stoev
| Thread |
|---|
| • How do you know which optimizations are worth it? | Philip Stoev | 18 Dec |