* marc.alff@stripped <marc.alff@stripped> [07/08/21 22:07]:
> ChangeSet@stripped, 2007-08-21 12:03:44-06:00, malff@weblab.(none) +4 -0
> Bug#30237 (Performance regression in boolean expressions)
The patch is OK to push.
Please see some minor comments inline.
> This is a performance bug, related to the parsing or 'OR' and
> 'AND' boolean expressions.
>
> Let N be the number of expressions involved in a OR (respectively AND).
>
> When N=1
>
> For example, "select 1" involve only 1 term: there is no OR operator.
>
> In 4.0 and 4.1, parsing expressions not involving OR had no overhead.
> In 5.0, parsing adds some overhead, with Select->expr_list.
>
> With this patch, the overhead introduced in 5.0 has been removed,
> so that performances for N=1 should be identical to the 4.0 performances,
> which are optimal (there is no code executed at all)
>
> The overhead in 5.0 was in fact affecting significantly some operations.
> For example, loading 1 Million rows into a table with INSERTs,
> for a table that has 100 columns, leads to parsing 100 Millions of
> expressions, which means that the overhead related to Select->expr_list
> is executed 100 Million times ...
>
> Considering that N=1 is by far the most probable expression,
> this case should be optimal.
>
> When N=2
>
> For example, "select a OR b" involves 2 terms in the OR operator.
>
> In 4.0 and 4.1, parsing expressions involving 2 terms created 1 Item_cond_or
> node, which is the expected result.
> In 5.0, parsing these expression also produced 1 node, but with some extra
> overhead related to Select->expr_list : creating 1 list in Select->expr_list
> and another in Item_cond::list is inefficient.
>
> With this patch, the overhead introduced in 5.0 has been removed
> so that performances for N=2 should be identical to the 4.0 performances.
> Note that the memory allocation uses the new (thd->mem_root) syntax
> directly.
> The cost of "is_cond_or" is estimated to be neglectable: the real problem
> of the performance degradation comes from unneeded memory allocations.
>
> When N>=3
>
> For example, "select a OR b OR c ...", which involves 3 or more terms.
>
> In 4.0 and 4.1, the parser had no significant cost overhead, but produced
> an Item tree which is difficult to evaluate / optimize during runtime.
> In 5.0, the parser produces a better Item tree, using the Item_cond
> constructor that accepts a list of children directly, but at an extra cost
> related to Select->expr_list.
>
> With this patch, the code is implemented to take the best of the two
> implementations:
> - there is no overhead with Select->expr_list
> - the Item tree generated is optimized and flattened.
>
> This is achieved by adding children nodes into the Item tree directly,
> with Item_cond::add(), which avoids the need for temporary lists and memory
> allocation
>
> Note that this patch also provide an extra optimization, that the previous
> code in 5.0 did not provide: expressions are flattened in the Item tree,
> based on what the expression already parsed is, and not based on the order
> in which rules are reduced.
>
> For example : "(a OR b) OR c", "a OR (b OR c)" would both be represented
> with 2 Item_cond_or nodes before this patch, and with 1 node only with this
> patch. The logic used is based on the mathematical properties of the OR
> operator (it's associative), and produces a simpler tree.
I found this comment useful.
However, let's try to keep the main server change log comment
concise -- and move the main bulk of explanation to sql_yacc.yy
comment.
It would also be nice to mention in the comment that the parser
stack should be preferred to custom stacks maintained by hand
during parsing, such as, for example, Select->expr_list, and this
patch moves us forward in following this general vision.
> Performance tests:
>
> Tests have been executed for N=1, that show an improvement.
>
> When executing repeatedly INSERT <100 values> into <blackhole table>,
> the timing is as follows (time mysqlslap ...)
>
> See the bug report for 30327 for the scripts used, these are not part of
> the patch.
>
> 100,000 inserts of 100 values (10 Million expressions)
> Before
> real 0m57.767s
> user 0m3.088s
> sys 0m3.188s
>
> After
> real 0m50.722s
> user 0m3.332s
> sys 0m3.352s
>
>
> 1,000,000 inserts of 100 values (100 Million expressions)
> Before
> real 9m14.565s
> user 0m33.390s
> sys 0m35.194s
>
> After
> real 8m30.286s
> user 0m33.410s
> sys 0m35.438s
OK, as agreed on the phone, let's replace the exact numbers with
improvement ratio or similar.
> Performance measures for N=2 or N>=3 are not available.
> The code is expected to be faster (per code analysis).
>
> The precedence of INTERVAL_SYM expr has been explicitely defined,
> to resolve shift/reduce conflicts in the grammar around interval_expr.
>
> mysql-test/r/precedence.result@stripped, 2007-08-21 12:03:40-06:00, malff@weblab.(none)
> +356 -0
> Added test cases to cover boolean operator precedence
>
>
> mysql-test/r/precedence.result@stripped, 2007-08-21 12:03:40-06:00, malff@weblab.(none)
> +0 -0
>
> mysql-test/t/precedence.test@stripped, 2007-08-21 12:03:40-06:00, malff@weblab.(none)
> +93 -0
> Added test cases to cover boolean operator precedence
>
>
> mysql-test/t/precedence.test@stripped, 2007-08-21 12:03:40-06:00, malff@weblab.(none) +0
> -0
>
> sql/item_cmpfunc.h@stripped, 2007-08-21 12:03:40-06:00, malff@weblab.(none) +18 -0
> Improved performances for parsing boolean expressions
>
> sql/sql_yacc.yy@stripped, 2007-08-21 12:03:40-06:00, malff@weblab.(none) +92 -42
> Improved performances for parsing boolean expressions
>
> diff -Nrup a/mysql-test/r/precedence.result b/mysql-test/r/precedence.result
> --- /dev/null Wed Dec 31 16:00:00 196900
> +++ b/mysql-test/r/precedence.result 2007-08-21 12:03:40 -06:00
OK.
As discussed on IRC, suggest to rename to
parser_operator_precedence.test to group all parser-related tests
into one group, and be able to run with ./mtr --do=parser
A similar test that would benefit from such a rename is
keywords.test - you could do it in a separate patch.
> --- a/sql/item_cmpfunc.h 2007-07-15 11:49:58 -06:00
> +++ b/sql/item_cmpfunc.h 2007-08-21 12:03:40 -06:00
> @@ -1361,6 +1361,7 @@ public:
> Item_cond(List<Item> &nlist)
> :Item_bool_func(), list(nlist), abort_on_null(0) {}
> bool add(Item *item) { return list.push_back(item); }
> + bool add_at_head(Item *item) { return list.push_front(item); }
> void add_at_head(List<Item> *nlist) { list.prepand(nlist); }
Since add_at_head methods now do not provide extra encapsulation,
you could perhaps remove them.
> Currently there is 251 shift/reduce conflict. We should not introduce
> new conflicts any more.
> */
Please fix the comment to match the number of conflicts :)
> -%expect 251
> +%expect 245
>
> /* all possible expressions */
> -expr:
> - bool_term { Select->expr_list.push_front(new List<Item>); }
> - bool_or_expr
> - {
> - List<Item> *list= Select->expr_list.pop();
> - if (list->elements)
> +expr:
> + bool_factor
> + | expr or expr %prec OR_SYM
> + {
> + Item_cond_or *item1;
> + Item_cond_or *item3;
> + if (is_cond_or($1))
> {
> - list->push_front($1);
> - $$= new Item_cond_or(*list);
> - /* optimize construction of logical OR to reduce
> - amount of objects for complex expressions */
> + item1= (Item_cond_or*) $1;
> + if (is_cond_or($3))
> + {
> + item3= (Item_cond_or*) $3;
> + /*
> + (X1 OR X2) OR (Y1 OR Y2) ==> OR (X1, X2, Y1, Y2)
> + */
> + item3->add_at_head(item1->argument_list());
> + $$ = $3;
> + }
> + else
> + {
> + /*
> + (X1 OR X2) OR Y ==> OR (X1, X2, Y)
> + */
> + item1->add($3);
> + $$ = $1;
> + }
> }
> else
> - $$= $1;
> - delete list;
> + {
> + if (is_cond_or($3))
Using else if (is_cond_or($3)) here could improve readability.
> + }
> }
> else
> - $$= $1;
> - delete list;
> + {
> + if (is_cond_and($3))
Same here, else if could improve readability.
> + item3= (Item_cond_and*) $3;
> + /*
> + X AND (Y1 AND Y2) ==> AND (X, Y1, Y2)
> + */
> + item3->add_at_head($1);
> + $$ = $3;
> + }
> + else
> + {
> + /* X AND Y */
> + $$ = new (YYTHD->mem_root) Item_cond_and($1, $3);
> + }
> + }
--
-- Konstantin Osipov Software Developer, Moscow, Russia
-- MySQL AB, www.mysql.com The best DATABASE COMPANY in the GALAXY