Hi Sergei.
Thanks for the review, see some comments below.
Sergei Golubchik wrote:
> Hi!
>
> On Mar 28, marc.alff@stripped wrote:
>
>> Below is the list of changes that have just been committed into a local
>> 5.0 repository of malff. When malff does a push these changes
>> will be propagated to the main repository and, within 24 hours after the
>> push, to the public repository.
>> For information on how to access the public repository
>> see http://dev.mysql.com/doc/mysql/en/installing-source-tree.html
>>
>> ChangeSet@stripped, 2008-03-28 11:32:08-06:00, malff@stripped.
> +12 -0
>> Bug#35577 (CREATE PROCEDURE causes either crash or syntax error depending on
>> build)
>>
>> The crash was caused by freeing the internal parser stack during the parser
>> execution.
>> This occured only for complex stored procedures, after reallocating the parser
>> stack using my_yyoverflow(), with the following C call stack:
>> - MYSQLparse()
>> - any rule calling sp_head::restore_lex()
>> - lex_end()
>> - x_free(lex->yacc_yyss), xfree(lex->yacc_yyvs)
>>
>> The root cause is the implementation of stored procedures, which breaks the
>> assumption from 4.1 that there is only one LEX structure per parser call.
>>
>> The solution is to separate the LEX structure into:
>> - attributes that represent a statement (the current LEX structure),
>> - attributes that relate to the parser itself (Parser_state),
>> so that parsing multiple statements in stored programs can create multiple
>> LEX structures while not changing the unique Parser_state.
>>
>> mysql-test/r/parser_stack.result@stripped, 2008-03-28 11:32:04-06:00,
> malff@stripped. +191 -0
>> Bug#35577 (CREATE PROCEDURE causes either crash or syntax error depending on
>> build)
>>
>> diff -Nrup a/sql/sp.cc b/sql/sp.cc
>> --- a/sql/sp.cc 2008-02-19 08:27:17 -07:00
>> +++ b/sql/sp.cc 2008-03-28 11:32:03 -06:00
>> @@ -443,6 +443,8 @@ db_load_routine(THD *thd, int type, sp_n
>> {
>> Lex_input_stream lip(thd, defstr.c_ptr(), defstr.length());
>> thd->m_lip= &lip;
>> + Parser_state parser_state;
>> + thd->m_parser_state= &parser_state;
>>
>
> Add a comment here that thd->m_parser_state will be set to NULL in the parser
> END_OF_INPUT rule.
>
>
What this points to is that, in fact, the interface exposed by the
parser to the caller is awkward.
The caller should not have to set *2* attributes in THD to invoke the
parser,
and since the parser will work even when thd->parser_state is NULL in
most cases
(when no overflow occur), the code around END_OF_INPUT was changed to
prevent bugs,
should another place in the code invoke MYSQLparse() ... which happens
as the code evolves (partitions, events).
This in turn is awkward, since THD::m_parser_state is set in one place
and cleared in another,
as you have noted by asking for a comment.
I changed the code to enforce that:
- the caller only needs to set *one* "thing" in THD to invoke MYSQLparse(),
- the code around END_OF_INPUT is not needed any more,
since all the required parts are guaranteed to be present and properly
initialized.
>> lex_start(thd);
>> thd->spcont= NULL;
>> ret= MYSQLparse(thd);
>> diff -Nrup a/sql/sql_lex.h b/sql/sql_lex.h
>> --- a/sql/sql_lex.h 2007-11-19 09:59:43 -07:00
>> +++ b/sql/sql_lex.h 2008-03-28 11:32:03 -06:00
>> @@ -1280,6 +1279,39 @@ typedef struct st_lex : public Query_tab
>> return FALSE;
>> }
>> } LEX;
>> +
>> +
>> +/**
>> + The internal state of the parser.
>> + This object is only available during parsing,
>> + and is private to the parser implementation.
>> +*/
>> +class Parser_state
>> +{
>> +public:
>> + Parser_state()
>> + : yacc_yyss(NULL), yacc_yyvs(NULL)
>> + {}
>> +
>> + ~Parser_state();
>> +
>> + /**
>> + Bison internal state stack, yyss, when dynamically allocated using
>> + my_yyoverflow().
>> + */
>> + gptr yacc_yyss;
>> +
>> + /**
>> + Bison internal semantic value stack, yyvs, when dynamically allocated using
>> + my_yyoverflow().
>> + */
>> + gptr yacc_yyvs;
>> +
>> + /*
>> + TODO: move more attributes from the LEX structure here.
>> + */
>> +};
>>
>
> Ok, I agree that in principle is the correct approach, the parser state
> should not be in the Lex.
Good.
> But why did you create a second class for the
> parser state ? There is already Lex_input_stream class - which contain
> parser state variables, such as tok_start, tok_end, next_state, yylval,
> etc.
These variables (yylval, tok_start and friends) are by definition the
interface exposed by the lexer to the parser,
so it's normal to have both sql_lex.cc and sql_yacc.yy dependencies to
these attributes in Lex_input_stream.
> Surely, there's not much sense in splitting parser state in two
> classes - yyss and yyvs belong to the same class where yylval and
> next_state do.
>
Actually, I think there is, and this is far reaching.
The names are all "yy<something>" for historical reasons in lex/yacc,
but this does not mean all "yy<something>" belong to the same component.
yyss and yyvs are really private to the yacc implementation,
and are not supposed to be exposed -- not to mention changed -- in the
lexer.
I agree that, *today*, the content of:
(a) - Lex_input_stream
(b) - yyss / yysv
has exactly the same life cycle, and that there is a 1-1 relationship
between the two.
However, that does not mean that (a) and (b) are the same and should be
in the same class,
since they are conceptually very different.
Looking at the history of this bug, this is exactly how we end up
causing severe design flaw bugs
(I view 35577 as a severe design flaw -- the data structures are wrong
--, luckily the fix is easy).
In 4.1, it was "ok" to put (c) yyss / yysv in (d) LEX, because:
- (c) and (d) have (almost) the same life cycle,
- (c) and (d) have a 1-1 relationship.
Then in 5.0, the implementation of stored programs changed things so that:
1 invocation of the parser could generate N LEX structures.
The end result is that it took 66 -- yes, sixty-six -- releases of 5.0
to finally understand that it was wrong,
and that the parser has a big hole in it that can cause a crash at will.
Just because today the lexer state (Lex_input_stream) and the yacc state
(yyss, yysv) are used in a very similar way
is not a reason sufficient (to me) to put the yacc state inside
Lex_input_stream.
My preference is to define data structures based on what the
structure/class *is*, not based on how it's *used*:
how it's used is likely to change, what it is is likely to be more stable.
For example, consider the impact of implementing the equivalent of
"#include <my_file.sql>" in the parser.
This change will most likely involve having 1 Lex_input_stream *per*
included file (yylineno, buffer pointer, etc),
but that does not mean that the internal stack of the parser should be
duplicated ...
If we put yyss / yysv in Lex_input_stream now, how many releases of 6.x
will it take to discover that some enhancement in 6.x
broke a (wrong) design assumption in the parser ?
Also, beside these two "yyss / yysv" attributes, I think there is some
value in separating the
lexer and parser state: many attributes that exist today in LEX or
elsewhere are in the wrong place,
and should eventually end up either in Lex_input_stream or in the yacc
state with yyss,
depending on where they are used (hacks in sql_lex.cc or in sql_yacc.yy).
For example:
- LEX::current_select ?
- LEX::spcont ?
- LEX::length ?
- LEX::dec ?
- LEX::udf ?
- sp_head::m_param_begin ?
- sp_head::m_param_end ?
- sp_head::m_backpatch ?
- sp_head::m_cont_backpatch ?
(surely we don't need to keep all that once parsing is complete, so this
does not belong here)
and probably many more if we look closely at how things are parsed ...
Most of these attributes should not be in Lex_input_stream, they just
don't belong there:
doing this would obfuscate the implementation of the lexer itself
(sql_lex.cc).
While moving these attributes will take time for the existing code,
it's necessary if we want to have the parser produce:
- a clean tree, that can be evaluated in parallel in multiple threads at
runtime (Opps, current_lex ?),
- a small tree (sizeof(LEX)), that does not bloat every statement in a
stored procedure (LEX looks like a junk-yard right now).
Having a separate yacc_state makes it much easier to implement properly new
syntax constructs, when the code in sql_yacc.yy needs a temporary place
to store data during parsing
(normally, this is what $$ is for, but it might not always be practical).
I hope that this (rather long) reply will convince you that yyss and
yysv don't belong *in* Lex_input_stream,
even when I agree that yyss and yysv are really *close* to Lex_input_stream.
Thanks for your review comments, as they exposed a weakness in the first
patch, which is now fixed:
please see the new patch for Bug#35577.
Regards,
-- Marc