List:Commits« Previous MessageNext Message »
From:Alfranio Correia Date:April 2 2009 9:57am
Subject:Re: bzr commit into mysql-5.1-bugteam branch (aelkin:2836) Bug#319
Bug#38205
View as plain text  
Hi Andrei,

Great work, sorry for the delay but this was a busy week.

Find some comments in-line.


Andrei Elkin wrote:
> #At file:///home/andrei/MySQL/BZR/FIXES/5.1-bt-bug38205-stop_slave_trans/ based on
> revid:luis.soares@stripped
>
>  2836 Andrei Elkin	2009-03-26
>       Bug#38205 Row-based Replication (RBR) causes inconsistencies: HA_ERR_FOUND_DUP
>       Bug#319  if while a non-transactional slave is replicating a transaction
> possible problem 
>       
>       It is impossible to roll back a mixed engines transaction when one of the
> engine is
>       non-transaction. In replication that fact is crucial because the slave can not
> safely
>       re-apply a transction that was interrupted with STOP SLAVE.
>       
>       Fixed with making STOP SLAVE not be effective immediately in the case the
> current
>       group of replication events has modified a non-transaction table. In order for
> slave to leave
>       either the group needs finishing or the user issues KILL QUERY|CONNECTION
> slave_thread_id.
> added:
>   mysql-test/suite/bugs/r/rpl_bug38205.result
>   mysql-test/suite/bugs/t/rpl_bug38205.test
>   mysql-test/suite/rpl/t/rpl_start_stop_slave-slave.opt
> modified:
>   mysql-test/suite/rpl/r/rpl_start_stop_slave.result
>   mysql-test/suite/rpl/t/rpl_start_stop_slave.test
>   sql/log_event.cc
>   sql/slave.cc
>
> per-file messages:
>   mysql-test/suite/bugs/r/rpl_bug38205.result
>     bug#38205 non-deterministic part of tests results.
>   mysql-test/suite/bugs/t/rpl_bug38205.test
>     bug#38205 non-deterministic part of tests.
>   mysql-test/suite/rpl/r/rpl_start_stop_slave.result
>     bug#38205 deterministic part of tests results.
>   mysql-test/suite/rpl/t/rpl_start_stop_slave-slave.opt
>     increasing `innodb_lock_wait_timeout' to make the test pass on slow env w/o
>     timeout expired issue.
>   mysql-test/suite/rpl/t/rpl_start_stop_slave.test
>     bug#38205 deterministic part of tests.
>   sql/log_event.cc
>     Augmenting row-based events applying with the notion of 
>     thd->transaction.{all,stmt}.modified_non_trans_table.
>     The pair is set and reset according to its specification
>     for the mixed transaction processing.
>     Particualry, once `modified_non_trans_table' is set in the row-events
>     processing loop, it will remain till the commit of the transaction.
>   sql/slave.cc
>     Consulting `thd->transaction.all.modified_non_trans_table' to decide
>     whether to terminate by the sql thread or to continue even though
>     the sql thread might have been STOP-ed (rli->abort_slave).
> === added file 'mysql-test/suite/bugs/r/rpl_bug38205.result'
> --- a/mysql-test/suite/bugs/r/rpl_bug38205.result	1970-01-01 00:00:00 +0000
> +++ b/mysql-test/suite/bugs/r/rpl_bug38205.result	2009-03-26 08:25:06 +0000
> @@ -0,0 +1,56 @@
> +stop slave;
> +drop table if exists t1,t2,t3,t4,t5,t6,t7,t8,t9;
> +reset master;
> +reset slave;
> +drop table if exists t1,t2,t3,t4,t5,t6,t7,t8,t9;
> +start slave;
> +create table t1i(n int primary key) engine=innodb;
> +create table t2m(n int primary key) engine=myisam;
> +begin;
> +insert into t1i values (1);
> +insert into t1i values (2);
> +insert into t1i values (3);
> +commit;
> +begin;
> +insert into t1i values (5);
> +begin;
> +insert into t1i values (4);
> +insert into t2m values (1);
> +update t1i set n = 5 where n = 4;
> +commit;
> +zero
> +0
> +*** kill sql thread ***
> +rollback;
> +*** sql thread is *not* running: No ***
> +*** the prove: the killed slave has not finished the current transaction ***
> +three
> +3
> +one
> +1
> +zero
> +0
> +delete from t2m;
> +start slave sql_thread;
> +delete from t1i;
> +delete from t2m;
> +begin;
> +insert into t1i values (5);
> +begin;
> +insert into t1i values (4);
> +update t1i set n = 5 where n = 4;
> +commit;
> +zero
> +0
> +stop slave sql_thread;
> +rollback;
> +*** sql thread is running: No ***
> +*** the prove: the stopped slave has rolled back the current transaction ***
> +zero
> +0
> +zero
> +0
> +one
> +1
> +start slave sql_thread;
> +drop table t1i, t2m;
>
> === added file 'mysql-test/suite/bugs/t/rpl_bug38205.test'
> --- a/mysql-test/suite/bugs/t/rpl_bug38205.test	1970-01-01 00:00:00 +0000
> +++ b/mysql-test/suite/bugs/t/rpl_bug38205.test	2009-03-26 08:25:06 +0000
> @@ -0,0 +1,165 @@
> +#
> +# Bug #38205 Row-based Replication (RBR) causes inconsistencies:
> HA_ERR_FOUND_DUPP_KEY
> +#
> +# Verifying the fact that STOP SLAVE in the middle of a group execution waits
> +# for the end of the group before the slave sql thread will stop.
> +# The patch refines STOP SLAVE to not interrupt a transaction or other type of 
> +# the replication events group (the part I).
> +# Killing the sql thread continues to provide a "hard" stop (the part II).
> +#
> +# Non-deterministic tests
> +#
> +
> +source include/master-slave.inc;
> +source include/have_innodb.inc;
> +
> +
> +#
> +# Part II, killed sql slave leaves instantly
> +#
> +
> +# A. multi-statement transaction as the replication group
> +
> +connection master;
> +
> +create table t1i(n int primary key) engine=innodb;
> +create table t2m(n int primary key) engine=myisam;
> +
> +sync_slave_with_master;
> +
> +connection master;
> +
> +begin;
> +insert into t1i values (1);
> +insert into t1i values (2);
> +insert into t1i values (3);
> +commit;
> +
> +sync_slave_with_master;
> +
> +#
> +# todo: first challenge is to find out the SQL thread id
> +# the following is not fully reliable
> +#
> +
> +let $id=`SELECT id from information_schema.processlist where user like 'system user'
> and state like '%Has read all relay log%' or user like 'system user' and state like
> '%Reading event from the relay log%'`;
> +connection slave;
> +begin;
> +insert into t1i values (5);
> +
> +connection master;
> +let $pos0_master= query_get_value(SHOW MASTER STATUS, Position, 1);
> +begin;
> +insert into t1i values (4);
> +insert into t2m values (1); # non-ta update
> +update t1i set n = 5 where n = 4; # to block at. can't be played with killed
> +commit;
> +let $pos1_master= query_get_value(SHOW MASTER STATUS, Position, 1);
> +
> +connection slave;
> +# slave sql thread must be locked out by the conn `slave' explicit lock
> +let $pos0_slave= query_get_value(SHOW SLAVE STATUS, Exec_Master_Log_Pos, 1);
> +--disable_query_log
> +eval select $pos0_master - $pos0_slave as zero;
> +--enable_query_log
> +
> +connection slave1;
> +
> +let $count= 1;
> +let $table= t2m;
> +source include/wait_until_rows_count.inc;
> +#
> +# todo: may fail as said above
> +#
> +--echo *** kill sql thread ***
> +--disable_query_log
> +eval kill connection $id;
> +--enable_query_log
> +
> +connection slave;
> +rollback; # release the sql thread
> +
> +connection slave1;
> +
> +source include/wait_for_slave_sql_to_stop.inc;
> +let $sql_status= query_get_value(SHOW SLAVE STATUS, Slave_SQL_Running, 1);
> +--echo *** sql thread is *not* running: $sql_status ***
> +let $pos1_slave= query_get_value(SHOW SLAVE STATUS, Exec_Master_Log_Pos, 1);
> +
> +connection slave;
> +--echo *** the prove: the killed slave has not finished the current transaction ***
> +
> +--disable_query_log
> +select count(*) as three from t1i;
> +eval select $pos1_master > $pos1_slave as one;
> +eval select $pos1_slave - $pos0_slave as zero;
> +--enable_query_log
> +
> +delete from t2m; # remove the row to be able to replay
> +start slave sql_thread;
> +
> +#
> +# Part I: B The homogenous transaction remains interuptable in between
> +#
> +
> +connection master;
> +delete from t1i;
> +delete from t2m;
> +
> +sync_slave_with_master;
> +begin;
> +insert into t1i values (5);
> +
> +connection master;
> +let $pos0_master= query_get_value(SHOW MASTER STATUS, Position, 1);
> +begin;
> +insert into t1i values (4);
> +update t1i set n = 5 where n = 4; # to block at. not to be played
> +commit;
> +let $pos1_master= query_get_value(SHOW MASTER STATUS, Position, 1);
> +
> +
> +connection slave1;
> +# slave sql can't advance as must be locked by the conn `slave' trans
> +let $pos0_slave= query_get_value(SHOW SLAVE STATUS, Exec_Master_Log_Pos, 1);
> +--disable_query_log
> +eval select $pos0_master - $pos0_slave as zero;
> +--enable_query_log
> +
> +#
> +# the replicated trans is blocked by the slave's local.
> +# However, it's not easy to catch the exact moment when it happens.
> +# The test issues sleep which makes the test either non-deterministic or
> +# wasting too much time.
> +#
> +--sleep 3
> +
> +send stop slave sql_thread;
> +
> +connection slave;
> +rollback; # release the sql thread
> +
> +connection slave1;
> +reap;
> +source include/wait_for_slave_sql_to_stop.inc;
> +let $sql_status= query_get_value(SHOW SLAVE STATUS, Slave_SQL_Running, 1);
> +--echo *** sql thread is running: $sql_status ***
> +
> +let $pos1_slave= query_get_value(SHOW SLAVE STATUS, Exec_Master_Log_Pos, 1);
> +
> +--echo *** the prove: the stopped slave has rolled back the current transaction ***
> +
> +--disable_query_log
> +select count(*) as zero from t1i;
> +eval select $pos0_master - $pos0_slave as zero;
> +eval select $pos1_master > $pos0_slave as one;
> +--enable_query_log
> +
> +start slave sql_thread;
> +
> +# clean-up
> +
> +connection master;
> +drop table t1i, t2m;
> +
> +sync_slave_with_master;
>
> === modified file 'mysql-test/suite/rpl/r/rpl_start_stop_slave.result'
> --- a/mysql-test/suite/rpl/r/rpl_start_stop_slave.result	2007-12-06 15:27:10 +0000
> +++ b/mysql-test/suite/rpl/r/rpl_start_stop_slave.result	2009-03-26 08:25:06 +0000
> @@ -10,3 +10,31 @@ start slave;
>  stop slave io_thread;
>  start slave io_thread;
>  drop table t1;
> +create table t1i(n int primary key) engine=innodb;
> +create table t2m(n int primary key) engine=myisam;
> +begin;
> +insert into t1i values (1);
> +insert into t1i values (2);
> +insert into t1i values (3);
> +commit;
> +begin;
> +insert into t1i values (5);
> +begin;
> +insert into t1i values (4);
> +insert into t2m values (1);
> +insert into t1i values (5);
> +commit;
> +zero
> +0
> +stop slave;
> +rollback;
> +*** sql thread is running: No ***
> +*** the prove: the stopped slave has finished the current transaction ***
> +five
> +5
> +zero
> +0
> +one
> +1
> +start slave;
> +drop table t1i, t2m;
>
> === added file 'mysql-test/suite/rpl/t/rpl_start_stop_slave-slave.opt'
> --- a/mysql-test/suite/rpl/t/rpl_start_stop_slave-slave.opt	1970-01-01 00:00:00
> +0000
> +++ b/mysql-test/suite/rpl/t/rpl_start_stop_slave-slave.opt	2009-03-26 08:25:06
> +0000
> @@ -0,0 +1 @@
> +--innodb_lock_wait_timeout=60
>
> === modified file 'mysql-test/suite/rpl/t/rpl_start_stop_slave.test'
> --- a/mysql-test/suite/rpl/t/rpl_start_stop_slave.test	2008-12-12 11:25:36 +0000
> +++ b/mysql-test/suite/rpl/t/rpl_start_stop_slave.test	2009-03-26 08:25:06 +0000
> @@ -1,4 +1,5 @@
>  source include/master-slave.inc;
> +source include/have_innodb.inc;
>  
>  #
>  # Bug#6148 ()
> @@ -35,4 +36,88 @@ save_master_pos;
>  connection slave;
>  sync_with_master;
>  
> -# End of 4.1 tests
> +
> +#
> +# Bug #38205 Row-based Replication (RBR) causes inconsistencies...
> +#
> +# Verifying that STOP SLAVE does not interrupt excution of a group
> +# execution of events if the group can not roll back.
> +# Killing the sql thread continues to provide a "hard" stop (the
> +# part II, moved to the bugs suite as it's hard to make it
> +# deterministic with KILL).
> +#
> +
> +#
> +# Part I. The being stopped sql thread finishes first the current group of 
> +#         events if the group contains an event on a non-transaction table.
> +
> +connection master;
> +create table t1i(n int primary key) engine=innodb;
> +create table t2m(n int primary key) engine=myisam;
> +begin;
> +insert into t1i values (1);
> +insert into t1i values (2);
> +insert into t1i values (3);
> +commit;
> +
> +sync_slave_with_master;
> +connection slave;
> +begin;
> +insert into t1i values (5);
> +
> +connection master;
> +let $pos0_master= query_get_value(SHOW MASTER STATUS, Position, 1);
> +begin;
> +insert into t1i values (4);
> +insert into t2m values (1); # non-ta update to process
> +insert into t1i values (5); # to block at. to be played with stopped
> +commit;
> +
> +connection slave;
> +# slave sql thread must be locked out by the conn `slave' explicit lock
> +let $pos0_slave= query_get_value(SHOW SLAVE STATUS, Exec_Master_Log_Pos, 1);
> +--disable_query_log
> +eval select $pos0_master - $pos0_slave as zero;
> +--enable_query_log
> +
> +connection slave1;
> +let $count= 1;
> +let $table= t2m;
> +source include/wait_until_rows_count.inc;
> +send stop slave;
> +
> +connection slave;
> +rollback; # release the sql thread
> +
>   

How many slaves do you have? Two?
If you have two slave, please describe the architecture of your test.
Otherwise, please, replace slave1 by slave.

I think you have just one master and one slave.
> +connection slave1;
> +reap;
> +source include/wait_for_slave_to_stop.inc;
> +let $sql_status= query_get_value(SHOW SLAVE STATUS, Slave_SQL_Running, 1);
> +--echo *** sql thread is running: $sql_status ***
> +
> +
> +connection master;
> +let $pos1_master= query_get_value(SHOW MASTER STATUS, Position, 1);
>   
If you have just one slave, please, this second wait is unnecessary.
> +
> +connection slave;
> +source include/wait_for_slave_sql_to_stop.inc;
>   
above.
> +
> +let $pos1_slave= query_get_value(SHOW SLAVE STATUS, Exec_Master_Log_Pos, 1);
> +
> +--echo *** the prove: the stopped slave has finished the current transaction ***
>   
above.

> +
> +--disable_query_log
> +select count(*) as five from t1i;
> +eval select $pos1_master - $pos1_slave as zero;
> +eval select $pos1_slave > $pos0_slave as one;
> +--enable_query_log
> +
> +start slave;
> +
> +# clean-up
> +connection master;
> +drop table t1i, t2m;
> +
> +sync_slave_with_master;
> +
> +# End of tests
>
> === modified file 'sql/log_event.cc'
> --- a/sql/log_event.cc	2009-03-05 10:23:46 +0000
> +++ b/sql/log_event.cc	2009-03-26 08:25:06 +0000
> @@ -7114,7 +7114,12 @@ int Rows_log_event::do_apply_event(Relay
>      */
>      lex_start(thd);
>      mysql_reset_thd_for_next_command(thd);
> -
> +    /*
> +      The current statement is just about to begin and 
> +      has not yet modified anything. Note, all.modified is reset
> +      by mysql_reset_thd_for_next_command.
> +    */
>   
Change this comment because this is not completely true.
The all.modified is just reset when the following conditions are
true:

If in autocommit mode and not in a transaction, reset
OPTION_STATUS_NO_TRANS_UPDATE | OPTION_KEEP_LOG to not get warnings
in ha_rollback_trans() about some tables couldn't be rolled back.
*/
if (!(thd->options & (OPTION_NOT_AUTOCOMMIT | OPTION_BEGIN)))
{
thd->options&= ~OPTION_KEEP_LOG;
thd->transaction.all.modified_non_trans_table= FALSE;
}

Otherwise, your patch would be wrong. Maybe you can copy part of the
comment.

Can you point out where this flag is set by the replication?
> +    thd->transaction.stmt.modified_non_trans_table= FALSE;
>      /*
>        Check if the slave is set to use SBR.  If so, it should switch
>        to using RBR until the end of the "statement", i.e., next
> @@ -7217,6 +7222,7 @@ int Rows_log_event::do_apply_event(Relay
>  
>    if (table)
>    {
> +    bool transactional_table= table->file->has_transactions();
>      /*
>        table == NULL means that this table should not be replicated
>        (this was set up by Table_map_log_event::do_apply_event()
> @@ -7348,6 +7354,9 @@ int Rows_log_event::do_apply_event(Relay
>    
>        m_curr_row= m_curr_row_end;
>   
> +      if (error == 0 && !transactional_table)
> +        thd->transaction.all.modified_non_trans_table=
> +          thd->transaction.stmt.modified_non_trans_table= TRUE;
>      } // row processing loop
>  
>      DBUG_EXECUTE_IF("STOP_SLAVE_after_first_Rows_event",
>
> === modified file 'sql/slave.cc'
> --- a/sql/slave.cc	2009-01-09 12:49:24 +0000
> +++ b/sql/slave.cc	2009-03-26 08:25:06 +0000
> @@ -687,6 +687,9 @@ static bool sql_slave_killed(THD* thd, R
>    DBUG_ASSERT(rli->slave_running == 1);// tracking buffer overrun
>    if (abort_loop || thd->killed || rli->abort_slave)
>    {
> +    if (rli->abort_slave && rli->is_in_group() &&
> +        thd->transaction.all.modified_non_trans_table)
> +      DBUG_RETURN(0);
>      /*
>        If we are in an unsafe situation (stopping could corrupt replication),
>        we give one minute to the slave SQL thread of grace before really
>
>
>   

Thread
bzr commit into mysql-5.1-bugteam branch (aelkin:2836) Bug#319Bug#38205Andrei Elkin26 Mar
  • Re: bzr commit into mysql-5.1-bugteam branch (aelkin:2836) Bug#319Bug#38205Alfranio Correia2 Apr
    • Re: bzr commit into mysql-5.1-bugteam branch (aelkin:2836) Bug#319Bug#38205Andrei Elkin2 Apr
      • Re: bzr commit into mysql-5.1-bugteam branch (aelkin:2836) Bug#319Bug#38205Alfranio Correia2 Apr
  • Re: bzr commit into mysql-5.1-bugteam branch (aelkin:2836) Bug#319Bug#38205He Zhenxing8 Apr
    • Re: bzr commit into mysql-5.1-bugteam branch (aelkin:2836) Bug#319Bug#38205Andrei Elkin9 Apr
      • Re: bzr commit into mysql-5.1-bugteam branch (aelkin:2836) Bug#319Bug#38205Andrei Elkin9 Apr
        • Re: bzr commit into mysql-5.1-bugteam branch (aelkin:2836) Bug#319Bug#38205He Zhenxing10 Apr
      • Re: bzr commit into mysql-5.1-bugteam branch (aelkin:2836) Bug#319Bug#38205He Zhenxing10 Apr