He Zhenxing, hello.
> Hi Andrei,
>
> Nice work, I only have some comments on the test, please see in line!
>
thanks!
I agreed with of your them. A new patch is following in minutes.
> Andrei Elkin wrote:
>> #At file:///home/andrei/MySQL/BZR/FIXES/5.1-bt-bug38205-stop_slave_trans/ based
> on revid:luis.soares@stripped
>>
>> 2836 Andrei Elkin 2009-03-26
>> Bug#38205 Row-based Replication (RBR) causes inconsistencies:
> HA_ERR_FOUND_DUP
>> Bug#319 if while a non-transactional slave is replicating a transaction
> possible problem
>>
>> It is impossible to roll back a mixed engines transaction when one of the
> engine is
>> non-transaction. In replication that fact is crucial because the slave can
> not safely
>> re-apply a transction that was interrupted with STOP SLAVE.
>>
>> Fixed with making STOP SLAVE not be effective immediately in the case the
> current
>> group of replication events has modified a non-transaction table. In order
> for slave to leave
>> either the group needs finishing or the user issues KILL QUERY|CONNECTION
> slave_thread_id.
>> added:
>> mysql-test/suite/bugs/r/rpl_bug38205.result
>> mysql-test/suite/bugs/t/rpl_bug38205.test
>> mysql-test/suite/rpl/t/rpl_start_stop_slave-slave.opt
>> modified:
>> mysql-test/suite/rpl/r/rpl_start_stop_slave.result
>> mysql-test/suite/rpl/t/rpl_start_stop_slave.test
>> sql/log_event.cc
>> sql/slave.cc
>>
>> per-file messages:
>> mysql-test/suite/bugs/r/rpl_bug38205.result
>> bug#38205 non-deterministic part of tests results.
>> mysql-test/suite/bugs/t/rpl_bug38205.test
>> bug#38205 non-deterministic part of tests.
>> mysql-test/suite/rpl/r/rpl_start_stop_slave.result
>> bug#38205 deterministic part of tests results.
>> mysql-test/suite/rpl/t/rpl_start_stop_slave-slave.opt
>> increasing `innodb_lock_wait_timeout' to make the test pass on slow env w/o
>> timeout expired issue.
>> mysql-test/suite/rpl/t/rpl_start_stop_slave.test
>> bug#38205 deterministic part of tests.
>> sql/log_event.cc
>> Augmenting row-based events applying with the notion of
>> thd->transaction.{all,stmt}.modified_non_trans_table.
>> The pair is set and reset according to its specification
>> for the mixed transaction processing.
>> Particualry, once `modified_non_trans_table' is set in the row-events
>> processing loop, it will remain till the commit of the transaction.
>> sql/slave.cc
>> Consulting `thd->transaction.all.modified_non_trans_table' to decide
>> whether to terminate by the sql thread or to continue even though
>> the sql thread might have been STOP-ed (rli->abort_slave).
>> === added file 'mysql-test/suite/bugs/r/rpl_bug38205.result'
>> --- a/mysql-test/suite/bugs/r/rpl_bug38205.result 1970-01-01 00:00:00 +0000
>> +++ b/mysql-test/suite/bugs/r/rpl_bug38205.result 2009-03-26 08:25:06 +0000
>> @@ -0,0 +1,56 @@
>> +stop slave;
>> +drop table if exists t1,t2,t3,t4,t5,t6,t7,t8,t9;
>> +reset master;
>> +reset slave;
>> +drop table if exists t1,t2,t3,t4,t5,t6,t7,t8,t9;
>> +start slave;
>> +create table t1i(n int primary key) engine=innodb;
>> +create table t2m(n int primary key) engine=myisam;
>> +begin;
>> +insert into t1i values (1);
>> +insert into t1i values (2);
>> +insert into t1i values (3);
>> +commit;
>> +begin;
>> +insert into t1i values (5);
>> +begin;
>> +insert into t1i values (4);
>> +insert into t2m values (1);
>> +update t1i set n = 5 where n = 4;
>> +commit;
>> +zero
>> +0
>> +*** kill sql thread ***
>> +rollback;
>> +*** sql thread is *not* running: No ***
>> +*** the prove: the killed slave has not finished the current transaction ***
>> +three
>> +3
>> +one
>> +1
>> +zero
>> +0
>> +delete from t2m;
>> +start slave sql_thread;
>> +delete from t1i;
>> +delete from t2m;
>> +begin;
>> +insert into t1i values (5);
>> +begin;
>> +insert into t1i values (4);
>> +update t1i set n = 5 where n = 4;
>> +commit;
>> +zero
>> +0
>> +stop slave sql_thread;
>> +rollback;
>> +*** sql thread is running: No ***
>> +*** the prove: the stopped slave has rolled back the current transaction ***
>> +zero
>> +0
>> +zero
>> +0
>> +one
>> +1
>> +start slave sql_thread;
>> +drop table t1i, t2m;
>>
>> === added file 'mysql-test/suite/bugs/t/rpl_bug38205.test'
>> --- a/mysql-test/suite/bugs/t/rpl_bug38205.test 1970-01-01 00:00:00 +0000
>> +++ b/mysql-test/suite/bugs/t/rpl_bug38205.test 2009-03-26 08:25:06 +0000
>> @@ -0,0 +1,165 @@
>> +#
>> +# Bug #38205 Row-based Replication (RBR) causes inconsistencies:
> HA_ERR_FOUND_DUPP_KEY
>> +#
>> +# Verifying the fact that STOP SLAVE in the middle of a group execution waits
>> +# for the end of the group before the slave sql thread will stop.
>> +# The patch refines STOP SLAVE to not interrupt a transaction or other type of
>> +# the replication events group (the part I).
>> +# Killing the sql thread continues to provide a "hard" stop (the part II).
>> +#
>> +# Non-deterministic tests
>> +#
>> +
>> +source include/master-slave.inc;
>> +source include/have_innodb.inc;
>> +
>> +
>> +#
>> +# Part II, killed sql slave leaves instantly
>> +#
>
> I think you do not need to test killing SQL slave thread, because this
> is not addressed by the problem, and the main problem is the test is not
> deterministic (as you stated and I also found that it will fail
> sometimes), so I'd suggest to remove this part of test.
I have to disagree and even twice :-)!
First, killing relates to the bug, not immediately though but rathar
as a solution to provide a hard stop.
For your convinience I have attached a support from the support.
Second, even though I appreciate your wise advice not to do what's
unnecessary a reason to reject a test like I did can be only if
the test is really redundant. We are better to write tests over and
over again without really having any bugs.
I am serious to claim that. It's really helpful. Considering test
coverage three years back, I could be certain with doing changes to
the level that is possible now, thanks to having tests.
I hope you will share either my point.
>
>> +
>> +# A. multi-statement transaction as the replication group
>> +
>> +connection master;
>> +
>> +create table t1i(n int primary key) engine=innodb;
>> +create table t2m(n int primary key) engine=myisam;
>> +
>> +sync_slave_with_master;
>> +
>> +connection master;
>> +
>> +begin;
>> +insert into t1i values (1);
>> +insert into t1i values (2);
>> +insert into t1i values (3);
>> +commit;
>> +
>> +sync_slave_with_master;
>> +
>> +#
>> +# todo: first challenge is to find out the SQL thread id
>> +# the following is not fully reliable
>> +#
>> +
>> +let $id=`SELECT id from information_schema.processlist where user like 'system
> user' and state like '%Has read all relay log%' or user like 'system user' and state like
> '%Reading event from the relay log%'`;
>> +connection slave;
>> +begin;
>> +insert into t1i values (5);
>> +
>> +connection master;
>> +let $pos0_master= query_get_value(SHOW MASTER STATUS, Position, 1);
>> +begin;
>> +insert into t1i values (4);
>> +insert into t2m values (1); # non-ta update
>> +update t1i set n = 5 where n = 4; # to block at. can't be played with killed
>> +commit;
>> +let $pos1_master= query_get_value(SHOW MASTER STATUS, Position, 1);
>> +
>> +connection slave;
>> +# slave sql thread must be locked out by the conn `slave' explicit lock
>> +let $pos0_slave= query_get_value(SHOW SLAVE STATUS, Exec_Master_Log_Pos, 1);
>> +--disable_query_log
>> +eval select $pos0_master - $pos0_slave as zero;
>> +--enable_query_log
>> +
>> +connection slave1;
>> +
>> +let $count= 1;
>> +let $table= t2m;
>> +source include/wait_until_rows_count.inc;
>> +#
>> +# todo: may fail as said above
>> +#
>> +--echo *** kill sql thread ***
>> +--disable_query_log
>> +eval kill connection $id;
>> +--enable_query_log
>> +
>> +connection slave;
>> +rollback; # release the sql thread
>> +
>> +connection slave1;
>> +
>> +source include/wait_for_slave_sql_to_stop.inc;
>> +let $sql_status= query_get_value(SHOW SLAVE STATUS, Slave_SQL_Running, 1);
>> +--echo *** sql thread is *not* running: $sql_status ***
>> +let $pos1_slave= query_get_value(SHOW SLAVE STATUS, Exec_Master_Log_Pos, 1);
>> +
>> +connection slave;
>> +--echo *** the prove: the killed slave has not finished the current transaction
> ***
>> +
>> +--disable_query_log
>> +select count(*) as three from t1i;
>> +eval select $pos1_master > $pos1_slave as one;
>> +eval select $pos1_slave - $pos0_slave as zero;
>> +--enable_query_log
>> +
>> +delete from t2m; # remove the row to be able to replay
>> +start slave sql_thread;
>> +
>> +#
>> +# Part I: B The homogenous transaction remains interuptable in between
>> +#
>> +
>
> I think this part of test can be merged into rpl_start_stop_slave.test.
Alas, it's not. I marked that with `Non-deterministic tests' warning
at the beginning of the file.
>
>> +connection master;
>> +delete from t1i;
>> +delete from t2m;
>> +
>> +sync_slave_with_master;
>> +begin;
>> +insert into t1i values (5);
>> +
>> +connection master;
>> +let $pos0_master= query_get_value(SHOW MASTER STATUS, Position, 1);
>> +begin;
>> +insert into t1i values (4);
>> +update t1i set n = 5 where n = 4; # to block at. not to be played
>> +commit;
>> +let $pos1_master= query_get_value(SHOW MASTER STATUS, Position, 1);
>> +
>> +
>> +connection slave1;
>> +# slave sql can't advance as must be locked by the conn `slave' trans
>> +let $pos0_slave= query_get_value(SHOW SLAVE STATUS, Exec_Master_Log_Pos, 1);
>> +--disable_query_log
>> +eval select $pos0_master - $pos0_slave as zero;
>> +--enable_query_log
>> +
>> +#
>> +# the replicated trans is blocked by the slave's local.
>
> local => local transaction
that would be correct too.
In my version `local' is not the adjective. It stands for the noun.
>
>> +# However, it's not easy to catch the exact moment when it happens.
>> +# The test issues sleep which makes the test either non-deterministic or
>> +# wasting too much time.
>> +#
>> +--sleep 3
>> +
>
"The homogenous transaction remains interuptable in between" - that's
the purpose of the test.
If you agree on that, STOP SLAVE has to come when the transaction is
ongoing, has taken a lock for instance.
> Hmm, I think it's not necessary to sleep here, because it's OK to issue
> the STOP SLAVE command before SQL thread reached the point of waiting
> for lock on table t1i
Otherwise, if STOP SLAVE enters before the transaction has grabbed any
lock we would not test the claimed interruptibility.
>, we just need to make sure the STOP SLAVE command
> is issued before the transaction finishes (which is guarranteed by the
> lock on table t1i).
`Before the transaction finishes' that's true. However, lines above
must have explained why it's not `just'.
>> +send stop slave sql_thread;
>> +
>> +connection slave;
>> +rollback; # release the sql thread
>> +
>> +connection slave1;
>> +reap;
>> +source include/wait_for_slave_sql_to_stop.inc;
>> +let $sql_status= query_get_value(SHOW SLAVE STATUS, Slave_SQL_Running, 1);
>> +--echo *** sql thread is running: $sql_status ***
>> +
>
> should be *** sql thread is *NOT* running: ...
this can be changed, if you got confused...
I thought that the colon sign specifies how exactly the
thread is running. `No' is important and it takes place.
>
>> +let $pos1_slave= query_get_value(SHOW SLAVE STATUS, Exec_Master_Log_Pos, 1);
>> +
>> +--echo *** the prove: the stopped slave has rolled back the current transaction
> ***
>> +
>> +--disable_query_log
>> +select count(*) as zero from t1i;
>> +eval select $pos0_master - $pos0_slave as zero;
>> +eval select $pos1_master > $pos0_slave as one;
>> +--enable_query_log
>> +
>> +start slave sql_thread;
>> +
>
> source include/wait_for_slave_sql_to_start.inc;
That's a possiblity, i'd rather use it.
Still the patch's way `start slave' does not make
the test incorrect.
I mean `START SLAVE sql_thread' is fully synchronous and if it returns OK
>
>> +# clean-up
>> +
>> +connection master;
>> +drop table t1i, t2m;
>> +
>> +sync_slave_with_master;
there is no way to miss synchronization down here.
>>
>> === modified file 'mysql-test/suite/rpl/r/rpl_start_stop_slave.result'
>> --- a/mysql-test/suite/rpl/r/rpl_start_stop_slave.result 2007-12-06 15:27:10
> +0000
>> +++ b/mysql-test/suite/rpl/r/rpl_start_stop_slave.result 2009-03-26 08:25:06
> +0000
>> @@ -10,3 +10,31 @@ start slave;
>> stop slave io_thread;
>> start slave io_thread;
>> drop table t1;
>> +create table t1i(n int primary key) engine=innodb;
>> +create table t2m(n int primary key) engine=myisam;
>> +begin;
>> +insert into t1i values (1);
>> +insert into t1i values (2);
>> +insert into t1i values (3);
>> +commit;
>> +begin;
>> +insert into t1i values (5);
>> +begin;
>> +insert into t1i values (4);
>> +insert into t2m values (1);
>> +insert into t1i values (5);
>> +commit;
>> +zero
>> +0
>> +stop slave;
>> +rollback;
>> +*** sql thread is running: No ***
>> +*** the prove: the stopped slave has finished the current transaction ***
>> +five
>> +5
>> +zero
>> +0
>> +one
>> +1
>> +start slave;
>> +drop table t1i, t2m;
>>
>> === added file 'mysql-test/suite/rpl/t/rpl_start_stop_slave-slave.opt'
>> --- a/mysql-test/suite/rpl/t/rpl_start_stop_slave-slave.opt 1970-01-01 00:00:00
> +0000
>> +++ b/mysql-test/suite/rpl/t/rpl_start_stop_slave-slave.opt 2009-03-26 08:25:06
> +0000
>> @@ -0,0 +1 @@
>> +--innodb_lock_wait_timeout=60
>>
>> === modified file 'mysql-test/suite/rpl/t/rpl_start_stop_slave.test'
>> --- a/mysql-test/suite/rpl/t/rpl_start_stop_slave.test 2008-12-12 11:25:36 +0000
>> +++ b/mysql-test/suite/rpl/t/rpl_start_stop_slave.test 2009-03-26 08:25:06 +0000
>> @@ -1,4 +1,5 @@
>> source include/master-slave.inc;
>> +source include/have_innodb.inc;
>>
>> #
>> # Bug#6148 ()
>> @@ -35,4 +36,88 @@ save_master_pos;
>> connection slave;
>> sync_with_master;
>>
>> -# End of 4.1 tests
>> +
>> +#
>> +# Bug #38205 Row-based Replication (RBR) causes inconsistencies...
>> +#
>
> Please put reference to BUG#319 here too
ok
>
>> +# Verifying that STOP SLAVE does not interrupt excution of a group
>> +# execution of events if the group can not roll back.
>> +# Killing the sql thread continues to provide a "hard" stop (the
>> +# part II, moved to the bugs suite as it's hard to make it
>> +# deterministic with KILL).
>> +#
>> +
>> +#
>> +# Part I. The being stopped sql thread finishes first the current group of
>> +# events if the group contains an event on a non-transaction table.
>> +
>> +connection master;
>> +create table t1i(n int primary key) engine=innodb;
>> +create table t2m(n int primary key) engine=myisam;
>> +begin;
>> +insert into t1i values (1);
>> +insert into t1i values (2);
>> +insert into t1i values (3);
>> +commit;
>> +
>> +sync_slave_with_master;
>> +connection slave;
>> +begin;
>> +insert into t1i values (5);
>> +
>> +connection master;
>> +let $pos0_master= query_get_value(SHOW MASTER STATUS, Position, 1);
>> +begin;
>> +insert into t1i values (4);
>> +insert into t2m values (1); # non-ta update to process
>> +insert into t1i values (5); # to block at. to be played with stopped
>> +commit;
>> +
>> +connection slave;
>> +# slave sql thread must be locked out by the conn `slave' explicit lock
>> +let $pos0_slave= query_get_value(SHOW SLAVE STATUS, Exec_Master_Log_Pos, 1);
>> +--disable_query_log
>> +eval select $pos0_master - $pos0_slave as zero;
>> +--enable_query_log
>> +
>> +connection slave1;
>> +let $count= 1;
>> +let $table= t2m;
>> +source include/wait_until_rows_count.inc;
>> +send stop slave;
>> +
>> +connection slave;
>> +rollback; # release the sql thread
>> +
>> +connection slave1;
>> +reap;
>> +source include/wait_for_slave_to_stop.inc;
>> +let $sql_status= query_get_value(SHOW SLAVE STATUS, Slave_SQL_Running, 1);
>> +--echo *** sql thread is running: $sql_status ***
>> +
>
> should be *** sql thread is *NOT* running: ...
>
ok
>> +
>> +connection master;
>> +let $pos1_master= query_get_value(SHOW MASTER STATUS, Position, 1);
>> +
>> +connection slave;
>> +source include/wait_for_slave_sql_to_stop.inc;
>> +
>
> This is redundant, there is already a wait for stop above.
>
right, will remove.
>> +let $pos1_slave= query_get_value(SHOW SLAVE STATUS, Exec_Master_Log_Pos, 1);
>> +
>> +--echo *** the prove: the stopped slave has finished the current transaction
> ***
>> +
>> +--disable_query_log
>> +select count(*) as five from t1i;
>> +eval select $pos1_master - $pos1_slave as zero;
>> +eval select $pos1_slave > $pos0_slave as one;
>> +--enable_query_log
>> +
>> +start slave;
>> +
>
> use source include/start_slave.inc;
> or add
> source include/wait_for_slave_to_start.inc;
agreed, thanks!
>
>> +# clean-up
>> +connection master;
>> +drop table t1i, t2m;
>> +
>> +sync_slave_with_master;
>> +
>> +# End of tests
>>
>> === modified file 'sql/log_event.cc'
>> --- a/sql/log_event.cc 2009-03-05 10:23:46 +0000
>> +++ b/sql/log_event.cc 2009-03-26 08:25:06 +0000
>> @@ -7114,7 +7114,12 @@ int Rows_log_event::do_apply_event(Relay
>> */
>> lex_start(thd);
>> mysql_reset_thd_for_next_command(thd);
>> -
>> + /*
>> + The current statement is just about to begin and
>> + has not yet modified anything. Note, all.modified is reset
>> + by mysql_reset_thd_for_next_command.
>> + */
>> + thd->transaction.stmt.modified_non_trans_table= FALSE;
>> /*
>> Check if the slave is set to use SBR. If so, it should switch
>> to using RBR until the end of the "statement", i.e., next
>> @@ -7217,6 +7222,7 @@ int Rows_log_event::do_apply_event(Relay
>>
>> if (table)
>> {
>> + bool transactional_table= table->file->has_transactions();
>> /*
>> table == NULL means that this table should not be replicated
>> (this was set up by Table_map_log_event::do_apply_event()
>> @@ -7348,6 +7354,9 @@ int Rows_log_event::do_apply_event(Relay
>>
>> m_curr_row= m_curr_row_end;
>>
>> + if (error == 0 && !transactional_table)
>> + thd->transaction.all.modified_non_trans_table=
>> + thd->transaction.stmt.modified_non_trans_table= TRUE;
>> } // row processing loop
>>
>> DBUG_EXECUTE_IF("STOP_SLAVE_after_first_Rows_event",
>>
>> === modified file 'sql/slave.cc'
>> --- a/sql/slave.cc 2009-01-09 12:49:24 +0000
>> +++ b/sql/slave.cc 2009-03-26 08:25:06 +0000
>> @@ -687,6 +687,9 @@ static bool sql_slave_killed(THD* thd, R
>> DBUG_ASSERT(rli->slave_running == 1);// tracking buffer overrun
>> if (abort_loop || thd->killed || rli->abort_slave)
>> {
>> + if (rli->abort_slave && rli->is_in_group() &&
>> + thd->transaction.all.modified_non_trans_table)
>> + DBUG_RETURN(0);
>> /*
>> If we are in an unsafe situation (stopping could corrupt replication),
>> we give one minute to the slave SQL thread of grace before really
>>
>>
>
>
====================================================================================
From: Sinisa Milivojevic <Sinisa.Milivojevic@stripped>
Subject: Re: "delayed" STOP SLAVE bug#38205
To: Andrei Elkin <Andrei.Elkin@stripped>
Cc: support-private@stripped, dev-replication@stripped
Date: Thu, 26 Mar 2009 19:49:50 +0200
Organization: Sun Microsystems
Reply-to: Sinisa.Milivojevic@stripped
Andrei Elkin wrote:
> Hello.
>
> Two bugs bug#38205 (row-based) + bug#319 (stmt) deal with fact
> STOP SLAVE could interrupt ongoing mixed with a not-trans table
> transaction.
>
> The patch http://lists.mysql.com/commits/70478
>
> defers the slave stopped status until the mixed engine transaction
> or a group of row-based events on a non-trans table has
> committed/completed.
>
> The existing `KILL sql_thread' remains to be a hard stop for the
> slave, the same for `mysqladmin shutdown'.
>
> In my view there is no need for a new option requesting the former
> "eager" reaction on STOP SLAVE.
> This view found understating on #support today.
>
> <susanne> andrei: Valeriy: FYI i only gave my ok when it is proper
> documented
>
> [11:02]<sveta> andrei: regarding to stop slave behavior I think fix is good
>
> [14:16]<Valeriy> andrei: please, send your request for the feedback to
> support-private@stripped. Let's have it all documnted in the
> email thread
> *** ray|away (~ray@stripped) is now known as ray
> [14:17]<Valeriy> andrei: please, explain how kill sql_tread would be different
> from current stop slave
> <Valeriy> andrei: for both sbr and rbr cases
> <Valeriy> andrei: then you
> [14:18]<Valeriy> andrei: 'll get feedback from all experts who cares :)
>
>
> Welcome with your suggestion!
>
> Andrei
Andrei,
Fix fine for me and I don't think we need option here.