I am approving the patch, but I need you to improve the
description of symptom #1 and perhaps symptom #2 - which
I am having a hard time to follow - maybe I missed something.
But overall, I think that the change you propose fixes
both issues: timeout and extra error in the error log.
On 12/09/2010 08:33 AM, Dao-Gang.Qu@stripped wrote:
> #At file:///home/daogang/bzrwork/bug57918/mysql-trunk-bugfixing/ based on
> 3427 Dao-Gang.Qu@stripped 2010-12-09
> Bug #57918 rpl_get_master_version_and_clock times out sporadically
> The reason is that sometimes the slave is unblocked before the
> master server is really shutdown.
> Symptom 1, found warnings/errors in server log file.
> In the situation, the 'get_master_version_and_clock' func
> will return normally inster of network error expected. Then
typo: inster - instead
> the 'handle_slave_io' func will go to invoke 'get_master_uuid'
> func, which will return normally if the master server is
> still alive. Then the process will go on untill invoke
> 'request_dump' func, which will print the "Error on
> COM_BINLOG_DUMP: 2003 Can't connect to MySQL server on
> '127.0.0.1' (111), will retry in 1 secs" error as symptom
> reported if the server is really shutdown in the moment.
> while the test is waiting for the network error.
Please, rewrite the at least the last sentence. Seems you typed a
'.' instead of a ',' ?
So... As I get it, the slave will go on as far as 'request_dump'
and then abruptly stops because the master was finally restarted.
When it does, an error stating that the DUMP failed is written to
the error log.
> Symptom 2, timeout after 900 seconds
> In the situation, the 'get_master_version_and_clock',
> get_master_uuid, and 'request_dump' funcs will retrurn
> normally if the master server is alive during the period,
> then the 'handle_slave_io' func will enter
> 'while (!io_slave_killed(thd,mi))' block and try to connect
> master while the master sever is really shutdown and the
> test will not have a chance to start master, because during
> the period the test will stall there to wait for a network
> error until it exits with a timeout error.
Isn't it possible that the master is restarted successfully *before*
the slave notices the it was down in the first place? Then it will
most certainly wait forever for the master to restart...
In the scenario you describe here, isn't it the case that the slave
will notice the 2003 error eventually, much before the 900 secs timeout
> To fix these problems to source wait_until_disconnected.inc
Perhaps "we source" instead of "to source".
> after shutdown master server for guaranteeing the master
> server is really shutdown before the slave is unblocked.
> @ mysql-test/extra/rpl_tests/rpl_get_master_version_and_clock.test
> Update test to guarantee the master server is really
> shutdown before the slave is unblocked.
> === modified file 'mysql-test/extra/rpl_tests/rpl_get_master_version_and_clock.test'
> --- a/mysql-test/extra/rpl_tests/rpl_get_master_version_and_clock.test 2010-11-26
> 13:39:15 +0000
> +++ b/mysql-test/extra/rpl_tests/rpl_get_master_version_and_clock.test 2010-12-09
> 08:33:39 +0000
> @@ -58,6 +58,7 @@ connection master;
> # Send shutdown to the connected server and give
> # it 10 seconds to die before zapping it
> shutdown_server 10;
> +--source include/wait_until_disconnected.inc
> connection slave;
> --echo slave is unblocked