Hi Ramil,
Thanks for implementing my suggestions and especially adding documentation to
try_to_reconnect(). Below I propose how to extend it, hopefully making it a bit
more clear what the function actually does.
I have also one more suggestion (implement it if you like it). I'd move handling
of master_retry_count outside try_to_reconnect() function. I.e., make the
function to reconnect unconditionally and simply not call it if we don't want
to. I know it will cause code repetition but I think it will make the code
easier to read and understand.
Best,
Rafal
ramil@stripped wrote:
> Below is the list of changes that have just been committed into a local
> 5.1 repository of ram. When ram does a push these changes will
> be propagated to the main repository and, within 24 hours after the
> push, to the public repository.
> For information on how to access the public repository
> see http://dev.mysql.com/doc/mysql/en/installing-source-tree.html
>
> ChangeSet@stripped, 2007-06-28 17:18:29+05:00, ramil@stripped +1 -0
> Fix for bug #19328: Slave timeout with COM_REGISTER_SLAVE error causing stop
>
> Problem: "Under high load, the slave registering to the master can timeout
> during the COM_REGISTER_SLAVE execution. This causes an error, which
> prevents the slave from connecting at all."
>
> Fix: Do not abort immediately, but retry registering on master.
>
> sql/slave.cc@stripped, 2007-06-28 17:18:27+05:00, ramil@stripped +203 -100
> Fix for bug #19328: Slave timeout with COM_REGISTER_SLAVE error causing stop
> - retry connecting after a failed slave registration on master.
> Reconnecting has been factored out (see try_to_reconnect() and
> check_io_slave_killed()) to get rid of similar code used during
> registering on master/dump requesting/event reading.
> All messages have been moved to reconnect_messages[] array
> for easy usage and maintenance.
> - now one can force slave to reconnect using
> ./mtr --mysqld=--loose-debug=d,FORCE_SLAVE_TO_RECONNECT_{REG, DUMP, EVENT}
>
> diff -Nrup a/sql/slave.cc b/sql/slave.cc
> --- a/sql/slave.cc 2007-06-12 01:15:29 +05:00
> +++ b/sql/slave.cc 2007-06-28 17:18:27 +05:00
(...)
> @@ -1879,6 +1942,89 @@ on this slave.\
> }
>
>
> +static bool check_io_slave_killed(THD *thd, MASTER_INFO *mi, const char *info)
> +{
> + if (io_slave_killed(thd, mi))
> + {
> + if (global_system_variables.log_warnings)
> + sql_print_information(info);
> + return TRUE;
> + }
> + return FALSE;
> +}
> +
> +
> +/*
Doxygen comment should start with /**
> + @brief Try to reconnect slave IO thread.
> +
I propose to add this to detailed description:
Terminates current connection to master, sleeps for @c mi->connect_retry msecs
and initiates new connection with @c safe_reconnect(). Variable pointed by @c
retry_count is increased - if it exceeds @c master_retry_count then connection
is not re-established and function signals error.
Unless suppres_warnings is TRUE, a warning is put in the server error log when
reconnecting. The warning message and messages used to report errors are taken
from @c messages array. In case @c master_retry_count is exceeded, no messages
are added to the log.
> + @details Under high load, the slave registering on the master/
> + requesting a dump/reading an event can timeout during execution.
> + If so, don't abort immediately, but retry reconnecting up to
> + master_retry_count times.
> +
> + @param[in] thd Thread context.
> + @param[in] mysql MySQL connection.
> + @param[in] mi Master connection information.
> + @param[in,out] retry_count Number of attempts to reconnect.
> + @param[in] suppress_warnings TRUE when a normal net read timeout
> + has caused to reconnecting.
> + @param[in] messages Messages to print/log, see
> + reconnect_messages[] array.
> +
> + @retval 0 OK.
> + @retval 1 There was an error.
> +*/
> +
> +static int try_to_reconnect(THD *thd, MYSQL *mysql, MASTER_INFO *mi,
> + uint *retry_count, bool suppress_warnings,
> + const char *messages[SLAVE_RECON_MSG_MAX])
> +{
> + mi->slave_running= MYSQL_SLAVE_RUN_NOT_CONNECT;
> + thd->proc_info= messages[SLAVE_RECON_MSG_WAIT];
> +#ifdef SIGNAL_WITH_VIO_CLOSE
> + thd->clear_active_vio();
> +#endif
> + end_server(mysql);
> + if ((*retry_count)++)
> + {
> + if (*retry_count > master_retry_count)
> + return 1; // Don't retry forever
> + safe_sleep(thd, mi->connect_retry, (CHECK_KILLED_FUNC) io_slave_killed,
> + (void *) mi);
> + }
> + if (check_io_slave_killed(thd, mi, messages[SLAVE_RECON_MSG_KILLED_WAITING]))
> + return 1;
> + thd->proc_info = messages[SLAVE_RECON_MSG_AFTER];
> + if (!suppress_warnings)
> + {
> + char buf[256], llbuff[22];
> + my_snprintf(buf, sizeof(buf), messages[SLAVE_RECON_MSG_FAILED],
> + IO_RPL_LOG_NAME, llstr(mi->master_log_pos, llbuff));
> + /*
> + Raise a warining during registering on master/requesting dump.
> + Log a message reading event.
> + */
> + if (messages[SLAVE_RECON_MSG_COMMAND][0])
> + {
> + mi->report(WARNING_LEVEL, ER_SLAVE_MASTER_COM_FAILURE,
> + ER(ER_SLAVE_MASTER_COM_FAILURE),
> + messages[SLAVE_RECON_MSG_COMMAND], buf);
> + }
> + else
> + {
> + sql_print_information(buf);
> + }
> + }
> + if (safe_reconnect(thd, mysql, mi, 1) || io_slave_killed(thd, mi))
> + {
> + if (global_system_variables.log_warnings)
> + sql_print_information(messages[SLAVE_RECON_MSG_KILLED_AFTER]);
> + return 1;
> + }
> + return 0;
> +}
> +