3825 Andrei Elkin 2012-03-30
BUG#13893310 checkpoint_group size wrong at recovery after cold restart
The issue with checkpoint_group at MTS recovery is that after the server restart
MTS recovery gaps collecting algorithm initialized the recovery bitmap with the default 512 size
rather than with a correct one with size of not less than of Worker group_executed of the last
slave session.
That is corrected. The max possible size is used in the gaps collecting.
opt_mts_checkpoint_group 's update step is made as 8 (bits).
Some refactoring is done as well.
Also MTS recovery gaps collecting is deployed on a common to
START-SLAVE and --skip-start-slave=0 execution path.
@ mysql-test/suite/rpl/r/rpl_mts_debug.result
results updated.
@ mysql-test/suite/rpl/t/rpl_mts_debug.test
cleanup: making the test to looking uniformally (@save).
@ sql/rpl_info_dummy.cc
Rpl handler interface is extended with reset_info().
@ sql/rpl_info_dummy.h
Rpl handler interface is extended with reset_info().
@ sql/rpl_info_factory.cc
submerging worker->init_info() into create_worker().
Signature of both changed to propagate more of the caller context such
as recovery mode.
@ sql/rpl_info_factory.h
signature is changed.
@ sql/rpl_info_file.cc
Rpl handler interface is extended with reset_info().
@ sql/rpl_info_file.h
Rpl handler interface is extended with reset_info().
@ sql/rpl_info_handler.h
Rpl handler interface is extended with reset_info().
@ sql/rpl_info_table.cc
Rpl handler interface is extended with reset_info().
@ sql/rpl_info_table.h
Rpl handler interface is extended with reset_info().
@ sql/rpl_rli.cc
recovery_groups bitmap life time is defined depending whether the previous parallel slave
session left gaps. It is initialized in mts_recovery_group() if that's the case;
making mts_recovery_groups() to be called in a common for START-SLAVE and init_slave()
execution path;
Relay_log_info::reset_workers_recovered() is added to be called at the end of
mts-recovery execution phase (aka gaps filling).
@ sql/rpl_rli.h
a new flag is added to indicate recovery_groups is initialized;
reset_workers_recovered() is added.
@ sql/rpl_rli_pdb.cc
Relocating bitmap_init() for Worker::group_* bitmaps into
Worker::init_info();
Two possible value for the size depends on the last argument of
create_worker();
In the gaps collecting phase of recovery the size must be set to the max
possible value;
Slave_worker::reset_info() is added for RLI::reset_workers_recovered().
@ sql/rpl_rli_pdb.h
Worker::init_info() signature accepts a new argument;
reset_info() to the set of methods is added.
@ sql/rpl_slave.cc
making mts_recovery_groups() to be called in a common for START-SLAVE and init_slave()
execution path, notice Change-Master forces to have it in a separate invocation point
in RLI::init_info();
changing mts_recovery_groups() to corresponds to error code convention for func:s
called by its caller;
initialization of rli->recovery_groups bitmap to the max size of the Worker group_executed;
a little refinement of asserts and logics in mts_recovery_groups() is done;
fixing too early (so incorrect) updating for rli->recovery_parallel_workers in the error branch
of slave_start_workers();
Cleanup for rli->recovery_groups at the end of the Coordinator (SQL) thread to complete
its life time;
remove_workers() is replaced by Rpl_info_factory::reset_workers();
deploying RLI::reset_workers_recovered() at two points corresponding to
the end of mts-recovery.
@ sql/rpl_slave.h
changing signature for mts_recovery_groups().
@ sql/share/errmsg-utf8.txt
A new error message is added to correspond to an error branch of impossible MTS recovery
at time of START SLAVE.
@ sql/sys_vars.cc
making mts_checkpoint_group to vary by 8 bits to correspond
to the Worker info table schema that contains the number of bytes (not bits)
field.
modified:
mysql-test/suite/rpl/r/rpl_mts_debug.result
mysql-test/suite/rpl/t/rpl_mts_debug.test
sql/rpl_info_dummy.cc
sql/rpl_info_dummy.h
sql/rpl_info_factory.cc
sql/rpl_info_factory.h
sql/rpl_info_file.cc
sql/rpl_info_file.h
sql/rpl_info_handler.h
sql/rpl_info_table.cc
sql/rpl_info_table.h
sql/rpl_rli.cc
sql/rpl_rli.h
sql/rpl_rli_pdb.cc
sql/rpl_rli_pdb.h
sql/rpl_slave.cc
sql/rpl_slave.h
sql/share/errmsg-utf8.txt
sql/sys_vars.cc
3824 Joerg Bruehe 2012-03-30 [merge]
Null upmerge of a version number change from 5.5 to 5.6
=== modified file 'mysql-test/suite/rpl/r/rpl_mts_debug.result'
--- a/mysql-test/suite/rpl/r/rpl_mts_debug.result 2012-03-23 20:11:19 +0000
+++ b/mysql-test/suite/rpl/r/rpl_mts_debug.result 2012-03-30 11:15:17 +0000
@@ -41,6 +41,9 @@ include/stop_slave.inc
call mtr.add_suppression("option 'slave_checkpoint_group': unsigned value 524281 adjusted to 524280");
call mtr.add_suppression("Failed during slave worker thread create");
call mtr.add_suppression("Slave SQL: Failed during slave workers initialization, Error_code: 1593");
+call mtr.add_suppression("Mismatch between the number of bytes configured to store checkpoint information and the previously stored information");
+set @save.slave_checkpoint_group= @@global.slave_checkpoint_group;
+set @save.slave_parallel_workers= @@global.slave_parallel_workers;
SET GLOBAL slave_parallel_workers= 2;
SET GLOBAL slave_checkpoint_group=524281;
Warnings:
@@ -54,8 +57,8 @@ SET GLOBAL debug= "d,inject_init_worker_
START SLAVE SQL_THREAD;
include/wait_for_slave_sql_error.inc [errno=1593]
SET GLOBAL debug="";
-SET GLOBAL slave_checkpoint_group= 512;
-SET GLOBAL slave_parallel_workers= 0;
+set @@global.slave_checkpoint_group= @save.slave_checkpoint_group;
+set @@global.slave_parallel_workers= @save.slave_parallel_workers;
include/start_slave.inc
include/rpl_reset.inc
include/rpl_end.inc
=== modified file 'mysql-test/suite/rpl/t/rpl_mts_debug.test'
--- a/mysql-test/suite/rpl/t/rpl_mts_debug.test 2012-03-05 16:35:12 +0000
+++ b/mysql-test/suite/rpl/t/rpl_mts_debug.test 2012-03-30 11:15:17 +0000
@@ -106,9 +106,10 @@ source include/start_slave.inc;
call mtr.add_suppression("option 'slave_checkpoint_group': unsigned value 524281 adjusted to 524280");
call mtr.add_suppression("Failed during slave worker thread create");
call mtr.add_suppression("Slave SQL: Failed during slave workers initialization, Error_code: 1593");
+call mtr.add_suppression("Mismatch between the number of bytes configured to store checkpoint information and the previously stored information");
---let $saved_slave_checkpoint_group= `SELECT @@global.slave_checkpoint_group`
---let $saved_slave_parallel_workers= `SELECT @@global.slave_parallel_workers`
+set @save.slave_checkpoint_group= @@global.slave_checkpoint_group;
+set @save.slave_parallel_workers= @@global.slave_parallel_workers;
SET GLOBAL slave_parallel_workers= 2;
SET GLOBAL slave_checkpoint_group=524281;
@@ -140,8 +141,8 @@ START SLAVE SQL_THREAD;
--eval SET GLOBAL debug="$saved_debug"
## clean up
---eval SET GLOBAL slave_checkpoint_group= $saved_slave_checkpoint_group
---eval SET GLOBAL slave_parallel_workers= $saved_slave_parallel_workers
+set @@global.slave_checkpoint_group= @save.slave_checkpoint_group;
+set @@global.slave_parallel_workers= @save.slave_parallel_workers;
--source include/start_slave.inc
--source include/rpl_reset.inc
=== modified file 'sql/rpl_info_dummy.cc'
--- a/sql/rpl_info_dummy.cc 2011-10-13 14:01:50 +0000
+++ b/sql/rpl_info_dummy.cc 2012-03-30 11:15:17 +0000
@@ -64,6 +64,12 @@ void Rpl_info_dummy::do_end_info(const u
return;
}
+int Rpl_info_dummy::do_reset_info()
+{
+ if (abort) DBUG_ASSERT(0);
+ return 0;
+}
+
int Rpl_info_dummy::do_remove_info(const ulong *uidx __attribute__((unused)),
const uint nidx __attribute__((unused)))
{
=== modified file 'sql/rpl_info_dummy.h'
--- a/sql/rpl_info_dummy.h 2011-10-13 14:01:50 +0000
+++ b/sql/rpl_info_dummy.h 2012-03-30 11:15:17 +0000
@@ -40,6 +40,7 @@ private:
int do_flush_info(const ulong *uidx, const uint nidx,
const bool force);
int do_remove_info(const ulong *uidx, const uint nidx);
+ int do_reset_info();
int do_prepare_info_for_read(const uint nidx);
int do_prepare_info_for_write(const uint nidx);
=== modified file 'sql/rpl_info_factory.cc'
--- a/sql/rpl_info_factory.cc 2012-03-16 15:04:13 +0000
+++ b/sql/rpl_info_factory.cc 2012-03-30 11:15:17 +0000
@@ -277,8 +277,49 @@ err:
DBUG_RETURN(TRUE);
}
+bool Rpl_info_factory::reset_workers(Relay_log_info *rli, bool cold)
+{
+ Rpl_info_handler *handler_file= NULL;
+ Rpl_info_handler *handler_table= NULL;
+ char search_fname[FN_REFLEN];
+ bool error= true;
+
+ DBUG_ENTER("Rpl_info_factory::create_workers");
+
+ if (rli->recovery_parallel_workers == 0)
+ DBUG_RETURN(0);
+
+ char *pos= strmov(search_fname, relay_log_info_file);
+ strmov(pos, ".");
+
+ if (!(handler_file= new Rpl_info_file(Slave_worker::get_number_worker_fields(),
+ search_fname)))
+ goto err;
+
+ if (!(handler_table= new Rpl_info_table(Slave_worker::get_number_worker_fields() + 2,
+ MYSQL_SCHEMA_NAME.str, WORKER_INFO_NAME.str)))
+ goto err;
+
+ if (handler_file->reset_info())
+ goto err;
+
+ if (handler_table->reset_info())
+ goto err;
+
+ error= false;
+
+err:
+ if (cold)
+ rli->recovery_parallel_workers= 0; // no mts_recovery_groups() next time
+
+ delete handler_file;
+ delete handler_table;
+ DBUG_RETURN(error);
+}
+
Slave_worker *Rpl_info_factory::create_worker(uint worker_option, uint worker_id,
- Relay_log_info *rli)
+ Relay_log_info *rli,
+ bool gaps_collecting)
{
char info_fname[FN_REFLEN];
char info_name[FN_REFLEN];
@@ -326,6 +367,12 @@ Slave_worker *Rpl_info_factory::create_w
if (decide_repository(worker, worker_option, &handler_src, &handler_dest,
&msg))
goto err;
+
+ if (worker->init_info(gaps_collecting))
+ {
+ worker->end_info();
+ goto err;
+ }
DBUG_RETURN(worker);
=== modified file 'sql/rpl_info_factory.h'
--- a/sql/rpl_info_factory.h 2012-01-11 12:23:17 +0000
+++ b/sql/rpl_info_factory.h 2012-03-30 11:15:17 +0000
@@ -44,7 +44,8 @@ class Rpl_info_factory
static bool change_rli_repository(Relay_log_info *mi, const uint mi_option,
const char **msg);
static Slave_worker *create_worker(uint rli_option, uint worker_id,
- Relay_log_info *rli);
+ Relay_log_info *rli, bool gaps_collecting);
+ static bool reset_workers(Relay_log_info *rli, bool cold);
private:
static bool decide_repository(Rpl_info *info,
uint option,
=== modified file 'sql/rpl_info_file.cc'
--- a/sql/rpl_info_file.cc 2012-01-06 20:00:48 +0000
+++ b/sql/rpl_info_file.cc 2012-03-30 11:15:17 +0000
@@ -192,6 +192,37 @@ int Rpl_info_file::do_remove_info(const
DBUG_RETURN(error);
}
+int Rpl_info_file::do_reset_info()
+{
+ uint i= 0;
+ struct st_my_dir *dir_info= NULL;
+ struct fileinfo *file_info= NULL;
+ char dir_name[FN_REFLEN];
+ size_t dir_size= 0;
+ char* file_name= NULL;
+ size_t file_size= 0;
+ int error= FALSE;
+
+ DBUG_ENTER("Rpl_info_file::do_reset_info");
+
+ file_name= info_fname + dirname_part(dir_name, info_fname, &dir_size);
+ file_size= strlen(file_name);
+
+ if (!(dir_info= my_dir(dir_name, MYF(MY_DONT_SORT))))
+ DBUG_RETURN(TRUE);
+
+ file_info= dir_info->dir_entry;
+ for (i= dir_info->number_off_files ; i-- ; file_info++)
+ {
+ if (!strncmp(file_info->name, file_name, file_size) &&
+ my_delete(file_info->name, MYF(MY_WME)))
+ error= TRUE;
+ }
+ my_dirend(dir_info);
+
+ DBUG_RETURN(error);
+}
+
bool Rpl_info_file::do_set_info(const int pos, const char *value)
{
return (my_b_printf(&info_file, "%s\n", value) > (size_t) 0 ?
=== modified file 'sql/rpl_info_file.h'
--- a/sql/rpl_info_file.h 2011-10-13 14:01:50 +0000
+++ b/sql/rpl_info_file.h 2012-03-30 11:15:17 +0000
@@ -50,6 +50,7 @@ private:
int do_flush_info(const ulong *uidx, const uint nidx,
const bool force);
int do_remove_info(const ulong *uidx, const uint nidx);
+ int do_reset_info();
int do_prepare_info_for_read(const uint nidx);
int do_prepare_info_for_write(const uint nidx);
=== modified file 'sql/rpl_info_handler.h'
--- a/sql/rpl_info_handler.h 2011-10-13 14:01:50 +0000
+++ b/sql/rpl_info_handler.h 2012-03-30 11:15:17 +0000
@@ -89,7 +89,7 @@ public:
}
/**
- Deletes any information in the repository.
+ Deletes a specific information in the repository.
@param[in] uidx Array of fields that identifies an object
@param[in] nidx Number of fields in the array
@@ -103,6 +103,17 @@ public:
}
/**
+ Deletes all information in the repository.
+
+ @retval FALSE No error
+ @retval TRUE Failure
+ */
+ int reset_info()
+ {
+ return do_reset_info();
+ }
+
+ /**
Closes access to the repository.
@param[in] uidx Array of fields that identifies an object
@@ -346,6 +357,7 @@ private:
virtual int do_flush_info(const ulong *uidx, const uint nidx,
const bool force)= 0;
virtual int do_remove_info(const ulong *uidx, const uint nidx)= 0;
+ virtual int do_reset_info()= 0;
virtual void do_end_info(const ulong *uidx, const uint nidx)= 0;
virtual int do_prepare_info_for_read(const uint nidx)= 0;
virtual int do_prepare_info_for_write(const uint nidx)= 0;
=== modified file 'sql/rpl_info_table.cc'
--- a/sql/rpl_info_table.cc 2012-01-31 15:16:16 +0000
+++ b/sql/rpl_info_table.cc 2012-03-30 11:15:17 +0000
@@ -219,6 +219,50 @@ end:
DBUG_RETURN(error);
}
+int Rpl_info_table::do_reset_info()
+{
+ int error= 1;
+ TABLE *table= NULL;
+ ulong saved_mode;
+ Open_tables_backup backup;
+
+ DBUG_ENTER("Rpl_info_table::do_reset_info");
+
+ THD *thd= access->create_thd();
+
+ saved_mode= thd->variables.sql_mode;
+ tmp_disable_binlog(thd);
+
+ /*
+ Opens and locks the rpl_info table before accessing it.
+ */
+ if (access->open_table(thd, str_schema, str_table,
+ get_number_info(), TL_WRITE,
+ &table, &backup))
+ goto end;
+
+ /*
+ Deletes a row in the rpl_info table.
+ */
+ if ((error= table->file->truncate()))
+ {
+ table->file->print_error(error, MYF(0));
+ goto end;
+ }
+
+ error= 0;
+
+end:
+ /*
+ Unlocks and closes the rpl_info table.
+ */
+ access->close_table(thd, table, &backup, error);
+ reenable_binlog(thd);
+ thd->variables.sql_mode= saved_mode;
+ access->drop_thd(thd);
+ DBUG_RETURN(error);
+}
+
int Rpl_info_table::do_remove_info(const ulong *uidx, const uint nidx)
{
int error= 1;
=== modified file 'sql/rpl_info_table.h'
--- a/sql/rpl_info_table.h 2011-10-13 14:01:50 +0000
+++ b/sql/rpl_info_table.h 2012-03-30 11:15:17 +0000
@@ -65,6 +65,7 @@ private:
int do_flush_info(const ulong *uidx, const uint nidx,
const bool force);
int do_remove_info(const ulong *uidx, const uint nidx);
+ int do_reset_info();
int do_prepare_info_for_read(const uint nidx);
int do_prepare_info_for_write(const uint nidx);
=== modified file 'sql/rpl_rli.cc'
--- a/sql/rpl_rli.cc 2012-03-28 18:01:14 +0000
+++ b/sql/rpl_rli.cc 2012-03-30 11:15:17 +0000
@@ -89,7 +89,8 @@ Relay_log_info::Relay_log_info(bool is_s
rows_query_ev(NULL), last_event_start_time(0),
slave_parallel_workers(0),
recovery_parallel_workers(0), checkpoint_seqno(0),
- checkpoint_group(opt_mts_checkpoint_group), mts_recovery_group_cnt(0),
+ checkpoint_group(opt_mts_checkpoint_group),
+ recovery_groups_inited(false), mts_recovery_group_cnt(0),
mts_recovery_index(0), mts_recovery_group_seen_begin(0),
mts_group_status(MTS_NOT_IN_GROUP), reported_unsafe_warning(false),
sql_delay(0), sql_delay_end(0), m_flags(0), row_stmt_start_timestamp(0),
@@ -108,7 +109,6 @@ Relay_log_info::Relay_log_info(bool is_s
group_master_log_name[0]= 0;
until_log_name[0]= ign_master_log_name_end[0]= 0;
set_timespec_nsec(last_clock, 0);
- bitmap_init(&recovery_groups, NULL, checkpoint_group, FALSE);
memset(&cache_buf, 0, sizeof(cache_buf));
cached_charset_invalidate();
@@ -152,7 +152,6 @@ Relay_log_info::~Relay_log_info()
{
DBUG_ENTER("Relay_log_info::~Relay_log_info");
- bitmap_free(&recovery_groups);
mysql_mutex_destroy(&log_space_lock);
mysql_cond_destroy(&log_space_cond);
mysql_mutex_destroy(&pending_jobs_lock);
@@ -249,6 +248,16 @@ void Relay_log_info::reset_notified_chec
}
}
+void Relay_log_info::reset_workers_recovered()
+{
+ for (uint i= 0; i < workers.elements; i++)
+ {
+ Slave_worker *w= *(Slave_worker **) dynamic_array_ptr(&workers, i);
+ w->reset_info();
+ }
+ recovery_parallel_workers= slave_parallel_workers;
+}
+
static inline int add_relay_log(Relay_log_info* rli,LOG_INFO* linfo)
{
MY_STAT s;
@@ -1531,9 +1540,13 @@ int Relay_log_info::init_info()
if (hot_log)
mysql_mutex_unlock(log_lock);
-
- DBUG_RETURN(recovery_parallel_workers && !is_mts_recovery() ?
- mts_recovery_groups(this, &recovery_groups) : 0);
+ DBUG_RETURN((recovery_parallel_workers && !mi->rli->is_mts_recovery()) ?
+ /*
+ though mts_recovery_groups() is reentrant don't call
+ it again if a previous run was not followed (completed)
+ with the gaps filling.
+ */
+ mts_recovery_groups(this) : 0);
}
cur_log_fd = -1;
=== modified file 'sql/rpl_rli.h'
--- a/sql/rpl_rli.h 2012-03-28 15:24:17 +0000
+++ b/sql/rpl_rli.h 2012-03-30 11:15:17 +0000
@@ -512,6 +512,7 @@ public:
uint checkpoint_seqno; // counter of groups executed after the most recent CP
uint checkpoint_group; // cache for ::opt_mts_checkpoint_group
MY_BITMAP recovery_groups; // bitmap used during recovery
+ bool recovery_groups_inited;
ulong mts_recovery_group_cnt; // number of groups to execute at recovery
ulong mts_recovery_index; // running index of recoverable groups
bool mts_recovery_group_seen_begin;
@@ -623,6 +624,11 @@ public:
*/
void reset_notified_checkpoint(ulong, time_t, bool);
+ /**
+ Called when gaps execution is ended to it is crash-safe
+ to reset the last session info which the method does for each Worker.
+ */
+ void reset_workers_recovered();
/*
* End of MTS section ******************************************************/
=== modified file 'sql/rpl_rli_pdb.cc'
--- a/sql/rpl_rli_pdb.cc 2012-03-05 16:35:12 +0000
+++ b/sql/rpl_rli_pdb.cc 2012-03-30 11:15:17 +0000
@@ -86,8 +86,6 @@ Slave_worker::Slave_worker(Relay_log_inf
checkpoint_master_log_name[0]= 0;
my_init_dynamic_array(&curr_group_exec_parts, sizeof(db_worker_hash_entry*),
SLAVE_INIT_DBS_IN_GROUP, 1);
- bitmap_init(&group_executed, NULL, c_rli->checkpoint_group, FALSE);
- bitmap_init(&group_shifted, NULL, c_rli->checkpoint_group, FALSE);
mysql_mutex_init(key_mutex_slave_parallel_worker, &jobs_lock,
MY_MUTEX_INIT_FAST);
mysql_cond_init(key_cond_slave_parallel_worker, &jobs_cond, NULL);
@@ -96,10 +94,14 @@ Slave_worker::Slave_worker(Relay_log_inf
Slave_worker::~Slave_worker()
{
delete_dynamic(&curr_group_exec_parts);
- bitmap_free(&group_executed);
- bitmap_free(&group_shifted);
mysql_mutex_destroy(&jobs_lock);
mysql_cond_destroy(&jobs_cond);
+ if (inited)
+ {
+ bitmap_free(&group_executed);
+ bitmap_free(&group_shifted);
+ }
+ inited= 0;
}
/**
@@ -119,7 +121,7 @@ int Slave_worker::init_worker(Relay_log_
Slave_job_item empty= {NULL};
c_rli= rli;
- if (init_info() ||
+ if (init_info(false) ||
DBUG_EVALUATE_IF("inject_init_worker_init_info_fault", true, false))
DBUG_RETURN(1);
@@ -153,7 +155,21 @@ int Slave_worker::init_worker(Relay_log_
DBUG_RETURN(0);
}
-int Slave_worker::init_info()
+/**
+ A part of Slave worker iitializer that provides a
+ minimum context for MTS recovery.
+
+ @param gaps_collecting clarifies what state the caller
+ executes this method from. When it's @c true
+ that is @c mts_recovery_groups() and Worker should
+ restore the last time info.
+ Whet it's @c false Worker should not read the last
+ session time stale info. It will be reset once
+ recovery execution gets done.
+
+ @return 0 on success, non-zero for a failure
+*/
+int Slave_worker::init_info(bool gaps_collecting)
{
enum_return_check return_check= ERROR_CHECKING_REPOSITORY;
@@ -163,23 +179,35 @@ int Slave_worker::init_info()
DBUG_RETURN(0);
/*
+ Worker bitmap size depends on recovery mode.
+ If it is gaps collecting the bitmaps must be capable to accept
+ up to MTS_MAX_BITS_IN_GROUP of bits.
+ */
+ size_t num_bits= gaps_collecting ?
+ MTS_MAX_BITS_IN_GROUP : c_rli->checkpoint_group;
+ /*
This checks if the repository was created before and thus there
will be values to be read. Please, do not move this call after
the handler->init_info().
*/
return_check= check_info();
- if (return_check == ERROR_CHECKING_REPOSITORY)
+ if (return_check == ERROR_CHECKING_REPOSITORY ||
+ (return_check == REPOSITORY_DOES_NOT_EXIST && gaps_collecting))
goto err;
if (handler->init_info(uidx, nidx))
goto err;
- if (return_check == REPOSITORY_EXISTS && read_info(handler))
+ bitmap_init(&group_executed, NULL, num_bits, FALSE);
+ bitmap_init(&group_shifted, NULL, num_bits, FALSE);
+
+ if (gaps_collecting && read_info(handler))
+ {
+ bitmap_free(&group_executed);
+ bitmap_free(&group_shifted);
goto err;
-
+ }
inited= 1;
- if (flush_info(TRUE))
- goto err;
DBUG_RETURN(0);
@@ -269,6 +297,9 @@ bool Slave_worker::read_info(Rpl_info_ha
from->get_info((ulong *) &temp_checkpoint_seqno,
(ulong) 0) ||
from->get_info(&nbytes, (ulong) 0) ||
+
+ (DBUG_ASSERT(nbytes <= (group_executed.n_bits + 7) / 8), 0) ||
+
from->get_info(buffer, (size_t) nbytes,
(uchar *) 0))
DBUG_RETURN(TRUE);
@@ -289,6 +320,8 @@ bool Slave_worker::write_info(Rpl_info_h
ulong nbytes= (ulong) no_bytes_in_map(&group_executed);
uchar *buffer= (uchar*) group_executed.bitmap;
+ DBUG_ASSERT(nbytes <= c_rli->checkpoint_group / 8);
+
if (to->prepare_info_for_write(nidx) ||
to->set_info(group_relay_log_name) ||
to->set_info((ulong) group_relay_log_pos) ||
@@ -306,6 +339,20 @@ bool Slave_worker::write_info(Rpl_info_h
DBUG_RETURN(FALSE);
}
+/**
+ Crean up valueable for gaps collecting info.
+ This worker won't contribute to recovery bitmap at future
+ slave restart (see @c mts_recovery_groups).
+*/
+bool Slave_worker::reset_info()
+{
+ set_group_master_log_name("");
+ set_group_master_log_pos(0);
+ flush_info(true);
+
+ return FALSE;
+}
+
size_t Slave_worker::get_number_worker_fields()
{
return sizeof(info_slave_worker_fields)/sizeof(info_slave_worker_fields[0]);
=== modified file 'sql/rpl_rli_pdb.h'
--- a/sql/rpl_rli_pdb.h 2011-11-21 17:46:02 +0000
+++ b/sql/rpl_rli_pdb.h 2012-03-30 11:15:17 +0000
@@ -345,12 +345,13 @@ public:
en_running_state volatile running_status;
int init_worker(Relay_log_info*, ulong);
- int init_info();
+ int init_info(bool);
void end_info();
int flush_info(bool force= FALSE);
- size_t get_number_worker_fields();
+ static size_t get_number_worker_fields();
void slave_worker_ends_group(Log_event*, int);
bool commit_positions(Log_event *evt, Slave_job_group *ptr_g, bool force);
+ bool reset_info();
protected:
=== modified file 'sql/rpl_slave.cc'
--- a/sql/rpl_slave.cc 2012-03-28 15:24:17 +0000
+++ b/sql/rpl_slave.cc 2012-03-30 11:15:17 +0000
@@ -194,7 +194,6 @@ static int terminate_slave_thread(THD *t
static bool check_io_slave_killed(THD *thd, Master_info *mi, const char *info);
int slave_worker_exec_job(Slave_worker * w, Relay_log_info *rli);
static int mts_event_coord_cmp(LOG_POS_COORD *id1, LOG_POS_COORD *id2);
-static bool remove_workers(Relay_log_info* rli);
/*
Find out which replications threads are running
@@ -398,7 +397,7 @@ int init_recovery(Master_info* mi, const
We need to improve this. /Alfranio.
*/
- error= mts_recovery_groups(rli, &rli->recovery_groups);
+ error= mts_recovery_groups(rli);
if (rli->mts_recovery_group_cnt)
{
error= 1;
@@ -513,7 +512,7 @@ int remove_info(Master_info* mi)
mi->end_info();
mi->rli->end_info();
- if (mi->remove_info() || remove_workers(mi->rli) ||
+ if (mi->remove_info() || Rpl_info_factory::reset_workers(mi->rli, true) ||
mi->rli->remove_info())
goto err;
@@ -559,41 +558,6 @@ int flush_master_info(Master_info* mi, b
DBUG_RETURN (err);
}
-/*
- Remove worker's entries from the repositories.
-*/
-bool remove_workers(Relay_log_info* rli)
-{
- Slave_worker *worker= NULL;
-
- for (uint id= 0; id < rli->recovery_parallel_workers; id++)
- {
- if (!(worker=
- Rpl_info_factory::create_worker(opt_rli_repository_id, id, rli)))
- goto err;
-
- if (worker->init_info())
- {
- delete worker;
- goto err;
- }
-
- worker->end_info();
-
- if (worker->remove_info())
- {
- delete worker;
- goto err;
- }
-
- delete worker;
- }
- return false;
-
-err:
- return true;
-}
-
/**
Convert slave skip errors bitmap into a printable string.
*/
@@ -1102,14 +1066,21 @@ int start_slave_threads(bool need_slave_
mi);
if (!error && (thread_mask & SLAVE_SQL))
{
- error= start_slave_thread(
+ /*
+ MTS-recovery gaps gathering is placed onto common execution path
+ for either START-SLAVE and --skip-start-slave= 0
+ */
+ if (mi->rli->recovery_parallel_workers != 0 && !mi->rli->is_mts_recovery())
+ error= mts_recovery_groups(mi->rli);
+ if (!error)
+ error= start_slave_thread(
#ifdef HAVE_PSI_INTERFACE
- key_thread_slave_sql,
+ key_thread_slave_sql,
#endif
- handle_slave_sql, lock_sql, lock_cond_sql,
- cond_sql,
- &mi->rli->slave_running, &mi->rli->slave_run_id,
- mi);
+ handle_slave_sql, lock_sql, lock_cond_sql,
+ cond_sql,
+ &mi->rli->slave_running, &mi->rli->slave_run_id,
+ mi);
if (error)
terminate_slave_threads(mi, thread_mask & SLAVE_IO, !need_slave_mutex);
}
@@ -3237,13 +3208,16 @@ int apply_event_and_update_pos(Log_event
{
reason= ev->shall_skip(rli);
}
-
- DBUG_PRINT("mts", ("Mts is recovering %d, number of bits set %d, "
- "bitmap is set %d, index %lu.\n",
- rli->is_mts_recovery(), bitmap_bits_set(&rli->recovery_groups),
- bitmap_is_set(&rli->recovery_groups, rli->mts_recovery_index),
- rli->mts_recovery_index));
-
+ if (rli->is_mts_recovery())
+ {
+ DBUG_PRINT("mts", ("Mts is recovering %d, number of bits set %d, "
+ "bitmap is set %d, index %lu.\n",
+ rli->is_mts_recovery(),
+ bitmap_bits_set(&rli->recovery_groups),
+ bitmap_is_set(&rli->recovery_groups,
+ rli->mts_recovery_index),
+ rli->mts_recovery_index));
+ }
if (reason == Log_event::EVENT_SKIP_COUNT)
{
sql_slave_skip_counter= --rli->slave_skip_counter;
@@ -3412,12 +3386,20 @@ int apply_event_and_update_pos(Log_event
rli->mts_recovery_index++;
if (--rli->mts_recovery_group_cnt == 0)
{
- rli->recovery_parallel_workers= rli->slave_parallel_workers;
rli->mts_recovery_index= 0;
+ sql_print_information("Slave: MTS Recovery has completed at "
+ "relay log %s, position %llu "
+ "master log %s, position %llu.",
+ rli->get_group_relay_log_name(),
+ rli->get_group_relay_log_pos(),
+ rli->get_group_master_log_name(),
+ rli->get_group_master_log_pos());
+ // reset the Worker tables to remove last slave session time info
+ rli->reset_workers_recovered();
}
rli->mts_recovery_group_seen_begin= false;
-
- error= rli->flush_info(TRUE);
+ if (!error)
+ error= rli->flush_info(TRUE);
}
}
@@ -4353,7 +4335,7 @@ int mts_event_coord_cmp(LOG_POS_COORD *i
(poscmp < 0 ? -1 : (poscmp > 0 ? 1 : 0))));
}
-bool mts_recovery_groups(Relay_log_info *rli, MY_BITMAP *groups)
+int mts_recovery_groups(Relay_log_info *rli)
{
Log_event *ev= NULL;
const char *errmsg= NULL;
@@ -4367,10 +4349,11 @@ bool mts_recovery_groups(Relay_log_info
File file;
LOG_INFO linfo;
my_off_t offset= 0;
+ MY_BITMAP *groups= &rli->recovery_groups;
DBUG_ENTER("mts_recovery_groups");
- DBUG_ASSERT(rli->recovery_parallel_workers > 0);
+ DBUG_ASSERT(rli->slave_parallel_workers == 0);
/*
Save relay log position to compare with worker's position.
*/
@@ -4390,12 +4373,13 @@ bool mts_recovery_groups(Relay_log_info
above_lwm_jobs in asc ordered by the master binlog coordinates.
*/
my_init_dynamic_array(&above_lwm_jobs, sizeof(Slave_job_group),
- rli->recovery_parallel_workers, rli->recovery_parallel_workers);
+ rli->recovery_parallel_workers,
+ rli->recovery_parallel_workers);
for (uint id= 0; id < rli->recovery_parallel_workers; id++)
{
Slave_worker *worker=
- Rpl_info_factory::create_worker(opt_rli_repository_id, id, rli);
+ Rpl_info_factory::create_worker(opt_rli_repository_id, id, rli, true);
if (!worker)
{
@@ -4403,7 +4387,6 @@ bool mts_recovery_groups(Relay_log_info
goto err;
}
- worker->init_info();
LOG_POS_COORD w_last= { const_cast<char*>(worker->get_group_master_log_name()),
worker->get_group_master_log_pos() };
if (mts_event_coord_cmp(&w_last, &cp) > 0)
@@ -4447,8 +4430,12 @@ bool mts_recovery_groups(Relay_log_info
while(!eof);
continue;
*/
-
- bitmap_clear_all(groups);
+ if (above_lwm_jobs.elements != 0)
+ {
+ bitmap_init(groups, NULL, MTS_MAX_BITS_IN_GROUP, FALSE);
+ rli->recovery_groups_inited= true;
+ bitmap_clear_all(groups);
+ }
rli->mts_recovery_group_cnt= 0;
for (uint it_job= 0; it_job < above_lwm_jobs.elements; it_job++)
{
@@ -4458,14 +4445,14 @@ bool mts_recovery_groups(Relay_log_info
w->get_group_master_log_pos() };
bool checksum_detected= FALSE;
- sql_print_information("Recovery relay log info based on Worker-Id %lu, "
- "group_relay_log_name %s, group_relay_log_pos %lu "
- "group_master_log_name %s, group_master_log_pos %lu",
+ sql_print_information("Slave: MTS group recovery relay log info based on Worker-Id %lu, "
+ "group_relay_log_name %s, group_relay_log_pos %llu "
+ "group_master_log_name %s, group_master_log_pos %llu",
w->id,
w->get_group_relay_log_name(),
- (ulong) w->get_group_relay_log_pos(),
+ w->get_group_relay_log_pos(),
w->get_group_master_log_name(),
- (ulong) w->get_group_master_log_pos());
+ w->get_group_master_log_pos());
recovery_group_cnt= 0;
not_reached_commit= true;
@@ -4548,10 +4535,10 @@ bool mts_recovery_groups(Relay_log_info
flag_group_seen_begin= false;
recovery_group_cnt++;
- sql_print_information("Group Recoverying relay log info "
- "group_master_log_name %s, event_master_log_pos %llu.",
- rli->get_group_master_log_name(),
- ev->log_pos);
+ sql_print_information("Slave: MTS group recovery relay log info "
+ "group_master_log_name %s, "
+ "event_master_log_pos %llu.",
+ rli->get_group_master_log_name(), ev->log_pos);
if ((ret= mts_event_coord_cmp(&ev_coord, &w_last)) == 0)
{
#ifndef DBUG_OFF
@@ -4600,8 +4587,7 @@ bool mts_recovery_groups(Relay_log_info
recovery_group_cnt : rli->mts_recovery_group_cnt);
}
- DBUG_ASSERT(rli->mts_recovery_group_cnt < groups->n_bits);
- DBUG_ASSERT(rli->mts_recovery_group_cnt < rli->checkpoint_group);
+ DBUG_ASSERT(rli->mts_recovery_group_cnt <= groups->n_bits);
err:
@@ -4614,7 +4600,7 @@ err:
delete_dynamic(&above_lwm_jobs);
- DBUG_RETURN(error);
+ DBUG_RETURN(error ? ER_MTS_RECOVERY_FAILURE : 0);
}
/**
@@ -4775,7 +4761,7 @@ int slave_start_single_worker(Relay_log_
Slave_worker *w= NULL;
if (!(w=
- Rpl_info_factory::create_worker(opt_rli_repository_id, i, rli)))
+ Rpl_info_factory::create_worker(opt_rli_repository_id, i, rli, false)))
{
sql_print_error("Failed during slave worker thread create");
error= 1;
@@ -4900,12 +4886,14 @@ int slave_start_workers(Relay_log_info *
goto err;
}
-err:
rli->slave_parallel_workers= rli->workers.elements;
// end recovery right now if mts_recovery_groups() did not find any gaps
if (rli->mts_recovery_group_cnt == 0)
- rli->recovery_parallel_workers= rli->slave_parallel_workers;
+ {
+ rli->reset_workers_recovered();
+ }
+err:
return error;
}
@@ -5360,6 +5348,8 @@ llstr(rli->get_group_master_log_pos(), l
err:
slave_stop_workers(rli, &mts_inited); // stopping worker pool
+ if (rli->recovery_groups_inited)
+ bitmap_free(&rli->recovery_groups);
/*
Some events set some playgrounds, which won't be cleared because thread
@@ -8094,7 +8084,7 @@ err:
if (ret == FALSE)
{
if (mts_remove_workers)
- remove_workers(mi->rli);
+ Rpl_info_factory::reset_workers(mi->rli, true);
my_ok(thd);
}
DBUG_RETURN(ret);
=== modified file 'sql/rpl_slave.h'
--- a/sql/rpl_slave.h 2012-03-26 19:30:00 +0000
+++ b/sql/rpl_slave.h 2012-03-30 11:15:17 +0000
@@ -250,7 +250,7 @@ extern char *master_ssl_cipher, *master_
extern I_List<THD> threads;
-bool mts_recovery_groups(Relay_log_info *rli, MY_BITMAP *groups);
+int mts_recovery_groups(Relay_log_info *rli);
bool mts_checkpoint_routine(Relay_log_info *rli, ulonglong period,
bool force, bool locked);
#endif /* HAVE_REPLICATION */
=== modified file 'sql/share/errmsg-utf8.txt'
--- a/sql/share/errmsg-utf8.txt 2012-03-28 15:24:17 +0000
+++ b/sql/share/errmsg-utf8.txt 2012-03-30 11:15:17 +0000
@@ -6742,6 +6742,9 @@ ER_UNKNOWN_ALTER_LOCK
ER_MTS_CHANGE_MASTER_CANT_RUN_WITH_GAPS
eng "CHANGE MASTER cannot be executed when the slave was stopped with an error or killed in MTS mode. Consider using RESET SLAVE or START SLAVE UNTIL."
+ER_MTS_RECOVERY_FAILURE
+ eng "Cannot recover after SLAVE errored out in parallel execution mode. Additional error messages can be found in the MySQL error log."
+
#
# End of 5.6 error messages.
#
=== modified file 'sql/sys_vars.cc'
--- a/sql/sys_vars.cc 2012-03-28 15:24:17 +0000
+++ b/sql/sys_vars.cc 2012-03-30 11:15:17 +0000
@@ -3772,9 +3772,9 @@ static Sys_var_uint Sys_checkpoint_mts_g
"before a checkpoint operation is called to update progress status.",
GLOBAL_VAR(opt_mts_checkpoint_group), CMD_LINE(REQUIRED_ARG),
#ifndef DBUG_OFF
- VALID_RANGE(1, MTS_MAX_BITS_IN_GROUP), DEFAULT(512), BLOCK_SIZE(1));
+ VALID_RANGE(1, MTS_MAX_BITS_IN_GROUP), DEFAULT(512), BLOCK_SIZE(8));
#else
- VALID_RANGE(512, MTS_MAX_BITS_IN_GROUP), DEFAULT(512), BLOCK_SIZE(1));
+ VALID_RANGE(512, MTS_MAX_BITS_IN_GROUP), DEFAULT(512), BLOCK_SIZE(8));
#endif /* DBUG_OFF */
#endif /* HAVE_REPLICATION */
No bundle (reason: useless for push emails).
| Thread |
|---|
| • bzr push into mysql-trunk branch (andrei.elkin:3824 to 3825) Bug#13893310 | Andrei Elkin | 31 Mar |