List:Replication« Previous MessageNext Message »
From:Michael Bunzel Date:July 22 2008 2:12pm
Subject:Re: Best way to recover from Replication Failure
View as plain text  
As you already guessed, the answer is: it all depends.

Normally you should see the source of the error with SHOW SLAVE STATUS
or at least find it in the error log, but IIRC you had issues with
that. If it's a duplicate key then stopping the replication, deleting
the offending data and resuming the replication does work sometimes.
Of course this does depend on how sure you are that you want to do
this and how much you value your data integrity. You could also
re-initialize the replication from a new or a known-good dump and
replication position. There's nothing wrong with either solution, as
long as you know what you are doing and don't necessarily try this for
the first time on a production server with billions of dollars worth
of company data that hasn't been backed up in months ;-)

About your problem: are you sure that your data dump was actually in
sync with the master log position you used to initiate the
replication? Are you sure that you don't have an application that
accidentally writes to the slave?

As a solution, you could try resetting the slave and starting the
replication anew with either your previous dump and master position or
with a new one or alternatively you could try resetting both the
master and slave. I've performed each of those and they all worked the
way I intended, although I wouldn't recommend completely resetting the
master if this is a production system or if you have multiple slaves.

Here's a step-by-step guide to resetting the slave and starting the
replication with a new dump and master position:

- On the slave, run STOP SLAVE to stop all slave activity, then run
RESET SLAVE to make the slave forget its replication position and
delete its relay log amongst others.

- Make sure that the collation-server and default-collation
configuration options are the same on both the master and the slave.

- On the master, run FLUSH TABLES WITH READ LOCK and leave that
session open. If you close it, the tables will be unlocked and your
data dump and log position will be out of sync.

- Now, while having that session still opened, make a dump of the data
you want to replicate using mysqldump or whatever tool fits your
purpose.

- When it's finished, run SHOW MASTER STATUS, save the output
somewhere, free the read lock with UNLOCK TABLES and just to be safe
close the session.

- Import your dump on the slave and make sure that the data imported
successfully. Then issue a full CHANGE MASTER TO statement using the
values from the statement above.

CHANGE MASTER TO
     MASTER_HOST='master_host_name',
     MASTER_USER='replication_user_name',
     MASTER_PASSWORD='replication_password',
     MASTER_LOG_FILE='recorded_log_file_name',
     MASTER_LOG_POS=recorded_log_position;

I've experienced issues when using only MASTER_LOG_POS and
MASTER_LOG_FILE, so it's probably a good idea to always include
hostname, user and password when you want to make sure that the master
data is changed.

- Start the slave with START SLAVE, then run SHOW SLAVE STATUS to see
whether everything worked as expected.

In case you really want to reset the master too you can run RESET
MASTER before resetting the slave. You might also want to read section
12.6 in the manual before doing that.

Hope that helps,

Michael Bunzel
LintraNex Systems
Thread
Best way to recover from Replication FailureJesse22 Jul
  • Re: Best way to recover from Replication FailureMichael Bunzel22 Jul
  • Re: Best way to recover from Replication FailureJesse22 Jul
    • Re: Best way to recover from Replication FailureMarcus Bointon22 Jul
    • Re: Best way to recover from Replication FailureMichael Bunzel22 Jul
  • Re: Best way to recover from Replication FailureJesse23 Jul