List:General Discussion« Previous MessageNext Message »
From:Todd Lyons Date:July 26 2011 2:18pm
Subject:Chronicle of fixing broken replication, and a question
View as plain text  
Yeah it happens to all of us.  Some master->slave replication system
breaks and goes unnoticed.  In my case, I had a monitoring script, but
it was being called with a --verbose flag, and internally that
"verbose" was being mapped to "debug", and the routine to notify
sysadmins of broken replication doesn't fire if debug is set, so
warning emails were not being sent out.  You might ask, why is this a
problem?  Just restart replication, skipping problem statements.
Well, the big issue was that I have automatic purge of binlogs set.
Over the course of a few days, the purging of binlogs surpassed where
the replication stopped.  Damn...

This is a mysql 5.0.77 (CentOS 5.x) installation with maatkit and
xtrabackup utilities available.  So there's really only one thing to
do.  xtrabackup to the rescue!

1.  I did a full copy of the running master database server using
xtrabackup to a backup server via nfs.  It took 2 hours, of which the
last 15 minutes did a write lock of the entire server as it copied
over the *.frm files and the few myisam tables.  This was the most
troublesome part as it was visible to both users and website owners
2.  On the slave, I did the apply-logs step (via nfs to the backup
server), taking another 2 hours.
3.  I renamed the old directory on the slave server and created a new
mysql directory, owned by the correct user.
4.  I copied the backup data from the backup server to this new directory.
5.  I deleted the iblogfile* files so that mysql would create them on
startup based on the sizes set in the my.cnf.
6.  I started it up.  It still wanted to do a crash recovery since the
ibdata and iblog files didn't match, but that was expected.  After a
few minutes of scanning (we use innodb_table_per_file, so it had a lot
of tablespace spread out over lots of files), it started up and was
accepting connections.  Looking good so far.
7.  I did 'cat xtrabackup_slave_info'.  For some reason it had this:

*UH OH*  *CRAP*   *$#^@%&!*

** See note below

8.  Calm down, think about things a bit.  Obviously I didn't
compensate for the fact that I was doing the original backup on the
master, but the apply-log on the slave (and --slave-info option
doesn't have the data available on the
9.  I know about what time I started the backup.  I know that this is
a database server dedicated to ecommerce shopping carts.  As a result,
there is a steady flow in of data as customers and web_robots access
the shopping carts.
10.  Knowing the data in #9, I look at the statements being replicated
with mysqlbinlog on the master with --start-datetime="2011-07-25
12:00:00".  A quick query of the data in that database on the
newly-recovered-but-not-yet-replicating slave found me a last
timestamp of 13:16.49.
11.  I adjust my timestamp and trace data from that point until I find
one that didn't get inserted into the database by the replication
process (meaning, this was when I started the backup).  I determined
that the timestamp of data not yet in the database was 13:16:52.
12.  Comments in mysqlbinlog output at that timestamp indicated it was
at position 487727085.
13.  I repeated the mysqlbinlog using --start-position 487727085 and
verified that it was the same output as #12.
14.  I did a change master statement on the slave, setting the
position to 487727085, I started replication, and it caught up after
about 20 minutes.
15.  If #14 would have had issues, I did have mk-slave-restart
available to force it to skip problems and restart replication.

So everything is alright.  There are no issues.  But there is a question.

Is there a better way to do this?  Is there something automated that
will "compare THIS binlog to THAT server and tell me where the data
was no longer being inserted" ?  I'm looking to see how others deal
with a complete restart of a running system like this.

** It turns out that xtrabackup gave me confidence of the log position
I had determined.  Running the backup on the master, it also made this
        # cat xtrabackup_binlog_info
        mysqld-bin.000259	487727085
which of course confirmed the log position I determined in step #12.

To be honest, the biggest step in the sequence above is likely #8.
The ability to assess a situation without freaking out (for too long)
is a big part of a sysadmin job IMHO.

Regards...         Todd
If Americans could eliminate sugary beverages, potatoes, white bread,
pasta, white rice and sugary snacks, we would wipe out almost all the
problems we have with weight and diabetes and other metabolic
diseases. -- Dr. Walter Willett, Harvard School of Public Health
Chronicle of fixing broken replication, and a questionTodd Lyons26 Jul
  • Re: Chronicle of fixing broken replication, and a questionReindl Harald26 Jul
    • Re: Chronicle of fixing broken replication, and a questionTodd Lyons26 Jul
      • Re: Chronicle of fixing broken replication, and a questionReindl Harald26 Jul