List:General Discussion« Previous MessageNext Message »
From:Bruce Dembecki Date:September 22 2005 4:02pm
Subject:Re: Circular Replication
View as plain text  
On Sep 21, 2005, at 5:23 AM, Jeff wrote:

> I am interested in how you go about doing a "delayed replication" to
> protect against operator error.  We've already fallen victim to that
> situation here.

The long story short is we use the fact that MySQL has the ability to  
run the SQL thread and the IO thread of replication separately, and  
control them individually. In practice we use cron and a whole bunch  
of scripts to stop the I/O thread (the one reading from the master)  
most of the time, and manage when the SQL thread replicates... eg at  
4:00 cron stops the SQL thread. At 4:01 we start the I/O thread (this  
can read a lot of changes very quickly from the master, so only need  
a short time to catch up with all the changes). At 4:05 we stop the I/ 
O thread. Then we wait a few minutes to give ourselves a buffer...  
then finally at 4:15 we start the SQL thread.... and repeat the cycle  
every two hours.

The upshot is at the small end we are 10 minutes behind (the time  
between we stop I/O at 4:05 and the time when we start SQL at 4:15),  
and at the long end we are 2 hours behind (at 4:07 for example the  
last query that the SQL thread could have executed came from the  
master at 2:05).

Our scripts are a little more complicated to marry into our  
monitoring system without setting off alerts that replication has  
stopped and so on (and of course the machine that runs this speaks to  
many masters using many instances of MySQL, so we need to manage this  
for every instance of MySQL). We also manage things to allow an  
emergency stop by having the scripts do an existence check on a  
specific file, and if the file isn't there don't start any  
replication processes. We then have a stop script which tells the  
instances to stop whatever they are doing and deletes the file. At  
that point replication can't resume until we replace the file  
manually - we tie that emergency script to a TCP port and hey  
presto... in the event of an emergency all someone needs to do is hit  
the right tcp port on the server (telnet to it, hit it with a  
browser, anything that will cause the port to see some activity) and  
all the replication comes to a stop.

Also as part of our 2 hourly cycle we do a lot of binary log flushing  
on the slave and the masters, so if we ever need to roll back we can  
roll back to a specific point in time and only have to deal with  
fixing problems in the logs form that point in time onwards. if an  
operator error gets by before we can stop we can go to yesterdays  
backup and only execute those binary logs from before the incident,  
and then deal with the issue in question.

This process has reduced our downtime in the event of a total  
database corruption from four hours to recover from yesterdays data  
and be missing everything since, to 30 minutes and be only missing  
the data since the last 2 hourly roll over. And it doesn't take long  
to dump the last set of binary logs to a text file, find and fix/ 
remove the corrupting command and apply that whole log into the  
database, effectively giving us almost zero lost data and back online  
in no time (although when clients are screaming even 30 minutes feels  
like an eternity). This is all of course so much better than the four  
hour downtime we had before this system.

And there are side benefits... for example backups are easier to do  
because the data isn't being changed except for a few minutes every 2  
hours. Instead of co-ordinating timing scripts and locking tables and  
doing dumps and so on we can do simple file system duplication of the  
data directories.

Best Regards, Bruce
Circular ReplicationJeff16 Sep
  • Re: Circular ReplicationDan Nelson16 Sep
    • RE: Circular ReplicationJeff16 Sep
      • Re: Circular ReplicationDevananda16 Sep
  • Re: Circular ReplicationStefan Kuhn16 Sep
    • Re: Circular ReplicationRaphaĆ«l 'SurcouF' Bordet27 Oct
      • Re: Circular ReplicationStefan Kuhn27 Oct
        • Re: Circular ReplicationJigal van Hemert27 Oct
          • Re: Circular ReplicationDevananda28 Oct
RE: Circular ReplicationJeff McKeon16 Sep
  • RE: Circular ReplicationJeff16 Sep
    • Re: Circular ReplicationDevananda16 Sep
      • Re: Circular ReplicationSid Lane19 Sep
        • Re: Circular ReplicationAlec.Cawley19 Sep
          • Re: Circular ReplicationSid Lane19 Sep
          • RE: Circular ReplicationJeff20 Sep
            • Re: Circular ReplicationDevananda20 Sep
    • Re: Circular ReplicationBruce Dembecki21 Sep
RE: Circular ReplicationJeff20 Sep
RE: Circular ReplicationJeff21 Sep
  • Re: Circular ReplicationBruce Dembecki22 Sep
    • Re: Circular ReplicationHank26 Sep
      • Re: Circular ReplicationHank26 Sep