On Sep 21, 2005, at 5:23 AM, Jeff wrote:
> I am interested in how you go about doing a "delayed replication" to
> protect against operator error. We've already fallen victim to that
> situation here.
The long story short is we use the fact that MySQL has the ability to
run the SQL thread and the IO thread of replication separately, and
control them individually. In practice we use cron and a whole bunch
of scripts to stop the I/O thread (the one reading from the master)
most of the time, and manage when the SQL thread replicates... eg at
4:00 cron stops the SQL thread. At 4:01 we start the I/O thread (this
can read a lot of changes very quickly from the master, so only need
a short time to catch up with all the changes). At 4:05 we stop the I/
O thread. Then we wait a few minutes to give ourselves a buffer...
then finally at 4:15 we start the SQL thread.... and repeat the cycle
every two hours.
The upshot is at the small end we are 10 minutes behind (the time
between we stop I/O at 4:05 and the time when we start SQL at 4:15),
and at the long end we are 2 hours behind (at 4:07 for example the
last query that the SQL thread could have executed came from the
master at 2:05).
Our scripts are a little more complicated to marry into our
monitoring system without setting off alerts that replication has
stopped and so on (and of course the machine that runs this speaks to
many masters using many instances of MySQL, so we need to manage this
for every instance of MySQL). We also manage things to allow an
emergency stop by having the scripts do an existence check on a
specific file, and if the file isn't there don't start any
replication processes. We then have a stop script which tells the
instances to stop whatever they are doing and deletes the file. At
that point replication can't resume until we replace the file
manually - we tie that emergency script to a TCP port and hey
presto... in the event of an emergency all someone needs to do is hit
the right tcp port on the server (telnet to it, hit it with a
browser, anything that will cause the port to see some activity) and
all the replication comes to a stop.
Also as part of our 2 hourly cycle we do a lot of binary log flushing
on the slave and the masters, so if we ever need to roll back we can
roll back to a specific point in time and only have to deal with
fixing problems in the logs form that point in time onwards. if an
operator error gets by before we can stop we can go to yesterdays
backup and only execute those binary logs from before the incident,
and then deal with the issue in question.
This process has reduced our downtime in the event of a total
database corruption from four hours to recover from yesterdays data
and be missing everything since, to 30 minutes and be only missing
the data since the last 2 hourly roll over. And it doesn't take long
to dump the last set of binary logs to a text file, find and fix/
remove the corrupting command and apply that whole log into the
database, effectively giving us almost zero lost data and back online
in no time (although when clients are screaming even 30 minutes feels
like an eternity). This is all of course so much better than the four
hour downtime we had before this system.
And there are side benefits... for example backups are easier to do
because the data isn't being changed except for a few minutes every 2
hours. Instead of co-ordinating timing scripts and locking tables and
doing dumps and so on we can do simple file system duplication of the
Best Regards, Bruce