Tompkins Neil wrote:
> We are looking to upgrade our version of mySQL to the latest version of
> mySQL 5. One of the main features we are going to think about using is
> replication for our website data. Basically we have 2 websites located in
> the UK and US which share similar information, and we are going to be using
> replication as a way of keeping the data up to date and in sync.
> Based on your experiences, is there anything we should be aware of before
> investigating this route and putting it into practice ?
I've had to take servers out for bad raid-controllers, bad ram, bad
mobos. Disks have been the least of my problems. So make sure your
architecture tolerates the ability to take members of your pool out
without load-spiking the remaining members. And if you're doing
filesystem snapshots from a master to a replicant, you will have to
either have policy or extra servers available to maintain your uptime
when you interrupt the master to flush all the tables, sync the
filesystem and do an LVM snapshot. Innodb would require a shutdown.
Don't forget that LVM snapshots are copy-on-write, so when that master
comes back up and starts processing modifying tables, you'll get amazing
system load on a busy system as your file system starts madly copying
extents into the snapshot volume.
Define a procedure for junior staff how to properly down and up a pool
member. Like, if you get a disk-full on one member, and it borks
replication, what's the step-by-step for a) determining if replication
can re-establish after you do a FLUSH LOGS, b) under what conditions do
you have to re-copy all data from one master to another because your
replication window has expired and your logs have gotten flushed. Your
replication binlogs get really big if you're pushing large materialized
views regularly via replication, or your servers have fast disks, not
enough size to handle a more than a weekend or whole day (for example)
Define a procedure for checking your my.cnf files for correct
auto-increment-* settings and server-id settings. Junior staff, and even
senior staff rarely add more members to the pool, so these settings are
often mistaken during a midnight maintenance hour. Procedure for adding
members and changing master replication settings is very important.
Often your DBA is not racking and changing the equipment.
Make sure that you have a good understanding of what kind of capacity
you're growing at. I started a project with two four-core boxes with
plenty of 15krpm disk and when they got into production, they regularly
spiked to load 20 and 30. Not pretty. Not only had my old architecture
refused traffic to lighten the load, my new architecture didn't. My data
set was growing so fast my sort-buffer settings for the old servers were
too small for new servers. I ended up with four DL380s with 8 cores per
box. I really had to scramble to get more servers in there. The addition
of two more read-only members really helped, and backups handled by
replication to an off-site replicant.
Another load capacity warning: if your traffic is very spiky, and you
get high-load conditions, I've seen reset/dropped connections and also
plain old connection timeouts. So if you have RAM for 1024 connections,
you prolly can't service 1024 connections when you've got table
contention and connections from your web-nodes just start failing. If
they fail for too long, then you have to do some FLUSH HOSTS to reset
connection attempt counters.
I don't know what your application does, but I certainly monitor
replication lag. Load spikes can certainly increase lag. I've had to
move from single instances of mysql to mysqld_multi and separate
databases by replication rate. Your monitoring should also track sql
threads. You might need to define procedure on how to deal with
pooling-out members that fall too far behind in replication.
I've written an iptables script to block webnode connections but allow
sql pool member connections. I use this to take a member out to run
table repairs or to lighten the load while it does replication catch-up.
WAN connectivity for replication is interesting! I did site-to-site
transfer using stunnel. I had to negotiate weird Cisco 5502 VPN
behavior. Copying gigs of myisam files between sites would knock over my
vpn so I had to rate-limit using rsync --bwlimit. Bursting bandwidth
charges were still brutal, though. Later, we ended up configuring CBQ
(search freshmeat.net for cbq-init) on my backup replicant to limit
bandwidth so it wouldn't provoke bursting charges.