From: Luis Motta Campos Date: October 23 2011 9:40am Subject: 5.1.51 Database Replica Slows Down Suddenly, Lags For Days, and Recovers Without Intervention List-Archive: http://lists.mysql.com/mysql/226156 Message-Id: MIME-Version: 1.0 (Apple Message framework v1084) Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable Fellow DBAs and MySQL Users [apologies for eventual duplicates - I've posted this to = percona-discussion@stripped also] I've been hunting an issue with my database cluster for several months = now without much success. Maybe I'm overlooking something here. I've been observing the database slowing down and lagging behind for = thousands of seconds (sometimes over the course of several days) even = without any query load besides replication itself. I am running Percona MySQL 5.1.51 (InnoDB plug-in version 1.12) on Dell = R710 (6 x 3.5 inch 15K RPM disks in RAID10; 24GB RAM; 2x Quad-core Intel = processors) running Debian Lenny. MySQL data, binary logs, relay logs, = innodb log files are on separated partitions from each other, on a RAID = system separated from the operating system disks. Default Storage Engine is InnoDB, and the usual InnoDB memory structures = are stable and look healthy. I have about 500 (read) queries per second on average, and about 10% of = this as writes on the master. I've been observing something that looks like between 6 and 10 pending = reads per second uniformly on my cacti graphs. The issue is characterized by the server suddenly slowing down writes = without any previous warning or change, and lagging behind for several = thousand seconds (triggering all sorts of alerts on my monitoring = system). I don't observe extra CPU activity, just a reduced disk access = ratio (from about 5-6MB/s to 500KB/s) and replication lagging. I could = correlate it neither InnoDB hashing activity, nor with = long-running-queries, nor with background read/write thread activities. I don't have any clues of what is causing this behavior, and I'm unable = to reproduce it under controlled conditions. I've observed the issue = both on severs with and without workload (apart from the usual = replication load). I am sure no changes were applied to the server or to = the cluster. I'm looking forward for suggestions and theories on the issue - all = ideas are welcome.=20 Thank you for your time and attention, Kind regards, -- Luis Motta Campos is a DBA, Foodie, and Photographer