List:Internals« Previous MessageNext Message »
From:Sasha Pachev Date:December 16 2000 5:26pm
Subject:Re: replication algorithm question
View as plain text  
On Friday 15 December 2000 12:27, Aaron Ingram wrote:
>When replicating a query having an error code, the slave will fail if that
>same error is not encountered.  Assuming there's some reason to log failed
>queries, why should the slave try to execute them?  If they were not
>actually written to the master, even attempting the command on the slave
>seems incorrect.

Here is an example:

create table foo(n int not null primary key);
insert into foo values (1);
insert into foo values (1),(2),(3);

the last query will return an error, but still modify the table, so it needs 
to be replicated.

>
>-Aaron
>
>-----Original Message-----
>From: Aaron Ingram 
>Sent: Sunday, December 10, 2000 10:20 PM
>To: 'mysql@stripped'
>Subject: Replication fails when expecting error
>
>
>I've run across the following replication error on the slave:
>	001209  2:08:36  Slave: did not get the expected error running query
>from master - expected: 'Got an error writing communication packets', got
>'no error'
>	001209  2:08:36  Slave:  error running query 'delete from
>table_name'
>	001209  2:08:36  Error running query, slave aborted. Fix the
>problem, and re-start the slave thread with mysqladmin start-slave
>I've confirmed the "Got an error writing communication packets" error
>appears on the master.  However, whatever problem caused that error on the
>master is not occurring on the slave.  Hence the slave failure.  Should a
>slave fail when encountering a mismatched error code of this type? 
>
>That aside, even if I can stop the communication problem from recurring on
>the master, I still have the existing log events to deal with.  How do I
>work around this problem?  I really would like to avoid restarting the log
>from a fresh dump&load, especially since there's no guarantee I can stop
>this error from happening again.
>
>I'm running MySQL 3.23.28 on RedHat Linux 6.1.

You have found a rather rare bug that would be near to impossible to repeat 
"at will". The query not the master has actually succeeded, but the errno in 
the thread structure was set because the client dropped the connection as the 
thread was trying to tell it that everything was cool. Here is a patch for 
this:

--- 1.28/sql/sql_delete.cc      Fri Dec  8 08:04:53 2000
+++ edited/sql/sql_delete.cc    Sat Dec 16 09:54:11 2000
@@ -106,13 +106,13 @@
   }
   if (!error)
   {
-    send_ok(&thd->net);                // This should return record count
     mysql_update_log.write(thd,thd->query,thd->query_length);
     if (mysql_bin_log.is_open())
     {
       Query_log_event qinfo(thd, thd->query);
       mysql_bin_log.write(&qinfo);
     }
+    send_ok(&thd->net);                // This should return record count
   }
   DBUG_RETURN(error ? -1 : 0);
 }                                                                            

Resuming the replication would be a rather tricky task :

SHOW SLAVE STATUS; on the slave

figure out the name of the master log and the position

on the master

od -c -j offset_on_the_slave /path/to/datadir/binlog_name

then count the bytes and try to guess where the next log entry starts - it 
would be about 100 bytes ahead of the current position, and the 5th byte of 
the entry ( 4 bytes away from the start) will be most likely 0x02 ( the code 
for query log event). For the exact offset, look at the previous entry - here 
is the format:

offset  size   meaning
0        4      timestamp
4        1      event code ( 0x02 for query)
5        4      orginating server id
9        4      event size

all integers are little endian

And here is the code that creates it ( the ultimate reference :-) ):

int Log_event::write_header(IO_CACHE* file)
{
  // make sure to change this when the header gets bigger
  char buf[LOG_EVENT_HEADER_LEN];
  char* pos = buf;
  int4store(pos, when); // timestamp
  pos += 4;
  *pos++ = get_type_code(); // event type code
  int4store(pos, server_id);
  pos += 4;
  long tmp=get_data_size() + LOG_EVENT_HEADER_LEN;
  int4store(pos, tmp);
  pos += 4;
  return (my_b_write(file, (byte*) buf, (uint) (pos - buf)));
}



So after you have figured out the event size of the trouble query, add it to 
the current slave offset and do:

mysqlbinlog -j new_offset /path/to/datadir/binlog_name | head -1

You should see the next query printed out in plain text - if you do, you got 
the offset right - if not, check your arithmetic and try again.

Once you got the offset right, on the slave:

CHANGE MASTER TO MASTER_LOG_POS=new_offset;
SLAVE START;
SHOW SLAVE STATUS;

it should now be going, and your data on the slave should be ok, as the 
delete query that we have skipped by adjusting the offset has already 
happened.

In 3.23.30, I will change the code on the slave to print the next offset in 
case of a query error, so one could skip the trouble query in case something 
really terrible happens without having to do the binlog magic.






-- 
MySQL Development Team
   __  ___     ___ ____  __ 
  /  |/  /_ __/ __/ __ \/ /   Sasha Pachev <sasha@stripped>
 / /|_/ / // /\ \/ /_/ / /__  MySQL AB, http://www.mysql.com/
/_/  /_/\_, /___/\___\_\___/  Provo, Utah, USA
       <___/                  
Thread
replication algorithm questionAaron Ingram15 Dec
  • Re: replication algorithm questionSasha Pachev16 Dec
RE: replication algorithm questionAaron Ingram15 Dec