List:Bugs« Previous MessageNext Message »
From:sasha Date:September 13 2000 12:21am
Subject:Re: More on replication ..
View as plain text  
Weslee Bilodeau wrote:
> 
> This ones an odd one, not fatal, just annoying ..
> 
> On the master server I created a table in a database that didn't exist
> on the slaves, as such the
> slaves exited with the error below. Thats not the bug though:
> ( But, speaking of which - Can there be an option to create databases
> which don't exist for which a table is created for on replication? )

This option would not be very good, as it would make it easier to make a mess -
let's say you have been setting up a slave and forgot to mirror a replicable
database. If this option is enabled, it will take you longer to notice the
problem. The correct/recommended way to set up replication is ( to be shortly
documented in the manual :-)):

 - identify all databases that will be replicated
 - add binlog-do-db=db for each database to be replicated on the master, and
replicate-do-db for each database on the slave (you can use binlog-ignore-db and
replicate-ignore-db if it is more convenient, or not use these commands at all
if every database is to be replicated)
 - lock all tables in them, make a copy of them to the slave, flush master,
unlock them 
 - restore the tables to the slave
 - start slave

> 
> mysql> show processlist ;
>
> +----+-------------+-----------+------+---------+------+-----------------------+------------------+
> 
> | Id | User        | Host      | db   | Command | Time |
> State                 | Info             |
>
> +----+-------------+-----------+------+---------+------+-----------------------+------------------+
> 
> |  5 | root        | localhost | test | Query   | 0    |
> NULL                  | show processlist |
> | 48 | system user | none      | NULL | Connect | 3960 | reading master
> update | NULL             |
>
> +----+-------------+-----------+------+---------+------+-----------------------+------------------+
> 
> 2 rows in set (0.00 sec)
> 
> The interesting part is that I just started the slave, and immediatly
> ran 'show processlist'. The Time is 3960.
> 
> The slaves been connected less then a few seconds and the counter is
> saying an hour.

I've fixed it in 3.23.25 - in case you have the source laying around and really
care to have this time correct before 3.23.25 is released, here is the patch:

===== slave.cc 1.36 vs edited =====
--- 1.36/sql/slave.cc   Sat Sep  9 21:31:22 2000
+++ edited/slave.cc     Mon Sep 11 14:01:50 2000
@@ -812,6 +812,7 @@
   my_thread_init(); // needs to be up here, otherwise we get a coredump
   // trying to use DBUG_ stuff
   thd = new THD; // note that contructor of THD uses DBUG_ !
+  thd->set_time();
   DBUG_ENTER("handle_slave");

   pthread_detach_this_thread();


> 
> Second one, goes back to my original 'slave stop' not removing thread.
> 
> I did further testing, and issued a 'slave stop' on one of the slaves,
> thread still didn't go away on the master, but no updates were done yet
> either ( Should the slave at least advise the master its no longer
> listening? ). I updated the master (meaning I inserted data in a table
> that was being replicated) and the 'zombie' slave was still there.  Did
> another processlist and it showed two slave threads from the same
> machine, only one if which was actually connected.
> 
>
> +----+-----------+----------------------+-------------+-------------+------+--------------------+------------------+
> 
> | Id | User      | Host                 | db          | Command     |
> Time | State              | Info             |
>
> +----+-----------+----------------------+-------------+-------------+------+--------------------+------------------+
> 
> |  1 | slave_rep | crashme.local.domain | NULL        | Binlog Dump |
> 4457 | Waiting for update | NULL             |
> |  2 | slave_rep | 192.168.1.34         | NULL        | Binlog Dump |
> 4457 | Waiting for update | NULL             |
> |  3 | root      | localhost            | replication | Query       |
> 0    | NULL               | show processlist |
> |  6 | slave_rep |                      | NULL        | Binlog Dump |
> 393  | Waiting for update | NULL             |
> |  7 | slave_rep | crashme.local.domain | NULL        | Binlog Dump |
> 338  | Waiting for update | NULL             |
> |  9 | slave_rep |                      | NULL        | Binlog Dump |
> 8    | Waiting for update | NULL             |
>
> +----+-----------+----------------------+-------------+-------------+------+--------------------+------------------+
> 
> After doing a few more inserts, the zombie threads disappeared (seems if
> only one update is done, they stick around, 2+ and they disappear).
> 
> More strangeness. Threads 6 and 9 are actually newer 'zombie' threads of
> '192.168.1.34', but for some reason they don't have a Host entry (unlike
> thread 2, which does have a Host entry). The 'crashme' server always has
> a host entry (but is also the only one with a reverse DNS entry).
> 
> If further information is needed I have on each server a second binary
> compiled w/'--with-debug=full' and can send the logs and replicate the
> problems if needed.

The reason for fluctuations in the ETA of zombie binlog_dump true death after
SLAVE STOP is net buffering in mysqld - it does not touch the network socket
until there is enough data in the net buffer to actually require writing it to
the net. And the zombie will not know the client died until it tries to write
to  the socket and gets an error. The issue here is that COM_BINLOG_DUMP was
meant to be an infinite command that never returns unless there is an error.

Unfortunately, I do not see any way to kill the zombies any sooner without
either slowing down replication, or having the slave establish another
connection to the master just to kill a thread that has exited. We could, of
course, try a read with a short timeout on the socket after each update to check
for COM_QUIT, but this would slow things down too much - you will pay the
penalty on each update. Another possiblity to open another connection and kill
the zombie, but this is not that good of an idea either -  if you decided to
STOP the slave during network congestion, for example, and establishig another
connection would take a long time, you will sit there for a couple of minutes
waiting for STOP SLAVE to return. 

Monty, do you see any better/workable alternatives?
 

-- 
Sasha Pachev

+------------------------------------------------------------------+
|      ____  __     _____   _____  ___     http://www.mysql.com    |
|     /*/\*\/\*\   /*/ \*\ /*/ \*\ |*|     Sasha Pachev            |
|    /*/ /*/ /*/   \*\_   |*|   |*||*|     sasha@stripped         |
|   /*/ /*/ /*/\*\/*/  \*\|*|   |*||*|     Provo, Utah, USA        |
|  /*/     /*/  /*/\*\_/*/ \*\_/*/ |*|____                         |
|  ^^^^^^^^^^^^/*/^^^^^^^^^^^\*\^^^^^^^^^^^                        |
|             /*/             \*\ Developers Team                  |
+------------------------------------------------------------------+
Thread
More on replication ..Weslee Bilodeau8 Sep
  • Re: More on replication ..sasha10 Sep
    • Re: More on replication ..Thimble Smith11 Sep
  • Re: More on replication ..sasha12 Sep
  • Re: More on replication ..sasha13 Sep
    • RE: More on replication ..Weslee Bilodeau13 Sep
      • RE: More on replication ..Michael Widenius13 Sep
    • Re: More on replication ..Michael Widenius13 Sep
      • Re: More on replication ..Benjamin Pflugmann24 Sep
  • Re: More on replication ..sasha25 Sep
  • Re: More on replication ..sasha5 Oct