List:Internals« Previous MessageNext Message »
From:Baron Schwartz Date:July 17 2007 3:50pm
Subject:I don't understand how SHOW SLAVE HOSTS works
View as plain text  
I have been trying to understand the behavior of SHOW SLAVE HOSTS, and it didn't seem 
to match the documentation, so I went to the source and got confused more :-)

I'm sure I am commiting several sins here including looking at 5.1 code while running 
5.0.40, but the code I'm looking at is the Doxygen-ized 5.1 code of 
sql/repl_failsafe.cc at 
http://dev.mysql.com/sources/doxygen/mysql-5.1/repl__failsafe_8cc-source.html

So far, I find there is a hash table called slave_list, which is inserted from 
register_slave() and read from the function (whose name I have now forgotten and can't 
see) called by SHOW SLAVE HOSTS.

It looks to me like the command doesn't work quite like it is supposed to.  It looks 
like each slave is supposed to always know what other slaves are connected at all 
times, because each slave reports to and reads from its master, and the master updates 
the slave whenever another slave connects or disconnects (I think -- I am not very good 
at reading the source).  Yet on my 5.0.40 setup, I have the following replication
topology:

portland
   => fries
   => fresno
        => nepal

And on these servers, I see the following:

on portland:

+-----------+--------+------+-------------------+-----------+
| Server_id | Host   | Port | Rpl_recovery_rank | Master_id |
+-----------+--------+------+-------------------+-----------+
|        40 | fresno | 3306 |                 0 |        21 |
|        20 | fries  | 3306 |                 0 |        21 |
+-----------+--------+------+-------------------+-----------+

on fries:

+-----------+----------+------+-------------------+-----------+
| Server_id | Host     | Port | Rpl_recovery_rank | Master_id |
+-----------+----------+------+-------------------+-----------+
|        21 | portland | 3306 |                 0 |        11 |
|        40 | fresno   | 3306 |                 0 |        21 |
|        20 | fries    | 3306 |                 0 |        21 |
+-----------+----------+------+-------------------+-----------+

on fresno:
+-----------+----------+------+-------------------+-----------+
| Server_id | Host     | Port | Rpl_recovery_rank | Master_id |
+-----------+----------+------+-------------------+-----------+
|         9 | nepal    | 3306 |                 0 |        40 |
|        40 | fresno   | 3306 |                 0 |        21 |
|        20 | fries    | 3306 |                 0 |        21 |
|        21 | portland | 3306 |                 0 |        11 |
+-----------+----------+------+-------------------+-----------+

on nepal:

+-----------+----------+------+-------------------+-----------+
| Server_id | Host     | Port | Rpl_recovery_rank | Master_id |
+-----------+----------+------+-------------------+-----------+
|        21 | portland | 3306 |                 0 |        11 |
|        20 | fries    | 3306 |                 0 |        21 |
|        40 | fresno   | 3306 |                 0 |        21 |
|        42 | portland | 3306 |                 0 |        11 |
|         9 | nepal    | 3306 |                 0 |        40 |
+-----------+----------+------+-------------------+-----------+

I'm sure you have guessed some of these servers have swapped roles at various times. 
For example, portland used to be a slave of usa, which it replaced (after an OS 
rebuild) and which is no longer in use.  Likewise, I think nepal used to be a slave of 
portland, a very long time ago -- probably six months ago.  But all of these servers 
have surely been restarted, if not given a new OS, during the swapping.  Why the 
obsolete entry for portland (currently server_id 21) on nepal?

What should this command really show in my setup?  Should each of the four machines 
show the same thing?  (I think they are meant to)  Should a server unregister itself 
when it is stopped, and is the old entry for portland on nepal therefore a bug?

Finally, a question on the code itself, from the file linked above:

00473   Asks the master for the list of its other connected slaves.
00474   This is for failsafe replication:
00475   in order for failsafe replication to work, the servers involved in
00476   replication must know of each other. We accomplish this by having each
00477   slave report to the master how to reach it, and on connection, each
00478   slave receives information about where the other slaves are.

Shouldn't each slave also receive information about the other slaves whenever a new 
slave connects?  I realize this becomes one of those O(n(n-1)) kinds of problems but it 
seems like the only way to get correct behavior -- unless only one server (the topmost 
master in the replication tree) ever stores any information about which slaves are 
connected.  But then I imagine this isn't exactly failsafe.

Thanks for reading my disjointed thoughts and questions!

Baron

-- 
Baron Schwartz
http://www.xaprb.com/
Thread
I don't understand how SHOW SLAVE HOSTS worksBaron Schwartz17 Jul
  • Re: I don't understand how SHOW SLAVE HOSTS worksGuilhem Bichot27 Jul