List:Commits« Previous MessageNext Message »
From:Luís Soares Date:September 24 2010 9:56am
Subject:Re: bzr commit into mysql-5.1-bugteam branch (luis.soares:3489) Bug#55263
View as plain text  
See: http://lists.mysql.com/commits/119015

On 09/22/2010 05:10 PM, Alfranio Correia wrote:
> Hi Luis,
> 
> Thank you for the feedback.

No problem.

> Please, add the explanation you wrote here to the code.

Done.

> I would go for solution 1 and file a bug report and a worklog
> to handle the refactory.

Done solution #1.
Will file worklog later.

Regards,
Luís Soares

> Cheers.
> 
> On 09/22/2010 02:41 PM, Luís Soares wrote:
>> Hi Alfranio,
>>    This is somewhat hard to trace/debug. That code is not very appealing
>> (to say the least), so sorry if I miss something, or got something
>> wrong!
>>
>> On 09/21/2010 09:04 PM, Alfranio Correia wrote:
>>> Hi Luis,
>>>
>>> First of all, sorry for the delay.
>>> Let's discuss this tomorrow in order to close this bug asap
>>
>> OK.
>>
>>> Excellent description on the problem but I still have some questions
>>> on the decision you have made to fix the bug.
>>
>> OK.
>>
>>> 1 - Why don't you set the cur_log to a valid value before starting the
>>> sql thread?
>>
>> The cur_log is set to a valid value befor the sql thread starts.
>>
>> It's the one set when CHANGE MASTER is issued. However, while the SQL
>> thread moves processes the events and moves on to the next log,
>> eventually it reaches the log that it was reading the first time it
>> was stopped. Then, there is no my_b_seek done (as it happens in several
>> other places in the code), before check_binlog_magic is called. See,
>> next_event, especially this part:
>>
>> 4340       if (rli->relay_log.is_active(rli->linfo.log_file_name))
>> 4341       {
>> 4342 #ifdef EXTRA_DEBUG
>> 4343         if (global_system_variables.log_warnings)
>> 4344           sql_print_information("next log '%s' is currently active",
>> 4345                                 rli->linfo.log_file_name);
>> 4346 #endif
>> 4347         rli->cur_log= cur_log= rli->relay_log.get_log_file();
>> (...)  /*** There is no my_b_seek(cur_log, 0) from here to: ***/
>> 4358         if (check_binlog_magic(cur_log,&errmsg))
>> (...)
>>
>>> 2 - Why is this not repeatable when you stop and start the slave?
>>
>> Because, when the SQL thread is stopped and restarted the
>> following may happen:
>>
>> 1. Log was hot at stop time and remains hot at restart
>>
>>     SQL thread reads again from hot_log (SQL thread was reading
>>     from the active log when it was stopped and the very same log
>>     is still active on SQL thread restart).
>>
>>     In this case, my_b_seek is performed on cur_log, while cur_log
>>     points to relay_log.get_log_file();
>>
>>     Result: check_binlog_magic *succeeds*
>>
>> 2. Log was hot at stop time but got cold before restart
>>
>>     The log was hot when SQL thread stopped, but it is not anymore
>>     when the SQL thread restarts.
>>
>>     In this case, the SQL thread reopens the log, using cache_buf,
>>     ie, cur_log points to&cache_buf, and thence its coordinates
>>     are reset.
>>
>>     Result: check_binlog_magic *succeeds*
>>
>> 3. Log was already cold at stop time
>>
>>     The log was not hot when the SQL thread stopped, and, of
>>     course, it will not be hot when it restarts.
>>
>>     In this case, the SQL thread opens the cold log again, using
>>     cache_buf, ie, cur_log points to&cache_buf, and thence its
>>     coordinates are reset.
>>
>>     Result: check_binlog_magic *succeeds*
>>
>> 4. Log was hot at stop time, DBA changes to previous cold log and
>>     restarts SQL thread
>>
>>     The log was hot when the SQL thread was stopped, but the user
>>     changed the coordinates of the SQL thread to restart from a
>>     previous cold log.
>>
>>     In this case, at start time, cur_log points to a cold log,
>>     opened using&cache_buf as cache, and coordinates are
>>     reset. However, as it moves on to the next logs, it will
>>     eventually reach the hot log. If the hot log is the same at
>>     the time the SQL thread was stopped, then coordinates were not
>>     reset - the cur_log will point to relay_log.get_log_file(),
>>     and not a freshly opened IO_CACHE through cache_buf,
>>     triggering the assertion (see that in next_event, after
>>     checking whether the log is hot or not there is no my_b_seek
>>     before calling check_binlog_magic - see snippet above).
>>
>>     Result: check_binlog_magic *FAILS*
>>
>> So I considered two alternatives:
>>
>>    1. to deploy yet another my_b_seek in the snippet I pasted
>>       above.
>>
>>    2. or make the my_b_seek at SQL (re)start time, always reset
>>       the hot_log IO_CACHE, the cold logs will have to be
>>       reopened, thence, they should be safe.
>>
>> I went for the second one, as I prefer not to had one more
>> my_b_seek.
>>
>> Finally, I think this code needs to be restructured/refactored
>> sometime in the comming future.
>>
>>> Finally, I have a requested. Please, add a better explanation on the
>>> variables
>>> cur_log, cur_log_fd, etc... It is very hard to understand why they are
>>> defined
>>> without using gdb and reading several pieces of the code.
>>
>> /**
>>    Pointer that either points at relay_log.get_log_file() or
>>    &rli->cache_buf, depending on whether the log is hot or there
>>    was the need to open a cold relay_log.
>> */
>> cur_log : IO_CACHE*
>>
>> /**
>>    File descriptor of the current read relay log.
>>   */
>> rli->cur_log_fd : File
>>
>> /**
>>    IO_CACHE used when opening cold relay logs.
>>   */
>> rli->cache_buf : IO_CACHE
>>
>> /**
>>    Gets the IO_CACHE* to the log file referenced at the super
>>    class MYSQL_LOG.
>>   */
>> MYSQL_BIN_LOG::get_log_file
>>
>>
>> cur_log is set in:
>>
>>   1. init_relay_log_pos
>>
>>      a) my_b_seek((rli->cur_log=rli->relay_log.get_log_file()),
>> (off_t)0);
>>         if we are reading from the binary log that the IO thread is using
>>
>>      b) rli->cur_log =&rli->cache_buf;
>>         if we are reading from a cold binary log
>>
>>   2. reopen_relay_log
>>
>>      a) rli->cur_log=&rli->cache_buf;
>>
>>   3. next_event
>>
>>      a) rli->cur_log= cur_log= rli->relay_log.get_log_file()
>>         if we are reading from the binary log that the IO thread is using
>>
>>
>> Regards,
>> Luís Soares
>>
>>> Cheers.
>>>
>>> On 07/23/2010 05:32 PM, Luis Soares wrote:
>>>> #At
>>>>
> file:///home/lsoares/Workspace/bzr/work/bugfixing/55263/mysql-5.1-bugteam/
>>>>
>>>> based on revid:ramil@stripped
>>>>
>>>>    3489 Luis Soares    2010-07-23
>>>>         BUG#55263: assert in check_binlog_magic
>>>>
>>>>         The procedure for setting up a fake binary log, by changing the
>>>>         relay log files manually, is run twice when we issue mtr with
>>>>         --repeat=2. However, when setting it up for the second time,
>>>>         neither the sql thread is reset nor the server is restarted.
>>>> This
>>>>         means that previous stale relay log IO_CACHE - from first run -
>>>>         is left around. As a consequence, when the test is run for the
>>>>         second time, the IO_CACHE for the relay log has position in
>>>> file
>>>>         different than zero, triggering the assertion.
>>>>
>>>>         We fix this by changing the call to my_b_seek at the very
>>>>         beginning of init_master_info. Instead of doing a seek on the
>>>>         cur_log we now do a seek on the currently active relay log.
>>>> This
>>>>         makes sure that even if the SQL thread had read the log
>>>> while it
>>>>         was hot in the past, the position will be set to 0 if it
>>>>         continues hot when the SQL thread is restarted and switches
>>>> back
>>>>         to it. If, on the other hand, the log became cold while the SQL
>>>>         was stopped and restarted, then when the thread switches back
>>>> to it
>>>>         the log will have to be reopened, thus the position will start
>>>>         from the 0, meaning that there is no need to do a seek.
>>>>
>>>>       modified:
>>>>         sql/rpl_mi.cc
>>>> === modified file 'sql/rpl_mi.cc'
>>>> --- a/sql/rpl_mi.cc    2010-05-24 16:43:27 +0000
>>>> +++ b/sql/rpl_mi.cc    2010-07-23 16:32:55 +0000
>>>> @@ -123,7 +123,9 @@ int init_master_info(Master_info* mi, co
>>>>        */
>>>>        if (thread_mask&   SLAVE_SQL)
>>>>        {
>>>> -      my_b_seek(mi->rli.cur_log, (my_off_t) 0);
>>>> +      pthread_mutex_lock(mi->rli.relay_log.get_log_lock());
>>>> +      my_b_seek(mi->rli.relay_log.get_log_file(), (my_off_t) 0);
>>>> +      pthread_mutex_unlock(mi->rli.relay_log.get_log_lock());
>>>>        }
>>>>        DBUG_RETURN(0);
>>>>      }
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>>
> 

Thread
bzr commit into mysql-5.1-bugteam branch (luis.soares:3489) Bug#55263Luis Soares23 Jul
  • Re: bzr commit into mysql-5.1-bugteam branch (luis.soares:3489) Bug#55263Alfranio Correia21 Sep
    • Re: bzr commit into mysql-5.1-bugteam branch (luis.soares:3489) Bug#55263Luís Soares22 Sep
      • Re: bzr commit into mysql-5.1-bugteam branch (luis.soares:3489) Bug#55263Alfranio Correia22 Sep
        • Re: bzr commit into mysql-5.1-bugteam branch (luis.soares:3489) Bug#55263Luís Soares24 Sep
          • Re: bzr commit into mysql-5.1-bugteam branch (luis.soares:3489) Bug#55263Alfranio Correia24 Sep