See: http://lists.mysql.com/commits/119015
On 09/22/2010 05:10 PM, Alfranio Correia wrote:
> Hi Luis,
>
> Thank you for the feedback.
No problem.
> Please, add the explanation you wrote here to the code.
Done.
> I would go for solution 1 and file a bug report and a worklog
> to handle the refactory.
Done solution #1.
Will file worklog later.
Regards,
Luís Soares
> Cheers.
>
> On 09/22/2010 02:41 PM, Luís Soares wrote:
>> Hi Alfranio,
>> This is somewhat hard to trace/debug. That code is not very appealing
>> (to say the least), so sorry if I miss something, or got something
>> wrong!
>>
>> On 09/21/2010 09:04 PM, Alfranio Correia wrote:
>>> Hi Luis,
>>>
>>> First of all, sorry for the delay.
>>> Let's discuss this tomorrow in order to close this bug asap
>>
>> OK.
>>
>>> Excellent description on the problem but I still have some questions
>>> on the decision you have made to fix the bug.
>>
>> OK.
>>
>>> 1 - Why don't you set the cur_log to a valid value before starting the
>>> sql thread?
>>
>> The cur_log is set to a valid value befor the sql thread starts.
>>
>> It's the one set when CHANGE MASTER is issued. However, while the SQL
>> thread moves processes the events and moves on to the next log,
>> eventually it reaches the log that it was reading the first time it
>> was stopped. Then, there is no my_b_seek done (as it happens in several
>> other places in the code), before check_binlog_magic is called. See,
>> next_event, especially this part:
>>
>> 4340 if (rli->relay_log.is_active(rli->linfo.log_file_name))
>> 4341 {
>> 4342 #ifdef EXTRA_DEBUG
>> 4343 if (global_system_variables.log_warnings)
>> 4344 sql_print_information("next log '%s' is currently active",
>> 4345 rli->linfo.log_file_name);
>> 4346 #endif
>> 4347 rli->cur_log= cur_log= rli->relay_log.get_log_file();
>> (...) /*** There is no my_b_seek(cur_log, 0) from here to: ***/
>> 4358 if (check_binlog_magic(cur_log,&errmsg))
>> (...)
>>
>>> 2 - Why is this not repeatable when you stop and start the slave?
>>
>> Because, when the SQL thread is stopped and restarted the
>> following may happen:
>>
>> 1. Log was hot at stop time and remains hot at restart
>>
>> SQL thread reads again from hot_log (SQL thread was reading
>> from the active log when it was stopped and the very same log
>> is still active on SQL thread restart).
>>
>> In this case, my_b_seek is performed on cur_log, while cur_log
>> points to relay_log.get_log_file();
>>
>> Result: check_binlog_magic *succeeds*
>>
>> 2. Log was hot at stop time but got cold before restart
>>
>> The log was hot when SQL thread stopped, but it is not anymore
>> when the SQL thread restarts.
>>
>> In this case, the SQL thread reopens the log, using cache_buf,
>> ie, cur_log points to&cache_buf, and thence its coordinates
>> are reset.
>>
>> Result: check_binlog_magic *succeeds*
>>
>> 3. Log was already cold at stop time
>>
>> The log was not hot when the SQL thread stopped, and, of
>> course, it will not be hot when it restarts.
>>
>> In this case, the SQL thread opens the cold log again, using
>> cache_buf, ie, cur_log points to&cache_buf, and thence its
>> coordinates are reset.
>>
>> Result: check_binlog_magic *succeeds*
>>
>> 4. Log was hot at stop time, DBA changes to previous cold log and
>> restarts SQL thread
>>
>> The log was hot when the SQL thread was stopped, but the user
>> changed the coordinates of the SQL thread to restart from a
>> previous cold log.
>>
>> In this case, at start time, cur_log points to a cold log,
>> opened using&cache_buf as cache, and coordinates are
>> reset. However, as it moves on to the next logs, it will
>> eventually reach the hot log. If the hot log is the same at
>> the time the SQL thread was stopped, then coordinates were not
>> reset - the cur_log will point to relay_log.get_log_file(),
>> and not a freshly opened IO_CACHE through cache_buf,
>> triggering the assertion (see that in next_event, after
>> checking whether the log is hot or not there is no my_b_seek
>> before calling check_binlog_magic - see snippet above).
>>
>> Result: check_binlog_magic *FAILS*
>>
>> So I considered two alternatives:
>>
>> 1. to deploy yet another my_b_seek in the snippet I pasted
>> above.
>>
>> 2. or make the my_b_seek at SQL (re)start time, always reset
>> the hot_log IO_CACHE, the cold logs will have to be
>> reopened, thence, they should be safe.
>>
>> I went for the second one, as I prefer not to had one more
>> my_b_seek.
>>
>> Finally, I think this code needs to be restructured/refactored
>> sometime in the comming future.
>>
>>> Finally, I have a requested. Please, add a better explanation on the
>>> variables
>>> cur_log, cur_log_fd, etc... It is very hard to understand why they are
>>> defined
>>> without using gdb and reading several pieces of the code.
>>
>> /**
>> Pointer that either points at relay_log.get_log_file() or
>> &rli->cache_buf, depending on whether the log is hot or there
>> was the need to open a cold relay_log.
>> */
>> cur_log : IO_CACHE*
>>
>> /**
>> File descriptor of the current read relay log.
>> */
>> rli->cur_log_fd : File
>>
>> /**
>> IO_CACHE used when opening cold relay logs.
>> */
>> rli->cache_buf : IO_CACHE
>>
>> /**
>> Gets the IO_CACHE* to the log file referenced at the super
>> class MYSQL_LOG.
>> */
>> MYSQL_BIN_LOG::get_log_file
>>
>>
>> cur_log is set in:
>>
>> 1. init_relay_log_pos
>>
>> a) my_b_seek((rli->cur_log=rli->relay_log.get_log_file()),
>> (off_t)0);
>> if we are reading from the binary log that the IO thread is using
>>
>> b) rli->cur_log =&rli->cache_buf;
>> if we are reading from a cold binary log
>>
>> 2. reopen_relay_log
>>
>> a) rli->cur_log=&rli->cache_buf;
>>
>> 3. next_event
>>
>> a) rli->cur_log= cur_log= rli->relay_log.get_log_file()
>> if we are reading from the binary log that the IO thread is using
>>
>>
>> Regards,
>> Luís Soares
>>
>>> Cheers.
>>>
>>> On 07/23/2010 05:32 PM, Luis Soares wrote:
>>>> #At
>>>>
> file:///home/lsoares/Workspace/bzr/work/bugfixing/55263/mysql-5.1-bugteam/
>>>>
>>>> based on revid:ramil@stripped
>>>>
>>>> 3489 Luis Soares 2010-07-23
>>>> BUG#55263: assert in check_binlog_magic
>>>>
>>>> The procedure for setting up a fake binary log, by changing the
>>>> relay log files manually, is run twice when we issue mtr with
>>>> --repeat=2. However, when setting it up for the second time,
>>>> neither the sql thread is reset nor the server is restarted.
>>>> This
>>>> means that previous stale relay log IO_CACHE - from first run -
>>>> is left around. As a consequence, when the test is run for the
>>>> second time, the IO_CACHE for the relay log has position in
>>>> file
>>>> different than zero, triggering the assertion.
>>>>
>>>> We fix this by changing the call to my_b_seek at the very
>>>> beginning of init_master_info. Instead of doing a seek on the
>>>> cur_log we now do a seek on the currently active relay log.
>>>> This
>>>> makes sure that even if the SQL thread had read the log
>>>> while it
>>>> was hot in the past, the position will be set to 0 if it
>>>> continues hot when the SQL thread is restarted and switches
>>>> back
>>>> to it. If, on the other hand, the log became cold while the SQL
>>>> was stopped and restarted, then when the thread switches back
>>>> to it
>>>> the log will have to be reopened, thus the position will start
>>>> from the 0, meaning that there is no need to do a seek.
>>>>
>>>> modified:
>>>> sql/rpl_mi.cc
>>>> === modified file 'sql/rpl_mi.cc'
>>>> --- a/sql/rpl_mi.cc 2010-05-24 16:43:27 +0000
>>>> +++ b/sql/rpl_mi.cc 2010-07-23 16:32:55 +0000
>>>> @@ -123,7 +123,9 @@ int init_master_info(Master_info* mi, co
>>>> */
>>>> if (thread_mask& SLAVE_SQL)
>>>> {
>>>> - my_b_seek(mi->rli.cur_log, (my_off_t) 0);
>>>> + pthread_mutex_lock(mi->rli.relay_log.get_log_lock());
>>>> + my_b_seek(mi->rli.relay_log.get_log_file(), (my_off_t) 0);
>>>> + pthread_mutex_unlock(mi->rli.relay_log.get_log_lock());
>>>> }
>>>> DBUG_RETURN(0);
>>>> }
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>>
>