List:Commits« Previous MessageNext Message »
From:hiu Date:September 28 2011 9:00am
Subject:a workaround patch for #62100
View as plain text  
Hi all,

For the last two years, we suffered the DDL lost table issue more than 3
times for online products.
The background could be referred to
#62100<http://bugs.mysql.com/bug.php?id=62100>,
and a workaround patch <http://bugs.mysql.com/file.php?id=17465> is
committed.

There maybe some mistaken for the previous explanation as mentioned in
ticket when I again look
into this issue, as well as some still unclear logic.

After looking into the stack info
<http://bugs.mysql.com/file.php?id=17415>when DDL table alter log is
printed, we found that IO handler threads
are all waiting for signals and the master is stuck in
fil_mutex_enter_and_prepare_for_io until timeout,
while another DDL worker thread is stuck in fil_rename_tablespace until
timeout.


Two works are done in the patch:
1. Force master thread wake io threads before loop waiting. As a matter of
fact, the master would wake
    simulated io threads soon later.

   Something is mistaken in the ticket on the logic for the
master/io_handler threads of being signaled,
   as there are using different slots and the broadcast is signal to one
thread as the cond variables are different,
   that means there is no racing between different threads when to be wake.

   In this patch, it would wake the io_handler threads to do aio of the
intermediate file for renamed table,
   just before master thread is stuck as it's space->stop_ios = TRUE set by
fil_rename_tablespace.

The key logic for these two parts in MySQL-5.1.48 are list below:

fil_rename_tablespace

retry:
       mutex_enter(&fil_system->mutex);
       count++;
        if (count > 25000) {
             space->stop_ios = FALSE;
             mutex_exit(&fil_system->mutex);
             return(FALSE);
        }

     space = fil_space_get_by_id(id);
     space->stop_ios = TRUE;
     if (node->n_pending > 0 || node->n_pending_flushes > 0) {
                mutex_exit(&fil_system->mutex);
                os_thread_sleep(20000);
                goto retry;



fil_mutex_enter_and_prepare_for_io:

retry:
        mutex_enter(&fil_system->mutex);
        if (space != NULL && space->stop_ios) {
                mutex_exit(&fil_system->mutex);
                os_thread_sleep(20000);

goto retry;

2. Even the DDL is failed and it should be rollback successfully, but it did
not. #62146 <http://bugs.mysql.com/bug.php?id=62146> mentioned that.


The 2) is easy to be accept as it's easy to repeat using GDB, but the 1) is
very hard to repeat, neither with
DEBUG_SYNC nor gdb nonstop. Still some questions on 1):

In normal case, io_handle thread should do fil_io, and master thread also
help to do it when large workload.
In the backtrace, we could see the master thread is doing fil_io for
intermediate table file, while it's should
be done by io_handle thread. I am puzzled with the unworkable io_handle is
such situation. Why it's not
receive the signal for waking?

Therefore, for 1), the patch is just a workaround solution. However, it's
worth to apply this patch to avoid lost
tables before the reason is found and fixed.

Looking forward the root fixing of lost tables for DDL.


regards,
hickey

Thread
a workaround patch for #62100hiu2 Oct