List:Commits« Previous MessageNext Message »
From:Mattias Jonsson Date:April 1 2008 9:38pm
Subject:Re: bk commit into 5.1 tree (mattiasj:1.2518) BUG#33479
View as plain text  
Hi,

In the bottom I have a proposal for handling the multi row insert 
statement with statement replication problem.

It will not release the auto_increment_lock taken at the first 
get_auto_increment until release_auto_increment is called.
I have read through the code in sql_insert.cc, sql_load.cc and 
sql_table.cc and as I read it, ha_release_auto_increment is always 
called after an insert select, insert replace or load data. (can create 
select be a problem here?)

It is all in the new patch I committed.

Regards
Mattias

Guilhem Bichot wrote:
> On Thu, Mar 27, 2008 at 05:16:37PM +0100, Mattias Jonsson wrote:
>> Guilhem Bichot wrote:
>>> On Fri, Mar 14, 2008 at 01:54:08AM +0100, Mattias Jonsson wrote:
>>>> Guilhem Bichot wrote:
>>>>> On Wed, Mar 05, 2008 at 11:56:16PM +0100, mattiasj@stripped wrote:
>>>>>> ChangeSet@stripped, 2008-03-05 23:56:06+01:00, mattiasj@witty. +3
> -0
>>>>>> Bug#33479: auto_increment failures in partitioning
>>>>>>
>>>>>> Several problems with auto_increment in partitioning
>>>>>> (with MyISAM, InnoDB and Falcon)
>>>>>>
>>>>>> Updated code only patch (the test cases will be in a separate
> patches
>>>>>> 5.1+ (InnoDB+MyISAM) and 6.0 (Falcon)
>>>>>> diff -Nrup a/sql/ha_partition.cc b/sql/ha_partition.cc

>>>>>> @@ -2953,13 +2980,25 @@ int ha_partition::delete_all_rows()
> 
>>>>>> +        ha_data->next_auto_inc_val= next_insert_id;
>>>>>> +      unlock_auto_increment();
>>>>>> +    }
>>>>>> +    DBUG_PRINT("info", ("next_auto_inc_val: %llu next_ins_id:
> %llu",
>>>>>> +                        ha_data->next_auto_inc_val,
> next_insert_id));
>>>>>>  }
>>>>>>  DBUG_VOID_RETURN;
>>>>>> }
>>>> Can there be any problems with concurrent multi row insert and statement
> 
>>>> replications?
>>> For statement-level replication to work, a multi-row INSERT (INSERT
>>> VALUES or INSERT SELECT) must use consecutive values.
>>> Let's assume that two concurrent statements are doing
>>> INSERT INTO t SELECT * FROM u;
>>> where t is a partitioned table, having an autoinc column first in the
>>> index.
>>> Each statement will call get_auto_increment() with
>>> nb_desired_values==1 (see handler::update_auto_increment();
>>> estimation_rows_to_insert is always 0 in INSERT SELECT):
>>> statement1 moves next_auto_inc_val to 2
>>>  statement2 moves next_auto_inc_val to 3
>>> statement1 inserts 1
>>>  statement2 inserts 2
>>> statement1 notices that it has more rows to insert because SELECT
>>> returned more than one row (see handler::update_auto_increment()),
>>> calls get_auto_increment() with nb_desired_values==2 (twice the
>>> previous nb_desired_values):
>>> statement1 moves next_auto_inc_val to 5
>>> statement1 inserts 3
>>> and then we have the problem: statement1 has inserted 1 and 3, it's
>>> not consecutive and will not replicate well.
>>>
>>> If the underlying engine of the partitioned table has table-level
>>> locking, in fact there is no problem (statement2 will run only after
>>> statement1 is done).
>>> This is also the case of InnoDB:
>>> - if innodb_auto_inc_lock_mode==2, it's documented to not work with
>>> statement-based replication
>>> - if ==0, it does table-level locking
>>> - if ==1, it does table-level locking except if the statement is
>>> INSERT/REPLACE VALUES (and the discussed case above is INSERT
>>> SELECT).
>>> But InnoDB does this fine logic only in
>>> ha_innobase::get_auto_increment() (called when the table is not
>>> partitioned), which is not called by
>>> ha_partition::get_auto_increment().
>>>
>>> So I suspect that yes, there are problems. This is a suspicion. A
>>> multi-threaded test or code inspection could help know the truth.
>>> Possible solutions:
>>> - make ha_partition::get_auto_increment() imitate
>>> ha_innobase::get_auto_increment(): let it keep the autoinc mutex if
>>> (thd->lex->sql_command is != SQLCOM_INSERT && !row_based).
>>> - let ha_partition switch to row-based replication if inserting into
>>> an auto_increment column.
>>> - make ha_partition::get_auto_increment() call engine's
>>> get_auto_increment() to ask it to reserve [max,
>>> max+nb_desired_values*increment]. The problem is that engine's
>>> get_auto_increment() API does not tell the engine the lower bound of
>>> the interval (*first_value is not supposed to be read by the engine),
>>> only the interval's width (nb_desired_values). InnoDB does read
>>> *first_value, violating the API, using that as a lower bound. We could
>>> make InnoDB's violation the new norm: *first_value would become an
>>> INOUT parameter (see the email I sent to Innobase recently, cc:you).
>>> - have statement-based binlogging of multiple autoinc intervals
>>> http://forge.mysql.com/worklog/task.php?id=3404 implemented.
>>> - something better?
>>>
>>> In all cases, this is a significant problem, a discussion with the
>>> other reviewer (Serg) would be needed.
>>>
>>> Falcon supports only row-based replication right now.
>>>
>> How about the easy way for start:
>> Rely on the partitioned storage engines support for this (if it work
>> without partitioning, then it will work with partitioning). I have to
>> check that it actually will work if you have auto_increment_offset =
>> no_of_partitions and partition by hash(ai_col), if the table lock is on
>> the table_share, then it will be OK, but it might be a problem if is not.
>> Do this by adding a comment, and add it to the bug report for the
>> doc-team for adding it in the manual.
>>
>>
>> keep the autoinc mutex, where should it be released? in
>> release_auto_increment? would probably work.
>>
>> switch to row-based replication, this is new grounds for me, but is that
>> always possible? (is it used often?)
>>
>> Forward the call of get_auto_increment to all the partitions, I think
>> that *first_value as an INOUT would be good, but since nb_desired_values
>> is not forced, Falcon always uses 1 as example, it will not help in all
>> cases. And I do not really see this as a solution, since it would lower 
>> the performance, but it might be used for the first auto_increment value 
>> for each partition just to be sure...
> 
> We agreed on IRC to evaluate a solution where get_auto_increment() is
> forwarded to all partitions, and helped by holding the mutex. Could
> you please sketch it? I am not absolutely sure that it would work,
> especially if while we are computing the max of reserved intervals, an
> INSERT VALUES(some_fixed_value) comes, where some_fixed_value == the
> max.
> 

(for insert values (),(),()... it always tries to reserve as many values 
as given).

Instead of forwarding the get_auto_increment, I changed the patch to not 
release the mutex until release_auto_increment is called if there is a 
multi row insert statement and statement based replication.

This proposed change will work with all statement based replication 
since it is prohibiting concurrent multi row insert statements, by 
serializing them using a mutex.

I added this in get_auto_increment (where auto_increment_lock already 
was called):

     /*
       in a multi row insert statement like insert ... select, load data
       we must hold a lock/mutex for the whole statement if we have 
statement
       based replication.
     */
     if (!auto_increment_multi_row_stmt_lock &&
         thd->variables.binlog_format == BINLOG_FORMAT_STMT &&
         (thd->lex->sql_command == SQLCOM_REPLACE_SELECT ||
          thd->lex->sql_command == SQLCOM_INSERT_SELECT ||
          thd->lex->sql_command == SQLCOM_LOAD))
     {
       DBUG_PRINT("info", ("locking auto_increment_multi_row_stmt_lock"));
       auto_increment_multi_row_stmt_lock= TRUE;
     }


Would it also need thd->variables.binlog_format == BINLOG_FORMAT_MIXED, 
or does that always do row based with auto_increment inserts? (or maybe 
thd->lex->sql_command == SQLCOM_CREATE for create .. select?)

also added this in release_auto_increment:
       /* Unlock the multi row statement lock taken in get_auto_increment */
       if (auto_increment_multi_row_stmt_lock)
       {
         auto_increment_multi_row_stmt_lock= FALSE;
         DBUG_PRINT("info", ("unlocking 
auto_increment_multi_row_stmt_lock"));
       }

       unlock_auto_increment();

and changed so that lock_auto_increment returns directly if it is 
holding the multi_row lock:
   virtual void lock_auto_increment()
   {
     if (auto_increment_multi_row_stmt_lock)
       return;
     DBUG_ASSERT(table_share->ha_data && !auto_increment_lock);
     if(table_share->tmp_table == NO_TMP_TABLE)
     {
       auto_increment_lock= TRUE;
       pthread_mutex_lock(&table_share->mutex);
     }
   }

and will only release it if it does not have the multi_row lock:
   virtual void unlock_auto_increment()
   {
     DBUG_ASSERT(table_share->ha_data);
     if(auto_increment_lock && !auto_increment_multi_row_stmt_lock)
     {
       pthread_mutex_unlock(&table_share->mutex);
       auto_increment_lock= FALSE;
     }
   }


-- 
Mattias Jonsson, Software Engineer
MySQL AB, www.mysql.com

Are you MySQL certified?  www.mysql.com/certification

Thread
bk commit into 5.1 tree (mattiasj:1.2518) BUG#33479mattiasj5 Mar
  • Re: bk commit into 5.1 tree (mattiasj:1.2518) BUG#33479Sergei Golubchik7 Mar
    • Re: bk commit into 5.1 tree (mattiasj:1.2518) BUG#33479Sergei Golubchik7 Mar
      • Re: bk commit into 5.1 tree (mattiasj:1.2518) BUG#33479Mattias Jonsson8 Mar
        • Re: bk commit into 5.1 tree (mattiasj:1.2518) BUG#33479Sergei Golubchik10 Mar
  • Re: bk commit into 5.1 tree (mattiasj:1.2518) BUG#33479Guilhem Bichot13 Mar
    • Re: bk commit into 5.1 tree (mattiasj:1.2518) BUG#33479Mattias Jonsson14 Mar
      • Re: bk commit into 5.1 tree (mattiasj:1.2518) BUG#33479Guilhem Bichot18 Mar
        • Re: bk commit into 5.1 tree (mattiasj:1.2518) BUG#33479Mattias Jonsson27 Mar
          • Re: bk commit into 5.1 tree (mattiasj:1.2518) BUG#33479Guilhem Bichot28 Mar
            • Re: bk commit into 5.1 tree (mattiasj:1.2518) BUG#33479Mattias Jonsson1 Apr
              • Re: bk commit into 5.1 tree (mattiasj:1.2518) BUG#33479Guilhem Bichot10 Apr