Hi,
In the bottom I have a proposal for handling the multi row insert
statement with statement replication problem.
It will not release the auto_increment_lock taken at the first
get_auto_increment until release_auto_increment is called.
I have read through the code in sql_insert.cc, sql_load.cc and
sql_table.cc and as I read it, ha_release_auto_increment is always
called after an insert select, insert replace or load data. (can create
select be a problem here?)
It is all in the new patch I committed.
Regards
Mattias
Guilhem Bichot wrote:
> On Thu, Mar 27, 2008 at 05:16:37PM +0100, Mattias Jonsson wrote:
>> Guilhem Bichot wrote:
>>> On Fri, Mar 14, 2008 at 01:54:08AM +0100, Mattias Jonsson wrote:
>>>> Guilhem Bichot wrote:
>>>>> On Wed, Mar 05, 2008 at 11:56:16PM +0100, mattiasj@stripped wrote:
>>>>>> ChangeSet@stripped, 2008-03-05 23:56:06+01:00, mattiasj@witty. +3
> -0
>>>>>> Bug#33479: auto_increment failures in partitioning
>>>>>>
>>>>>> Several problems with auto_increment in partitioning
>>>>>> (with MyISAM, InnoDB and Falcon)
>>>>>>
>>>>>> Updated code only patch (the test cases will be in a separate
> patches
>>>>>> 5.1+ (InnoDB+MyISAM) and 6.0 (Falcon)
>>>>>> diff -Nrup a/sql/ha_partition.cc b/sql/ha_partition.cc
>>>>>> @@ -2953,13 +2980,25 @@ int ha_partition::delete_all_rows()
>
>>>>>> + ha_data->next_auto_inc_val= next_insert_id;
>>>>>> + unlock_auto_increment();
>>>>>> + }
>>>>>> + DBUG_PRINT("info", ("next_auto_inc_val: %llu next_ins_id:
> %llu",
>>>>>> + ha_data->next_auto_inc_val,
> next_insert_id));
>>>>>> }
>>>>>> DBUG_VOID_RETURN;
>>>>>> }
>>>> Can there be any problems with concurrent multi row insert and statement
>
>>>> replications?
>>> For statement-level replication to work, a multi-row INSERT (INSERT
>>> VALUES or INSERT SELECT) must use consecutive values.
>>> Let's assume that two concurrent statements are doing
>>> INSERT INTO t SELECT * FROM u;
>>> where t is a partitioned table, having an autoinc column first in the
>>> index.
>>> Each statement will call get_auto_increment() with
>>> nb_desired_values==1 (see handler::update_auto_increment();
>>> estimation_rows_to_insert is always 0 in INSERT SELECT):
>>> statement1 moves next_auto_inc_val to 2
>>> statement2 moves next_auto_inc_val to 3
>>> statement1 inserts 1
>>> statement2 inserts 2
>>> statement1 notices that it has more rows to insert because SELECT
>>> returned more than one row (see handler::update_auto_increment()),
>>> calls get_auto_increment() with nb_desired_values==2 (twice the
>>> previous nb_desired_values):
>>> statement1 moves next_auto_inc_val to 5
>>> statement1 inserts 3
>>> and then we have the problem: statement1 has inserted 1 and 3, it's
>>> not consecutive and will not replicate well.
>>>
>>> If the underlying engine of the partitioned table has table-level
>>> locking, in fact there is no problem (statement2 will run only after
>>> statement1 is done).
>>> This is also the case of InnoDB:
>>> - if innodb_auto_inc_lock_mode==2, it's documented to not work with
>>> statement-based replication
>>> - if ==0, it does table-level locking
>>> - if ==1, it does table-level locking except if the statement is
>>> INSERT/REPLACE VALUES (and the discussed case above is INSERT
>>> SELECT).
>>> But InnoDB does this fine logic only in
>>> ha_innobase::get_auto_increment() (called when the table is not
>>> partitioned), which is not called by
>>> ha_partition::get_auto_increment().
>>>
>>> So I suspect that yes, there are problems. This is a suspicion. A
>>> multi-threaded test or code inspection could help know the truth.
>>> Possible solutions:
>>> - make ha_partition::get_auto_increment() imitate
>>> ha_innobase::get_auto_increment(): let it keep the autoinc mutex if
>>> (thd->lex->sql_command is != SQLCOM_INSERT && !row_based).
>>> - let ha_partition switch to row-based replication if inserting into
>>> an auto_increment column.
>>> - make ha_partition::get_auto_increment() call engine's
>>> get_auto_increment() to ask it to reserve [max,
>>> max+nb_desired_values*increment]. The problem is that engine's
>>> get_auto_increment() API does not tell the engine the lower bound of
>>> the interval (*first_value is not supposed to be read by the engine),
>>> only the interval's width (nb_desired_values). InnoDB does read
>>> *first_value, violating the API, using that as a lower bound. We could
>>> make InnoDB's violation the new norm: *first_value would become an
>>> INOUT parameter (see the email I sent to Innobase recently, cc:you).
>>> - have statement-based binlogging of multiple autoinc intervals
>>> http://forge.mysql.com/worklog/task.php?id=3404 implemented.
>>> - something better?
>>>
>>> In all cases, this is a significant problem, a discussion with the
>>> other reviewer (Serg) would be needed.
>>>
>>> Falcon supports only row-based replication right now.
>>>
>> How about the easy way for start:
>> Rely on the partitioned storage engines support for this (if it work
>> without partitioning, then it will work with partitioning). I have to
>> check that it actually will work if you have auto_increment_offset =
>> no_of_partitions and partition by hash(ai_col), if the table lock is on
>> the table_share, then it will be OK, but it might be a problem if is not.
>> Do this by adding a comment, and add it to the bug report for the
>> doc-team for adding it in the manual.
>>
>>
>> keep the autoinc mutex, where should it be released? in
>> release_auto_increment? would probably work.
>>
>> switch to row-based replication, this is new grounds for me, but is that
>> always possible? (is it used often?)
>>
>> Forward the call of get_auto_increment to all the partitions, I think
>> that *first_value as an INOUT would be good, but since nb_desired_values
>> is not forced, Falcon always uses 1 as example, it will not help in all
>> cases. And I do not really see this as a solution, since it would lower
>> the performance, but it might be used for the first auto_increment value
>> for each partition just to be sure...
>
> We agreed on IRC to evaluate a solution where get_auto_increment() is
> forwarded to all partitions, and helped by holding the mutex. Could
> you please sketch it? I am not absolutely sure that it would work,
> especially if while we are computing the max of reserved intervals, an
> INSERT VALUES(some_fixed_value) comes, where some_fixed_value == the
> max.
>
(for insert values (),(),()... it always tries to reserve as many values
as given).
Instead of forwarding the get_auto_increment, I changed the patch to not
release the mutex until release_auto_increment is called if there is a
multi row insert statement and statement based replication.
This proposed change will work with all statement based replication
since it is prohibiting concurrent multi row insert statements, by
serializing them using a mutex.
I added this in get_auto_increment (where auto_increment_lock already
was called):
/*
in a multi row insert statement like insert ... select, load data
we must hold a lock/mutex for the whole statement if we have
statement
based replication.
*/
if (!auto_increment_multi_row_stmt_lock &&
thd->variables.binlog_format == BINLOG_FORMAT_STMT &&
(thd->lex->sql_command == SQLCOM_REPLACE_SELECT ||
thd->lex->sql_command == SQLCOM_INSERT_SELECT ||
thd->lex->sql_command == SQLCOM_LOAD))
{
DBUG_PRINT("info", ("locking auto_increment_multi_row_stmt_lock"));
auto_increment_multi_row_stmt_lock= TRUE;
}
Would it also need thd->variables.binlog_format == BINLOG_FORMAT_MIXED,
or does that always do row based with auto_increment inserts? (or maybe
thd->lex->sql_command == SQLCOM_CREATE for create .. select?)
also added this in release_auto_increment:
/* Unlock the multi row statement lock taken in get_auto_increment */
if (auto_increment_multi_row_stmt_lock)
{
auto_increment_multi_row_stmt_lock= FALSE;
DBUG_PRINT("info", ("unlocking
auto_increment_multi_row_stmt_lock"));
}
unlock_auto_increment();
and changed so that lock_auto_increment returns directly if it is
holding the multi_row lock:
virtual void lock_auto_increment()
{
if (auto_increment_multi_row_stmt_lock)
return;
DBUG_ASSERT(table_share->ha_data && !auto_increment_lock);
if(table_share->tmp_table == NO_TMP_TABLE)
{
auto_increment_lock= TRUE;
pthread_mutex_lock(&table_share->mutex);
}
}
and will only release it if it does not have the multi_row lock:
virtual void unlock_auto_increment()
{
DBUG_ASSERT(table_share->ha_data);
if(auto_increment_lock && !auto_increment_multi_row_stmt_lock)
{
pthread_mutex_unlock(&table_share->mutex);
auto_increment_lock= FALSE;
}
}
--
Mattias Jonsson, Software Engineer
MySQL AB, www.mysql.com
Are you MySQL certified? www.mysql.com/certification