Hi,
Thanks for you comments.
I must say that this is the best review I ever got (both for cheering me
up and as well as good advice and thorough code AND test review!) Thanks!
My replies marked as MJ.
I will probably commit an update to the previous patch tomorrow. Then it
will be easier to see the changes.
Regards
Mattias
Guilhem Bichot wrote:
> Hello,
>
> As this is a big patch, I marked my comments with "GB" for easier
> searching.
> By the way this is an excellent patch.
>
> On Wed, Mar 05, 2008 at 11:56:16PM +0100, mattiasj@stripped wrote:
>> ChangeSet@stripped, 2008-03-05 23:56:06+01:00, mattiasj@witty. +3 -0
>> Bug#33479: auto_increment failures in partitioning
>>
>> Several problems with auto_increment in partitioning
>> (with MyISAM, InnoDB and Falcon)
>>
>> Updated code only patch (the test cases will be in a separate patches
>> 5.1+ (InnoDB+MyISAM) and 6.0 (Falcon)
>> This is a full code only patch which I have updated according to sergs
>> comments
>> Changed the auto_increment handling for partitioning:
>> Added a ha_data variable in table_share for storage engine specific data such
> as
>> auto_increment value handling in partitioning
>> and using the ha_data->mutex to lock around read + update.
>>
>> The idea is this:
>> Store the table's reserved auto_increment value in
>> the TABLE_SHARE and use a mutex to lock it for reading and updateing it,
>> only accessing all partitions when it is not initialized.
>> Also allow reservations of ranges, and if no one has done a reservation
>> afterwards, lower the reservation to what was actually used after
>> the statement is done (via release_auto_increment from WL 3146).
>>
>> This should also lead to better concurrancy
>> and work with any local storage engine.
>>
>> sql/ha_partition.cc@stripped, 2008-03-05 23:56:03+01:00, mattiasj@witty. +261 -88
>> Bug#33479: Failures using auto_increment and partitioning
>>
>> Changed ha_partition::get_auto_increment from file->get_auto_increment
>> to file->info(HA_AUTO_STATUS), since it is works better with InnoDB
>>
>> Using the new table_share->ha_data for keeping the auto_increment
>> value, shared by all instances of the same table.
>> It is read+updated when holding a auto_increment specific mutex.
>> Also added release_auto_increment to decrease gaps if possible.
>>
>> sql/ha_partition.h@stripped, 2008-03-05 23:56:03+01:00, mattiasj@witty. +43 -1
>> Bug#33479: Failures using auto_increment and partitioning
>>
>> Added a new struct HA_DATA_PARTITION to be used in table_share->ha_data
>> Added a private function to set auto_increment values if needed
>> Removed the restore_auto_increment (the hander version is better)
>> Added lock/unlock functions for auto_increment handling.
>> Missed some special case initializations
>>
>> sql/table.h@stripped, 2008-03-05 23:56:03+01:00, mattiasj@witty. +4 -0
>> Bug#33479: Failures using auto_increment and partitioning
>>
>> Added a variable in table_share: ha_data for storage of storage engine
>> specific data (such as auto_increment handling in partitioning).
>>
>> diff -Nrup a/sql/ha_partition.cc b/sql/ha_partition.cc
>> --- a/sql/ha_partition.cc 2007-12-20 19:16:51 +01:00
>> +++ b/sql/ha_partition.cc 2008-03-05 23:56:03 +01:00
>> @@ -160,7 +160,7 @@ const uint ha_partition::NO_CURRENT_PART
>>
>> ha_partition::ha_partition(handlerton *hton, TABLE_SHARE *share)
>> :handler(hton, share), m_part_info(NULL), m_create_handler(FALSE),
>> - m_is_sub_partitioned(0), is_clone(FALSE)
>> + m_is_sub_partitioned(0), is_clone(FALSE), auto_increment_lock(FALSE)
>> {
>> DBUG_ENTER("ha_partition::ha_partition(table)");
>> init_handler_variables();
>> @@ -182,7 +182,8 @@ ha_partition::ha_partition(handlerton *h
>> ha_partition::ha_partition(handlerton *hton, partition_info *part_info)
>> :handler(hton, NULL), m_part_info(part_info),
>> m_create_handler(TRUE),
>> - m_is_sub_partitioned(m_part_info->is_sub_partitioned()), is_clone(FALSE)
>> + m_is_sub_partitioned(m_part_info->is_sub_partitioned()), is_clone(FALSE),
>> + auto_increment_lock(FALSE)
>> {
>> DBUG_ENTER("ha_partition::ha_partition(part_info)");
>> init_handler_variables();
>> @@ -1570,13 +1571,13 @@ int ha_partition::copy_partitions(ulongl
>> table since it doesn't fit into any partition any longer due to
>> changed partitioning ranges or list values.
>> */
>> - deleted++;
>> + (*deleted)++;
>
> GB This bug would have been caught by the compiler if the method was
> declared as
> int ha_partition::copy_partitions(ulonglong * const copied, ulonglong * const
> deleted)
>
> Could you please do this change?
MJ: I will change this. (and also handler::ha_change_partitions and
handler::change_partitions)
>
>> }
>> else
>> {
>> THD *thd= ha_thd();
>> /* Copy record to new handler */
>> - copied++;
>> + (*copied)++;
>> tmp_disable_binlog(thd); /* Do not replicate the low-level changes. */
>> result= m_new_file[new_part]->ha_write_row(m_rec0);
>> reenable_binlog(thd);
>> @@ -2354,6 +2355,24 @@ int ha_partition::open(const char *name,
>> 0, key_rec_cmp, (void*)this)))
>> goto err_handler;
>>
>> +
>> + if (table_share->tmp_table == NO_TMP_TABLE)
>> + pthread_mutex_lock(&table->s->mutex);
>> + if (!table_share->ha_data)
>> + {
>> + HA_DATA_PARTITION *ha_data;
>> + /* currently only needed for auto_increment */
>> + table_share->ha_data= alloc_root(&table_share->mem_root,
>> + sizeof(HA_DATA_PARTITION));
>> + if (!table_share->ha_data)
>> + goto err_handler;
>> + ha_data= (HA_DATA_PARTITION*) table_share->ha_data;
>> + bzero(ha_data, sizeof(HA_DATA_PARTITION));
>> + if (table_share->tmp_table == NO_TMP_TABLE)
>> + pthread_mutex_init(&ha_data->mutex, MY_MUTEX_INIT_FAST);
>
> GB you need to destroy this mutex, and do so before
> table_share->mem_root is freed.
>
MJ: Where should I do that? I cannot find any good spot. Should I maybe
use the table_share->mutex and remove the ha_data->mutex instead?
I added the ha_data->mutex for finer granuality, and allow more
concurrancy, maybe we can add this first when serg has made a worklog
about the table_share->ha_data as he said in a mail 2008-02-19:
When you're done, I'll create a WL about making a proper use of
TABLE_SHARE::ha_data (which would include adding constructor/destructor
methods to the handlerton, moving all storage engines to the new API,
moving existing partition-only fields in the TABLE_SHARE to the new
structure that you'll create).
>> + }
>> + if (table_share->tmp_table == NO_TMP_TABLE)
>> + pthread_mutex_unlock(&table->s->mutex);
>
> GB you may want to have a bool on the stack which keeps the value of
> table_share->tmp_table == NO_TMP_TABLE, to avoid doing the test 3
> times in the worst case.
> To have maybe less pointer dereference of table_share (in
> table_share->ha_data) you could do
> table_share->ha_data= ha_data= (HA_DATA_PARTITION*) alloc_root(...);
> if (!ha_data) // or ha_data==NULL, as you like
> goto err_handler;
> bzero(ha_data, sizeof(HA_DATA_PARTITION));
> This would be one dereference instead of three.
>
MJ: I added:
bool is_not_tmp_table= (table_share->tmp_table == NO_TMP_TABLE)
and changed the ha_data as you proposed.
>> /*
>> Some handlers update statistics as part of the open call. This will in
>> some cases corrupt the statistics of the partition handler and thus
>> @@ -2695,8 +2714,9 @@ int ha_partition::write_row(uchar * buf)
>> uint32 part_id;
>> int error;
>> longlong func_value;
>> - bool autoincrement_lock= FALSE;
>> + bool have_auto_increment= table->next_number_field && buf ==
> table->record[0];
>> my_bitmap_map *old_map;
>> + HA_DATA_PARTITION *ha_data= (HA_DATA_PARTITION*) table_share->ha_data;
>> THD *thd= ha_thd();
>> #ifdef NOT_NEEDED
>> uchar *rec0= m_rec0;
>> @@ -2712,31 +2732,30 @@ int ha_partition::write_row(uchar * buf)
>> If we have an auto_increment column and we are writing a changed row
>> or a new row, then update the auto_increment value in the record.
>> */
>> - if (table->next_number_field && buf == table->record[0])
>> + if (have_auto_increment)
>> {
>> /*
>> - Some engines (InnoDB for example) can change autoincrement
>> - counter only after 'table->write_row' operation.
>> - So if another thread gets inside the ha_partition::write_row
>> - before it is complete, it gets same auto_increment value,
>> - which means DUP_KEY error (bug #27405)
>> - Here we separate the access using table_share->mutex, and
>> - use autoincrement_lock variable to avoid unnecessary locks.
>> - Probably not an ideal solution.
>> + must initialize it before write if it explicity set,
>> + since the partitions can have higher auto_increment_value
>> + which should be used.
>> */
>> - if (table_share->tmp_table == NO_TMP_TABLE)
>> + if (!ha_data->auto_inc_initialized &&
>> + !table->s->next_number_keypart &&
>> + table_share->tmp_table == NO_TMP_TABLE)
>> {
>> /*
>> - Bug#30878 crash when alter table from non partitioned table
>> - to partitioned.
>> - Checking if tmp table then there is no need to lock,
>> - and the table_share->mutex may not be initialised.
>> + If auto_increment in table_share is not initialized, we must
>> + have a mutex around update_auto_increment,
>> + get_auto_increment and write_row.
>
> GB why must we have a mutex around etc? Could you please show a
> problematic scenario if we don't have this mutex for so long?
>
MJ: This is a relic from the Bug#27405...
But after rethinking, it should be enough just initializing the
ha_data->next_auto_inc_val and then let new code do its work. This since
innobase::get_auto_increment is never called and can not give any
duplicats any more.
I think the problematic scenario was when concurrent inserting into a
partitioned innodb table, it could lead to a duplicat keys.
>> + (get_auto_increment can be called from update_auto_increment
>> + and update_auto_increment can be called from write_row).
>> + Not needed for tmp-tables, or auto_increment in secondary
>> + columns in multi-column index.
>> + Also check ha_partition::get_auto_increment().
>> */
>> - autoincrement_lock= TRUE;
>> - pthread_mutex_lock(&table_share->mutex);
>> + lock_auto_increment();
>> }
>> error= update_auto_increment();
>> -
>> /*
>> If we have failed to set the auto-increment value for this row,
>> it is highly likely that we will not be able to insert it into
>> @@ -2771,10 +2790,11 @@ int ha_partition::write_row(uchar * buf)
>> DBUG_PRINT("info", ("Insert in partition %d", part_id));
>> tmp_disable_binlog(thd); /* Do not replicate the low-level changes. */
>> error= m_file[part_id]->ha_write_row(buf);
>> + if (have_auto_increment && !table->s->next_number_keypart)
>> + set_auto_increment_if_higher();
>> reenable_binlog(thd);
>> exit:
>> - if (autoincrement_lock)
>> - pthread_mutex_unlock(&table_share->mutex);
>> + unlock_auto_increment();
>> DBUG_RETURN(error);
>> }
>>
>> @@ -2838,17 +2858,24 @@ int ha_partition::update_row(const uchar
>> goto exit;
>> }
>>
>> - /*
>> - TODO:
>> - set_internal_auto_increment=
>> - max(set_internal_auto_increment, new_data->auto_increment)
>> - */
>> m_last_part= new_part_id;
>> if (new_part_id == old_part_id)
>> {
>> DBUG_PRINT("info", ("Update in partition %d", new_part_id));
>> tmp_disable_binlog(thd); /* Do not replicate the low-level changes. */
>> error= m_file[new_part_id]->ha_update_row(old_data, new_data);
>> + /*
>> + if updating a auto_increment row, update
>
> "auto_increment column" would be more accurate.
>
MJ: Point taken, will change that.
>> + auto_increment_value in table_share if needed
>> + (not to be used if auto_increment on secondary field in a multi-
>> + column index)
>> + */
>> + if (table->next_number_field && new_data == table->record[0]
> &&
>> + !table->s->next_number_keypart)
>> + {
>> + set_auto_increment_if_higher();
>> + unlock_auto_increment();
>> + }
>> reenable_binlog(thd);
>> goto exit;
>> }
>> @@ -2953,13 +2980,25 @@ int ha_partition::delete_all_rows()
>> {
>> int error;
>> handler **file;
>> + THD *thd= current_thd;
>
> GB In handler code like this, ha_thd() is to be used instead of
> current_thd, because we think it could be faster:
> - it's a function call like pthread_getspecific() is (current_thd uses
> pthread_getspecific())
> - it is in mysqld whereas pthread_getspecific() is in libc (farther so
> slower to access, probably one indirection more).
> - ha_thd() has good chances of just returning table->in_use, which is
> probably not more work than pthread_getspecific() does, at least this
> one:
>
> http://cvs.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/lib/libc/port/threads/tsd.c#_thr_getspecific
> does more work (opensolaris is what popped first in Google - sign of
> the times?).
>
MJ: Thanks for showing me, I used it since it was used in several places
in the partitioning code. Should I change all current_thd to ha_thd in
ha_partition.cc?
>> DBUG_ENTER("ha_partition::delete_all_rows");
>>
>> + if (thd->lex->sql_command == SQLCOM_TRUNCATE)
>> + {
>> + HA_DATA_PARTITION *ha_data= (HA_DATA_PARTITION*) table_share->ha_data;
>> + lock_auto_increment();
>> + ha_data->next_auto_inc_val= 0;
>> + ha_data->auto_inc_initialized= FALSE;
>> + unlock_auto_increment();
>> + }
>> file= m_file;
>> do
>> {
>> if ((error= (*file)->ha_delete_all_rows()))
>> DBUG_RETURN(error);
>> + /* must reset the auto_increment for some engines (eg MyISAM) */
>> + if (thd->lex->sql_command == SQLCOM_TRUNCATE)
>> + (*file)->ha_reset_auto_increment(0);
>
> GB This piece moved to another bug report, I understood.
>
MJ: I will remove those tree lines. Should it not be a comment in
handler.h for delete_all_rows() that it should reset auto_increment if
thd->lex->sql_command == SQLCOM_TRUNCATE?
>> } while (*(++file));
>> DBUG_RETURN(0);
>> }
>> @@ -4412,20 +4451,56 @@ int ha_partition::handle_ordered_prev(uc
>> int ha_partition::info(uint flag)
>> {
>> handler *file, **file_array;
>> + HA_DATA_PARTITION *ha_data= (HA_DATA_PARTITION*) table_share->ha_data;
>
> GB this declaration should be moved to the block which uses this variable:
> the if(flag&HA_STATUS_AUTO) block.
>
(I have reorganized the ::info function regaring HA_STATUS_AUTO, see below.)
MJ: Yes. I moved it down to ->
>> DBUG_ENTER("ha_partition:info");
>>
>> if (flag & HA_STATUS_AUTO)
>> {
MJ: -> here.
>> - ulonglong auto_increment_value= 0;
>> DBUG_PRINT("info", ("HA_STATUS_AUTO"));
>> - file_array= m_file;
>> - do
>> + if (ha_data->auto_inc_initialized &&
> !table_share->next_number_keypart)
>
> GB this is reading ha_data->auto_inc_initialized without mutex: there may
> be another thread changing this variable at this moment. Then we may
> go into the 'else' branch though next_auto_inc_val has already just
> been set to true by the other thread, but this is ok as we will then
> only do a useless setting of next_auto_inc_val.
MJ: Yes this is a start of a race, but since I will acquire a lock
before set_if_bigger(next_auto_inc_val,ai..) (in the branch where it is
not initialized) it will not get updated if has changed. So I guess I
will add a comment about it. Actually this is the thing with the
ha_data->auto_inc_initialized, since it is a bool, then I assume I do
not have to aquire any lock/mutex for reading it and it will only be
written once during the life of the table_share. And the few times it
actually are more than one thread running after a check of
auto_inc_initialized == 0, there will be no harm, since the handling of
ha_data->next_auto_inc_val always should be locked before reading and
updating.
>> {
>> - file= *file_array;
>> - file->info(HA_STATUS_AUTO);
>> - set_if_bigger(auto_increment_value, file->stats.auto_increment_value);
>> - } while (*(++file_array));
>> - stats.auto_increment_value= auto_increment_value;
>> + DBUG_PRINT("info", ("Using ha_data->next_auto_inc_val: %llu",
>> + ha_data->next_auto_inc_val));
>
> GB From what I see in MSDN, %ll[u] is not supported by VS 2003 (it is in
> C99 but Microsoft apparently needed more than 4 years to implement it
> - now let's see how many flame emails I get for writing that
> gratuitous sarcasm). It is supported starting from VS2005. There is a
> VS2003 machine in pushbuild, so we may have to do some debugging on
> it, thus we should avoid %ll[u] in the patch (the rest of MySQL code
> already avoids it and uses instead [u]llstr/%s). I agree this is a
> hassle.
>
MJ: OK, I will probably remove all DBUG_PRINT. the handler.cc code uses
"%lu", (ulong) nr instead, is that ok? (else should I do #ifdef DEBUG;
char buf[22] #endif before?)
>> +
>> + lock_auto_increment();
>> + stats.auto_increment_value= ha_data->next_auto_inc_val;
>> + unlock_auto_increment();
>> + }
>> + else
>> + {
>> + ulonglong auto_increment_value= 0;
>> + file_array= m_file;
>> + DBUG_PRINT("info", ("checking all partitions for auto_increment_value"));
>> + do
>> + {
>> + file= *file_array;
>> + file->info(HA_STATUS_AUTO);
>> + set_if_bigger(auto_increment_value,
> file->stats.auto_increment_value);
>> + DBUG_PRINT("info", ("file->stats.auto_increment_value: %llu",
>> + file->stats.auto_increment_value));
>> + } while (*(++file_array));
>> +
>> + /*
>> + This should not occur unless bugs in the other storage engine.
>> + If locked, it is called via get_auto_increment as a part of
>> + initializing and is handled there.
>> + */
>
> GB Could you please expand on "This should not occur unless bugs in the
> other storage engine?"? If the if() block below is useless except if
> bugs, we could change it to DBUG_ASSERT(). It sounds weird that:
> SHOW TABLE STATUS on a table of the buggy engine would return 0 in the
> Auto_increment column, but SHOW TABLE STATUS on a partitioned table
> and whose partitions are of the buggy engine, would return 1 in the
> Auto_increment column.
> And I prefer to not do defensive programming when dealing with storage
> engines but rather let engines' developers fixed their bugs.
>
MJ: There is one more possible bug in InnoDB which we now cover up, and
that is the ha_partition::get_auto_increment call to
innobase::get_auto_increment for a second insert, could return a
untouched *first_value. (this was what first drove me to using the
info() call instead...)
debugging the following test (in the original code):
create table t1 (a int not null auto_increment primary key)
engine=innodb partition by hash(a) partitions 2;
insert into t1 values (null);
insert into t1 values (null); # <-- in this call, the *first_value in
ha_innobase::get_auto_increment will not be touched! (i.e. not used as
an OUT variable).
My idea was: this is the case when the first insert to the table is
VALUES (0) and sql_mode='NO_AUTO_VALUE_ON_ZERO' especially on InnoDB.
And there are two bugs in Falcon that also affect this (bug#33661 and
bug#33662) but I now rely on those patches.
I have now rewritten this into:
if (flag & HA_STATUS_AUTO)
{
bool auto_inc_is_first_in_idx= (table_share->next_number_keypart == 0);
HA_DATA_PARTITION *ha_data= (HA_DATA_PARTITION*) table_share->ha_data;
DBUG_PRINT("info", ("HA_STATUS_AUTO"));
if (!table->found_next_number_field)
stats.auto_increment_value= 0;
else if (ha_data->auto_inc_initialized && auto_inc_is_first_in_idx)
{
lock_auto_increment();
stats.auto_increment_value= ha_data->next_auto_inc_val;
unlock_auto_increment();
}
else
{
lock_auto_increment();
/* to avoid two concurrent initializations, check again when
locked */
if (ha_data->auto_inc_initialized)
stats.auto_increment_value= ha_data->next_auto_inc_val;
else
{
ulonglong auto_increment_value= 0;
file_array= m_file;
DBUG_PRINT("info",
("checking all partitions for auto_increment_value"));
do
{
file= *file_array;
file->info(HA_STATUS_AUTO);
set_if_bigger(auto_increment_value,
file->stats.auto_increment_value);
} while (*(++file_array));
DBUG_ASSERT(auto_increment_value);
stats.auto_increment_value= auto_increment_value;
if (auto_inc_is_first_in_idx)
{
set_if_bigger(ha_data->next_auto_inc_val, auto_increment_value);
ha_data->auto_inc_initialized= TRUE;
}
}
unlock_auto_increment();
}
}
>> + if (!auto_increment_lock && auto_increment_value == 0)
>> + auto_increment_value= 1;
>
> GB In spite of the comment, I don't understand why the fact of being
> locked has an influence of if we should set to 1 or not.
> When can auto_increment_value==0 be true?
>
MJ: I rewrote it...
I think my idea was something like: if auto_increment_lock, it is called
from get_auto_increment and it could be an insert of VALUES (0) and
sql_mode = 'NO_AUTO_VALUE_ON_ZERO' and thus allowed.
>
>> + stats.auto_increment_value= auto_increment_value;
>> + if (!table_share->next_number_keypart)
>> + {
>> + bool is_already_locked= auto_increment_lock;
>> + if (!is_already_locked)
>> + lock_auto_increment();
>> + set_if_bigger(ha_data->next_auto_inc_val, auto_increment_value);
>> + ha_data->auto_inc_initialized= TRUE;
>> + if (!is_already_locked)
>> + unlock_auto_increment();
>> + }
>> + }
>> + DBUG_PRINT("info", ("stats.auto_increment_value: %llu",
>> + stats.auto_increment_value));
>> }
>> if (flag & HA_STATUS_VARIABLE)
>> {
>> @@ -5579,19 +5654,37 @@ int ha_partition::cmp_ref(const uchar *r
>> MODULE auto increment
>> ****************************************************************************/
>>
>> -void ha_partition::restore_auto_increment(ulonglong)
>> -{
>> - DBUG_ENTER("ha_partition::restore_auto_increment");
>>
>> - DBUG_VOID_RETURN;
>> +int ha_partition::reset_auto_increment(ulonglong value)
>> +{
>> + handler **pos, **end;
>> + int res;
>> + HA_DATA_PARTITION *ha_data= (HA_DATA_PARTITION*) table_share->ha_data;
>> + DBUG_ENTER("ha_partition::reset_auto_increment");
>> + DBUG_PRINT("info", ("value: %llu", value));
>> + lock_auto_increment();
>> + ha_data->auto_inc_initialized= FALSE;
>> + ha_data->next_auto_inc_val= 0;
>> + for (pos=m_file, end= m_file+ m_tot_parts; pos != end ; pos++)
>
> GB all occurences of "m_file+ m_tot_parts" should have a space before +.
>
MJ: I will change this, thanks for noticing (although it was not me who
wrote it, bk revtool says monty :)
>> + {
>> + if ((res= (*pos)->ha_reset_auto_increment(value)) != 0)
>> + {
>> + unlock_auto_increment();
>> + DBUG_RETURN(res);
>> + }
>> + }
>> + unlock_auto_increment();
>> + DBUG_RETURN(0);
>
> GB or maybe (smaller code, different behaviour in case if error but not
> clear if that itself is important):
> int res= 0;
> ...
> for (pos=m_file, end= m_file+ m_tot_parts; pos != end ; pos++)
> res|= (*pos)->ha_reset_auto_increment(value);
> unlock_auto_increment();
> DBUG_RETURN(res != 0);
>
MJ: I can rewrite it as this, but do really res != 0 give a correct
error code? (I could add a new error and handle it in
ha_partition::print_error)
in sql_delete.cc row 331:
int error2= table->file->ha_reset_auto_increment(0);
if (error2 && (error2 != HA_ERR_WRONG_COMMAND))
{
table->file->print_error(error2, MYF(0));
>
>> /*
>> This method is called by update_auto_increment which in turn is called
>> - by the individual handlers as part of write_row. We will always let
>> - the first handler keep track of the auto increment value for all
>> - partitions.
>> + by the individual handlers as part of write_row. We use the
>> + table_share->ha_data->next_auto_inc_val, or search all
>> + partitions for the highest auto_increment_value if not initialized or
>> + if auto_increment field is a secondary part of a key, we must search
>> + every partition with a mutex to be sure of correctness.
>> */
>
> GB While you are changing the comment, you could format it for doxygen.
>
MJ: I will try to update the comments for doxygen
>>
>> void ha_partition::get_auto_increment(ulonglong offset, ulonglong increment,
>> @@ -5599,59 +5692,120 @@ void ha_partition::get_auto_increment(ul
>> ulonglong *first_value,
>> ulonglong *nb_reserved_values)
>> {
>> - ulonglong first_value_part, last_value_part, nb_reserved_values_part,
>> - last_value= ~ (ulonglong) 0;
>> - handler **pos, **end;
>> - bool retry= TRUE;
>> + HA_DATA_PARTITION *ha_data= (HA_DATA_PARTITION*) table_share->ha_data;
>> DBUG_ENTER("ha_partition::get_auto_increment");
>> -
>> -again:
>> - for (pos=m_file, end= m_file+ m_tot_parts; pos != end ; pos++)
>> + DBUG_PRINT("info", ("offset: %llu inc: %llu desired_values: %llu "
>> + "first_value: %llu", offset, increment,
>> + nb_desired_values, *first_value));
>> + DBUG_ASSERT(increment && nb_desired_values);
>> + *first_value= 0;
>> + if (table->s->next_number_keypart)
>> {
>> - first_value_part= *first_value;
>> - (*pos)->get_auto_increment(offset, increment, nb_desired_values,
>> - &first_value_part,
> &nb_reserved_values_part);
>> - if (first_value_part == ~(ulonglong)(0)) // error in one partition
>> - {
>> - *first_value= first_value_part;
>> - sql_print_error("Partition failed to reserve auto_increment value");
>> - DBUG_VOID_RETURN;
>> - }
>> /*
>> - Partition has reserved an interval. Intersect it with the intervals
>> - already reserved for the previous partitions.
>> + next_number_keypart is != 0 if the auto_increment column is a secondary
>> + column in the index (it is allowed in MyISAM)
>> */
>
> GB In this branch, the concept of reserving multiple value does not
> make sense, as the comment in ha_myisam.cc says:
>
> /*
> MySQL needs to call us for next row: assume we are inserting ("a",null)
> here, we return 3, and next this statement will want to insert ("b",null):
> there is no reason why ("b",3+1) would be the good row to insert: maybe it
> already exists, maybe 3+1 is too large...
> */
> *nb_reserved_values= 1;
>
> So ha_partition::get_auto_increment() should not look for an
> intersection.
> The intersection was my idea, my code and it's wrong, Mikael and
> I had identified that already. When I coded it it was not a problem
> because all engines, even InnoDB, did table-level locking for the
> duration of the whole INSERT statement, so they were all reserving to
> +inf, thus the intersection was always doing a max in fact.
> Now that InnoDB reserves finite intervals, we have to get rid of the
> intersection and use a max.
> It does not make sense to think about intervals in this case.
>
> In the branch above, you just want a max: you can just call
> get_auto_increment(), then use the max of all
> first_value_part. last_value_part is not needed.
>
MJ: I thought of changing it (see my comments to serg's review), but I
could not really get the scope of the use, so I let it be. But now, when
I have some history behind it, I will change it to just be simple and
loop over every partition and just use the highest value and always only
reserve one value.
>> - last_value_part= (nb_reserved_values_part == ULONGLONG_MAX) ?
>> - ULONGLONG_MAX : (first_value_part + nb_reserved_values_part * increment);
>> - set_if_bigger(*first_value, first_value_part);
>> - set_if_smaller(last_value, last_value_part);
>> - }
>> - if (last_value < *first_value) /* empty intersection, error */
>> - {
>> + DBUG_PRINT("info", ("next_number_keypart != 0"));
>> + ulonglong first_value_part, last_value_part, nb_reserved_values_part,
>> + last_value= ~ (ulonglong) 0;
>> + handler **pos, **end;
>> /*
>> - When we have an empty intersection, it means that one or more
>> - partitions may have a significantly different autoinc next value.
>> - We should not fail here - it just means that we should try to
>> - find a new reservation making use of the current *first_value
>> - wbich should now be compatible with all partitions.
>> + Must lock and find highest value among all partitions
>> */
>> - if (retry)
>> + lock_auto_increment();
>> + for (pos=m_file, end= m_file+ m_tot_parts; pos != end ; pos++)
>> {
>> - retry= FALSE;
>> - last_value= ~ (ulonglong) 0;
>> - release_auto_increment();
>> - goto again;
>> + first_value_part= 0;
>> + (*pos)->get_auto_increment(offset, increment, nb_desired_values,
>> + &first_value_part,
> &nb_reserved_values_part);
>> + if (first_value_part == ~(ulonglong)(0)) // error in one partition
>> + {
>> + *first_value= first_value_part;
>> + sql_print_error("Partition failed to reserve auto_increment value");
>> + unlock_auto_increment();
>> + DBUG_VOID_RETURN;
>> + }
>> + DBUG_PRINT("info", ("partition loop, first_value_part: %llu",
> first_value_part));
>> + /*
>> + Partition has reserved an interval. Intersect it with the intervals
>> + already reserved for the previous partitions.
>> + */
>> + last_value_part= (nb_reserved_values_part == ULONGLONG_MAX) ?
>> + ULONGLONG_MAX : (first_value_part + nb_reserved_values_part *
> increment);
>> + set_if_bigger(*first_value, first_value_part);
>> + set_if_smaller(last_value, last_value_part);
>> + }
>> + DBUG_PRINT("info", ("*first_value: %llu last_value: %llu", *first_value,
>> + last_value));
>> + if (last_value < *first_value)
>> + {
>> + /*
>> + Set last_value to *first_value. This is safe because of the mutex.
>> + */
>> + last_value= *first_value;
>> }
>> + if (increment) // If not check for values
>
> GB the function starts with a DBUG_ASSERT() which checks that
> "increment" is true.
MJ: I will simplify as you say below.
>
>> + {
>> + *nb_reserved_values= (last_value == ULONGLONG_MAX) ?
>> + ULONGLONG_MAX : ((last_value - *first_value) / increment);
>> + }
>> + unlock_auto_increment();
>
> GB I think you should simplify this whole branch:
> - call get_auto_increment() in a loop, with nb_desired_values=1
> - compute the max into *first_value, set *nb_reserved_values to 1
> - and be done with it.
>
MJ: I will change it like this, thanks for clearing up things for me.
>> + }
>> + else if (!ha_data->auto_inc_initialized)
>> + {
>> /*
>> - We should not get here.
>> + Not initialized, but should be locked in ha_partition::write_row().
>> + (Or is a temporary table).
>
> GB how about adding
> if (this is not tmp table)
> safe_mutex_assert_owner(the expected mutex);
> ?
>
MJ: Thanks, did not now of that function.
>> + Initialize to highest value among all partitions
>> */
>> - sql_print_error("Failed to calculate auto_increment value for partition");
>> -
>> - *first_value= ~(ulonglong)(0);
>> + info(HA_STATUS_AUTO);
>> + *first_value= ha_data->next_auto_inc_val;
>> + if (*first_value == 0 &&
>
> GB When can *first_value be 0?
>
MJ: this must be some dead code left from my experiements with
sql_mode='NO_AUTO_ON_ZERO' and insert into t values (0) as the first
insert... But this does not seem to be any need for handling here
anymore, it is handled in handler::update_auto_increment(). I will
remove this.
>> + !(table->auto_increment_field_not_null &&
>> + table->next_number_field->val_int() == 0 &&
>> + current_thd->variables.sql_mode & MODE_NO_AUTO_VALUE_ON_ZERO))
>> + {
>> + /*
>> + First auto_increment, set to 1
>> + (if not explicit number 0 and not null and sql_mode set)
>> + */
>> + DBUG_PRINT("info", ("Setting *first_value to 1"));
>> + *first_value= 1;
>> + }
>> + DBUG_PRINT("info", ("*first_value: %llu", *first_value));
>> +
>> + ha_data->next_auto_inc_val= *first_value + nb_desired_values *
> increment;
>> + ha_data->auto_inc_initialized= TRUE;
>> + *nb_reserved_values= nb_desired_values;
>> }
>> - if (increment) // If not check for values
>> - *nb_reserved_values= (last_value == ULONGLONG_MAX) ?
>> - ULONGLONG_MAX : ((last_value - *first_value) / increment);
>> + else
>> + {
>> + /*
>> + This is an optimized solution that not require a mutex around
>> + write_row for auto_increment handling, instead only using a mutex
>> + for picking the next auto_increment_value.
>> +
>> + Get a lock for handling the auto_increment in table_share->ha_data
>> + for avoiding two concurrent statements getting the same number.
>> + */
>> +
>> + DBUG_PRINT("info", ("reserving %llu auto_inc values", nb_desired_values));
>> + lock_auto_increment();
>> +
>> + DBUG_PRINT("info", ("ha_data->next_auto_inc_val: %llu",
>> + ha_data->next_auto_inc_val));
>> + /* this gets corrected (for offset/increment) in update_auto_increment */
>> + *first_value= ha_data->next_auto_inc_val;
>> + ha_data->next_auto_inc_val= *first_value + nb_desired_values *
> increment;
>> +
>> + unlock_auto_increment();
>> + *nb_reserved_values= nb_desired_values;
>> + }
>> + DBUG_PRINT("info", ("first_value: %llu next_auto_inc_val: %llu "
>> + "nb_reserved_values: %llu", *first_value,
>> + ha_data->next_auto_inc_val,
>> + *nb_reserved_values));
>> +
>> DBUG_VOID_RETURN;
>> }
>>
>> @@ -5659,9 +5813,28 @@ void ha_partition::release_auto_incremen
>> {
>> DBUG_ENTER("ha_partition::release_auto_increment");
>>
>> - for (uint i= 0; i < m_tot_parts; i++)
>> + if (table->s->next_number_keypart)
>> {
>> - m_file[i]->ha_release_auto_increment();
>> + for (uint i= 0; i < m_tot_parts; i++)
>> + m_file[i]->ha_release_auto_increment();
>> + }
>> + else
>> + {
>> + HA_DATA_PARTITION *ha_data= (HA_DATA_PARTITION*) table_share->ha_data;
>> + DBUG_PRINT("info", ("next_auto_inc_val: %llu cur_row_max: %llu min: %llu",
>> + ha_data->next_auto_inc_val,
>> + auto_inc_interval_for_cur_row.maximum(),
>> + auto_inc_interval_for_cur_row.minimum()));
>> + if (next_insert_id)
>> + {
>> + lock_auto_increment();
>> + if (next_insert_id < ha_data->next_auto_inc_val &&
>> + auto_inc_interval_for_cur_row.maximum() >=
> ha_data->next_auto_inc_val)
>
> GB too long line (80 chars).
>
MJ: OK, I thought that it was not more than 80 chars... should I remove
one space in front of the line, or break it up like this:
if (next_insert_id < ha_data->next_auto_inc_val &&
auto_inc_interval_for_cur_row.maximum() >=
ha_data->next_auto_inc_val)
?
>> + ha_data->next_auto_inc_val= next_insert_id;
>> + unlock_auto_increment();
>> + }
>> + DBUG_PRINT("info", ("next_auto_inc_val: %llu next_ins_id: %llu",
>> + ha_data->next_auto_inc_val, next_insert_id));
>> }
>> DBUG_VOID_RETURN;
>> }
>> diff -Nrup a/sql/ha_partition.h b/sql/ha_partition.h
>> --- a/sql/ha_partition.h 2007-09-24 15:30:28 +02:00
>> +++ b/sql/ha_partition.h 2008-03-05 23:56:03 +01:00
>> @@ -37,6 +37,13 @@ typedef struct st_partition_share
>> } PARTITION_SHARE;
>> #endif
>>
>> +/* TODO: move all partition specific data from TABLE_SHARE here */
>
> GB /** @todo
>
MJ: OK, thanks, I'm new with the doxygen comments.
>> +typedef struct st_ha_data_partition
>> +{
>> + ulonglong next_auto_inc_val;
>> + bool auto_inc_initialized;
>> + pthread_mutex_t mutex;
>> +} HA_DATA_PARTITION;
>
> GB if you add a comment for the struct, and maybe for each of its members
> unless its use is easily guessable, it will produce useful info in
> doxygen-generated documentation.
>
MJ: I will add these comments:
..next_auto_inc_val; .. /**< first non reserved value */
..mutes; .. /**< mutex for auto_increment */
>> #define PARTITION_BYTES_IN_POS 2
>> class ha_partition :public handler
>> @@ -141,6 +148,7 @@ private:
>> "own" the m_part_info structure.
>> */
>> bool is_clone;
>> + bool auto_increment_lock; /* lock reading/updating auto_inc */
>> public:
>> handler *clone(MEM_ROOT *mem_root);
>> virtual void set_part_info(partition_info *part_info)
>> @@ -828,12 +836,46 @@ public:
>> auto_increment_column_changed
>> -------------------------------------------------------------------------
>> */
>> - virtual void restore_auto_increment(ulonglong prev_insert_id);
>> virtual void get_auto_increment(ulonglong offset, ulonglong increment,
>> ulonglong nb_desired_values,
>> ulonglong *first_value,
>> ulonglong *nb_reserved_values);
>> virtual void release_auto_increment();
>> +private:
>> + virtual int reset_auto_increment(ulonglong value);
>> + inline virtual void lock_auto_increment()
>
> GB methods declared inside the class' declaration are automatically
> inline (C++ standard says so), but no harm in adding "inline".
>
MJ: OK, I will remove the "inline".
>> + {
>
> GB I think we should have
> DBUG_ASSERT(!auto_increment_lock) here.
> The reason is: auto_increment_lock seems to be used as "remember if
> lock is already acquired or not; if already acquired I don't need to
> acquire it". But this code:
>
> lock_auto_increment();
> f();
> something;
> unlock_auto_increment();
>
> where f() is:
> f()
> {
> lock_auto_increment(); $
> ...;
> unlock_auto_increment(); *
> }
>
> will not work: the line marked with $ will do nothing, but the line
> with * will unlock the mutex, too early if 'something;' expected to be
> protected by the mutex.
> With the assertion we prevent against such wrong double-lock use.
>
> Please also add some description to lock_auto_increment() (like "locks
> mutex if not already locked") and unlock_auto_increment() (like
> "unlocks mutex if it was locked").
>
> Any function which calls unlock_auto_increment() should in theory have
> a comment in its description saying that it unlocks the mutex.
>
> I worry a bit about this design where in a function, the lock may or
> not be unlocked (like in get_auto_increment(), where we unlock or not
> depending on the branch), functions call each other with the lock
> acquired by caller or callee, and it all relies on the proper branches
> being taken by callee. But to be more specific I need to first
> understand why in some cases the mutex has to be held in write_row +
> update_auto_increment + get_auto_increment (a previous question).
>
MJ: I made it this way to reduce the code, (when I thought that I had to
have a mutex for the whole write_row block) I have now removed this and
will always lock, read/alter and unlock within the same function, and I
have added a DBUG_ASSERT(!auto_increment_lock) in lock_auto_increment().
>> + if(!auto_increment_lock && table_share->tmp_table ==
> NO_TMP_TABLE)
>> + {
>> + HA_DATA_PARTITION *ha_data= (HA_DATA_PARTITION*) table_share->ha_data;
>> + auto_increment_lock= TRUE;
>> + pthread_mutex_lock(&ha_data->mutex);
>> + }
>> + }
>> + inline virtual void unlock_auto_increment()
>> + {
>> + if(auto_increment_lock)
>> + {
>> + HA_DATA_PARTITION *ha_data= (HA_DATA_PARTITION*) table_share->ha_data;
>> + pthread_mutex_unlock(&ha_data->mutex);
>> + auto_increment_lock= FALSE;
>> + }
>> + }
>> + inline virtual void set_auto_increment_if_higher()
>> + {
>> + ulonglong nr= table->next_number_field->val_int();
>> + HA_DATA_PARTITION *ha_data= (HA_DATA_PARTITION*) table_share->ha_data;
>> + if (nr >= ha_data->next_auto_inc_val)
>
> GB We are doing a read without mutex above. Why is that correct?
> I have doubts: for example, on Pentium and newer Intel CPUs, a read of
> a ulonglong variable may not be atomic if the ulonglong is not aligned
> on a 8-byte boundary (and looking at how next_auto_inc_val is
> allocated I see no guarantee of such alignment).
> So if two threads are executing the above code at the same time, one
> may see a corrupted ha_data->next_auto_inc_val (a value which never
> really existed) and go wrong.
>
MJ: I thought I had it all covered :) this was one of the first worries
I had. I will change this.
>> + {
>> + lock_auto_increment();
>> + /* must check again when the mutex is taken */
>> + if (nr >= ha_data->next_auto_inc_val)
>> + ha_data->next_auto_inc_val= nr + 1;
>> + ha_data->auto_inc_initialized= TRUE;
>> + }
>> + }
>> +
>> +public:
>>
>> /*
>> -------------------------------------------------------------------------
>> diff -Nrup a/sql/table.h b/sql/table.h
>> --- a/sql/table.h 2007-12-20 21:24:07 +01:00
>> +++ b/sql/table.h 2008-03-05 23:56:03 +01:00
>> @@ -358,6 +358,7 @@ typedef struct st_table_share
>> */
>> int cached_row_logging_check;
>>
>> + /* TODO: Move into *ha_data for partitioning */
>> #ifdef WITH_PARTITION_STORAGE_ENGINE
>> bool auto_partitioned;
>> const char *partition_info;
>> @@ -367,6 +368,9 @@ typedef struct st_table_share
>> uint part_state_len;
>> handlerton *default_part_db_type;
>> #endif
>> +
>> + /* place to store storage engine specific data */
>
> GB /** for doxygen
>
MJ: OK
>> + void *ha_data;
>
Thanks again for your thorough review!
Can there be any problems with concurrent multi row insert and statement
replications?
--
Mattias Jonsson, Software Engineer
MySQL AB, www.mysql.com
Are you MySQL certified? www.mysql.com/certification