List:Commits« Previous MessageNext Message »
From:Mattias Jonsson Date:March 14 2008 12:54am
Subject:Re: bk commit into 5.1 tree (mattiasj:1.2518) BUG#33479
View as plain text  
Hi,

Thanks for you comments.

I must say that this is the best review I ever got (both for cheering me 
up and as well as good advice and thorough code AND test review!) Thanks!

My replies marked as MJ.

I will probably commit an update to the previous patch tomorrow. Then it 
will be easier to see the changes.

Regards
Mattias

Guilhem Bichot wrote:
> Hello,
> 
> As this is a big patch, I marked my comments with "GB" for easier
> searching.
> By the way this is an excellent patch.
> 
> On Wed, Mar 05, 2008 at 11:56:16PM +0100, mattiasj@stripped wrote:
>> ChangeSet@stripped, 2008-03-05 23:56:06+01:00, mattiasj@witty. +3 -0
>>   Bug#33479: auto_increment failures in partitioning
>>   
>>   Several problems with auto_increment in partitioning
>>   (with MyISAM, InnoDB and Falcon)
>>   
>>   Updated code only patch (the test cases will be in a separate patches
>>   5.1+ (InnoDB+MyISAM) and 6.0 (Falcon)
>>   This is a full code only patch which I have updated according to sergs
>>   comments
>>   Changed the auto_increment handling for partitioning:
>>   Added a ha_data variable in table_share for storage engine specific data such
> as
>>   auto_increment value handling in partitioning
>>   and using the ha_data->mutex to lock around read + update.
>>   
>>   The idea is this:
>>   Store the table's reserved auto_increment value in
>>   the TABLE_SHARE and use a mutex to lock it for reading and updateing it,
>>   only accessing all partitions when it is not initialized.
>>   Also allow reservations of ranges, and if no one has done a reservation
>>   afterwards, lower the reservation to what was actually used after
>>   the statement is done (via release_auto_increment from WL 3146).
>>   
>>   This should also lead to better concurrancy
>>   and work with any local storage engine.
>>
>>   sql/ha_partition.cc@stripped, 2008-03-05 23:56:03+01:00, mattiasj@witty. +261 -88
>>     Bug#33479: Failures using auto_increment and partitioning
>>     
>>     Changed ha_partition::get_auto_increment from file->get_auto_increment
>>     to file->info(HA_AUTO_STATUS), since it is works better with InnoDB
>>     
>>     Using the new table_share->ha_data for keeping the auto_increment
>>     value, shared by all instances of the same table.
>>     It is read+updated when holding a auto_increment specific mutex.
>>     Also added release_auto_increment to decrease gaps if possible.
>>
>>   sql/ha_partition.h@stripped, 2008-03-05 23:56:03+01:00, mattiasj@witty. +43 -1
>>     Bug#33479: Failures using auto_increment and partitioning
>>     
>>     Added a new struct HA_DATA_PARTITION to be used in table_share->ha_data
>>     Added a private function to set auto_increment values if needed
>>     Removed the restore_auto_increment (the hander version is better)
>>     Added lock/unlock functions for auto_increment handling.
>>     Missed some special case initializations
>>
>>   sql/table.h@stripped, 2008-03-05 23:56:03+01:00, mattiasj@witty. +4 -0
>>     Bug#33479: Failures using auto_increment and partitioning
>>     
>>     Added a variable in table_share: ha_data for storage of storage engine
>>     specific data (such as auto_increment handling in partitioning).
>>
>> diff -Nrup a/sql/ha_partition.cc b/sql/ha_partition.cc
>> --- a/sql/ha_partition.cc	2007-12-20 19:16:51 +01:00
>> +++ b/sql/ha_partition.cc	2008-03-05 23:56:03 +01:00
>> @@ -160,7 +160,7 @@ const uint ha_partition::NO_CURRENT_PART
>>  
>>  ha_partition::ha_partition(handlerton *hton, TABLE_SHARE *share)
>>    :handler(hton, share), m_part_info(NULL), m_create_handler(FALSE),
>> -   m_is_sub_partitioned(0), is_clone(FALSE)
>> +   m_is_sub_partitioned(0), is_clone(FALSE), auto_increment_lock(FALSE)
>>  {
>>    DBUG_ENTER("ha_partition::ha_partition(table)");
>>    init_handler_variables();
>> @@ -182,7 +182,8 @@ ha_partition::ha_partition(handlerton *h
>>  ha_partition::ha_partition(handlerton *hton, partition_info *part_info)
>>    :handler(hton, NULL), m_part_info(part_info),
>>     m_create_handler(TRUE),
>> -   m_is_sub_partitioned(m_part_info->is_sub_partitioned()), is_clone(FALSE)
>> +   m_is_sub_partitioned(m_part_info->is_sub_partitioned()), is_clone(FALSE),
>> +   auto_increment_lock(FALSE)
>>  {
>>    DBUG_ENTER("ha_partition::ha_partition(part_info)");
>>    init_handler_variables();
>> @@ -1570,13 +1571,13 @@ int ha_partition::copy_partitions(ulongl
>>             table since it doesn't fit into any partition any longer due to
>>             changed partitioning ranges or list values.
>>          */
>> -        deleted++;
>> +        (*deleted)++;
> 
> GB This bug would have been caught by the compiler if the method was
> declared as
> int ha_partition::copy_partitions(ulonglong * const copied, ulonglong * const
> deleted)
> 
> Could you please do this change?

MJ: I will change this. (and also handler::ha_change_partitions and 
handler::change_partitions)

> 
>>        }
>>        else
>>        {
>>          THD *thd= ha_thd();
>>          /* Copy record to new handler */
>> -        copied++;
>> +        (*copied)++;
>>          tmp_disable_binlog(thd); /* Do not replicate the low-level changes. */
>>          result= m_new_file[new_part]->ha_write_row(m_rec0);
>>          reenable_binlog(thd);
>> @@ -2354,6 +2355,24 @@ int ha_partition::open(const char *name,
>>                           0, key_rec_cmp, (void*)this)))
>>      goto err_handler;
>>  
>> +
>> +  if (table_share->tmp_table == NO_TMP_TABLE)
>> +    pthread_mutex_lock(&table->s->mutex);
>> +  if (!table_share->ha_data)
>> +  {
>> +    HA_DATA_PARTITION *ha_data;
>> +    /* currently only needed for auto_increment */
>> +    table_share->ha_data= alloc_root(&table_share->mem_root,
>> +                                     sizeof(HA_DATA_PARTITION));
>> +    if (!table_share->ha_data)
>> +      goto err_handler;
>> +    ha_data= (HA_DATA_PARTITION*) table_share->ha_data;
>> +    bzero(ha_data, sizeof(HA_DATA_PARTITION));
>> +    if (table_share->tmp_table == NO_TMP_TABLE)
>> +      pthread_mutex_init(&ha_data->mutex, MY_MUTEX_INIT_FAST);
> 
> GB you need to destroy this mutex, and do so before
> table_share->mem_root is freed.
> 

MJ: Where should I do that? I cannot find any good spot. Should I maybe 
use the table_share->mutex and remove the ha_data->mutex instead?

I added the ha_data->mutex for finer granuality, and allow more 
concurrancy, maybe we can add this first when serg has made a worklog 
about the table_share->ha_data as he said in a mail 2008-02-19:

When you're done, I'll create a WL about making a proper use of
TABLE_SHARE::ha_data (which would include adding constructor/destructor
methods to the handlerton, moving all storage engines to the new API,
moving existing partition-only fields in the TABLE_SHARE to the new
structure that you'll create).

>> +  }
>> +  if (table_share->tmp_table == NO_TMP_TABLE)
>> +    pthread_mutex_unlock(&table->s->mutex);
> 
> GB you may want to have a bool on the stack which keeps the value of
> table_share->tmp_table == NO_TMP_TABLE, to avoid doing the test 3
> times in the worst case.
> To have maybe less pointer dereference of table_share (in
> table_share->ha_data) you could do
> table_share->ha_data= ha_data= (HA_DATA_PARTITION*) alloc_root(...);
> if (!ha_data) // or ha_data==NULL, as you like
>   goto err_handler;
> bzero(ha_data, sizeof(HA_DATA_PARTITION));
> This would be one dereference instead of three.
> 

MJ: I added:
bool is_not_tmp_table= (table_share->tmp_table == NO_TMP_TABLE)
and changed the ha_data as you proposed.

>>    /*
>>      Some handlers update statistics as part of the open call. This will in
>>      some cases corrupt the statistics of the partition handler and thus
>> @@ -2695,8 +2714,9 @@ int ha_partition::write_row(uchar * buf)
>>    uint32 part_id;
>>    int error;
>>    longlong func_value;
>> -  bool autoincrement_lock= FALSE;
>> +  bool have_auto_increment= table->next_number_field && buf ==
> table->record[0];
>>    my_bitmap_map *old_map;
>> +  HA_DATA_PARTITION *ha_data= (HA_DATA_PARTITION*) table_share->ha_data;
>>    THD *thd= ha_thd();
>>  #ifdef NOT_NEEDED
>>    uchar *rec0= m_rec0;
>> @@ -2712,31 +2732,30 @@ int ha_partition::write_row(uchar * buf)
>>      If we have an auto_increment column and we are writing a changed row
>>      or a new row, then update the auto_increment value in the record.
>>    */
>> -  if (table->next_number_field && buf == table->record[0])
>> +  if (have_auto_increment)
>>    {
>>      /*
>> -      Some engines (InnoDB for example) can change autoincrement
>> -      counter only after 'table->write_row' operation.
>> -      So if another thread gets inside the ha_partition::write_row
>> -      before it is complete, it gets same auto_increment value,
>> -      which means DUP_KEY error (bug #27405)
>> -      Here we separate the access using table_share->mutex, and
>> -      use autoincrement_lock variable to avoid unnecessary locks.
>> -      Probably not an ideal solution.
>> +      must initialize it before write if it explicity set,
>> +      since the partitions can have higher auto_increment_value
>> +      which should be used.
>>      */
>> -    if (table_share->tmp_table == NO_TMP_TABLE)
>> +    if (!ha_data->auto_inc_initialized &&
>> +        !table->s->next_number_keypart &&
>> +        table_share->tmp_table == NO_TMP_TABLE)
>>      {
>>        /*
>> -        Bug#30878 crash when alter table from non partitioned table
>> -        to partitioned.
>> -        Checking if tmp table then there is no need to lock,
>> -        and the table_share->mutex may not be initialised.
>> +        If auto_increment in table_share is not initialized, we must
>> +        have a mutex around update_auto_increment,
>> +        get_auto_increment and write_row.
> 
> GB why must we have a mutex around etc? Could you please show a
> problematic scenario if we don't have this mutex for so long?
> 

MJ: This is a relic from the Bug#27405...
But after rethinking, it should be enough just initializing the 
ha_data->next_auto_inc_val and then let new code do its work. This since 
innobase::get_auto_increment is never called and can not give any 
duplicats any more.

I think the problematic scenario was when concurrent inserting into a 
partitioned innodb table, it could lead to a duplicat keys.

>> +        (get_auto_increment can be called from update_auto_increment
>> +        and update_auto_increment can be called from write_row).
>> +        Not needed for tmp-tables, or auto_increment in secondary
>> +        columns in multi-column index.
>> +        Also check ha_partition::get_auto_increment().
>>        */
>> -      autoincrement_lock= TRUE;
>> -      pthread_mutex_lock(&table_share->mutex);
>> +      lock_auto_increment();
>>      }
>>      error= update_auto_increment();
>> -
>>      /*
>>        If we have failed to set the auto-increment value for this row,
>>        it is highly likely that we will not be able to insert it into
>> @@ -2771,10 +2790,11 @@ int ha_partition::write_row(uchar * buf)
>>    DBUG_PRINT("info", ("Insert in partition %d", part_id));
>>    tmp_disable_binlog(thd); /* Do not replicate the low-level changes. */
>>    error= m_file[part_id]->ha_write_row(buf);
>> +  if (have_auto_increment && !table->s->next_number_keypart)
>> +    set_auto_increment_if_higher();
>>    reenable_binlog(thd);
>>  exit:
>> -  if (autoincrement_lock)
>> -    pthread_mutex_unlock(&table_share->mutex);
>> +  unlock_auto_increment();
>>    DBUG_RETURN(error);
>>  }
>>  
>> @@ -2838,17 +2858,24 @@ int ha_partition::update_row(const uchar
>>      goto exit;
>>    }
>>  
>> -  /*
>> -    TODO:
>> -      set_internal_auto_increment=
>> -        max(set_internal_auto_increment, new_data->auto_increment)
>> -  */
>>    m_last_part= new_part_id;
>>    if (new_part_id == old_part_id)
>>    {
>>      DBUG_PRINT("info", ("Update in partition %d", new_part_id));
>>      tmp_disable_binlog(thd); /* Do not replicate the low-level changes. */
>>      error= m_file[new_part_id]->ha_update_row(old_data, new_data);
>> +    /*
>> +      if updating a auto_increment row, update
> 
> "auto_increment column" would be more accurate.
> 

MJ: Point taken, will change that.

>> +      auto_increment_value in table_share if needed
>> +      (not to be used if auto_increment on secondary field in a multi-
>> +      column index)
>> +    */
>> +    if (table->next_number_field && new_data == table->record[0]
> &&
>> +        !table->s->next_number_keypart)
>> +    {
>> +      set_auto_increment_if_higher();
>> +      unlock_auto_increment();
>> +    }
>>      reenable_binlog(thd);
>>      goto exit;
>>    }
>> @@ -2953,13 +2980,25 @@ int ha_partition::delete_all_rows()
>>  {
>>    int error;
>>    handler **file;
>> +  THD *thd= current_thd;
> 
> GB In handler code like this, ha_thd() is to be used instead of
> current_thd, because we think it could be faster:
> - it's a function call like pthread_getspecific() is (current_thd uses
> pthread_getspecific())
> - it is in mysqld whereas pthread_getspecific() is in libc (farther so
> slower to access, probably one indirection more).
> - ha_thd() has good chances of just returning table->in_use, which is
> probably not more work than pthread_getspecific() does, at least this
> one:
>
> http://cvs.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/lib/libc/port/threads/tsd.c#_thr_getspecific
> does more work (opensolaris is what popped first in Google - sign of
> the times?).
> 

MJ: Thanks for showing me, I used it since it was used in several places 
in the partitioning code. Should I change all current_thd to ha_thd in 
ha_partition.cc?

>>    DBUG_ENTER("ha_partition::delete_all_rows");
>>  
>> +  if (thd->lex->sql_command == SQLCOM_TRUNCATE)
>> +  {
>> +    HA_DATA_PARTITION *ha_data= (HA_DATA_PARTITION*) table_share->ha_data;
>> +    lock_auto_increment();
>> +    ha_data->next_auto_inc_val= 0;
>> +    ha_data->auto_inc_initialized= FALSE;
>> +    unlock_auto_increment();
>> +  }
>>    file= m_file;
>>    do
>>    {
>>      if ((error= (*file)->ha_delete_all_rows()))
>>        DBUG_RETURN(error);
>> +    /* must reset the auto_increment for some engines (eg MyISAM) */
>> +    if (thd->lex->sql_command == SQLCOM_TRUNCATE)
>> +      (*file)->ha_reset_auto_increment(0);
> 
> GB This piece moved to another bug report, I understood.
> 

MJ: I will remove those tree lines. Should it not be a comment in 
handler.h for delete_all_rows() that it should reset auto_increment if 
thd->lex->sql_command == SQLCOM_TRUNCATE?

>>    } while (*(++file));
>>    DBUG_RETURN(0);
>>  }
>> @@ -4412,20 +4451,56 @@ int ha_partition::handle_ordered_prev(uc
>>  int ha_partition::info(uint flag)
>>  {
>>    handler *file, **file_array;
>> +  HA_DATA_PARTITION *ha_data= (HA_DATA_PARTITION*) table_share->ha_data;
> 
> GB this declaration should be moved to the block which uses this variable:
> the if(flag&HA_STATUS_AUTO) block.
> 
(I have reorganized the ::info function regaring HA_STATUS_AUTO, see below.)
MJ: Yes. I moved it down to ->

>>    DBUG_ENTER("ha_partition:info");
>>  
>>    if (flag & HA_STATUS_AUTO)
>>    {

MJ: -> here.

>> -    ulonglong auto_increment_value= 0;
>>      DBUG_PRINT("info", ("HA_STATUS_AUTO"));
>> -    file_array= m_file;
>> -    do
>> +    if (ha_data->auto_inc_initialized &&
> !table_share->next_number_keypart)
> 
> GB this is reading ha_data->auto_inc_initialized without mutex: there may
> be another thread changing this variable at this moment. Then we may
> go into the 'else' branch though next_auto_inc_val has already just
> been set to true by the other thread, but this is ok as we will then
> only do a useless setting of next_auto_inc_val.

MJ: Yes this is a start of a race, but since I will acquire a lock 
before set_if_bigger(next_auto_inc_val,ai..) (in the branch where it is 
not initialized) it will not get updated if has changed. So I guess I 
will add a comment about it. Actually this is the thing with the 
ha_data->auto_inc_initialized, since it is a bool, then I assume I do 
not have to aquire any lock/mutex for reading it and it will only be 
written once during the life of the table_share. And the few times it 
actually are more than one thread running after a check of 
auto_inc_initialized == 0, there will be no harm, since the handling of 
ha_data->next_auto_inc_val always should be locked before reading and 
updating.

>>      {
>> -      file= *file_array;
>> -      file->info(HA_STATUS_AUTO);
>> -      set_if_bigger(auto_increment_value, file->stats.auto_increment_value);
>> -    } while (*(++file_array));
>> -    stats.auto_increment_value= auto_increment_value;
>> +      DBUG_PRINT("info", ("Using ha_data->next_auto_inc_val: %llu",
>> +                          ha_data->next_auto_inc_val));
> 
> GB From what I see in MSDN, %ll[u] is not supported by VS 2003 (it is in
> C99 but Microsoft apparently needed more than 4 years to implement it
> - now let's see how many flame emails I get for writing that
> gratuitous sarcasm). It is supported starting from VS2005. There is a
> VS2003 machine in pushbuild, so we may have to do some debugging on
> it, thus we should avoid %ll[u] in the patch (the rest of MySQL code
> already avoids it and uses instead [u]llstr/%s). I agree this is a
> hassle.
> 

MJ: OK, I will probably remove all DBUG_PRINT. the handler.cc code uses 
"%lu", (ulong) nr instead, is that ok? (else should I do #ifdef DEBUG; 
char buf[22] #endif before?)

>> +
>> +      lock_auto_increment();
>> +      stats.auto_increment_value= ha_data->next_auto_inc_val;
>> +      unlock_auto_increment();
>> +    }
>> +    else
>> +    {
>> +      ulonglong auto_increment_value= 0;
>> +      file_array= m_file;
>> +      DBUG_PRINT("info", ("checking all partitions for auto_increment_value"));
>> +      do
>> +      {
>> +        file= *file_array;
>> +        file->info(HA_STATUS_AUTO);
>> +        set_if_bigger(auto_increment_value,
> file->stats.auto_increment_value);
>> +        DBUG_PRINT("info", ("file->stats.auto_increment_value: %llu",
>> +                            file->stats.auto_increment_value));
>> +      } while (*(++file_array));
>> +
>> +      /*
>> +        This should not occur unless bugs in the other storage engine.
>> +        If locked, it is called via get_auto_increment as a part of
>> +        initializing and is handled there.
>> +      */
> 
> GB Could you please expand on "This should not occur unless bugs in the
> other storage engine?"? If the if() block below is useless except if
> bugs, we could change it to DBUG_ASSERT(). It sounds weird that:
> SHOW TABLE STATUS on a table of the buggy engine would return 0 in the
> Auto_increment column, but SHOW TABLE STATUS on a partitioned table
> and whose partitions are of the buggy engine, would return 1 in the
> Auto_increment column.
> And I prefer to not do defensive programming when dealing with storage
> engines but rather let engines' developers fixed their bugs.
> 

MJ: There is one more possible bug in InnoDB which we now cover up, and 
that is the ha_partition::get_auto_increment call to 
innobase::get_auto_increment for a second insert, could return a 
untouched *first_value. (this was what first drove me to using the 
info() call instead...)

debugging the following test (in the original code):
create table t1 (a int not null auto_increment primary key) 
engine=innodb partition by hash(a) partitions 2;
insert into t1 values (null);
insert into t1 values (null); # <-- in this call, the *first_value in 
ha_innobase::get_auto_increment will not be touched! (i.e. not used as 
an OUT variable).



My idea was: this is the case when the first insert to the table is 
VALUES (0) and sql_mode='NO_AUTO_VALUE_ON_ZERO' especially on InnoDB.

And there are two bugs in Falcon that also affect this (bug#33661 and 
bug#33662) but I now rely on those patches.

I have now rewritten this into:
   if (flag & HA_STATUS_AUTO)
   {
     bool auto_inc_is_first_in_idx= (table_share->next_number_keypart == 0);
     HA_DATA_PARTITION *ha_data= (HA_DATA_PARTITION*) table_share->ha_data;
     DBUG_PRINT("info", ("HA_STATUS_AUTO"));
     if (!table->found_next_number_field)
       stats.auto_increment_value= 0;
     else if (ha_data->auto_inc_initialized && auto_inc_is_first_in_idx)
     {
       lock_auto_increment();
       stats.auto_increment_value= ha_data->next_auto_inc_val;
       unlock_auto_increment();
     }
     else
     {
       lock_auto_increment();
       /* to avoid two concurrent initializations, check again when 
locked */
       if (ha_data->auto_inc_initialized)
         stats.auto_increment_value= ha_data->next_auto_inc_val;
       else
       {
         ulonglong auto_increment_value= 0;
         file_array= m_file;
         DBUG_PRINT("info",
                    ("checking all partitions for auto_increment_value"));
         do
         {
           file= *file_array;
           file->info(HA_STATUS_AUTO);
           set_if_bigger(auto_increment_value,
                         file->stats.auto_increment_value);
         } while (*(++file_array));

         DBUG_ASSERT(auto_increment_value);
         stats.auto_increment_value= auto_increment_value;
         if (auto_inc_is_first_in_idx)
         {
           set_if_bigger(ha_data->next_auto_inc_val, auto_increment_value);
           ha_data->auto_inc_initialized= TRUE;
         }
       }
       unlock_auto_increment();
     }
   }


>> +      if (!auto_increment_lock && auto_increment_value == 0)
>> +        auto_increment_value= 1;
> 
> GB In spite of the comment, I don't understand why the fact of being
> locked has an influence of if we should set to 1 or not.
> When can auto_increment_value==0 be true?
> 

MJ: I rewrote it...

I think my idea was something like: if auto_increment_lock, it is called 
from get_auto_increment and it could be an insert of VALUES (0) and 
sql_mode = 'NO_AUTO_VALUE_ON_ZERO' and thus allowed.

> 
>> +      stats.auto_increment_value= auto_increment_value;
>> +      if (!table_share->next_number_keypart)
>> +      {
>> +        bool is_already_locked= auto_increment_lock;
>> +        if (!is_already_locked)
>> +          lock_auto_increment();
>> +        set_if_bigger(ha_data->next_auto_inc_val, auto_increment_value);
>> +        ha_data->auto_inc_initialized= TRUE;
>> +        if (!is_already_locked)
>> +          unlock_auto_increment();
>> +      }
>> +    }
>> +    DBUG_PRINT("info", ("stats.auto_increment_value: %llu",
>> +                        stats.auto_increment_value));
>>    }
>>    if (flag & HA_STATUS_VARIABLE)
>>    {
>> @@ -5579,19 +5654,37 @@ int ha_partition::cmp_ref(const uchar *r
>>                  MODULE auto increment
>>  ****************************************************************************/
>>  
>> -void ha_partition::restore_auto_increment(ulonglong)
>> -{
>> -  DBUG_ENTER("ha_partition::restore_auto_increment");
>>  
>> -  DBUG_VOID_RETURN;
>> +int ha_partition::reset_auto_increment(ulonglong value)
>> +{
>> +  handler **pos, **end;
>> +  int res;
>> +  HA_DATA_PARTITION *ha_data= (HA_DATA_PARTITION*) table_share->ha_data;
>> +  DBUG_ENTER("ha_partition::reset_auto_increment");
>> +  DBUG_PRINT("info", ("value: %llu", value));
>> +  lock_auto_increment();
>> +  ha_data->auto_inc_initialized= FALSE;
>> +  ha_data->next_auto_inc_val= 0;
>> +  for (pos=m_file, end= m_file+ m_tot_parts; pos != end ; pos++)
> 
> GB all occurences of "m_file+ m_tot_parts" should have a space before +.
> 

MJ: I will change this, thanks for noticing (although it was not me who 
wrote it, bk revtool says monty :)

>> +  {
>> +    if ((res= (*pos)->ha_reset_auto_increment(value)) != 0)
>> +    {
>> +      unlock_auto_increment();
>> +      DBUG_RETURN(res);
>> +    }
>> +  }
>> +  unlock_auto_increment();
>> +  DBUG_RETURN(0);
> 
> GB or maybe (smaller code, different behaviour in case if error but not
> clear if that itself is important):
>   int res= 0;
>   ...
>   for (pos=m_file, end= m_file+ m_tot_parts; pos != end ; pos++)
>     res|= (*pos)->ha_reset_auto_increment(value);
>   unlock_auto_increment();
>   DBUG_RETURN(res != 0);
> 

MJ: I can rewrite it as this, but do really res != 0 give a correct 
error code? (I could add a new error and handle it in 
ha_partition::print_error)

in sql_delete.cc row 331:
     int error2= table->file->ha_reset_auto_increment(0);

     if (error2 && (error2 != HA_ERR_WRONG_COMMAND))
     {
       table->file->print_error(error2, MYF(0));

>   
>>  /*
>>    This method is called by update_auto_increment which in turn is called
>> -  by the individual handlers as part of write_row. We will always let
>> -  the first handler keep track of the auto increment value for all
>> -  partitions.
>> +  by the individual handlers as part of write_row. We use the
>> +  table_share->ha_data->next_auto_inc_val, or search all
>> +  partitions for the highest auto_increment_value if not initialized or
>> +  if auto_increment field is a secondary part of a key, we must search
>> +  every partition with a mutex to be sure of correctness.
>>  */
> 
> GB While you are changing the comment, you could format it for doxygen.
> 

MJ: I will try to update the comments for doxygen

>>  
>>  void ha_partition::get_auto_increment(ulonglong offset, ulonglong increment,
>> @@ -5599,59 +5692,120 @@ void ha_partition::get_auto_increment(ul
>>                                        ulonglong *first_value,
>>                                        ulonglong *nb_reserved_values)
>>  {
>> -  ulonglong first_value_part, last_value_part, nb_reserved_values_part,
>> -    last_value= ~ (ulonglong) 0;
>> -  handler **pos, **end;
>> -  bool retry= TRUE;
>> +  HA_DATA_PARTITION *ha_data= (HA_DATA_PARTITION*) table_share->ha_data;
>>    DBUG_ENTER("ha_partition::get_auto_increment");
>> -
>> -again:
>> -  for (pos=m_file, end= m_file+ m_tot_parts; pos != end ; pos++)
>> +  DBUG_PRINT("info", ("offset: %llu inc: %llu desired_values: %llu "
>> +                      "first_value: %llu", offset, increment,
>> +                      nb_desired_values, *first_value));
>> +  DBUG_ASSERT(increment && nb_desired_values);
>> +  *first_value= 0;
>> +  if (table->s->next_number_keypart)
>>    {
>> -    first_value_part= *first_value;
>> -    (*pos)->get_auto_increment(offset, increment, nb_desired_values,
>> -                               &first_value_part,
> &nb_reserved_values_part);
>> -    if (first_value_part == ~(ulonglong)(0)) // error in one partition
>> -    {
>> -      *first_value= first_value_part;
>> -      sql_print_error("Partition failed to reserve auto_increment value");
>> -      DBUG_VOID_RETURN;
>> -    }
>>      /*
>> -      Partition has reserved an interval. Intersect it with the intervals
>> -      already reserved for the previous partitions.
>> +      next_number_keypart is != 0 if the auto_increment column is a secondary
>> +      column in the index (it is allowed in MyISAM)
>>      */
> 
> GB In this branch, the concept of reserving multiple value does not
> make sense, as the comment in ha_myisam.cc says:
> 
>   /*
>     MySQL needs to call us for next row: assume we are inserting ("a",null)
>     here, we return 3, and next this statement will want to insert ("b",null):
>     there is no reason why ("b",3+1) would be the good row to insert: maybe it
>     already exists, maybe 3+1 is too large...
>   */
>   *nb_reserved_values= 1;
> 
> So ha_partition::get_auto_increment() should not look for an
> intersection.
> The intersection was my idea, my code and it's wrong, Mikael and
> I had identified that already. When I coded it it was not a problem
> because all engines, even InnoDB, did table-level locking for the
> duration of the whole INSERT statement, so they were all reserving to
> +inf, thus the intersection was always doing a max in fact.
> Now that InnoDB reserves finite intervals, we have to get rid of the
> intersection and use a max.
> It does not make sense to think about intervals in this case.
> 
> In the branch above, you just want a max: you can just call
> get_auto_increment(), then use the max of all
> first_value_part. last_value_part is not needed.
> 

MJ: I thought of changing it (see my comments to serg's review), but I 
could not really get the scope of the use, so I let it be. But now, when 
I have some history behind it, I will change it to just be simple and 
loop over every partition and just use the highest value and always only 
reserve one value.

>> -    last_value_part= (nb_reserved_values_part == ULONGLONG_MAX) ?
>> -      ULONGLONG_MAX : (first_value_part + nb_reserved_values_part * increment);
>> -    set_if_bigger(*first_value, first_value_part);
>> -    set_if_smaller(last_value, last_value_part);
>> -  }
>> -  if (last_value < *first_value) /* empty intersection, error */
>> -  {
>> +    DBUG_PRINT("info", ("next_number_keypart != 0"));
>> +    ulonglong first_value_part, last_value_part, nb_reserved_values_part,
>> +      last_value= ~ (ulonglong) 0;
>> +    handler **pos, **end;
>>      /*
>> -      When we have an empty intersection, it means that one or more
>> -      partitions may have a significantly different autoinc next value.
>> -      We should not fail here - it just means that we should try to
>> -      find a new reservation making use of the current *first_value
>> -      wbich should now be compatible with all partitions.
>> +      Must lock and find highest value among all partitions
>>      */
>> -    if (retry)
>> +    lock_auto_increment();
>> +    for (pos=m_file, end= m_file+ m_tot_parts; pos != end ; pos++)
>>      {
>> -      retry= FALSE;
>> -      last_value= ~ (ulonglong) 0;
>> -      release_auto_increment();
>> -      goto again;
>> +      first_value_part= 0;
>> +      (*pos)->get_auto_increment(offset, increment, nb_desired_values,
>> +                                 &first_value_part,
> &nb_reserved_values_part);
>> +      if (first_value_part == ~(ulonglong)(0)) // error in one partition
>> +      {
>> +        *first_value= first_value_part;
>> +        sql_print_error("Partition failed to reserve auto_increment value");
>> +        unlock_auto_increment();
>> +        DBUG_VOID_RETURN;
>> +      }
>> +      DBUG_PRINT("info", ("partition loop, first_value_part: %llu",
> first_value_part));
>> +      /*
>> +        Partition has reserved an interval. Intersect it with the intervals
>> +        already reserved for the previous partitions.
>> +      */
>> +      last_value_part= (nb_reserved_values_part == ULONGLONG_MAX) ?
>> +        ULONGLONG_MAX : (first_value_part + nb_reserved_values_part *
> increment);
>> +      set_if_bigger(*first_value, first_value_part);
>> +      set_if_smaller(last_value, last_value_part);
>> +    }
>> +    DBUG_PRINT("info", ("*first_value: %llu last_value: %llu", *first_value,
>> +                        last_value));
>> +    if (last_value < *first_value)
>> +    {
>> +      /*
>> +        Set last_value to *first_value. This is safe because of the mutex.
>> +      */
>> +      last_value= *first_value;
>>      }
>> +    if (increment)                                // If not check for values
> 
> GB the function starts with a DBUG_ASSERT() which checks that
> "increment" is true.

MJ: I will simplify as you say below.

> 
>> +    {
>> +      *nb_reserved_values= (last_value == ULONGLONG_MAX) ?
>> +        ULONGLONG_MAX : ((last_value - *first_value) / increment);
>> +    }
>> +    unlock_auto_increment();
> 
> GB I think you should simplify this whole branch:
> - call get_auto_increment() in a loop, with nb_desired_values=1
> - compute the max into *first_value, set *nb_reserved_values to 1
> - and be done with it.
> 

MJ: I will change it like this, thanks for clearing up things for me.

>> +  }
>> +  else if (!ha_data->auto_inc_initialized)
>> +  {
>>      /*
>> -      We should not get here.
>> +      Not initialized, but should be locked in ha_partition::write_row().
>> +      (Or is a temporary table).
> 
> GB how about adding
> if (this is not tmp table)
>    safe_mutex_assert_owner(the expected mutex);
> ?
> 

MJ: Thanks, did not now of that function.

>> +      Initialize to highest value among all partitions
>>      */
>> -    sql_print_error("Failed to calculate auto_increment value for partition");
>> -    
>> -    *first_value= ~(ulonglong)(0);
>> +    info(HA_STATUS_AUTO);
>> +    *first_value= ha_data->next_auto_inc_val;
>> +    if (*first_value == 0 &&
> 
> GB When can *first_value be 0?
> 

MJ: this must be some dead code left from my experiements with 
sql_mode='NO_AUTO_ON_ZERO' and insert into t values (0) as the first 
insert... But this does not seem to be any need for handling here 
anymore, it is handled in handler::update_auto_increment(). I will 
remove this.

>> +        !(table->auto_increment_field_not_null &&
>> +          table->next_number_field->val_int() == 0 &&
>> +          current_thd->variables.sql_mode & MODE_NO_AUTO_VALUE_ON_ZERO))
>> +    {
>> +      /*
>> +        First auto_increment, set to 1
>> +        (if not explicit number 0 and not null and sql_mode set)
>> +      */
>> +      DBUG_PRINT("info", ("Setting *first_value to 1"));
>> +      *first_value= 1;
>> +    }
>> +    DBUG_PRINT("info", ("*first_value: %llu", *first_value));
>> +
>> +    ha_data->next_auto_inc_val= *first_value + nb_desired_values *
> increment;
>> +    ha_data->auto_inc_initialized= TRUE;
>> +    *nb_reserved_values= nb_desired_values;
>>    }
>> -  if (increment)                                // If not check for values
>> -    *nb_reserved_values= (last_value == ULONGLONG_MAX) ?
>> -      ULONGLONG_MAX : ((last_value - *first_value) / increment);
>> +  else
>> +  {
>> +    /*
>> +      This is an optimized solution that not require a mutex around
>> +      write_row for auto_increment handling, instead only using a mutex
>> +      for picking the next auto_increment_value.
>> +
>> +      Get a lock for handling the auto_increment in table_share->ha_data
>> +      for avoiding two concurrent statements getting the same number.
>> +    */ 
>> +
>> +    DBUG_PRINT("info", ("reserving %llu auto_inc values", nb_desired_values));
>> +    lock_auto_increment();
>> +
>> +    DBUG_PRINT("info", ("ha_data->next_auto_inc_val: %llu",
>> +                        ha_data->next_auto_inc_val));
>> +    /* this gets corrected (for offset/increment) in update_auto_increment */
>> +    *first_value= ha_data->next_auto_inc_val;
>> +    ha_data->next_auto_inc_val= *first_value + nb_desired_values *
> increment;
>> +
>> +    unlock_auto_increment();
>> +    *nb_reserved_values= nb_desired_values;
>> +  }
>> +  DBUG_PRINT("info", ("first_value: %llu next_auto_inc_val: %llu "
>> +                      "nb_reserved_values: %llu", *first_value,
>> +                      ha_data->next_auto_inc_val,
>> +                      *nb_reserved_values));
>> +
>>    DBUG_VOID_RETURN;
>>  }
>>  
>> @@ -5659,9 +5813,28 @@ void ha_partition::release_auto_incremen
>>  {
>>    DBUG_ENTER("ha_partition::release_auto_increment");
>>  
>> -  for (uint i= 0; i < m_tot_parts; i++)
>> +  if (table->s->next_number_keypart)
>>    {
>> -    m_file[i]->ha_release_auto_increment();
>> +    for (uint i= 0; i < m_tot_parts; i++)
>> +      m_file[i]->ha_release_auto_increment();
>> +  }
>> +  else
>> +  {
>> +    HA_DATA_PARTITION *ha_data= (HA_DATA_PARTITION*) table_share->ha_data;
>> +    DBUG_PRINT("info", ("next_auto_inc_val: %llu cur_row_max: %llu min: %llu",
>> +                        ha_data->next_auto_inc_val,
>> +                        auto_inc_interval_for_cur_row.maximum(),
>> +                        auto_inc_interval_for_cur_row.minimum()));
>> +    if (next_insert_id)
>> +    {
>> +      lock_auto_increment();
>> +      if (next_insert_id < ha_data->next_auto_inc_val &&
>> +          auto_inc_interval_for_cur_row.maximum() >=
> ha_data->next_auto_inc_val)
> 
> GB too long line (80 chars).
> 

MJ: OK, I thought that it was not more than 80 chars... should I remove 
one space in front of the line, or break it up like this:
       if (next_insert_id < ha_data->next_auto_inc_val &&
           auto_inc_interval_for_cur_row.maximum() >=
           ha_data->next_auto_inc_val)
?

>> +        ha_data->next_auto_inc_val= next_insert_id;
>> +      unlock_auto_increment();
>> +    }
>> +    DBUG_PRINT("info", ("next_auto_inc_val: %llu next_ins_id: %llu",
>> +                        ha_data->next_auto_inc_val, next_insert_id));
>>    }
>>    DBUG_VOID_RETURN;
>>  }
>> diff -Nrup a/sql/ha_partition.h b/sql/ha_partition.h
>> --- a/sql/ha_partition.h	2007-09-24 15:30:28 +02:00
>> +++ b/sql/ha_partition.h	2008-03-05 23:56:03 +01:00
>> @@ -37,6 +37,13 @@ typedef struct st_partition_share
>>  } PARTITION_SHARE;
>>  #endif
>>  
>> +/* TODO: move all partition specific data from TABLE_SHARE here */
> 
> GB /** @todo
> 

MJ: OK, thanks, I'm new with the doxygen comments.

>> +typedef struct st_ha_data_partition
>> +{
>> +  ulonglong next_auto_inc_val;
>> +  bool auto_inc_initialized;
>> +  pthread_mutex_t mutex;
>> +} HA_DATA_PARTITION;
> 
> GB if you add a comment for the struct, and maybe for each of its members
> unless its use is easily guessable, it will produce useful info in
> doxygen-generated documentation.
> 

MJ: I will add these comments:
..next_auto_inc_val; .. /**< first non reserved value */
..mutes; ..             /**< mutex for auto_increment */

>>  #define PARTITION_BYTES_IN_POS 2
>>  class ha_partition :public handler
>> @@ -141,6 +148,7 @@ private:
>>      "own" the m_part_info structure.
>>    */
>>    bool is_clone;
>> +  bool auto_increment_lock;             /* lock reading/updating auto_inc */
>>  public:
>>    handler *clone(MEM_ROOT *mem_root);
>>    virtual void set_part_info(partition_info *part_info)
>> @@ -828,12 +836,46 @@ public:
>>      auto_increment_column_changed
>>       -------------------------------------------------------------------------
>>    */
>> -  virtual void restore_auto_increment(ulonglong prev_insert_id);
>>    virtual void get_auto_increment(ulonglong offset, ulonglong increment,
>>                                    ulonglong nb_desired_values,
>>                                    ulonglong *first_value,
>>                                    ulonglong *nb_reserved_values);
>>    virtual void release_auto_increment();
>> +private:
>> +  virtual int reset_auto_increment(ulonglong value);
>> +  inline virtual void lock_auto_increment()
> 
> GB methods declared inside the class' declaration are automatically
> inline (C++ standard says so), but no harm in adding "inline".
> 

MJ: OK, I will remove the "inline".

>> +  {
> 
> GB I think we should have
> DBUG_ASSERT(!auto_increment_lock) here.
> The reason is: auto_increment_lock seems to be used as "remember if
> lock is already acquired or not; if already acquired I don't need to
> acquire it". But this code:
> 
> lock_auto_increment();
> f();
> something;
> unlock_auto_increment();
> 
> where f() is:
> f()
> {
>   lock_auto_increment(); $
>   ...;
>   unlock_auto_increment(); *
> }
> 
> will not work: the line marked with $ will do nothing, but the line
> with * will unlock the mutex, too early if 'something;' expected to be
> protected by the mutex.
> With the assertion we prevent against such wrong double-lock use.
> 
> Please also add some description to lock_auto_increment() (like "locks
> mutex if not already locked") and unlock_auto_increment() (like
> "unlocks mutex if it was locked").
> 
> Any function which calls unlock_auto_increment() should in theory have
> a comment in its description saying that it unlocks the mutex.
> 
> I worry a bit about this design where in a function, the lock may or
> not be unlocked (like in get_auto_increment(), where we unlock or not
> depending on the branch), functions call each other with the lock
> acquired by caller or callee, and it all relies on the proper branches
> being taken by callee. But to be more specific I need to first
> understand why in some cases the mutex has to be held in write_row +
> update_auto_increment + get_auto_increment (a previous question).
> 

MJ: I made it this way to reduce the code, (when I thought that I had to 
have a mutex for the whole write_row block) I have now removed this and 
will always lock, read/alter and unlock within the same function, and I 
have added a DBUG_ASSERT(!auto_increment_lock) in lock_auto_increment().

>> +    if(!auto_increment_lock && table_share->tmp_table ==
> NO_TMP_TABLE)
>> +    {
>> +      HA_DATA_PARTITION *ha_data= (HA_DATA_PARTITION*) table_share->ha_data;
>> +      auto_increment_lock= TRUE;
>> +      pthread_mutex_lock(&ha_data->mutex);
>> +    }
>> +  }
>> +  inline virtual void unlock_auto_increment()
>> +  {
>> +    if(auto_increment_lock)
>> +    {
>> +      HA_DATA_PARTITION *ha_data= (HA_DATA_PARTITION*) table_share->ha_data;
>> +      pthread_mutex_unlock(&ha_data->mutex);
>> +      auto_increment_lock= FALSE;
>> +    }
>> +  }
>> +  inline virtual void set_auto_increment_if_higher()
>> +  {
>> +    ulonglong nr= table->next_number_field->val_int();
>> +    HA_DATA_PARTITION *ha_data= (HA_DATA_PARTITION*) table_share->ha_data;
>> +    if (nr >= ha_data->next_auto_inc_val)
> 
> GB We are doing a read without mutex above. Why is that correct?
> I have doubts: for example, on Pentium and newer Intel CPUs, a read of
> a ulonglong variable may not be atomic if the ulonglong is not aligned
> on a 8-byte boundary (and looking at how next_auto_inc_val is
> allocated I see no guarantee of such alignment).
> So if two threads are executing the above code at the same time, one
> may see a corrupted ha_data->next_auto_inc_val (a value which never
> really existed) and go wrong.
> 

MJ: I thought I had it all covered :) this was one of the first worries 
I had. I will change this.

>> +    {
>> +      lock_auto_increment();
>> +      /* must check again when the mutex is taken */
>> +      if (nr >= ha_data->next_auto_inc_val)
>> +        ha_data->next_auto_inc_val= nr + 1;
>> +      ha_data->auto_inc_initialized= TRUE;
>> +    }
>> +  }
>> +
>> +public:
>>  
>>    /*
>>       -------------------------------------------------------------------------
>> diff -Nrup a/sql/table.h b/sql/table.h
>> --- a/sql/table.h	2007-12-20 21:24:07 +01:00
>> +++ b/sql/table.h	2008-03-05 23:56:03 +01:00
>> @@ -358,6 +358,7 @@ typedef struct st_table_share
>>    */
>>    int cached_row_logging_check;
>>  
>> +  /* TODO: Move into *ha_data for partitioning */
>>  #ifdef WITH_PARTITION_STORAGE_ENGINE
>>    bool auto_partitioned;
>>    const char *partition_info;
>> @@ -367,6 +368,9 @@ typedef struct st_table_share
>>    uint part_state_len;
>>    handlerton *default_part_db_type;
>>  #endif
>> +
>> +  /* place to store storage engine specific data */
> 
> GB /** for doxygen
> 

MJ: OK

>> +  void *ha_data;
> 

Thanks again for your thorough review!

Can there be any problems with concurrent multi row insert and statement 
replications?

-- 
Mattias Jonsson, Software Engineer
MySQL AB, www.mysql.com

Are you MySQL certified?  www.mysql.com/certification
Thread
bk commit into 5.1 tree (mattiasj:1.2518) BUG#33479mattiasj5 Mar
  • Re: bk commit into 5.1 tree (mattiasj:1.2518) BUG#33479Sergei Golubchik7 Mar
    • Re: bk commit into 5.1 tree (mattiasj:1.2518) BUG#33479Sergei Golubchik7 Mar
      • Re: bk commit into 5.1 tree (mattiasj:1.2518) BUG#33479Mattias Jonsson8 Mar
        • Re: bk commit into 5.1 tree (mattiasj:1.2518) BUG#33479Sergei Golubchik10 Mar
  • Re: bk commit into 5.1 tree (mattiasj:1.2518) BUG#33479Guilhem Bichot13 Mar
    • Re: bk commit into 5.1 tree (mattiasj:1.2518) BUG#33479Mattias Jonsson14 Mar
      • Re: bk commit into 5.1 tree (mattiasj:1.2518) BUG#33479Guilhem Bichot18 Mar
        • Re: bk commit into 5.1 tree (mattiasj:1.2518) BUG#33479Mattias Jonsson27 Mar
          • Re: bk commit into 5.1 tree (mattiasj:1.2518) BUG#33479Guilhem Bichot28 Mar
            • Re: bk commit into 5.1 tree (mattiasj:1.2518) BUG#33479Mattias Jonsson1 Apr
              • Re: bk commit into 5.1 tree (mattiasj:1.2518) BUG#33479Guilhem Bichot10 Apr