List:Commits« Previous MessageNext Message »
From:Mattias Jonsson Date:March 3 2008 8:29am
Subject:Re: bk commit into 5.1 tree (mattiasj:1.2518) BUG#33479
View as plain text  
Hi,

Thank you very much for your comments (I am still a novice to the 
code/happy learning, and appreciate your notes!)

I will commit a new full patch with the changes.

I have one concern and that is how it will work with concurrency, 
statement replication and multi-insert statements (like 'insert into ... 
select' in several threads), I have manually tried it, and it seems to 
work (both InnoDB and MyISAM seems to take a table lock currently to 
handle this), but it would good if there was a test case that 'prove' 
that it works.

My comments are below yours.

Regards
Mattias

Sergei Golubchik wrote:
> Hi!
> 
> On Feb 21, mattiasj@stripped wrote:
>> ChangeSet@stripped, 2008-02-21 10:09:43+01:00, mattiasj@witty. +3 -0
>>   Bug#33479: auto_increment failures in partitioning
>>   
>>   Several problems with auto_increment in partitioning
>>   (with MyISAM, InnoDB and Falcon)
>>   
>>   Code only patch (the test cases will be in a separate patches
>>   5.1+ (InnoDB+MyISAM) and 6.0 (Falcon)
>>   
>>   Changed the auto_increment handling for partitioning:
>>   Added a ha_data variable in table_share for storage engine specific data such
> as
>>   auto_increment value handling in partitioning
>>   and using the ha_data->mutex to lock around read + update.
>>   
>>   The idea is this:
>>   Store the table's reserved auto_increment value in
>>   the TABLE_SHARE and use a mutex to lock it for reading and updateing it,
>>   only accessing all partitions when it is not initialized.
>>   Also allow reservations of ranges, and if no one has done a reservation
>>   afterwards, lower the reservation to what was actually used after
>>   the statement is done (via release_auto_increment from WL#3146).
>>   
>>   This should be quite fast and work with any local storage engine.
> 
> Generally, I like it, it's certainly on the right track.

Thanks, I appreciate that!

> Thanks for adding TABLE_SHARE::ha_data.
> Still I have some comments and questions, see below:
>  
>> diff -Nrup a/sql/ha_partition.cc b/sql/ha_partition.cc
>> --- a/sql/ha_partition.cc	2007-12-20 19:16:51 +01:00
>> +++ b/sql/ha_partition.cc	2008-02-21 10:09:42 +01:00
>> @@ -4412,20 +4443,46 @@ int ha_partition::handle_ordered_prev(uc
>>  int ha_partition::info(uint flag)
>>  {
>>    handler *file, **file_array;
>> +  HA_DATA_PARTITION *ha_data= (HA_DATA_PARTITION*) table_share->ha_data;
>>    DBUG_ENTER("ha_partition:info");
>>  
>>    if (flag & HA_STATUS_AUTO)
>>    {
>> +    bool auto_increment_lock= FALSE;
>>      DBUG_PRINT("info", ("HA_STATUS_AUTO"));
>> +    if (ha_data->auto_inc_initialized &&
> !table_share->next_number_keypart)
>>      {
>> +      DBUG_PRINT("info", ("Using ha_data->next_auto_inc_val: %llu",
>> +                          ha_data->next_auto_inc_val));
>> +
>> +      lock_auto_increment(&auto_increment_lock);
>> +      stats.auto_increment_value= ha_data->next_auto_inc_val;
>> +      unlock_auto_increment(&auto_increment_lock);
>> +    }
>> +    else
>> +    {
>> +      ulonglong auto_increment_value= 0;
>> +      file_array= m_file;
>> +      DBUG_PRINT("info", ("checking all partitions for auto_increment_value"));
>> +      do
>> +      {
>> +        file= *file_array;
>> +        file->info(HA_STATUS_AUTO);
>> +        set_if_bigger(auto_increment_value,
> file->stats.auto_increment_value);
>> +        DBUG_PRINT("info", ("file->stats.auto_increment_value: %llu",
>> +                            file->stats.auto_increment_value));
>> +      } while (*(++file_array));
>> +      stats.auto_increment_value= auto_increment_value;
>> +      if (!table_share->next_number_keypart)
>> +      {
>> +        lock_auto_increment(&auto_increment_lock);
>> +        set_if_bigger(ha_data->next_auto_inc_val, auto_increment_value);
> 
> why set_if_bigger here ?

if (unlikely) a race condition where another thread manages to 
initialize it and set it to a higher value (if it for example switch 
contexts and run a query with explicit value). Can this happen?
(also is in line with your comment below, re-check the 
ha_data->next_auto_inc_val?)

>> +        ha_data->auto_inc_initialized= TRUE;
>> +        unlock_auto_increment(&auto_increment_lock);
>> +      }
>> +    }
>> +    DBUG_PRINT("info", ("stats.auto_increment_value: %llu",
>> +                        stats.auto_increment_value));
>>    }
>>    if (flag & HA_STATUS_VARIABLE)
>>    {
>> @@ -5579,19 +5636,29 @@ int ha_partition::cmp_ref(const uchar *r
>>                  MODULE auto increment
>>  ****************************************************************************/
>>  
>> -void ha_partition::restore_auto_increment(ulonglong)
>> -{
>>  
>> +int ha_partition::ha_reset_auto_increment(ulonglong value)
> 
> Never. You can only override reset_auto_increment(), and never ha_xxx
> methods. By the way, ha_reset_auto_increment() (and all other ha_xxx
> methods) are not even virtual.
> 
> Something must be wrong with your test, if you didn't notice that you
> "override" a non-virtual method, didn't notice that your
> ha_reset_auto_increment() is never called.

Thank you very much for pointing this out to me. I will change it and 
test again. (I must have done some kind of typo, and you are correct 
that I have not done any test for this function, I will add a test with 
truncate table)

When testing, I found that I had to use reset_auto_increment in 
delete_all_rows if SQLCOM_TRUNCATE. Is it OK to create a bug that MyISAM 
does not support this function, and is not working properly with 
TRUNCATE TABLE if it is partitioned? (Or could you guide me how to 
implement reset_auto_increment properly in ha_myisam.cc?)

Could I also report a bug about 'ALTER TABLE t1 AUTO_INCREMENT = n' is 
actually rewriting the whole table, instead of only set its storage 
engines internal AUTO_INCREMENT value? (and the lack of a API to set 
this without inserting a row). Or is this intentional?

Are there any documentation or notes about the naming convention of 
functions, define, variables and such?

>> +{
>> +  handler **pos, **end;
>> +  int res;
>> +  HA_DATA_PARTITION *ha_data= (HA_DATA_PARTITION*) table_share->ha_data;
>> +  DBUG_ENTER("ha_partition::ha_reset_auto_increment");
>> +  DBUG_PRINT("info", ("value: %llu", value));
>> +  for (pos=m_file, end= m_file+ m_tot_parts; pos != end ; pos++)
>> +    if ((res= (*pos)->ha_reset_auto_increment(value)) != 0)
>> +      DBUG_RETURN(res);
>> +  ha_data->auto_inc_initialized= FALSE;
>> +  DBUG_RETURN(0);
>>  }
>>  
>>  
>>  /*
>>    This method is called by update_auto_increment which in turn is called
>> +  by the individual handlers as part of write_row. We use the
>> +  table_share->ha_data->next_auto_inc_val, or search all
>> +  partitions for the highest auto_increment_value if not initialized or
>> +  if auto_increment field is a secondary part of a key, we must search
>> +  every partition with a mutex to be sure of correctness.
>>  */
>>  
>>  void ha_partition::get_auto_increment(ulonglong offset, ulonglong increment,
>>                                        ulonglong *nb_desired_values,
>>                                        ulonglong *first_value,
>>                                        ulonglong *nb_reserved_values)
>>  {
>> +  HA_DATA_PARTITION *ha_data= (HA_DATA_PARTITION*) table_share->ha_data;
>>    DBUG_ENTER("ha_partition::get_auto_increment");
>> -
>> -again:
>> +  DBUG_PRINT("info", ("offset: %llu inc: %llu desired_values: %llu "
>> +                      "first_value: %llu", offset, increment,
>> +                      nb_desired_values, *first_value));
>> +  DBUG_ASSERT(increment && nb_desired_values);
>> +  *first_value= 0;
>> +  if (table->s->next_number_keypart)
>>    {
>>      /*
>> +      next_number_keypart is != 0 if the auto_increment column is a secondary
>> +      column in the index (it is allowed in MyISAM)
>>      */
>> +    DBUG_PRINT("info", ("next_number_keypart != 0"));
>> +    ulonglong first_value_part, last_value_part, nb_reserved_values_part,
>> +      last_value= ~ (ulonglong) 0;
>> +    handler **pos, **end;
>> +    bool auto_increment_lock= FALSE;
>>      /*
>> +      Must lock and find highest value among all partitions
>>      */
>> +    lock_auto_increment(&auto_increment_lock);
>> +    for (pos=m_file, end= m_file+ m_tot_parts; pos != end ; pos++)
>>      {
>> +      first_value_part= 0;
>> +      (*pos)->get_auto_increment(offset, increment, nb_desired_values,
>> +                                 &first_value_part,
> &nb_reserved_values_part);
>> +      if (first_value_part == ~(ulonglong)(0)) // error in one partition
>> +      {
>> +        *first_value= first_value_part;
>> +        sql_print_error("Partition failed to reserve auto_increment value");
>> +        unlock_auto_increment(&auto_increment_lock);
>> +        DBUG_VOID_RETURN;
>> +      }
>> +      DBUG_PRINT("info", ("partition loop, first_value_part: %llu",
> first_value_part));
>> +      /*
>> +        Partition has reserved an interval. Intersect it with the intervals
>> +        already reserved for the previous partitions.
>> +      */
>> +      last_value_part= (nb_reserved_values_part == ULONGLONG_MAX) ?
>> +        ULONGLONG_MAX : (first_value_part + nb_reserved_values_part *
> increment);
>> +      set_if_bigger(*first_value, first_value_part);
>> +      set_if_smaller(last_value, last_value_part);
>> +    }
>> +    DBUG_PRINT("info", ("*first_value: %llu last_value: %llu", *first_value,
>> +                        last_value));
>> +    if (last_value < *first_value)
>> +    {
>> +      /*
>> +        Set last_value to *first_value. This is safe because of the mutex.
> 
> I don't understand the comment

This was a problem in the original implementation, where the last_value 
could be lower than the *first_value (and then not finding an 
intersection, it then tried again, without luck and failed. And InnoDB 
had to have a mutex over write_row, and Falcon could deadlock if mutexed).

But since we always have the auto_increment_lock taken there are no race 
condition when looping over every partition. And we can't reserve 
negative number of values.

But I believe that in this specific case (which is not supported by 
InnoDB or Falcon), there will always be only one reserved value, and 
could probably simplify it. (since it is not the first part of a 
multi-column index). But I believe it is better to currently leaving it 
to the partitions storage engine.
if inserting (3,NULL) -> (3,1) then when inserting (4,NULL) -> (4,2) 
would probably be wrong.

>> +      */
>> +      last_value= *first_value;
>> +    }
>> +    if (increment)                                // If not check for values
>> +    {
>> +      *nb_reserved_values= (last_value == ULONGLONG_MAX) ?
>> +        ULONGLONG_MAX : ((last_value - *first_value) / increment);
>>      }
>> +    unlock_auto_increment(&auto_increment_lock);
>> +  }
>> +  else if (!ha_data->auto_inc_initialized)
>> +  {
>> +    ulonglong first_value_part;
>> +    handler **pos, **end;
>>      /*
>> +      Not initialized, but should be locked in ha_partition::write_row().
>> +      (Or is a temporary table).
>> +      Initialize to highest value among all partitions
>>      */
>> +    DBUG_PRINT("info", ("Initializing auto_increment in table_share"));
>> +    for (pos=m_file, end= m_file+ m_tot_parts; pos != end ; pos++)
>> +    {
>> +      first_value_part= 0;
>> +      /*
>> +        Use (*pos)->info() and (*pos)->stats.auto_increment_value instead
> of
>> +        (*pos)->get_auto_increment() since InnoDB does not handle it
> correctly
>> +        for partitioned tables.
>> +      */
>> +      (*pos)->info(HA_STATUS_AUTO);
>> +      first_value_part= (*pos)->stats.auto_increment_value;
>> +      if (first_value_part == ~(ulonglong)(0)) // error in one partition
>> +      {
>> +        *first_value= first_value_part;
>> +        sql_print_error("Partition failed to reserve auto_increment value");
>> +        DBUG_VOID_RETURN;
>> +      }
>> +      DBUG_PRINT("info", ("partition loop, first_value_part: %llu",
> first_value_part));
>> +      set_if_bigger(*first_value, first_value_part);
>> +    }
> 
> eh, why don't you just call ::info(HA_STATUS_AUTO) ?

Because of the auto_increment_lock, it is already taken in write_row, 
and info() tries to lock it again.

I will add a 'bool auto_increment_lock' in the ha_partition class. The 
auto_increment_lock in the class would simplify the code, and allow a 
call to info.

>> +    if (*first_value == 0 &&
>> +        !(table->auto_increment_field_not_null &&
>> +          table->next_number_field->val_int() == 0 &&
>> +          current_thd->variables.sql_mode & MODE_NO_AUTO_VALUE_ON_ZERO))
>> +    {
>> +      /*
>> +        First auto_increment, set to 1
>> +        (if not explicit number 0 and not null and sql_mode set)
>> +      */
>> +      DBUG_PRINT("info", ("Setting *first_value to 1"));
>> +      *first_value= 1;
>> +    }
>> +    DBUG_PRINT("info", ("*first_value: %llu", *first_value));
>> +
>> +    ha_data->next_auto_inc_val= *first_value + nb_desired_values *
> increment;
>> +    ha_data->auto_inc_initialized= TRUE;
>> +    *nb_reserved_values= nb_desired_values;
>>    }
>> +  else
>> +  {
>> +    /*
>> +      This is an optimized solution that not require
>> +      a mutex around write_row for auto_increment handling 
>> +      Only for a picking the next auto_increment_value.
>> +
>> +      Get a lock for handling the auto_increment in table_share->ha_data
>> +      for avoiding two concurrent statements getting the same
>> +      number.
>> +    */ 
>> +    bool auto_increment_lock= FALSE;
>> +
>> +    DBUG_PRINT("info", ("reserving %llu auto_inc values", nb_desired_values));
>> +    lock_auto_increment(&auto_increment_lock);
>> +
>> +    DBUG_PRINT("info", ("ha_data->next_auto_inc_val: %llu",
>> +                        ha_data->next_auto_inc_val));
>> +    /* this gets corrected (for offset) in update_auto_increment */
>> +    *first_value= ha_data->next_auto_inc_val;
>> +    ha_data->next_auto_inc_val= *first_value + nb_desired_values *
> increment;
>> +
>> +    unlock_auto_increment(&auto_increment_lock);
>> +    *nb_reserved_values= nb_desired_values;
>> +  }
>> +  DBUG_PRINT("info", ("first_value: %llu next_auto_inc_val: %llu "
>> +                      "nb_reserved_values: %llu", *first_value,
>> +                      ha_data->next_auto_inc_val,
>> +                      *nb_reserved_values));
>> +
>>    DBUG_VOID_RETURN;
>>  }
>>  
>> diff -Nrup a/sql/ha_partition.h b/sql/ha_partition.h
>> --- a/sql/ha_partition.h	2007-09-24 15:30:28 +02:00
>> +++ b/sql/ha_partition.h	2008-02-21 10:09:42 +01:00
>> @@ -828,12 +835,43 @@ public:
>>      auto_increment_column_changed
>>       -------------------------------------------------------------------------
>>    */
>>    virtual void get_auto_increment(ulonglong offset, ulonglong increment,
>>                                    ulonglong nb_desired_values,
>>                                    ulonglong *first_value,
>>                                    ulonglong *nb_reserved_values);
>>    virtual void release_auto_increment();
>> +  virtual int ha_reset_auto_increment(ulonglong value);
>> +private:
>> +  inline virtual void lock_auto_increment(bool *is_locked)
>> +  {
>> +    if(!(*is_locked) && table_share->tmp_table == NO_TMP_TABLE)
>> +    {
>> +      HA_DATA_PARTITION *ha_data= (HA_DATA_PARTITION*) table_share->ha_data;
>> +      *is_locked= TRUE;
>> +      pthread_mutex_lock(&ha_data->mutex);
>> +    }
>> +  }
>> +  inline virtual void unlock_auto_increment(bool *is_locked)
>> +  {
>> +    if(*is_locked)
>> +    {
>> +      HA_DATA_PARTITION *ha_data= (HA_DATA_PARTITION*) table_share->ha_data;
>> +      pthread_mutex_unlock(&ha_data->mutex);
>> +      *is_locked= FALSE;
>> +    }
>> +  }
>> +  inline virtual void set_auto_increment_if_higher(bool *is_locked)
>> +  {
>> +    ulonglong nr= table->next_number_field->val_int();
>> +    HA_DATA_PARTITION *ha_data= (HA_DATA_PARTITION*) table_share->ha_data;
>> +    if (nr >= ha_data->next_auto_inc_val)
>> +    {
>> +      lock_auto_increment(is_locked);
> 
> as you check ha_data->next_auto_inc_val without a mutex, you need to
> re-check it under a mutex.

Thanks for finding this glitch! I will add 'if (nr >= 
ha_data->next_auto_inc_val)' here.

>> +      ha_data->next_auto_inc_val= nr + 1;
>> +    }
>> +  }
>> +
>> +public:
>>  
> Regards / Mit vielen Grüssen,
> Sergei
> 

-- 
Mattias Jonsson, Software Engineer
MySQL AB, www.mysql.com

Are you MySQL certified?  www.mysql.com/certification
Thread
bk commit into 5.1 tree (mattiasj:1.2518) BUG#33479mattiasj21 Feb
  • Re: bk commit into 5.1 tree (mattiasj:1.2518) BUG#33479Sergei Golubchik29 Feb
    • Re: bk commit into 5.1 tree (mattiasj:1.2518) BUG#33479Mattias Jonsson3 Mar
      • Re: bk commit into 5.1 tree (mattiasj:1.2518) BUG#33479Sergei Golubchik7 Mar