List:Commits« Previous MessageNext Message »
From:Guilhem Bichot Date:March 13 2008 1:26pm
Subject:Re: bk commit into 5.1 tree (mattiasj:1.2518) BUG#33479
View as plain text  
Hello,

As this is a big patch, I marked my comments with "GB" for easier
searching.
By the way this is an excellent patch.

On Wed, Mar 05, 2008 at 11:56:16PM +0100, mattiasj@stripped wrote:
> ChangeSet@stripped, 2008-03-05 23:56:06+01:00, mattiasj@witty. +3 -0
>   Bug#33479: auto_increment failures in partitioning
>   
>   Several problems with auto_increment in partitioning
>   (with MyISAM, InnoDB and Falcon)
>   
>   Updated code only patch (the test cases will be in a separate patches
>   5.1+ (InnoDB+MyISAM) and 6.0 (Falcon)
>   This is a full code only patch which I have updated according to sergs
>   comments
>   Changed the auto_increment handling for partitioning:
>   Added a ha_data variable in table_share for storage engine specific data such as
>   auto_increment value handling in partitioning
>   and using the ha_data->mutex to lock around read + update.
>   
>   The idea is this:
>   Store the table's reserved auto_increment value in
>   the TABLE_SHARE and use a mutex to lock it for reading and updateing it,
>   only accessing all partitions when it is not initialized.
>   Also allow reservations of ranges, and if no one has done a reservation
>   afterwards, lower the reservation to what was actually used after
>   the statement is done (via release_auto_increment from WL 3146).
>   
>   This should also lead to better concurrancy
>   and work with any local storage engine.
> 
>   sql/ha_partition.cc@stripped, 2008-03-05 23:56:03+01:00, mattiasj@witty. +261 -88
>     Bug#33479: Failures using auto_increment and partitioning
>     
>     Changed ha_partition::get_auto_increment from file->get_auto_increment
>     to file->info(HA_AUTO_STATUS), since it is works better with InnoDB
>     
>     Using the new table_share->ha_data for keeping the auto_increment
>     value, shared by all instances of the same table.
>     It is read+updated when holding a auto_increment specific mutex.
>     Also added release_auto_increment to decrease gaps if possible.
> 
>   sql/ha_partition.h@stripped, 2008-03-05 23:56:03+01:00, mattiasj@witty. +43 -1
>     Bug#33479: Failures using auto_increment and partitioning
>     
>     Added a new struct HA_DATA_PARTITION to be used in table_share->ha_data
>     Added a private function to set auto_increment values if needed
>     Removed the restore_auto_increment (the hander version is better)
>     Added lock/unlock functions for auto_increment handling.
>     Missed some special case initializations
> 
>   sql/table.h@stripped, 2008-03-05 23:56:03+01:00, mattiasj@witty. +4 -0
>     Bug#33479: Failures using auto_increment and partitioning
>     
>     Added a variable in table_share: ha_data for storage of storage engine
>     specific data (such as auto_increment handling in partitioning).
> 
> diff -Nrup a/sql/ha_partition.cc b/sql/ha_partition.cc
> --- a/sql/ha_partition.cc	2007-12-20 19:16:51 +01:00
> +++ b/sql/ha_partition.cc	2008-03-05 23:56:03 +01:00
> @@ -160,7 +160,7 @@ const uint ha_partition::NO_CURRENT_PART
>  
>  ha_partition::ha_partition(handlerton *hton, TABLE_SHARE *share)
>    :handler(hton, share), m_part_info(NULL), m_create_handler(FALSE),
> -   m_is_sub_partitioned(0), is_clone(FALSE)
> +   m_is_sub_partitioned(0), is_clone(FALSE), auto_increment_lock(FALSE)
>  {
>    DBUG_ENTER("ha_partition::ha_partition(table)");
>    init_handler_variables();
> @@ -182,7 +182,8 @@ ha_partition::ha_partition(handlerton *h
>  ha_partition::ha_partition(handlerton *hton, partition_info *part_info)
>    :handler(hton, NULL), m_part_info(part_info),
>     m_create_handler(TRUE),
> -   m_is_sub_partitioned(m_part_info->is_sub_partitioned()), is_clone(FALSE)
> +   m_is_sub_partitioned(m_part_info->is_sub_partitioned()), is_clone(FALSE),
> +   auto_increment_lock(FALSE)
>  {
>    DBUG_ENTER("ha_partition::ha_partition(part_info)");
>    init_handler_variables();
> @@ -1570,13 +1571,13 @@ int ha_partition::copy_partitions(ulongl
>             table since it doesn't fit into any partition any longer due to
>             changed partitioning ranges or list values.
>          */
> -        deleted++;
> +        (*deleted)++;

GB This bug would have been caught by the compiler if the method was
declared as
int ha_partition::copy_partitions(ulonglong * const copied, ulonglong * const deleted)

Could you please do this change?

>        }
>        else
>        {
>          THD *thd= ha_thd();
>          /* Copy record to new handler */
> -        copied++;
> +        (*copied)++;
>          tmp_disable_binlog(thd); /* Do not replicate the low-level changes. */
>          result= m_new_file[new_part]->ha_write_row(m_rec0);
>          reenable_binlog(thd);
> @@ -2354,6 +2355,24 @@ int ha_partition::open(const char *name,
>                           0, key_rec_cmp, (void*)this)))
>      goto err_handler;
>  
> +
> +  if (table_share->tmp_table == NO_TMP_TABLE)
> +    pthread_mutex_lock(&table->s->mutex);
> +  if (!table_share->ha_data)
> +  {
> +    HA_DATA_PARTITION *ha_data;
> +    /* currently only needed for auto_increment */
> +    table_share->ha_data= alloc_root(&table_share->mem_root,
> +                                     sizeof(HA_DATA_PARTITION));
> +    if (!table_share->ha_data)
> +      goto err_handler;
> +    ha_data= (HA_DATA_PARTITION*) table_share->ha_data;
> +    bzero(ha_data, sizeof(HA_DATA_PARTITION));
> +    if (table_share->tmp_table == NO_TMP_TABLE)
> +      pthread_mutex_init(&ha_data->mutex, MY_MUTEX_INIT_FAST);

GB you need to destroy this mutex, and do so before
table_share->mem_root is freed.

> +  }
> +  if (table_share->tmp_table == NO_TMP_TABLE)
> +    pthread_mutex_unlock(&table->s->mutex);

GB you may want to have a bool on the stack which keeps the value of
table_share->tmp_table == NO_TMP_TABLE, to avoid doing the test 3
times in the worst case.
To have maybe less pointer dereference of table_share (in
table_share->ha_data) you could do
table_share->ha_data= ha_data= (HA_DATA_PARTITION*) alloc_root(...);
if (!ha_data) // or ha_data==NULL, as you like
  goto err_handler;
bzero(ha_data, sizeof(HA_DATA_PARTITION));
This would be one dereference instead of three.

>    /*
>      Some handlers update statistics as part of the open call. This will in
>      some cases corrupt the statistics of the partition handler and thus
> @@ -2695,8 +2714,9 @@ int ha_partition::write_row(uchar * buf)
>    uint32 part_id;
>    int error;
>    longlong func_value;
> -  bool autoincrement_lock= FALSE;
> +  bool have_auto_increment= table->next_number_field && buf ==
> table->record[0];
>    my_bitmap_map *old_map;
> +  HA_DATA_PARTITION *ha_data= (HA_DATA_PARTITION*) table_share->ha_data;
>    THD *thd= ha_thd();
>  #ifdef NOT_NEEDED
>    uchar *rec0= m_rec0;
> @@ -2712,31 +2732,30 @@ int ha_partition::write_row(uchar * buf)
>      If we have an auto_increment column and we are writing a changed row
>      or a new row, then update the auto_increment value in the record.
>    */
> -  if (table->next_number_field && buf == table->record[0])
> +  if (have_auto_increment)
>    {
>      /*
> -      Some engines (InnoDB for example) can change autoincrement
> -      counter only after 'table->write_row' operation.
> -      So if another thread gets inside the ha_partition::write_row
> -      before it is complete, it gets same auto_increment value,
> -      which means DUP_KEY error (bug #27405)
> -      Here we separate the access using table_share->mutex, and
> -      use autoincrement_lock variable to avoid unnecessary locks.
> -      Probably not an ideal solution.
> +      must initialize it before write if it explicity set,
> +      since the partitions can have higher auto_increment_value
> +      which should be used.
>      */
> -    if (table_share->tmp_table == NO_TMP_TABLE)
> +    if (!ha_data->auto_inc_initialized &&
> +        !table->s->next_number_keypart &&
> +        table_share->tmp_table == NO_TMP_TABLE)
>      {
>        /*
> -        Bug#30878 crash when alter table from non partitioned table
> -        to partitioned.
> -        Checking if tmp table then there is no need to lock,
> -        and the table_share->mutex may not be initialised.
> +        If auto_increment in table_share is not initialized, we must
> +        have a mutex around update_auto_increment,
> +        get_auto_increment and write_row.

GB why must we have a mutex around etc? Could you please show a
problematic scenario if we don't have this mutex for so long?

> +        (get_auto_increment can be called from update_auto_increment
> +        and update_auto_increment can be called from write_row).
> +        Not needed for tmp-tables, or auto_increment in secondary
> +        columns in multi-column index.
> +        Also check ha_partition::get_auto_increment().
>        */
> -      autoincrement_lock= TRUE;
> -      pthread_mutex_lock(&table_share->mutex);
> +      lock_auto_increment();
>      }
>      error= update_auto_increment();
> -
>      /*
>        If we have failed to set the auto-increment value for this row,
>        it is highly likely that we will not be able to insert it into
> @@ -2771,10 +2790,11 @@ int ha_partition::write_row(uchar * buf)
>    DBUG_PRINT("info", ("Insert in partition %d", part_id));
>    tmp_disable_binlog(thd); /* Do not replicate the low-level changes. */
>    error= m_file[part_id]->ha_write_row(buf);
> +  if (have_auto_increment && !table->s->next_number_keypart)
> +    set_auto_increment_if_higher();
>    reenable_binlog(thd);
>  exit:
> -  if (autoincrement_lock)
> -    pthread_mutex_unlock(&table_share->mutex);
> +  unlock_auto_increment();
>    DBUG_RETURN(error);
>  }
>  
> @@ -2838,17 +2858,24 @@ int ha_partition::update_row(const uchar
>      goto exit;
>    }
>  
> -  /*
> -    TODO:
> -      set_internal_auto_increment=
> -        max(set_internal_auto_increment, new_data->auto_increment)
> -  */
>    m_last_part= new_part_id;
>    if (new_part_id == old_part_id)
>    {
>      DBUG_PRINT("info", ("Update in partition %d", new_part_id));
>      tmp_disable_binlog(thd); /* Do not replicate the low-level changes. */
>      error= m_file[new_part_id]->ha_update_row(old_data, new_data);
> +    /*
> +      if updating a auto_increment row, update

"auto_increment column" would be more accurate.

> +      auto_increment_value in table_share if needed
> +      (not to be used if auto_increment on secondary field in a multi-
> +      column index)
> +    */
> +    if (table->next_number_field && new_data == table->record[0]
> &&
> +        !table->s->next_number_keypart)
> +    {
> +      set_auto_increment_if_higher();
> +      unlock_auto_increment();
> +    }
>      reenable_binlog(thd);
>      goto exit;
>    }
> @@ -2953,13 +2980,25 @@ int ha_partition::delete_all_rows()
>  {
>    int error;
>    handler **file;
> +  THD *thd= current_thd;

GB In handler code like this, ha_thd() is to be used instead of
current_thd, because we think it could be faster:
- it's a function call like pthread_getspecific() is (current_thd uses
pthread_getspecific())
- it is in mysqld whereas pthread_getspecific() is in libc (farther so
slower to access, probably one indirection more).
- ha_thd() has good chances of just returning table->in_use, which is
probably not more work than pthread_getspecific() does, at least this
one:
http://cvs.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/lib/libc/port/threads/tsd.c#_thr_getspecific
does more work (opensolaris is what popped first in Google - sign of
the times?).

>    DBUG_ENTER("ha_partition::delete_all_rows");
>  
> +  if (thd->lex->sql_command == SQLCOM_TRUNCATE)
> +  {
> +    HA_DATA_PARTITION *ha_data= (HA_DATA_PARTITION*) table_share->ha_data;
> +    lock_auto_increment();
> +    ha_data->next_auto_inc_val= 0;
> +    ha_data->auto_inc_initialized= FALSE;
> +    unlock_auto_increment();
> +  }
>    file= m_file;
>    do
>    {
>      if ((error= (*file)->ha_delete_all_rows()))
>        DBUG_RETURN(error);
> +    /* must reset the auto_increment for some engines (eg MyISAM) */
> +    if (thd->lex->sql_command == SQLCOM_TRUNCATE)
> +      (*file)->ha_reset_auto_increment(0);

GB This piece moved to another bug report, I understood.

>    } while (*(++file));
>    DBUG_RETURN(0);
>  }
> @@ -4412,20 +4451,56 @@ int ha_partition::handle_ordered_prev(uc
>  int ha_partition::info(uint flag)
>  {
>    handler *file, **file_array;
> +  HA_DATA_PARTITION *ha_data= (HA_DATA_PARTITION*) table_share->ha_data;

GB this declaration should be moved to the block which uses this variable:
the if(flag&HA_STATUS_AUTO) block.

>    DBUG_ENTER("ha_partition:info");
>  
>    if (flag & HA_STATUS_AUTO)
>    {
> -    ulonglong auto_increment_value= 0;
>      DBUG_PRINT("info", ("HA_STATUS_AUTO"));
> -    file_array= m_file;
> -    do
> +    if (ha_data->auto_inc_initialized &&
> !table_share->next_number_keypart)

GB this is reading ha_data->auto_inc_initialized without mutex: there may
be another thread changing this variable at this moment. Then we may
go into the 'else' branch though next_auto_inc_val has already just
been set to true by the other thread, but this is ok as we will then
only do a useless setting of next_auto_inc_val.
>      {
> -      file= *file_array;
> -      file->info(HA_STATUS_AUTO);
> -      set_if_bigger(auto_increment_value, file->stats.auto_increment_value);
> -    } while (*(++file_array));
> -    stats.auto_increment_value= auto_increment_value;
> +      DBUG_PRINT("info", ("Using ha_data->next_auto_inc_val: %llu",
> +                          ha_data->next_auto_inc_val));

GB From what I see in MSDN, %ll[u] is not supported by VS 2003 (it is in
C99 but Microsoft apparently needed more than 4 years to implement it
- now let's see how many flame emails I get for writing that
gratuitous sarcasm). It is supported starting from VS2005. There is a
VS2003 machine in pushbuild, so we may have to do some debugging on
it, thus we should avoid %ll[u] in the patch (the rest of MySQL code
already avoids it and uses instead [u]llstr/%s). I agree this is a
hassle.

> +
> +      lock_auto_increment();
> +      stats.auto_increment_value= ha_data->next_auto_inc_val;
> +      unlock_auto_increment();
> +    }
> +    else
> +    {
> +      ulonglong auto_increment_value= 0;
> +      file_array= m_file;
> +      DBUG_PRINT("info", ("checking all partitions for auto_increment_value"));
> +      do
> +      {
> +        file= *file_array;
> +        file->info(HA_STATUS_AUTO);
> +        set_if_bigger(auto_increment_value, file->stats.auto_increment_value);
> +        DBUG_PRINT("info", ("file->stats.auto_increment_value: %llu",
> +                            file->stats.auto_increment_value));
> +      } while (*(++file_array));
> +
> +      /*
> +        This should not occur unless bugs in the other storage engine.
> +        If locked, it is called via get_auto_increment as a part of
> +        initializing and is handled there.
> +      */

GB Could you please expand on "This should not occur unless bugs in the
other storage engine?"? If the if() block below is useless except if
bugs, we could change it to DBUG_ASSERT(). It sounds weird that:
SHOW TABLE STATUS on a table of the buggy engine would return 0 in the
Auto_increment column, but SHOW TABLE STATUS on a partitioned table
and whose partitions are of the buggy engine, would return 1 in the
Auto_increment column.
And I prefer to not do defensive programming when dealing with storage
engines but rather let engines' developers fixed their bugs.

> +      if (!auto_increment_lock && auto_increment_value == 0)
> +        auto_increment_value= 1;

GB In spite of the comment, I don't understand why the fact of being
locked has an influence of if we should set to 1 or not.
When can auto_increment_value==0 be true?


> +      stats.auto_increment_value= auto_increment_value;
> +      if (!table_share->next_number_keypart)
> +      {
> +        bool is_already_locked= auto_increment_lock;
> +        if (!is_already_locked)
> +          lock_auto_increment();
> +        set_if_bigger(ha_data->next_auto_inc_val, auto_increment_value);
> +        ha_data->auto_inc_initialized= TRUE;
> +        if (!is_already_locked)
> +          unlock_auto_increment();
> +      }
> +    }
> +    DBUG_PRINT("info", ("stats.auto_increment_value: %llu",
> +                        stats.auto_increment_value));
>    }
>    if (flag & HA_STATUS_VARIABLE)
>    {
> @@ -5579,19 +5654,37 @@ int ha_partition::cmp_ref(const uchar *r
>                  MODULE auto increment
>  ****************************************************************************/
>  
> -void ha_partition::restore_auto_increment(ulonglong)
> -{
> -  DBUG_ENTER("ha_partition::restore_auto_increment");
>  
> -  DBUG_VOID_RETURN;
> +int ha_partition::reset_auto_increment(ulonglong value)
> +{
> +  handler **pos, **end;
> +  int res;
> +  HA_DATA_PARTITION *ha_data= (HA_DATA_PARTITION*) table_share->ha_data;
> +  DBUG_ENTER("ha_partition::reset_auto_increment");
> +  DBUG_PRINT("info", ("value: %llu", value));
> +  lock_auto_increment();
> +  ha_data->auto_inc_initialized= FALSE;
> +  ha_data->next_auto_inc_val= 0;
> +  for (pos=m_file, end= m_file+ m_tot_parts; pos != end ; pos++)

GB all occurences of "m_file+ m_tot_parts" should have a space before +.

> +  {
> +    if ((res= (*pos)->ha_reset_auto_increment(value)) != 0)
> +    {
> +      unlock_auto_increment();
> +      DBUG_RETURN(res);
> +    }
> +  }
> +  unlock_auto_increment();
> +  DBUG_RETURN(0);

GB or maybe (smaller code, different behaviour in case if error but not
clear if that itself is important):
  int res= 0;
  ...
  for (pos=m_file, end= m_file+ m_tot_parts; pos != end ; pos++)
    res|= (*pos)->ha_reset_auto_increment(value);
  unlock_auto_increment();
  DBUG_RETURN(res != 0);

  
>  /*
>    This method is called by update_auto_increment which in turn is called
> -  by the individual handlers as part of write_row. We will always let
> -  the first handler keep track of the auto increment value for all
> -  partitions.
> +  by the individual handlers as part of write_row. We use the
> +  table_share->ha_data->next_auto_inc_val, or search all
> +  partitions for the highest auto_increment_value if not initialized or
> +  if auto_increment field is a secondary part of a key, we must search
> +  every partition with a mutex to be sure of correctness.
>  */

GB While you are changing the comment, you could format it for doxygen.

>  
>  void ha_partition::get_auto_increment(ulonglong offset, ulonglong increment,
> @@ -5599,59 +5692,120 @@ void ha_partition::get_auto_increment(ul
>                                        ulonglong *first_value,
>                                        ulonglong *nb_reserved_values)
>  {
> -  ulonglong first_value_part, last_value_part, nb_reserved_values_part,
> -    last_value= ~ (ulonglong) 0;
> -  handler **pos, **end;
> -  bool retry= TRUE;
> +  HA_DATA_PARTITION *ha_data= (HA_DATA_PARTITION*) table_share->ha_data;
>    DBUG_ENTER("ha_partition::get_auto_increment");
> -
> -again:
> -  for (pos=m_file, end= m_file+ m_tot_parts; pos != end ; pos++)
> +  DBUG_PRINT("info", ("offset: %llu inc: %llu desired_values: %llu "
> +                      "first_value: %llu", offset, increment,
> +                      nb_desired_values, *first_value));
> +  DBUG_ASSERT(increment && nb_desired_values);
> +  *first_value= 0;
> +  if (table->s->next_number_keypart)
>    {
> -    first_value_part= *first_value;
> -    (*pos)->get_auto_increment(offset, increment, nb_desired_values,
> -                               &first_value_part,
> &nb_reserved_values_part);
> -    if (first_value_part == ~(ulonglong)(0)) // error in one partition
> -    {
> -      *first_value= first_value_part;
> -      sql_print_error("Partition failed to reserve auto_increment value");
> -      DBUG_VOID_RETURN;
> -    }
>      /*
> -      Partition has reserved an interval. Intersect it with the intervals
> -      already reserved for the previous partitions.
> +      next_number_keypart is != 0 if the auto_increment column is a secondary
> +      column in the index (it is allowed in MyISAM)
>      */

GB In this branch, the concept of reserving multiple value does not
make sense, as the comment in ha_myisam.cc says:

  /*
    MySQL needs to call us for next row: assume we are inserting ("a",null)
    here, we return 3, and next this statement will want to insert ("b",null):
    there is no reason why ("b",3+1) would be the good row to insert: maybe it
    already exists, maybe 3+1 is too large...
  */
  *nb_reserved_values= 1;

So ha_partition::get_auto_increment() should not look for an
intersection.
The intersection was my idea, my code and it's wrong, Mikael and
I had identified that already. When I coded it it was not a problem
because all engines, even InnoDB, did table-level locking for the
duration of the whole INSERT statement, so they were all reserving to
+inf, thus the intersection was always doing a max in fact.
Now that InnoDB reserves finite intervals, we have to get rid of the
intersection and use a max.
It does not make sense to think about intervals in this case.

In the branch above, you just want a max: you can just call
get_auto_increment(), then use the max of all
first_value_part. last_value_part is not needed.

> -    last_value_part= (nb_reserved_values_part == ULONGLONG_MAX) ?
> -      ULONGLONG_MAX : (first_value_part + nb_reserved_values_part * increment);
> -    set_if_bigger(*first_value, first_value_part);
> -    set_if_smaller(last_value, last_value_part);
> -  }
> -  if (last_value < *first_value) /* empty intersection, error */
> -  {
> +    DBUG_PRINT("info", ("next_number_keypart != 0"));
> +    ulonglong first_value_part, last_value_part, nb_reserved_values_part,
> +      last_value= ~ (ulonglong) 0;
> +    handler **pos, **end;
>      /*
> -      When we have an empty intersection, it means that one or more
> -      partitions may have a significantly different autoinc next value.
> -      We should not fail here - it just means that we should try to
> -      find a new reservation making use of the current *first_value
> -      wbich should now be compatible with all partitions.
> +      Must lock and find highest value among all partitions
>      */
> -    if (retry)
> +    lock_auto_increment();
> +    for (pos=m_file, end= m_file+ m_tot_parts; pos != end ; pos++)
>      {
> -      retry= FALSE;
> -      last_value= ~ (ulonglong) 0;
> -      release_auto_increment();
> -      goto again;
> +      first_value_part= 0;
> +      (*pos)->get_auto_increment(offset, increment, nb_desired_values,
> +                                 &first_value_part,
> &nb_reserved_values_part);
> +      if (first_value_part == ~(ulonglong)(0)) // error in one partition
> +      {
> +        *first_value= first_value_part;
> +        sql_print_error("Partition failed to reserve auto_increment value");
> +        unlock_auto_increment();
> +        DBUG_VOID_RETURN;
> +      }
> +      DBUG_PRINT("info", ("partition loop, first_value_part: %llu",
> first_value_part));
> +      /*
> +        Partition has reserved an interval. Intersect it with the intervals
> +        already reserved for the previous partitions.
> +      */
> +      last_value_part= (nb_reserved_values_part == ULONGLONG_MAX) ?
> +        ULONGLONG_MAX : (first_value_part + nb_reserved_values_part * increment);
> +      set_if_bigger(*first_value, first_value_part);
> +      set_if_smaller(last_value, last_value_part);
> +    }
> +    DBUG_PRINT("info", ("*first_value: %llu last_value: %llu", *first_value,
> +                        last_value));
> +    if (last_value < *first_value)
> +    {
> +      /*
> +        Set last_value to *first_value. This is safe because of the mutex.
> +      */
> +      last_value= *first_value;
>      }
> +    if (increment)                                // If not check for values

GB the function starts with a DBUG_ASSERT() which checks that
"increment" is true.

> +    {
> +      *nb_reserved_values= (last_value == ULONGLONG_MAX) ?
> +        ULONGLONG_MAX : ((last_value - *first_value) / increment);
> +    }
> +    unlock_auto_increment();

GB I think you should simplify this whole branch:
- call get_auto_increment() in a loop, with nb_desired_values=1
- compute the max into *first_value, set *nb_reserved_values to 1
- and be done with it.

> +  }
> +  else if (!ha_data->auto_inc_initialized)
> +  {
>      /*
> -      We should not get here.
> +      Not initialized, but should be locked in ha_partition::write_row().
> +      (Or is a temporary table).

GB how about adding
if (this is not tmp table)
   safe_mutex_assert_owner(the expected mutex);
?

> +      Initialize to highest value among all partitions
>      */
> -    sql_print_error("Failed to calculate auto_increment value for partition");
> -    
> -    *first_value= ~(ulonglong)(0);
> +    info(HA_STATUS_AUTO);
> +    *first_value= ha_data->next_auto_inc_val;
> +    if (*first_value == 0 &&

GB When can *first_value be 0?

> +        !(table->auto_increment_field_not_null &&
> +          table->next_number_field->val_int() == 0 &&
> +          current_thd->variables.sql_mode & MODE_NO_AUTO_VALUE_ON_ZERO))
> +    {
> +      /*
> +        First auto_increment, set to 1
> +        (if not explicit number 0 and not null and sql_mode set)
> +      */
> +      DBUG_PRINT("info", ("Setting *first_value to 1"));
> +      *first_value= 1;
> +    }
> +    DBUG_PRINT("info", ("*first_value: %llu", *first_value));
> +
> +    ha_data->next_auto_inc_val= *first_value + nb_desired_values * increment;
> +    ha_data->auto_inc_initialized= TRUE;
> +    *nb_reserved_values= nb_desired_values;
>    }
> -  if (increment)                                // If not check for values
> -    *nb_reserved_values= (last_value == ULONGLONG_MAX) ?
> -      ULONGLONG_MAX : ((last_value - *first_value) / increment);
> +  else
> +  {
> +    /*
> +      This is an optimized solution that not require a mutex around
> +      write_row for auto_increment handling, instead only using a mutex
> +      for picking the next auto_increment_value.
> +
> +      Get a lock for handling the auto_increment in table_share->ha_data
> +      for avoiding two concurrent statements getting the same number.
> +    */ 
> +
> +    DBUG_PRINT("info", ("reserving %llu auto_inc values", nb_desired_values));
> +    lock_auto_increment();
> +
> +    DBUG_PRINT("info", ("ha_data->next_auto_inc_val: %llu",
> +                        ha_data->next_auto_inc_val));
> +    /* this gets corrected (for offset/increment) in update_auto_increment */
> +    *first_value= ha_data->next_auto_inc_val;
> +    ha_data->next_auto_inc_val= *first_value + nb_desired_values * increment;
> +
> +    unlock_auto_increment();
> +    *nb_reserved_values= nb_desired_values;
> +  }
> +  DBUG_PRINT("info", ("first_value: %llu next_auto_inc_val: %llu "
> +                      "nb_reserved_values: %llu", *first_value,
> +                      ha_data->next_auto_inc_val,
> +                      *nb_reserved_values));
> +
>    DBUG_VOID_RETURN;
>  }
>  
> @@ -5659,9 +5813,28 @@ void ha_partition::release_auto_incremen
>  {
>    DBUG_ENTER("ha_partition::release_auto_increment");
>  
> -  for (uint i= 0; i < m_tot_parts; i++)
> +  if (table->s->next_number_keypart)
>    {
> -    m_file[i]->ha_release_auto_increment();
> +    for (uint i= 0; i < m_tot_parts; i++)
> +      m_file[i]->ha_release_auto_increment();
> +  }
> +  else
> +  {
> +    HA_DATA_PARTITION *ha_data= (HA_DATA_PARTITION*) table_share->ha_data;
> +    DBUG_PRINT("info", ("next_auto_inc_val: %llu cur_row_max: %llu min: %llu",
> +                        ha_data->next_auto_inc_val,
> +                        auto_inc_interval_for_cur_row.maximum(),
> +                        auto_inc_interval_for_cur_row.minimum()));
> +    if (next_insert_id)
> +    {
> +      lock_auto_increment();
> +      if (next_insert_id < ha_data->next_auto_inc_val &&
> +          auto_inc_interval_for_cur_row.maximum() >=
> ha_data->next_auto_inc_val)

GB too long line (80 chars).

> +        ha_data->next_auto_inc_val= next_insert_id;
> +      unlock_auto_increment();
> +    }
> +    DBUG_PRINT("info", ("next_auto_inc_val: %llu next_ins_id: %llu",
> +                        ha_data->next_auto_inc_val, next_insert_id));
>    }
>    DBUG_VOID_RETURN;
>  }
> diff -Nrup a/sql/ha_partition.h b/sql/ha_partition.h
> --- a/sql/ha_partition.h	2007-09-24 15:30:28 +02:00
> +++ b/sql/ha_partition.h	2008-03-05 23:56:03 +01:00
> @@ -37,6 +37,13 @@ typedef struct st_partition_share
>  } PARTITION_SHARE;
>  #endif
>  
> +/* TODO: move all partition specific data from TABLE_SHARE here */

GB /** @todo

> +typedef struct st_ha_data_partition
> +{
> +  ulonglong next_auto_inc_val;
> +  bool auto_inc_initialized;
> +  pthread_mutex_t mutex;
> +} HA_DATA_PARTITION;

GB if you add a comment for the struct, and maybe for each of its members
unless its use is easily guessable, it will produce useful info in
doxygen-generated documentation.

>  #define PARTITION_BYTES_IN_POS 2
>  class ha_partition :public handler
> @@ -141,6 +148,7 @@ private:
>      "own" the m_part_info structure.
>    */
>    bool is_clone;
> +  bool auto_increment_lock;             /* lock reading/updating auto_inc */
>  public:
>    handler *clone(MEM_ROOT *mem_root);
>    virtual void set_part_info(partition_info *part_info)
> @@ -828,12 +836,46 @@ public:
>      auto_increment_column_changed
>       -------------------------------------------------------------------------
>    */
> -  virtual void restore_auto_increment(ulonglong prev_insert_id);
>    virtual void get_auto_increment(ulonglong offset, ulonglong increment,
>                                    ulonglong nb_desired_values,
>                                    ulonglong *first_value,
>                                    ulonglong *nb_reserved_values);
>    virtual void release_auto_increment();
> +private:
> +  virtual int reset_auto_increment(ulonglong value);
> +  inline virtual void lock_auto_increment()

GB methods declared inside the class' declaration are automatically
inline (C++ standard says so), but no harm in adding "inline".

> +  {

GB I think we should have
DBUG_ASSERT(!auto_increment_lock) here.
The reason is: auto_increment_lock seems to be used as "remember if
lock is already acquired or not; if already acquired I don't need to
acquire it". But this code:

lock_auto_increment();
f();
something;
unlock_auto_increment();

where f() is:
f()
{
  lock_auto_increment(); $
  ...;
  unlock_auto_increment(); *
}

will not work: the line marked with $ will do nothing, but the line
with * will unlock the mutex, too early if 'something;' expected to be
protected by the mutex.
With the assertion we prevent against such wrong double-lock use.

Please also add some description to lock_auto_increment() (like "locks
mutex if not already locked") and unlock_auto_increment() (like
"unlocks mutex if it was locked").

Any function which calls unlock_auto_increment() should in theory have
a comment in its description saying that it unlocks the mutex.

I worry a bit about this design where in a function, the lock may or
not be unlocked (like in get_auto_increment(), where we unlock or not
depending on the branch), functions call each other with the lock
acquired by caller or callee, and it all relies on the proper branches
being taken by callee. But to be more specific I need to first
understand why in some cases the mutex has to be held in write_row +
update_auto_increment + get_auto_increment (a previous question).

> +    if(!auto_increment_lock && table_share->tmp_table == NO_TMP_TABLE)
> +    {
> +      HA_DATA_PARTITION *ha_data= (HA_DATA_PARTITION*) table_share->ha_data;
> +      auto_increment_lock= TRUE;
> +      pthread_mutex_lock(&ha_data->mutex);
> +    }
> +  }
> +  inline virtual void unlock_auto_increment()
> +  {
> +    if(auto_increment_lock)
> +    {
> +      HA_DATA_PARTITION *ha_data= (HA_DATA_PARTITION*) table_share->ha_data;
> +      pthread_mutex_unlock(&ha_data->mutex);
> +      auto_increment_lock= FALSE;
> +    }
> +  }
> +  inline virtual void set_auto_increment_if_higher()
> +  {
> +    ulonglong nr= table->next_number_field->val_int();
> +    HA_DATA_PARTITION *ha_data= (HA_DATA_PARTITION*) table_share->ha_data;
> +    if (nr >= ha_data->next_auto_inc_val)

GB We are doing a read without mutex above. Why is that correct?
I have doubts: for example, on Pentium and newer Intel CPUs, a read of
a ulonglong variable may not be atomic if the ulonglong is not aligned
on a 8-byte boundary (and looking at how next_auto_inc_val is
allocated I see no guarantee of such alignment).
So if two threads are executing the above code at the same time, one
may see a corrupted ha_data->next_auto_inc_val (a value which never
really existed) and go wrong.

> +    {
> +      lock_auto_increment();
> +      /* must check again when the mutex is taken */
> +      if (nr >= ha_data->next_auto_inc_val)
> +        ha_data->next_auto_inc_val= nr + 1;
> +      ha_data->auto_inc_initialized= TRUE;
> +    }
> +  }
> +
> +public:
>  
>    /*
>       -------------------------------------------------------------------------
> diff -Nrup a/sql/table.h b/sql/table.h
> --- a/sql/table.h	2007-12-20 21:24:07 +01:00
> +++ b/sql/table.h	2008-03-05 23:56:03 +01:00
> @@ -358,6 +358,7 @@ typedef struct st_table_share
>    */
>    int cached_row_logging_check;
>  
> +  /* TODO: Move into *ha_data for partitioning */
>  #ifdef WITH_PARTITION_STORAGE_ENGINE
>    bool auto_partitioned;
>    const char *partition_info;
> @@ -367,6 +368,9 @@ typedef struct st_table_share
>    uint part_state_len;
>    handlerton *default_part_db_type;
>  #endif
> +
> +  /* place to store storage engine specific data */

GB /** for doxygen

> +  void *ha_data;

-- 
   __  ___     ___ ____  __
  /  |/  /_ __/ __/ __ \/ /    Mr. Guilhem Bichot <guilhem@stripped>
 / /|_/ / // /\ \/ /_/ / /__   MySQL AB, Lead Software Engineer
/_/  /_/\_, /___/\___\_\___/   Bordeaux, France
       <___/   www.mysql.com   
Thread
bk commit into 5.1 tree (mattiasj:1.2518) BUG#33479mattiasj5 Mar
  • Re: bk commit into 5.1 tree (mattiasj:1.2518) BUG#33479Sergei Golubchik7 Mar
    • Re: bk commit into 5.1 tree (mattiasj:1.2518) BUG#33479Sergei Golubchik7 Mar
      • Re: bk commit into 5.1 tree (mattiasj:1.2518) BUG#33479Mattias Jonsson8 Mar
        • Re: bk commit into 5.1 tree (mattiasj:1.2518) BUG#33479Sergei Golubchik10 Mar
  • Re: bk commit into 5.1 tree (mattiasj:1.2518) BUG#33479Guilhem Bichot13 Mar
    • Re: bk commit into 5.1 tree (mattiasj:1.2518) BUG#33479Mattias Jonsson14 Mar
      • Re: bk commit into 5.1 tree (mattiasj:1.2518) BUG#33479Guilhem Bichot18 Mar
        • Re: bk commit into 5.1 tree (mattiasj:1.2518) BUG#33479Mattias Jonsson27 Mar
          • Re: bk commit into 5.1 tree (mattiasj:1.2518) BUG#33479Guilhem Bichot28 Mar
            • Re: bk commit into 5.1 tree (mattiasj:1.2518) BUG#33479Mattias Jonsson1 Apr
              • Re: bk commit into 5.1 tree (mattiasj:1.2518) BUG#33479Guilhem Bichot10 Apr