List:Commits« Previous MessageNext Message »
From:Guilhem Bichot Date:April 17 2008 1:28pm
Subject:Re: bk commit into 5.1 tree (mattiasj:1.2518) BUG#33479
View as plain text  
Hello Mattias,

My comments prefixed with GB.
Please also see the ones I sent in a previous mail:
 Date: Thu, 10 Apr 2008 15:59:41 +0200
 From: Guilhem Bichot <guilhem@stripped>
 To: Mattias Jonsson <mattiasj@stripped>

Generally I think you patch is approaching perfection. So some of my
comments below are just ideas for how to present code, add comments.
Regarding testing, given that your patch touches concurrency (longer
lock for statement-based replication etc), I'd say it needs
"additional QA testing" (you know, this checkbox in the bug
report). Before the patch is rolled out to our customers, I think a QA
person should run a stress test like this: multiple threads all
inserting  into the same partitioned table, using row-based and then
statement-based replication, checking that slave does not go wrong
(it could go wrong without the lock specific of statement-based
replication); checking that INSERTs don't receive duplicate keys
errors (like was the case in the past for InnoDB+partition, fixed by
some code, which you had to remove in your patch). Doing it for
MyISAM, InnoDB, Falcon at least.

On Tue, Apr 01, 2008 at 11:04:45PM +0200, mattiasj@stripped wrote:
> ChangeSet@stripped, 2008-04-01 23:04:40+02:00, mattiasj@witty. +11 -0
>   Bug#33479: auto_increment failures in partitioning
>   
>   Several problems with auto_increment in partitioning
>   (with MyISAM, InnoDB and Falcon. Locking issues, not handling
>   multi row inserts properly etc.)
>   
>   This is a full patch which I have updated according to sergs and
>   guilhems comments.
>   Changed the auto_increment handling for partitioning:
>   Added a ha_data variable in table_share for storage engine specific data such as
>   auto_increment value handling in partitioning
>   and using the ha_data->mutex to lock around read + update.
>   
>   The idea is this:
>   Store the table's reserved auto_increment value in
>   the TABLE_SHARE and use a mutex to, lock it for reading and updating it
>   and unlocking it, in one block. Only accessing all partitions
>   when it is not initialized.
>   Also allow reservations of ranges, and if no one has done a reservation
>   afterwards, lower the reservation to what was actually used after
>   the statement is done (via release_auto_increment from WL 3146).
>   The lock is kept from the first reservation if it is statement based
>   replication and a multi row insert statement

GB "multi row" is a bit vague, see below.

>   (insert select, load data etc.)

GB We don't know what 'etc' is.
InnoDB has descriptions and names here:
http://dev.mysql.com/doc/refman/5.1/en/innodb-auto-increment-handling.html
("simple inserts", "bulk inserts", "mixed inserts"), but it's not easy
to guess from only the name, what a "simple insert" is.
I'd say
"multi-row INSERT statement where the number of candidate rows to
insert is not known in advance (like INSERT SELECT, LOAD DATA; unlike
INSERT VALUES(row1),(row2),...,(rowN))"
Such verbose description is even more important in the code, which has
a longer life than the changeset comment.

>   sql/ha_partition.cc@stripped, 2008-04-01 23:04:38+02:00, mattiasj@witty. +227 -103
>     Bug#33479: Failures using auto_increment and partitioning
>     
>     Changed ha_partition::get_auto_increment from file->get_auto_increment
>     to file->info(HA_AUTO_STATUS), since it is works better with InnoDB

GB Please add why it works better.

> diff -Nrup a/mysql-test/extra/rpl_tests/rpl_auto_increment.test
> b/mysql-test/extra/rpl_tests/rpl_auto_increment.test
> --- a/mysql-test/extra/rpl_tests/rpl_auto_increment.test	2007-06-06 19:48:51 +02:00
> +++ b/mysql-test/extra/rpl_tests/rpl_auto_increment.test	2008-04-01 23:04:38 +02:00
> @@ -12,7 +12,7 @@
>  -- source include/not_ndb_default.inc
>  -- source include/master-slave.inc
>  
> -eval create table t1 (a int not null auto_increment,b int, primary key (a))
> engine=$engine_type2 auto_increment=3;
> +eval create table t1 (a int not null auto_increment,b int, primary key (a))
> auto_increment=3 engine=$engine_type2;
>  insert into t1 values (NULL,1),(NULL,2),(NULL,3);
>  select * from t1;
>  
> @@ -76,7 +76,7 @@ insert into t1 values (NULL),(5),(NULL),
>  insert into t1 values (500),(NULL),(502),(NULL),(NULL);
>  select * from t1;
>  set @@insert_id=600;
> ---error ER_DUP_ENTRY
> +--error ER_DUP_ENTRY, ER_DUP_KEY

GB do you know why two errors can happen now and not one?
As for tests and results, I didn't re-check them since one of your
previous commits. So I don't know if since then, you changed anything
to them or the results changed.

> diff -Nrup a/sql/ha_partition.cc b/sql/ha_partition.cc
> --- a/sql/ha_partition.cc	2007-12-20 19:16:51 +01:00
> +++ b/sql/ha_partition.cc	2008-04-01 23:04:38 +02:00

> @@ -2270,8 +2273,10 @@ int ha_partition::open(const char *name,
>    uint alloc_len;
>    handler **file;
>    char name_buff[FN_REFLEN];
> +  bool is_not_tmp_table= (table_share->tmp_table == NO_TMP_TABLE);
>    DBUG_ENTER("ha_partition::open");
>  
> +  DBUG_ASSERT(table->s == table_share);

GB sounds logical but I hope there is no corner case... Are you sure
that the '==' is always true?

> @@ -4412,20 +4439,53 @@ int ha_partition::handle_ordered_prev(uc
>  int ha_partition::info(uint flag)
>  {
>    handler *file, **file_array;

GB please move those two variables down to the else{} which uses them.

> -  DBUG_ENTER("ha_partition:info");
> +  DBUG_ENTER("ha_partition::info");
>  
>    if (flag & HA_STATUS_AUTO)
>    {
> -    ulonglong auto_increment_value= 0;
> +    bool auto_inc_is_first_in_idx= (table_share->next_number_keypart == 0);
> +    HA_DATA_PARTITION *ha_data= (HA_DATA_PARTITION*) table_share->ha_data;
>      DBUG_PRINT("info", ("HA_STATUS_AUTO"));
> -    file_array= m_file;
> -    do
> +    if (!table->found_next_number_field)
> +      stats.auto_increment_value= 0;
> +    else if (ha_data->auto_inc_initialized && auto_inc_is_first_in_idx)

GB I suspect that you can remove '&& auto_inc_is_first_in_idx' above.

> +    {
> +      lock_auto_increment();
> +      stats.auto_increment_value= ha_data->next_auto_inc_val;
> +      unlock_auto_increment();
> +    }
> +    else
>      {
> -      file= *file_array;
> -      file->info(HA_STATUS_AUTO);
> -      set_if_bigger(auto_increment_value, file->stats.auto_increment_value);
> -    } while (*(++file_array));
> -    stats.auto_increment_value= auto_increment_value;
> +      lock_auto_increment();
<cut>

> @@ -5579,19 +5639,36 @@ int ha_partition::cmp_ref(const uchar *r
>                  MODULE auto increment
>  ****************************************************************************/
>  
> -void ha_partition::restore_auto_increment(ulonglong)
> -{
> -  DBUG_ENTER("ha_partition::restore_auto_increment");
>  
> -  DBUG_VOID_RETURN;
> +int ha_partition::reset_auto_increment(ulonglong value)
> +{
> +  handler **pos, **end;
> +  int res;
> +  HA_DATA_PARTITION *ha_data= (HA_DATA_PARTITION*) table_share->ha_data;
> +  DBUG_ENTER("ha_partition::reset_auto_increment");
> +  lock_auto_increment();
> +  ha_data->auto_inc_initialized= FALSE;
> +  ha_data->next_auto_inc_val= 0;
> +  for (pos=m_file, end= m_file + m_tot_parts; pos != end ; pos++)

GB one might write it like this:
  for (handler **pos=m_file, **end= m_file + m_tot_parts; pos != end ; pos++)
to keep "pos" and "end" local to the loop where they are used.
This is a mere suggestion.

By the way, this m_file+m_tot_parts is not your invention, but there
is another way to write the loop, which is more common throughout the
ha_partition.cc file, it's
do { ...  } while (*(++file));
This is also a mere suggestion.

> +  {
> +    if ((res= (*pos)->ha_reset_auto_increment(value)) != 0)
> +    {
> +      unlock_auto_increment();
> +      DBUG_RETURN(res);
> +    }

GB I think the code could be smaller it just did "break" and then
DBUG_RETURN(0) below would be changed to DBUG_RETURN(res).
With a LINT_INIT(res) at start if a compiler complains.
This could make the code smaller because it would remove one inlined
function (unlock_auto_increment()) and, a DBUG_RETURN() which is a
function in debug builds (which don't really matter, I agree).
This is a mere suggestion.

> +  }
> +  unlock_auto_increment();
> +  DBUG_RETURN(0);
>  }
>  
>  
> -/*
> +/**
>    This method is called by update_auto_increment which in turn is called
> -  by the individual handlers as part of write_row. We will always let
> -  the first handler keep track of the auto increment value for all
> -  partitions.
> +  by the individual handlers as part of write_row. We use the
> +  table_share->ha_data->next_auto_inc_val, or search all
> +  partitions for the highest auto_increment_value if not initialized or
> +  if auto_increment field is a secondary part of a key, we must search
> +  every partition when holding a mutex to be sure of correctness.
>  */
>  
>  void ha_partition::get_auto_increment(ulonglong offset, ulonglong increment,
> @@ -5599,59 +5676,82 @@ void ha_partition::get_auto_increment(ul
>                                        ulonglong *first_value,
>                                        ulonglong *nb_reserved_values)
>  {
> -  ulonglong first_value_part, last_value_part, nb_reserved_values_part,
> -    last_value= ~ (ulonglong) 0;
> -  handler **pos, **end;
> -  bool retry= TRUE;
>    DBUG_ENTER("ha_partition::get_auto_increment");
> -
> -again:
> -  for (pos=m_file, end= m_file+ m_tot_parts; pos != end ; pos++)
> +  DBUG_PRINT("info", ("offset: %lu inc: %lu desired_values: %lu "
> +                      "first_value: %lu", (ulong) offset, (ulong) increment,
> +                      (ulong) nb_desired_values, (ulong) *first_value));
> +  DBUG_ASSERT(increment && nb_desired_values);
> +  *first_value= 0;
> +  if (table->s->next_number_keypart)
>    {
> -    first_value_part= *first_value;
> -    (*pos)->get_auto_increment(offset, increment, nb_desired_values,
> -                               &first_value_part,
> &nb_reserved_values_part);
> -    if (first_value_part == ~(ulonglong)(0)) // error in one partition
> -    {
> -      *first_value= first_value_part;
> -      sql_print_error("Partition failed to reserve auto_increment value");
> -      DBUG_VOID_RETURN;
> -    }
>      /*
> -      Partition has reserved an interval. Intersect it with the intervals
> -      already reserved for the previous partitions.
> +      next_number_keypart is != 0 if the auto_increment column is a secondary
> +      column in the index (it is allowed in MyISAM)
>      */
> -    last_value_part= (nb_reserved_values_part == ULONGLONG_MAX) ?
> -      ULONGLONG_MAX : (first_value_part + nb_reserved_values_part * increment);
> -    set_if_bigger(*first_value, first_value_part);
> -    set_if_smaller(last_value, last_value_part);
> -  }
> -  if (last_value < *first_value) /* empty intersection, error */
> -  {
> +    DBUG_PRINT("info", ("next_number_keypart != 0"));
> +    ulonglong first_value_part, nb_reserved_values_part;
> +    handler **pos, **end;
>      /*
> -      When we have an empty intersection, it means that one or more
> -      partitions may have a significantly different autoinc next value.
> -      We should not fail here - it just means that we should try to
> -      find a new reservation making use of the current *first_value
> -      wbich should now be compatible with all partitions.
> +      Must lock and find highest value among all partitions.

GB comment would fit in one line together with its /* */  (advantage:
more lines on the screen when reading this complex function).

>      */
> -    if (retry)
> +    lock_auto_increment();
> +    for (pos=m_file, end= m_file + m_tot_parts; pos != end ; pos++)

GB pos and end could be local to the loop.

>      {
> -      retry= FALSE;
> -      last_value= ~ (ulonglong) 0;
> -      release_auto_increment();
> -      goto again;
> +      first_value_part= 0;

GB please add a comment to explain the first_value_part=0 above: "hack
for InnoDB which considers *first_value_part as an INOUT parameter
though it's only an OUT parameter per the API". This is important.

> +      /* Only nb_desired_values = 1 makes sense */
> +      (*pos)->get_auto_increment(offset, increment, 1,
> +                                 &first_value_part,
> &nb_reserved_values_part);
> +      if (first_value_part == ~(ulonglong)(0)) // error in one partition
> +      {
> +        *first_value= first_value_part;
> +        sql_print_error("Partition failed to reserve auto_increment value");
> +        unlock_auto_increment();
> +        DBUG_VOID_RETURN;
> +      }
> +      DBUG_PRINT("info", ("first_value_part: %lu", (ulong) first_value_part));
> +      set_if_bigger(*first_value, first_value_part);


GB this get_auto_increment() was partly written by me so this is more
a critics of myself: it dereferences first_value too much, once per
table in the partition.
I'd try to remove most dereferences (but error cases matter less) by
using a local variable, and in the end of the function, doing
*first_value = the_local_variable.

>      }
> +    *nb_reserved_values= 1;
> +    unlock_auto_increment();
> +  }
> +  else
> +  {
> +    THD *thd= ha_thd();
> +    HA_DATA_PARTITION *ha_data= (HA_DATA_PARTITION*) table_share->ha_data;
>      /*
> -      We should not get here.
> +     This is initialized in the beginning of the first write_row call.

GB indentation of "This is"

>      */
> -    sql_print_error("Failed to calculate auto_increment value for partition");
> -    
> -    *first_value= ~(ulonglong)(0);
> +    DBUG_ASSERT(ha_data->auto_inc_initialized);
> +    /*
> +      Get a lock for handling the auto_increment in table_share->ha_data
> +      for avoiding two concurrent statements getting the same number.
> +    */ 
> +
> +    lock_auto_increment();
> +
> +    /*
> +      in a multi row insert statement like insert ... select, load data 
> +      we must hold a lock/mutex for the whole statement if we have statement
> +      based replication.

GB I'd mention the key fact that the number of candidate rows to
insert is not known in advance. And also the reason why we have to
hold the lock/mutex: because the statement-based binary log contains
only the first generated value used by the statement, and slave assumes
all other generated values used by this statement were consecutive to
this first one, we must exclusively lock the generator until the
statement is done. Something like that, maybe I'm too verbose. Well,
something which a new dev could understand.

> +    */
> +    if (!auto_increment_multi_row_stmt_lock &&
> +        thd->variables.binlog_format == BINLOG_FORMAT_STMT &&
> +        (thd->lex->sql_command == SQLCOM_REPLACE_SELECT ||
> +         thd->lex->sql_command == SQLCOM_INSERT_SELECT ||
> +         thd->lex->sql_command == SQLCOM_LOAD))

GB As said in a previous email, this test should test other variables
related to binlog.
About sql_command values: InnoDB, in
ha_innobase::innobase_autoinc_lock(), tests in another way:
if (SQLCOM_INSERT) don't use lock else use lock.
It is more future-safe if somebody later adds a new command to MySQL
and does not update your test above: 
the InnoDB way will be maybe too conservative for the new command (no
data corruption but non-optimal performance) while the way of your
patch will be maybe too unsafe for the new command (optimal
performance but data corruption).

> +    {
> +      DBUG_PRINT("info", ("locking auto_increment_multi_row_stmt_lock"));
> +      auto_increment_multi_row_stmt_lock= TRUE;
> +    }
> +
> +    /* this gets corrected (for offset/increment) in update_auto_increment */
> +    *first_value= ha_data->next_auto_inc_val;
> +    ha_data->next_auto_inc_val= *first_value + nb_desired_values * increment;

GB I'd change the line above to 
ha_data->next_auto_inc_val+= nb_desired_values * increment;

> +
> +    unlock_auto_increment();
> +    DBUG_PRINT("info", ("*first_value: %lu", (ulong) *first_value));
> +    *nb_reserved_values= nb_desired_values;
>    }
> -  if (increment)                                // If not check for values
> -    *nb_reserved_values= (last_value == ULONGLONG_MAX) ?
> -      ULONGLONG_MAX : ((last_value - *first_value) / increment);
>    DBUG_VOID_RETURN;
>  }
> @@ -5659,9 +5759,33 @@ void ha_partition::release_auto_incremen
>  {
>    DBUG_ENTER("ha_partition::release_auto_increment");

GB I tested that bypassing this function makes your tests fail - good!

> -  for (uint i= 0; i < m_tot_parts; i++)
> +  if (table->s->next_number_keypart)
> +  {
> +    for (uint i= 0; i < m_tot_parts; i++)
> +      m_file[i]->ha_release_auto_increment();
> +  }
> +  else
>    {
> -    m_file[i]->ha_release_auto_increment();
> +    if (next_insert_id)

GB we could glue 'else' and 'if' and remove a pair of {}:
 else if (next_insert_id)

> +    {
> +      HA_DATA_PARTITION *ha_data= (HA_DATA_PARTITION*) table_share->ha_data;
> +      lock_auto_increment();
> +      if (next_insert_id < ha_data->next_auto_inc_val &&
> +          auto_inc_interval_for_cur_row.maximum() >=
> +          ha_data->next_auto_inc_val)

GB you may want to cache ha_data->next_auto_inc_val in a local
variable to not dereference two times.

> +        ha_data->next_auto_inc_val= next_insert_id;
> +      DBUG_PRINT("info", ("ha_data->next_auto_inc_val: %lu",
> +                          (ulong) ha_data->next_auto_inc_val));
> +
> +      /* Unlock the multi row statement lock taken in get_auto_increment */
> +      if (auto_increment_multi_row_stmt_lock)
> +      {
> +        auto_increment_multi_row_stmt_lock= FALSE;
> +        DBUG_PRINT("info", ("unlocking auto_increment_multi_row_stmt_lock"));

GB If there wasn't this DBUG_PRINT(), I'd suggest to remove the if() and
unconditionally do
auto_increment_multi_row_stmt_lock= FALSE;
because:
"read + test + jump"
is probably more costly than "write".

> +      }
> +
> +      unlock_auto_increment();
> +    }
>    }
>    DBUG_VOID_RETURN;
>  }
> diff -Nrup a/sql/ha_partition.h b/sql/ha_partition.h
> --- a/sql/ha_partition.h	2007-09-24 15:30:28 +02:00
> +++ b/sql/ha_partition.h	2008-04-01 23:04:38 +02:00
> @@ -37,6 +37,15 @@ typedef struct st_partition_share
>  } PARTITION_SHARE;
>  #endif
>  
> +/**
> +  Partition specific ha_data struct.
> +  @todo: move all partition specific data from TABLE_SHARE here.
> +*/
> +typedef struct st_ha_data_partition
> +{
> +  ulonglong next_auto_inc_val;                 /**< first non reserved value */
> +  bool auto_inc_initialized;
> +} HA_DATA_PARTITION;
>  
>  #define PARTITION_BYTES_IN_POS 2
>  class ha_partition :public handler
> @@ -141,6 +150,8 @@ private:
>      "own" the m_part_info structure.
>    */
>    bool is_clone;
> +  bool auto_increment_lock;             /* lock reading/updating auto_inc */
> +  bool auto_increment_multi_row_stmt_lock; /* used with auto_increment_lock */

GB Please make both comments start with /**< so that Doxygen picks
them up.
Is auto_increment_multi_row_stmt_lock a good name? I mean, it's not
used for INSERT VALUES(),(),() which is a multi-row INSERT too.
This variable is true when we don't know how many rows are to be
inserted. So maybe we could find a name for this.
And, "used with auto_increment_lock" is not really a description of
the member. We should specify what it serves for. If the comment
becomes too long to sit at the right of the declaration, it can come
on top:
/**
  the comment
*/
bool auto_increment_multi_row_stmt_lock;

>  public:
>    handler *clone(MEM_ROOT *mem_root);
>    virtual void set_part_info(partition_info *part_info)

> @@ -828,12 +839,46 @@ public:
>      auto_increment_column_changed
>       -------------------------------------------------------------------------
>    */
> -  virtual void restore_auto_increment(ulonglong prev_insert_id);
>    virtual void get_auto_increment(ulonglong offset, ulonglong increment,
>                                    ulonglong nb_desired_values,
>                                    ulonglong *first_value,
>                                    ulonglong *nb_reserved_values);
>    virtual void release_auto_increment();
> +private:
> +  virtual int reset_auto_increment(ulonglong value);
> +  virtual void lock_auto_increment()
> +  {
> +    if (auto_increment_multi_row_stmt_lock)
> +      return;

Gb I'd add above a comment like
/* lock is already locked */

> +    DBUG_ASSERT(table_share->ha_data && !auto_increment_lock);
> +    if(table_share->tmp_table == NO_TMP_TABLE)
> +    {
> +      auto_increment_lock= TRUE;
> +      pthread_mutex_lock(&table_share->mutex);
> +    }
> +  }
> +  virtual void unlock_auto_increment()
> +  {
> +    DBUG_ASSERT(table_share->ha_data);
> +    if(auto_increment_lock && !auto_increment_multi_row_stmt_lock)


GB I'd add a comment:
/* if auto_increment_multi_row_lock is true, we have to keep lock, it
will be released only at end of statement */

> +    {
> +      pthread_mutex_unlock(&table_share->mutex);
> +      auto_increment_lock= FALSE;
> +    }
> +  }
> +  virtual void set_auto_increment_if_higher()
> +  {
> +    ulonglong nr= table->next_number_field->val_int();
> +    HA_DATA_PARTITION *ha_data= (HA_DATA_PARTITION*) table_share->ha_data;
> +    lock_auto_increment();
> +    /* must check when the mutex is taken */
> +    if (nr >= ha_data->next_auto_inc_val)
> +      ha_data->next_auto_inc_val= nr + 1;
> +    ha_data->auto_inc_initialized= TRUE;
> +    unlock_auto_increment();
> +  }

GB I'd add
DBUG_ASSERT(!auto_increment_lock && !auto_increment_multi_row_stmt_lock);
to handler::ha_external_lock() (it already contains something similar
for next_insert_id==0).

-- 
   __  ___     ___ ____  __
  /  |/  /_ __/ __/ __ \/ /    Mr. Guilhem Bichot <guilhem@stripped>
 / /|_/ / // /\ \/ /_/ / /__   Sun Microsystems, Lead Software Engineer
/_/  /_/\_, /___/\___\_\___/   Bordeaux, France
       <___/   www.mysql.com   
Thread
bk commit into 5.1 tree (mattiasj:1.2518) BUG#33479mattiasj1 Apr
  • Re: bk commit into 5.1 tree (mattiasj:1.2518) BUG#33479Guilhem Bichot17 Apr
    • Re: bk commit into 5.1 tree (mattiasj:1.2518) BUG#33479Mattias Jonsson25 Apr
      • Re: bk commit into 5.1 tree (mattiasj:1.2518) BUG#33479Guilhem Bichot25 Apr
        • Re: bk commit into 5.1 tree (mattiasj:1.2518) BUG#33479Mattias Jonsson25 Apr
          • Re: bk commit into 5.1 tree (mattiasj:1.2518) BUG#33479Sergei Golubchik25 Apr