List:Commits« Previous MessageNext Message »
From:Guilhem Bichot Date:March 18 2008 4:45pm
Subject:Re: bk commit into 5.1 tree (mattiasj:1.2518) BUG#33479
View as plain text  
Hello,

I marked my comments with GB2.

On Fri, Mar 14, 2008 at 01:54:08AM +0100, Mattias Jonsson wrote:
> Hi,
> 
> Thanks for you comments.
> 
> I must say that this is the best review I ever got (both for cheering me 
> up and as well as good advice and thorough code AND test review!) Thanks!

GB2 My pleasure :)

> My replies marked as MJ.
> 
> I will probably commit an update to the previous patch tomorrow. Then it 
> will be easier to see the changes.

GB2 Ok. Please collapse the update with the original csets (code and
test), so that we have one single cset for the entire bugfix.
I marked my replies with GB2.

> Guilhem Bichot wrote:
> >Hello,
> >
> >As this is a big patch, I marked my comments with "GB" for easier
> >searching.
> >By the way this is an excellent patch.
> >
> >On Wed, Mar 05, 2008 at 11:56:16PM +0100, mattiasj@stripped wrote:
> >>ChangeSet@stripped, 2008-03-05 23:56:06+01:00, mattiasj@witty. +3 -0
> >>  Bug#33479: auto_increment failures in partitioning
> >>  
> >>  Several problems with auto_increment in partitioning
> >>  (with MyISAM, InnoDB and Falcon)
> >>  
> >>  Updated code only patch (the test cases will be in a separate patches
> >>  5.1+ (InnoDB+MyISAM) and 6.0 (Falcon)

> >>diff -Nrup a/sql/ha_partition.cc b/sql/ha_partition.cc

> >>@@ -2354,6 +2355,24 @@ int ha_partition::open(const char *name,
> >>                          0, key_rec_cmp, (void*)this)))
> >>     goto err_handler;
> >> 
> >>+
> >>+  if (table_share->tmp_table == NO_TMP_TABLE)
> >>+    pthread_mutex_lock(&table->s->mutex);
> >>+  if (!table_share->ha_data)
> >>+  {
> >>+    HA_DATA_PARTITION *ha_data;
> >>+    /* currently only needed for auto_increment */
> >>+    table_share->ha_data= alloc_root(&table_share->mem_root,
> >>+                                     sizeof(HA_DATA_PARTITION));
> >>+    if (!table_share->ha_data)
> >>+      goto err_handler;
> >>+    ha_data= (HA_DATA_PARTITION*) table_share->ha_data;
> >>+    bzero(ha_data, sizeof(HA_DATA_PARTITION));
> >>+    if (table_share->tmp_table == NO_TMP_TABLE)
> >>+      pthread_mutex_init(&ha_data->mutex, MY_MUTEX_INIT_FAST);
> >
> >GB you need to destroy this mutex, and do so before
> >table_share->mem_root is freed.
> >
> 
> MJ: Where should I do that? I cannot find any good spot. Should I maybe 
> use the table_share->mutex and remove the ha_data->mutex instead?

GB2 You can keep ha_data->mutex, and destroy it ha_partition::close()
(assuming such close always happens before free_table_share(), which
sounds logical but I haven't verified).

> I added the ha_data->mutex for finer granuality, and allow more 
> concurrancy, maybe we can add this first when serg has made a worklog 
> about the table_share->ha_data as he said in a mail 2008-02-19:
> 
> When you're done, I'll create a WL about making a proper use of
> TABLE_SHARE::ha_data (which would include adding constructor/destructor
> methods to the handlerton, moving all storage engines to the new API,
> moving existing partition-only fields in the TABLE_SHARE to the new
> structure that you'll create).

GB2 I don't see how this WL would help with destroying the mutex.
 
> >>+  }
> >>+  if (table_share->tmp_table == NO_TMP_TABLE)
> >>+    pthread_mutex_unlock(&table->s->mutex);
> >
> >GB you may want to have a bool on the stack which keeps the value of
> >table_share->tmp_table == NO_TMP_TABLE, to avoid doing the test 3
> >times in the worst case.
> >To have maybe less pointer dereference of table_share (in
> >table_share->ha_data) you could do
> >table_share->ha_data= ha_data= (HA_DATA_PARTITION*) alloc_root(...);
> >if (!ha_data) // or ha_data==NULL, as you like
> >  goto err_handler;
> >bzero(ha_data, sizeof(HA_DATA_PARTITION));
> >This would be one dereference instead of three.
> >
> 
> MJ: I added:
> bool is_not_tmp_table= (table_share->tmp_table == NO_TMP_TABLE)
> and changed the ha_data as you proposed.

GB2 Good. I just remembered an idea: in
    table_share->ha_data= ha_data= (HA_DATA_PARTITION*)
                                   alloc_root(&table_share->mem_root,
                                              sizeof(HA_DATA_PARTITION));
you could consider replacing the C-style cast with
reinterpret_cast<HA_DATA_PARTITION*>.
For the compiler it changes nothing; for the developer it makes casts
more classified (and one day we would be able to run
gcc -Wold-style-cast on our C++ code).
As our existing C++ code has lots of C-style casts, it is up to you.

> >>@@ -2953,13 +2980,25 @@ int ha_partition::delete_all_rows()
> >> {
> >>   int error;
> >>   handler **file;
> >>+  THD *thd= current_thd;
> >
> >GB In handler code like this, ha_thd() is to be used instead of
> >current_thd, because we think it could be faster:
> >- it's a function call like pthread_getspecific() is (current_thd uses
> >pthread_getspecific())
> >- it is in mysqld whereas pthread_getspecific() is in libc (farther so
> >slower to access, probably one indirection more).
> >- ha_thd() has good chances of just returning table->in_use, which is
> >probably not more work than pthread_getspecific() does, at least this
> >one:
>
> >http://cvs.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/lib/libc/port/threads/tsd.c#_thr_getspecific
> >does more work (opensolaris is what popped first in Google - sign of
> >the times?).
> >
> 
> MJ: Thanks for showing me, I used it since it was used in several places 
> in the partitioning code. Should I change all current_thd to ha_thd in 
> ha_partition.cc?

GB2 Yes you could do this change. If it breaks anything, it means
table->in_use is incorrect so it's a bug, RC is a good time for
finding bugs.

> >>   DBUG_ENTER("ha_partition::delete_all_rows");
> >> 
> >>+  if (thd->lex->sql_command == SQLCOM_TRUNCATE)
> >>+  {
> >>+    HA_DATA_PARTITION *ha_data= (HA_DATA_PARTITION*) 
> >>table_share->ha_data;
> >>+    lock_auto_increment();
> >>+    ha_data->next_auto_inc_val= 0;
> >>+    ha_data->auto_inc_initialized= FALSE;
> >>+    unlock_auto_increment();
> >>+  }
> >>   file= m_file;
> >>   do
> >>   {
> >>     if ((error= (*file)->ha_delete_all_rows()))
> >>       DBUG_RETURN(error);
> >>+    /* must reset the auto_increment for some engines (eg MyISAM) */
> >>+    if (thd->lex->sql_command == SQLCOM_TRUNCATE)
> >>+      (*file)->ha_reset_auto_increment(0);
> >
> >GB This piece moved to another bug report, I understood.
> >
> 
> MJ: I will remove those tree lines. Should it not be a comment in 
> handler.h for delete_all_rows() that it should reset auto_increment if 
> thd->lex->sql_command == SQLCOM_TRUNCATE?

GB2 yes it would be good.

> >>@@ -4412,20 +4451,56 @@ int ha_partition::handle_ordered_prev(uc
> >> int ha_partition::info(uint flag)
> >> {
> >>   handler *file, **file_array;
> >>+  HA_DATA_PARTITION *ha_data= (HA_DATA_PARTITION*) table_share->ha_data;
> >
> >GB this declaration should be moved to the block which uses this variable:
> >the if(flag&HA_STATUS_AUTO) block.
> >
> (I have reorganized the ::info function regaring HA_STATUS_AUTO, see below.)
> MJ: Yes. I moved it down to ->
> 
> >>   DBUG_ENTER("ha_partition:info");
> >> 
> >>   if (flag & HA_STATUS_AUTO)
> >>   {
> 
> MJ: -> here.
> 
> >>-    ulonglong auto_increment_value= 0;
> >>     DBUG_PRINT("info", ("HA_STATUS_AUTO"));
> >>-    file_array= m_file;
> >>-    do
> >>+    if (ha_data->auto_inc_initialized && 
> >>!table_share->next_number_keypart)
> >
> >GB this is reading ha_data->auto_inc_initialized without mutex: there may
> >be another thread changing this variable at this moment. Then we may
> >go into the 'else' branch though next_auto_inc_val has already just
> >been set to true by the other thread, but this is ok as we will then
> >only do a useless setting of next_auto_inc_val.
> 
> MJ: Yes this is a start of a race, but since I will acquire a lock 
> before set_if_bigger(next_auto_inc_val,ai..) (in the branch where it is 
> not initialized) it will not get updated if has changed. So I guess I 
> will add a comment about it. Actually this is the thing with the 
> ha_data->auto_inc_initialized, since it is a bool, then I assume I do 
> not have to aquire any lock/mutex for reading it and it will only be 
> written once during the life of the table_share. And the few times it 
> actually are more than one thread running after a check of 
> auto_inc_initialized == 0, there will be no harm, since the handling of 
> ha_data->next_auto_inc_val always should be locked before reading and 
> updating.

GB2 Those assumptions are safe provided that in concurrent situations,
the bool changes only from 0 to 1 and not from 1 to 0. If it could
change from 1 to 0, the tests without mutex could go wrong like this:

if (ha_data->auto_inc_initialized)
{
  use ha_data->next_auto_inc_val;
}
If at this moment such code runs:
lock_auto_increment();
ha_data->auto_inc_initialized= 0;
ha_data->next_auto_inc_val= 0;
unlock_auto_increment();

then the if() could be true, but when we use
ha_data->next_auto_inc_val it has become 0.
Fortunately, going from 1 to 0 happens only during TRUNCATE TABLE
which first closes all open instances and prevents opening new
instances, so there cannot be concurrency.
You may want to add a comment about this.

> 
> >>     {
> >>-      file= *file_array;
> >>-      file->info(HA_STATUS_AUTO);
> >>-      set_if_bigger(auto_increment_value, 
> >>file->stats.auto_increment_value);
> >>-    } while (*(++file_array));
> >>-    stats.auto_increment_value= auto_increment_value;
> >>+      DBUG_PRINT("info", ("Using ha_data->next_auto_inc_val: %llu",
> >>+                          ha_data->next_auto_inc_val));
> >
> >GB From what I see in MSDN, %ll[u] is not supported by VS 2003 (it is in
> >C99 but Microsoft apparently needed more than 4 years to implement it
> >- now let's see how many flame emails I get for writing that
> >gratuitous sarcasm). It is supported starting from VS2005. There is a
> >VS2003 machine in pushbuild, so we may have to do some debugging on
> >it, thus we should avoid %ll[u] in the patch (the rest of MySQL code
> >already avoids it and uses instead [u]llstr/%s). I agree this is a
> >hassle.
> >
> 
> MJ: OK, I will probably remove all DBUG_PRINT. the handler.cc code uses 
> "%lu", (ulong) nr instead, is that ok? (else should I do #ifdef DEBUG; 
> char buf[22] #endif before?)

GB2 %lu is ok


> >>+  {
> >>+    if ((res= (*pos)->ha_reset_auto_increment(value)) != 0)
> >>+    {
> >>+      unlock_auto_increment();
> >>+      DBUG_RETURN(res);
> >>+    }
> >>+  }
> >>+  unlock_auto_increment();
> >>+  DBUG_RETURN(0);
> >
> >GB or maybe (smaller code, different behaviour in case if error but not
> >clear if that itself is important):
> >  int res= 0;
> >  ...
> >  for (pos=m_file, end= m_file+ m_tot_parts; pos != end ; pos++)
> >    res|= (*pos)->ha_reset_auto_increment(value);
> >  unlock_auto_increment();
> >  DBUG_RETURN(res != 0);
> >
> 
> MJ: I can rewrite it as this, but do really res != 0 give a correct 
> error code?

GB2 ah, right, I misread parenthesis in the original code. Ok, keep it
like it is.

> (I could add a new error and handle it in 
> ha_partition::print_error)
> 
> in sql_delete.cc row 331:
>     int error2= table->file->ha_reset_auto_increment(0);
> 
>     if (error2 && (error2 != HA_ERR_WRONG_COMMAND))
>     {
>       table->file->print_error(error2, MYF(0));
> 
> >  
> >> /*
> >>   This method is called by update_auto_increment which in turn is called
> >>-  by the individual handlers as part of write_row. We will always let
> >>-  the first handler keep track of the auto increment value for all
> >>-  partitions.
> >>+  by the individual handlers as part of write_row. We use the
> >>+  table_share->ha_data->next_auto_inc_val, or search all
> >>+  partitions for the highest auto_increment_value if not initialized or
> >>+  if auto_increment field is a secondary part of a key, we must search
> >>+  every partition with a mutex to be sure of correctness.
> >> */
> >
> >GB While you are changing the comment, you could format it for doxygen.
> >
> 
> MJ: I will try to update the comments for doxygen

GB2 Please fix that in the next version of the patch (latest version
does not have it yet).


> 
> >>+  }
> >>+  else if (!ha_data->auto_inc_initialized)
> >>+  {
> >>     /*
> >>-      We should not get here.
> >>+      Not initialized, but should be locked in ha_partition::write_row().
> >>+      (Or is a temporary table).
> >
> >GB how about adding
> >if (this is not tmp table)
> >   safe_mutex_assert_owner(the expected mutex);
> >?
> >
> 
> MJ: Thanks, did not now of that function.

GB2 do you still need to add such assertion, or the code changes make
it unnecessary?


> >>@@ -5659,9 +5813,28 @@ void ha_partition::release_auto_incremen
> >> {
> >>   DBUG_ENTER("ha_partition::release_auto_increment");
> >> 
> >>-  for (uint i= 0; i < m_tot_parts; i++)
> >>+  if (table->s->next_number_keypart)
> >>   {
> >>-    m_file[i]->ha_release_auto_increment();
> >>+    for (uint i= 0; i < m_tot_parts; i++)
> >>+      m_file[i]->ha_release_auto_increment();
> >>+  }
> >>+  else
> >>+  {
> >>+    HA_DATA_PARTITION *ha_data= (HA_DATA_PARTITION*) 
> >>table_share->ha_data;
> >>+    DBUG_PRINT("info", ("next_auto_inc_val: %llu cur_row_max: %llu min: 
> >>%llu",
> >>+                        ha_data->next_auto_inc_val,
> >>+                        auto_inc_interval_for_cur_row.maximum(),
> >>+                        auto_inc_interval_for_cur_row.minimum()));
> >>+    if (next_insert_id)
> >>+    {
> >>+      lock_auto_increment();
> >>+      if (next_insert_id < ha_data->next_auto_inc_val &&
> >>+          auto_inc_interval_for_cur_row.maximum() >= 
> >>ha_data->next_auto_inc_val)
> >
> >GB too long line (80 chars).
> >
> 
> MJ: OK, I thought that it was not more than 80 chars... should I remove 
> one space in front of the line, or break it up like this:
>       if (next_insert_id < ha_data->next_auto_inc_val &&
>           auto_inc_interval_for_cur_row.maximum() >=
>           ha_data->next_auto_inc_val)
> ?

GB2 I'd break it up.
Right, our coding guidelines say "The maximum line width is 80
characters"... I took my habit for the rule then; I think I picked 79
after getting angry notes that I had lines 80-char long :)
So, ignore my comment.

> 
> >>+        ha_data->next_auto_inc_val= next_insert_id;
> >>+      unlock_auto_increment();
> >>+    }
> >>+    DBUG_PRINT("info", ("next_auto_inc_val: %llu next_ins_id: %llu",
> >>+                        ha_data->next_auto_inc_val, next_insert_id));
> >>   }
> >>   DBUG_VOID_RETURN;
> >> }

> Can there be any problems with concurrent multi row insert and statement 
> replications?

For statement-level replication to work, a multi-row INSERT (INSERT
VALUES or INSERT SELECT) must use consecutive values.
Let's assume that two concurrent statements are doing
INSERT INTO t SELECT * FROM u;
where t is a partitioned table, having an autoinc column first in the
index.
Each statement will call get_auto_increment() with
nb_desired_values==1 (see handler::update_auto_increment();
estimation_rows_to_insert is always 0 in INSERT SELECT):
statement1 moves next_auto_inc_val to 2
  statement2 moves next_auto_inc_val to 3
statement1 inserts 1
  statement2 inserts 2
statement1 notices that it has more rows to insert because SELECT
returned more than one row (see handler::update_auto_increment()),
calls get_auto_increment() with nb_desired_values==2 (twice the
previous nb_desired_values):
statement1 moves next_auto_inc_val to 5
statement1 inserts 3
and then we have the problem: statement1 has inserted 1 and 3, it's
not consecutive and will not replicate well.

If the underlying engine of the partitioned table has table-level
locking, in fact there is no problem (statement2 will run only after
statement1 is done).
This is also the case of InnoDB:
- if innodb_auto_inc_lock_mode==2, it's documented to not work with
statement-based replication
- if ==0, it does table-level locking
- if ==1, it does table-level locking except if the statement is
INSERT/REPLACE VALUES (and the discussed case above is INSERT
SELECT).
But InnoDB does this fine logic only in
ha_innobase::get_auto_increment() (called when the table is not
partitioned), which is not called by
ha_partition::get_auto_increment().

So I suspect that yes, there are problems. This is a suspicion. A
multi-threaded test or code inspection could help know the truth.
Possible solutions:
- make ha_partition::get_auto_increment() imitate
ha_innobase::get_auto_increment(): let it keep the autoinc mutex if
(thd->lex->sql_command is != SQLCOM_INSERT && !row_based).
- let ha_partition switch to row-based replication if inserting into
an auto_increment column.
- make ha_partition::get_auto_increment() call engine's
get_auto_increment() to ask it to reserve [max,
max+nb_desired_values*increment]. The problem is that engine's
get_auto_increment() API does not tell the engine the lower bound of
the interval (*first_value is not supposed to be read by the engine),
only the interval's width (nb_desired_values). InnoDB does read
*first_value, violating the API, using that as a lower bound. We could
make InnoDB's violation the new norm: *first_value would become an
INOUT parameter (see the email I sent to Innobase recently, cc:you).
- have statement-based binlogging of multiple autoinc intervals
http://forge.mysql.com/worklog/task.php?id=3404 implemented.
- something better?

In all cases, this is a significant problem, a discussion with the
other reviewer (Serg) would be needed.

Falcon supports only row-based replication right now.

-- 
   __  ___     ___ ____  __
  /  |/  /_ __/ __/ __ \/ /    Mr. Guilhem Bichot <guilhem@stripped>
 / /|_/ / // /\ \/ /_/ / /__   MySQL AB, Lead Software Engineer
/_/  /_/\_, /___/\___\_\___/   Bordeaux, France
       <___/   www.mysql.com   
Thread
bk commit into 5.1 tree (mattiasj:1.2518) BUG#33479mattiasj5 Mar
  • Re: bk commit into 5.1 tree (mattiasj:1.2518) BUG#33479Sergei Golubchik7 Mar
    • Re: bk commit into 5.1 tree (mattiasj:1.2518) BUG#33479Sergei Golubchik7 Mar
      • Re: bk commit into 5.1 tree (mattiasj:1.2518) BUG#33479Mattias Jonsson8 Mar
        • Re: bk commit into 5.1 tree (mattiasj:1.2518) BUG#33479Sergei Golubchik10 Mar
  • Re: bk commit into 5.1 tree (mattiasj:1.2518) BUG#33479Guilhem Bichot13 Mar
    • Re: bk commit into 5.1 tree (mattiasj:1.2518) BUG#33479Mattias Jonsson14 Mar
      • Re: bk commit into 5.1 tree (mattiasj:1.2518) BUG#33479Guilhem Bichot18 Mar
        • Re: bk commit into 5.1 tree (mattiasj:1.2518) BUG#33479Mattias Jonsson27 Mar
          • Re: bk commit into 5.1 tree (mattiasj:1.2518) BUG#33479Guilhem Bichot28 Mar
            • Re: bk commit into 5.1 tree (mattiasj:1.2518) BUG#33479Mattias Jonsson1 Apr
              • Re: bk commit into 5.1 tree (mattiasj:1.2518) BUG#33479Guilhem Bichot10 Apr