List:Commits« Previous MessageNext Message »
From:Mattias Jonsson Date:March 27 2008 4:16pm
Subject:Re: bk commit into 5.1 tree (mattiasj:1.2518) BUG#33479
View as plain text  
Hi,

I have a concern about destroying the ha_data->mutex.

It is not possible to destroy the table_share->ha_data->mutex in 
ha_partition::close() since there can be another handler instance with 
the same table_share, which is in the middle of a 
write_row/get_auto_increment, which will crash the server if the mutex 
is destroyed.


I think I will go back and use the table_share->mutex instead, and 
suggest a method/function in WL4305 where the handler can destroy and 
free engine specific data and objects if needed.

I will commit a new full patch soon.

My answers to your questions are below.

Regards Mattias

Guilhem Bichot wrote:
> Hello,
> 
> I marked my comments with GB2.
> 
> On Fri, Mar 14, 2008 at 01:54:08AM +0100, Mattias Jonsson wrote:
>> Hi,
>>
>> Thanks for you comments.
>>
>> I must say that this is the best review I ever got (both for cheering me 
>> up and as well as good advice and thorough code AND test review!) Thanks!
> 
> GB2 My pleasure :)
> 
>> My replies marked as MJ.
>>
>> I will probably commit an update to the previous patch tomorrow. Then it 
>> will be easier to see the changes.
> 
> GB2 Ok. Please collapse the update with the original csets (code and
> test), so that we have one single cset for the entire bugfix.
> I marked my replies with GB2.
> 
OK, I will collapse all changes including the test into one patch (and 
then add a new patch for 6.0/Falcon specific test.

>> Guilhem Bichot wrote:
>>> Hello,
>>>
>>> As this is a big patch, I marked my comments with "GB" for easier
>>> searching.
>>> By the way this is an excellent patch.
>>>
>>> On Wed, Mar 05, 2008 at 11:56:16PM +0100, mattiasj@stripped wrote:
>>>> ChangeSet@stripped, 2008-03-05 23:56:06+01:00, mattiasj@witty. +3 -0
>>>>  Bug#33479: auto_increment failures in partitioning
>>>>  
>>>>  Several problems with auto_increment in partitioning
>>>>  (with MyISAM, InnoDB and Falcon)
>>>>  
>>>>  Updated code only patch (the test cases will be in a separate patches
>>>>  5.1+ (InnoDB+MyISAM) and 6.0 (Falcon)
> 
>>>> diff -Nrup a/sql/ha_partition.cc b/sql/ha_partition.cc
> 
>>>> @@ -2354,6 +2355,24 @@ int ha_partition::open(const char *name,
>>>>                          0, key_rec_cmp, (void*)this)))
>>>>     goto err_handler;
>>>>
>>>> +
>>>> +  if (table_share->tmp_table == NO_TMP_TABLE)
>>>> +    pthread_mutex_lock(&table->s->mutex);
>>>> +  if (!table_share->ha_data)
>>>> +  {
>>>> +    HA_DATA_PARTITION *ha_data;
>>>> +    /* currently only needed for auto_increment */
>>>> +    table_share->ha_data= alloc_root(&table_share->mem_root,
>>>> +                                     sizeof(HA_DATA_PARTITION));
>>>> +    if (!table_share->ha_data)
>>>> +      goto err_handler;
>>>> +    ha_data= (HA_DATA_PARTITION*) table_share->ha_data;
>>>> +    bzero(ha_data, sizeof(HA_DATA_PARTITION));
>>>> +    if (table_share->tmp_table == NO_TMP_TABLE)
>>>> +      pthread_mutex_init(&ha_data->mutex, MY_MUTEX_INIT_FAST);
>>> GB you need to destroy this mutex, and do so before
>>> table_share->mem_root is freed.
>>>
>> MJ: Where should I do that? I cannot find any good spot. Should I maybe 
>> use the table_share->mutex and remove the ha_data->mutex instead?
> 
> GB2 You can keep ha_data->mutex, and destroy it ha_partition::close()
> (assuming such close always happens before free_table_share(), which
> sounds logical but I haven't verified).
> 

See my question/answer above.

>> I added the ha_data->mutex for finer granuality, and allow more 
>> concurrancy, maybe we can add this first when serg has made a worklog 
>> about the table_share->ha_data as he said in a mail 2008-02-19:
>>
>> When you're done, I'll create a WL about making a proper use of
>> TABLE_SHARE::ha_data (which would include adding constructor/destructor
>> methods to the handlerton, moving all storage engines to the new API,
>> moving existing partition-only fields in the TABLE_SHARE to the new
>> structure that you'll create).
> 
> GB2 I don't see how this WL would help with destroying the mutex.
>  

I hoped that it would provide an interface for the storage engine to
initialize and destroy the storage engine specific parts, like mutexes etc.

>>>> +  }
>>>> +  if (table_share->tmp_table == NO_TMP_TABLE)
>>>> +    pthread_mutex_unlock(&table->s->mutex);
>>> GB you may want to have a bool on the stack which keeps the value of
>>> table_share->tmp_table == NO_TMP_TABLE, to avoid doing the test 3
>>> times in the worst case.
>>> To have maybe less pointer dereference of table_share (in
>>> table_share->ha_data) you could do
>>> table_share->ha_data= ha_data= (HA_DATA_PARTITION*) alloc_root(...);
>>> if (!ha_data) // or ha_data==NULL, as you like
>>>  goto err_handler;
>>> bzero(ha_data, sizeof(HA_DATA_PARTITION));
>>> This would be one dereference instead of three.
>>>
>> MJ: I added:
>> bool is_not_tmp_table= (table_share->tmp_table == NO_TMP_TABLE)
>> and changed the ha_data as you proposed.
> 
> GB2 Good. I just remembered an idea: in
>     table_share->ha_data= ha_data= (HA_DATA_PARTITION*)
>                                    alloc_root(&table_share->mem_root,
>                                               sizeof(HA_DATA_PARTITION));
> you could consider replacing the C-style cast with
> reinterpret_cast<HA_DATA_PARTITION*>.
> For the compiler it changes nothing; for the developer it makes casts
> more classified (and one day we would be able to run
> gcc -Wold-style-cast on our C++ code).
> As our existing C++ code has lots of C-style casts, it is up to you.
> 

I did not get this compiling, so I will not change it.

>>>> @@ -2953,13 +2980,25 @@ int ha_partition::delete_all_rows()
>>>> {
>>>>   int error;
>>>>   handler **file;
>>>> +  THD *thd= current_thd;
>>> GB In handler code like this, ha_thd() is to be used instead of
>>> current_thd, because we think it could be faster:
>>> - it's a function call like pthread_getspecific() is (current_thd uses
>>> pthread_getspecific())
>>> - it is in mysqld whereas pthread_getspecific() is in libc (farther so
>>> slower to access, probably one indirection more).
>>> - ha_thd() has good chances of just returning table->in_use, which is
>>> probably not more work than pthread_getspecific() does, at least this
>>> one:
>>>
> http://cvs.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/lib/libc/port/threads/tsd.c#_thr_getspecific
>>> does more work (opensolaris is what popped first in Google - sign of
>>> the times?).
>>>
>> MJ: Thanks for showing me, I used it since it was used in several places 
>> in the partitioning code. Should I change all current_thd to ha_thd in 
>> ha_partition.cc?
> 
> GB2 Yes you could do this change. If it breaks anything, it means
> table->in_use is incorrect so it's a bug, RC is a good time for
> finding bugs.
> 

Have changed it in ha_partition.cc (they also exists in
sql_partition.cc, but since that is not handler specific code, I have
not changed it)

>>>>   DBUG_ENTER("ha_partition::delete_all_rows");
>>>>
>>>> +  if (thd->lex->sql_command == SQLCOM_TRUNCATE)
>>>> +  {
>>>> +    HA_DATA_PARTITION *ha_data= (HA_DATA_PARTITION*) 
>>>> table_share->ha_data;
>>>> +    lock_auto_increment();
>>>> +    ha_data->next_auto_inc_val= 0;
>>>> +    ha_data->auto_inc_initialized= FALSE;
>>>> +    unlock_auto_increment();
>>>> +  }
>>>>   file= m_file;
>>>>   do
>>>>   {
>>>>     if ((error= (*file)->ha_delete_all_rows()))
>>>>       DBUG_RETURN(error);
>>>> +    /* must reset the auto_increment for some engines (eg MyISAM) */
>>>> +    if (thd->lex->sql_command == SQLCOM_TRUNCATE)
>>>> +      (*file)->ha_reset_auto_increment(0);
>>> GB This piece moved to another bug report, I understood.
>>>
>> MJ: I will remove those tree lines. Should it not be a comment in 
>> handler.h for delete_all_rows() that it should reset auto_increment if 
>> thd->lex->sql_command == SQLCOM_TRUNCATE?
> 
> GB2 yes it would be good.
> 

I'll added the comment in handler.h


>>>> @@ -4412,20 +4451,56 @@ int ha_partition::handle_ordered_prev(uc
>>>> int ha_partition::info(uint flag)
>>>> {
>>>>   handler *file, **file_array;
>>>> +  HA_DATA_PARTITION *ha_data= (HA_DATA_PARTITION*)
> table_share->ha_data;
>>> GB this declaration should be moved to the block which uses this variable:
>>> the if(flag&HA_STATUS_AUTO) block.
>>>
>> (I have reorganized the ::info function regaring HA_STATUS_AUTO, see below.)
>> MJ: Yes. I moved it down to ->
>>
>>>>   DBUG_ENTER("ha_partition:info");
>>>>
>>>>   if (flag & HA_STATUS_AUTO)
>>>>   {
>> MJ: -> here.
>>
>>>> -    ulonglong auto_increment_value= 0;
>>>>     DBUG_PRINT("info", ("HA_STATUS_AUTO"));
>>>> -    file_array= m_file;
>>>> -    do
>>>> +    if (ha_data->auto_inc_initialized && 
>>>> !table_share->next_number_keypart)
>>> GB this is reading ha_data->auto_inc_initialized without mutex: there may
>>> be another thread changing this variable at this moment. Then we may
>>> go into the 'else' branch though next_auto_inc_val has already just
>>> been set to true by the other thread, but this is ok as we will then
>>> only do a useless setting of next_auto_inc_val.
>> MJ: Yes this is a start of a race, but since I will acquire a lock 
>> before set_if_bigger(next_auto_inc_val,ai..) (in the branch where it is 
>> not initialized) it will not get updated if has changed. So I guess I 
>> will add a comment about it. Actually this is the thing with the 
>> ha_data->auto_inc_initialized, since it is a bool, then I assume I do 
>> not have to aquire any lock/mutex for reading it and it will only be 
>> written once during the life of the table_share. And the few times it 
>> actually are more than one thread running after a check of 
>> auto_inc_initialized == 0, there will be no harm, since the handling of 
>> ha_data->next_auto_inc_val always should be locked before reading and 
>> updating.
> 
> GB2 Those assumptions are safe provided that in concurrent situations,
> the bool changes only from 0 to 1 and not from 1 to 0. If it could
> change from 1 to 0, the tests without mutex could go wrong like this:
> 
> if (ha_data->auto_inc_initialized)
> {
>   use ha_data->next_auto_inc_val;
> }
> If at this moment such code runs:
> lock_auto_increment();
> ha_data->auto_inc_initialized= 0;
> ha_data->next_auto_inc_val= 0;
> unlock_auto_increment();
> 
> then the if() could be true, but when we use
> ha_data->next_auto_inc_val it has become 0.
> Fortunately, going from 1 to 0 happens only during TRUNCATE TABLE
> which first closes all open instances and prevents opening new
> instances, so there cannot be concurrency.
> You may want to add a comment about this.
> 

Yes, those was my worries about using ha_data->mutex, since I wanted it
to be initialized when initalizing the share, and destroy it when
freeing the share, but I do not know where to destroy it. (see my
question above). So I will use table_share->mutex instead.

>>>>     {
>>>> -      file= *file_array;
>>>> -      file->info(HA_STATUS_AUTO);
>>>> -      set_if_bigger(auto_increment_value, 
>>>> file->stats.auto_increment_value);
>>>> -    } while (*(++file_array));
>>>> -    stats.auto_increment_value= auto_increment_value;
>>>> +      DBUG_PRINT("info", ("Using ha_data->next_auto_inc_val: %llu",
>>>> +                          ha_data->next_auto_inc_val));
>>> GB From what I see in MSDN, %ll[u] is not supported by VS 2003 (it is in
>>> C99 but Microsoft apparently needed more than 4 years to implement it
>>> - now let's see how many flame emails I get for writing that
>>> gratuitous sarcasm). It is supported starting from VS2005. There is a
>>> VS2003 machine in pushbuild, so we may have to do some debugging on
>>> it, thus we should avoid %ll[u] in the patch (the rest of MySQL code
>>> already avoids it and uses instead [u]llstr/%s). I agree this is a
>>> hassle.
>>>
>> MJ: OK, I will probably remove all DBUG_PRINT. the handler.cc code uses 
>> "%lu", (ulong) nr instead, is that ok? (else should I do #ifdef DEBUG; 
>> char buf[22] #endif before?)
> 
> GB2 %lu is ok
> 
> 

I'll use %lu if I add any DBUG_PRINT (I removed it in the updated patch).

>>>> +  {
>>>> +    if ((res= (*pos)->ha_reset_auto_increment(value)) != 0)
>>>> +    {
>>>> +      unlock_auto_increment();
>>>> +      DBUG_RETURN(res);
>>>> +    }
>>>> +  }
>>>> +  unlock_auto_increment();
>>>> +  DBUG_RETURN(0);
>>> GB or maybe (smaller code, different behaviour in case if error but not
>>> clear if that itself is important):
>>>  int res= 0;
>>>  ...
>>>  for (pos=m_file, end= m_file+ m_tot_parts; pos != end ; pos++)
>>>    res|= (*pos)->ha_reset_auto_increment(value);
>>>  unlock_auto_increment();
>>>  DBUG_RETURN(res != 0);
>>>
>> MJ: I can rewrite it as this, but do really res != 0 give a correct 
>> error code?
> 
> GB2 ah, right, I misread parenthesis in the original code. Ok, keep it
> like it is.
> 

Kept it unchanged.

>> (I could add a new error and handle it in 
>> ha_partition::print_error)
>>
>> in sql_delete.cc row 331:
>>     int error2= table->file->ha_reset_auto_increment(0);
>>
>>     if (error2 && (error2 != HA_ERR_WRONG_COMMAND))
>>     {
>>       table->file->print_error(error2, MYF(0));
>>
>>>  
>>>> /*
>>>>   This method is called by update_auto_increment which in turn is called
>>>> -  by the individual handlers as part of write_row. We will always let
>>>> -  the first handler keep track of the auto increment value for all
>>>> -  partitions.
>>>> +  by the individual handlers as part of write_row. We use the
>>>> +  table_share->ha_data->next_auto_inc_val, or search all
>>>> +  partitions for the highest auto_increment_value if not initialized or
>>>> +  if auto_increment field is a secondary part of a key, we must search
>>>> +  every partition with a mutex to be sure of correctness.
>>>> */
>>> GB While you are changing the comment, you could format it for doxygen.
>>>
>> MJ: I will try to update the comments for doxygen
> 
> GB2 Please fix that in the next version of the patch (latest version
> does not have it yet).
> 
> 

Arrg, missed that comment...

>>>> +  }
>>>> +  else if (!ha_data->auto_inc_initialized)
>>>> +  {
>>>>     /*
>>>> -      We should not get here.
>>>> +      Not initialized, but should be locked in
> ha_partition::write_row().
>>>> +      (Or is a temporary table).
>>> GB how about adding
>>> if (this is not tmp table)
>>>   safe_mutex_assert_owner(the expected mutex);
>>> ?
>>>
>> MJ: Thanks, did not now of that function.
> 
> GB2 do you still need to add such assertion, or the code changes make
> it unnecessary?
> 
> 

There was no such assertion in my latest version, since I rewrote it.

>>>> @@ -5659,9 +5813,28 @@ void ha_partition::release_auto_incremen
>>>> {
>>>>   DBUG_ENTER("ha_partition::release_auto_increment");
>>>>
>>>> -  for (uint i= 0; i < m_tot_parts; i++)
>>>> +  if (table->s->next_number_keypart)
>>>>   {
>>>> -    m_file[i]->ha_release_auto_increment();
>>>> +    for (uint i= 0; i < m_tot_parts; i++)
>>>> +      m_file[i]->ha_release_auto_increment();
>>>> +  }
>>>> +  else
>>>> +  {
>>>> +    HA_DATA_PARTITION *ha_data= (HA_DATA_PARTITION*) 
>>>> table_share->ha_data;
>>>> +    DBUG_PRINT("info", ("next_auto_inc_val: %llu cur_row_max: %llu min:
> 
>>>> %llu",
>>>> +                        ha_data->next_auto_inc_val,
>>>> +                        auto_inc_interval_for_cur_row.maximum(),
>>>> +                        auto_inc_interval_for_cur_row.minimum()));
>>>> +    if (next_insert_id)
>>>> +    {
>>>> +      lock_auto_increment();
>>>> +      if (next_insert_id < ha_data->next_auto_inc_val &&
>>>> +          auto_inc_interval_for_cur_row.maximum() >= 
>>>> ha_data->next_auto_inc_val)
>>> GB too long line (80 chars).
>>>
>> MJ: OK, I thought that it was not more than 80 chars... should I remove 
>> one space in front of the line, or break it up like this:
>>       if (next_insert_id < ha_data->next_auto_inc_val &&
>>           auto_inc_interval_for_cur_row.maximum() >=
>>           ha_data->next_auto_inc_val)
>> ?
> 
> GB2 I'd break it up.
> Right, our coding guidelines say "The maximum line width is 80
> characters"... I took my habit for the rule then; I think I picked 79
> after getting angry notes that I had lines 80-char long :)
> So, ignore my comment.
> 

I already broke it up in the latest version :)

>>>> +        ha_data->next_auto_inc_val= next_insert_id;
>>>> +      unlock_auto_increment();
>>>> +    }
>>>> +    DBUG_PRINT("info", ("next_auto_inc_val: %llu next_ins_id: %llu",
>>>> +                        ha_data->next_auto_inc_val,
> next_insert_id));
>>>>   }
>>>>   DBUG_VOID_RETURN;
>>>> }
> 
>> Can there be any problems with concurrent multi row insert and statement 
>> replications?
> 
> For statement-level replication to work, a multi-row INSERT (INSERT
> VALUES or INSERT SELECT) must use consecutive values.
> Let's assume that two concurrent statements are doing
> INSERT INTO t SELECT * FROM u;
> where t is a partitioned table, having an autoinc column first in the
> index.
> Each statement will call get_auto_increment() with
> nb_desired_values==1 (see handler::update_auto_increment();
> estimation_rows_to_insert is always 0 in INSERT SELECT):
> statement1 moves next_auto_inc_val to 2
>   statement2 moves next_auto_inc_val to 3
> statement1 inserts 1
>   statement2 inserts 2
> statement1 notices that it has more rows to insert because SELECT
> returned more than one row (see handler::update_auto_increment()),
> calls get_auto_increment() with nb_desired_values==2 (twice the
> previous nb_desired_values):
> statement1 moves next_auto_inc_val to 5
> statement1 inserts 3
> and then we have the problem: statement1 has inserted 1 and 3, it's
> not consecutive and will not replicate well.
> 
> If the underlying engine of the partitioned table has table-level
> locking, in fact there is no problem (statement2 will run only after
> statement1 is done).
> This is also the case of InnoDB:
> - if innodb_auto_inc_lock_mode==2, it's documented to not work with
> statement-based replication
> - if ==0, it does table-level locking
> - if ==1, it does table-level locking except if the statement is
> INSERT/REPLACE VALUES (and the discussed case above is INSERT
> SELECT).
> But InnoDB does this fine logic only in
> ha_innobase::get_auto_increment() (called when the table is not
> partitioned), which is not called by
> ha_partition::get_auto_increment().
> 
> So I suspect that yes, there are problems. This is a suspicion. A
> multi-threaded test or code inspection could help know the truth.
> Possible solutions:
> - make ha_partition::get_auto_increment() imitate
> ha_innobase::get_auto_increment(): let it keep the autoinc mutex if
> (thd->lex->sql_command is != SQLCOM_INSERT && !row_based).
> - let ha_partition switch to row-based replication if inserting into
> an auto_increment column.
> - make ha_partition::get_auto_increment() call engine's
> get_auto_increment() to ask it to reserve [max,
> max+nb_desired_values*increment]. The problem is that engine's
> get_auto_increment() API does not tell the engine the lower bound of
> the interval (*first_value is not supposed to be read by the engine),
> only the interval's width (nb_desired_values). InnoDB does read
> *first_value, violating the API, using that as a lower bound. We could
> make InnoDB's violation the new norm: *first_value would become an
> INOUT parameter (see the email I sent to Innobase recently, cc:you).
> - have statement-based binlogging of multiple autoinc intervals
> http://forge.mysql.com/worklog/task.php?id=3404 implemented.
> - something better?
> 
> In all cases, this is a significant problem, a discussion with the
> other reviewer (Serg) would be needed.
> 
> Falcon supports only row-based replication right now.
> 

How about the easy way for start:
Rely on the partitioned storage engines support for this (if it work
without partitioning, then it will work with partitioning). I have to
check that it actually will work if you have auto_increment_offset =
no_of_partitions and partition by hash(ai_col), if the table lock is on
the table_share, then it will be OK, but it might be a problem if is not.
Do this by adding a comment, and add it to the bug report for the
doc-team for adding it in the manual.


keep the autoinc mutex, where should it be released? in
release_auto_increment? would probably work.

switch to row-based replication, this is new grounds for me, but is that
always possible? (is it used often?)

Forward the call of get_auto_increment to all the partitions, I think
that *first_value as an INOUT would be good, but since nb_desired_values
is not forced, Falcon always uses 1 as example, it will not help in all
cases. And I do not really see this as a solution, since it would lower 
the performance, but it might be used for the first auto_increment value 
for each partition just to be sure...



-- 
Mattias Jonsson, Software Engineer
MySQL AB, www.mysql.com

Are you MySQL certified?  www.mysql.com/certification
Thread
bk commit into 5.1 tree (mattiasj:1.2518) BUG#33479mattiasj5 Mar
  • Re: bk commit into 5.1 tree (mattiasj:1.2518) BUG#33479Sergei Golubchik7 Mar
    • Re: bk commit into 5.1 tree (mattiasj:1.2518) BUG#33479Sergei Golubchik7 Mar
      • Re: bk commit into 5.1 tree (mattiasj:1.2518) BUG#33479Mattias Jonsson8 Mar
        • Re: bk commit into 5.1 tree (mattiasj:1.2518) BUG#33479Sergei Golubchik10 Mar
  • Re: bk commit into 5.1 tree (mattiasj:1.2518) BUG#33479Guilhem Bichot13 Mar
    • Re: bk commit into 5.1 tree (mattiasj:1.2518) BUG#33479Mattias Jonsson14 Mar
      • Re: bk commit into 5.1 tree (mattiasj:1.2518) BUG#33479Guilhem Bichot18 Mar
        • Re: bk commit into 5.1 tree (mattiasj:1.2518) BUG#33479Mattias Jonsson27 Mar
          • Re: bk commit into 5.1 tree (mattiasj:1.2518) BUG#33479Guilhem Bichot28 Mar
            • Re: bk commit into 5.1 tree (mattiasj:1.2518) BUG#33479Mattias Jonsson1 Apr
              • Re: bk commit into 5.1 tree (mattiasj:1.2518) BUG#33479Guilhem Bichot10 Apr