Hi,
I have a concern about destroying the ha_data->mutex.
It is not possible to destroy the table_share->ha_data->mutex in
ha_partition::close() since there can be another handler instance with
the same table_share, which is in the middle of a
write_row/get_auto_increment, which will crash the server if the mutex
is destroyed.
I think I will go back and use the table_share->mutex instead, and
suggest a method/function in WL4305 where the handler can destroy and
free engine specific data and objects if needed.
I will commit a new full patch soon.
My answers to your questions are below.
Regards Mattias
Guilhem Bichot wrote:
> Hello,
>
> I marked my comments with GB2.
>
> On Fri, Mar 14, 2008 at 01:54:08AM +0100, Mattias Jonsson wrote:
>> Hi,
>>
>> Thanks for you comments.
>>
>> I must say that this is the best review I ever got (both for cheering me
>> up and as well as good advice and thorough code AND test review!) Thanks!
>
> GB2 My pleasure :)
>
>> My replies marked as MJ.
>>
>> I will probably commit an update to the previous patch tomorrow. Then it
>> will be easier to see the changes.
>
> GB2 Ok. Please collapse the update with the original csets (code and
> test), so that we have one single cset for the entire bugfix.
> I marked my replies with GB2.
>
OK, I will collapse all changes including the test into one patch (and
then add a new patch for 6.0/Falcon specific test.
>> Guilhem Bichot wrote:
>>> Hello,
>>>
>>> As this is a big patch, I marked my comments with "GB" for easier
>>> searching.
>>> By the way this is an excellent patch.
>>>
>>> On Wed, Mar 05, 2008 at 11:56:16PM +0100, mattiasj@stripped wrote:
>>>> ChangeSet@stripped, 2008-03-05 23:56:06+01:00, mattiasj@witty. +3 -0
>>>> Bug#33479: auto_increment failures in partitioning
>>>>
>>>> Several problems with auto_increment in partitioning
>>>> (with MyISAM, InnoDB and Falcon)
>>>>
>>>> Updated code only patch (the test cases will be in a separate patches
>>>> 5.1+ (InnoDB+MyISAM) and 6.0 (Falcon)
>
>>>> diff -Nrup a/sql/ha_partition.cc b/sql/ha_partition.cc
>
>>>> @@ -2354,6 +2355,24 @@ int ha_partition::open(const char *name,
>>>> 0, key_rec_cmp, (void*)this)))
>>>> goto err_handler;
>>>>
>>>> +
>>>> + if (table_share->tmp_table == NO_TMP_TABLE)
>>>> + pthread_mutex_lock(&table->s->mutex);
>>>> + if (!table_share->ha_data)
>>>> + {
>>>> + HA_DATA_PARTITION *ha_data;
>>>> + /* currently only needed for auto_increment */
>>>> + table_share->ha_data= alloc_root(&table_share->mem_root,
>>>> + sizeof(HA_DATA_PARTITION));
>>>> + if (!table_share->ha_data)
>>>> + goto err_handler;
>>>> + ha_data= (HA_DATA_PARTITION*) table_share->ha_data;
>>>> + bzero(ha_data, sizeof(HA_DATA_PARTITION));
>>>> + if (table_share->tmp_table == NO_TMP_TABLE)
>>>> + pthread_mutex_init(&ha_data->mutex, MY_MUTEX_INIT_FAST);
>>> GB you need to destroy this mutex, and do so before
>>> table_share->mem_root is freed.
>>>
>> MJ: Where should I do that? I cannot find any good spot. Should I maybe
>> use the table_share->mutex and remove the ha_data->mutex instead?
>
> GB2 You can keep ha_data->mutex, and destroy it ha_partition::close()
> (assuming such close always happens before free_table_share(), which
> sounds logical but I haven't verified).
>
See my question/answer above.
>> I added the ha_data->mutex for finer granuality, and allow more
>> concurrancy, maybe we can add this first when serg has made a worklog
>> about the table_share->ha_data as he said in a mail 2008-02-19:
>>
>> When you're done, I'll create a WL about making a proper use of
>> TABLE_SHARE::ha_data (which would include adding constructor/destructor
>> methods to the handlerton, moving all storage engines to the new API,
>> moving existing partition-only fields in the TABLE_SHARE to the new
>> structure that you'll create).
>
> GB2 I don't see how this WL would help with destroying the mutex.
>
I hoped that it would provide an interface for the storage engine to
initialize and destroy the storage engine specific parts, like mutexes etc.
>>>> + }
>>>> + if (table_share->tmp_table == NO_TMP_TABLE)
>>>> + pthread_mutex_unlock(&table->s->mutex);
>>> GB you may want to have a bool on the stack which keeps the value of
>>> table_share->tmp_table == NO_TMP_TABLE, to avoid doing the test 3
>>> times in the worst case.
>>> To have maybe less pointer dereference of table_share (in
>>> table_share->ha_data) you could do
>>> table_share->ha_data= ha_data= (HA_DATA_PARTITION*) alloc_root(...);
>>> if (!ha_data) // or ha_data==NULL, as you like
>>> goto err_handler;
>>> bzero(ha_data, sizeof(HA_DATA_PARTITION));
>>> This would be one dereference instead of three.
>>>
>> MJ: I added:
>> bool is_not_tmp_table= (table_share->tmp_table == NO_TMP_TABLE)
>> and changed the ha_data as you proposed.
>
> GB2 Good. I just remembered an idea: in
> table_share->ha_data= ha_data= (HA_DATA_PARTITION*)
> alloc_root(&table_share->mem_root,
> sizeof(HA_DATA_PARTITION));
> you could consider replacing the C-style cast with
> reinterpret_cast<HA_DATA_PARTITION*>.
> For the compiler it changes nothing; for the developer it makes casts
> more classified (and one day we would be able to run
> gcc -Wold-style-cast on our C++ code).
> As our existing C++ code has lots of C-style casts, it is up to you.
>
I did not get this compiling, so I will not change it.
>>>> @@ -2953,13 +2980,25 @@ int ha_partition::delete_all_rows()
>>>> {
>>>> int error;
>>>> handler **file;
>>>> + THD *thd= current_thd;
>>> GB In handler code like this, ha_thd() is to be used instead of
>>> current_thd, because we think it could be faster:
>>> - it's a function call like pthread_getspecific() is (current_thd uses
>>> pthread_getspecific())
>>> - it is in mysqld whereas pthread_getspecific() is in libc (farther so
>>> slower to access, probably one indirection more).
>>> - ha_thd() has good chances of just returning table->in_use, which is
>>> probably not more work than pthread_getspecific() does, at least this
>>> one:
>>>
> http://cvs.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/lib/libc/port/threads/tsd.c#_thr_getspecific
>>> does more work (opensolaris is what popped first in Google - sign of
>>> the times?).
>>>
>> MJ: Thanks for showing me, I used it since it was used in several places
>> in the partitioning code. Should I change all current_thd to ha_thd in
>> ha_partition.cc?
>
> GB2 Yes you could do this change. If it breaks anything, it means
> table->in_use is incorrect so it's a bug, RC is a good time for
> finding bugs.
>
Have changed it in ha_partition.cc (they also exists in
sql_partition.cc, but since that is not handler specific code, I have
not changed it)
>>>> DBUG_ENTER("ha_partition::delete_all_rows");
>>>>
>>>> + if (thd->lex->sql_command == SQLCOM_TRUNCATE)
>>>> + {
>>>> + HA_DATA_PARTITION *ha_data= (HA_DATA_PARTITION*)
>>>> table_share->ha_data;
>>>> + lock_auto_increment();
>>>> + ha_data->next_auto_inc_val= 0;
>>>> + ha_data->auto_inc_initialized= FALSE;
>>>> + unlock_auto_increment();
>>>> + }
>>>> file= m_file;
>>>> do
>>>> {
>>>> if ((error= (*file)->ha_delete_all_rows()))
>>>> DBUG_RETURN(error);
>>>> + /* must reset the auto_increment for some engines (eg MyISAM) */
>>>> + if (thd->lex->sql_command == SQLCOM_TRUNCATE)
>>>> + (*file)->ha_reset_auto_increment(0);
>>> GB This piece moved to another bug report, I understood.
>>>
>> MJ: I will remove those tree lines. Should it not be a comment in
>> handler.h for delete_all_rows() that it should reset auto_increment if
>> thd->lex->sql_command == SQLCOM_TRUNCATE?
>
> GB2 yes it would be good.
>
I'll added the comment in handler.h
>>>> @@ -4412,20 +4451,56 @@ int ha_partition::handle_ordered_prev(uc
>>>> int ha_partition::info(uint flag)
>>>> {
>>>> handler *file, **file_array;
>>>> + HA_DATA_PARTITION *ha_data= (HA_DATA_PARTITION*)
> table_share->ha_data;
>>> GB this declaration should be moved to the block which uses this variable:
>>> the if(flag&HA_STATUS_AUTO) block.
>>>
>> (I have reorganized the ::info function regaring HA_STATUS_AUTO, see below.)
>> MJ: Yes. I moved it down to ->
>>
>>>> DBUG_ENTER("ha_partition:info");
>>>>
>>>> if (flag & HA_STATUS_AUTO)
>>>> {
>> MJ: -> here.
>>
>>>> - ulonglong auto_increment_value= 0;
>>>> DBUG_PRINT("info", ("HA_STATUS_AUTO"));
>>>> - file_array= m_file;
>>>> - do
>>>> + if (ha_data->auto_inc_initialized &&
>>>> !table_share->next_number_keypart)
>>> GB this is reading ha_data->auto_inc_initialized without mutex: there may
>>> be another thread changing this variable at this moment. Then we may
>>> go into the 'else' branch though next_auto_inc_val has already just
>>> been set to true by the other thread, but this is ok as we will then
>>> only do a useless setting of next_auto_inc_val.
>> MJ: Yes this is a start of a race, but since I will acquire a lock
>> before set_if_bigger(next_auto_inc_val,ai..) (in the branch where it is
>> not initialized) it will not get updated if has changed. So I guess I
>> will add a comment about it. Actually this is the thing with the
>> ha_data->auto_inc_initialized, since it is a bool, then I assume I do
>> not have to aquire any lock/mutex for reading it and it will only be
>> written once during the life of the table_share. And the few times it
>> actually are more than one thread running after a check of
>> auto_inc_initialized == 0, there will be no harm, since the handling of
>> ha_data->next_auto_inc_val always should be locked before reading and
>> updating.
>
> GB2 Those assumptions are safe provided that in concurrent situations,
> the bool changes only from 0 to 1 and not from 1 to 0. If it could
> change from 1 to 0, the tests without mutex could go wrong like this:
>
> if (ha_data->auto_inc_initialized)
> {
> use ha_data->next_auto_inc_val;
> }
> If at this moment such code runs:
> lock_auto_increment();
> ha_data->auto_inc_initialized= 0;
> ha_data->next_auto_inc_val= 0;
> unlock_auto_increment();
>
> then the if() could be true, but when we use
> ha_data->next_auto_inc_val it has become 0.
> Fortunately, going from 1 to 0 happens only during TRUNCATE TABLE
> which first closes all open instances and prevents opening new
> instances, so there cannot be concurrency.
> You may want to add a comment about this.
>
Yes, those was my worries about using ha_data->mutex, since I wanted it
to be initialized when initalizing the share, and destroy it when
freeing the share, but I do not know where to destroy it. (see my
question above). So I will use table_share->mutex instead.
>>>> {
>>>> - file= *file_array;
>>>> - file->info(HA_STATUS_AUTO);
>>>> - set_if_bigger(auto_increment_value,
>>>> file->stats.auto_increment_value);
>>>> - } while (*(++file_array));
>>>> - stats.auto_increment_value= auto_increment_value;
>>>> + DBUG_PRINT("info", ("Using ha_data->next_auto_inc_val: %llu",
>>>> + ha_data->next_auto_inc_val));
>>> GB From what I see in MSDN, %ll[u] is not supported by VS 2003 (it is in
>>> C99 but Microsoft apparently needed more than 4 years to implement it
>>> - now let's see how many flame emails I get for writing that
>>> gratuitous sarcasm). It is supported starting from VS2005. There is a
>>> VS2003 machine in pushbuild, so we may have to do some debugging on
>>> it, thus we should avoid %ll[u] in the patch (the rest of MySQL code
>>> already avoids it and uses instead [u]llstr/%s). I agree this is a
>>> hassle.
>>>
>> MJ: OK, I will probably remove all DBUG_PRINT. the handler.cc code uses
>> "%lu", (ulong) nr instead, is that ok? (else should I do #ifdef DEBUG;
>> char buf[22] #endif before?)
>
> GB2 %lu is ok
>
>
I'll use %lu if I add any DBUG_PRINT (I removed it in the updated patch).
>>>> + {
>>>> + if ((res= (*pos)->ha_reset_auto_increment(value)) != 0)
>>>> + {
>>>> + unlock_auto_increment();
>>>> + DBUG_RETURN(res);
>>>> + }
>>>> + }
>>>> + unlock_auto_increment();
>>>> + DBUG_RETURN(0);
>>> GB or maybe (smaller code, different behaviour in case if error but not
>>> clear if that itself is important):
>>> int res= 0;
>>> ...
>>> for (pos=m_file, end= m_file+ m_tot_parts; pos != end ; pos++)
>>> res|= (*pos)->ha_reset_auto_increment(value);
>>> unlock_auto_increment();
>>> DBUG_RETURN(res != 0);
>>>
>> MJ: I can rewrite it as this, but do really res != 0 give a correct
>> error code?
>
> GB2 ah, right, I misread parenthesis in the original code. Ok, keep it
> like it is.
>
Kept it unchanged.
>> (I could add a new error and handle it in
>> ha_partition::print_error)
>>
>> in sql_delete.cc row 331:
>> int error2= table->file->ha_reset_auto_increment(0);
>>
>> if (error2 && (error2 != HA_ERR_WRONG_COMMAND))
>> {
>> table->file->print_error(error2, MYF(0));
>>
>>>
>>>> /*
>>>> This method is called by update_auto_increment which in turn is called
>>>> - by the individual handlers as part of write_row. We will always let
>>>> - the first handler keep track of the auto increment value for all
>>>> - partitions.
>>>> + by the individual handlers as part of write_row. We use the
>>>> + table_share->ha_data->next_auto_inc_val, or search all
>>>> + partitions for the highest auto_increment_value if not initialized or
>>>> + if auto_increment field is a secondary part of a key, we must search
>>>> + every partition with a mutex to be sure of correctness.
>>>> */
>>> GB While you are changing the comment, you could format it for doxygen.
>>>
>> MJ: I will try to update the comments for doxygen
>
> GB2 Please fix that in the next version of the patch (latest version
> does not have it yet).
>
>
Arrg, missed that comment...
>>>> + }
>>>> + else if (!ha_data->auto_inc_initialized)
>>>> + {
>>>> /*
>>>> - We should not get here.
>>>> + Not initialized, but should be locked in
> ha_partition::write_row().
>>>> + (Or is a temporary table).
>>> GB how about adding
>>> if (this is not tmp table)
>>> safe_mutex_assert_owner(the expected mutex);
>>> ?
>>>
>> MJ: Thanks, did not now of that function.
>
> GB2 do you still need to add such assertion, or the code changes make
> it unnecessary?
>
>
There was no such assertion in my latest version, since I rewrote it.
>>>> @@ -5659,9 +5813,28 @@ void ha_partition::release_auto_incremen
>>>> {
>>>> DBUG_ENTER("ha_partition::release_auto_increment");
>>>>
>>>> - for (uint i= 0; i < m_tot_parts; i++)
>>>> + if (table->s->next_number_keypart)
>>>> {
>>>> - m_file[i]->ha_release_auto_increment();
>>>> + for (uint i= 0; i < m_tot_parts; i++)
>>>> + m_file[i]->ha_release_auto_increment();
>>>> + }
>>>> + else
>>>> + {
>>>> + HA_DATA_PARTITION *ha_data= (HA_DATA_PARTITION*)
>>>> table_share->ha_data;
>>>> + DBUG_PRINT("info", ("next_auto_inc_val: %llu cur_row_max: %llu min:
>
>>>> %llu",
>>>> + ha_data->next_auto_inc_val,
>>>> + auto_inc_interval_for_cur_row.maximum(),
>>>> + auto_inc_interval_for_cur_row.minimum()));
>>>> + if (next_insert_id)
>>>> + {
>>>> + lock_auto_increment();
>>>> + if (next_insert_id < ha_data->next_auto_inc_val &&
>>>> + auto_inc_interval_for_cur_row.maximum() >=
>>>> ha_data->next_auto_inc_val)
>>> GB too long line (80 chars).
>>>
>> MJ: OK, I thought that it was not more than 80 chars... should I remove
>> one space in front of the line, or break it up like this:
>> if (next_insert_id < ha_data->next_auto_inc_val &&
>> auto_inc_interval_for_cur_row.maximum() >=
>> ha_data->next_auto_inc_val)
>> ?
>
> GB2 I'd break it up.
> Right, our coding guidelines say "The maximum line width is 80
> characters"... I took my habit for the rule then; I think I picked 79
> after getting angry notes that I had lines 80-char long :)
> So, ignore my comment.
>
I already broke it up in the latest version :)
>>>> + ha_data->next_auto_inc_val= next_insert_id;
>>>> + unlock_auto_increment();
>>>> + }
>>>> + DBUG_PRINT("info", ("next_auto_inc_val: %llu next_ins_id: %llu",
>>>> + ha_data->next_auto_inc_val,
> next_insert_id));
>>>> }
>>>> DBUG_VOID_RETURN;
>>>> }
>
>> Can there be any problems with concurrent multi row insert and statement
>> replications?
>
> For statement-level replication to work, a multi-row INSERT (INSERT
> VALUES or INSERT SELECT) must use consecutive values.
> Let's assume that two concurrent statements are doing
> INSERT INTO t SELECT * FROM u;
> where t is a partitioned table, having an autoinc column first in the
> index.
> Each statement will call get_auto_increment() with
> nb_desired_values==1 (see handler::update_auto_increment();
> estimation_rows_to_insert is always 0 in INSERT SELECT):
> statement1 moves next_auto_inc_val to 2
> statement2 moves next_auto_inc_val to 3
> statement1 inserts 1
> statement2 inserts 2
> statement1 notices that it has more rows to insert because SELECT
> returned more than one row (see handler::update_auto_increment()),
> calls get_auto_increment() with nb_desired_values==2 (twice the
> previous nb_desired_values):
> statement1 moves next_auto_inc_val to 5
> statement1 inserts 3
> and then we have the problem: statement1 has inserted 1 and 3, it's
> not consecutive and will not replicate well.
>
> If the underlying engine of the partitioned table has table-level
> locking, in fact there is no problem (statement2 will run only after
> statement1 is done).
> This is also the case of InnoDB:
> - if innodb_auto_inc_lock_mode==2, it's documented to not work with
> statement-based replication
> - if ==0, it does table-level locking
> - if ==1, it does table-level locking except if the statement is
> INSERT/REPLACE VALUES (and the discussed case above is INSERT
> SELECT).
> But InnoDB does this fine logic only in
> ha_innobase::get_auto_increment() (called when the table is not
> partitioned), which is not called by
> ha_partition::get_auto_increment().
>
> So I suspect that yes, there are problems. This is a suspicion. A
> multi-threaded test or code inspection could help know the truth.
> Possible solutions:
> - make ha_partition::get_auto_increment() imitate
> ha_innobase::get_auto_increment(): let it keep the autoinc mutex if
> (thd->lex->sql_command is != SQLCOM_INSERT && !row_based).
> - let ha_partition switch to row-based replication if inserting into
> an auto_increment column.
> - make ha_partition::get_auto_increment() call engine's
> get_auto_increment() to ask it to reserve [max,
> max+nb_desired_values*increment]. The problem is that engine's
> get_auto_increment() API does not tell the engine the lower bound of
> the interval (*first_value is not supposed to be read by the engine),
> only the interval's width (nb_desired_values). InnoDB does read
> *first_value, violating the API, using that as a lower bound. We could
> make InnoDB's violation the new norm: *first_value would become an
> INOUT parameter (see the email I sent to Innobase recently, cc:you).
> - have statement-based binlogging of multiple autoinc intervals
> http://forge.mysql.com/worklog/task.php?id=3404 implemented.
> - something better?
>
> In all cases, this is a significant problem, a discussion with the
> other reviewer (Serg) would be needed.
>
> Falcon supports only row-based replication right now.
>
How about the easy way for start:
Rely on the partitioned storage engines support for this (if it work
without partitioning, then it will work with partitioning). I have to
check that it actually will work if you have auto_increment_offset =
no_of_partitions and partition by hash(ai_col), if the table lock is on
the table_share, then it will be OK, but it might be a problem if is not.
Do this by adding a comment, and add it to the bug report for the
doc-team for adding it in the manual.
keep the autoinc mutex, where should it be released? in
release_auto_increment? would probably work.
switch to row-based replication, this is new grounds for me, but is that
always possible? (is it used often?)
Forward the call of get_auto_increment to all the partitions, I think
that *first_value as an INOUT would be good, but since nb_desired_values
is not forced, Falcon always uses 1 as example, it will not help in all
cases. And I do not really see this as a solution, since it would lower
the performance, but it might be used for the first auto_increment value
for each partition just to be sure...
--
Mattias Jonsson, Software Engineer
MySQL AB, www.mysql.com
Are you MySQL certified? www.mysql.com/certification