From: Date: March 27 2008 5:16pm Subject: Re: bk commit into 5.1 tree (mattiasj:1.2518) BUG#33479 List-Archive: http://lists.mysql.com/commits/44520 Message-Id: <47EBC865.9070404@mysql.com> MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Hi, I have a concern about destroying the ha_data->mutex. It is not possible to destroy the table_share->ha_data->mutex in ha_partition::close() since there can be another handler instance with the same table_share, which is in the middle of a write_row/get_auto_increment, which will crash the server if the mutex is destroyed. I think I will go back and use the table_share->mutex instead, and suggest a method/function in WL4305 where the handler can destroy and free engine specific data and objects if needed. I will commit a new full patch soon. My answers to your questions are below. Regards Mattias Guilhem Bichot wrote: > Hello, > > I marked my comments with GB2. > > On Fri, Mar 14, 2008 at 01:54:08AM +0100, Mattias Jonsson wrote: >> Hi, >> >> Thanks for you comments. >> >> I must say that this is the best review I ever got (both for cheering me >> up and as well as good advice and thorough code AND test review!) Thanks! > > GB2 My pleasure :) > >> My replies marked as MJ. >> >> I will probably commit an update to the previous patch tomorrow. Then it >> will be easier to see the changes. > > GB2 Ok. Please collapse the update with the original csets (code and > test), so that we have one single cset for the entire bugfix. > I marked my replies with GB2. > OK, I will collapse all changes including the test into one patch (and then add a new patch for 6.0/Falcon specific test. >> Guilhem Bichot wrote: >>> Hello, >>> >>> As this is a big patch, I marked my comments with "GB" for easier >>> searching. >>> By the way this is an excellent patch. >>> >>> On Wed, Mar 05, 2008 at 11:56:16PM +0100, mattiasj@stripped wrote: >>>> ChangeSet@stripped, 2008-03-05 23:56:06+01:00, mattiasj@witty. +3 -0 >>>> Bug#33479: auto_increment failures in partitioning >>>> >>>> Several problems with auto_increment in partitioning >>>> (with MyISAM, InnoDB and Falcon) >>>> >>>> Updated code only patch (the test cases will be in a separate patches >>>> 5.1+ (InnoDB+MyISAM) and 6.0 (Falcon) > >>>> diff -Nrup a/sql/ha_partition.cc b/sql/ha_partition.cc > >>>> @@ -2354,6 +2355,24 @@ int ha_partition::open(const char *name, >>>> 0, key_rec_cmp, (void*)this))) >>>> goto err_handler; >>>> >>>> + >>>> + if (table_share->tmp_table == NO_TMP_TABLE) >>>> + pthread_mutex_lock(&table->s->mutex); >>>> + if (!table_share->ha_data) >>>> + { >>>> + HA_DATA_PARTITION *ha_data; >>>> + /* currently only needed for auto_increment */ >>>> + table_share->ha_data= alloc_root(&table_share->mem_root, >>>> + sizeof(HA_DATA_PARTITION)); >>>> + if (!table_share->ha_data) >>>> + goto err_handler; >>>> + ha_data= (HA_DATA_PARTITION*) table_share->ha_data; >>>> + bzero(ha_data, sizeof(HA_DATA_PARTITION)); >>>> + if (table_share->tmp_table == NO_TMP_TABLE) >>>> + pthread_mutex_init(&ha_data->mutex, MY_MUTEX_INIT_FAST); >>> GB you need to destroy this mutex, and do so before >>> table_share->mem_root is freed. >>> >> MJ: Where should I do that? I cannot find any good spot. Should I maybe >> use the table_share->mutex and remove the ha_data->mutex instead? > > GB2 You can keep ha_data->mutex, and destroy it ha_partition::close() > (assuming such close always happens before free_table_share(), which > sounds logical but I haven't verified). > See my question/answer above. >> I added the ha_data->mutex for finer granuality, and allow more >> concurrancy, maybe we can add this first when serg has made a worklog >> about the table_share->ha_data as he said in a mail 2008-02-19: >> >> When you're done, I'll create a WL about making a proper use of >> TABLE_SHARE::ha_data (which would include adding constructor/destructor >> methods to the handlerton, moving all storage engines to the new API, >> moving existing partition-only fields in the TABLE_SHARE to the new >> structure that you'll create). > > GB2 I don't see how this WL would help with destroying the mutex. > I hoped that it would provide an interface for the storage engine to initialize and destroy the storage engine specific parts, like mutexes etc. >>>> + } >>>> + if (table_share->tmp_table == NO_TMP_TABLE) >>>> + pthread_mutex_unlock(&table->s->mutex); >>> GB you may want to have a bool on the stack which keeps the value of >>> table_share->tmp_table == NO_TMP_TABLE, to avoid doing the test 3 >>> times in the worst case. >>> To have maybe less pointer dereference of table_share (in >>> table_share->ha_data) you could do >>> table_share->ha_data= ha_data= (HA_DATA_PARTITION*) alloc_root(...); >>> if (!ha_data) // or ha_data==NULL, as you like >>> goto err_handler; >>> bzero(ha_data, sizeof(HA_DATA_PARTITION)); >>> This would be one dereference instead of three. >>> >> MJ: I added: >> bool is_not_tmp_table= (table_share->tmp_table == NO_TMP_TABLE) >> and changed the ha_data as you proposed. > > GB2 Good. I just remembered an idea: in > table_share->ha_data= ha_data= (HA_DATA_PARTITION*) > alloc_root(&table_share->mem_root, > sizeof(HA_DATA_PARTITION)); > you could consider replacing the C-style cast with > reinterpret_cast. > For the compiler it changes nothing; for the developer it makes casts > more classified (and one day we would be able to run > gcc -Wold-style-cast on our C++ code). > As our existing C++ code has lots of C-style casts, it is up to you. > I did not get this compiling, so I will not change it. >>>> @@ -2953,13 +2980,25 @@ int ha_partition::delete_all_rows() >>>> { >>>> int error; >>>> handler **file; >>>> + THD *thd= current_thd; >>> GB In handler code like this, ha_thd() is to be used instead of >>> current_thd, because we think it could be faster: >>> - it's a function call like pthread_getspecific() is (current_thd uses >>> pthread_getspecific()) >>> - it is in mysqld whereas pthread_getspecific() is in libc (farther so >>> slower to access, probably one indirection more). >>> - ha_thd() has good chances of just returning table->in_use, which is >>> probably not more work than pthread_getspecific() does, at least this >>> one: >>> http://cvs.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/lib/libc/port/threads/tsd.c#_thr_getspecific >>> does more work (opensolaris is what popped first in Google - sign of >>> the times?). >>> >> MJ: Thanks for showing me, I used it since it was used in several places >> in the partitioning code. Should I change all current_thd to ha_thd in >> ha_partition.cc? > > GB2 Yes you could do this change. If it breaks anything, it means > table->in_use is incorrect so it's a bug, RC is a good time for > finding bugs. > Have changed it in ha_partition.cc (they also exists in sql_partition.cc, but since that is not handler specific code, I have not changed it) >>>> DBUG_ENTER("ha_partition::delete_all_rows"); >>>> >>>> + if (thd->lex->sql_command == SQLCOM_TRUNCATE) >>>> + { >>>> + HA_DATA_PARTITION *ha_data= (HA_DATA_PARTITION*) >>>> table_share->ha_data; >>>> + lock_auto_increment(); >>>> + ha_data->next_auto_inc_val= 0; >>>> + ha_data->auto_inc_initialized= FALSE; >>>> + unlock_auto_increment(); >>>> + } >>>> file= m_file; >>>> do >>>> { >>>> if ((error= (*file)->ha_delete_all_rows())) >>>> DBUG_RETURN(error); >>>> + /* must reset the auto_increment for some engines (eg MyISAM) */ >>>> + if (thd->lex->sql_command == SQLCOM_TRUNCATE) >>>> + (*file)->ha_reset_auto_increment(0); >>> GB This piece moved to another bug report, I understood. >>> >> MJ: I will remove those tree lines. Should it not be a comment in >> handler.h for delete_all_rows() that it should reset auto_increment if >> thd->lex->sql_command == SQLCOM_TRUNCATE? > > GB2 yes it would be good. > I'll added the comment in handler.h >>>> @@ -4412,20 +4451,56 @@ int ha_partition::handle_ordered_prev(uc >>>> int ha_partition::info(uint flag) >>>> { >>>> handler *file, **file_array; >>>> + HA_DATA_PARTITION *ha_data= (HA_DATA_PARTITION*) table_share->ha_data; >>> GB this declaration should be moved to the block which uses this variable: >>> the if(flag&HA_STATUS_AUTO) block. >>> >> (I have reorganized the ::info function regaring HA_STATUS_AUTO, see below.) >> MJ: Yes. I moved it down to -> >> >>>> DBUG_ENTER("ha_partition:info"); >>>> >>>> if (flag & HA_STATUS_AUTO) >>>> { >> MJ: -> here. >> >>>> - ulonglong auto_increment_value= 0; >>>> DBUG_PRINT("info", ("HA_STATUS_AUTO")); >>>> - file_array= m_file; >>>> - do >>>> + if (ha_data->auto_inc_initialized && >>>> !table_share->next_number_keypart) >>> GB this is reading ha_data->auto_inc_initialized without mutex: there may >>> be another thread changing this variable at this moment. Then we may >>> go into the 'else' branch though next_auto_inc_val has already just >>> been set to true by the other thread, but this is ok as we will then >>> only do a useless setting of next_auto_inc_val. >> MJ: Yes this is a start of a race, but since I will acquire a lock >> before set_if_bigger(next_auto_inc_val,ai..) (in the branch where it is >> not initialized) it will not get updated if has changed. So I guess I >> will add a comment about it. Actually this is the thing with the >> ha_data->auto_inc_initialized, since it is a bool, then I assume I do >> not have to aquire any lock/mutex for reading it and it will only be >> written once during the life of the table_share. And the few times it >> actually are more than one thread running after a check of >> auto_inc_initialized == 0, there will be no harm, since the handling of >> ha_data->next_auto_inc_val always should be locked before reading and >> updating. > > GB2 Those assumptions are safe provided that in concurrent situations, > the bool changes only from 0 to 1 and not from 1 to 0. If it could > change from 1 to 0, the tests without mutex could go wrong like this: > > if (ha_data->auto_inc_initialized) > { > use ha_data->next_auto_inc_val; > } > If at this moment such code runs: > lock_auto_increment(); > ha_data->auto_inc_initialized= 0; > ha_data->next_auto_inc_val= 0; > unlock_auto_increment(); > > then the if() could be true, but when we use > ha_data->next_auto_inc_val it has become 0. > Fortunately, going from 1 to 0 happens only during TRUNCATE TABLE > which first closes all open instances and prevents opening new > instances, so there cannot be concurrency. > You may want to add a comment about this. > Yes, those was my worries about using ha_data->mutex, since I wanted it to be initialized when initalizing the share, and destroy it when freeing the share, but I do not know where to destroy it. (see my question above). So I will use table_share->mutex instead. >>>> { >>>> - file= *file_array; >>>> - file->info(HA_STATUS_AUTO); >>>> - set_if_bigger(auto_increment_value, >>>> file->stats.auto_increment_value); >>>> - } while (*(++file_array)); >>>> - stats.auto_increment_value= auto_increment_value; >>>> + DBUG_PRINT("info", ("Using ha_data->next_auto_inc_val: %llu", >>>> + ha_data->next_auto_inc_val)); >>> GB From what I see in MSDN, %ll[u] is not supported by VS 2003 (it is in >>> C99 but Microsoft apparently needed more than 4 years to implement it >>> - now let's see how many flame emails I get for writing that >>> gratuitous sarcasm). It is supported starting from VS2005. There is a >>> VS2003 machine in pushbuild, so we may have to do some debugging on >>> it, thus we should avoid %ll[u] in the patch (the rest of MySQL code >>> already avoids it and uses instead [u]llstr/%s). I agree this is a >>> hassle. >>> >> MJ: OK, I will probably remove all DBUG_PRINT. the handler.cc code uses >> "%lu", (ulong) nr instead, is that ok? (else should I do #ifdef DEBUG; >> char buf[22] #endif before?) > > GB2 %lu is ok > > I'll use %lu if I add any DBUG_PRINT (I removed it in the updated patch). >>>> + { >>>> + if ((res= (*pos)->ha_reset_auto_increment(value)) != 0) >>>> + { >>>> + unlock_auto_increment(); >>>> + DBUG_RETURN(res); >>>> + } >>>> + } >>>> + unlock_auto_increment(); >>>> + DBUG_RETURN(0); >>> GB or maybe (smaller code, different behaviour in case if error but not >>> clear if that itself is important): >>> int res= 0; >>> ... >>> for (pos=m_file, end= m_file+ m_tot_parts; pos != end ; pos++) >>> res|= (*pos)->ha_reset_auto_increment(value); >>> unlock_auto_increment(); >>> DBUG_RETURN(res != 0); >>> >> MJ: I can rewrite it as this, but do really res != 0 give a correct >> error code? > > GB2 ah, right, I misread parenthesis in the original code. Ok, keep it > like it is. > Kept it unchanged. >> (I could add a new error and handle it in >> ha_partition::print_error) >> >> in sql_delete.cc row 331: >> int error2= table->file->ha_reset_auto_increment(0); >> >> if (error2 && (error2 != HA_ERR_WRONG_COMMAND)) >> { >> table->file->print_error(error2, MYF(0)); >> >>> >>>> /* >>>> This method is called by update_auto_increment which in turn is called >>>> - by the individual handlers as part of write_row. We will always let >>>> - the first handler keep track of the auto increment value for all >>>> - partitions. >>>> + by the individual handlers as part of write_row. We use the >>>> + table_share->ha_data->next_auto_inc_val, or search all >>>> + partitions for the highest auto_increment_value if not initialized or >>>> + if auto_increment field is a secondary part of a key, we must search >>>> + every partition with a mutex to be sure of correctness. >>>> */ >>> GB While you are changing the comment, you could format it for doxygen. >>> >> MJ: I will try to update the comments for doxygen > > GB2 Please fix that in the next version of the patch (latest version > does not have it yet). > > Arrg, missed that comment... >>>> + } >>>> + else if (!ha_data->auto_inc_initialized) >>>> + { >>>> /* >>>> - We should not get here. >>>> + Not initialized, but should be locked in ha_partition::write_row(). >>>> + (Or is a temporary table). >>> GB how about adding >>> if (this is not tmp table) >>> safe_mutex_assert_owner(the expected mutex); >>> ? >>> >> MJ: Thanks, did not now of that function. > > GB2 do you still need to add such assertion, or the code changes make > it unnecessary? > > There was no such assertion in my latest version, since I rewrote it. >>>> @@ -5659,9 +5813,28 @@ void ha_partition::release_auto_incremen >>>> { >>>> DBUG_ENTER("ha_partition::release_auto_increment"); >>>> >>>> - for (uint i= 0; i < m_tot_parts; i++) >>>> + if (table->s->next_number_keypart) >>>> { >>>> - m_file[i]->ha_release_auto_increment(); >>>> + for (uint i= 0; i < m_tot_parts; i++) >>>> + m_file[i]->ha_release_auto_increment(); >>>> + } >>>> + else >>>> + { >>>> + HA_DATA_PARTITION *ha_data= (HA_DATA_PARTITION*) >>>> table_share->ha_data; >>>> + DBUG_PRINT("info", ("next_auto_inc_val: %llu cur_row_max: %llu min: >>>> %llu", >>>> + ha_data->next_auto_inc_val, >>>> + auto_inc_interval_for_cur_row.maximum(), >>>> + auto_inc_interval_for_cur_row.minimum())); >>>> + if (next_insert_id) >>>> + { >>>> + lock_auto_increment(); >>>> + if (next_insert_id < ha_data->next_auto_inc_val && >>>> + auto_inc_interval_for_cur_row.maximum() >= >>>> ha_data->next_auto_inc_val) >>> GB too long line (80 chars). >>> >> MJ: OK, I thought that it was not more than 80 chars... should I remove >> one space in front of the line, or break it up like this: >> if (next_insert_id < ha_data->next_auto_inc_val && >> auto_inc_interval_for_cur_row.maximum() >= >> ha_data->next_auto_inc_val) >> ? > > GB2 I'd break it up. > Right, our coding guidelines say "The maximum line width is 80 > characters"... I took my habit for the rule then; I think I picked 79 > after getting angry notes that I had lines 80-char long :) > So, ignore my comment. > I already broke it up in the latest version :) >>>> + ha_data->next_auto_inc_val= next_insert_id; >>>> + unlock_auto_increment(); >>>> + } >>>> + DBUG_PRINT("info", ("next_auto_inc_val: %llu next_ins_id: %llu", >>>> + ha_data->next_auto_inc_val, next_insert_id)); >>>> } >>>> DBUG_VOID_RETURN; >>>> } > >> Can there be any problems with concurrent multi row insert and statement >> replications? > > For statement-level replication to work, a multi-row INSERT (INSERT > VALUES or INSERT SELECT) must use consecutive values. > Let's assume that two concurrent statements are doing > INSERT INTO t SELECT * FROM u; > where t is a partitioned table, having an autoinc column first in the > index. > Each statement will call get_auto_increment() with > nb_desired_values==1 (see handler::update_auto_increment(); > estimation_rows_to_insert is always 0 in INSERT SELECT): > statement1 moves next_auto_inc_val to 2 > statement2 moves next_auto_inc_val to 3 > statement1 inserts 1 > statement2 inserts 2 > statement1 notices that it has more rows to insert because SELECT > returned more than one row (see handler::update_auto_increment()), > calls get_auto_increment() with nb_desired_values==2 (twice the > previous nb_desired_values): > statement1 moves next_auto_inc_val to 5 > statement1 inserts 3 > and then we have the problem: statement1 has inserted 1 and 3, it's > not consecutive and will not replicate well. > > If the underlying engine of the partitioned table has table-level > locking, in fact there is no problem (statement2 will run only after > statement1 is done). > This is also the case of InnoDB: > - if innodb_auto_inc_lock_mode==2, it's documented to not work with > statement-based replication > - if ==0, it does table-level locking > - if ==1, it does table-level locking except if the statement is > INSERT/REPLACE VALUES (and the discussed case above is INSERT > SELECT). > But InnoDB does this fine logic only in > ha_innobase::get_auto_increment() (called when the table is not > partitioned), which is not called by > ha_partition::get_auto_increment(). > > So I suspect that yes, there are problems. This is a suspicion. A > multi-threaded test or code inspection could help know the truth. > Possible solutions: > - make ha_partition::get_auto_increment() imitate > ha_innobase::get_auto_increment(): let it keep the autoinc mutex if > (thd->lex->sql_command is != SQLCOM_INSERT && !row_based). > - let ha_partition switch to row-based replication if inserting into > an auto_increment column. > - make ha_partition::get_auto_increment() call engine's > get_auto_increment() to ask it to reserve [max, > max+nb_desired_values*increment]. The problem is that engine's > get_auto_increment() API does not tell the engine the lower bound of > the interval (*first_value is not supposed to be read by the engine), > only the interval's width (nb_desired_values). InnoDB does read > *first_value, violating the API, using that as a lower bound. We could > make InnoDB's violation the new norm: *first_value would become an > INOUT parameter (see the email I sent to Innobase recently, cc:you). > - have statement-based binlogging of multiple autoinc intervals > http://forge.mysql.com/worklog/task.php?id=3404 implemented. > - something better? > > In all cases, this is a significant problem, a discussion with the > other reviewer (Serg) would be needed. > > Falcon supports only row-based replication right now. > How about the easy way for start: Rely on the partitioned storage engines support for this (if it work without partitioning, then it will work with partitioning). I have to check that it actually will work if you have auto_increment_offset = no_of_partitions and partition by hash(ai_col), if the table lock is on the table_share, then it will be OK, but it might be a problem if is not. Do this by adding a comment, and add it to the bug report for the doc-team for adding it in the manual. keep the autoinc mutex, where should it be released? in release_auto_increment? would probably work. switch to row-based replication, this is new grounds for me, but is that always possible? (is it used often?) Forward the call of get_auto_increment to all the partitions, I think that *first_value as an INOUT would be good, but since nb_desired_values is not forced, Falcon always uses 1 as example, it will not help in all cases. And I do not really see this as a solution, since it would lower the performance, but it might be used for the first auto_increment value for each partition just to be sure... -- Mattias Jonsson, Software Engineer MySQL AB, www.mysql.com Are you MySQL certified? www.mysql.com/certification