List:Commits« Previous MessageNext Message »
From:Guilhem Bichot Date:March 28 2008 8:30am
Subject:Re: bk commit into 5.1 tree (mattiasj:1.2518) BUG#33479
View as plain text  
Hello,

On Thu, Mar 27, 2008 at 05:16:37PM +0100, Mattias Jonsson wrote:
> Hi,
> 
> I have a concern about destroying the ha_data->mutex.
> 
> It is not possible to destroy the table_share->ha_data->mutex in 
> ha_partition::close() since there can be another handler instance with 
> the same table_share, which is in the middle of a 
> write_row/get_auto_increment, which will crash the server if the mutex 
> is destroyed.
> 
> 
> I think I will go back and use the table_share->mutex instead, and 
> suggest a method/function in WL4305 where the handler can destroy and 
> free engine specific data and objects if needed.

GB3 Ok

> Guilhem Bichot wrote:
> >On Fri, Mar 14, 2008 at 01:54:08AM +0100, Mattias Jonsson wrote:
> >>Guilhem Bichot wrote:
> >>>On Wed, Mar 05, 2008 at 11:56:16PM +0100, mattiasj@stripped wrote:
> >>>>ChangeSet@stripped, 2008-03-05 23:56:06+01:00, mattiasj@witty. +3 -0
> >>>> Bug#33479: auto_increment failures in partitioning
> >>>> 
> >>>> Several problems with auto_increment in partitioning
> >>>> (with MyISAM, InnoDB and Falcon)
> >>>> 
> >>>> Updated code only patch (the test cases will be in a separate
> patches
> >>>> 5.1+ (InnoDB+MyISAM) and 6.0 (Falcon)
> >
> >>>>diff -Nrup a/sql/ha_partition.cc b/sql/ha_partition.cc

> >>>>+  }
> >>>>+  if (table_share->tmp_table == NO_TMP_TABLE)
> >>>>+    pthread_mutex_unlock(&table->s->mutex);
> >>>GB you may want to have a bool on the stack which keeps the value of
> >>>table_share->tmp_table == NO_TMP_TABLE, to avoid doing the test 3
> >>>times in the worst case.
> >>>To have maybe less pointer dereference of table_share (in
> >>>table_share->ha_data) you could do
> >>>table_share->ha_data= ha_data= (HA_DATA_PARTITION*) alloc_root(...);
> >>>if (!ha_data) // or ha_data==NULL, as you like
> >>> goto err_handler;
> >>>bzero(ha_data, sizeof(HA_DATA_PARTITION));
> >>>This would be one dereference instead of three.
> >>>
> >>MJ: I added:
> >>bool is_not_tmp_table= (table_share->tmp_table == NO_TMP_TABLE)
> >>and changed the ha_data as you proposed.
> >
> >GB2 Good. I just remembered an idea: in
> >    table_share->ha_data= ha_data= (HA_DATA_PARTITION*)
> >                                   alloc_root(&table_share->mem_root,
> >                                              sizeof(HA_DATA_PARTITION));
> >you could consider replacing the C-style cast with
> >reinterpret_cast<HA_DATA_PARTITION*>.
> >For the compiler it changes nothing; for the developer it makes casts
> >more classified (and one day we would be able to run
> >gcc -Wold-style-cast on our C++ code).
> >As our existing C++ code has lots of C-style casts, it is up to you.
> >
> 
> I did not get this compiling, so I will not change it.

Yes, -Wold-style-cast is not used in our builds. It's just that I like
to use it from time to time, on C++ code I write, to be sure I have
used only C++ casts.
The common benefits of using C++ casts (static_cast, const_cast,
reinterpret_cast - dynamic_cast is not possible because we use
-fno-rtti) can be found here:
http://www.acm.org/crossroads/xrds3-1/ovp3-1.html

It is absolutely not enforced at MySQL, only certain pieces use it
(like log_event.cc) so this is a mere suggestion.

> >>>>@@ -2953,13 +2980,25 @@ int ha_partition::delete_all_rows()

> >>>>+        ha_data->next_auto_inc_val= next_insert_id;
> >>>>+      unlock_auto_increment();
> >>>>+    }
> >>>>+    DBUG_PRINT("info", ("next_auto_inc_val: %llu next_ins_id:
> %llu",
> >>>>+                        ha_data->next_auto_inc_val,
> next_insert_id));
> >>>>  }
> >>>>  DBUG_VOID_RETURN;
> >>>>}
> >
> >>Can there be any problems with concurrent multi row insert and statement 
> >>replications?
> >
> >For statement-level replication to work, a multi-row INSERT (INSERT
> >VALUES or INSERT SELECT) must use consecutive values.
> >Let's assume that two concurrent statements are doing
> >INSERT INTO t SELECT * FROM u;
> >where t is a partitioned table, having an autoinc column first in the
> >index.
> >Each statement will call get_auto_increment() with
> >nb_desired_values==1 (see handler::update_auto_increment();
> >estimation_rows_to_insert is always 0 in INSERT SELECT):
> >statement1 moves next_auto_inc_val to 2
> >  statement2 moves next_auto_inc_val to 3
> >statement1 inserts 1
> >  statement2 inserts 2
> >statement1 notices that it has more rows to insert because SELECT
> >returned more than one row (see handler::update_auto_increment()),
> >calls get_auto_increment() with nb_desired_values==2 (twice the
> >previous nb_desired_values):
> >statement1 moves next_auto_inc_val to 5
> >statement1 inserts 3
> >and then we have the problem: statement1 has inserted 1 and 3, it's
> >not consecutive and will not replicate well.
> >
> >If the underlying engine of the partitioned table has table-level
> >locking, in fact there is no problem (statement2 will run only after
> >statement1 is done).
> >This is also the case of InnoDB:
> >- if innodb_auto_inc_lock_mode==2, it's documented to not work with
> >statement-based replication
> >- if ==0, it does table-level locking
> >- if ==1, it does table-level locking except if the statement is
> >INSERT/REPLACE VALUES (and the discussed case above is INSERT
> >SELECT).
> >But InnoDB does this fine logic only in
> >ha_innobase::get_auto_increment() (called when the table is not
> >partitioned), which is not called by
> >ha_partition::get_auto_increment().
> >
> >So I suspect that yes, there are problems. This is a suspicion. A
> >multi-threaded test or code inspection could help know the truth.
> >Possible solutions:
> >- make ha_partition::get_auto_increment() imitate
> >ha_innobase::get_auto_increment(): let it keep the autoinc mutex if
> >(thd->lex->sql_command is != SQLCOM_INSERT && !row_based).
> >- let ha_partition switch to row-based replication if inserting into
> >an auto_increment column.
> >- make ha_partition::get_auto_increment() call engine's
> >get_auto_increment() to ask it to reserve [max,
> >max+nb_desired_values*increment]. The problem is that engine's
> >get_auto_increment() API does not tell the engine the lower bound of
> >the interval (*first_value is not supposed to be read by the engine),
> >only the interval's width (nb_desired_values). InnoDB does read
> >*first_value, violating the API, using that as a lower bound. We could
> >make InnoDB's violation the new norm: *first_value would become an
> >INOUT parameter (see the email I sent to Innobase recently, cc:you).
> >- have statement-based binlogging of multiple autoinc intervals
> >http://forge.mysql.com/worklog/task.php?id=3404 implemented.
> >- something better?
> >
> >In all cases, this is a significant problem, a discussion with the
> >other reviewer (Serg) would be needed.
> >
> >Falcon supports only row-based replication right now.
> >
> 
> How about the easy way for start:
> Rely on the partitioned storage engines support for this (if it work
> without partitioning, then it will work with partitioning). I have to
> check that it actually will work if you have auto_increment_offset =
> no_of_partitions and partition by hash(ai_col), if the table lock is on
> the table_share, then it will be OK, but it might be a problem if is not.
> Do this by adding a comment, and add it to the bug report for the
> doc-team for adding it in the manual.
> 
> 
> keep the autoinc mutex, where should it be released? in
> release_auto_increment? would probably work.
> 
> switch to row-based replication, this is new grounds for me, but is that
> always possible? (is it used often?)
> 
> Forward the call of get_auto_increment to all the partitions, I think
> that *first_value as an INOUT would be good, but since nb_desired_values
> is not forced, Falcon always uses 1 as example, it will not help in all
> cases. And I do not really see this as a solution, since it would lower 
> the performance, but it might be used for the first auto_increment value 
> for each partition just to be sure...

We agreed on IRC to evaluate a solution where get_auto_increment() is
forwarded to all partitions, and helped by holding the mutex. Could
you please sketch it? I am not absolutely sure that it would work,
especially if while we are computing the max of reserved intervals, an
INSERT VALUES(some_fixed_value) comes, where some_fixed_value == the
max.

-- 
   __  ___     ___ ____  __
  /  |/  /_ __/ __/ __ \/ /    Mr. Guilhem Bichot <guilhem@stripped>
 / /|_/ / // /\ \/ /_/ / /__   Sun Microsystems, Lead Software Engineer
/_/  /_/\_, /___/\___\_\___/   Bordeaux, France
       <___/   www.mysql.com   
Thread
bk commit into 5.1 tree (mattiasj:1.2518) BUG#33479mattiasj5 Mar
  • Re: bk commit into 5.1 tree (mattiasj:1.2518) BUG#33479Sergei Golubchik7 Mar
    • Re: bk commit into 5.1 tree (mattiasj:1.2518) BUG#33479Sergei Golubchik7 Mar
      • Re: bk commit into 5.1 tree (mattiasj:1.2518) BUG#33479Mattias Jonsson8 Mar
        • Re: bk commit into 5.1 tree (mattiasj:1.2518) BUG#33479Sergei Golubchik10 Mar
  • Re: bk commit into 5.1 tree (mattiasj:1.2518) BUG#33479Guilhem Bichot13 Mar
    • Re: bk commit into 5.1 tree (mattiasj:1.2518) BUG#33479Mattias Jonsson14 Mar
      • Re: bk commit into 5.1 tree (mattiasj:1.2518) BUG#33479Guilhem Bichot18 Mar
        • Re: bk commit into 5.1 tree (mattiasj:1.2518) BUG#33479Mattias Jonsson27 Mar
          • Re: bk commit into 5.1 tree (mattiasj:1.2518) BUG#33479Guilhem Bichot28 Mar
            • Re: bk commit into 5.1 tree (mattiasj:1.2518) BUG#33479Mattias Jonsson1 Apr
              • Re: bk commit into 5.1 tree (mattiasj:1.2518) BUG#33479Guilhem Bichot10 Apr