List:Commits« Previous MessageNext Message »
From:Konstantin Osipov Date:January 10 2008 10:21pm
Subject:Re: bk commit into 5.1 tree (kostja:1.2653) BUG#12713
View as plain text  
Hello Sergei,

Thank you for a code review.

* Sergei Golubchik <serg@stripped> [08/01/10 18:33]:
> > --- /dev/null	Wed Dec 31 16:00:00 196900
> > +++ b/mysql-test/include/commit.inc	2007-12-29 22:18:25 +03:00
> > @@ -0,0 +1,750 @@
> >  .....
> > +
> > +insert into t2 (a) values (1001);
> > +--error ER_DUP_ENTRY
> > +insert into t1 (a) values (f2(1));
> > +select * from t2;
> > +rollback;
> > +select * from t2;
> > +commit;
> 
> why a commit ?

Is unnecessary. Will remove.

> > +--echo # We need at least one procedure to make sure the WHERE clause is
> > +--echo # evaluated
> > +create procedure dummy() begin end;
> 
> you probably need 'drop procedure if exists dummy' in the beginning

Will add.
 
> > +--echo # 4. Read-write statement: UPDATE, update 0 rows, 1 row matches WHERE 
> > +--echo #
> > +update t1 set a=2;
> > +call p_verify_status_increment(2, 2, 2, 2);
> > +commit;
> > +call p_verify_status_increment(2, 2, 2, 2);
> 
> why this one isn't optimized ?

When I first time encountered this strange result seeing that
ha_write_row is invoked was good enough to not investigate
further. This time I looked to find out why it is invoked.

I turns out it is a bug, an InnoDB specific one. table->write_set
and table->read_set are not the same, so can_compare_record is
evaluated to 'false' in mysql_update.

Here's the test case:

MyISAM:

mysql> create table t1 (a int unique);
Query OK, 0 rows affected (0.01 sec)

mysql> insert into t1 (a) values (2);
Query OK, 1 row affected (0.00 sec)

mysql> update t1 set a=2;
Query OK, 0 rows affected (0.00 sec)
Rows matched: 1  Changed: 0  Warnings: 0

Note the 'Changed' count.

InnoDB:

mysql> create table t1 (a int unique) engine=innodb;
Query OK, 0 rows affected (0.01 sec)

mysql> insert into t1 (a) values (2);
Query OK, 1 row affected (0.02 sec)

mysql> update t1 set a=2;
Query OK, 1 row affected (0.02 sec)
Rows matched: 1  Changed: 1  Warnings: 0


I will report this bug.

> > +
> > +--echo # 5. Read-write statement: UPDATE, update 0 rows, 0 rows match WHERE 
> > +--echo #
> > +update t1 set a=3 where a=1;
> > +call p_verify_status_increment(2, 0, 1, 0);
> > +commit;
> > +call p_verify_status_increment(2, 0, 1, 0);
> 
> why different results for mixed/row ?

In statement mode, there is a read-only transaction in InnoDB
and a write to the binary log. So we have two commits but no 2pc,
since the first engine is read-only.

In the row level replication mode, we only have the read-only
transaction.

I will add this as a comment, right?

> >  .....
> > +--echo # 8. Read-write statement: unqualified DELETE
> > +--echo #
> > +delete from t1;
> > +call p_verify_status_increment(2, 2, 1, 0);
> > +commit;
> > +call p_verify_status_increment(2, 2, 1, 0);
> 
> oh, row/mixed are even optimized differently. why ?

There is a call to ha_delete_all_rows in both cases. This marks
the innodb transaction read-write. Plus there is a read-write
transaction in the binary log in statement mode.
Thus we have 2 prepare + 2 commits.

But no rows are deleted, so there is no write to the binary log in
the row level replication mode. This is why in the second case we
have no 2pc.

> > diff -Nrup a/mysql-test/suite/binlog/r/binlog_stm_mix_innodb_myisam.result
> b/mysql-test/suite/binlog/r/binlog_stm_mix_innodb_myisam.result
> > --- a/mysql-test/suite/binlog/r/binlog_stm_mix_innodb_myisam.result	2007-10-14
> 00:12:46 +04:00
> > +++ b/mysql-test/suite/binlog/r/binlog_stm_mix_innodb_myisam.result	2007-12-29
> 22:18:20 +03:00
> > @@ -349,11 +348,10 @@ master-bin.000001	#	Query	#	#	use `test`
> >  master-bin.000001	#	Query	#	#	use `test`; INSERT INTO t1 values (8,8)
> >  master-bin.000001	#	Query	#	#	use `test`; INSERT INTO t1 values (9,9)
> >  master-bin.000001	#	Query	#	#	use `test`; TRUNCATE table t2
> > -master-bin.000001	#	Xid	#	#	COMMIT /* XID */
> >  master-bin.000001	#	Query	#	#	use `test`; INSERT INTO t1 values (10,10)
> >  master-bin.000001	#	Query	#	#	use `test`; BEGIN
> >  master-bin.000001	#	Query	#	#	use `test`; INSERT INTO t2 values (100,100)
> > -master-bin.000001	#	Xid	#	#	COMMIT /* XID */
> > +master-bin.000001	#	Query	#	#	use `test`; COMMIT
> >  master-bin.000001	#	Query	#	#	use `test`; DROP TABLE t1,t2
> 
> ok, I understand why "COMMIT /* XID */" can be changed to "use `test`; COMMIT".
> but how it can simlpy dissapear ?

I don't know, this started to show up after the merge.
I will need to investigate.

> > diff -Nrup a/mysql-test/suite/rpl_ndb/t/disabled.def
> b/mysql-test/suite/rpl_ndb/t/disabled.def
> > --- a/mysql-test/suite/rpl_ndb/t/disabled.def	2007-12-11 13:51:00 +03:00
> > +++ b/mysql-test/suite/rpl_ndb/t/disabled.def	2007-12-29 22:18:20 +03:00
> > @@ -21,3 +21,5 @@ rpl_ndb_mix_innodb       : Bug #32720   
> >  # the below testcase have been reworked to avoid the bug, test contains
> comment, keep bug open
> >  
> >  #rpl_ndb_dd_advance	 : Bug#25913 rpl_ndb_dd_advance fails randomly
> > +rpl_ndb_circular         : DBUG ASSERT fails
> > +rpl_ndb_circular_simplex : DBUG ASSERT fails
> 
> what DBUG_ASSERT ?
> are you going to do anything about it ?

I am going to report a bug.

Here is my email to dev-replication on this matter:

----- Forwarded message from Konstantin Osipov <konstantin@stripped> -----

Date: Mon, 24 Dec 2007 05:29:06 +0300
From: Konstantin Osipov <konstantin@stripped>
To: mats@stripped
Cc: dev-replication@stripped
Subject: rpl_ndb_circular failure

Hi,

I spent today's night trying to nail down the problem but failed.

How the failure occurs is highly timing dependent, but it does
occur most of the time.

Although I was debugging in my local tree, which has a patch for
12713, the failure becomes almost 100% repeatable with this simple
diff to the main tree:

===== sql_parse.cc 1.714 vs edited =====
--- 1.714/sql/sql_parse.cc      2007-12-17 16:28:05 +03:00
+++ edited/sql_parse.cc 2007-12-24 05:12:58 +03:00
@@ -3737,6 +3737,7 @@ end_with_restore_list:
     send_ok(thd);
     break;
   case SQLCOM_COMMIT:
+    DBUG_ASSERT(thd->lock == NULL);
     if (end_trans(thd, lex->tx_release ? COMMIT_RELEASE :
                               lex->tx_chain ? COMMIT_AND_CHAIN : COMMIT))
       goto error;


But not quite enough repeatable to be able to debug in gdb :((.

What I was able to establish so far is:

 * the failure happens on the master, in the slave SQL thread,
   when it is executing the binary log which has records injected
   from the slave through ndb injector thread. gdb window is master_0.
 * Write_rows_log_event doesn't close ndb_apply_status and t1
   tables because STMT_END_F is missing from the last Write_rows_log_event
 * how it happens: since the flag is missing,
   Rows_log_event::do_update_pos does not call rli->cleanup_context()
 * this leads to tables being left unlocked for the next binlog event, 
    which happens to be Query_log_event("COMMIT");
 * since there is close_thread_tables(thd) in the end of
   Query_log_event::do_apply_event, the bug used to go unnoticed
   because any Query_log_event that doesn't use tables would
   implicitly close the any tables left from the previous event


As far as I was able to figure out, the reason for the missing
flag is that a call to injector::transaction::commit() is
sporadically missing from the ndbcluster binlog thread execution
flow. Why and how it is missing depends on timing, depends whether
I do or do not insert a breakpoint.  So far I wasn't able to
establish exactly why, but it seems that the bypass of the call is
in correlation with trans.good(), which is always "not good" when
it's bypassed.

Regardless of the fix for this particular issue I find reliance on
a flag sent over network being necessary to close the thread tables a 
vulnerable design: the binary log may get corrupt, and a failure
to close the thread tables may easily lead to a server crash.
The server must not crash, regardless of what is in the binary
log, even if an entire event is missing.

Your help is very much appreciated, this is currently blocking me
from committing the last part of 12713, it's one of the few
remaining failing tests, and it fails only because the new assert that
I added as part of the patch. Besides, after having spent 6 hours
debugging the thing from various view points, it has become a sort
of sports interest to me to nail it down. One case of the many
when the whole thing would have benefited a great deal from a
different form of testing than the one that is used now.

Thanks,

-- 
-- Konstantin Osipov              Software Developer, Moscow, Russia
-- MySQL AB, www.mysql.com   The best DATABASE COMPANY in the GALAXY

----- End forwarded message -----

> > diff -Nrup a/sql/handler.cc b/sql/handler.cc
> > --- a/sql/handler.cc	2007-12-20 21:16:51 +03:00
> > +++ b/sql/handler.cc	2007-12-29 22:18:20 +03:00
> > @@ -577,6 +577,218 @@ void ha_close_connection(THD* thd)
> >  
> >  /* ========================================================================
> >   ======================= TRANSACTIONS ===================================*/
> > +/**
> > +  Transaction handling in the server
> > +  ==================================
> > +
> > +  In each client connection, MySQL maintains two transactions:
> > +  - a statement transaction,
> 
> I'd explain here what a "statement transaction" is, because the term is
> non-standard, it came from BDB - it was the first transactional engine,
> and it used nested transactions for every statement. InnoDB uses
> savepoints, and Heikki objects to the term "statement commit", for
> example.  And he's right, in a sense, as it's not a commit strictly
> speaking, it doesn't make any persistent changes - it simply marks end
> of statement.  So, I would define here what is "statement transaction",
> and "statement commit" in convensional language, without using BDB-isms.

OK, will do.

> 
> > +  - a standard, also called "global", transaction.
> 
> global transaction is most often used in the context of distributed
> transaction processing, and it means a distributed transaction managed
> by the transaction coordinator, as opposite to the local transaction in
> every particular resource manager.
> 
> Best is to avoid saying "global transaction" when you simply
> mean a normal, local to one MySQL instance, transaction.

OK.

> > +  In autocommit mode thd->transaction.all is not
> > +  maintained. Instead data of thd->transaction.stmt is
> 
> "not maintained" or "is empty" ?

Both. "is empty" is more accurate. Will clarify.
> 
> > +  - if the user has requested so, by issuing ROLLBACK SQL
> > +  statement
> > +  - if one of the storage engines requested a rollback
> > +  by setting thd->transaction_rollback_request. In this case
> 
> add "(for example in case of internal deadlock)" or something else you
> may find appropriate to explain why we need
> transaction_rollback_request and how it can ever happen.

OK.

> > +  Normally the engine registers itself whenever handler::store_lock()
> 
> external_lock, not store_lock

OK.

> > -    if (!trans->no_2pc && trans->nht > 1)
> > +    must_2pc= ha_check_and_coalesce_trx_read_only(thd, ha_info, all);
> 
> do you need a call to ha_check_and_coalesce_trx_read_only() here ?
> can you check and coalesce in the for() loop below
> that does ht->prepare() ?

If I inline the check into an existing loop, there will be a
redundant 'prepare' for all cases when the first engine in the
list has a read-write transaction.
I think this optimization is unnecessary - in most cases the
size of the list is 2, so an additional loop is very low cost.
Keeping the code in a separate function also allows for clear
comments and good isolation. Oh how many times did I change the
contents of the function before I got to the final verison!

> > +
> > +    if (!trans->no_2pc && must_2pc)
> >      {
> > -      for (; *ht && !error; ht++)
> > +      for (; ha_info && !error; ha_info= ha_info->next())
> >        {
> >          int err;
> > -        if ((err= (*(*ht)->prepare)(*ht, thd, all)))
> > +        handlerton *ht= ha_info->ht();
> > +        /*
> > +          Call two-phase commit even if this particular
> > +          transaction is read-only, just in case. It should not
> > +          produce any overhead, and normally should be a no-op,
> > +          but gives the engine a chance to abort the distributed
> > +          commit.
> 
> why it could be useful (I mean "chance to abort ...") ?

I have no strong opinion frankly. I just thought it's better to
have it this way -- this makes it easy to count 'prepapre' in the
test. Basically, if we need 2pc, we prepare all engines, so
handler_prepare count is always equal to handler_commit count.
Other than that it's just a gut feeling. What do you think?

> > +  /*
> > +    We must not commit a global transaction if statement
> 
> s/commit/rollback/
Ack.

> > +inline
> > +void
> > +handler::mark_trx_read_write()
> > +{
> > +  Ha_trx_info *ha_info= &ha_thd()->ha_data[ht->slot].ha_info[0];
> > +  /*
> > +    When a storage engine method is called, the transaction must
> > +    have been started, unless it's a DDL call, for which the
> > +    storage engine starts the transaction internally, and commits
> > +    it internally, without registering in the ha_list.
> > +    Unfortunately here we can't know know for sure if the engine
> > +    has registered the transaction or not, so we must check.
> > +  */
> > +
> > +  if (ha_info->is_started())
> > +  {
> > +    DBUG_ASSERT(has_transactions());
> > +    if (table_share == NULL || table_share->tmp_table == NO_TMP_TABLE)
> 
> when can table_share be NULL ?

I know of only one case - in handler::ha_delete_table.

Remember the tweak with dummy_share in global ha_delete_table()?

Should I add this in a comment?

> > +  void coalesce_trx_with(const Ha_trx_info *stmt_trx)
> > +  {
> > +    /*
> > +      Must be called only after the transaction has been started.
> > +      Can be called twice, e.g. when we have two read-write
> 
> better (less confusing): twice -> many times, two -> many

Will do.

> > +  /**
> > +    Delete a table in the engine. Called for temporary tables, intermediate
> > +    tables of CREATE TABLE ... SELECT.
> 
> do you mean it's not called for normal tables ?

Hmm, no I don't. The comment is misleading. Will fix.
> 
> > +  */
> >    virtual int delete_table(const char *name);
> >  private:
> > +  /* Private helpers */
> > +  inline void mark_trx_read_write();
> > +private:
> 
> why private: twice ?

Just wanted to stress the border between two groups of
declarations: private helpers and interface definition for 
SE writers. No strong opinion, can remove.

> >    /*
> >      Low-level primitives for storage engines.  These should be
> >      overridden by the storage engine class. To call these methods, use
> > diff -Nrup a/sql/log.cc b/sql/log.cc
> > --- a/sql/log.cc	2007-12-13 23:58:53 +03:00
> > +++ b/sql/log.cc	2007-12-29 22:18:21 +03:00
> > @@ -3321,6 +3321,16 @@ THD::binlog_start_trans_and_stmt()
> >      if (options & (OPTION_NOT_AUTOCOMMIT | OPTION_BEGIN))
> >        trans_register_ha(this, TRUE, binlog_hton);
> >      trans_register_ha(this, FALSE, binlog_hton);
> > +    /*
> > +      Mark statement transaction as read/write. We never start
> > +      a binary log transaction and keep it read-only,
> > +      therefore it's best to mark the transaction read/write just
> > +      at the same time we start it.
> 
> why it's best ? shouldn't it be marked read-write automatically
> on the first ha_write/delete/update_row ?

This is the binlog engine. First, it does not have ha_write_row.
Second, I still think it's better to mark the trasnaction
read-write once per transaction, rather than once per each row
written. I was considering doing the same for other engines even
-- e.g. Dima suggested once to mark the transaction read-write in 
store_lock, if the lock value is > TL_WRITE_ONLY, but I abandoned
this idea since sometimes a handler method is called without 
store_lock.

> I agree that it will be marked read-write sooner or later, and
> that forcing read-write here doesn't hurt. But why it's necessary ?

I think you are not taking into account that there is no
ha_write_row for the binlog engine.

> > diff -Nrup a/sql/rpl_injector.cc b/sql/rpl_injector.cc
> > --- a/sql/rpl_injector.cc	2007-04-12 18:13:46 +04:00
> > +++ b/sql/rpl_injector.cc	2007-12-29 22:18:23 +03:00
> > @@ -61,7 +61,9 @@ injector::transaction::~transaction()
> >  int injector::transaction::commit()
> >  {
> >     DBUG_ENTER("injector::transaction::commit()");
> > +   printf("\n\n\n\n\nHERE WE GO\n\n\n\n");
> 
> :)

Oops, will fix.

> 
> >     m_thd->binlog_flush_pending_rows_event(true);
> > +   ha_autocommit_or_rollback(m_thd, 0);
> 
> ok, you've added it, I guess, to ensure that a statement
> transaction is committed before main transaction.
> 
> but why did you need it ? every insert/update/delete ends
> with ha_autocommit_or_rollback(), and a normal statements don't
> need an extra ha_autocommit_or_rollback() right before end_trans().
> do you mean that in rbr this invariant broken ?

There is a pending statement transaction in the binary log as far
as I remember. I will look more into this question tomorrow.

> >     end_trans(m_thd, COMMIT);
> >     DBUG_RETURN(0);
> >  }
> > diff -Nrup a/sql/sp.cc b/sql/sp.cc
> > --- a/sql/sp.cc	2007-10-30 20:08:11 +03:00
> > +++ b/sql/sp.cc	2007-12-29 22:18:23 +03:00
> > @@ -669,6 +669,13 @@ sp_returns_type(THD *thd, String &result
> >                (TYPE_ENUM_PROCEDURE or TYPE_ENUM_FUNCTION).
> >    @param sp   Stored routine object to store.
> >  
> > +  @note Opens and closes the thread tables. Therefore assumes
> > +  that there are no locked tables in this thread on invokation.
> > +  Unlike some other DDL statements, does close the tables
> > +  in the end, since the caller normally issues the implicit
> > +  grant to the definer of the created routine, which leads
> > +  to open/modify/close of mysql.procs_priv table.
> > +
> 
> Why does it need to close ? Shouldn't close_thread_tables() that is
> called after mysql_parse() be enough ?

Yes, I thought so too at first. But there is this "implicit grant" after
create/drop routine, which opens the grant tables.  Will add a comment.

> >    @return Error code. SP_OK is returned on success. Other SP_ constants are
> >    used to indicate about errors.
> >  */
> > diff -Nrup a/sql/sql_acl.cc b/sql/sql_acl.cc
> > --- a/sql/sql_acl.cc	2007-12-12 18:36:03 +03:00
> > +++ b/sql/sql_acl.cc	2007-12-29 22:18:23 +03:00
> > @@ -3862,7 +3862,7 @@ bool check_grant(THD *thd, ulong want_ac
> >      of other queries). For simple queries first_not_own_table is 0.
> >    */
> >    for (i= 0, table= tables;
> > -       table != first_not_own_table && i < number;
> > +       table && table != first_not_own_table && i < number;
> 
> can that happen (table==0, I mean) ?
> shouldn't "i < number" take care of that ?

OK, this is needed to fix a crash of this test case:

show tables where (f2(11) = 11);
select * from t2;
rollback;
select * from t2;

I haven't looked further. I will talk to Dima what is wrong with this
loop/information schema so that this additional check is needed.

> >      /*
> >        Let us commit transaction for statement. Since in 5.0 we only have
> >        one statement transaction and don't allow several nested statement
> >        transactions this call will do nothing if we are inside of stored
> >        function or trigger (i.e. statement transaction is already active and
> >        does not belong to statement for which we do close_thread_tables()).
> >        TODO: This should be fixed in later releases.
> >      */
> > -    ha_commit_stmt(thd);
> > +  if (!(thd->state_flags & Open_tables_state::BACKUPS_AVAIL))
> > +  {
> > +    thd->main_da.can_overwrite_status= TRUE;
> > +    ha_autocommit_or_rollback(thd, thd->is_error());
> > +    thd->main_da.can_overwrite_status= FALSE;
> >  
> > +    /*
> > +      Reset transaction state, but only if we're not inside a
> > +      sub-statement of a prelocked statement.
> > +    */
> 
> how can we get here if we are inside a sub-statement of a prelocked
> statement ?

In particular it's called in 
sp_lex_keeper::reset_lex_and_exec_core, which may be called for a
sub-statement of a pre-locked statement.
Before my patch it of course used to be called from a whole bunch of DML
statements as well. 

> > -  if (transactional_table)
> > -  {
> > -    if (ha_autocommit_or_rollback(thd,error >= 0))
> > -      error=1;
> > -  }
> > -
> > -  if (thd->lock)
> > -  {
> > -    mysql_unlock_tables(thd, thd->lock);
> > -    thd->lock=0;
> > -  }
> 
> Hmm, if here (and below, and for insert/update/load) you don't call
> ha_autocommit_or_rollback, then how can you ensure that
> commit will succeed before sending an ok packet ?

send_ok is no more *sending* ok. It's instead recording an OK message in the
diagnostics area. I will rename it to my_ok soon. 

Later on I have this hack:


  /* If commit fails, we should be able to reset the OK status. */
  thd->main_da.can_overwrite_status= TRUE;
  ha_autocommit_or_rollback(thd, thd->is_error());
  thd->main_da.can_overwrite_status= FALSE;

That is, I give the commit an exclusive right to reset the
diagnostics area if there is an error.

I hope I missed no question. I will investigate the remaining
difficult ones. 
Hope the above is of help.

-- 
-- Konstantin Osipov              Software Developer, Moscow, Russia
-- MySQL AB, www.mysql.com   The best DATABASE COMPANY in the GALAXY
Thread
bk commit into 5.1 tree (kostja:1.2653) BUG#12713konstantin29 Dec
  • Re: bk commit into 5.1 tree (kostja:1.2653) BUG#12713Sergei Golubchik10 Jan
    • Re: bk commit into 5.1 tree (kostja:1.2653) BUG#12713Konstantin Osipov10 Jan
      • Re: bk commit into 5.1 tree (kostja:1.2653) BUG#12713Sergei Golubchik12 Jan