List:Commits« Previous MessageNext Message »
From:jonas oreland Date:July 8 2011 12:30pm
Subject:bzr push into mysql-5.1-telco-7.0 branch (jonas.oreland:4445 to 4446)
View as plain text  
 4446 jonas oreland	2011-07-08 [merge]
      ndb - merge 70 to 71

 4445 Frazer Clement	2011-07-07
      WL5353 Primary Cluster Conflict Detection : NDB$EPOCH() row based function
      Ndb handler is modified to set the NDB$AUTHOR pseudo column to 1 when writing, or updating a row from the slave.
      The NDB$AUTHOR pseudo column defaults to zero on every update, so with this change, it acts as an indicator of whether 
      a row was last updated by the Slave (NDB$AUTHOR == 1), or not (NDB$AUTHOR == 0).
      To support > 2 clusters replicating, we should support > 1 bit NDB$AUTHOR,
      with some scheme for mapping from ServerId to Author bits.
      Reflected GCI/Max_rep_epoch
      The Ndb_slave_max_replicated_epoch status variable tracks the
      highest local-cluster epoch which was applied *before* the
      replicated remote-cluster updates currently being applied by 
      the slave.
      This acts as a kind of 'confirmation' of application of 
      each epoch on the remote cluster.
      Additionally, it indicates the closure of the window of
      risk of replicated updates arriving from the remote cluster
      which conflict with updates committed in a given epoch.  Once
      a particular epoch has been applied remotely, later updates
      received from the remote slave are by definition applied at the
      remote slave *after* that epoch was applied, and therefore are
      *not* in conflict.
      This information is used by NDB$EPOCH().  When applying a replicated
      remote update, the existing (if any) row's commit-epoch is checked
      to determine whether it is less than or equal to the current
      Ndb_slave_max_rep_epoch value.  If it is less that or equal to the
      current value, then the received update is not in conflict.  If it
      is greater than the current value, then it is in conflict.
      When replicated remote updates are committed, the row's commit epoch
      naturally changes to the current system epoch.  This could cause
      further updates from the remote cluster to appear to be in-conflict,
      when in reality they are not.  To avoid this case, rows written by 
      the slave are marked (via their NDB$AUTHOR column).  The value of the
      NDB$AUTHOR column is also checked when looking for conflicts.  If the
      NDB$AUTHOR column indicates that the row was last written by the slave,
      then *by definition*, it is not in conflict with a locally sourced 
      Therefore, the pushed program for conflict detection has the following
      pseudo code :
        if (NDB$AUTHOR == 0) // Last update was local
          if (NDB$GCI64 > Ndb_slave_max_rep_epoch)
            return row_in_conflict;
        return row_not_in_conflict;
      Epoch resolution
      A Cluster epoch is a 64 bit number comprised of 2 32-bit components,
      GCI_hi and GCI_lo.  GCI_hi starts at 1 after an initial cluster start,
      and increments periodically, according to the configured Global Checkpoint
      period, normally every 2s.
      GCI_lo also increments periodically, according to the configured epoch
      timeout period, normally every 100 millis.
      GCI_lo is reset when GCI_hi increments.
      For the default configuration, this would suggest that normally there could
      be up to 2000/100 = 20 GCI_lo values per GCI_hi value.
      Note that GCI_hi is incremented by a Global-checkpoint-save round completing,
      where Redo log files for a particular global checkpoint are successfully 
      flushed to disk.  Where disk IO is in short supply, this can take some time,
      and the period between increments can be extended.  In this case, there can
      be more GCI_lo increments per GCI_hi increment than the division of the configured
      periods would suggest.
      The ROW_GCI64 pseudo column is implemented using the existing 32-bit ROW_GCI to 
      retrieve the GCI_hi component, and using a variable bit-width column to store 
      the n lsbs of GCI_lo.  As described above, storing the full 64 bits for each row 
      would be wasteful.  The number of bits to use to store the GCI_lo component can 
      be controlled by users as a conflict function parameter.  Where it is not supplied,
      it defaults to 6, giving 26 values = 64.
      If GCP Save is too slow, it could be that there are more than 2^n GCI_lo values per
      GCI_hi.  GCI_lo values above 2^n -1 cannot be recorded in the database.  This is
      problematic as if we just saturate at 2^n -1 then we cannot tell the difference 
      between rows updated at 2^n -1 and 2^n + x.  When the highest replicated epoch reached
      2^n - 1, we will assume that everything committed in the current GCI_hi value has been
      replicated, and may miss some conflicts.
      To avoid this scenario, we treat the max value of GCI_lo (2^n -1) as if it were 232 -1, e.g.
      0xffffffff.  So for a 3-bit GCI_lo, the values can be :
      Note that the value 7 is not available - it is used to convey 'saturation'.
      Where a row is updated with GCI_lo at 7 or above, when GCI_64 is read back, the bottom word
      will be 0xffffffff.  This indicates a 'later' epoch than the actual epoch, but has the 
      safety property that we need.  When ndb_apply_status reflection causes the Max Replicated
      Epoch to reach GCI_lo of 7 (ndb_apply_status is not affected by this bit limitation), we
      will not consider these rows to be 'replicated'.  It is only when we see that the next GCI_hi 
      value has been replicated, that these rows are considered to be replicated.
      In this way we avoid the risk of missing a conflict, but risk false conflicts.
      Another risk here is that if the system quiesces after this saturation, then
      we may not get the replication traffic necessary to determine that the next GCI_hi
      value has been replicated, and so the window of conflict stays open until some further 
      replication traffic occurs.
      For these reasons, the number of bits used should not be pared too low, as it will result
      in more conflicts. 
      Existing conflict detection algorithms assume that conflicts are
      symmetric.  For example, with NDB$MAX() if Cluster 1 updates a row 
      with timestamp 6 and Cluster 2 updates it with timestamp 7, then, 
      when applying the replicated updates :
        At Cluster 1
          Received timestamp (7) > Existing timestamp (6), APPLY
        At Cluster 2
          Received timestamp (6) < Existing timestamp (7), DISCARD
       So  Cluster 2 wins.
      NDB$EPOCH does not have the same guaranteed symmetry.  Two Clusters
      replicating circularly may update the same row(s) 'semi-simultaneously', 
      one may detect the updates as in-conflict, while the other may not.  One 
      reason for this is imprecision in the conflict detection, due to the
      limited ordering in the Cluster Binlog, as described below.  The effect of
      this is that to get a deterministic result, we need to have only one cluster
      determining when conflicts have occurred, and taking action.  As only one
      Cluster is detecting conflicts, other clusters are accepting changes in
      the order they arrive, and may contain divergent data.  The conflict-detecting
      Cluster (The Primary Cluster) must therefore take steps to realign other
      Clusters in the case of conflicts.  
      Realigning the Secondary Cluster
      Realignment is achieved by logging events in the Primary Cluster Binlog 
      which will force the Secondary into alignment when applied by the Secondary's
      Slave.  The NdbApi refreshTuple() mechanism is used to generate these events 
      in the Binlog.  refreshTuple() has the following semantics :
        - If row exists
            Lock exclusively
            Produce Insert NdbApi event with all column contents as after image
        - If row does not exist
            Lock exclusively (by pretending it does exist!)
            Produce Delete NdbApi event with primary key only
      This results in WRITE_ROW and DELETE_ROW events for conflicting rows appearing
      in the Binlog.  These events are Binlogged in the Primary epoch transaction where
      the results of applying the Secondary epoch transaction (if any) are Binlogged.
      As the realignment is a 'force' operation, it is important that the WRITE_ROW and
      DELETE_ROW events are applied at the Secondary as 'idempotent' operations.  
      The WRITE_ROW should act as an Insert if the row does not exist, and as an update
      otherwise.  The Delete row should ignore the case where the row does not exist.
      This behaviour is the default for replication, except where a conflict algorithm
      is defined for a table, so it is important that the Secondary have no conflict 
      algorithm defined for tables which can be realigned by this mechanism.  (This
      could be circumvented if necessary).
      Within the Primary Cluster Slave, the algorithm is as follows :
           while (rows_remain_in_secondary_epoch_transaction)
               define conflict op with interpreted program
             while (space_left_in_batch)
               if conflict_detected
                 define refresh_row operation
      Conflict handling occurs at the end of each Slave batch - detected conflicts
      result in refresh row operations being executed, before continuing with
      the next batch of operations.  Commit occurs once all conflict handling has
      Handling multiple conflicts
      It is important that once a conflict is detected between a Primary Cluster update
      and a Secondary Cluster update, any further Secondary Cluster updates to the same
      row are also regarded as in-conflict, whether they occur within the same Secondary 
      Cluster epoch transaction as the initial conflict, or in some later transaction.
      Within the application of a Secondary Cluster epoch transaction in the Primary Cluster
      Slave, there can be multiple batches of execution, each with its own post-execute
      conflict handling stage.
      Where a Secondary epoch transaction has multiple conflicting updates to a row, they
      can be applied in the same or separate batches.
         update row 1 set a='First';
         update row 1 set b='Second';
         #execute(NoCommit) + handle conflicts
         update row 2 set a='Third';
         #execute(NoCommit) + handle conflicts
      In this case, the first and second conflicting operations will detect the conflict 
      based on the committed epoch of the row, and the author column.  They will each
      issue a refreshTuple operation at the end of the batch.
      The first refreshTuple operation will result in Binlog event generation, and the
      row's epoch being updated to the current epoch at commit time.
      The second refreshTuple operation will fail, as it is not allowed for a refreshTuple
      to be followed by any other operation on a row in a transaction.
      This 'operation after refresh' error is used to indicate a conflict - the row is 
      already being refreshed, so no need to take any further steps.
      When the third operation is executed, it will also hit the 'operation after refresh'
      scenario, although it will be an update operation rather than a refresh operation.
      Using the 'no operation after refresh' mechanism in this way avoids the need for the
      slave to 'remember' which rows in the current transaction have suffered conflicts.
      Delete conflict issues
      Delete conflicts are problematic as by definition, a row delete removes the information 
      necessary to detect further conflicts (epoch, author etc).
      Normally, a DELETE from the Secondary which finds the row missing would be considered 
      a conflict, and result in a refresh.  However, this is not effective, as the fact that
      the row has been refreshed, and should therefore record conflicts for any further 
      Secondary modifications until the refresh has replicated, cannot be recorded in the
      Additionally, if a refresh is issued for a deleted row, then there are different 
      behaviours within the transaction depending on batching.  If we receive a DELETE followed
      by an INSERT from the Secondary, and the row does not exist on the Primary, then the
      DELETE indicates a conflict, but the behaviour varies depending on whether the INSERT is
      executed in the same batch as the DELETE.  If it's in the same batch then it will succeed.
      If it's in a later batch then it will hit the no-operation-after-refresh check.  If it's
      in a following transaction, then it will succeed, as the delete conflict was not recorded.
      To avoid this strange inconsistency, DELETE-DELETE conflicts do not result in a refreshTuple
      operation being issued, and we state that we cannot currently avoid divergence where 
      DELETE-DELETE conflicts are possible. 
      Between epoch transactions
      Where the same conflicting row is updated by the Secondary Slave in separate Secondary
      Slave epoch transactions, the normal epoch/author conflict detection mechanism is used
      to detect the conflict.  The commit of the refreshTuple operation from the first 
      transaction updates the row's epoch to the current system epoch, so that any further
      updates from the Secondary, occurring before the refresh Tuple has been replicated, are
      detected as in-conflict.
      In this way, detecting a conflict, and refreshing the row, extends the window of potential
      conflicts, until the refresh operation has also been 'reflected' from the Secondary. 
      Conflict Detection Imprecision
      The Ndb Binlog is recorded as a result of events received over the
      NdbApi event Api.  The event Api supports limited ordering within
      an epoch, only guaranteeing that events for multiple transactions 
      affecting the same row will be received in order.
      Events for different rows can be received in different orders on
      different Api nodes.
      A consequence of this is that when a Slave MySQLD applies an epoch
      transaction, the generated events for its ndb_apply_status update,
      and other row updates may arrive at any order (within the epoch) of
      Binlogging MySQLDs.
      The ndb_apply_status event is used to flag that an epoch has been
      applied by including it in the Slave Cluster's Binlog (as part of
      an epoch) which is 'reflected' back to the 'Master'.  However, due
      to the vague Binlog ordering, we can only know that our epoch can be
      considered 'applied' once we have fully applied the Slave's 'reflected'
      epoch transaction, as some row update from our epoch may be the last
      event recorded in the Slave's epoch transaction.
      This problem is handled by the mechanism for updating Ndb_slave_max_rep_epoch 
      in the Slave, where it is only updated to the highest seen value after committing 
      a slave transaction (containing a slave epoch).  This ensures that when
      Ndb_slave_max_rep_epoch advances, all subsequently applied events occurred
      after the epoch in question was applied.
      The main downside of this imprecision is that it can cause false conflicts
      to be detected.  Where a local epoch was applied at the slave before a slave-local
      update to the same row, if both are binlogged in the same slave epoch then
      the slave-local update will be detected as in-conflict, even though it occurred
      after the applied update.
      Example :
        Primary              Primary Binlog          Secondary                  Secondary Binlog
        'Epoch 44 start'
        Set T1 row 1 to A     
        'Epoch 44 end'
                               WR apply_status
                               WR T1 (1, A)
                                                     'Epoch 222 start'                                     
                                                     Set T1 row 1 to B
                                                     BEGIN # Slave
                                                       Set apply_status row 
                                                         to 44
                                                       Write (1,A) into T1
                                                     Set T1 row 1 to C
                                                     'Epoch 222 end'
                                                                                  WR apply_status
                                                                                     (Secondary, 222)
                                                                                  WR T1 (1,B)
                                                                                  WR T1 (1,A)
                                                                                  WR apply_status
                                                                                     (Primary, 44)
                                                                                  WR T1 (1,C)
        In this example, looking at the order of events on the Secondary, we can say
        that in theory, the write of B to T1 row 1 occurred before the application
        of the replicated epoch 44, and the write of C to T1 row 1 occurred afterwards.
        However, the Binlog contents only guarantee order between updates to the same
        row, so the relative order of the T1 and apply_status updates is not guaranteed.
        In this case, the apply_status update is logged *after* the corresponding T1 update.
        Due to this uncertainty, we can only be *sure* that Primary epoch 44 has been fully
        applied, at the end of the Secondary epoch (222).  This 'rounding up' or imprecision, 
        causes us to consider the second write by the Secondary to T1 row 1 as also being
        in conflict, as we cannot tell whether it occurred before or after the application
        of the primary epoch.
        Note that other orders are possible within the Secondary Binlog.  The limitations are :
         1) The locally generated apply status write row event will be first.
         2) Updates to the same row (Table, PK pair) will be in-order.
        Possible Binlog order combinations :
                       ('In order') (As above)
          WR AS S 222  WR AS S 222  WR AS S 222  WR AS S 222
          WR AS P 44   WR T1 1 B    WR T1 1 B    WR T1 1 B
          WR T1 1 B    WR AS P 44   WR T1 1 A    WR T1 1 A
          WR T1 1 A    WR T1 1 A    WR AS P 44   WR T1 1 C
          WR T1 1 C    WR T1 1 C    WR T1 1 C    WR AS P 44
         As the number of writes, rows and epochs involved goes up, the possible combinations 
         When applying the above Binlog transaction in the Primary, the Primary Slave will note
         the WR AS P 44 row as a new 'Max replicated epoch' for the Primary as it applies it.
         However, as it cannot be sure of the Binlog ordering, it is only safe to use this 
         as the Max replicated epoch once the received Binlog transaction has been committed.
      The NDB$EPOCH function does not support tables with Blob columns.
      Two reasons currently :
        1) Not clear how well adding an interpreted program to a Blob update/delete operation works in practice
           (This affects the other conflict algorithms as well)
        2) No implementation of refreshTuple() for Blobs yet
           Normal refreshTuple will refresh the main table row only, not the part table(s) rows.

=== modified file 'sql/'
--- a/sql/	2011-07-07 14:48:06 +0000
+++ b/sql/	2011-07-08 12:28:37 +0000
@@ -2264,7 +2264,8 @@ int ha_ndbcluster::get_metadata(THD *thd
     my_free(pack_data, MYF(MY_ALLOW_ZERO_PTR));
+  ndb->setDatabaseName(m_dbname);
   Ndb_table_guard ndbtab_g(dict, m_tabname);
   if (!(tab= ndbtab_g.get_table()))
@@ -10214,26 +10215,47 @@ int ha_ndbcluster::open(const char *name
     m_key_fields[i]= NULL;
-  // Init table lock structure 
-  /* ndb_share reference handler */
-  if (!(m_share=get_share(name, table)))
+  set_dbname(name);
+  set_tabname(name);
+  if ((res= check_ndb_connection(thd)) != 0)
     local_close(thd, FALSE);
-    DBUG_RETURN(1);
+    DBUG_RETURN(res);
+  }
+  // Init table lock structure
+  /* ndb_share reference handler */
+  if ((m_share=get_share(name, table, FALSE)) == 0)
+  {
+    /**
+     * No share present...we must create one
+     */
+    if (opt_ndb_extra_logging > 19)
+    {
+      sql_print_information("Calling ndbcluster_create_binlog_setup(%s) in ::open",
+                            name);
+    }
+    Ndb* ndb= check_ndb_in_thd(thd);
+    ndbcluster_create_binlog_setup(thd, ndb, name, strlen(name),
+                                   m_dbname, m_tabname, FALSE);
+    if ((m_share=get_share(name, table, FALSE)) == 0)
+    {
+      local_close(thd, FALSE);
+      DBUG_RETURN(1);
+    }
   DBUG_PRINT("NDB_SHARE", ("%s handler  use_count: %u",
                            m_share->key, m_share->use_count));
   thr_lock_data_init(&m_share->lock,&m_lock,(void*) 0);
-  set_dbname(name);
-  set_tabname(name);
-  if ((res= check_ndb_connection(thd)) ||
-      (res= get_metadata(thd, name)))
+  if ((res= get_metadata(thd, name)))
     local_close(thd, FALSE);
   if ((res= update_stats(thd, 1, true)) ||
       (res= info(HA_STATUS_CONST)))

=== modified file 'sql/'
--- a/sql/	2011-07-07 14:48:06 +0000
+++ b/sql/	2011-07-08 12:28:37 +0000
@@ -5240,10 +5240,15 @@ int ndbcluster_create_binlog_setup(THD *
                       "FAILED CREATE (DISCOVER) EVENT OPERATIONS Event: %s",
       /* a warning has been issued to the client */
-      DBUG_RETURN(0);
+      break;
+  if (share)
+  {
+    free_share(&share);
+  }

No bundle (reason: useless for push emails).
bzr push into mysql-5.1-telco-7.0 branch (jonas.oreland:4445 to 4446) jonas oreland10 Jul