4446 jonas oreland 2011-07-08 [merge]
ndb - merge 70 to 71
4445 Frazer Clement 2011-07-07
WL5353 Primary Cluster Conflict Detection : NDB$EPOCH() row based function
Ndb handler is modified to set the NDB$AUTHOR pseudo column to 1 when writing, or updating a row from the slave.
The NDB$AUTHOR pseudo column defaults to zero on every update, so with this change, it acts as an indicator of whether
a row was last updated by the Slave (NDB$AUTHOR == 1), or not (NDB$AUTHOR == 0).
To support > 2 clusters replicating, we should support > 1 bit NDB$AUTHOR,
with some scheme for mapping from ServerId to Author bits.
The Ndb_slave_max_replicated_epoch status variable tracks the
highest local-cluster epoch which was applied *before* the
replicated remote-cluster updates currently being applied by
This acts as a kind of 'confirmation' of application of
each epoch on the remote cluster.
Additionally, it indicates the closure of the window of
risk of replicated updates arriving from the remote cluster
which conflict with updates committed in a given epoch. Once
a particular epoch has been applied remotely, later updates
received from the remote slave are by definition applied at the
remote slave *after* that epoch was applied, and therefore are
*not* in conflict.
This information is used by NDB$EPOCH(). When applying a replicated
remote update, the existing (if any) row's commit-epoch is checked
to determine whether it is less than or equal to the current
Ndb_slave_max_rep_epoch value. If it is less that or equal to the
current value, then the received update is not in conflict. If it
is greater than the current value, then it is in conflict.
When replicated remote updates are committed, the row's commit epoch
naturally changes to the current system epoch. This could cause
further updates from the remote cluster to appear to be in-conflict,
when in reality they are not. To avoid this case, rows written by
the slave are marked (via their NDB$AUTHOR column). The value of the
NDB$AUTHOR column is also checked when looking for conflicts. If the
NDB$AUTHOR column indicates that the row was last written by the slave,
then *by definition*, it is not in conflict with a locally sourced
Therefore, the pushed program for conflict detection has the following
pseudo code :
if (NDB$AUTHOR == 0) // Last update was local
if (NDB$GCI64 > Ndb_slave_max_rep_epoch)
A Cluster epoch is a 64 bit number comprised of 2 32-bit components,
GCI_hi and GCI_lo. GCI_hi starts at 1 after an initial cluster start,
and increments periodically, according to the configured Global Checkpoint
period, normally every 2s.
GCI_lo also increments periodically, according to the configured epoch
timeout period, normally every 100 millis.
GCI_lo is reset when GCI_hi increments.
For the default configuration, this would suggest that normally there could
be up to 2000/100 = 20 GCI_lo values per GCI_hi value.
Note that GCI_hi is incremented by a Global-checkpoint-save round completing,
where Redo log files for a particular global checkpoint are successfully
flushed to disk. Where disk IO is in short supply, this can take some time,
and the period between increments can be extended. In this case, there can
be more GCI_lo increments per GCI_hi increment than the division of the configured
periods would suggest.
The ROW_GCI64 pseudo column is implemented using the existing 32-bit ROW_GCI to
retrieve the GCI_hi component, and using a variable bit-width column to store
the n lsbs of GCI_lo. As described above, storing the full 64 bits for each row
would be wasteful. The number of bits to use to store the GCI_lo component can
be controlled by users as a conflict function parameter. Where it is not supplied,
it defaults to 6, giving 26 values = 64.
If GCP Save is too slow, it could be that there are more than 2^n GCI_lo values per
GCI_hi. GCI_lo values above 2^n -1 cannot be recorded in the database. This is
problematic as if we just saturate at 2^n -1 then we cannot tell the difference
between rows updated at 2^n -1 and 2^n + x. When the highest replicated epoch reached
2^n - 1, we will assume that everything committed in the current GCI_hi value has been
replicated, and may miss some conflicts.
To avoid this scenario, we treat the max value of GCI_lo (2^n -1) as if it were 232 -1, e.g.
0xffffffff. So for a 3-bit GCI_lo, the values can be :
Note that the value 7 is not available - it is used to convey 'saturation'.
Where a row is updated with GCI_lo at 7 or above, when GCI_64 is read back, the bottom word
will be 0xffffffff. This indicates a 'later' epoch than the actual epoch, but has the
safety property that we need. When ndb_apply_status reflection causes the Max Replicated
Epoch to reach GCI_lo of 7 (ndb_apply_status is not affected by this bit limitation), we
will not consider these rows to be 'replicated'. It is only when we see that the next GCI_hi
value has been replicated, that these rows are considered to be replicated.
In this way we avoid the risk of missing a conflict, but risk false conflicts.
Another risk here is that if the system quiesces after this saturation, then
we may not get the replication traffic necessary to determine that the next GCI_hi
value has been replicated, and so the window of conflict stays open until some further
replication traffic occurs.
For these reasons, the number of bits used should not be pared too low, as it will result
in more conflicts.
Existing conflict detection algorithms assume that conflicts are
symmetric. For example, with NDB$MAX() if Cluster 1 updates a row
with timestamp 6 and Cluster 2 updates it with timestamp 7, then,
when applying the replicated updates :
At Cluster 1
Received timestamp (7) > Existing timestamp (6), APPLY
At Cluster 2
Received timestamp (6) < Existing timestamp (7), DISCARD
So Cluster 2 wins.
NDB$EPOCH does not have the same guaranteed symmetry. Two Clusters
replicating circularly may update the same row(s) 'semi-simultaneously',
one may detect the updates as in-conflict, while the other may not. One
reason for this is imprecision in the conflict detection, due to the
limited ordering in the Cluster Binlog, as described below. The effect of
this is that to get a deterministic result, we need to have only one cluster
determining when conflicts have occurred, and taking action. As only one
Cluster is detecting conflicts, other clusters are accepting changes in
the order they arrive, and may contain divergent data. The conflict-detecting
Cluster (The Primary Cluster) must therefore take steps to realign other
Clusters in the case of conflicts.
Realigning the Secondary Cluster
Realignment is achieved by logging events in the Primary Cluster Binlog
which will force the Secondary into alignment when applied by the Secondary's
Slave. The NdbApi refreshTuple() mechanism is used to generate these events
in the Binlog. refreshTuple() has the following semantics :
- If row exists
Produce Insert NdbApi event with all column contents as after image
- If row does not exist
Lock exclusively (by pretending it does exist!)
Produce Delete NdbApi event with primary key only
This results in WRITE_ROW and DELETE_ROW events for conflicting rows appearing
in the Binlog. These events are Binlogged in the Primary epoch transaction where
the results of applying the Secondary epoch transaction (if any) are Binlogged.
As the realignment is a 'force' operation, it is important that the WRITE_ROW and
DELETE_ROW events are applied at the Secondary as 'idempotent' operations.
The WRITE_ROW should act as an Insert if the row does not exist, and as an update
otherwise. The Delete row should ignore the case where the row does not exist.
This behaviour is the default for replication, except where a conflict algorithm
is defined for a table, so it is important that the Secondary have no conflict
algorithm defined for tables which can be realigned by this mechanism. (This
could be circumvented if necessary).
Within the Primary Cluster Slave, the algorithm is as follows :
define conflict op with interpreted program
define refresh_row operation
Conflict handling occurs at the end of each Slave batch - detected conflicts
result in refresh row operations being executed, before continuing with
the next batch of operations. Commit occurs once all conflict handling has
Handling multiple conflicts
It is important that once a conflict is detected between a Primary Cluster update
and a Secondary Cluster update, any further Secondary Cluster updates to the same
row are also regarded as in-conflict, whether they occur within the same Secondary
Cluster epoch transaction as the initial conflict, or in some later transaction.
Within the application of a Secondary Cluster epoch transaction in the Primary Cluster
Slave, there can be multiple batches of execution, each with its own post-execute
conflict handling stage.
Where a Secondary epoch transaction has multiple conflicting updates to a row, they
can be applied in the same or separate batches.
update row 1 set a='First';
update row 1 set b='Second';
#execute(NoCommit) + handle conflicts
update row 2 set a='Third';
#execute(NoCommit) + handle conflicts
In this case, the first and second conflicting operations will detect the conflict
based on the committed epoch of the row, and the author column. They will each
issue a refreshTuple operation at the end of the batch.
The first refreshTuple operation will result in Binlog event generation, and the
row's epoch being updated to the current epoch at commit time.
The second refreshTuple operation will fail, as it is not allowed for a refreshTuple
to be followed by any other operation on a row in a transaction.
This 'operation after refresh' error is used to indicate a conflict - the row is
already being refreshed, so no need to take any further steps.
When the third operation is executed, it will also hit the 'operation after refresh'
scenario, although it will be an update operation rather than a refresh operation.
Using the 'no operation after refresh' mechanism in this way avoids the need for the
slave to 'remember' which rows in the current transaction have suffered conflicts.
Delete conflict issues
Delete conflicts are problematic as by definition, a row delete removes the information
necessary to detect further conflicts (epoch, author etc).
Normally, a DELETE from the Secondary which finds the row missing would be considered
a conflict, and result in a refresh. However, this is not effective, as the fact that
the row has been refreshed, and should therefore record conflicts for any further
Secondary modifications until the refresh has replicated, cannot be recorded in the
Additionally, if a refresh is issued for a deleted row, then there are different
behaviours within the transaction depending on batching. If we receive a DELETE followed
by an INSERT from the Secondary, and the row does not exist on the Primary, then the
DELETE indicates a conflict, but the behaviour varies depending on whether the INSERT is
executed in the same batch as the DELETE. If it's in the same batch then it will succeed.
If it's in a later batch then it will hit the no-operation-after-refresh check. If it's
in a following transaction, then it will succeed, as the delete conflict was not recorded.
To avoid this strange inconsistency, DELETE-DELETE conflicts do not result in a refreshTuple
operation being issued, and we state that we cannot currently avoid divergence where
DELETE-DELETE conflicts are possible.
Between epoch transactions
Where the same conflicting row is updated by the Secondary Slave in separate Secondary
Slave epoch transactions, the normal epoch/author conflict detection mechanism is used
to detect the conflict. The commit of the refreshTuple operation from the first
transaction updates the row's epoch to the current system epoch, so that any further
updates from the Secondary, occurring before the refresh Tuple has been replicated, are
detected as in-conflict.
In this way, detecting a conflict, and refreshing the row, extends the window of potential
conflicts, until the refresh operation has also been 'reflected' from the Secondary.
Conflict Detection Imprecision
The Ndb Binlog is recorded as a result of events received over the
NdbApi event Api. The event Api supports limited ordering within
an epoch, only guaranteeing that events for multiple transactions
affecting the same row will be received in order.
Events for different rows can be received in different orders on
different Api nodes.
A consequence of this is that when a Slave MySQLD applies an epoch
transaction, the generated events for its ndb_apply_status update,
and other row updates may arrive at any order (within the epoch) of
The ndb_apply_status event is used to flag that an epoch has been
applied by including it in the Slave Cluster's Binlog (as part of
an epoch) which is 'reflected' back to the 'Master'. However, due
to the vague Binlog ordering, we can only know that our epoch can be
considered 'applied' once we have fully applied the Slave's 'reflected'
epoch transaction, as some row update from our epoch may be the last
event recorded in the Slave's epoch transaction.
This problem is handled by the mechanism for updating Ndb_slave_max_rep_epoch
in the Slave, where it is only updated to the highest seen value after committing
a slave transaction (containing a slave epoch). This ensures that when
Ndb_slave_max_rep_epoch advances, all subsequently applied events occurred
after the epoch in question was applied.
The main downside of this imprecision is that it can cause false conflicts
to be detected. Where a local epoch was applied at the slave before a slave-local
update to the same row, if both are binlogged in the same slave epoch then
the slave-local update will be detected as in-conflict, even though it occurred
after the applied update.
Primary Primary Binlog Secondary Secondary Binlog
'Epoch 44 start'
Set T1 row 1 to A
'Epoch 44 end'
WR T1 (1, A)
'Epoch 222 start'
Set T1 row 1 to B
BEGIN # Slave
Set apply_status row
Write (1,A) into T1
Set T1 row 1 to C
'Epoch 222 end'
WR T1 (1,B)
WR T1 (1,A)
WR T1 (1,C)
In this example, looking at the order of events on the Secondary, we can say
that in theory, the write of B to T1 row 1 occurred before the application
of the replicated epoch 44, and the write of C to T1 row 1 occurred afterwards.
However, the Binlog contents only guarantee order between updates to the same
row, so the relative order of the T1 and apply_status updates is not guaranteed.
In this case, the apply_status update is logged *after* the corresponding T1 update.
Due to this uncertainty, we can only be *sure* that Primary epoch 44 has been fully
applied, at the end of the Secondary epoch (222). This 'rounding up' or imprecision,
causes us to consider the second write by the Secondary to T1 row 1 as also being
in conflict, as we cannot tell whether it occurred before or after the application
of the primary epoch.
Note that other orders are possible within the Secondary Binlog. The limitations are :
1) The locally generated apply status write row event will be first.
2) Updates to the same row (Table, PK pair) will be in-order.
Possible Binlog order combinations :
('In order') (As above)
WR AS S 222 WR AS S 222 WR AS S 222 WR AS S 222
WR AS P 44 WR T1 1 B WR T1 1 B WR T1 1 B
WR T1 1 B WR AS P 44 WR T1 1 A WR T1 1 A
WR T1 1 A WR T1 1 A WR AS P 44 WR T1 1 C
WR T1 1 C WR T1 1 C WR T1 1 C WR AS P 44
As the number of writes, rows and epochs involved goes up, the possible combinations
When applying the above Binlog transaction in the Primary, the Primary Slave will note
the WR AS P 44 row as a new 'Max replicated epoch' for the Primary as it applies it.
However, as it cannot be sure of the Binlog ordering, it is only safe to use this
as the Max replicated epoch once the received Binlog transaction has been committed.
The NDB$EPOCH function does not support tables with Blob columns.
Two reasons currently :
1) Not clear how well adding an interpreted program to a Blob update/delete operation works in practice
(This affects the other conflict algorithms as well)
2) No implementation of refreshTuple() for Blobs yet
Normal refreshTuple will refresh the main table row only, not the part table(s) rows.
=== modified file 'sql/ha_ndbcluster.cc'
--- a/sql/ha_ndbcluster.cc 2011-07-07 14:48:06 +0000
+++ b/sql/ha_ndbcluster.cc 2011-07-08 12:28:37 +0000
@@ -2264,7 +2264,8 @@ int ha_ndbcluster::get_metadata(THD *thd
Ndb_table_guard ndbtab_g(dict, m_tabname);
if (!(tab= ndbtab_g.get_table()))
@@ -10214,26 +10215,47 @@ int ha_ndbcluster::open(const char *name
- // Init table lock structure
- /* ndb_share reference handler */
- if (!(m_share=get_share(name, table)))
+ if ((res= check_ndb_connection(thd)) != 0)
+ // Init table lock structure
+ /* ndb_share reference handler */
+ if ((m_share=get_share(name, table, FALSE)) == 0)
+ * No share present...we must create one
+ if (opt_ndb_extra_logging > 19)
+ sql_print_information("Calling ndbcluster_create_binlog_setup(%s) in ::open",
+ Ndb* ndb= check_ndb_in_thd(thd);
+ ndbcluster_create_binlog_setup(thd, ndb, name, strlen(name),
+ m_dbname, m_tabname, FALSE);
+ if ((m_share=get_share(name, table, FALSE)) == 0)
+ local_close(thd, FALSE);
DBUG_PRINT("NDB_SHARE", ("%s handler use_count: %u",
- if ((res= check_ndb_connection(thd)) ||
- (res= get_metadata(thd, name)))
+ if ((res= get_metadata(thd, name)))
if ((res= update_stats(thd, 1, true)) ||
=== modified file 'sql/ha_ndbcluster_binlog.cc'
--- a/sql/ha_ndbcluster_binlog.cc 2011-07-07 14:48:06 +0000
+++ b/sql/ha_ndbcluster_binlog.cc 2011-07-08 12:28:37 +0000
@@ -5240,10 +5240,15 @@ int ndbcluster_create_binlog_setup(THD *
"FAILED CREATE (DISCOVER) EVENT OPERATIONS Event: %s",
/* a warning has been issued to the client */
+ if (share)
No bundle (reason: useless for push emails).
|• bzr push into mysql-5.1-telco-7.0 branch (jonas.oreland:4445 to 4446) ||jonas oreland||10 Jul|