Author: jstephens
Date: 2007-06-29 14:36:26 +0200 (Fri, 29 Jun 2007)
New Revision: 6954
Log:
Deploy Start Phases doc in API Guide.
Modified:
trunk/ndbapi/ndb-internals-start-phases.xml
trunk/ndbapi/ndb-internals.xml
Modified: trunk/ndbapi/ndb-internals-start-phases.xml
===================================================================
--- trunk/ndbapi/ndb-internals-start-phases.xml 2007-06-29 12:34:35 UTC (rev 6953)
+++ trunk/ndbapi/ndb-internals-start-phases.xml 2007-06-29 12:36:26 UTC (rev 6954)
Changed blocks: 1, Lines Added: 1, Lines Deleted: 1; 514 bytes
@@ -2,7 +2,7 @@
<!DOCTYPE section SYSTEM "http://www.docbook.org/xml/4.3/docbookx.dtd">
<section id="ndb-internals-start-phases">
- <title>Cluster Start Phases</title>
+ <title>MySQL Cluster Start Phases</title>
<para></para>
Modified: trunk/ndbapi/ndb-internals.xml
===================================================================
--- trunk/ndbapi/ndb-internals.xml 2007-06-29 12:34:35 UTC (rev 6953)
+++ trunk/ndbapi/ndb-internals.xml 2007-06-29 12:36:26 UTC (rev 6954)
Changed blocks: 1, Lines Added: 1406, Lines Deleted: 0; 53443 bytes
@@ -11740,6 +11740,1412 @@
</section>
+ <section id="ndb-internals-start-phases">
+
+ <title>MySQL Cluster Start Phases</title>
+
+ <para></para>
+
+ <section id="ndb-internals-start-phases-read-config">
+
+ <title>Read Configuration Phase (Phase -1)</title>
+
+ <para>
+ Before the data node actually starts, a number of other setup
+ and initialization tasks must be done for the block objects,
+ transporters, and watchdog checks, among others.
+ </para>
+
+ <para>
+ This initialization process begins in
+ <filename>storage/ndb/src/kernel/main.cpp</filename> with a
+ series of calls to
+ <literal>globalEmulatorData.theThreadConfig->doStart()</literal>.
+ When starting <command>ndbd</command> with the
+ <option>-n</option> or <option>--nostart</option> option there
+ is only one call to this method; otherwise, there are two, with
+ the second call actually starting the data node. The first
+ invocation of <literal>doStart()</literal> sends the
+ <literal>START_ORD</literal> signal to the
+ <literal>CMVMI</literal> block (see
+ <xref
+ linkend="ndb-internals-kernel-blocks-cmvmi"/>); the
+ second call to this method sends a <literal>START_ORD</literal>
+ signal to <literal>NDBCNTR</literal> (see
+ <xref linkend="ndb-internals-kernel-blocks-ndbcntr"/>).
+ </para>
+
+ <para>
+ When <literal>START_ORD</literal> is received by the
+ <literal>NDBCNTR</literal> block, the signal is immediately
+ transferred to <literal>NDBCNTR</literal>'s
+ <literal>MISSRA</literal> sub-block, which handles the start
+ process by sending a <literal>READ_CONFIG_REQ</literal> signals
+ to all blocks in order as given in the array
+ <literal>readConfigOrder</literal>:
+
+ <orderedlist>
+
+ <listitem>
+ <para>
+ <literal>NDBFS</literal>
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <literal>DBTUP</literal>
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <literal>DBACC</literal>
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <literal>DBTC</literal>
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <literal>DBLQH</literal>
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <literal>DBTUX</literal>
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <literal>DBDICT</literal>
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <literal>DBDIH</literal>
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <literal>NDBCNTR</literal>
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <literal>QMGR</literal>
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <literal>TRIX</literal>
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <literal>BACKUP</literal>
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <literal>DBUTIL</literal>
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <literal>SUMA</literal>
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <literal>TSMAN</literal>
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <literal>LGMAN</literal>
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <literal>PGMAN</literal>
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <literal>RESTORE</literal>
+ </para>
+ </listitem>
+
+ </orderedlist>
+
+ <literal>NDBFS</literal> is allowed to run before any of the
+ remaining blocks are contacted, in order to make sure that it
+ can start the <literal>CMVMI</literal> block's threads.
+ </para>
+
+ </section>
+
+ <section id="ndb-internals-start-phases-sttor-config-read">
+
+ <title>Configuration Read Phase (<literal>STTOR</literal> Phase -1)</title>
+
+ <para>
+ The <literal>READ_CONFIG_REQ</literal> signal provides all
+ kernel blocks an opportunity to read the configuration data,
+ which is stored in a global object accessible to all blocks. All
+ memory allocation in the data nodes takes place during this
+ phase.
+ </para>
+
+ <note>
+ <para>
+ Connections between the kernel blocks and the
+ <literal>NDB</literal> filesystem are also set up during Phase
+ 0. This is necessary to enable the blocks to communicate
+ easily which parts of a table structure are to be written to
+ disk.
+ </para>
+ </note>
+
+ <para>
+ <literal>NDB</literal> performs memory allocations in two
+ different ways. The first of these is by using the
+ <literal>allocRecord()</literal> method (defined in
+ <filename>storage/ndb/src/kernel/vm/SimulatedBlock.hpp</filename>).
+ This is the traditional method whereby records are accessed
+ using the <literal>ptrCheckGuard</literal> macros (defined in
+ <filename>storage/ndb/src/kernel/vm/pc.hpp</filename>). The
+ other method is to allocate memory using the
+ <literal>setSize()</literal> method defined with the help of the
+ template found in
+ <filename>storage/ndb/src/kernel/vm/CArray.hpp</filename>.
+ </para>
+
+ <para>
+ These methods sometimes also initialize the memory, ensuring
+ that both memory allocation and initialization are done with
+ watchdog protection.
+ </para>
+
+ <para>
+ Many blocks also perform block-specific initialization, which
+ often entails building linked lists or doubly-linked lists (and
+ in some cases hash tables).
+ </para>
+
+ <para>
+ Many of the sizes used in allocation are calculated in the
+ <literal>Configuration::calcSizeAlt()</literal> method, found in
+ <filename>storage/ndb/src/kernel/vm/Configuration.cpp</filename>.
+ </para>
+
+ <para>
+ Some preparations for more intelligent pooling of memory
+ resources have been made. <literal>DataMemory</literal> and disk
+ records already belong to this global memory pool.
+ </para>
+
+ </section>
+
+ <section id="ndb-internals-start-phases-sttor-0">
+
+ <title><literal>STTOR</literal> Phase 0</title>
+
+ <para>
+ Most <literal>NDB</literal> kernel blocks begin their start
+ phases at <literal>STTOR</literal> Phase 1, with the exception
+ of <literal>NDBFS</literal> and <literal>NDBCNTR</literal>,
+ which begin with Phase 0, as can be seen by inspecting the first
+ value for each element in the <literal>ALL_BLOCKS</literal>
+ array (defined in
+ <filename>src/kernel/blocks/ndbcntr/NdbcntrMain.cpp</filename>).
+ In addition, when the <literal>STTOR</literal> signal is sent to
+ a block, the return signal <literal>STTORRY</literal> always
+ contains a list of the start phases in which the block has an
+ interest. Only in those start phases does the block actually
+ receive a <literal>STTOR</literal> signal.
+ </para>
+
+ <para>
+ <literal>STTOR</literal> signals are sent out in the order in
+ which the kernel blocks are listed in the
+ <literal>ALL_BLOCKS</literal> array. While
+ <literal>NDBCNTR</literal> goes through start phases 0 to 255,
+ most of these are empty.
+ </para>
+
+ <para>
+ Both activities in Phase 0 have to do with initialization of the
+ <literal>NDB</literal> filesystem. First, if necessary,
+ <literal>NDBFS</literal> creates the filesystem directory for
+ the data node. In the case of an initial start,
+ <literal>NDBCNTR</literal> clears any existing files from the
+ directory of the data node to ensure that the
+ <literal>DBDIH</literal> block does not subsequently discover
+ any system files (if <literal>DBDIH</literal> were to find any
+ system files, it would not interpret the start correctly as an
+ initial start). (See also
+ <xref linkend="ndb-internals-kernel-blocks-dbdih"/>.)
+ </para>
+
+ <para>
+ Each time that <literal>NDBCNTR</literal> completes the sending
+ of one start phase to all kernel blocks, it sends a
+ <literal>NODE_STATE_REP</literal> signal to all blocks, which
+ effectively updates the <literal>NodeState</literal> in all
+ blocks.
+ </para>
+
+ <para>
+ Each time that <literal>NDBCNTR</literal> completes a non-empty
+ start phase, it reports this to the management server; in most
+ cases this is recorded in the cluster log.
+ </para>
+
+ <para>
+ Finally, after completing all start phases,
+ <literal>NDBCNTR</literal> updates the node state in all blocks
+ via a <literal>NODE_STATE_REP</literal> signal; it also sends an
+ event report advising that all start phases are complete. In
+ addition, all other cluster data nodes are notified that this
+ node has completed all its start phases to ensure all nodes are
+ aware of one another's state. Each data node sends a
+ <literal>NODE_START_REP</literal> to all blocks; however, this
+ is significant only for <literal>DBDIH</literal>, so that it
+ knows when it can unlock the lock for schema changes on
+ <literal>DBDICT</literal>.
+ </para>
+
+ <note>
+ <para>
+ In the following table, and throughout this text, we sometimes
+ refer to <literal>STTOR</literal> start phases simply as
+ <quote>start phases</quote> or <quote>Phase
+ <replaceable>N</replaceable></quote> (where
+ <replaceable>N</replaceable> is some number).
+ <literal>NDB_STTOR</literal> start phases are always qualified
+ as such, and so referred to as
+ <quote><literal>NDB_STTOR</literal> start phases</quote> or
+ <quote><literal>NDB_STTOR</literal> phases</quote>.
+ </para>
+ </note>
+
+ <informaltable>
+ <tgroup cols="2">
+ <colspec colwidth="20*"/>
+ <colspec colwidth="80*"/>
+ <thead>
+ <row>
+ <entry>Kernel Block</entry>
+ <entry>Receptive Start Phases</entry>
+ </row>
+ </thead>
+ <tbody>
+ <row>
+ <entry><literal>NDBFS</literal></entry>
+ <entry>0</entry>
+ </row>
+ <row>
+ <entry><literal>DBTC</literal></entry>
+ <entry>1</entry>
+ </row>
+ <row>
+ <entry><literal>DBDIH</literal></entry>
+ <entry>1</entry>
+ </row>
+ <row>
+ <entry><literal>DBLQH</literal></entry>
+ <entry>1, 4</entry>
+ </row>
+ <row>
+ <entry><literal>DBACC</literal></entry>
+ <entry>1</entry>
+ </row>
+ <row>
+ <entry><literal>DBTUP</literal></entry>
+ <entry>1</entry>
+ </row>
+ <row>
+ <entry><literal>DBDICT</literal></entry>
+ <entry>1, 3</entry>
+ </row>
+ <row>
+ <entry><literal>NDBCNTR</literal></entry>
+ <entry>0, 1, 2, 3, 4, 5, 6, 8, 9</entry>
+ </row>
+ <row>
+ <entry><literal>CMVMI</literal></entry>
+ <entry>1 (prior to <literal>QMGR</literal>), 3, 8</entry>
+ </row>
+ <row>
+ <entry><literal>QMGR</literal></entry>
+ <entry>1, 7</entry>
+ </row>
+ <row>
+ <entry><literal>TRIX</literal></entry>
+ <entry>1</entry>
+ </row>
+ <row>
+ <entry><literal>BACKUP</literal></entry>
+ <entry>1, 3, 7</entry>
+ </row>
+ <row>
+ <entry><literal>DBUTIL</literal></entry>
+ <entry>1, 6</entry>
+ </row>
+ <row>
+ <entry><literal>SUMA</literal></entry>
+ <entry>1, 3, 5, 7, 100 (empty), 101</entry>
+ </row>
+ <row>
+ <entry><literal>DBTUX</literal></entry>
+ <entry>1,3,7</entry>
+ </row>
+ <row>
+ <entry><literal>TSMAN</literal></entry>
+ <entry>1, 3 (both ignored)</entry>
+ </row>
+ <row>
+ <entry><literal>LGMAN</literal></entry>
+ <entry>1, 2, 3, 4, 5, 6 (all ignored)</entry>
+ </row>
+ <row>
+ <entry><literal>PGMAN</literal></entry>
+ <entry>1, 3, 7 (Phase 7 currently empty)</entry>
+ </row>
+ <row>
+ <entry><literal>RESTORE</literal></entry>
+ <entry>1,3 (only in Phase 1 is any real work done)</entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </informaltable>
+
+ <note>
+ <para>
+ This table was current at the time this text was written, but
+ is likely to change over time. The latest information can be
+ found in the source code.
+ </para>
+ </note>
+
+ </section>
+
+ <section id="ndb-internals-start-phases-sttor-1">
+
+ <title><literal>STTOR</literal> Phase 1</title>
+
+ <para>
+ This is one of the phases in which most kernel blocks
+ participate (see the table in
+ <xref linkend="ndb-internals-start-phases-sttor-0"/>).
+ Otherwise, most blocks are involved primarily in the
+ initialization of data — for example, this is all that
+ <literal>DBTC</literal> does.
+ </para>
+
+ <para>
+ Many blocks initialize references to other blocks in Phase 1.
+ <literal>DBLQH</literal> initializes block references to
+ <literal>DBTUP</literal>, and <literal>DBACC</literal>
+ initializes block references to <literal>DBTUP</literal> and
+ <literal>DBLQH</literal>. <literal>DBTUP</literal> initializes
+ references to the blocks <literal>DBLQH</literal>,
+ <literal>TSMAN</literal>, and <literal>LGMAN</literal>.
+ </para>
+
+ <para>
+ <literal>NDBCNTR</literal> initializes some variables and sets
+ up block references to <literal>DBTUP</literal>,
+ <literal>DBLQH</literal>, <literal>DBACC</literal>,
+ <literal>DBTC</literal>, <literal>DBDIH</literal>, and
+ <literal>DBDICT</literal>; these are needed in the special start
+ phase handling of these blocks using
+ <literal>NDB_STTOR</literal> signals, where the bulk of the node
+ startup process actually takes place.
+ </para>
+
+ <para>
+ If the cluster is configured to lock pages (that is, if the
+ <literal>LockPagesInMainMemory</literal> configuration parameter
+ has been set), <literal>CMVMI</literal> handles this locking.
+ </para>
+
+ <para>
+ The <literal>QMGR</literal> block calls the
+ <literal>initData()</literal> method (defined in
+ <filename>storage/ndb/src/kernel/blocks/qmgr/QmgrMain.cpp</filename>)
+ whose output is handled by all other blocks in the
+ <literal>READ_CONFIG_REQ</literal> phase (see
+ <xref linkend="ndb-internals-start-phases-read-config"/>).
+ Following these initializations, <literal>QMGR</literal> sends
+ the <literal>DIH_RESTARTREQ</literal> signal to
+ <literal>DBDIH</literal>, which determines whether a proper
+ system file exists; if it does, an initial start is not being
+ performed. After the reception of this signal comes the process
+ of integrating the node among the other data nodes in the
+ cluster, where data nodes enter the cluster one at a time. The
+ first one to enter becomes the master; whenever the master dies
+ the new master is always the node that has been running for the
+ longest time from those remaining.
+ </para>
+
+ <para>
+ <literal>QMGR</literal> sets up timers to ensure that inclusion
+ in the cluster does not take longer than what the cluster's
+ configuration is set to allow (see
+ <link linkend="mysql-cluster-timeouts-intervals-disk-paging">Controlling
+ Timeouts, Intervals, and Disk Paging</link> for the relevant
+ configuration parameters), after which communication to all
+ other data nodes is established. At this point, a
+ <literal>CM_REGREQ</literal> signal is sent to all data nodes.
+ Only the president of the cluster responds to this signal; the
+ president allows one node at a time to enter the cluster. If no
+ node responds within 3 seconds then the president becomes the
+ master. If several nodes start up simultaneously, then the node
+ with the lowest node ID becomes president. The president sends
+ <literal>CM_REGCONF</literal> in response to this signal, but
+ also sends a <literal>CM_ADD</literal> signal to all nodes that
+ are currently alive.
+ </para>
+
+ <para>
+ Next, the starting node sends a
+ <literal>CM_NODEINFOREQ</literal> signal to all current
+ <quote>live</quote> data nodes. When these nodes receive that
+ signal they send a <literal>NODE_VERSION_REP</literal> signal to
+ all API nodes that have connected to them. Each data node also
+ sends a <literal>CM_ACKADD</literal> to the president to inform
+ the president that it has heard the
+ <literal>CM_NODEINFOREQ</literal> signal from the new node.
+ Finally, each of the current data nodes sends the
+ <literal>CM_NODEINFOCONF</literal> signal in response to the
+ starting node. When the starting node has received all these
+ signals, it also sends the <literal>CM_ACKADD</literal> signal
+ to the president.
+ </para>
+
+ <para>
+ When the president has received all of the expected
+ <literal>CM_ACKADD</literal> signals, it knows that all data
+ nodes (including the newest one to start) have replied to the
+ <literal>CM_NODEINFOREQ</literal> signal. When the president
+ receives the final <literal>CM_ACKADD</literal>, it sends a
+ <literal>CM_ADD</literal> signal to all current data nodes (that
+ is, except for the node that just started). Upon receiving this
+ signal, the existing data nodes enable communication with the
+ new node; they begin sending heartbeats to it and including in
+ the list of neighbors used by the heartbeat protocol.
+ </para>
+
+ <para>
+ The <literal>start</literal> struct is reset, so that it can
+ handle new starting nodes, and then each data node sends a
+ <literal>CM_ACKADD</literal> to the president, which then sends
+ a <literal>CM_ADD</literal> to the starting node after all such
+ <literal>CM_ACKADD</literal> signals have been received. The new
+ node then opens all of its communication channels to the data
+ nodes that were already connected to the cluster; it also sets
+ up its own heartbeat structures and starts sending heartbeats.
+ It also sends a <literal>CM_ACKADD</literal> message in response
+ to the president.
+ </para>
+
+ <para>
+ As a final step, <literal>QMGR</literal> also starts the timer
+ handling for which it is responsible. This means that it
+ generates a signal to blocks that have requested it. This signal
+ is sent 100 times per second even if any one instance of the
+ signal is delayed..
+ </para>
+
+ <para>
+ The <literal>BACKUP</literal> kernel block also begins sending a
+ signal periodically. This is to ensure that excessive amounts of
+ data are not written to disk, and that data writes are kept
+ within the limits of what has been specified in the cluster
+ configuration file during and after restarts. The
+ <literal>DBUTIL</literal> block initializes the transaction
+ identity, and <literal>DBTUX</literal> creates a reference to
+ the <literal>DBTUP</literal> block, while
+ <literal>PGMAN</literal> initializes pointers to the
+ <literal>LGMAN</literal> and <literal>DBTUP</literal> blocks.
+ The <literal>RESTORE</literal> kernel block creates references
+ to the <literal>DBLQH</literal> and <literal>DBTUP</literal>
+ blocks to enable quick access to those blocks when needed.
+ </para>
+
+ </section>
+
+ <section id="ndb-internals-start-phases-sttor-2">
+
+ <title><literal>STTOR</literal> Phase 2</title>
+
+ <para>
+ The only kernel block that participates in this phase to any
+ real effect is <literal>NDBCNTR</literal>.
+ </para>
+
+ <para>
+ In this phase <literal>NDBCNTR</literal> obtains the current
+ state of each configured cluster data node. Messages are sent to
+ <literal>NDBCNTR</literal> from <literal>QMGR</literal>
+ reporting the changes in status of any the nodes.
+ <literal>NDBCNTR</literal> also sets timers corresponding to the
+ <literal>StartPartialTimeout</literal>,
+ <literal>StartPartitionTimeout</literal>, and
+ <literal>StartFailureTimeout</literal> configuration parameters.
+ </para>
+
+ <para>
+ The next step is for a <literal>CNTR_START_REQ</literal> signal
+ to be sent to the proposed master node. Normally the president
+ is also chosen as master. However, during a system restart where
+ the starting node has a newer global checkpoint than that which
+ has survived on the president, then this node will take over as
+ master node, even though it is not recognized as the president
+ by <literal>QMGR</literal>. If the starting node is chosen as
+ the new master, then the other nodes are informed of this via a
+ <literal>CNTR_START_REF</literal> signal.
+ </para>
+
+ <para>
+ The master withholds the <literal>CNTR_START_REQ</literal>
+ signal until it is ready to start a new node, or to start the
+ cluster for an initial restart or system restart.
+ </para>
+
+ <para>
+ When the starting node receives
+ <literal>CNTR_START_CONF</literal>, it starts the
+ <literal>NDB_STTOR</literal> phases, in the following order:
+
+ <orderedlist>
+
+ <listitem>
+ <para>
+ DBLQH
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ DBDICT
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ DBTUP
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ DBACC
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ DBTC
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ DBDIH
+ </para>
+ </listitem>
+
+ </orderedlist>
+ </para>
+
+ </section>
+
+ <section id="ndb-internals-start-phases-ndb-sttor-1">
+
+ <title><literal>NDB_STTOR</literal> Phase 1</title>
+
+ <para>
+ <literal>DBDICT</literal>, if necessary, initializes the schema
+ file. <literal>DBDIH</literal>, <literal>DBTC</literal>,
+ <literal>DBTUP</literal>, and <literal>DBLQH</literal>
+ initialize variables. <literal>DBLQH</literal> also initializes
+ the sending of statistics on database operations.
+ </para>
+
+ </section>
+
+ <section id="ndb-internals-start-phases-sttor-3">
+
+ <title><literal>STTOR</literal> Phase 3</title>
+
+ <para>
+ <literal>DBDICT</literal> initializes a variable that keeps
+ track of the type of restart being performed.
+ </para>
+
+ <para>
+ <literal>NDBCNTR</literal> executes the second of the
+ <literal>NDB_STTOR</literal> start phases, with no other
+ <literal>NDBCNTR</literal> activity taking place during this
+ <literal>STTOR</literal> phase.
+ </para>
+
+ </section>
+
+ <section id="ndb-internals-start-phases-ndb-sttor-2">
+
+ <title><literal>NDB_STTOR</literal> Phase 2</title>
+
+ <para>
+ The <literal>DBLQH</literal> block enables its exchange of
+ internal records with <literal>DBTUP</literal> and
+ <literal>DBACC</literal>, while <literal>DBTC</literal> allows
+ its internal records to be exchanged with
+ <literal>DBDIH</literal>. The <literal>DBDIH</literal> kernel
+ block creates the mutexes used by the <literal>NDB</literal>
+ kernel and reads nodes using the
+ <literal>READ_NODESREQ</literal> signal. With the data from the
+ response to this signal, <literal>DBDIH</literal> can create
+ node lists, node groups, and so forth. For node restarts and
+ initial node restarts, <literal>DBDIH</literal> also asks the
+ master for permission to perform the restart. The master will
+ ask all <quote>live</quote> nodes if they are prepared to permit
+ the new node to join the cluster. If an initial node restart is
+ to be performed, then all LCPs are invalidated as part of this
+ phase.
+ </para>
+
+ <para>
+ LCPs from nodes that are not part of the cluster at the time of
+ the initial node restart are not invalidated. The reason for
+ this is that there is never any chance for a node to become
+ master of a system restart using any of the LCPs that have been
+ invalidated, since this node must complete a node restart
+ — including a local checkpoint — before it can join
+ the cluster and possibly become a master node.
+ </para>
+
+ <para>
+ The <literal>CMVMI</literal> kernel block activates the sending
+ of packed signals, which occurs only as part of database
+ operations. Packing must be enabled prior to beginning any such
+ operations during the execution of the redo log or node recovery
+ phases.
+ </para>
+
+ <para>
+ The <literal>DBTUX</literal> block sets the type of start
+ currently taking place, while the <literal>BACKUP</literal>
+ block sets the type of restart to be performed, if any (in each
+ case, the block actually sets a variable whose value reflects
+ the type of start or restart). The <literal>SUMA</literal> block
+ remains inactive during this phase.
+ </para>
+
+ <para>
+ The <literal>PGMAN</literal> kernel block starts the generation
+ of two repeated signals, the first handles cleanup. This signal
+ is sent every 200 milliseconds. The other signal handles
+ statistics, and is sent once per second.
+ </para>
+
+ </section>
+
+ <section id="ndb-internals-start-phases-sttor-4">
+
+ <title><literal>STTOR</literal> Phase 4</title>
+
+ <para>
+ Only the <literal>DBLQH</literal> and <literal>NDBCNTR</literal>
+ kernel blocks are directly involved in this phase.
+ <literal>DBLQH</literal> allocates a record in the
+ <literal>BACKUP</literal> block, used in the execution of local
+ checkpoints via the <literal>DEFINE_BACKUP_REQ</literal> signal.
+ <literal>NDBCNTR</literal> causes <literal>NDB_STTOR</literal>
+ to execute NDB_STTOR phase 3; there is otherwise no other
+ <literal>NDBCNTR</literal> activity during this
+ <literal>STTOR</literal> phase.
+ </para>
+
+ </section>
+
+ <section id="ndb-internals-start-phases-ndb-sttor-3">
+
+ <title><literal>NDB_STTOR</literal> Phase 3</title>
+
+ <para>
+ The <literal>DBLQH</literal> block initiates checking of the log
+ files here. Then it obtains the states of the data nodes using
+ the <literal>READ_NODESREQ</literal> signal. Unless an initial
+ start or an initial node restart is being performed, the
+ checking of log files is handled in parallel with a number of
+ other start phases. For initial starts, the log files must be
+ initialized; this can be a lengthy process and should have some
+ progress status attached to it.
+ </para>
+
+ <note>
+ <para>
+ From this point, there are two parallel paths, one continuing
+ restart and another reading and determining the state of the
+ redo log files.
+ </para>
+ </note>
+
+ <para>
+ The <literal>DBDICT</literal> block requests information about
+ the cluster data nodes via the <literal>READ_NODESREQ</literal>
+ signal. <literal>DBACC</literal> resets the system restart flag
+ if this is not a system restart; this is used only to verify
+ that no requests are received from <literal>DBTUX</literal>
+ during system restart. <literal>DBTC</literal> requests
+ information about all nodes by means of the
+ <literal>READ_NODESREQ</literal> signal.
+ </para>
+
+ <para>
+ <literal>DBDIH</literal> sets an internal master state and makes
+ other preparations exclusive to initial starts. In the case of
+ an initial start, the non-master nodes perform some initial
+ tasks, the master node doing once all non-master nodes have
+ reported that their tasks are completed. (This delay is actually
+ unnecessary since there is no reason to wait while initializing
+ the master node.)
+ </para>
+
+ <para>
+ For node restarts and initial node restarts no more work is done
+ in this phase. For initial starts the work is done when all
+ nodes have created the initial restart information and
+ initialized the system file.
+ </para>
+
+ <para>
+ For system restarts this is where most of the work is performed,
+ initiated by sending the <literal>NDB_STARTREQ</literal> signal
+ from <literal>NDBCNTR</literal> to <literal>DBDIH</literal> in
+ the master. This signal is sent when all nodes in the system
+ restart have reached this point in the restart. This we can mark
+ as our first synchronization point for system restarts,
+ designated <literal>WAITPOINT_4_1</literal>.
+ </para>
+
+ <para>
+ For a description of the system restart version of Phase 4, see
+ <xref linkend="ndb-internals-start-phases-system-restart-phase-4"/>.
+ </para>
+
+ <para>
+ After completing execution of the
+ <literal>NDB_STARTREQ</literal> signal, the master sends a
+ <literal>CNTR_WAITREP</literal> signal with
+ <literal>WAITPOINT_4_2</literal> to all nodes. This ends
+ <literal>NDB_STTOR</literal> phase 3 as well as
+ (<literal>STTOR</literal>) Phase 4.
+ </para>
+
+ </section>
+
+ <section id="ndb-internals-start-phases-sttor-5">
+
+ <title><literal>STTOR</literal> Phase 5</title>
+
+ <para>
+ All that takes place in Phase 5 is the delivery by
+ <literal>NDBCNTR</literal> of <literal>NDB_STTOR</literal> phase
+ 4; the only block that acts on this signal is
+ <literal>DBDIH</literal> that controls most of the part of a
+ data node start that is database-related.
+ </para>
+
+ </section>
+
+ <section id="ndb-internals-start-phases-ndb-sttor-4">
+
+ <title><literal>NDB_STTOR</literal> Phase 4</title>
+
+ <para>
+ Some initialization of local checkpoint variables takes place in
+ this phase, and for initial restarts, this is all that happens
+ in this phase.
+ </para>
+
+ <para>
+ For system restarts, all required takeovers are also performed.
+ Currently, this means that all nodes whose states could not be
+ recovered using the redo log are restarted by copying to them
+ all the necessary data from the <quote>live</quote> data nodes.
+
+ <remark role="NOTE">
+ [js] Commented out until Mikael supplies the material on node
+ takeovers.
+ </remark>
+
+<!-- For a
+ description of this process, see
+ <xref linkend="ndb-internals-start-phases-takeovers"/>.
+ </para>
+ <para>-->
+
+ For node restarts and initial node restarts, the master node
+ performs a number of services, requested to do so by sending the
+ <literal>START_MEREQ</literal> signal to it. This phase is
+ complete when the master responds with a
+ <literal>START_MECONF</literal> message, and is described in
+ <xref linkend="ndb-internals-start-phases-start-mereq-handling"/>.
+ </para>
+
+ <para>
+ After ensuring that the tasks assigned to
+ <literal>DBDIH</literal> tasks in the NDB_STTOR phase 4 are
+ complete, <literal>NDBCNTR</literal> performs some work on its
+ own. For initial starts, it creates the system table that keeps
+ track of unique identifiers such as those used for
+ <literal>AUTO_INCREMENT</literal>. Following the WAITPOINT_4_1
+ synchronization point, all system restarts proceed immediately
+ to <literal>NDB_STTOR</literal> phase 5, which is handled by the
+ <literal>DBDIH</literal> block. See
+ <xref linkend="ndb-internals-start-phases-ndb-sttor-5"/>, for
+ more information.
+ </para>
+
+ </section>
+
+ <section id="ndb-internals-start-phases-ndb-sttor-5">
+
+ <title><literal>NDB_STTOR</literal> Phase 5</title>
+
+ <para>
+ For initial starts and system restarts this phase means
+ executing a local checkpoint. This is handled by the master so
+ that the other nodes will return immediately from this phase.
+ Node restarts and initial node restarts perform the copying of
+ the records from the primary replica to the starting replicas in
+ this phase. Local checkpoints are enabled before the copying
+ process is begun.
+ </para>
+
+ <para>
+ Copying the data to a starting node is part of the node takeover
+ protocol. As part of this protocol, the node status of the
+ starting node is updated; this is communicated using the global
+ checkpoint protocol. Waiting for these events to take place
+ ensures that the new node status is communicated to all nodes
+ and their system files.
+ </para>
+
+ <para>
+ After the node's status has been communicated, all nodes are
+ signaled that we are about to start the takeover protocol for
+ this node. Part of this protocol consists of Steps 3 - 9 during
+ the system restart phase as described below. This means that
+ restoration of all the fragments, preparation for execution of
+ the redo log, execution of the redo log, and finally reporting
+ back to <literal>DBDIH</literal> when the execution of the redo
+ log is completed, are all part of this process.
+ </para>
+
+ <para>
+ After preparations are complete, copy phase for each fragment in
+ the node must be performed. The process of copying a fragment
+ involves the following steps:
+
+ <orderedlist>
+
+ <listitem>
+ <para>
+ The <literal>DBLQH</literal> kernel block in the starting
+ node is informed that the copy process is about to begin
+ by sending it a <literal>PREPARE_COPY_FRAGREQ</literal>
+ signal.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ When <literal>DBLQH</literal> acknowledges this request a
+ <literal>CREATE_FRAGREQ</literal> signal is sent to all
+ nodes notify them of the preparation being made to copy
+ data to this replica for this table fragment.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ After all nodes have acknowledged this, a
+ <literal>COPY_FRAGREQ</literal> signal is sent to the node
+ from which the data is to be copied to the new node. This
+ is always the primary replica of the fragment. The node
+ indicated copies all the data over to the starting node in
+ response to this message.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ After copying has been completed, and a
+ <literal>COPY_FRAGCONF</literal> message is sent, all
+ nodes are notified of the completion through an
+ <literal>UPDATE_TOREQ</literal> signal.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ After all nodes have updated to reflect the new state of
+ the fragment, the <literal>DBLQH</literal> kernel block of
+ the starting node is informed of the fact that the copy
+ has been completed, and that the replica is now up-to-date
+ and any failures should now be treated as real failures.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ The new replica is transformed into a primary replica if
+ this is the role it had when the table was created.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ After completing this change another round of
+ <literal>CREATE_FRAGREQ</literal> messages is sent to all
+ nodes informing them that the takeover of the fragment is
+ now committed.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ After this, process is repeated with the next fragment if
+ another one exists.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ When there are no more fragments for takeover by the node,
+ all nodes are informed of this by sending an
+ <literal>UPDATE_TOREQ</literal> signal sent to all of
+ them.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ Wait for the next complete local checkpoint to occur,
+ running from start to finish.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ The node states are updated, using a complete global
+ checkpoint. As with the local checkpoint in the previous
+ step, the global checkpoint must be allowed to start and
+ then to finish.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ When the global checkpoint has completed, it will
+ communicate the successful local checkpoint of this node
+ restart by sending an <literal>END_TOREQ</literal> signal
+ to all nodes.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ A <literal>START_COPYCONF</literal> is sent back to the
+ starting node informing it that the node restart has been
+ completed.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ Receiving the <literal>START_COPYCONF</literal> signal
+ ends <literal>NDB_STTOR</literal> phase 5. This provides
+ another synchronization point for system restarts,
+ designated as <literal>WAITPOINT_5_2</literal>.
+ </para>
+ </listitem>
+
+ </orderedlist>
+ </para>
+
+ <note>
+ <para>
+ The copy process in this phase can in theory be performed in
+ parallel by several nodes. However, all messages from the
+ master to all nodes are currently sent to single node at a
+ time, but can be made completely parallel. This is likely to
+ be done in the not too distant future.
+ </para>
+ </note>
+
+ <para>
+ In an initial and an initial node restart, the
+ <literal>SUMA</literal> block requests the subscriptions from
+ the <literal>SUMA</literal> master node.
+ <literal>NDBCNTR</literal> executes <literal>NDB_STTOR</literal>
+ phase 6. No other <literal>NDBCNTR</literal> activity takes
+ place.
+ </para>
+
+ </section>
+
+ <section id="ndb-internals-start-phases-ndb-sttor-6">
+
+ <title><literal>NDB_STTOR</literal> Phase 6</title>
+
+ <para>
+ In this <literal>NDB_STTOR</literal> phase, both
+ <literal>DBLQH</literal> and <literal>DBDICT</literal> clear
+ their internal representing the current restart type. The
+ <literal>DBACC</literal> block resets the system restart flag;
+ <literal>DBACC</literal> and <literal>DBTUP</literal> start a
+ periodic signal for checking memory usage once per second.
+ <literal>DBTC</literal> sets an internal variable indicating
+ that the system restart has been completed.
+ </para>
+
+ </section>
+
+ <section id="ndb-internals-start-phases-sttor-6">
+
+ <title><literal>STTOR</literal> Phase 6</title>
+
+ <para>
+ The <literal>NDBCNTR</literal> block defines the cluster's node
+ groups, and the <literal>DBUTIL</literal> block initializes a
+ number of data structures to facilitate the sending keyed
+ operations can be to the system tables.
+ <literal>DBUTIL</literal> also sets up a single connection to
+ the <literal>DBTC</literal> kernel block.
+ </para>
+
+ </section>
+
+ <section id="ndb-internals-start-phases-sttor-7">
+
+ <title><literal>STTOR</literal> Phase 7</title>
+
+ <para>
+ In <literal>QMGR</literal> the president starts an arbitrator
+ (unless this feature has been disabled by setting the value of
+ the <literal>ArbitrationRank</literal> configuration parameter
+ to 0 for all nodes — see
+ <xref linkend="mysql-cluster-mgm-definition"/>, and
+ <xref linkend="mysql-cluster-api-definition"/>, for more
+ information; note that this currently can be done only when
+ using MySQL Cluster Carrier Grade Edition). In addition,
+ checking of API nodes through heartbeats is activated.
+ </para>
+
+ <para>
+ Also during this phase, the <literal>BACKUP</literal> block sets
+ the disk write speed to the value used following the completion
+ of the restart. The master node during initial start also
+ inserts the record keeping track of which backup ID is to be
+ used next. The <literal>SUMA</literal> and
+ <literal>DBTUX</literal> blocks set variables indicating start
+ phase 7 has been completed, and that requests to
+ <literal>DBTUX</literal> that occurs when running the redo log
+ should no longer be ignored.
+ </para>
+
+ </section>
+
+ <section id="ndb-internals-start-phases-sttor-8">
+
+ <title><literal>STTOR</literal> Phase 8</title>
+
+ <para>
+ <literal>NDB_STTOR</literal> executes
+ <literal>NDB_STTOR</literal> phase 7; no other
+ <literal>NDBCNTR</literal> activity takes place.
+ </para>
+
+ </section>
+
+ <section id="ndb-internals-start-phases-ndb-sttor-7">
+
+ <title><literal>NDB_STTOR</literal> Phase 7</title>
+
+ <para>
+ If this is a system restart, the master node initiates a rebuild
+ of all indexes from <literal>DBDICT</literal> during this phase.
+ </para>
+
+ <para>
+ The <literal>CMVMI</literal> kernel block opens communication
+ channels to the API nodes (including MySQL servers acting as SQL
+ nodes). Indicate in <literal>globalData</literal> that the node
+ is started.
+ </para>
+
+ </section>
+
+ <section id="ndb-internals-start-phases-sttor-9">
+
+ <title><literal>STTOR</literal> Phase 9</title>
+
+ <para>
+ <literal>NDBCNTR</literal> resets some start variables.
+ </para>
+
+ </section>
+
+ <section id="ndb-internals-start-phases-sttor-101">
+
+ <title><literal>STTOR</literal> Phase 101</title>
+
+ <para>
+ This is the <literal>SUMA</literal> handover phase.
+ </para>
+
+ </section>
+
+ <section id="ndb-internals-start-phases-system-restart-phase-4">
+
+ <title>System Restart Handling in Phase 4</title>
+
+ <para>
+ This consists of the following steps:
+
+ <orderedlist>
+
+ <listitem>
+ <para>
+ The master sets the latest GCI as the restart GCI, and
+ then synchronizes its system file to all other nodes
+ involved in the system restart.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ The next step is to synchronize the schema of all the
+ nodes in the system restart. This is performed in 15
+ passes. The problem we are trying to solve here occurs
+ when a schema object has been created while the node was
+ up but was dropped while the node was down, and possibly a
+ new object was even created with the same schema ID while
+ that node was unavailable. In order to handle this
+ situation, it is necessary first to re-create all objects
+ that are supposed to exist from the viewpoint of the
+ starting node. After this, any objects that were dropped
+ by other nodes in the cluster while this node was
+ <quote>dead</quote> are dropped; this also applies to any
+ tables that were dropped during the outage. Finally, any
+ tables that have been created by other nodes while the
+ starting node was unavailable are re-created on the
+ starting node. All these operations are local to the
+ starting node. As part of this process, is it also
+ necessary to ensure that all tables that need to be
+ re-created have been created locally and that the proper
+ data structures have been set up for them in all kernel
+ blocks.
+ </para>
+
+ <para>
+ After performing the procedure described previously for
+ the master node the new schema file is sent to all other
+ participants in the system restart, and they perform the
+ same synchronization.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ All fragments involved in the restart must have proper
+ parameters as derived from <literal>DBDIH</literal>. This
+ causes a number of <literal>START_FRAGREQ</literal>
+ signals to be sent from <literal>DBDIH</literal> to
+ <literal>DBLQH</literal>. This also starts the restoration
+ of the fragments, which are restored one by one and one
+ record at a time in the course of reading the restore data
+ from disk and applying in parallel the restore data read
+ from disk into main memory. This restores only the main
+ memory parts of the tables.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ Once all fragments have been restored, a
+ <literal>START_RECREQ</literal> message is sent to all
+ nodes in the starting cluster, and then all undo logs for
+ any Disk Data parts of the tables are applied.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ After applying the undo logs in <literal>LGMAN</literal>,
+ it is necessary to perform some restore work in
+ <literal>TSMAN</literal> that requires scanning the extent
+ headers of the tablespaces.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ Next, it is necessary to prepare for execution of the redo
+ log, which log can be performed in up to four phases. For
+ each fragment, execution of redo logs from several
+ different nodes may be required. This is handled by
+ executing the redo logs in different phases for a specific
+ fragment, as decided in <literal>DBDIH</literal> when
+ sending the <literal>START_FRAGREQ</literal> signal. An
+ <literal>EXEC_FRAGREQ</literal> signal is sent for each
+ phase and fragment that requires execution in this phase.
+ After these signals are sent, an
+ <literal>EXEC_SRREQ</literal> signal is sent to all nodes
+ to tell them that they can start executing the redo log.
+
+ <note>
+ <para>
+ Before starting execution of the first redo log, it is
+ necessary to make sure that the setup which was
+ started earlier (in Phase 4) by
+ <literal>DBLQH</literal> has finished, or to wait
+ until it does before continuing.
+ </para>
+ </note>
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ Prior to executing the redo log, it is necessary to
+ calculate where to start reading and where the end of the
+ REDO log should have been reached. The end of the REDO log
+ should be found when the last GCI to restore has been
+ reached.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ After completing the execution of the redo logs, all redo
+ log pages that have been written beyond the last GCI to be
+ restore are invalidated. Given the cyclic nature of the
+ redo logs, this could carry the invalidation into new redo
+ log files past the last one executed.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ After the completion of the previous step,
+ <literal>DBLQH</literal> report this back to
+ <literal>DBDIH</literal> using a
+ <literal>START_RECCONF</literal> message.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ When the master has received this message back from all
+ starting nodes, it sends a
+ <literal>NDB_STARTCONF</literal> signal back to
+ <literal>NDBCNTR</literal>.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ The <literal>NDB_STARTCONF</literal> message signals the
+ end of <literal>STTOR</literal> phase 4 to
+ <literal>NDBCNTR</literal>, which is the only block
+ involved to any significant degree in this phase.
+ </para>
+ </listitem>
+
+ </orderedlist>
+ </para>
+
+ </section>
+
+<!--
+ <section id="ndb-internals-start-phases-takeovers">
+ <title>Handling of Node Takeovers</title>
+
+ <para>
+ <remark role="NOTE">
+ [js] Commented out until Mikael supplies the material on node
+ takeovers.
+ </remark>
+
+ </para>
+ </section>
+-->
+
+ <section id="ndb-internals-start-phases-start-mereq-handling">
+
+ <title>START_MEREQ Handling</title>
+
+ <para>
+ The first step in handling <literal>START_MEREQ</literal> is to
+ ensure that no local checkpoint is currently taking place;
+ otherwise, it is necessary to wait until it is completed. The
+ next step is to copy all distribution information from the
+ master <literal>DBDIH</literal> to the starting
+ <literal>DBDIH</literal>. After this, all metadata is
+ synchronized in <literal>DBDICT</literal> (see
+ <xref linkend="ndb-internals-start-phases-system-restart-phase-4"/>).
+ </para>
+
+ <para>
+ After blocking local checkpoints, and then synchronizing
+ distribution information and metadata information, global
+ checkpoints are blocked.
+ </para>
+
+ <para>
+ The next step is to integrate the starting node in the global
+ checkpoint protocol, local checkpoint protocol, and all other
+ distributed protocols. As part of this the node status is also
+ updated.
+ </para>
+
+ <para>
+ After completing this step the global checkpoint protocol is
+ permitted to start again, the <literal>START_MECONF</literal>
+ signal is sent to indicate to the starting node that the next
+ phase may proceed.
+ </para>
+
+ </section>
+
+ </section>
+
<section id="ndb-internals-glossary">
<title><literal>NDB</literal> Internals Glossary</title>
| Thread |
|---|
| • svn commit - mysqldoc@docsrva: r6954 - trunk/ndbapi | jon | 29 Jun |