List:Commits« Previous MessageNext Message »
From:jon Date:June 29 2007 12:36pm
Subject:svn commit - mysqldoc@docsrva: r6954 - trunk/ndbapi
View as plain text  
Author: jstephens
Date: 2007-06-29 14:36:26 +0200 (Fri, 29 Jun 2007)
New Revision: 6954

Log:

Deploy Start Phases doc in API Guide.



Modified:
   trunk/ndbapi/ndb-internals-start-phases.xml
   trunk/ndbapi/ndb-internals.xml


Modified: trunk/ndbapi/ndb-internals-start-phases.xml
===================================================================
--- trunk/ndbapi/ndb-internals-start-phases.xml	2007-06-29 12:34:35 UTC (rev 6953)
+++ trunk/ndbapi/ndb-internals-start-phases.xml	2007-06-29 12:36:26 UTC (rev 6954)
Changed blocks: 1, Lines Added: 1, Lines Deleted: 1; 514 bytes

@@ -2,7 +2,7 @@
 <!DOCTYPE section SYSTEM "http://www.docbook.org/xml/4.3/docbookx.dtd">
 <section id="ndb-internals-start-phases">
 
-  <title>Cluster Start Phases</title>
+  <title>MySQL Cluster Start Phases</title>
 
   <para></para>
 


Modified: trunk/ndbapi/ndb-internals.xml
===================================================================
--- trunk/ndbapi/ndb-internals.xml	2007-06-29 12:34:35 UTC (rev 6953)
+++ trunk/ndbapi/ndb-internals.xml	2007-06-29 12:36:26 UTC (rev 6954)
Changed blocks: 1, Lines Added: 1406, Lines Deleted: 0; 53443 bytes

@@ -11740,6 +11740,1412 @@
 
   </section>
 
+  <section id="ndb-internals-start-phases">
+
+    <title>MySQL Cluster Start Phases</title>
+
+    <para></para>
+
+    <section id="ndb-internals-start-phases-read-config">
+
+      <title>Read Configuration Phase (Phase -1)</title>
+
+      <para>
+        Before the data node actually starts, a number of other setup
+        and initialization tasks must be done for the block objects,
+        transporters, and watchdog checks, among others.
+      </para>
+
+      <para>
+        This initialization process begins in
+        <filename>storage/ndb/src/kernel/main.cpp</filename> with a
+        series of calls to
+        <literal>globalEmulatorData.theThreadConfig->doStart()</literal>.
+        When starting <command>ndbd</command> with the
+        <option>-n</option> or <option>--nostart</option> option there
+        is only one call to this method; otherwise, there are two, with
+        the second call actually starting the data node. The first
+        invocation of <literal>doStart()</literal> sends the
+        <literal>START_ORD</literal> signal to the
+        <literal>CMVMI</literal> block (see
+        <xref
+      linkend="ndb-internals-kernel-blocks-cmvmi"/>); the
+        second call to this method sends a <literal>START_ORD</literal>
+        signal to <literal>NDBCNTR</literal> (see
+        <xref linkend="ndb-internals-kernel-blocks-ndbcntr"/>).
+      </para>
+
+      <para>
+        When <literal>START_ORD</literal> is received by the
+        <literal>NDBCNTR</literal> block, the signal is immediately
+        transferred to <literal>NDBCNTR</literal>'s
+        <literal>MISSRA</literal> sub-block, which handles the start
+        process by sending a <literal>READ_CONFIG_REQ</literal> signals
+        to all blocks in order as given in the array
+        <literal>readConfigOrder</literal>:
+
+        <orderedlist>
+
+          <listitem>
+            <para>
+              <literal>NDBFS</literal>
+            </para>
+          </listitem>
+
+          <listitem>
+            <para>
+              <literal>DBTUP</literal>
+            </para>
+          </listitem>
+
+          <listitem>
+            <para>
+              <literal>DBACC</literal>
+            </para>
+          </listitem>
+
+          <listitem>
+            <para>
+              <literal>DBTC</literal>
+            </para>
+          </listitem>
+
+          <listitem>
+            <para>
+              <literal>DBLQH</literal>
+            </para>
+          </listitem>
+
+          <listitem>
+            <para>
+              <literal>DBTUX</literal>
+            </para>
+          </listitem>
+
+          <listitem>
+            <para>
+              <literal>DBDICT</literal>
+            </para>
+          </listitem>
+
+          <listitem>
+            <para>
+              <literal>DBDIH</literal>
+            </para>
+          </listitem>
+
+          <listitem>
+            <para>
+              <literal>NDBCNTR</literal>
+            </para>
+          </listitem>
+
+          <listitem>
+            <para>
+              <literal>QMGR</literal>
+            </para>
+          </listitem>
+
+          <listitem>
+            <para>
+              <literal>TRIX</literal>
+            </para>
+          </listitem>
+
+          <listitem>
+            <para>
+              <literal>BACKUP</literal>
+            </para>
+          </listitem>
+
+          <listitem>
+            <para>
+              <literal>DBUTIL</literal>
+            </para>
+          </listitem>
+
+          <listitem>
+            <para>
+              <literal>SUMA</literal>
+            </para>
+          </listitem>
+
+          <listitem>
+            <para>
+              <literal>TSMAN</literal>
+            </para>
+          </listitem>
+
+          <listitem>
+            <para>
+              <literal>LGMAN</literal>
+            </para>
+          </listitem>
+
+          <listitem>
+            <para>
+              <literal>PGMAN</literal>
+            </para>
+          </listitem>
+
+          <listitem>
+            <para>
+              <literal>RESTORE</literal>
+            </para>
+          </listitem>
+
+        </orderedlist>
+
+        <literal>NDBFS</literal> is allowed to run before any of the
+        remaining blocks are contacted, in order to make sure that it
+        can start the <literal>CMVMI</literal> block's threads.
+      </para>
+
+    </section>
+
+    <section id="ndb-internals-start-phases-sttor-config-read">
+
+      <title>Configuration Read Phase (<literal>STTOR</literal> Phase -1)</title>
+
+      <para>
+        The <literal>READ_CONFIG_REQ</literal> signal provides all
+        kernel blocks an opportunity to read the configuration data,
+        which is stored in a global object accessible to all blocks. All
+        memory allocation in the data nodes takes place during this
+        phase.
+      </para>
+
+      <note>
+        <para>
+          Connections between the kernel blocks and the
+          <literal>NDB</literal> filesystem are also set up during Phase
+          0. This is necessary to enable the blocks to communicate
+          easily which parts of a table structure are to be written to
+          disk.
+        </para>
+      </note>
+
+      <para>
+        <literal>NDB</literal> performs memory allocations in two
+        different ways. The first of these is by using the
+        <literal>allocRecord()</literal> method (defined in
+        <filename>storage/ndb/src/kernel/vm/SimulatedBlock.hpp</filename>).
+        This is the traditional method whereby records are accessed
+        using the <literal>ptrCheckGuard</literal> macros (defined in
+        <filename>storage/ndb/src/kernel/vm/pc.hpp</filename>). The
+        other method is to allocate memory using the
+        <literal>setSize()</literal> method defined with the help of the
+        template found in
+        <filename>storage/ndb/src/kernel/vm/CArray.hpp</filename>.
+      </para>
+
+      <para>
+        These methods sometimes also initialize the memory, ensuring
+        that both memory allocation and initialization are done with
+        watchdog protection.
+      </para>
+
+      <para>
+        Many blocks also perform block-specific initialization, which
+        often entails building linked lists or doubly-linked lists (and
+        in some cases hash tables).
+      </para>
+
+      <para>
+        Many of the sizes used in allocation are calculated in the
+        <literal>Configuration::calcSizeAlt()</literal> method, found in
+        <filename>storage/ndb/src/kernel/vm/Configuration.cpp</filename>.
+      </para>
+
+      <para>
+        Some preparations for more intelligent pooling of memory
+        resources have been made. <literal>DataMemory</literal> and disk
+        records already belong to this global memory pool.
+      </para>
+
+    </section>
+
+    <section id="ndb-internals-start-phases-sttor-0">
+
+      <title><literal>STTOR</literal> Phase 0</title>
+
+      <para>
+        Most <literal>NDB</literal> kernel blocks begin their start
+        phases at <literal>STTOR</literal> Phase 1, with the exception
+        of <literal>NDBFS</literal> and <literal>NDBCNTR</literal>,
+        which begin with Phase 0, as can be seen by inspecting the first
+        value for each element in the <literal>ALL_BLOCKS</literal>
+        array (defined in
+        <filename>src/kernel/blocks/ndbcntr/NdbcntrMain.cpp</filename>).
+        In addition, when the <literal>STTOR</literal> signal is sent to
+        a block, the return signal <literal>STTORRY</literal> always
+        contains a list of the start phases in which the block has an
+        interest. Only in those start phases does the block actually
+        receive a <literal>STTOR</literal> signal.
+      </para>
+
+      <para>
+        <literal>STTOR</literal> signals are sent out in the order in
+        which the kernel blocks are listed in the
+        <literal>ALL_BLOCKS</literal> array. While
+        <literal>NDBCNTR</literal> goes through start phases 0 to 255,
+        most of these are empty.
+      </para>
+
+      <para>
+        Both activities in Phase 0 have to do with initialization of the
+        <literal>NDB</literal> filesystem. First, if necessary,
+        <literal>NDBFS</literal> creates the filesystem directory for
+        the data node. In the case of an initial start,
+        <literal>NDBCNTR</literal> clears any existing files from the
+        directory of the data node to ensure that the
+        <literal>DBDIH</literal> block does not subsequently discover
+        any system files (if <literal>DBDIH</literal> were to find any
+        system files, it would not interpret the start correctly as an
+        initial start). (See also
+        <xref linkend="ndb-internals-kernel-blocks-dbdih"/>.)
+      </para>
+
+      <para>
+        Each time that <literal>NDBCNTR</literal> completes the sending
+        of one start phase to all kernel blocks, it sends a
+        <literal>NODE_STATE_REP</literal> signal to all blocks, which
+        effectively updates the <literal>NodeState</literal> in all
+        blocks.
+      </para>
+
+      <para>
+        Each time that <literal>NDBCNTR</literal> completes a non-empty
+        start phase, it reports this to the management server; in most
+        cases this is recorded in the cluster log.
+      </para>
+
+      <para>
+        Finally, after completing all start phases,
+        <literal>NDBCNTR</literal> updates the node state in all blocks
+        via a <literal>NODE_STATE_REP</literal> signal; it also sends an
+        event report advising that all start phases are complete. In
+        addition, all other cluster data nodes are notified that this
+        node has completed all its start phases to ensure all nodes are
+        aware of one another's state. Each data node sends a
+        <literal>NODE_START_REP</literal> to all blocks; however, this
+        is significant only for <literal>DBDIH</literal>, so that it
+        knows when it can unlock the lock for schema changes on
+        <literal>DBDICT</literal>.
+      </para>
+
+      <note>
+        <para>
+          In the following table, and throughout this text, we sometimes
+          refer to <literal>STTOR</literal> start phases simply as
+          <quote>start phases</quote> or <quote>Phase
+          <replaceable>N</replaceable></quote> (where
+          <replaceable>N</replaceable> is some number).
+          <literal>NDB_STTOR</literal> start phases are always qualified
+          as such, and so referred to as
+          <quote><literal>NDB_STTOR</literal> start phases</quote> or
+          <quote><literal>NDB_STTOR</literal> phases</quote>.
+        </para>
+      </note>
+
+      <informaltable>
+        <tgroup cols="2">
+          <colspec colwidth="20*"/>
+          <colspec colwidth="80*"/>
+          <thead>
+            <row>
+              <entry>Kernel Block</entry>
+              <entry>Receptive Start Phases</entry>
+            </row>
+          </thead>
+          <tbody>
+            <row>
+              <entry><literal>NDBFS</literal></entry>
+              <entry>0</entry>
+            </row>
+            <row>
+              <entry><literal>DBTC</literal></entry>
+              <entry>1</entry>
+            </row>
+            <row>
+              <entry><literal>DBDIH</literal></entry>
+              <entry>1</entry>
+            </row>
+            <row>
+              <entry><literal>DBLQH</literal></entry>
+              <entry>1, 4</entry>
+            </row>
+            <row>
+              <entry><literal>DBACC</literal></entry>
+              <entry>1</entry>
+            </row>
+            <row>
+              <entry><literal>DBTUP</literal></entry>
+              <entry>1</entry>
+            </row>
+            <row>
+              <entry><literal>DBDICT</literal></entry>
+              <entry>1, 3</entry>
+            </row>
+            <row>
+              <entry><literal>NDBCNTR</literal></entry>
+              <entry>0, 1, 2, 3, 4, 5, 6, 8, 9</entry>
+            </row>
+            <row>
+              <entry><literal>CMVMI</literal></entry>
+              <entry>1 (prior to <literal>QMGR</literal>), 3, 8</entry>
+            </row>
+            <row>
+              <entry><literal>QMGR</literal></entry>
+              <entry>1, 7</entry>
+            </row>
+            <row>
+              <entry><literal>TRIX</literal></entry>
+              <entry>1</entry>
+            </row>
+            <row>
+              <entry><literal>BACKUP</literal></entry>
+              <entry>1, 3, 7</entry>
+            </row>
+            <row>
+              <entry><literal>DBUTIL</literal></entry>
+              <entry>1, 6</entry>
+            </row>
+            <row>
+              <entry><literal>SUMA</literal></entry>
+              <entry>1, 3, 5, 7, 100 (empty), 101</entry>
+            </row>
+            <row>
+              <entry><literal>DBTUX</literal></entry>
+              <entry>1,3,7</entry>
+            </row>
+            <row>
+              <entry><literal>TSMAN</literal></entry>
+              <entry>1, 3 (both ignored)</entry>
+            </row>
+            <row>
+              <entry><literal>LGMAN</literal></entry>
+              <entry>1, 2, 3, 4, 5, 6 (all ignored)</entry>
+            </row>
+            <row>
+              <entry><literal>PGMAN</literal></entry>
+              <entry>1, 3, 7 (Phase 7 currently empty)</entry>
+            </row>
+            <row>
+              <entry><literal>RESTORE</literal></entry>
+              <entry>1,3 (only in Phase 1 is any real work done)</entry>
+            </row>
+          </tbody>
+        </tgroup>
+      </informaltable>
+
+      <note>
+        <para>
+          This table was current at the time this text was written, but
+          is likely to change over time. The latest information can be
+          found in the source code.
+        </para>
+      </note>
+
+    </section>
+
+    <section id="ndb-internals-start-phases-sttor-1">
+
+      <title><literal>STTOR</literal> Phase 1</title>
+
+      <para>
+        This is one of the phases in which most kernel blocks
+        participate (see the table in
+        <xref linkend="ndb-internals-start-phases-sttor-0"/>).
+        Otherwise, most blocks are involved primarily in the
+        initialization of data &mdash; for example, this is all that
+        <literal>DBTC</literal> does.
+      </para>
+
+      <para>
+        Many blocks initialize references to other blocks in Phase 1.
+        <literal>DBLQH</literal> initializes block references to
+        <literal>DBTUP</literal>, and <literal>DBACC</literal>
+        initializes block references to <literal>DBTUP</literal> and
+        <literal>DBLQH</literal>. <literal>DBTUP</literal> initializes
+        references to the blocks <literal>DBLQH</literal>,
+        <literal>TSMAN</literal>, and <literal>LGMAN</literal>.
+      </para>
+
+      <para>
+        <literal>NDBCNTR</literal> initializes some variables and sets
+        up block references to <literal>DBTUP</literal>,
+        <literal>DBLQH</literal>, <literal>DBACC</literal>,
+        <literal>DBTC</literal>, <literal>DBDIH</literal>, and
+        <literal>DBDICT</literal>; these are needed in the special start
+        phase handling of these blocks using
+        <literal>NDB_STTOR</literal> signals, where the bulk of the node
+        startup process actually takes place.
+      </para>
+
+      <para>
+        If the cluster is configured to lock pages (that is, if the
+        <literal>LockPagesInMainMemory</literal> configuration parameter
+        has been set), <literal>CMVMI</literal> handles this locking.
+      </para>
+
+      <para>
+        The <literal>QMGR</literal> block calls the
+        <literal>initData()</literal> method (defined in
+        <filename>storage/ndb/src/kernel/blocks/qmgr/QmgrMain.cpp</filename>)
+        whose output is handled by all other blocks in the
+        <literal>READ_CONFIG_REQ</literal> phase (see
+        <xref linkend="ndb-internals-start-phases-read-config"/>).
+        Following these initializations, <literal>QMGR</literal> sends
+        the <literal>DIH_RESTARTREQ</literal> signal to
+        <literal>DBDIH</literal>, which determines whether a proper
+        system file exists; if it does, an initial start is not being
+        performed. After the reception of this signal comes the process
+        of integrating the node among the other data nodes in the
+        cluster, where data nodes enter the cluster one at a time. The
+        first one to enter becomes the master; whenever the master dies
+        the new master is always the node that has been running for the
+        longest time from those remaining.
+      </para>
+
+      <para>
+        <literal>QMGR</literal> sets up timers to ensure that inclusion
+        in the cluster does not take longer than what the cluster's
+        configuration is set to allow (see
+        <link linkend="mysql-cluster-timeouts-intervals-disk-paging">Controlling
+        Timeouts, Intervals, and Disk Paging</link> for the relevant
+        configuration parameters), after which communication to all
+        other data nodes is established. At this point, a
+        <literal>CM_REGREQ</literal> signal is sent to all data nodes.
+        Only the president of the cluster responds to this signal; the
+        president allows one node at a time to enter the cluster. If no
+        node responds within 3 seconds then the president becomes the
+        master. If several nodes start up simultaneously, then the node
+        with the lowest node ID becomes president. The president sends
+        <literal>CM_REGCONF</literal> in response to this signal, but
+        also sends a <literal>CM_ADD</literal> signal to all nodes that
+        are currently alive.
+      </para>
+
+      <para>
+        Next, the starting node sends a
+        <literal>CM_NODEINFOREQ</literal> signal to all current
+        <quote>live</quote> data nodes. When these nodes receive that
+        signal they send a <literal>NODE_VERSION_REP</literal> signal to
+        all API nodes that have connected to them. Each data node also
+        sends a <literal>CM_ACKADD</literal> to the president to inform
+        the president that it has heard the
+        <literal>CM_NODEINFOREQ</literal> signal from the new node.
+        Finally, each of the current data nodes sends the
+        <literal>CM_NODEINFOCONF</literal> signal in response to the
+        starting node. When the starting node has received all these
+        signals, it also sends the <literal>CM_ACKADD</literal> signal
+        to the president.
+      </para>
+
+      <para>
+        When the president has received all of the expected
+        <literal>CM_ACKADD</literal> signals, it knows that all data
+        nodes (including the newest one to start) have replied to the
+        <literal>CM_NODEINFOREQ</literal> signal. When the president
+        receives the final <literal>CM_ACKADD</literal>, it sends a
+        <literal>CM_ADD</literal> signal to all current data nodes (that
+        is, except for the node that just started). Upon receiving this
+        signal, the existing data nodes enable communication with the
+        new node; they begin sending heartbeats to it and including in
+        the list of neighbors used by the heartbeat protocol.
+      </para>
+
+      <para>
+        The <literal>start</literal> struct is reset, so that it can
+        handle new starting nodes, and then each data node sends a
+        <literal>CM_ACKADD</literal> to the president, which then sends
+        a <literal>CM_ADD</literal> to the starting node after all such
+        <literal>CM_ACKADD</literal> signals have been received. The new
+        node then opens all of its communication channels to the data
+        nodes that were already connected to the cluster; it also sets
+        up its own heartbeat structures and starts sending heartbeats.
+        It also sends a <literal>CM_ACKADD</literal> message in response
+        to the president.
+      </para>
+
+      <para>
+        As a final step, <literal>QMGR</literal> also starts the timer
+        handling for which it is responsible. This means that it
+        generates a signal to blocks that have requested it. This signal
+        is sent 100 times per second even if any one instance of the
+        signal is delayed..
+      </para>
+
+      <para>
+        The <literal>BACKUP</literal> kernel block also begins sending a
+        signal periodically. This is to ensure that excessive amounts of
+        data are not written to disk, and that data writes are kept
+        within the limits of what has been specified in the cluster
+        configuration file during and after restarts. The
+        <literal>DBUTIL</literal> block initializes the transaction
+        identity, and <literal>DBTUX</literal> creates a reference to
+        the <literal>DBTUP</literal> block, while
+        <literal>PGMAN</literal> initializes pointers to the
+        <literal>LGMAN</literal> and <literal>DBTUP</literal> blocks.
+        The <literal>RESTORE</literal> kernel block creates references
+        to the <literal>DBLQH</literal> and <literal>DBTUP</literal>
+        blocks to enable quick access to those blocks when needed.
+      </para>
+
+    </section>
+
+    <section id="ndb-internals-start-phases-sttor-2">
+
+      <title><literal>STTOR</literal> Phase 2</title>
+
+      <para>
+        The only kernel block that participates in this phase to any
+        real effect is <literal>NDBCNTR</literal>.
+      </para>
+
+      <para>
+        In this phase <literal>NDBCNTR</literal> obtains the current
+        state of each configured cluster data node. Messages are sent to
+        <literal>NDBCNTR</literal> from <literal>QMGR</literal>
+        reporting the changes in status of any the nodes.
+        <literal>NDBCNTR</literal> also sets timers corresponding to the
+        <literal>StartPartialTimeout</literal>,
+        <literal>StartPartitionTimeout</literal>, and
+        <literal>StartFailureTimeout</literal> configuration parameters.
+      </para>
+
+      <para>
+        The next step is for a <literal>CNTR_START_REQ</literal> signal
+        to be sent to the proposed master node. Normally the president
+        is also chosen as master. However, during a system restart where
+        the starting node has a newer global checkpoint than that which
+        has survived on the president, then this node will take over as
+        master node, even though it is not recognized as the president
+        by <literal>QMGR</literal>. If the starting node is chosen as
+        the new master, then the other nodes are informed of this via a
+        <literal>CNTR_START_REF</literal> signal.
+      </para>
+
+      <para>
+        The master withholds the <literal>CNTR_START_REQ</literal>
+        signal until it is ready to start a new node, or to start the
+        cluster for an initial restart or system restart.
+      </para>
+
+      <para>
+        When the starting node receives
+        <literal>CNTR_START_CONF</literal>, it starts the
+        <literal>NDB_STTOR</literal> phases, in the following order:
+
+        <orderedlist>
+
+          <listitem>
+            <para>
+              DBLQH
+            </para>
+          </listitem>
+
+          <listitem>
+            <para>
+              DBDICT
+            </para>
+          </listitem>
+
+          <listitem>
+            <para>
+              DBTUP
+            </para>
+          </listitem>
+
+          <listitem>
+            <para>
+              DBACC
+            </para>
+          </listitem>
+
+          <listitem>
+            <para>
+              DBTC
+            </para>
+          </listitem>
+
+          <listitem>
+            <para>
+              DBDIH
+            </para>
+          </listitem>
+
+        </orderedlist>
+      </para>
+
+    </section>
+
+    <section id="ndb-internals-start-phases-ndb-sttor-1">
+
+      <title><literal>NDB_STTOR</literal> Phase 1</title>
+
+      <para>
+        <literal>DBDICT</literal>, if necessary, initializes the schema
+        file. <literal>DBDIH</literal>, <literal>DBTC</literal>,
+        <literal>DBTUP</literal>, and <literal>DBLQH</literal>
+        initialize variables. <literal>DBLQH</literal> also initializes
+        the sending of statistics on database operations.
+      </para>
+
+    </section>
+
+    <section id="ndb-internals-start-phases-sttor-3">
+
+      <title><literal>STTOR</literal> Phase 3</title>
+
+      <para>
+        <literal>DBDICT</literal> initializes a variable that keeps
+        track of the type of restart being performed.
+      </para>
+
+      <para>
+        <literal>NDBCNTR</literal> executes the second of the
+        <literal>NDB_STTOR</literal> start phases, with no other
+        <literal>NDBCNTR</literal> activity taking place during this
+        <literal>STTOR</literal> phase.
+      </para>
+
+    </section>
+
+    <section id="ndb-internals-start-phases-ndb-sttor-2">
+
+      <title><literal>NDB_STTOR</literal> Phase 2</title>
+
+      <para>
+        The <literal>DBLQH</literal> block enables its exchange of
+        internal records with <literal>DBTUP</literal> and
+        <literal>DBACC</literal>, while <literal>DBTC</literal> allows
+        its internal records to be exchanged with
+        <literal>DBDIH</literal>. The <literal>DBDIH</literal> kernel
+        block creates the mutexes used by the <literal>NDB</literal>
+        kernel and reads nodes using the
+        <literal>READ_NODESREQ</literal> signal. With the data from the
+        response to this signal, <literal>DBDIH</literal> can create
+        node lists, node groups, and so forth. For node restarts and
+        initial node restarts, <literal>DBDIH</literal> also asks the
+        master for permission to perform the restart. The master will
+        ask all <quote>live</quote> nodes if they are prepared to permit
+        the new node to join the cluster. If an initial node restart is
+        to be performed, then all LCPs are invalidated as part of this
+        phase.
+      </para>
+
+      <para>
+        LCPs from nodes that are not part of the cluster at the time of
+        the initial node restart are not invalidated. The reason for
+        this is that there is never any chance for a node to become
+        master of a system restart using any of the LCPs that have been
+        invalidated, since this node must complete a node restart
+        &mdash; including a local checkpoint &mdash; before it can join
+        the cluster and possibly become a master node.
+      </para>
+
+      <para>
+        The <literal>CMVMI</literal> kernel block activates the sending
+        of packed signals, which occurs only as part of database
+        operations. Packing must be enabled prior to beginning any such
+        operations during the execution of the redo log or node recovery
+        phases.
+      </para>
+
+      <para>
+        The <literal>DBTUX</literal> block sets the type of start
+        currently taking place, while the <literal>BACKUP</literal>
+        block sets the type of restart to be performed, if any (in each
+        case, the block actually sets a variable whose value reflects
+        the type of start or restart). The <literal>SUMA</literal> block
+        remains inactive during this phase.
+      </para>
+
+      <para>
+        The <literal>PGMAN</literal> kernel block starts the generation
+        of two repeated signals, the first handles cleanup. This signal
+        is sent every 200 milliseconds. The other signal handles
+        statistics, and is sent once per second.
+      </para>
+
+    </section>
+
+    <section id="ndb-internals-start-phases-sttor-4">
+
+      <title><literal>STTOR</literal> Phase 4</title>
+
+      <para>
+        Only the <literal>DBLQH</literal> and <literal>NDBCNTR</literal>
+        kernel blocks are directly involved in this phase.
+        <literal>DBLQH</literal> allocates a record in the
+        <literal>BACKUP</literal> block, used in the execution of local
+        checkpoints via the <literal>DEFINE_BACKUP_REQ</literal> signal.
+        <literal>NDBCNTR</literal> causes <literal>NDB_STTOR</literal>
+        to execute NDB_STTOR phase 3; there is otherwise no other
+        <literal>NDBCNTR</literal> activity during this
+        <literal>STTOR</literal> phase.
+      </para>
+
+    </section>
+
+    <section id="ndb-internals-start-phases-ndb-sttor-3">
+
+      <title><literal>NDB_STTOR</literal> Phase 3</title>
+
+      <para>
+        The <literal>DBLQH</literal> block initiates checking of the log
+        files here. Then it obtains the states of the data nodes using
+        the <literal>READ_NODESREQ</literal> signal. Unless an initial
+        start or an initial node restart is being performed, the
+        checking of log files is handled in parallel with a number of
+        other start phases. For initial starts, the log files must be
+        initialized; this can be a lengthy process and should have some
+        progress status attached to it.
+      </para>
+
+      <note>
+        <para>
+          From this point, there are two parallel paths, one continuing
+          restart and another reading and determining the state of the
+          redo log files.
+        </para>
+      </note>
+
+      <para>
+        The <literal>DBDICT</literal> block requests information about
+        the cluster data nodes via the <literal>READ_NODESREQ</literal>
+        signal. <literal>DBACC</literal> resets the system restart flag
+        if this is not a system restart; this is used only to verify
+        that no requests are received from <literal>DBTUX</literal>
+        during system restart. <literal>DBTC</literal> requests
+        information about all nodes by means of the
+        <literal>READ_NODESREQ</literal> signal.
+      </para>
+
+      <para>
+        <literal>DBDIH</literal> sets an internal master state and makes
+        other preparations exclusive to initial starts. In the case of
+        an initial start, the non-master nodes perform some initial
+        tasks, the master node doing once all non-master nodes have
+        reported that their tasks are completed. (This delay is actually
+        unnecessary since there is no reason to wait while initializing
+        the master node.)
+      </para>
+
+      <para>
+        For node restarts and initial node restarts no more work is done
+        in this phase. For initial starts the work is done when all
+        nodes have created the initial restart information and
+        initialized the system file.
+      </para>
+
+      <para>
+        For system restarts this is where most of the work is performed,
+        initiated by sending the <literal>NDB_STARTREQ</literal> signal
+        from <literal>NDBCNTR</literal> to <literal>DBDIH</literal> in
+        the master. This signal is sent when all nodes in the system
+        restart have reached this point in the restart. This we can mark
+        as our first synchronization point for system restarts,
+        designated <literal>WAITPOINT_4_1</literal>.
+      </para>
+
+      <para>
+        For a description of the system restart version of Phase 4, see
+        <xref linkend="ndb-internals-start-phases-system-restart-phase-4"/>.
+      </para>
+
+      <para>
+        After completing execution of the
+        <literal>NDB_STARTREQ</literal> signal, the master sends a
+        <literal>CNTR_WAITREP</literal> signal with
+        <literal>WAITPOINT_4_2</literal> to all nodes. This ends
+        <literal>NDB_STTOR</literal> phase 3 as well as
+        (<literal>STTOR</literal>) Phase 4.
+      </para>
+
+    </section>
+
+    <section id="ndb-internals-start-phases-sttor-5">
+
+      <title><literal>STTOR</literal> Phase 5</title>
+
+      <para>
+        All that takes place in Phase 5 is the delivery by
+        <literal>NDBCNTR</literal> of <literal>NDB_STTOR</literal> phase
+        4; the only block that acts on this signal is
+        <literal>DBDIH</literal> that controls most of the part of a
+        data node start that is database-related.
+      </para>
+
+    </section>
+
+    <section id="ndb-internals-start-phases-ndb-sttor-4">
+
+      <title><literal>NDB_STTOR</literal> Phase 4</title>
+
+      <para>
+        Some initialization of local checkpoint variables takes place in
+        this phase, and for initial restarts, this is all that happens
+        in this phase.
+      </para>
+
+      <para>
+        For system restarts, all required takeovers are also performed.
+        Currently, this means that all nodes whose states could not be
+        recovered using the redo log are restarted by copying to them
+        all the necessary data from the <quote>live</quote> data nodes.
+
+        <remark role="NOTE">
+          [js] Commented out until Mikael supplies the material on node
+          takeovers.
+        </remark>
+
+<!-- For a
+      description of this process, see 
+      <xref linkend="ndb-internals-start-phases-takeovers"/>. 
+    </para>
+    <para>-->
+
+        For node restarts and initial node restarts, the master node
+        performs a number of services, requested to do so by sending the
+        <literal>START_MEREQ</literal> signal to it. This phase is
+        complete when the master responds with a
+        <literal>START_MECONF</literal> message, and is described in
+        <xref linkend="ndb-internals-start-phases-start-mereq-handling"/>.
+      </para>
+
+      <para>
+        After ensuring that the tasks assigned to
+        <literal>DBDIH</literal> tasks in the NDB_STTOR phase 4 are
+        complete, <literal>NDBCNTR</literal> performs some work on its
+        own. For initial starts, it creates the system table that keeps
+        track of unique identifiers such as those used for
+        <literal>AUTO_INCREMENT</literal>. Following the WAITPOINT_4_1
+        synchronization point, all system restarts proceed immediately
+        to <literal>NDB_STTOR</literal> phase 5, which is handled by the
+        <literal>DBDIH</literal> block. See
+        <xref linkend="ndb-internals-start-phases-ndb-sttor-5"/>, for
+        more information.
+      </para>
+
+    </section>
+
+    <section id="ndb-internals-start-phases-ndb-sttor-5">
+
+      <title><literal>NDB_STTOR</literal> Phase 5</title>
+
+      <para>
+        For initial starts and system restarts this phase means
+        executing a local checkpoint. This is handled by the master so
+        that the other nodes will return immediately from this phase.
+        Node restarts and initial node restarts perform the copying of
+        the records from the primary replica to the starting replicas in
+        this phase. Local checkpoints are enabled before the copying
+        process is begun.
+      </para>
+
+      <para>
+        Copying the data to a starting node is part of the node takeover
+        protocol. As part of this protocol, the node status of the
+        starting node is updated; this is communicated using the global
+        checkpoint protocol. Waiting for these events to take place
+        ensures that the new node status is communicated to all nodes
+        and their system files.
+      </para>
+
+      <para>
+        After the node's status has been communicated, all nodes are
+        signaled that we are about to start the takeover protocol for
+        this node. Part of this protocol consists of Steps 3 - 9 during
+        the system restart phase as described below. This means that
+        restoration of all the fragments, preparation for execution of
+        the redo log, execution of the redo log, and finally reporting
+        back to <literal>DBDIH</literal> when the execution of the redo
+        log is completed, are all part of this process.
+      </para>
+
+      <para>
+        After preparations are complete, copy phase for each fragment in
+        the node must be performed. The process of copying a fragment
+        involves the following steps:
+
+        <orderedlist>
+
+          <listitem>
+            <para>
+              The <literal>DBLQH</literal> kernel block in the starting
+              node is informed that the copy process is about to begin
+              by sending it a <literal>PREPARE_COPY_FRAGREQ</literal>
+              signal.
+            </para>
+          </listitem>
+
+          <listitem>
+            <para>
+              When <literal>DBLQH</literal> acknowledges this request a
+              <literal>CREATE_FRAGREQ</literal> signal is sent to all
+              nodes notify them of the preparation being made to copy
+              data to this replica for this table fragment.
+            </para>
+          </listitem>
+
+          <listitem>
+            <para>
+              After all nodes have acknowledged this, a
+              <literal>COPY_FRAGREQ</literal> signal is sent to the node
+              from which the data is to be copied to the new node. This
+              is always the primary replica of the fragment. The node
+              indicated copies all the data over to the starting node in
+              response to this message.
+            </para>
+          </listitem>
+
+          <listitem>
+            <para>
+              After copying has been completed, and a
+              <literal>COPY_FRAGCONF</literal> message is sent, all
+              nodes are notified of the completion through an
+              <literal>UPDATE_TOREQ</literal> signal.
+            </para>
+          </listitem>
+
+          <listitem>
+            <para>
+              After all nodes have updated to reflect the new state of
+              the fragment, the <literal>DBLQH</literal> kernel block of
+              the starting node is informed of the fact that the copy
+              has been completed, and that the replica is now up-to-date
+              and any failures should now be treated as real failures.
+            </para>
+          </listitem>
+
+          <listitem>
+            <para>
+              The new replica is transformed into a primary replica if
+              this is the role it had when the table was created.
+            </para>
+          </listitem>
+
+          <listitem>
+            <para>
+              After completing this change another round of
+              <literal>CREATE_FRAGREQ</literal> messages is sent to all
+              nodes informing them that the takeover of the fragment is
+              now committed.
+            </para>
+          </listitem>
+
+          <listitem>
+            <para>
+              After this, process is repeated with the next fragment if
+              another one exists.
+            </para>
+          </listitem>
+
+          <listitem>
+            <para>
+              When there are no more fragments for takeover by the node,
+              all nodes are informed of this by sending an
+              <literal>UPDATE_TOREQ</literal> signal sent to all of
+              them.
+            </para>
+          </listitem>
+
+          <listitem>
+            <para>
+              Wait for the next complete local checkpoint to occur,
+              running from start to finish.
+            </para>
+          </listitem>
+
+          <listitem>
+            <para>
+              The node states are updated, using a complete global
+              checkpoint. As with the local checkpoint in the previous
+              step, the global checkpoint must be allowed to start and
+              then to finish.
+            </para>
+          </listitem>
+
+          <listitem>
+            <para>
+              When the global checkpoint has completed, it will
+              communicate the successful local checkpoint of this node
+              restart by sending an <literal>END_TOREQ</literal> signal
+              to all nodes.
+            </para>
+          </listitem>
+
+          <listitem>
+            <para>
+              A <literal>START_COPYCONF</literal> is sent back to the
+              starting node informing it that the node restart has been
+              completed.
+            </para>
+          </listitem>
+
+          <listitem>
+            <para>
+              Receiving the <literal>START_COPYCONF</literal> signal
+              ends <literal>NDB_STTOR</literal> phase 5. This provides
+              another synchronization point for system restarts,
+              designated as <literal>WAITPOINT_5_2</literal>.
+            </para>
+          </listitem>
+
+        </orderedlist>
+      </para>
+
+      <note>
+        <para>
+          The copy process in this phase can in theory be performed in
+          parallel by several nodes. However, all messages from the
+          master to all nodes are currently sent to single node at a
+          time, but can be made completely parallel. This is likely to
+          be done in the not too distant future.
+        </para>
+      </note>
+
+      <para>
+        In an initial and an initial node restart, the
+        <literal>SUMA</literal> block requests the subscriptions from
+        the <literal>SUMA</literal> master node.
+        <literal>NDBCNTR</literal> executes <literal>NDB_STTOR</literal>
+        phase 6. No other <literal>NDBCNTR</literal> activity takes
+        place.
+      </para>
+
+    </section>
+
+    <section id="ndb-internals-start-phases-ndb-sttor-6">
+
+      <title><literal>NDB_STTOR</literal> Phase 6</title>
+
+      <para>
+        In this <literal>NDB_STTOR</literal> phase, both
+        <literal>DBLQH</literal> and <literal>DBDICT</literal> clear
+        their internal representing the current restart type. The
+        <literal>DBACC</literal> block resets the system restart flag;
+        <literal>DBACC</literal> and <literal>DBTUP</literal> start a
+        periodic signal for checking memory usage once per second.
+        <literal>DBTC</literal> sets an internal variable indicating
+        that the system restart has been completed.
+      </para>
+
+    </section>
+
+    <section id="ndb-internals-start-phases-sttor-6">
+
+      <title><literal>STTOR</literal> Phase 6</title>
+
+      <para>
+        The <literal>NDBCNTR</literal> block defines the cluster's node
+        groups, and the <literal>DBUTIL</literal> block initializes a
+        number of data structures to facilitate the sending keyed
+        operations can be to the system tables.
+        <literal>DBUTIL</literal> also sets up a single connection to
+        the <literal>DBTC</literal> kernel block.
+      </para>
+
+    </section>
+
+    <section id="ndb-internals-start-phases-sttor-7">
+
+      <title><literal>STTOR</literal> Phase 7</title>
+
+      <para>
+        In <literal>QMGR</literal> the president starts an arbitrator
+        (unless this feature has been disabled by setting the value of
+        the <literal>ArbitrationRank</literal> configuration parameter
+        to 0 for all nodes &mdash; see
+        <xref linkend="mysql-cluster-mgm-definition"/>, and
+        <xref linkend="mysql-cluster-api-definition"/>, for more
+        information; note that this currently can be done only when
+        using MySQL Cluster Carrier Grade Edition). In addition,
+        checking of API nodes through heartbeats is activated.
+      </para>
+
+      <para>
+        Also during this phase, the <literal>BACKUP</literal> block sets
+        the disk write speed to the value used following the completion
+        of the restart. The master node during initial start also
+        inserts the record keeping track of which backup ID is to be
+        used next. The <literal>SUMA</literal> and
+        <literal>DBTUX</literal> blocks set variables indicating start
+        phase 7 has been completed, and that requests to
+        <literal>DBTUX</literal> that occurs when running the redo log
+        should no longer be ignored.
+      </para>
+
+    </section>
+
+    <section id="ndb-internals-start-phases-sttor-8">
+
+      <title><literal>STTOR</literal> Phase 8</title>
+
+      <para>
+        <literal>NDB_STTOR</literal> executes
+        <literal>NDB_STTOR</literal> phase 7; no other
+        <literal>NDBCNTR</literal> activity takes place.
+      </para>
+
+    </section>
+
+    <section id="ndb-internals-start-phases-ndb-sttor-7">
+
+      <title><literal>NDB_STTOR</literal> Phase 7</title>
+
+      <para>
+        If this is a system restart, the master node initiates a rebuild
+        of all indexes from <literal>DBDICT</literal> during this phase.
+      </para>
+
+      <para>
+        The <literal>CMVMI</literal> kernel block opens communication
+        channels to the API nodes (including MySQL servers acting as SQL
+        nodes). Indicate in <literal>globalData</literal> that the node
+        is started.
+      </para>
+
+    </section>
+
+    <section id="ndb-internals-start-phases-sttor-9">
+
+      <title><literal>STTOR</literal> Phase 9</title>
+
+      <para>
+        <literal>NDBCNTR</literal> resets some start variables.
+      </para>
+
+    </section>
+
+    <section id="ndb-internals-start-phases-sttor-101">
+
+      <title><literal>STTOR</literal> Phase 101</title>
+
+      <para>
+        This is the <literal>SUMA</literal> handover phase.
+      </para>
+
+    </section>
+
+    <section id="ndb-internals-start-phases-system-restart-phase-4">
+
+      <title>System Restart Handling in Phase 4</title>
+
+      <para>
+        This consists of the following steps:
+
+        <orderedlist>
+
+          <listitem>
+            <para>
+              The master sets the latest GCI as the restart GCI, and
+              then synchronizes its system file to all other nodes
+              involved in the system restart.
+            </para>
+          </listitem>
+
+          <listitem>
+            <para>
+              The next step is to synchronize the schema of all the
+              nodes in the system restart. This is performed in 15
+              passes. The problem we are trying to solve here occurs
+              when a schema object has been created while the node was
+              up but was dropped while the node was down, and possibly a
+              new object was even created with the same schema ID while
+              that node was unavailable. In order to handle this
+              situation, it is necessary first to re-create all objects
+              that are supposed to exist from the viewpoint of the
+              starting node. After this, any objects that were dropped
+              by other nodes in the cluster while this node was
+              <quote>dead</quote> are dropped; this also applies to any
+              tables that were dropped during the outage. Finally, any
+              tables that have been created by other nodes while the
+              starting node was unavailable are re-created on the
+              starting node. All these operations are local to the
+              starting node. As part of this process, is it also
+              necessary to ensure that all tables that need to be
+              re-created have been created locally and that the proper
+              data structures have been set up for them in all kernel
+              blocks.
+            </para>
+
+            <para>
+              After performing the procedure described previously for
+              the master node the new schema file is sent to all other
+              participants in the system restart, and they perform the
+              same synchronization.
+            </para>
+          </listitem>
+
+          <listitem>
+            <para>
+              All fragments involved in the restart must have proper
+              parameters as derived from <literal>DBDIH</literal>. This
+              causes a number of <literal>START_FRAGREQ</literal>
+              signals to be sent from <literal>DBDIH</literal> to
+              <literal>DBLQH</literal>. This also starts the restoration
+              of the fragments, which are restored one by one and one
+              record at a time in the course of reading the restore data
+              from disk and applying in parallel the restore data read
+              from disk into main memory. This restores only the main
+              memory parts of the tables.
+            </para>
+          </listitem>
+
+          <listitem>
+            <para>
+              Once all fragments have been restored, a
+              <literal>START_RECREQ</literal> message is sent to all
+              nodes in the starting cluster, and then all undo logs for
+              any Disk Data parts of the tables are applied.
+            </para>
+          </listitem>
+
+          <listitem>
+            <para>
+              After applying the undo logs in <literal>LGMAN</literal>,
+              it is necessary to perform some restore work in
+              <literal>TSMAN</literal> that requires scanning the extent
+              headers of the tablespaces.
+            </para>
+          </listitem>
+
+          <listitem>
+            <para>
+              Next, it is necessary to prepare for execution of the redo
+              log, which log can be performed in up to four phases. For
+              each fragment, execution of redo logs from several
+              different nodes may be required. This is handled by
+              executing the redo logs in different phases for a specific
+              fragment, as decided in <literal>DBDIH</literal> when
+              sending the <literal>START_FRAGREQ</literal> signal. An
+              <literal>EXEC_FRAGREQ</literal> signal is sent for each
+              phase and fragment that requires execution in this phase.
+              After these signals are sent, an
+              <literal>EXEC_SRREQ</literal> signal is sent to all nodes
+              to tell them that they can start executing the redo log.
+
+              <note>
+                <para>
+                  Before starting execution of the first redo log, it is
+                  necessary to make sure that the setup which was
+                  started earlier (in Phase 4) by
+                  <literal>DBLQH</literal> has finished, or to wait
+                  until it does before continuing.
+                </para>
+              </note>
+            </para>
+          </listitem>
+
+          <listitem>
+            <para>
+              Prior to executing the redo log, it is necessary to
+              calculate where to start reading and where the end of the
+              REDO log should have been reached. The end of the REDO log
+              should be found when the last GCI to restore has been
+              reached.
+            </para>
+          </listitem>
+
+          <listitem>
+            <para>
+              After completing the execution of the redo logs, all redo
+              log pages that have been written beyond the last GCI to be
+              restore are invalidated. Given the cyclic nature of the
+              redo logs, this could carry the invalidation into new redo
+              log files past the last one executed.
+            </para>
+          </listitem>
+
+          <listitem>
+            <para>
+              After the completion of the previous step,
+              <literal>DBLQH</literal> report this back to
+              <literal>DBDIH</literal> using a
+              <literal>START_RECCONF</literal> message.
+            </para>
+          </listitem>
+
+          <listitem>
+            <para>
+              When the master has received this message back from all
+              starting nodes, it sends a
+              <literal>NDB_STARTCONF</literal> signal back to
+              <literal>NDBCNTR</literal>.
+            </para>
+          </listitem>
+
+          <listitem>
+            <para>
+              The <literal>NDB_STARTCONF</literal> message signals the
+              end of <literal>STTOR</literal> phase 4 to
+              <literal>NDBCNTR</literal>, which is the only block
+              involved to any significant degree in this phase.
+            </para>
+          </listitem>
+
+        </orderedlist>
+      </para>
+
+    </section>
+
+<!--  
+  <section id="ndb-internals-start-phases-takeovers">
+    <title>Handling of Node Takeovers</title>
+    
+    <para>
+      <remark role="NOTE">
+        [js] Commented out until Mikael supplies the material on node
+        takeovers.
+      </remark>
+      
+    </para>
+  </section>
+-->
+
+    <section id="ndb-internals-start-phases-start-mereq-handling">
+
+      <title>START_MEREQ Handling</title>
+
+      <para>
+        The first step in handling <literal>START_MEREQ</literal> is to
+        ensure that no local checkpoint is currently taking place;
+        otherwise, it is necessary to wait until it is completed. The
+        next step is to copy all distribution information from the
+        master <literal>DBDIH</literal> to the starting
+        <literal>DBDIH</literal>. After this, all metadata is
+        synchronized in <literal>DBDICT</literal> (see
+        <xref linkend="ndb-internals-start-phases-system-restart-phase-4"/>).
+      </para>
+
+      <para>
+        After blocking local checkpoints, and then synchronizing
+        distribution information and metadata information, global
+        checkpoints are blocked.
+      </para>
+
+      <para>
+        The next step is to integrate the starting node in the global
+        checkpoint protocol, local checkpoint protocol, and all other
+        distributed protocols. As part of this the node status is also
+        updated.
+      </para>
+
+      <para>
+        After completing this step the global checkpoint protocol is
+        permitted to start again, the <literal>START_MECONF</literal>
+        signal is sent to indicate to the starting node that the next
+        phase may proceed.
+      </para>
+
+    </section>
+
+  </section>
+
   <section id="ndb-internals-glossary">
 
     <title><literal>NDB</literal> Internals Glossary</title>


Thread
svn commit - mysqldoc@docsrva: r6954 - trunk/ndbapijon29 Jun