From: Date: January 12 2006 5:15am Subject: bk commit into 5.0 tree (stewart:1.1997) BUG#15695 List-Archive: http://lists.mysql.com/commits/944 X-Bug: 15695 Message-Id: <20060112041508.4BFCF140EA9E@localhost.localdomain> Below is the list of changes that have just been committed into a local 5.0 repository of stewart. When stewart does a push these changes will be propagated to the main repository and, within 24 hours after the push, to the public repository. For information on how to access the public repository see http://dev.mysql.com/doc/mysql/en/installing-source-tree.html ChangeSet 1.1997 06/01/12 15:15:03 stewart@stripped +1 -0 Bug#15695 startings nodes hang in phase 2 forever on temporary network failure Fix so that: - if --initial is given, we can get a warning that there may be network partitioning - if no --initial, but we're doing an initial start, you can get an error when the nodes can talk to each other. This is because we're getting the problems at a very early stage of startup - we have not yet inferred if it's an initial start (among other things). ndb/src/kernel/blocks/qmgr/QmgrMain.cpp 1.24 06/01/12 15:14:58 stewart@stripped +36 -0 - allow reception of CONNECT_REP when ZRUNNING - in CM_REGCONF, check that we both agree on who the president is. - if we disagree, then there probably was network partitioning at some point. - add a check in CM_REGREF for if initial start, check that we can see everybody. If not, warn the user that they should check network connections. # This is a BitKeeper patch. What follows are the unified diffs for the # set of deltas contained in the patch. The rest of the patch, the part # that BitKeeper cares about, is below these diffs. # User: stewart # Host: willster.(none) # Root: /home/stewart/Documents/MySQL/5.0/bug15695 --- 1.23/ndb/src/kernel/blocks/qmgr/QmgrMain.cpp 2005-12-06 21:25:49 +11:00 +++ 1.24/ndb/src/kernel/blocks/qmgr/QmgrMain.cpp 2006-01-12 15:14:58 +11:00 @@ -288,6 +288,8 @@ jam(); break; case ZRUNNING: + jam(); + break; case ZPREPARE_FAIL: case ZFAIL_CLOSING: jam(); @@ -619,6 +621,19 @@ return; } + if(cpresident != ZNIL && cpresident != cmRegConf->presidentNodeId) + { + jam(); + char buf[256]; + BaseString::snprintf(buf,sizeof(buf),"Disagreement on who the president is" + ". We think it's %u, but somebody else thinks %u." + " This probably means there was network partitioning " + "when trying to start the cluster and you ended up " + "with two nodes trying to control cluster startup.", + cpresident, cmRegConf->presidentNodeId); + systemErrorLab(signal, __LINE__, buf); + return; + } cpdistref = cmRegConf->presidentBlockRef; cpresident = cmRegConf->presidentNodeId; @@ -782,6 +797,27 @@ return; } + if(theConfiguration.getInitialStart()) + { + NodeRecPtr nodePtr; + + for (nodePtr.i = 1; nodePtr.i < MAX_NDB_NODES; nodePtr.i++) { + jam(); + ptrAss(nodePtr, nodeRec); + if(getNodeInfo(nodePtr.i).getType() != NodeInfo::DB) + continue; + + if(c_start.m_nodes.isWaitingFor(nodePtr.i) && + !c_connectedNodes.get(nodePtr.i)) + { + warningEvent("Initial start without all nodes present."); + warningEvent("Waiting until we can communicate with other nodes" + " before attempting to start the cluster."); + warningEvent("If other nodes are starting, check network connection."); + return; + } + } + } /** * All configured nodes has agreed */