From: umapathi b Date: December 13 2011 7:23am Subject: Cluster down without any proper reason ... List-Archive: http://lists.mysql.com/cluster/8229 Message-Id: MIME-Version: 1.0 Content-Type: multipart/alternative; boundary=0015175dd846834a5b04b3f42119 --0015175dd846834a5b04b3f42119 Content-Type: text/plain; charset=ISO-8859-1 Hi , My cluster with 1 Management node,2 Data Nodes and 2 Sql Nodes , is down without any proper reason .. After restarting the ndbd , all went well and the cluster is up . But How can I handle this .so that this is not repeated again ? I came across one parameter i.e. StopOnError whose default value is 1 by default . If I set it to 0 , wil this resolve my issue ? Here is my config file -------------------------------- [ndbd default] # Options affecting ndbd processes on all data nodes: NoOfReplicas=2 # Number of replicas DataMemory=20G # How much memory to allocate for data storage IndexMemory=5G # How much memory to allocate for index storage # For DataMemory and IndexMemory, we have used the # default values. Since the "world" database takes up # only about 500KB, this should be more than enough for # this example Cluster setup. StringMemory=5M MaxNoOfConcurrentTransactions=100000 MaxNoOfConcurrentOperations=110000 MaxNoOfLocalOperations=250000 MaxNoOfConcurrentIndexOperations=81920 MaxNoOfConcurrentScans=256 MaxNoOfLocalScans=10000 MaxNoOfOpenFiles=10000 MaxNoOfAttributes=500000 ODirect=1 MaxNoOfTables=20320 MaxNoOfOrderedIndexes=20480 MaxNoOfUniqueHashIndexes=20480 LockPagesInMainMemory=1 NoOfFragmentLogFiles=300 TimeBetweenGlobalCheckpoints=1000 TimeBetweenEpochs=200 DiskCheckpointSpeed=10M DiskCheckpointSpeedInRestart=100M RedoBuffer=32M SchedulerSpinTimer=400 SchedulerExecutionTimer=100 RealTimeScheduler=1 LockExecuteThreadToCPU=1 LockMaintThreadsToCPU=0 [tcp default] # TCP/IP options: portnumber=2202 # This the default; however, you can use any # port that is free for all the hosts in the cluster # Note: It is recommended that you do not specify the port # number at all and simply allow the default value to be used # instead [ndb_mgmd] # Management process options: hostname=10.10.90.65 # Hostname or IP address of MGM node datadir=/var/lib/mysql-cluster # Directory for MGM node log files [ndbd] # Options for data node "A": # (one [ndbd] section per data node) hostname=10.10.90.57 # Hostname or IP address datadir=/usr/local/mysql/data # Directory for this data node's data files [ndbd] # Options for data node "B": hostname=10.10.90.58 # Hostname or IP address datadir=/usr/local/mysql/data # Directory for this data node's data files [mysqld] # SQL node options: hostname=10.10.90.57 # Hostname or IP address # (additional mysqld connections can be # specified for this node for various # purposes such as running ndb_restore) [mysqld] # SQL node options: hostname=10.10.90.58 # Hostname or IP address # (additional mysqld connections can be # specified for this node for various # purposes such as running ndb_restore) Error logs -------------- Management Node --------------------------- (ndb_1_cluster.log) 2011-12-12 06:03:41 [MgmtSrvr] INFO -- Node 2: Backup 486 started from node 1 completed. StartGCP: 2781465 StopGCP: 2781505 #Records: 20593993 #LogRecords: 6 Data: 862174712 bytes Log: 520 bytes 2011-12-12 06:25:07 [MgmtSrvr] INFO -- Node 2: Local checkpoint 822 started. Keep GCI = 2779146 oldest restorable GCI = 2779422 2011-12-12 06:25:38 [MgmtSrvr] ALERT -- Node 1: Node 2 Disconnected 2011-12-12 06:25:38 [MgmtSrvr] ALERT -- Node 1: Node 3 Disconnected 2011-12-12 06:26:53 [MgmtSrvr] INFO -- Mgmt server state: nodeid 3 freed, m_reserved_nodes 1, 2, 4 and 5. 2011-12-12 06:26:53 [MgmtSrvr] INFO -- Mgmt server state: nodeid 2 freed, m_reserved_nodes 1, 4 and 5. Data Node 1 ------------------- Error log ( houdb01.err ) ------------- 111212 6:25:44 [Note] NDB Binlog: Node: 2, down, Subscriber bitmask 00 111212 6:25:44 [Note] NDB Binlog: Node: 3, down, Subscriber bitmask 00 111212 6:25:44 [Note] NDB Binlog: cluster failure for ./mysql/ndb_schema at epoch 2782761/0. 111212 6:25:44 [Note] NDB Binlog: ndb tables initially read only on reconnect. 111212 6:25:44 [Note] NDB Binlog: cluster failure for ./usi/user_cert_duplicate at epoch 2782761/0. 111212 6:25:44 [Note] NDB Binlog: cluster failure for ./usi/user_exam_sequence at epoch 2782761/0. 111212 6:25:44 [Note] NDB Binlog: cluster failure for ./usi_test/regulator_court_fee at epoch 2782761/0. 111212 6:25:44 [Note] NDB Binlog: cluster failure for ./usi/coupon_per_custom at epoch 2782761/0. 111212 6:25:44 [Note] NDB Binlog: cluster failure for ./usi_test/coupon_per_hiddenfee at epoch 2782761/0. 111212 6:25:44 [Note] NDB Binlog: cluster failure for ./usi/county_id_seq at epoch 2782761/0. ndb_2_out.log ---------------------- 2011-12-12 06:25:39 [ndbd] INFO -- findNeighbours from: 4891 old (left: 3 right: 3) new (65535 65535) 2011-12-12 06:25:39 [ndbd] INFO -- Arbitrator decided to shutdown this node 2011-12-12 06:25:39 [ndbd] INFO -- QMGR (Line: 6005) 0x00000002 2011-12-12 06:25:39 [ndbd] INFO -- Error handler shutting down system 2011-12-12 06:25:40 [ndbd] INFO -- Error handler shutdown completed - exiting 2011-12-12 06:25:47 [ndbd] ALERT -- Node 2: Forced node shutdown completed. Caused by error 2305: 'Node lost connection to other nodes and can not form a unpartitioned cluster, please investigate if there are error(s) on other node(s)(Arbitration error). Temporary error, restart node'. 2011-12-12 06:26:24 [ndbd] WARNING -- Unable to report shutdown reason to '10.10.90.65:1186'(error: Could not connect to socket - Unable to connect with connect string: nodeid=0,10.10.90.65:1186) ndb_2_error.log ----------------------- Time: Monday 12 December 2011 - 06:25:39 Status: Temporary error, restart node Message: Node lost connection to other nodes and can not form a unpartitioned cluster, please investigate if there are error(s) on other node(s) (Arbitration error) Error: 2305 Error data: Arbitrator decided to shutdown this node Error object: QMGR (Line: 6005) 0x00000002 Program: /usr/local/mysql/bin/ndbd Pid: 3082 Version: mysql-5.1.56 ndb-7.1.15a Trace: /usr/local/mysql/data/ndb_2_trace.log.12 ***EOM*** Almost the same errors are logged on the other Data Node too ... Help is highly appreciated in this regard. --0015175dd846834a5b04b3f42119--