Hi ,
My cluster with 1 Management node,2 Data Nodes and 2 Sql Nodes , is down
without any proper reason ..
After restarting the ndbd , all went well and the cluster is up . But How
can I handle this .so that this is not repeated again ?
I came across one parameter i.e. StopOnError whose default value is 1 by
default .
If I set it to 0 , wil this resolve my issue ?
Here is my config file
--------------------------------
[ndbd default]
# Options affecting ndbd processes on all data nodes:
NoOfReplicas=2 # Number of replicas
DataMemory=20G # How much memory to allocate for data storage
IndexMemory=5G # How much memory to allocate for index storage
# For DataMemory and IndexMemory, we have used the
# default values. Since the "world" database takes up
# only about 500KB, this should be more than enough for
# this example Cluster setup.
StringMemory=5M
MaxNoOfConcurrentTransactions=100000
MaxNoOfConcurrentOperations=110000
MaxNoOfLocalOperations=250000
MaxNoOfConcurrentIndexOperations=81920
MaxNoOfConcurrentScans=256
MaxNoOfLocalScans=10000
MaxNoOfOpenFiles=10000
MaxNoOfAttributes=500000
ODirect=1
MaxNoOfTables=20320
MaxNoOfOrderedIndexes=20480
MaxNoOfUniqueHashIndexes=20480
LockPagesInMainMemory=1
NoOfFragmentLogFiles=300
TimeBetweenGlobalCheckpoints=1000
TimeBetweenEpochs=200
DiskCheckpointSpeed=10M
DiskCheckpointSpeedInRestart=100M
RedoBuffer=32M
SchedulerSpinTimer=400
SchedulerExecutionTimer=100
RealTimeScheduler=1
LockExecuteThreadToCPU=1
LockMaintThreadsToCPU=0
[tcp default]
# TCP/IP options:
portnumber=2202 # This the default; however, you can use any
# port that is free for all the hosts in the cluster
# Note: It is recommended that you do not specify the port
# number at all and simply allow the default value to be
used
# instead
[ndb_mgmd]
# Management process options:
hostname=10.10.90.65 # Hostname or IP address of MGM node
datadir=/var/lib/mysql-cluster # Directory for MGM node log files
[ndbd]
# Options for data node "A":
# (one [ndbd] section per data node)
hostname=10.10.90.57 # Hostname or IP address
datadir=/usr/local/mysql/data # Directory for this data node's data files
[ndbd]
# Options for data node "B":
hostname=10.10.90.58 # Hostname or IP address
datadir=/usr/local/mysql/data # Directory for this data node's data files
[mysqld]
# SQL node options:
hostname=10.10.90.57 # Hostname or IP address
# (additional mysqld connections can be
# specified for this node for various
# purposes such as running ndb_restore)
[mysqld]
# SQL node options:
hostname=10.10.90.58 # Hostname or IP address
# (additional mysqld connections can be
# specified for this node for various
# purposes such as running ndb_restore)
Error logs
--------------
Management Node
---------------------------
(ndb_1_cluster.log)
2011-12-12 06:03:41 [MgmtSrvr] INFO -- Node 2: Backup 486 started from
node 1 completed. StartGCP: 2781465 StopGCP: 2781505 #Records: 20593993
#LogRecords: 6 Data: 862174712 bytes Log: 520 bytes
2011-12-12 06:25:07 [MgmtSrvr] INFO -- Node 2: Local checkpoint 822
started. Keep GCI = 2779146 oldest restorable GCI = 2779422
2011-12-12 06:25:38 [MgmtSrvr] ALERT -- Node 1: Node 2 Disconnected
2011-12-12 06:25:38 [MgmtSrvr] ALERT -- Node 1: Node 3 Disconnected
2011-12-12 06:26:53 [MgmtSrvr] INFO -- Mgmt server state: nodeid 3
freed, m_reserved_nodes 1, 2, 4 and 5.
2011-12-12 06:26:53 [MgmtSrvr] INFO -- Mgmt server state: nodeid 2
freed, m_reserved_nodes 1, 4 and 5.
Data Node 1
-------------------
Error log ( houdb01.err )
-------------
111212 6:25:44 [Note] NDB Binlog: Node: 2, down, Subscriber bitmask 00
111212 6:25:44 [Note] NDB Binlog: Node: 3, down, Subscriber bitmask 00
111212 6:25:44 [Note] NDB Binlog: cluster failure for ./mysql/ndb_schema
at epoch 2782761/0.
111212 6:25:44 [Note] NDB Binlog: ndb tables initially read only on
reconnect.
111212 6:25:44 [Note] NDB Binlog: cluster failure for
./usi/user_cert_duplicate at epoch 2782761/0.
111212 6:25:44 [Note] NDB Binlog: cluster failure for
./usi/user_exam_sequence at epoch 2782761/0.
111212 6:25:44 [Note] NDB Binlog: cluster failure for
./usi_test/regulator_court_fee at epoch 2782761/0.
111212 6:25:44 [Note] NDB Binlog: cluster failure for
./usi/coupon_per_custom at epoch 2782761/0.
111212 6:25:44 [Note] NDB Binlog: cluster failure for
./usi_test/coupon_per_hiddenfee at epoch 2782761/0.
111212 6:25:44 [Note] NDB Binlog: cluster failure for ./usi/county_id_seq
at epoch 2782761/0.
ndb_2_out.log
----------------------
2011-12-12 06:25:39 [ndbd] INFO -- findNeighbours from: 4891 old (left:
3 right: 3) new (65535 65535)
2011-12-12 06:25:39 [ndbd] INFO -- Arbitrator decided to shutdown this
node
2011-12-12 06:25:39 [ndbd] INFO -- QMGR (Line: 6005) 0x00000002
2011-12-12 06:25:39 [ndbd] INFO -- Error handler shutting down system
2011-12-12 06:25:40 [ndbd] INFO -- Error handler shutdown completed -
exiting
2011-12-12 06:25:47 [ndbd] ALERT -- Node 2: Forced node shutdown
completed. Caused by error 2305: 'Node lost connection to other nodes and
can not form a unpartitioned cluster, please investigate if there are
error(s) on other node(s)(Arbitration error). Temporary error, restart
node'.
2011-12-12 06:26:24 [ndbd] WARNING -- Unable to report shutdown reason to
'10.10.90.65:1186'(error: Could not connect to socket - Unable to connect
with connect string: nodeid=0,10.10.90.65:1186)
ndb_2_error.log
-----------------------
Time: Monday 12 December 2011 - 06:25:39
Status: Temporary error, restart node
Message: Node lost connection to other nodes and can not form a
unpartitioned cluster, please investigate if there are error(s) on other
node(s) (Arbitration error)
Error: 2305
Error data: Arbitrator decided to shutdown this node
Error object: QMGR (Line: 6005) 0x00000002
Program: /usr/local/mysql/bin/ndbd
Pid: 3082
Version: mysql-5.1.56 ndb-7.1.15a
Trace: /usr/local/mysql/data/ndb_2_trace.log.12
***EOM***
Almost the same errors are logged on the other Data Node too ...
Help is highly appreciated in this regard.