List:Cluster« Previous MessageNext Message »
From:umapathi b Date:December 14 2011 7:19am
Subject:Fwd: Cluster down without any proper reason ...
View as plain text  
Hello can anybody help ?

- Umapathi.

---------- Forwarded message ----------
From: umapathi b <umapathi.b@stripped>
Date: Tue, Dec 13, 2011 at 12:53 PM
Subject: Cluster down without any proper reason ...
To: cluster@stripped
Cc: Johan Andersson <johan@stripped>, sureshkumarilu@stripped


Hi ,

My cluster with 1 Management node,2 Data Nodes and 2 Sql Nodes , is down
without any proper reason ..
After restarting the ndbd , all went well and the cluster is up . But How
can I handle this .so that this is not repeated again ?
I came across one parameter i.e. StopOnError whose default value is 1 by
default .
If I set it to 0 , wil this resolve my issue ?

Here is my config file
--------------------------------

[ndbd default]
# Options affecting ndbd processes on all data nodes:
NoOfReplicas=2    # Number of replicas
DataMemory=20G    # How much memory to allocate for data storage
IndexMemory=5G   # How much memory to allocate for index storage
                  # For DataMemory and IndexMemory, we have used the
                  # default values. Since the "world" database takes up
                  # only about 500KB, this should be more than enough for
                  # this example Cluster setup.
StringMemory=5M
MaxNoOfConcurrentTransactions=100000
MaxNoOfConcurrentOperations=110000
MaxNoOfLocalOperations=250000
MaxNoOfConcurrentIndexOperations=81920
MaxNoOfConcurrentScans=256
MaxNoOfLocalScans=10000
MaxNoOfOpenFiles=10000
MaxNoOfAttributes=500000
ODirect=1
MaxNoOfTables=20320
MaxNoOfOrderedIndexes=20480
MaxNoOfUniqueHashIndexes=20480
LockPagesInMainMemory=1
NoOfFragmentLogFiles=300
TimeBetweenGlobalCheckpoints=1000
TimeBetweenEpochs=200
DiskCheckpointSpeed=10M
DiskCheckpointSpeedInRestart=100M
RedoBuffer=32M
SchedulerSpinTimer=400
SchedulerExecutionTimer=100
RealTimeScheduler=1
LockExecuteThreadToCPU=1
LockMaintThreadsToCPU=0


[tcp default]
# TCP/IP options:
portnumber=2202   # This the default; however, you can use any
                  # port that is free for all the hosts in the cluster
                  # Note: It is recommended that you do not specify the port
                  # number at all and simply allow the default value to be
used
                  # instead

[ndb_mgmd]
# Management process options:
hostname=10.10.90.65           # Hostname or IP address of MGM node
datadir=/var/lib/mysql-cluster  # Directory for MGM node log files

[ndbd]
# Options for data node "A":
                                # (one [ndbd] section per data node)
hostname=10.10.90.57           # Hostname or IP address
datadir=/usr/local/mysql/data   # Directory for this data node's data files

[ndbd]
# Options for data node "B":
hostname=10.10.90.58           # Hostname or IP address
datadir=/usr/local/mysql/data   # Directory for this data node's data files

[mysqld]
# SQL node options:
hostname=10.10.90.57           # Hostname or IP address
                                # (additional mysqld connections can be
                                # specified for this node for various
                                # purposes such as running ndb_restore)

[mysqld]
# SQL node options:
hostname=10.10.90.58           # Hostname or IP address
                                # (additional mysqld connections can be
                                # specified for this node for various
                                # purposes such as running ndb_restore)

Error logs
--------------

Management Node
---------------------------

(ndb_1_cluster.log)

2011-12-12 06:03:41 [MgmtSrvr] INFO     -- Node 2: Backup 486 started from
node 1 completed. StartGCP: 2781465 StopGCP: 2781505 #Records: 20593993
#LogRecords: 6 Data: 862174712 bytes Log: 520 bytes
2011-12-12 06:25:07 [MgmtSrvr] INFO     -- Node 2: Local checkpoint 822
started. Keep GCI = 2779146 oldest restorable GCI = 2779422
2011-12-12 06:25:38 [MgmtSrvr] ALERT    -- Node 1: Node 2 Disconnected
2011-12-12 06:25:38 [MgmtSrvr] ALERT    -- Node 1: Node 3 Disconnected
2011-12-12 06:26:53 [MgmtSrvr] INFO     -- Mgmt server state: nodeid 3
freed, m_reserved_nodes 1, 2, 4 and 5.
2011-12-12 06:26:53 [MgmtSrvr] INFO     -- Mgmt server state: nodeid 2
freed, m_reserved_nodes 1, 4 and 5.

Data Node 1
-------------------

Error log ( houdb01.err )
-------------

111212  6:25:44 [Note] NDB Binlog: Node: 2, down, Subscriber bitmask 00
111212  6:25:44 [Note] NDB Binlog: Node: 3, down, Subscriber bitmask 00
111212  6:25:44 [Note] NDB Binlog: cluster failure for ./mysql/ndb_schema
at epoch 2782761/0.
111212  6:25:44 [Note] NDB Binlog: ndb tables initially read only on
reconnect.
111212  6:25:44 [Note] NDB Binlog: cluster failure for
./usi/user_cert_duplicate at epoch 2782761/0.
111212  6:25:44 [Note] NDB Binlog: cluster failure for
./usi/user_exam_sequence at epoch 2782761/0.
111212  6:25:44 [Note] NDB Binlog: cluster failure for
./usi_test/regulator_court_fee at epoch 2782761/0.
111212  6:25:44 [Note] NDB Binlog: cluster failure for
./usi/coupon_per_custom at epoch 2782761/0.
111212  6:25:44 [Note] NDB Binlog: cluster failure for
./usi_test/coupon_per_hiddenfee at epoch 2782761/0.
111212  6:25:44 [Note] NDB Binlog: cluster failure for ./usi/county_id_seq
at epoch 2782761/0.

ndb_2_out.log
----------------------

2011-12-12 06:25:39 [ndbd] INFO     -- findNeighbours from: 4891 old (left:
3 right: 3) new (65535 65535)
2011-12-12 06:25:39 [ndbd] INFO     -- Arbitrator decided to shutdown this
node
2011-12-12 06:25:39 [ndbd] INFO     -- QMGR (Line: 6005) 0x00000002
2011-12-12 06:25:39 [ndbd] INFO     -- Error handler shutting down system
2011-12-12 06:25:40 [ndbd] INFO     -- Error handler shutdown completed -
exiting
2011-12-12 06:25:47 [ndbd] ALERT    -- Node 2: Forced node shutdown
completed. Caused by error 2305: 'Node lost connection to other nodes and
can not form a unpartitioned cluster, please investigate if there are
error(s) on other node(s)(Arbitration error). Temporary error, restart
node'.
2011-12-12 06:26:24 [ndbd] WARNING  -- Unable to report shutdown reason to
'10.10.90.65:1186'(error: Could not connect to socket - Unable to connect
with connect string: nodeid=0,10.10.90.65:1186)

ndb_2_error.log
-----------------------

Time: Monday 12 December 2011 - 06:25:39
Status: Temporary error, restart node
Message: Node lost connection to other nodes and can not form a
unpartitioned cluster, please investigate if there are error(s) on other
node(s) (Arbitration error)
Error: 2305
Error data: Arbitrator decided to shutdown this node
Error object: QMGR (Line: 6005) 0x00000002
Program: /usr/local/mysql/bin/ndbd
Pid: 3082
Version: mysql-5.1.56 ndb-7.1.15a
Trace: /usr/local/mysql/data/ndb_2_trace.log.12
***EOM***

Almost the same errors are logged on the other Data Node too ...

Help is highly appreciated in this regard.

Thread
Cluster down without any proper reason ...umapathi b13 Dec
  • Fwd: Cluster down without any proper reason ...umapathi b14 Dec
    • Re: Fwd: Cluster down without any proper reason ...Johan Andersson14 Dec