List:Cluster« Previous MessageNext Message »
From:Johan Andersson Date:December 14 2011 7:21am
Subject:Re: Fwd: Cluster down without any proper reason ...
View as plain text  
What happened on node 3?


Best regards,
Johan
On 2011-12-14 08.19, umapathi b wrote:
> Hello can anybody help ?
>
> - Umapathi.
>
> ---------- Forwarded message ----------
> From: umapathi b<umapathi.b@stripped>
> Date: Tue, Dec 13, 2011 at 12:53 PM
> Subject: Cluster down without any proper reason ...
> To: cluster@stripped
> Cc: Johan Andersson<johan@stripped>, sureshkumarilu@stripped
>
>
> Hi ,
>
> My cluster with 1 Management node,2 Data Nodes and 2 Sql Nodes , is down
> without any proper reason ..
> After restarting the ndbd , all went well and the cluster is up . But How
> can I handle this .so that this is not repeated again ?
> I came across one parameter i.e. StopOnError whose default value is 1 by
> default .
> If I set it to 0 , wil this resolve my issue ?
>
> Here is my config file
> --------------------------------
>
> [ndbd default]
> # Options affecting ndbd processes on all data nodes:
> NoOfReplicas=2    # Number of replicas
> DataMemory=20G    # How much memory to allocate for data storage
> IndexMemory=5G   # How much memory to allocate for index storage
>                    # For DataMemory and IndexMemory, we have used the
>                    # default values. Since the "world" database takes up
>                    # only about 500KB, this should be more than enough for
>                    # this example Cluster setup.
> StringMemory=5M
> MaxNoOfConcurrentTransactions=100000
> MaxNoOfConcurrentOperations=110000
> MaxNoOfLocalOperations=250000
> MaxNoOfConcurrentIndexOperations=81920
> MaxNoOfConcurrentScans=256
> MaxNoOfLocalScans=10000
> MaxNoOfOpenFiles=10000
> MaxNoOfAttributes=500000
> ODirect=1
> MaxNoOfTables=20320
> MaxNoOfOrderedIndexes=20480
> MaxNoOfUniqueHashIndexes=20480
> LockPagesInMainMemory=1
> NoOfFragmentLogFiles=300
> TimeBetweenGlobalCheckpoints=1000
> TimeBetweenEpochs=200
> DiskCheckpointSpeed=10M
> DiskCheckpointSpeedInRestart=100M
> RedoBuffer=32M
> SchedulerSpinTimer=400
> SchedulerExecutionTimer=100
> RealTimeScheduler=1
> LockExecuteThreadToCPU=1
> LockMaintThreadsToCPU=0
>
>
> [tcp default]
> # TCP/IP options:
> portnumber=2202   # This the default; however, you can use any
>                    # port that is free for all the hosts in the cluster
>                    # Note: It is recommended that you do not specify the port
>                    # number at all and simply allow the default value to be
> used
>                    # instead
>
> [ndb_mgmd]
> # Management process options:
> hostname=10.10.90.65           # Hostname or IP address of MGM node
> datadir=/var/lib/mysql-cluster  # Directory for MGM node log files
>
> [ndbd]
> # Options for data node "A":
>                                  # (one [ndbd] section per data node)
> hostname=10.10.90.57           # Hostname or IP address
> datadir=/usr/local/mysql/data   # Directory for this data node's data files
>
> [ndbd]
> # Options for data node "B":
> hostname=10.10.90.58           # Hostname or IP address
> datadir=/usr/local/mysql/data   # Directory for this data node's data files
>
> [mysqld]
> # SQL node options:
> hostname=10.10.90.57           # Hostname or IP address
>                                  # (additional mysqld connections can be
>                                  # specified for this node for various
>                                  # purposes such as running ndb_restore)
>
> [mysqld]
> # SQL node options:
> hostname=10.10.90.58           # Hostname or IP address
>                                  # (additional mysqld connections can be
>                                  # specified for this node for various
>                                  # purposes such as running ndb_restore)
>
> Error logs
> --------------
>
> Management Node
> ---------------------------
>
> (ndb_1_cluster.log)
>
> 2011-12-12 06:03:41 [MgmtSrvr] INFO     -- Node 2: Backup 486 started from
> node 1 completed. StartGCP: 2781465 StopGCP: 2781505 #Records: 20593993
> #LogRecords: 6 Data: 862174712 bytes Log: 520 bytes
> 2011-12-12 06:25:07 [MgmtSrvr] INFO     -- Node 2: Local checkpoint 822
> started. Keep GCI = 2779146 oldest restorable GCI = 2779422
> 2011-12-12 06:25:38 [MgmtSrvr] ALERT    -- Node 1: Node 2 Disconnected
> 2011-12-12 06:25:38 [MgmtSrvr] ALERT    -- Node 1: Node 3 Disconnected
> 2011-12-12 06:26:53 [MgmtSrvr] INFO     -- Mgmt server state: nodeid 3
> freed, m_reserved_nodes 1, 2, 4 and 5.
> 2011-12-12 06:26:53 [MgmtSrvr] INFO     -- Mgmt server state: nodeid 2
> freed, m_reserved_nodes 1, 4 and 5.
>
> Data Node 1
> -------------------
>
> Error log ( houdb01.err )
> -------------
>
> 111212  6:25:44 [Note] NDB Binlog: Node: 2, down, Subscriber bitmask 00
> 111212  6:25:44 [Note] NDB Binlog: Node: 3, down, Subscriber bitmask 00
> 111212  6:25:44 [Note] NDB Binlog: cluster failure for ./mysql/ndb_schema
> at epoch 2782761/0.
> 111212  6:25:44 [Note] NDB Binlog: ndb tables initially read only on
> reconnect.
> 111212  6:25:44 [Note] NDB Binlog: cluster failure for
> ./usi/user_cert_duplicate at epoch 2782761/0.
> 111212  6:25:44 [Note] NDB Binlog: cluster failure for
> ./usi/user_exam_sequence at epoch 2782761/0.
> 111212  6:25:44 [Note] NDB Binlog: cluster failure for
> ./usi_test/regulator_court_fee at epoch 2782761/0.
> 111212  6:25:44 [Note] NDB Binlog: cluster failure for
> ./usi/coupon_per_custom at epoch 2782761/0.
> 111212  6:25:44 [Note] NDB Binlog: cluster failure for
> ./usi_test/coupon_per_hiddenfee at epoch 2782761/0.
> 111212  6:25:44 [Note] NDB Binlog: cluster failure for ./usi/county_id_seq
> at epoch 2782761/0.
>
> ndb_2_out.log
> ----------------------
>
> 2011-12-12 06:25:39 [ndbd] INFO     -- findNeighbours from: 4891 old (left:
> 3 right: 3) new (65535 65535)
> 2011-12-12 06:25:39 [ndbd] INFO     -- Arbitrator decided to shutdown this
> node
> 2011-12-12 06:25:39 [ndbd] INFO     -- QMGR (Line: 6005) 0x00000002
> 2011-12-12 06:25:39 [ndbd] INFO     -- Error handler shutting down system
> 2011-12-12 06:25:40 [ndbd] INFO     -- Error handler shutdown completed -
> exiting
> 2011-12-12 06:25:47 [ndbd] ALERT    -- Node 2: Forced node shutdown
> completed. Caused by error 2305: 'Node lost connection to other nodes and
> can not form a unpartitioned cluster, please investigate if there are
> error(s) on other node(s)(Arbitration error). Temporary error, restart
> node'.
> 2011-12-12 06:26:24 [ndbd] WARNING  -- Unable to report shutdown reason to
> '10.10.90.65:1186'(error: Could not connect to socket - Unable to connect
> with connect string: nodeid=0,10.10.90.65:1186)
>
> ndb_2_error.log
> -----------------------
>
> Time: Monday 12 December 2011 - 06:25:39
> Status: Temporary error, restart node
> Message: Node lost connection to other nodes and can not form a
> unpartitioned cluster, please investigate if there are error(s) on other
> node(s) (Arbitration error)
> Error: 2305
> Error data: Arbitrator decided to shutdown this node
> Error object: QMGR (Line: 6005) 0x00000002
> Program: /usr/local/mysql/bin/ndbd
> Pid: 3082
> Version: mysql-5.1.56 ndb-7.1.15a
> Trace: /usr/local/mysql/data/ndb_2_trace.log.12
> ***EOM***
>
> Almost the same errors are logged on the other Data Node too ...
>
> Help is highly appreciated in this regard.
>

Thread
Cluster down without any proper reason ...umapathi b13 Dec
  • Fwd: Cluster down without any proper reason ...umapathi b14 Dec
    • Re: Fwd: Cluster down without any proper reason ...Johan Andersson14 Dec