List:Cluster« Previous MessageNext Message »
From:Jim Hoadley Date:July 29 2004 4:33pm
Subject:nightly crashing
View as plain text  
I've been running a 2-node cluster for 10 days or so. 1 API and 1 DB on
each computer, MGM on a separate computer. Each night one of the nodes
dies. Always the same node. The cluster is in tact since the second node
survives, and I restart the crashed node and it rejoins the cluster with no
fuss. Obviously, can't have this behaviour in production and would like to find
the cause.

These are just test boxes with 512MB RAM, so it could be underpowered hardware,
but I was wondering if there were logs or trace files I could provide that
would help determine the source of the nightly crash.
                                                                               
                                             
What the console shows on BOX2:
                                                                               
                                             
[root@BOX2 2.ndb_db]# Ndb kernel is stuck in: Job Handling
Ndb kernel is stuck in: Job Handling
Error handler shutting down system
Error handler shutdown completed - exiting
                                                                               
                                             
What ndb/cluster.log says:

<...>
2004-07-28 17:59:02 [MgmSrvr] INFO     -- Node 3: Local checkpoint 174 started.
Keep GCI = 260315 oldest restorable GCI = 248165
2004-07-28 18:58:55 [MgmSrvr] INFO     -- Node 3: Local checkpoint 175 started.
Keep GCI = 262051 oldest restorable GCI = 248165
2004-07-28 19:58:47 [MgmSrvr] INFO     -- Node 3: Local checkpoint 176 started.
Keep GCI = 263786 oldest restorable GCI = 248165
2004-07-28 20:55:03 [MgmSrvr] WARNING  -- Node 3: Node 2 missed heartbeat 2
2004-07-28 20:55:04 [MgmSrvr] WARNING  -- Node 3: Node 2 missed heartbeat 3
2004-07-28 20:55:05 [MgmSrvr] ALERT    -- Node 1: Node 2 Disconnected
2004-07-28 20:55:05 [MgmSrvr] INFO     -- Lost connection to node 2
2004-07-28 20:55:06 [MgmSrvr] WARNING  -- Node 3: Node 2 missed heartbeat 4
2004-07-28 20:55:06 [MgmSrvr] ALERT    -- Node 3: Node 2 declared dead due to
missed heartbeat
2004-07-28 20:55:06 [MgmSrvr] ALERT    -- Node 3: Network partitioning -
arbitration required
2004-07-28 20:55:06 [MgmSrvr] INFO     -- Node 3: President restarts
arbitration thread [state=7]
2004-07-28 20:55:06 [MgmSrvr] ALERT    -- Node 3: Arbitration won - positive
reply from node 1
2004-07-28 20:55:06 [MgmSrvr] INFO     -- Node 3: Started arbitrator node 1
[ticket=1eab00020908f20c]
2004-07-28 20:55:07 [MgmSrvr] WARNING  -- Node 3: Node 12 missed heartbeat 2
2004-07-28 20:55:08 [MgmSrvr] WARNING  -- Node 3: Node 12 missed heartbeat 3
2004-07-28 20:55:10 [MgmSrvr] WARNING  -- Node 3: Node 12 missed heartbeat 4
2004-07-28 20:55:10 [MgmSrvr] ALERT    -- Node 3: Node 12 declared dead due to
missed heartbeat
2004-07-28 20:58:27 [MgmSrvr] INFO     -- Node 3: Local checkpoint 177 started.
Keep GCI = 265522 oldest restorable GCI = 248165
2004-07-28 21:53:06 [MgmSrvr] INFO     -- Node 3: Local checkpoint 178 started.
Keep GCI = 267250 oldest restorable GCI = 248165
<...>

Any ideas. Any other places to look?
                                                                               
                                             
Thanks in advance.
                                                                               
                                             
-- Jim




		
__________________________________
Do you Yahoo!?
Yahoo! Mail - 50x more storage than other providers!
http://promotions.yahoo.com/new_mail
Thread
mysql cluster with 3 DB nodesKasthuri Ilankamban22 Jul
  • Re: mysql cluster with 3 DB nodesDevananda22 Jul
    • Re: mysql cluster with 3 DB nodesKasthuri Ilankamban22 Jul
    • Re: mysql cluster with 3 DB nodesPh.D. Joseph E. Sacco22 Jul
      • Re: mysql cluster with 3 DB nodesMikael Ronström29 Jul
        • nightly crashingJim Hoadley29 Jul
          • Re: nightly crashingMikael Ronström29 Jul