I've been running a 2-node cluster for 10 days or so. 1 API and 1 DB on
each computer, MGM on a separate computer. Each night one of the nodes
dies. Always the same node. The cluster is in tact since the second node
survives, and I restart the crashed node and it rejoins the cluster with no
fuss. Obviously, can't have this behaviour in production and would like to find
the cause.
These are just test boxes with 512MB RAM, so it could be underpowered hardware,
but I was wondering if there were logs or trace files I could provide that
would help determine the source of the nightly crash.
What the console shows on BOX2:
[root@BOX2 2.ndb_db]# Ndb kernel is stuck in: Job Handling
Ndb kernel is stuck in: Job Handling
Error handler shutting down system
Error handler shutdown completed - exiting
What ndb/cluster.log says:
<...>
2004-07-28 17:59:02 [MgmSrvr] INFO -- Node 3: Local checkpoint 174 started.
Keep GCI = 260315 oldest restorable GCI = 248165
2004-07-28 18:58:55 [MgmSrvr] INFO -- Node 3: Local checkpoint 175 started.
Keep GCI = 262051 oldest restorable GCI = 248165
2004-07-28 19:58:47 [MgmSrvr] INFO -- Node 3: Local checkpoint 176 started.
Keep GCI = 263786 oldest restorable GCI = 248165
2004-07-28 20:55:03 [MgmSrvr] WARNING -- Node 3: Node 2 missed heartbeat 2
2004-07-28 20:55:04 [MgmSrvr] WARNING -- Node 3: Node 2 missed heartbeat 3
2004-07-28 20:55:05 [MgmSrvr] ALERT -- Node 1: Node 2 Disconnected
2004-07-28 20:55:05 [MgmSrvr] INFO -- Lost connection to node 2
2004-07-28 20:55:06 [MgmSrvr] WARNING -- Node 3: Node 2 missed heartbeat 4
2004-07-28 20:55:06 [MgmSrvr] ALERT -- Node 3: Node 2 declared dead due to
missed heartbeat
2004-07-28 20:55:06 [MgmSrvr] ALERT -- Node 3: Network partitioning -
arbitration required
2004-07-28 20:55:06 [MgmSrvr] INFO -- Node 3: President restarts
arbitration thread [state=7]
2004-07-28 20:55:06 [MgmSrvr] ALERT -- Node 3: Arbitration won - positive
reply from node 1
2004-07-28 20:55:06 [MgmSrvr] INFO -- Node 3: Started arbitrator node 1
[ticket=1eab00020908f20c]
2004-07-28 20:55:07 [MgmSrvr] WARNING -- Node 3: Node 12 missed heartbeat 2
2004-07-28 20:55:08 [MgmSrvr] WARNING -- Node 3: Node 12 missed heartbeat 3
2004-07-28 20:55:10 [MgmSrvr] WARNING -- Node 3: Node 12 missed heartbeat 4
2004-07-28 20:55:10 [MgmSrvr] ALERT -- Node 3: Node 12 declared dead due to
missed heartbeat
2004-07-28 20:58:27 [MgmSrvr] INFO -- Node 3: Local checkpoint 177 started.
Keep GCI = 265522 oldest restorable GCI = 248165
2004-07-28 21:53:06 [MgmSrvr] INFO -- Node 3: Local checkpoint 178 started.
Keep GCI = 267250 oldest restorable GCI = 248165
<...>
Any ideas. Any other places to look?
Thanks in advance.
-- Jim
__________________________________
Do you Yahoo!?
Yahoo! Mail - 50x more storage than other providers!
http://promotions.yahoo.com/new_mail