List:Cluster« Previous MessageNext Message »
From:Mikael Ronström Date:July 29 2004 8:01pm
Subject:Re: nightly crashing
View as plain text  
Hi Jim,
The ndbd node dies due to that the watch dog thread kills it. Most 
likely this is due to entering
an eternal loop in the ndbd main thread. Check in error.log which trace 
file was produced and
send the most interesting parts of the trace file. The trace file 
should basically point out where
the ndbd process got stuck if the code was properly coded.

The trace file starts with the jump address memory which records the 
line numbers in modules
where the code executed. Send this part together with the first few 
signals recorded which records
the last signals executed before the crash.

Rgrds Mikael
PS: You'll find those files in the directory where the ndbd process 
executes.

2004-07-29 kl. 18.33 skrev Jim Hoadley:

> I've been running a 2-node cluster for 10 days or so. 1 API and 1 DB on
> each computer, MGM on a separate computer. Each night one of the nodes
> dies. Always the same node. The cluster is in tact since the second 
> node
> survives, and I restart the crashed node and it rejoins the cluster 
> with no
> fuss. Obviously, can't have this behaviour in production and would 
> like to find
> the cause.
>
> These are just test boxes with 512MB RAM, so it could be underpowered 
> hardware,
> but I was wondering if there were logs or trace files I could provide 
> that
> would help determine the source of the nightly crash.
>
>
> What the console shows on BOX2:
>
>
> [root@BOX2 2.ndb_db]# Ndb kernel is stuck in: Job Handling
> Ndb kernel is stuck in: Job Handling
> Error handler shutting down system
> Error handler shutdown completed - exiting
>
>
> What ndb/cluster.log says:
>
> <...>
> 2004-07-28 17:59:02 [MgmSrvr] INFO     -- Node 3: Local checkpoint 174 
> started.
> Keep GCI = 260315 oldest restorable GCI = 248165
> 2004-07-28 18:58:55 [MgmSrvr] INFO     -- Node 3: Local checkpoint 175 
> started.
> Keep GCI = 262051 oldest restorable GCI = 248165
> 2004-07-28 19:58:47 [MgmSrvr] INFO     -- Node 3: Local checkpoint 176 
> started.
> Keep GCI = 263786 oldest restorable GCI = 248165
> 2004-07-28 20:55:03 [MgmSrvr] WARNING  -- Node 3: Node 2 missed 
> heartbeat 2
> 2004-07-28 20:55:04 [MgmSrvr] WARNING  -- Node 3: Node 2 missed 
> heartbeat 3
> 2004-07-28 20:55:05 [MgmSrvr] ALERT    -- Node 1: Node 2 Disconnected
> 2004-07-28 20:55:05 [MgmSrvr] INFO     -- Lost connection to node 2
> 2004-07-28 20:55:06 [MgmSrvr] WARNING  -- Node 3: Node 2 missed 
> heartbeat 4
> 2004-07-28 20:55:06 [MgmSrvr] ALERT    -- Node 3: Node 2 declared dead 
> due to
> missed heartbeat
> 2004-07-28 20:55:06 [MgmSrvr] ALERT    -- Node 3: Network partitioning 
> -
> arbitration required
> 2004-07-28 20:55:06 [MgmSrvr] INFO     -- Node 3: President restarts
> arbitration thread [state=7]
> 2004-07-28 20:55:06 [MgmSrvr] ALERT    -- Node 3: Arbitration won - 
> positive
> reply from node 1
> 2004-07-28 20:55:06 [MgmSrvr] INFO     -- Node 3: Started arbitrator 
> node 1
> [ticket=1eab00020908f20c]
> 2004-07-28 20:55:07 [MgmSrvr] WARNING  -- Node 3: Node 12 missed 
> heartbeat 2
> 2004-07-28 20:55:08 [MgmSrvr] WARNING  -- Node 3: Node 12 missed 
> heartbeat 3
> 2004-07-28 20:55:10 [MgmSrvr] WARNING  -- Node 3: Node 12 missed 
> heartbeat 4
> 2004-07-28 20:55:10 [MgmSrvr] ALERT    -- Node 3: Node 12 declared 
> dead due to
> missed heartbeat
> 2004-07-28 20:58:27 [MgmSrvr] INFO     -- Node 3: Local checkpoint 177 
> started.
> Keep GCI = 265522 oldest restorable GCI = 248165
> 2004-07-28 21:53:06 [MgmSrvr] INFO     -- Node 3: Local checkpoint 178 
> started.
> Keep GCI = 267250 oldest restorable GCI = 248165
> <...>
>
> Any ideas. Any other places to look?
>
>
> Thanks in advance.
>
>
> -- Jim
>
>
>
>
> 		
> __________________________________
> Do you Yahoo!?
> Yahoo! Mail - 50x more storage than other providers!
> http://promotions.yahoo.com/new_mail
>
> -- 
> MySQL Cluster Mailing List
> For list archives: http://lists.mysql.com/cluster
> To unsubscribe:    
> http://lists.mysql.com/cluster?unsub=1
>
>
Mikael Ronström, Senior Software Architect
MySQL AB, www.mysql.com

Clustering:
http://www.infoworld.com/article/04/04/14/HNmysqlcluster_1.html

http://www.eweek.com/article2/0,1759,1567546,00.asp


Thread
mysql cluster with 3 DB nodesKasthuri Ilankamban22 Jul
  • Re: mysql cluster with 3 DB nodesDevananda22 Jul
    • Re: mysql cluster with 3 DB nodesKasthuri Ilankamban22 Jul
    • Re: mysql cluster with 3 DB nodesPh.D. Joseph E. Sacco22 Jul
      • Re: mysql cluster with 3 DB nodesMikael Ronström29 Jul
        • nightly crashingJim Hoadley29 Jul
          • Re: nightly crashingMikael Ronström29 Jul