From: Mikael Ronström Date: July 29 2004 8:01pm Subject: Re: nightly crashing List-Archive: http://lists.mysql.com/cluster/217 Message-Id: <17A50B6A-E19A-11D8-9C16-000A959312A2@mysql.com> MIME-Version: 1.0 (Apple Message framework v618) Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: quoted-printable Hi Jim, The ndbd node dies due to that the watch dog thread kills it. Most=20 likely this is due to entering an eternal loop in the ndbd main thread. Check in error.log which trace=20= file was produced and send the most interesting parts of the trace file. The trace file=20 should basically point out where the ndbd process got stuck if the code was properly coded. The trace file starts with the jump address memory which records the=20 line numbers in modules where the code executed. Send this part together with the first few=20 signals recorded which records the last signals executed before the crash. Rgrds Mikael PS: You'll find those files in the directory where the ndbd process=20 executes. 2004-07-29 kl. 18.33 skrev Jim Hoadley: > I've been running a 2-node cluster for 10 days or so. 1 API and 1 DB = on > each computer, MGM on a separate computer. Each night one of the nodes > dies. Always the same node. The cluster is in tact since the second=20 > node > survives, and I restart the crashed node and it rejoins the cluster=20 > with no > fuss. Obviously, can't have this behaviour in production and would=20 > like to find > the cause. > > These are just test boxes with 512MB RAM, so it could be underpowered=20= > hardware, > but I was wondering if there were logs or trace files I could provide=20= > that > would help determine the source of the nightly crash. > > > What the console shows on BOX2: > > > [root@BOX2 2.ndb_db]# Ndb kernel is stuck in: Job Handling > Ndb kernel is stuck in: Job Handling > Error handler shutting down system > Error handler shutdown completed - exiting > > > What ndb/cluster.log says: > > <...> > 2004-07-28 17:59:02 [MgmSrvr] INFO -- Node 3: Local checkpoint 174=20= > started. > Keep GCI =3D 260315 oldest restorable GCI =3D 248165 > 2004-07-28 18:58:55 [MgmSrvr] INFO -- Node 3: Local checkpoint 175=20= > started. > Keep GCI =3D 262051 oldest restorable GCI =3D 248165 > 2004-07-28 19:58:47 [MgmSrvr] INFO -- Node 3: Local checkpoint 176=20= > started. > Keep GCI =3D 263786 oldest restorable GCI =3D 248165 > 2004-07-28 20:55:03 [MgmSrvr] WARNING -- Node 3: Node 2 missed=20 > heartbeat 2 > 2004-07-28 20:55:04 [MgmSrvr] WARNING -- Node 3: Node 2 missed=20 > heartbeat 3 > 2004-07-28 20:55:05 [MgmSrvr] ALERT -- Node 1: Node 2 Disconnected > 2004-07-28 20:55:05 [MgmSrvr] INFO -- Lost connection to node 2 > 2004-07-28 20:55:06 [MgmSrvr] WARNING -- Node 3: Node 2 missed=20 > heartbeat 4 > 2004-07-28 20:55:06 [MgmSrvr] ALERT -- Node 3: Node 2 declared dead=20= > due to > missed heartbeat > 2004-07-28 20:55:06 [MgmSrvr] ALERT -- Node 3: Network partitioning=20= > - > arbitration required > 2004-07-28 20:55:06 [MgmSrvr] INFO -- Node 3: President restarts > arbitration thread [state=3D7] > 2004-07-28 20:55:06 [MgmSrvr] ALERT -- Node 3: Arbitration won -=20 > positive > reply from node 1 > 2004-07-28 20:55:06 [MgmSrvr] INFO -- Node 3: Started arbitrator=20= > node 1 > [ticket=3D1eab00020908f20c] > 2004-07-28 20:55:07 [MgmSrvr] WARNING -- Node 3: Node 12 missed=20 > heartbeat 2 > 2004-07-28 20:55:08 [MgmSrvr] WARNING -- Node 3: Node 12 missed=20 > heartbeat 3 > 2004-07-28 20:55:10 [MgmSrvr] WARNING -- Node 3: Node 12 missed=20 > heartbeat 4 > 2004-07-28 20:55:10 [MgmSrvr] ALERT -- Node 3: Node 12 declared=20 > dead due to > missed heartbeat > 2004-07-28 20:58:27 [MgmSrvr] INFO -- Node 3: Local checkpoint 177=20= > started. > Keep GCI =3D 265522 oldest restorable GCI =3D 248165 > 2004-07-28 21:53:06 [MgmSrvr] INFO -- Node 3: Local checkpoint 178=20= > started. > Keep GCI =3D 267250 oldest restorable GCI =3D 248165 > <...> > > Any ideas. Any other places to look? > > > Thanks in advance. > > > -- Jim > > > > > =09 > __________________________________ > Do you Yahoo!? > Yahoo! Mail - 50x more storage than other providers! > http://promotions.yahoo.com/new_mail > > --=20 > MySQL Cluster Mailing List > For list archives: http://lists.mysql.com/cluster > To unsubscribe: =20 > http://lists.mysql.com/cluster?unsub=3Dmikael@stripped > > Mikael Ronstr=F6m, Senior Software Architect MySQL AB, www.mysql.com Clustering: http://www.infoworld.com/article/04/04/14/HNmysqlcluster_1.html http://www.eweek.com/article2/0,1759,1567546,00.asp