From: Date: March 9 2006 9:09am Subject: Re: Rolling restart full Crash 5.0.15 List-Archive: http://lists.mysql.com/cluster/3273 Message-Id: <440FE2D7.2020308@mysql.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Adam Dixon wrote: > This crash applies to 5.0.15, I know I should upgrade as allot of bugs > fixed etc in newer released, but I thought I would post my occurrence > here in hope that this 'bug' or situation gets fixed in further > releases. > > The purpose of this rolling stop/start was to increase the > MaxNoOfOperations, from what I have read about the place this should > only have involved a stop - start, so no initial required. (I hope) > > Their were active connections and queries running on the cluster > during this process also (if that's supposed to effect any failing > node behaviour) Hi, I analyzed what happend...and found an "embarassing" bug. When a master node fails, the taking over node will print a lot of stuff into the cluster log. I your case it wrote so much, so that the interal buffers overflow. Please file a bug report, and include the tracefiles from node 17 & 18 ok? /Jonas > > # diff config.ini config.ini.20050609 > 5c5 > < MaxNoOfConcurrentOperations=250000 > --- > >>MaxNoOfConcurrentOperations=200000 > > > This configuration doesnt seem to of caused a problem, as the cluster > is running fine after starting all the nodes again. > > I have a 8 NDBD node (8machines) 2replica setup. Each node with 6gb memory. > I proceeded to go from nodeid 18 through to 11; > X stop (and wait for it to stop) > X# ndbd (and wait for it to start) > Where X is 18, 17, 16, 15, 14, 13, 12, 11* > *11: This was the last node, and triggered the full crash. > > <<< Here is a paste from inside the mgm console >>> > ndb_mgm> 12 stop > Node 12: Node shutdown initiated > Node 12 has shutdown. > > ndb_mgm> Node 12: Node shutdown completed. > > ndb_mgm> all status; > Node 11: started (Version 5.0.15) > Node 12: not connected > > ndb_mgm> Node 12: Started (version 5.0.15) > > ndb_mgm> 11 stop > Node 11: Node shutdown initiated > Node 11 has shutdown. > > ndb_mgm> Node 11: Node shutdown completed. > Node 18: Forced node shutdown completed. Initiated by signal 0. Caused by error > 2334: 'Job buffer congestion(Internal error, programming error or missing error > message, please report a bug). Temporary error, restart node'. > Node 17: Forced node shutdown completed. Initiated by signal 0. Caused by error > 2334: 'Job buffer congestion(Internal error, programming error or missing error > message, please report a bug). Temporary error, restart node'. > Node 16: Forced node shutdown completed. Initiated by signal 0. Caused by error > 2305: 'Arbitrator shutdown, please investigate error(s) on other node(s)(Arbitra > tion error). Temporary error, restart node'. > Node 15: Forced node shutdown completed. Initiated by signal 0. Caused by error > 2305: 'Arbitrator shutdown, please investigate error(s) on other node(s)(Arbitra > tion error). Temporary error, restart node'. > Node 14: Forced node shutdown completed. Initiated by signal 0. Caused by error > 2305: 'Arbitrator shutdown, please investigate error(s) on other node(s)(Arbitra > tion error). Temporary error, restart node'. > Node 13: Forced node shutdown completed. Initiated by signal 0. Caused by error > 2305: 'Arbitrator shutdown, please investigate error(s) on other node(s)(Arbitra > tion error). Temporary error, restart node'. > Node 12: Forced node shutdown completed. Initiated by signal 0. Caused by error > 2305: 'Arbitrator shutdown, please investigate error(s) on other node(s)(Arbitra > tion error). Temporary error, restart node'. > > <<< end >>> > > I have uploaded all the log, out and trace files available in the ndbd > directories to; http://adixon.adam.com.au/mysql/crash20060309/ > The ndb_mgm.log file is there too which includes node 12's restart, > then node 11's stop, then the arbitrary shutdown of all other nodes. > > Most notably, node 11 has no error log entry or trace file, it was > actually shutdown cleanly, however the rest of the nodes do. > > > Is it worth lodging a bug report or anything like that? And or, did I > do anything wrong? > > Anyone help here? > > Adam >