List:Cluster« Previous MessageNext Message »
From:Jonas Oreland Date:March 9 2006 9:09am
Subject:Re: Rolling restart full Crash 5.0.15
View as plain text  
Adam Dixon wrote:
> This crash applies to 5.0.15, I know I should upgrade as allot of bugs
> fixed etc in newer released, but I thought I would post my occurrence
> here in hope that this 'bug' or situation gets fixed in further
> releases.
> 
> The purpose of this rolling stop/start was to increase the
> MaxNoOfOperations, from what I have read about the place this should
> only have involved a stop - start, so no initial required. (I hope)
> 
> Their were active connections and queries running on the cluster
> during this process also (if that's supposed to effect any failing
> node behaviour)

Hi,

I analyzed what happend...and found an "embarassing" bug.
When a master node fails, the taking over node will print a lot of stuff into
  the cluster log.

I your case it wrote so much, so that the interal buffers overflow.

Please file a bug report, and include the tracefiles from node 17 & 18

ok?

/Jonas

> 
> # diff config.ini config.ini.20050609
> 5c5
> < MaxNoOfConcurrentOperations=250000
> ---
> 
>>MaxNoOfConcurrentOperations=200000
> 
> 
> This configuration doesnt seem to of caused a problem, as the cluster
> is running fine after starting all the nodes again.
> 
> I have a 8 NDBD node (8machines) 2replica setup. Each node with 6gb memory.
> I proceeded to go from nodeid 18 through to 11;
> X stop (and wait for it to stop)
> X# ndbd (and wait for it to start)
> Where X is 18, 17, 16, 15, 14, 13, 12, 11*
> *11: This was the last node, and triggered the full crash.
> 
> <<< Here is a paste from inside the mgm console >>>
> ndb_mgm> 12 stop
> Node 12: Node shutdown initiated
> Node 12 has shutdown.
> 
> ndb_mgm> Node 12: Node shutdown completed.
> 
> ndb_mgm> all status;
> Node 11: started (Version 5.0.15)
> Node 12: not connected
> 
> ndb_mgm> Node 12: Started (version 5.0.15)
> 
> ndb_mgm> 11 stop
> Node 11: Node shutdown initiated
> Node 11 has shutdown.
> 
> ndb_mgm> Node 11: Node shutdown completed.
> Node 18: Forced node shutdown completed. Initiated by signal 0. Caused by error
> 2334: 'Job buffer congestion(Internal error, programming error or missing error
> message, please report a bug). Temporary error, restart node'.
> Node 17: Forced node shutdown completed. Initiated by signal 0. Caused by error
> 2334: 'Job buffer congestion(Internal error, programming error or missing error
> message, please report a bug). Temporary error, restart node'.
> Node 16: Forced node shutdown completed. Initiated by signal 0. Caused by error
> 2305: 'Arbitrator shutdown, please investigate error(s) on other node(s)(Arbitra
> tion error). Temporary error, restart node'.
> Node 15: Forced node shutdown completed. Initiated by signal 0. Caused by error
> 2305: 'Arbitrator shutdown, please investigate error(s) on other node(s)(Arbitra
> tion error). Temporary error, restart node'.
> Node 14: Forced node shutdown completed. Initiated by signal 0. Caused by error
> 2305: 'Arbitrator shutdown, please investigate error(s) on other node(s)(Arbitra
> tion error). Temporary error, restart node'.
> Node 13: Forced node shutdown completed. Initiated by signal 0. Caused by error
> 2305: 'Arbitrator shutdown, please investigate error(s) on other node(s)(Arbitra
> tion error). Temporary error, restart node'.
> Node 12: Forced node shutdown completed. Initiated by signal 0. Caused by error
> 2305: 'Arbitrator shutdown, please investigate error(s) on other node(s)(Arbitra
> tion error). Temporary error, restart node'.
> 
> <<< end >>>
> 
> I have uploaded all the log, out and trace files available in the ndbd
> directories to; http://adixon.adam.com.au/mysql/crash20060309/
> The ndb_mgm.log file is there too which includes node 12's restart,
> then node 11's stop, then  the arbitrary shutdown of all other nodes.
> 
> Most notably, node 11 has no error log entry or trace file, it was
> actually shutdown cleanly, however the rest of the nodes do.
> 
> 
> Is it worth lodging a bug report or anything like that? And or, did I
> do anything wrong?
> 
> Anyone help here?
> 
> Adam
> 

Thread
Rolling restart full Crash 5.0.15Adam Dixon9 Mar
  • Re: Rolling restart full Crash 5.0.15Jonas Oreland9 Mar
    • Re: Rolling restart full Crash 5.0.15Adam Dixon9 Mar
    • Re: Rolling restart full Crash 5.0.15Adam Dixon28 Apr
      • Re: Rolling restart full Crash 5.0.15Jonas Oreland28 Apr