List:Cluster« Previous MessageNext Message »
From:Adam Dixon Date:April 28 2006 7:01am
Subject:Re: Rolling restart full Crash 5.0.15
View as plain text  
Hi guys,
This is now a fixed bug; http://bugs.mysql.com/bug.php?id=18118
Now I am looking at upgrading from 5.0.15 to 5.0.20a so that this bug
can no longer effect our cluster. So I have two questions;
How do I 'work-around' this bug in particular during the upgrade
process, eg ensure that the buffer overflow does not occur when I shut
the elected Master node down? Will minimising transactions on the
cluster do this?
and;
I was wondering if there was anything special to look out for that may
not work correctly when going from 5.0.15 to 5.0.20a. I tested doing
this on a small 2-ndbd cluster and it seemed to work fine, I was just
wonder if there was any hidden gotchas in regards to jumping 5 minor
releases in one go?

Thanks for any clarifications,
Adam




On 3/9/06, Jonas Oreland <jonas@stripped> wrote:
> Adam Dixon wrote:
> > This crash applies to 5.0.15, I know I should upgrade as allot of bugs
> > fixed etc in newer released, but I thought I would post my occurrence
> > here in hope that this 'bug' or situation gets fixed in further
> > releases.
> >
> > The purpose of this rolling stop/start was to increase the
> > MaxNoOfOperations, from what I have read about the place this should
> > only have involved a stop - start, so no initial required. (I hope)
> >
> > Their were active connections and queries running on the cluster
> > during this process also (if that's supposed to effect any failing
> > node behaviour)
>
> Hi,
>
> I analyzed what happend...and found an "embarassing" bug.
> When a master node fails, the taking over node will print a lot of stuff into
>   the cluster log.
>
> I your case it wrote so much, so that the interal buffers overflow.
>
> Please file a bug report, and include the tracefiles from node 17 & 18
>
> ok?
>
> /Jonas
>
> >
> > # diff config.ini config.ini.20050609
> > 5c5
> > < MaxNoOfConcurrentOperations=250000
> > ---
> >
> >>MaxNoOfConcurrentOperations=200000
> >
> >
> > This configuration doesnt seem to of caused a problem, as the cluster
> > is running fine after starting all the nodes again.
> >
> > I have a 8 NDBD node (8machines) 2replica setup. Each node with 6gb memory.
> > I proceeded to go from nodeid 18 through to 11;
> > X stop (and wait for it to stop)
> > X# ndbd (and wait for it to start)
> > Where X is 18, 17, 16, 15, 14, 13, 12, 11*
> > *11: This was the last node, and triggered the full crash.
> >
> > <<< Here is a paste from inside the mgm console >>>
> > ndb_mgm> 12 stop
> > Node 12: Node shutdown initiated
> > Node 12 has shutdown.
> >
> > ndb_mgm> Node 12: Node shutdown completed.
> >
> > ndb_mgm> all status;
> > Node 11: started (Version 5.0.15)
> > Node 12: not connected
> >
> > ndb_mgm> Node 12: Started (version 5.0.15)
> >
> > ndb_mgm> 11 stop
> > Node 11: Node shutdown initiated
> > Node 11 has shutdown.
> >
> > ndb_mgm> Node 11: Node shutdown completed.
> > Node 18: Forced node shutdown completed. Initiated by signal 0. Caused by error
> > 2334: 'Job buffer congestion(Internal error, programming error or missing error
> > message, please report a bug). Temporary error, restart node'.
> > Node 17: Forced node shutdown completed. Initiated by signal 0. Caused by error
> > 2334: 'Job buffer congestion(Internal error, programming error or missing error
> > message, please report a bug). Temporary error, restart node'.
> > Node 16: Forced node shutdown completed. Initiated by signal 0. Caused by error
> > 2305: 'Arbitrator shutdown, please investigate error(s) on other
> node(s)(Arbitra
> > tion error). Temporary error, restart node'.
> > Node 15: Forced node shutdown completed. Initiated by signal 0. Caused by error
> > 2305: 'Arbitrator shutdown, please investigate error(s) on other
> node(s)(Arbitra
> > tion error). Temporary error, restart node'.
> > Node 14: Forced node shutdown completed. Initiated by signal 0. Caused by error
> > 2305: 'Arbitrator shutdown, please investigate error(s) on other
> node(s)(Arbitra
> > tion error). Temporary error, restart node'.
> > Node 13: Forced node shutdown completed. Initiated by signal 0. Caused by error
> > 2305: 'Arbitrator shutdown, please investigate error(s) on other
> node(s)(Arbitra
> > tion error). Temporary error, restart node'.
> > Node 12: Forced node shutdown completed. Initiated by signal 0. Caused by error
> > 2305: 'Arbitrator shutdown, please investigate error(s) on other
> node(s)(Arbitra
> > tion error). Temporary error, restart node'.
> >
> > <<< end >>>
> >
> > I have uploaded all the log, out and trace files available in the ndbd
> > directories to; http://adixon.adam.com.au/mysql/crash20060309/
> > The ndb_mgm.log file is there too which includes node 12's restart,
> > then node 11's stop, then  the arbitrary shutdown of all other nodes.
> >
> > Most notably, node 11 has no error log entry or trace file, it was
> > actually shutdown cleanly, however the rest of the nodes do.
> >
> >
> > Is it worth lodging a bug report or anything like that? And or, did I
> > do anything wrong?
> >
> > Anyone help here?
> >
> > Adam
> >
>
>
Thread
Rolling restart full Crash 5.0.15Adam Dixon9 Mar
  • Re: Rolling restart full Crash 5.0.15Jonas Oreland9 Mar
    • Re: Rolling restart full Crash 5.0.15Adam Dixon9 Mar
    • Re: Rolling restart full Crash 5.0.15Adam Dixon28 Apr
      • Re: Rolling restart full Crash 5.0.15Jonas Oreland28 Apr