List:Cluster« Previous MessageNext Message »
From:Jonas Oreland Date:December 29 2004 5:20pm
Subject:Re: I have run into a hitch with the
View as plain text  
Lewis Bergman wrote:
> Jonas Oreland wrote:
>> Lewis Bergman wrote:
>>> Back to the books. It looks like the stale session problem keeps an ndbd
>>> node from coming back immediately. How do all you guru's who know how to
>>> do this handle this situation without intervention.
>>> If I use the --no-nodeid-checks can I then use a connect string that 
>>> includes the nodeid=<nodeid> parameter and avoid the stale session 
>>> problem?


>> there is a ndb_mgm command: "PURGE STALE SESSIONS"
> I have seen that and used it to get the nodes back up and running.
> My question is more like this:
> What do you do to insure that a cluster node disappearing in the middle 
> of the night does not necessitate someone's manual intervention?
> I may have different goals for this than most of you. I have noone 
> babysitting a lrage cluster 24/7. I would have to be alerted somehow and 
> then wake up, log in, start mgm, PURGE STALE SESSIONS, log out. The main 
> reason for me to have the cluster is to avoid such problems to the 
> greatest extent.
> I want the thing to come back on and get back in the cluster so I can 
> figure it out tommorrow instead of at two in the morning.
> That does bring up a good point though. Is there anyone who has any 
> scripts or (dare I say it) snmp capability that the mgmd can react with? 
> I guess I could write one to watch the cluster log or something. That 
> doesn't look very friendly though. At any rate, if anyone has thoughts 
> on how you deal with this please let me know.


The only case where the "PURGE STALE SESSIONS" should be needed
is if a _computer/os_ fails.

That will likely introduce manual intervention.

1) we're working on removing this problem.
    The need to use the command even for computer/os failures.

2) As you suggested:
    If all nodes have nodeid in theirs connect string and ndb_mgmd is
    started without nodeid checks, (i.e no dynamic node allocation)
    There should _never_ be a need for the command.

    And ndb should be 24/7 wo/ anykind of manual intervention.

3) If you found a way to force the usage of "purge" wo/ os/computer
    failure please file a bug report, and we'll try to fix it asap.


Jonas Oreland, Software Engineer
I have run into a hitch with theLewis Bergman29 Dec
  • Re: I have run into a hitch with theJonas Oreland29 Dec
    • Re: I have run into a hitch with theLewis Bergman29 Dec
      • Re: I have run into a hitch with theJonas Oreland29 Dec
RE: I have run into a hitch with theMartin P├íla30 Dec