From: Jim Hoadley Date: July 22 2004 12:13am Subject: Re: API loses data during node restarts List-Archive: http://lists.mysql.com/cluster/175 Message-Id: <20040722001311.76521.qmail@web41904.mail.yahoo.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Mikael and Johan -- I made the change to HeartbeatIntervalDbApi and saw no improvements. I reinstalled mysql-4.1.3-beta-nightly-20040628 and the problem began occurring on both nodes. So my earlier statement is no longer true: > > I consistently see this > > behavior on one node but never on the other Next, I deleted MySQL and the Cluster software and reinstalled both using a new code snapshot: mysql-4.1.4-beta-nightly-20040721. Same results. Happens for either node. The MySQL server reports "ERROR 1015: Can't lock file (errno: 4009)" the instant a DB node running on the same computer as the API is stopped, and MySQL resumes serving data when the node comes back up. So I'm looking for help, again. I reported this as Bug#4641. Thanks. -- Jim --- Mikael_Ronström wrote: > Hi Jim, > When you say that it only happens to the mysql server executing on the > same machine as the crashing nodee > it gives me an idea that it might be a timing issue. It could be that > your machine is so busy with writing error > logs stopping the storage node that the mysql server on the same > machine doesn't get a chance to send heartbeats. > > If this is true you should change the config parameter > HeartbeatIntervalDbApi. Its default setting is 1500 miliseconds > which means that a heartbeat must be responded to in 4.5-6.0 seconds. > Normally this is not a problem but if the > machine is very busy and starts swapping in and out processes then it > can sometimes be a problem. So you could always > try with setting this to something like 10000 milliseconds and see if > that helps. > > Rgrds Mikael > > 2004-07-20 kl. 02.22 skrev Jim Hoadley: > > > Mikael -- > > > > Yes, the MySQL server reports "ERROR 1015: Can't lock file (errno: > > 4009)" the > > instant a database node running on the same computer as the API is > > stopped. > > > > In thinking about this problem please remember that I consistently see > > this > > behavior on one node but never on the other, and that the MySQL server > > API > > always resumes serving cluster data once the database node starts back > > up. > > > > I will follow your suggestion and file a bug report. Thanks! > > > > -- Jim > > > > > > --- Mikael_Ronstr�m wrote: > >> Hi Jim, > >> The error code you get 4009 means that the mysql server believes that > >> the cluster is down. > >> It is related to the bug reported by Johan but not the same so I > >> suggest you go ahead and file a new bug report to ensure that it gets > >> priority in development. > >> > >> From what I could see in your email it seems as if one of the mysqld > >> process finds the cluster down > >> (in error) which is reported with error code 4009. Then immediately > >> after doing so the mysql server > >> itself fails. Does that analysis seem correct? > >> > >> Rgrds Mikael > >> > >> 2004-07-20 kl. 00.36 skrev Jim Hoadley: > >> > >>> Johan -- > >>> > >>> Thanks for the fast response! I read bug report 4585. It says: > >>> > >>> - Description: > >>> - If entire DB cluster goes down, then the mysqld servers should > >>> retry > >>> - connecting to the DB. The mysql servers must not give up trying > >>> to > >>> reconnect > >>> - to DB nodes. > >>> - > >>> - If the mysqld is not restarted after a cluster restart and a > >>> query > >>> is > >>> - executed on that mysqld, then the mysqld will crash. Not so nice. > >>> - > >>> - How to repeat: > >>> - 1. restart cluster > >>> - 2. issue a query on one mysqld server > >>> - > >>> - Suggested fix: > >>> - Let be there be a configurable option (--ndbcluster_timeout) for > >>> how long > >>> - the mysqld should try to reconnect to the db nodes. > >>> - --ndbcluster_timeout={0,0x7fffffff} and let 0 be retry forever. > >>> > >>> Not sure we're talking about the same issue. I'm not taking the > >>> entire > >>> cluster > >>> down, just one of the nodes. In that case, shouldn't the API > >>> seamlessly and > >>> instantly read from another node? > >>> > >>> 1) I have a 2-node cluster with 2 replicas, with an API running on > >>> each node. > >>> 2) I run a shell script that connects to the first API and executes > >>> one SELECT > >>> query per second. I can stop either DB node everything still > >>> works. > >>> 3) I run the same script against the second API. I can stop the DB > >>> node on the > >>> *other* computer, but if I stop the DB node on the same computer > >>> that the > >>> API > >>> is running on, mysqld reports it can't get a lock on the data file > >>> until the > >>> node comes back up. > >>> 4) When the node is started again the API begins answering queries > >>> again. > >>> > >>> Comments? Thanks again for taking the time to look at my problem. > >>> > >>> -- Jim > >>> > >>> > >>> --- Johan Andersson wrote: > >>>> Hi, > >>>> A bug report (4585) relating to this has been filed. > >>>> Sorry for your inconvenience, > >>>> > >>>> b.r, > >>>> Johan Andersson > >>>> > >>>> Devananda wrote: > >>>> > >>>>> I've been experiencing this same general problem, but haven't tried > >>>>> to > >>>>> narrow it down to a reproduceable pattern. Seems to happen in > >>>>> relation > >>>>> to restarting a DB node, like Jim said. > >>>>> > >>>>> Jim Hoadley wrote: > >>>>> > >>>>>> When I stop/start or restart a database node, the API (MySQL > >>>>>> server) > >>>>>> loses > >>>>>> connection with the data until the node comes back online. This > >>>>>> only > >>>>>> happens on > >>>>>> one of my 2 nodes (BOX2). The other (BOX1) is fine. Been puzzling > >>>>>> over this for > >>>>>> a week or so. Something I missed? Please forward any suggestions. > >>>>>> Details > >>>>>> below. > >>>>>> > >>>>>> BOX1 = Pentium III/1000MHz/512MB RAM > >>>>>> BOX2 = Pentium III/600MHz/512MB RAM > >>>>>> Both running mysql-4.1.3-beta-nightly-20040628.tar.gz. > >>>>>> Not a lot of RAM but only using a tiny test database at this > >>>>>> point. > >>>>>> Running the MGM on a separate computer (BOX4) to help isolate > >>>>>> problem. > >>>>>> > >>>>>> Connected to BOX1, issue SELECT against test.simpsons and get > >>>>>> proper > >>>>>> response: > >>>>>> > >>>>>> ---------------------------------------- > >>>>>> mysql> select * from simpsons ; > >>>>>> +----+------------+ > >>>>>> | id | first_name | > >>>>>> +----+------------+ > >>>>>> | 2 | Lisa | > >>>>>> | 4 | Homer | > >>>>>> | 5 | Maggie | > >>>>>> | 3 | Marge | > >>>>>> | 1 | Bart | > >>>>>> +----+------------+ > >>>>>> 5 rows in set (0.03 sec) > >>>>>> ---------------------------------------- > >>>>>> > >>>>>> Stop node 3 on BOX1. SELECT now fails: > >>>>>> > >>>>>> ---------------------------------------- > >>>>>> mysql> select * from simpsons ; > >>>>>> ERROR 1015: Can't lock file (errno: 4009) > >>>>>> ---------------------------------------- > >>>>>> > >>>>>> Repeating SELECT fails: > >>>>>> > >>>>>> ---------------------------------------- > >>>>>> mysql> select * from simpsons ; > >>>>>> ERROR 2013: Lost connection to MySQL server during query > >>>>>> ---------------------------------------- > >>>>>> > >>>>>> Repeating SELECT fails again, then succeeds after node 3 is > >>>>>> restarted: > >>>>>> > >>>>>> ---------------------------------------- > >>>>>> mysql> select * from simpsons ; > >>>>>> ERROR 2006: MySQL server has gone away > >>>>>> No connection. Trying to reconnect... > >>>>>> Connection id: 1 > >>>>>> Current database: test > >>>>>> > >>>>>> +----+------------+ > >>>>>> | id | first_name | > >>>>>> +----+------------+ > >>>>>> | 2 | Lisa | > >>>>>> | 4 | Homer | > >>>>>> | 5 | Maggie | > >>>>>> | 3 | Marge | > === message truncated === __________________________________ Do you Yahoo!? Vote for the stars of Yahoo!'s next ad campaign! http://advision.webevents.yahoo.com/yahoo/votelifeengine/