List:Cluster« Previous MessageNext Message »
From:Mikael Ronström Date:July 20 2004 9:45am
Subject:Re: API loses data during node restarts
View as plain text  
Hi Jim,
When you say that it only happens to the mysql server executing on the 
same machine as the crashing node
it gives me an idea that it might be a timing issue. It could be that 
your machine is so busy with writing error
logs stopping the storage node that the mysql server on the same 
machine doesn't get a chance to send heartbeats.

If this is true you should change the config parameter 
HeartbeatIntervalDbApi. Its default setting is 1500 miliseconds
which means that a heartbeat must be responded to in 4.5-6.0 seconds. 
Normally this is not a problem but if the
machine is very busy and starts swapping in and out processes then it 
can sometimes be a problem. So you could always
try with setting this to something like 10000 milliseconds and see if 
that helps.

Rgrds Mikael

2004-07-20 kl. 02.22 skrev Jim Hoadley:

> Mikael --
>
> Yes, the MySQL server reports "ERROR 1015: Can't lock file (errno: 
> 4009)" the
> instant a database node running on the same computer as the API is 
> stopped.
>
> In thinking about this problem please remember that I consistently see 
> this
> behavior on one node but never on the other, and that the MySQL server 
> API
> always resumes serving cluster data once the database node starts back 
> up.
>
> I will follow your suggestion and file a bug report. Thanks!
>
> -- Jim
>
>
> --- Mikael_Ronstr�m <mikael@stripped> wrote:
>> Hi Jim,
>> The error code you get 4009 means that the mysql server believes that
>> the cluster is down.
>> It is related to the bug reported by Johan but not the same so I
>> suggest you go ahead and file a new bug report to ensure that it gets
>> priority in development.
>>
>>  From what I could see in your email it seems as if one of the mysqld
>> process finds the cluster down
>> (in error) which is reported with error code 4009. Then immediately
>> after doing so the mysql server
>> itself fails. Does that analysis seem correct?
>>
>> Rgrds Mikael
>>
>> 2004-07-20 kl. 00.36 skrev Jim Hoadley:
>>
>>> Johan --
>>>
>>> Thanks for the fast response! I read bug report 4585. It says:
>>>
>>> -   Description:
>>> -   If entire DB cluster goes down, then the mysqld servers should
>>> retry
>>> -   connecting to the DB. The mysql servers must not give up trying 
>>> to
>>> reconnect
>>> -   to DB nodes.
>>> -
>>> -   If the mysqld is not restarted after a cluster restart and a 
>>> query
>>> is
>>> -   executed on that mysqld, then the mysqld will crash. Not so nice.
>>> -
>>> -   How to repeat:
>>> -   1. restart cluster
>>> -   2. issue a query on one mysqld server
>>> -
>>> -   Suggested fix:
>>> -   Let be there be a configurable option (--ndbcluster_timeout)  for
>>> how long
>>> -   the mysqld should try to reconnect to the db nodes.
>>> -   --ndbcluster_timeout={0,0x7fffffff} and let 0 be retry forever.
>>>
>>> Not sure we're talking about the same issue. I'm not taking the 
>>> entire
>>> cluster
>>> down, just one of the nodes. In that case, shouldn't the API
>>> seamlessly and
>>> instantly read from another node?
>>>
>>> 1) I have a 2-node cluster with 2 replicas, with an API running on
>>> each node.
>>> 2) I run a shell script that connects to the first API and executes
>>> one SELECT
>>>    query per second. I can stop either DB node everything still 
>>> works.
>>> 3) I run the same script against the second API. I can stop the DB
>>> node on the
>>>    *other* computer, but if I stop the DB node on the same computer
>>> that the
>>> API
>>>    is running on, mysqld reports it can't get a lock on the data file
>>> until the
>>>    node comes back up.
>>> 4) When the node is started again the API begins answering queries
>>> again.
>>>
>>> Comments? Thanks again for taking the time to look at my problem.
>>>
>>> -- Jim
>>>
>>>
>>> --- Johan Andersson <johan@stripped> wrote:
>>>> Hi,
>>>> A bug report (4585) relating to this has been filed.
>>>> Sorry for your inconvenience,
>>>>
>>>> b.r,
>>>> Johan Andersson
>>>>
>>>> Devananda wrote:
>>>>
>>>>> I've been experiencing this same general problem, but haven't tried
>>>>> to
>>>>> narrow it down to a reproduceable pattern. Seems to happen in
>>>>> relation
>>>>> to restarting a DB node, like Jim said.
>>>>>
>>>>> Jim Hoadley wrote:
>>>>>
>>>>>> When I stop/start or restart a database node, the API (MySQL 
>>>>>> server)
>>>>>> loses
>>>>>> connection with the data until the node comes back online. This 
>>>>>> only
>>>>>> happens on
>>>>>> one of my 2 nodes (BOX2). The other (BOX1) is fine. Been
> puzzling
>>>>>> over this for
>>>>>> a week or so. Something I missed? Please forward any
> suggestions.
>>>>>> Details
>>>>>> below.
>>>>>>
>>>>>> BOX1 = Pentium III/1000MHz/512MB RAM
>>>>>> BOX2 = Pentium III/600MHz/512MB RAM
>>>>>> Both running mysql-4.1.3-beta-nightly-20040628.tar.gz.
>>>>>> Not a lot of RAM but only using a tiny test database at this 
>>>>>> point.
>>>>>> Running the MGM on a separate computer (BOX4) to help isolate
>>>>>> problem.
>>>>>>
>>>>>> Connected to BOX1, issue SELECT against test.simpsons and get 
>>>>>> proper
>>>>>> response:
>>>>>>
>>>>>> ----------------------------------------
>>>>>> mysql> select * from simpsons ;
>>>>>> +----+------------+
>>>>>> | id | first_name |
>>>>>> +----+------------+
>>>>>> |  2 | Lisa       |
>>>>>> |  4 | Homer      |
>>>>>> |  5 | Maggie     |
>>>>>> |  3 | Marge      |
>>>>>> |  1 | Bart       |
>>>>>> +----+------------+
>>>>>> 5 rows in set (0.03 sec)
>>>>>> ----------------------------------------
>>>>>>
>>>>>> Stop node 3 on BOX1. SELECT now fails:
>>>>>>
>>>>>> ----------------------------------------
>>>>>> mysql> select * from simpsons ;
>>>>>> ERROR 1015: Can't lock file (errno: 4009)
>>>>>> ----------------------------------------
>>>>>>
>>>>>> Repeating SELECT fails:
>>>>>>
>>>>>> ----------------------------------------
>>>>>> mysql> select * from simpsons ;
>>>>>> ERROR 2013: Lost connection to MySQL server during query
>>>>>> ----------------------------------------
>>>>>>
>>>>>> Repeating SELECT fails again, then succeeds after node 3 is
>>>>>> restarted:
>>>>>>
>>>>>> ----------------------------------------
>>>>>> mysql> select * from simpsons ;
>>>>>> ERROR 2006: MySQL server has gone away
>>>>>> No connection. Trying to reconnect...
>>>>>> Connection id:    1
>>>>>> Current database: test
>>>>>>
>>>>>> +----+------------+
>>>>>> | id | first_name |
>>>>>> +----+------------+
>>>>>> |  2 | Lisa       |
>>>>>> |  4 | Homer      |
>>>>>> |  5 | Maggie     |
>>>>>> |  3 | Marge      |
>>>>>> |  1 | Bart       |
>>>>>> +----+------------+
>>>>>> 5 rows in set (6.55 sec)
>>>>>> ----------------------------------------
>>>>>>
>>>>>> All data is intact. BTW new records added to node 2 on BOX2
> while
>>>>>> node 3 on
>>>>>> BOX1 is down show up (this is good).
>>>>>>
>>>>>> Here's what restarting node 3 on BOX1 with mgmd looks like
> (looks
>>>>>> right to me):
>>>>>>
>>>>>> ----------------------------------------
>>>>>> NDB> show
>>>>>> Cluster Configuration
>>>>>> ---------------------
>>>>>> 2 NDB Node(s)
>>>>>> DB node:        2  (Version: 3.5.0)
>>>>>> DB node:        3  (Version: 3.5.0)
>>>>>>
>>>>>> 4 API Node(s)
>>>>>> API node:       11  (not connected)
>>>>>> API node:       12  (Version: 3.5.0)
>>>>>> API node:       13  (not connected)
>>>>>> API node:       14  (not connected)
>>>>>>
>>>>>> 1 MGM Node(s)
>>>>>> MGM node:       1  (Version: 3.5.0)
>>>>>>
>>>>>> NDB> 2 restart
>>>>>> Executing RESTART on node 2.
>>>>>> Database node 2 is being restarted.
>>>>>>
>>>>>> NDB> 2 - endTakeOver
>>>>>> ----------------------------------------
>>>>>>
>>>>>> Here is the MySQL server error log output on BOX1 as node 3 is
>>>>>> restarted:
>>>>>>
>>>>>> ----------------------------------------
>>>>>> 040713 10:53:31  mysqld started
>>>>>> 040713 10:53:32  InnoDB: Started; log sequence number 0 44112
>>>>>> /usr/local/mysql/libexec/mysqld: ready for connections.
>>>>>> Version: '4.1.3-beta-nightly-20040628-log'  socket:
>>>>>> '/tmp/mysql.sock'
>>>>>> port:
>>>>>> 3306
>>>>>> 2004-07-19 11:39:15 [NDB] INFO     -- Node shutdown initiated
>>>>>> mysqld got signal 11;
>>
> === message truncated ===
>
>
>
> 		
> __________________________________
> Do you Yahoo!?
> Vote for the stars of Yahoo!'s next ad campaign!
> http://advision.webevents.yahoo.com/yahoo/votelifeengine/
>
>
> -- 
> MySQL Cluster Mailing List
> For list archives: http://lists.mysql.com/cluster
> To unsubscribe:    
> http://lists.mysql.com/cluster?unsub=1
>
>
Mikael Ronström, Senior Software Architect
MySQL AB, www.mysql.com
Office: +46 70 2646363

Clustering:
http://www.infoworld.com/article/04/04/14/HNmysqlcluster_1.html

http://www.eweek.com/article2/0,1759,1567546,00.asp


Thread
API loses data during node restartsJim Hoadley19 Jul
  • Re: API loses data during node restartsDevananda19 Jul
    • Re: API loses data during node restartsJohan Andersson19 Jul
      • Re: API loses data during node restartsJim Hoadley20 Jul
        • Re: API loses data during node restartsJustin Swanhart20 Jul
          • Re: API loses data during node restartsMikael Ronström20 Jul
          • Re: API loses data during node restartsDevananda20 Jul
            • Re: API loses data during node restartsJohan Andersson20 Jul
        • Re: API loses data during node restartsMikael Ronström20 Jul
          • Re: API loses data during node restartsJohan Andersson20 Jul
            • Re: API loses data during node restartsJim Hoadley20 Jul
          • Re: API loses data during node restartsJim Hoadley20 Jul
            • Re: API loses data during node restartsMikael Ronström20 Jul
              • Re: API loses data during node restartsJim Hoadley22 Jul