Hi Jim,
When you say that it only happens to the mysql server executing on the
same machine as the crashing node
it gives me an idea that it might be a timing issue. It could be that
your machine is so busy with writing error
logs stopping the storage node that the mysql server on the same
machine doesn't get a chance to send heartbeats.
If this is true you should change the config parameter
HeartbeatIntervalDbApi. Its default setting is 1500 miliseconds
which means that a heartbeat must be responded to in 4.5-6.0 seconds.
Normally this is not a problem but if the
machine is very busy and starts swapping in and out processes then it
can sometimes be a problem. So you could always
try with setting this to something like 10000 milliseconds and see if
that helps.
Rgrds Mikael
2004-07-20 kl. 02.22 skrev Jim Hoadley:
> Mikael --
>
> Yes, the MySQL server reports "ERROR 1015: Can't lock file (errno:
> 4009)" the
> instant a database node running on the same computer as the API is
> stopped.
>
> In thinking about this problem please remember that I consistently see
> this
> behavior on one node but never on the other, and that the MySQL server
> API
> always resumes serving cluster data once the database node starts back
> up.
>
> I will follow your suggestion and file a bug report. Thanks!
>
> -- Jim
>
>
> --- Mikael_Ronstr�m <mikael@stripped> wrote:
>> Hi Jim,
>> The error code you get 4009 means that the mysql server believes that
>> the cluster is down.
>> It is related to the bug reported by Johan but not the same so I
>> suggest you go ahead and file a new bug report to ensure that it gets
>> priority in development.
>>
>> From what I could see in your email it seems as if one of the mysqld
>> process finds the cluster down
>> (in error) which is reported with error code 4009. Then immediately
>> after doing so the mysql server
>> itself fails. Does that analysis seem correct?
>>
>> Rgrds Mikael
>>
>> 2004-07-20 kl. 00.36 skrev Jim Hoadley:
>>
>>> Johan --
>>>
>>> Thanks for the fast response! I read bug report 4585. It says:
>>>
>>> - Description:
>>> - If entire DB cluster goes down, then the mysqld servers should
>>> retry
>>> - connecting to the DB. The mysql servers must not give up trying
>>> to
>>> reconnect
>>> - to DB nodes.
>>> -
>>> - If the mysqld is not restarted after a cluster restart and a
>>> query
>>> is
>>> - executed on that mysqld, then the mysqld will crash. Not so nice.
>>> -
>>> - How to repeat:
>>> - 1. restart cluster
>>> - 2. issue a query on one mysqld server
>>> -
>>> - Suggested fix:
>>> - Let be there be a configurable option (--ndbcluster_timeout) for
>>> how long
>>> - the mysqld should try to reconnect to the db nodes.
>>> - --ndbcluster_timeout={0,0x7fffffff} and let 0 be retry forever.
>>>
>>> Not sure we're talking about the same issue. I'm not taking the
>>> entire
>>> cluster
>>> down, just one of the nodes. In that case, shouldn't the API
>>> seamlessly and
>>> instantly read from another node?
>>>
>>> 1) I have a 2-node cluster with 2 replicas, with an API running on
>>> each node.
>>> 2) I run a shell script that connects to the first API and executes
>>> one SELECT
>>> query per second. I can stop either DB node everything still
>>> works.
>>> 3) I run the same script against the second API. I can stop the DB
>>> node on the
>>> *other* computer, but if I stop the DB node on the same computer
>>> that the
>>> API
>>> is running on, mysqld reports it can't get a lock on the data file
>>> until the
>>> node comes back up.
>>> 4) When the node is started again the API begins answering queries
>>> again.
>>>
>>> Comments? Thanks again for taking the time to look at my problem.
>>>
>>> -- Jim
>>>
>>>
>>> --- Johan Andersson <johan@stripped> wrote:
>>>> Hi,
>>>> A bug report (4585) relating to this has been filed.
>>>> Sorry for your inconvenience,
>>>>
>>>> b.r,
>>>> Johan Andersson
>>>>
>>>> Devananda wrote:
>>>>
>>>>> I've been experiencing this same general problem, but haven't tried
>>>>> to
>>>>> narrow it down to a reproduceable pattern. Seems to happen in
>>>>> relation
>>>>> to restarting a DB node, like Jim said.
>>>>>
>>>>> Jim Hoadley wrote:
>>>>>
>>>>>> When I stop/start or restart a database node, the API (MySQL
>>>>>> server)
>>>>>> loses
>>>>>> connection with the data until the node comes back online. This
>>>>>> only
>>>>>> happens on
>>>>>> one of my 2 nodes (BOX2). The other (BOX1) is fine. Been
> puzzling
>>>>>> over this for
>>>>>> a week or so. Something I missed? Please forward any
> suggestions.
>>>>>> Details
>>>>>> below.
>>>>>>
>>>>>> BOX1 = Pentium III/1000MHz/512MB RAM
>>>>>> BOX2 = Pentium III/600MHz/512MB RAM
>>>>>> Both running mysql-4.1.3-beta-nightly-20040628.tar.gz.
>>>>>> Not a lot of RAM but only using a tiny test database at this
>>>>>> point.
>>>>>> Running the MGM on a separate computer (BOX4) to help isolate
>>>>>> problem.
>>>>>>
>>>>>> Connected to BOX1, issue SELECT against test.simpsons and get
>>>>>> proper
>>>>>> response:
>>>>>>
>>>>>> ----------------------------------------
>>>>>> mysql> select * from simpsons ;
>>>>>> +----+------------+
>>>>>> | id | first_name |
>>>>>> +----+------------+
>>>>>> | 2 | Lisa |
>>>>>> | 4 | Homer |
>>>>>> | 5 | Maggie |
>>>>>> | 3 | Marge |
>>>>>> | 1 | Bart |
>>>>>> +----+------------+
>>>>>> 5 rows in set (0.03 sec)
>>>>>> ----------------------------------------
>>>>>>
>>>>>> Stop node 3 on BOX1. SELECT now fails:
>>>>>>
>>>>>> ----------------------------------------
>>>>>> mysql> select * from simpsons ;
>>>>>> ERROR 1015: Can't lock file (errno: 4009)
>>>>>> ----------------------------------------
>>>>>>
>>>>>> Repeating SELECT fails:
>>>>>>
>>>>>> ----------------------------------------
>>>>>> mysql> select * from simpsons ;
>>>>>> ERROR 2013: Lost connection to MySQL server during query
>>>>>> ----------------------------------------
>>>>>>
>>>>>> Repeating SELECT fails again, then succeeds after node 3 is
>>>>>> restarted:
>>>>>>
>>>>>> ----------------------------------------
>>>>>> mysql> select * from simpsons ;
>>>>>> ERROR 2006: MySQL server has gone away
>>>>>> No connection. Trying to reconnect...
>>>>>> Connection id: 1
>>>>>> Current database: test
>>>>>>
>>>>>> +----+------------+
>>>>>> | id | first_name |
>>>>>> +----+------------+
>>>>>> | 2 | Lisa |
>>>>>> | 4 | Homer |
>>>>>> | 5 | Maggie |
>>>>>> | 3 | Marge |
>>>>>> | 1 | Bart |
>>>>>> +----+------------+
>>>>>> 5 rows in set (6.55 sec)
>>>>>> ----------------------------------------
>>>>>>
>>>>>> All data is intact. BTW new records added to node 2 on BOX2
> while
>>>>>> node 3 on
>>>>>> BOX1 is down show up (this is good).
>>>>>>
>>>>>> Here's what restarting node 3 on BOX1 with mgmd looks like
> (looks
>>>>>> right to me):
>>>>>>
>>>>>> ----------------------------------------
>>>>>> NDB> show
>>>>>> Cluster Configuration
>>>>>> ---------------------
>>>>>> 2 NDB Node(s)
>>>>>> DB node: 2 (Version: 3.5.0)
>>>>>> DB node: 3 (Version: 3.5.0)
>>>>>>
>>>>>> 4 API Node(s)
>>>>>> API node: 11 (not connected)
>>>>>> API node: 12 (Version: 3.5.0)
>>>>>> API node: 13 (not connected)
>>>>>> API node: 14 (not connected)
>>>>>>
>>>>>> 1 MGM Node(s)
>>>>>> MGM node: 1 (Version: 3.5.0)
>>>>>>
>>>>>> NDB> 2 restart
>>>>>> Executing RESTART on node 2.
>>>>>> Database node 2 is being restarted.
>>>>>>
>>>>>> NDB> 2 - endTakeOver
>>>>>> ----------------------------------------
>>>>>>
>>>>>> Here is the MySQL server error log output on BOX1 as node 3 is
>>>>>> restarted:
>>>>>>
>>>>>> ----------------------------------------
>>>>>> 040713 10:53:31 mysqld started
>>>>>> 040713 10:53:32 InnoDB: Started; log sequence number 0 44112
>>>>>> /usr/local/mysql/libexec/mysqld: ready for connections.
>>>>>> Version: '4.1.3-beta-nightly-20040628-log' socket:
>>>>>> '/tmp/mysql.sock'
>>>>>> port:
>>>>>> 3306
>>>>>> 2004-07-19 11:39:15 [NDB] INFO -- Node shutdown initiated
>>>>>> mysqld got signal 11;
>>
> === message truncated ===
>
>
>
>
> __________________________________
> Do you Yahoo!?
> Vote for the stars of Yahoo!'s next ad campaign!
> http://advision.webevents.yahoo.com/yahoo/votelifeengine/
>
>
> --
> MySQL Cluster Mailing List
> For list archives: http://lists.mysql.com/cluster
> To unsubscribe:
> http://lists.mysql.com/cluster?unsub=1
>
>
Mikael Ronström, Senior Software Architect
MySQL AB, www.mysql.com
Office: +46 70 2646363
Clustering:
http://www.infoworld.com/article/04/04/14/HNmysqlcluster_1.html
http://www.eweek.com/article2/0,1759,1567546,00.asp