List:Cluster« Previous MessageNext Message »
From:Johan Andersson Date:July 19 2004 11:30pm
Subject:Re: API loses data during node restarts
View as plain text  

Mikael Ronström wrote:

> From what I could see in your email it seems as if one of the mysqld 
> process finds the cluster down
> (in error) which is reported with error code 4009. Then immediately 
> after doing so the mysql server
> itself fails. Does that analysis seem correct?

The first query would generare erorr code 4009 and a subsequent query 
generates a mysqld crash. This is also the case with bug 4585 when the 
cluster goes down, The table handler does not gracefully cleanup after 
connection loss. And it does not reconnect when the cluster is up. In 
either way, the mysqld server may never ever crash since there be data 
stored in other storage  storage engines  (innodb, mysisam).... This 
also means that the mysqld server must be able to reconnect to ndb 
cluster storage engine.

b.r.,
johan

> Rgrds Mikael
>
> 2004-07-20 kl. 00.36 skrev Jim Hoadley:
>
>> Johan --
>>
>> Thanks for the fast response! I read bug report 4585. It says:
>>
>> -   Description:
>> -   If entire DB cluster goes down, then the mysqld servers should retry
>> -   connecting to the DB. The mysql servers must not give up trying to
>> reconnect
>> -   to DB nodes.
>> -
>> -   If the mysqld is not restarted after a cluster restart and a 
>> query is
>> -   executed on that mysqld, then the mysqld will crash. Not so nice.
>> -
>> -   How to repeat:
>> -   1. restart cluster
>> -   2. issue a query on one mysqld server
>> -
>> -   Suggested fix:
>> -   Let be there be a configurable option (--ndbcluster_timeout)  for 
>> how long
>> -   the mysqld should try to reconnect to the db nodes.
>> -   --ndbcluster_timeout={0,0x7fffffff} and let 0 be retry forever.
>>
>> Not sure we're talking about the same issue. I'm not taking the 
>> entire cluster
>> down, just one of the nodes. In that case, shouldn't the API 
>> seamlessly and
>> instantly read from another node?
>>
>> 1) I have a 2-node cluster with 2 replicas, with an API running on 
>> each node.
>> 2) I run a shell script that connects to the first API and executes 
>> one SELECT
>>    query per second. I can stop either DB node everything still works.
>> 3) I run the same script against the second API. I can stop the DB 
>> node on the
>>    *other* computer, but if I stop the DB node on the same computer 
>> that the
>> API
>>    is running on, mysqld reports it can't get a lock on the data file 
>> until the
>>    node comes back up.
>> 4) When the node is started again the API begins answering queries 
>> again.
>>
>> Comments? Thanks again for taking the time to look at my problem.
>>
>> -- Jim
>>
>>
>> --- Johan Andersson <johan@stripped> wrote:
>>
>>> Hi,
>>> A bug report (4585) relating to this has been filed.
>>> Sorry for your inconvenience,
>>>
>>> b.r,
>>> Johan Andersson
>>>
>>> Devananda wrote:
>>>
>>>> I've been experiencing this same general problem, but haven't tried to
>>>> narrow it down to a reproduceable pattern. Seems to happen in relation
>>>> to restarting a DB node, like Jim said.
>>>>
>>>> Jim Hoadley wrote:
>>>>
>>>>> When I stop/start or restart a database node, the API (MySQL server)
>>>>> loses
>>>>> connection with the data until the node comes back online. This only
>>>>> happens on
>>>>> one of my 2 nodes (BOX2). The other (BOX1) is fine. Been puzzling
>>>>> over this for
>>>>> a week or so. Something I missed? Please forward any suggestions.
>>>>> Details
>>>>> below.
>>>>>
>>>>> BOX1 = Pentium III/1000MHz/512MB RAM
>>>>> BOX2 = Pentium III/600MHz/512MB RAM
>>>>> Both running mysql-4.1.3-beta-nightly-20040628.tar.gz.
>>>>> Not a lot of RAM but only using a tiny test database at this point.
>>>>> Running the MGM on a separate computer (BOX4) to help isolate 
>>>>> problem.
>>>>>
>>>>> Connected to BOX1, issue SELECT against test.simpsons and get proper
>>>>> response:
>>>>>
>>>>> ----------------------------------------
>>>>> mysql> select * from simpsons ;
>>>>> +----+------------+
>>>>> | id | first_name |
>>>>> +----+------------+
>>>>> |  2 | Lisa       |
>>>>> |  4 | Homer      |
>>>>> |  5 | Maggie     |
>>>>> |  3 | Marge      |
>>>>> |  1 | Bart       |
>>>>> +----+------------+
>>>>> 5 rows in set (0.03 sec)
>>>>> ----------------------------------------
>>>>>
>>>>> Stop node 3 on BOX1. SELECT now fails:
>>>>>
>>>>> ----------------------------------------
>>>>> mysql> select * from simpsons ;
>>>>> ERROR 1015: Can't lock file (errno: 4009)
>>>>> ----------------------------------------
>>>>>
>>>>> Repeating SELECT fails:
>>>>>
>>>>> ----------------------------------------
>>>>> mysql> select * from simpsons ;
>>>>> ERROR 2013: Lost connection to MySQL server during query
>>>>> ----------------------------------------
>>>>>
>>>>> Repeating SELECT fails again, then succeeds after node 3 is 
>>>>> restarted:
>>>>>
>>>>> ----------------------------------------
>>>>> mysql> select * from simpsons ;
>>>>> ERROR 2006: MySQL server has gone away
>>>>> No connection. Trying to reconnect...
>>>>> Connection id:    1
>>>>> Current database: test
>>>>>
>>>>> +----+------------+
>>>>> | id | first_name |
>>>>> +----+------------+
>>>>> |  2 | Lisa       |
>>>>> |  4 | Homer      |
>>>>> |  5 | Maggie     |
>>>>> |  3 | Marge      |
>>>>> |  1 | Bart       |
>>>>> +----+------------+
>>>>> 5 rows in set (6.55 sec)
>>>>> ----------------------------------------
>>>>>
>>>>> All data is intact. BTW new records added to node 2 on BOX2 while
>>>>> node 3 on
>>>>> BOX1 is down show up (this is good).
>>>>>
>>>>> Here's what restarting node 3 on BOX1 with mgmd looks like (looks
>>>>> right to me):
>>>>>
>>>>> ----------------------------------------
>>>>> NDB> show
>>>>> Cluster Configuration
>>>>> ---------------------
>>>>> 2 NDB Node(s)
>>>>> DB node:        2  (Version: 3.5.0)
>>>>> DB node:        3  (Version: 3.5.0)
>>>>>
>>>>> 4 API Node(s)
>>>>> API node:       11  (not connected)
>>>>> API node:       12  (Version: 3.5.0)
>>>>> API node:       13  (not connected)
>>>>> API node:       14  (not connected)
>>>>>
>>>>> 1 MGM Node(s)
>>>>> MGM node:       1  (Version: 3.5.0)
>>>>>
>>>>> NDB> 2 restart
>>>>> Executing RESTART on node 2.
>>>>> Database node 2 is being restarted.
>>>>>
>>>>> NDB> 2 - endTakeOver
>>>>> ----------------------------------------
>>>>>
>>>>> Here is the MySQL server error log output on BOX1 as node 3 is
>>>>> restarted:
>>>>>
>>>>> ----------------------------------------
>>>>> 040713 10:53:31  mysqld started
>>>>> 040713 10:53:32  InnoDB: Started; log sequence number 0 44112
>>>>> /usr/local/mysql/libexec/mysqld: ready for connections.
>>>>> Version: '4.1.3-beta-nightly-20040628-log'  socket:
> '/tmp/mysql.sock'
>>>>> port:
>>>>> 3306
>>>>> 2004-07-19 11:39:15 [NDB] INFO     -- Node shutdown initiated
>>>>> mysqld got signal 11;
>>>>> This could be because you hit a bug. It is also possible that this
>>>>> binary
>>>>> or one of the libraries it was linked against is corrupt, improperly
>>>>> built,
>>>>> or misconfigured. This error can also be caused by malfunctioning
>>>>> hardware.
>>>>> We will try our best to scrape up some info that will hopefully help
>>>>> diagnose
>>>>> the problem, but since we have already crashed, something is
>>>>> definitely wrong
>>>>> and this may fail.
>>>>>
>>>>> key_buffer_size=16777216
>>>>> read_buffer_size=258048
>>>>> max_used_connections=1
>>>>> max_connections=100
>>>>> threads_connected=1
>>>>> It is possible that mysqld could use up to
>>>>> key_buffer_size + (read_buffer_size +
>>>>> sort_buffer_size)*max_connections = 92783
>>>>> K
>>>>> bytes of memory
>>>>> Hope that's ok; if not, decrease some variables in the equation.
>>>>>
>>>>> thd=0x8751280
>>>>> Attempting backtrace. You can use the following information to 
>>>>> find out
>>>>> where mysqld died. If you see no messages after this, something went
>>>>> terribly wrong...
>>>>> Cannot determine thread, fp=0xafdc0c5c, backtrace may not be
> correct.
>>>>> Stack range sanity check OK, backtrace follows:
>>>>> 0x81725e0
>>>>> 0xb749de48
>>>>> 0x81bc7a4
>>>>> 0x81f21c0
>>>>> 0x81bc7a4
>>>>> 0x81bb8e8
>>>>> 0x81b57e7
>>>>> 0x81ad41e
>>>>> 0x81adf45
>>>>> 0x81aa3fa
>>>>> 0x8187aba
>>>>> 0x818d8cc
>>>>> 0x81864b0
>>>>> 0x8186056
>>>>> 0x81857fa
>>>>> 0xb7497dac
>>>>> 0xb73c3a8a
>>>>> New value of fp=(nil) failed sanity check, terminating stack trace!
>>>>> Please read http://www.mysql.com/doc/en/Using_stack_trace.html and
>>>>> follow
>>>>> instructions on how to resolve the stack trace. Resolved
>>>>> stack trace is much more helpful in diagnosing the problem, so 
>>>>> please do
>>>>> resolve it
>>>>> Trying to get some variables.
>>>>> Some pointers may be invalid and cause the dump to abort...
>>>>> thd->query at 0x875d768 = select * from simpsons
>>>>> thd->thread_id=7854
>>>>> The manual page at http://www.mysql.com/doc/en/Crashing.html
> contains
>>>>> information that should help you find out what is causing the crash.
>>>>>
>>>>> Number of processes running now: 0
>>>>> 040719 11:39:24  mysqld restarted
>>>>> InnoDB: Warning: we did not need to do crash recovery, but log scan
>>>>> InnoDB: progressed past the checkpoint lsn 0 44112 up to lsn 0 44146
>>>>> 040719 11:39:24  InnoDB: Started; log sequence number 0 44112
>>>>> Restarting system
>>>>> 2004-07-19 11:39:28 [NDB] INFO     -- Ndb has terminated (pid 18367)
>>>>> restarting
>>>>> 2004-07-19 11:39:28 [NDB] INFO     -- Angel pid: 18366 ndb pid:
> 18428
>>>>> 2004-07-19 11:39:28 [NDB] INFO     -- NDB Cluster -- DB node 2
>>>>> 2004-07-19 11:39:28 [NDB] INFO     -- Version 3.5.0 (beta) --
>>>>> 2004-07-19 11:39:28 [NDB] INFO     -- Start initiated (version
> 3.5.0)
>>>>> NR: setLcpActiveStatusEnd - m_participatingLQH
>>>>> 2004-07-19 11:39:33 [NDB] INFO     -- Started (version 3.5.0)
>>>>> /usr/local/mysql/libexec/mysqld: ready for connections.
>>>>
>>>
>> === message truncated ===
>>
>>
>>
>>     
>>        
>> __________________________________
>> Do you Yahoo!?
>> New and Improved Yahoo! Mail - 100MB free storage!
>> http://promotions.yahoo.com/new_mail
>>
>> -- 
>> MySQL Cluster Mailing List
>> For list archives: http://lists.mysql.com/cluster
>> To unsubscribe:    http://lists.mysql.com/cluster?unsub=1
>>
> Mikael Ronström, Senior Software Architect
> MySQL AB, www.mysql.com
>
> Clustering:
> http://www.infoworld.com/article/04/04/14/HNmysqlcluster_1.html
>
> http://www.eweek.com/article2/0,1759,1567546,00.asp
>
>

Thread
API loses data during node restartsJim Hoadley19 Jul
  • Re: API loses data during node restartsDevananda19 Jul
    • Re: API loses data during node restartsJohan Andersson19 Jul
      • Re: API loses data during node restartsJim Hoadley20 Jul
        • Re: API loses data during node restartsJustin Swanhart20 Jul
          • Re: API loses data during node restartsMikael Ronström20 Jul
          • Re: API loses data during node restartsDevananda20 Jul
            • Re: API loses data during node restartsJohan Andersson20 Jul
        • Re: API loses data during node restartsMikael Ronström20 Jul
          • Re: API loses data during node restartsJohan Andersson20 Jul
            • Re: API loses data during node restartsJim Hoadley20 Jul
          • Re: API loses data during node restartsJim Hoadley20 Jul
            • Re: API loses data during node restartsMikael Ronström20 Jul
              • Re: API loses data during node restartsJim Hoadley22 Jul