Mikael --
Yes, the MySQL server reports "ERROR 1015: Can't lock file (errno: 4009)" the
instant a database node running on the same computer as the API is stopped.
In thinking about this problem please remember that I consistently see this
behavior on one node but never on the other, and that the MySQL server API
always resumes serving cluster data once the database node starts back up.
I will follow your suggestion and file a bug report. Thanks!
-- Jim
--- Mikael_Ronstr> Hi Jim,
> The error code you get 4009 means that the mysql server believes that
> the cluster is down.
> It is related to the bug reported by Johan but not the same so I
> suggest you go ahead and file a new bug report to ensure that it gets
> priority in development.
>
> From what I could see in your email it seems as if one of the mysqld
> process finds the cluster down
> (in error) which is reported with error code 4009. Then immediately
> after doing so the mysql server
> itself fails. Does that analysis seem correct?
>
> Rgrds Mikael
>
> 2004-07-20 kl. 00.36 skrev Jim Hoadley:
>
> > Johan --
> >
> > Thanks for the fast response! I read bug report 4585. It says:
> >
> > - Description:
> > - If entire DB cluster goes down, then the mysqld servers should
> > retry
> > - connecting to the DB. The mysql servers must not give up trying to
> > reconnect
> > - to DB nodes.
> > -
> > - If the mysqld is not restarted after a cluster restart and a query
> > is
> > - executed on that mysqld, then the mysqld will crash. Not so nice.
> > -
> > - How to repeat:
> > - 1. restart cluster
> > - 2. issue a query on one mysqld server
> > -
> > - Suggested fix:
> > - Let be there be a configurable option (--ndbcluster_timeout) for
> > how long
> > - the mysqld should try to reconnect to the db nodes.
> > - --ndbcluster_timeout={0,0x7fffffff} and let 0 be retry forever.
> >
> > Not sure we're talking about the same issue. I'm not taking the entire
> > cluster
> > down, just one of the nodes. In that case, shouldn't the API
> > seamlessly and
> > instantly read from another node?
> >
> > 1) I have a 2-node cluster with 2 replicas, with an API running on
> > each node.
> > 2) I run a shell script that connects to the first API and executes
> > one SELECT
> > query per second. I can stop either DB node everything still works.
> > 3) I run the same script against the second API. I can stop the DB
> > node on the
> > *other* computer, but if I stop the DB node on the same computer
> > that the
> > API
> > is running on, mysqld reports it can't get a lock on the data file
> > until the
> > node comes back up.
> > 4) When the node is started again the API begins answering queries
> > again.
> >
> > Comments? Thanks again for taking the time to look at my problem.
> >
> > -- Jim
> >
> >
> > --- Johan Andersson <johan@stripped> wrote:
> >> Hi,
> >> A bug report (4585) relating to this has been filed.
> >> Sorry for your inconvenience,
> >>
> >> b.r,
> >> Johan Andersson
> >>
> >> Devananda wrote:
> >>
> >>> I've been experiencing this same general problem, but haven't tried
> >>> to
> >>> narrow it down to a reproduceable pattern. Seems to happen in
> >>> relation
> >>> to restarting a DB node, like Jim said.
> >>>
> >>> Jim Hoadley wrote:
> >>>
> >>>> When I stop/start or restart a database node, the API (MySQL
> server)
> >>>> loses
> >>>> connection with the data until the node comes back online. This
> only
> >>>> happens on
> >>>> one of my 2 nodes (BOX2). The other (BOX1) is fine. Been puzzling
> >>>> over this for
> >>>> a week or so. Something I missed? Please forward any suggestions.
> >>>> Details
> >>>> below.
> >>>>
> >>>> BOX1 = Pentium III/1000MHz/512MB RAM
> >>>> BOX2 = Pentium III/600MHz/512MB RAM
> >>>> Both running mysql-4.1.3-beta-nightly-20040628.tar.gz.
> >>>> Not a lot of RAM but only using a tiny test database at this point.
> >>>> Running the MGM on a separate computer (BOX4) to help isolate
> >>>> problem.
> >>>>
> >>>> Connected to BOX1, issue SELECT against test.simpsons and get
> proper
> >>>> response:
> >>>>
> >>>> ----------------------------------------
> >>>> mysql> select * from simpsons ;
> >>>> +----+------------+
> >>>> | id | first_name |
> >>>> +----+------------+
> >>>> | 2 | Lisa |
> >>>> | 4 | Homer |
> >>>> | 5 | Maggie |
> >>>> | 3 | Marge |
> >>>> | 1 | Bart |
> >>>> +----+------------+
> >>>> 5 rows in set (0.03 sec)
> >>>> ----------------------------------------
> >>>>
> >>>> Stop node 3 on BOX1. SELECT now fails:
> >>>>
> >>>> ----------------------------------------
> >>>> mysql> select * from simpsons ;
> >>>> ERROR 1015: Can't lock file (errno: 4009)
> >>>> ----------------------------------------
> >>>>
> >>>> Repeating SELECT fails:
> >>>>
> >>>> ----------------------------------------
> >>>> mysql> select * from simpsons ;
> >>>> ERROR 2013: Lost connection to MySQL server during query
> >>>> ----------------------------------------
> >>>>
> >>>> Repeating SELECT fails again, then succeeds after node 3 is
> >>>> restarted:
> >>>>
> >>>> ----------------------------------------
> >>>> mysql> select * from simpsons ;
> >>>> ERROR 2006: MySQL server has gone away
> >>>> No connection. Trying to reconnect...
> >>>> Connection id: 1
> >>>> Current database: test
> >>>>
> >>>> +----+------------+
> >>>> | id | first_name |
> >>>> +----+------------+
> >>>> | 2 | Lisa |
> >>>> | 4 | Homer |
> >>>> | 5 | Maggie |
> >>>> | 3 | Marge |
> >>>> | 1 | Bart |
> >>>> +----+------------+
> >>>> 5 rows in set (6.55 sec)
> >>>> ----------------------------------------
> >>>>
> >>>> All data is intact. BTW new records added to node 2 on BOX2 while
> >>>> node 3 on
> >>>> BOX1 is down show up (this is good).
> >>>>
> >>>> Here's what restarting node 3 on BOX1 with mgmd looks like (looks
> >>>> right to me):
> >>>>
> >>>> ----------------------------------------
> >>>> NDB> show
> >>>> Cluster Configuration
> >>>> ---------------------
> >>>> 2 NDB Node(s)
> >>>> DB node: 2 (Version: 3.5.0)
> >>>> DB node: 3 (Version: 3.5.0)
> >>>>
> >>>> 4 API Node(s)
> >>>> API node: 11 (not connected)
> >>>> API node: 12 (Version: 3.5.0)
> >>>> API node: 13 (not connected)
> >>>> API node: 14 (not connected)
> >>>>
> >>>> 1 MGM Node(s)
> >>>> MGM node: 1 (Version: 3.5.0)
> >>>>
> >>>> NDB> 2 restart
> >>>> Executing RESTART on node 2.
> >>>> Database node 2 is being restarted.
> >>>>
> >>>> NDB> 2 - endTakeOver
> >>>> ----------------------------------------
> >>>>
> >>>> Here is the MySQL server error log output on BOX1 as node 3 is
> >>>> restarted:
> >>>>
> >>>> ----------------------------------------
> >>>> 040713 10:53:31 mysqld started
> >>>> 040713 10:53:32 InnoDB: Started; log sequence number 0 44112
> >>>> /usr/local/mysql/libexec/mysqld: ready for connections.
> >>>> Version: '4.1.3-beta-nightly-20040628-log' socket:
> >>>> '/tmp/mysql.sock'
> >>>> port:
> >>>> 3306
> >>>> 2004-07-19 11:39:15 [NDB] INFO -- Node shutdown initiated
> >>>> mysqld got signal 11;
>
=== message truncated ===
__________________________________
Do you Yahoo!?
Vote for the stars of Yahoo!'s next ad campaign!
http://advision.webevents.yahoo.com/yahoo/votelifeengine/