List:Cluster« Previous MessageNext Message »
From:Jim Hoadley Date:September 1 2004 4:49pm
Subject:Re: nightly crashing
View as plain text  
I've had a cluster going for several weeks, however, almost every day one or
more of my nodes crashes. My first thought is that the Linux boxes are
underpowered, but is there any way I can prove that? I'd sure like to know
MySQL Cluster is stable and will suit my needs before shelling out the money
for production boxes.

I wrote to the list about this a month ago but never got to the root cause. So
I'd like to try again.

I've got 2 replicas, 4 storage nodes, 4 APIs on 4 test Linux boxes. Each box
has 512MB RAM, and the test database is tiny (4 rows of data). Here's what
STDERR says on the node [4] that died yesterday afternoon:

[root@ed 4.ndb_mgm]# ndbd &
[3] 8884
2004-08-31 16:38:10 [NDB] INFO     -- Angel pid: 8884 ndb pid: 8885
2004-08-31 16:38:10 [NDB] INFO     -- NDB Cluster -- DB node 4
2004-08-31 16:38:10 [NDB] INFO     -- Version 3.5.0 (beta) --
2004-08-31 16:38:10 [NDB] INFO     -- Start initiated (version 3.5.0)
2004-08-31 16:38:11 [NDB] INFO     -- Communication to Node 2 opened
2004-08-31 16:38:11 [NDB] INFO     -- Communication to Node 3 opened
2004-08-31 16:38:11 [NDB] INFO     -- Communication to Node 5 opened
2004-08-31 16:38:11 [NDB] INFO     -- Node 1 Connected
2004-08-31 16:38:12 [NDB] INFO     -- Node 2 Connected
2004-08-31 16:38:12 [NDB] INFO     -- Node 3 Connected
2004-08-31 16:38:14 [NDB] INFO     -- Node 5 Connected
2004-08-31 16:38:14 [NDB] INFO     -- Node 2: API version 3.5.0
2004-08-31 16:38:14 [NDB] INFO     -- Node 3: API version 3.5.0
2004-08-31 16:38:14 [NDB] INFO     -- Node 5: API version 3.5.0
NR: setLcpActiveStatusEnd - m_participatingLQH
2004-08-31 16:38:16 [NDB] INFO     -- Communication to Node 11 opened
2004-08-31 16:38:16 [NDB] INFO     -- Communication to Node 12 opened
2004-08-31 16:38:16 [NDB] INFO     -- Communication to Node 13 opened
2004-08-31 16:38:16 [NDB] INFO     -- Communication to Node 14 opened
2004-08-31 16:38:16 [NDB] INFO     -- Communication to Node 0 opened
2004-08-31 16:38:16 [NDB] INFO     -- Started (version 3.5.0)
2004-08-31 16:38:16 [NDB] INFO     -- Node 1: API version 3.5.0
2004-08-31 16:38:17 [NDB] INFO     -- Node 12 Connected
2004-08-31 16:38:17 [NDB] INFO     -- Node 12: API version 3.5.0
2004-08-31 16:38:17 [NDB] INFO     -- Node 11 Connected
2004-08-31 16:38:17 [NDB] INFO     -- Node 14 Connected
2004-08-31 16:38:17 [NDB] INFO     -- Node 11: API version 3.5.0
2004-08-31 16:38:17 [NDB] INFO     -- Node 14: API version 3.5.0
2004-08-31 16:38:17 [NDB] INFO     -- Node 13 Connected
2004-08-31 16:38:17 [NDB] INFO     -- Node 13: API version 3.5.0
Ndb kernel is stuck in: Job Handling
Ndb kernel is stuck in: Job Handling
2004-08-31 19:40:39 [NDB] ALERT    -- Node 3 Disconnected
Error handler shutting down system
Error handler shutdown completed - exiting

So I'm looking for logging information on node[4] and maybe the MGM node.

[root@ed 4.ndb_mgm]# find /var/ndbcluster/ -name *.log -print |xargs ls -lrt
<...>
-rw-r--r--    1 root     root            2 Aug 31 19:40
/var/ndbcluster/ndb/NextTraceFileNo.log
-rw-r--r--    1 root     root        10044 Aug 31 19:40
/var/ndbcluster/ndb/error.log

Let's look in the error log. Last message says:

Date/Time: Tuesday 31 August 2004 - 19:40:39
Type of error: error
Message: Arbitrator shutdown
Fault ID: 2305
Problem data: Arbitrator decided to shutdown this node
Object of reference: QMGR (Line: 3764) 0x00000002
ProgramName: NDB Kernel
ProcessID: 8885
TraceFile: /var/ndbcluster/ndbNDB_TraceFile_18.trace
***EOM***

So let's look at the trace file, #18.

[root@ed ndbcluster]# ls -l /var/ndbcluster/ndbNDB_TraceFile_18.trace
-rw-r--r--    1 root     root       961439 Aug 31 19:40
/var/ndbcluster/ndbNDB_TraceFile_18.trace

[root@ed ndbcluster]# wc !$
wc /var/ndbcluster/ndbNDB_TraceFile_18.trace
  16863  135203  961439 /var/ndbcluster/ndbNDB_TraceFile_18.trace

It's 16k+ lines. I shouldn't past that in here. Could I email it to someone?

Any other ideas how to track this down?

Thanks in advance!

-- Jim


--- Mikael_Ronstr
> Hi Jim,
> The trace file you sent doesn't seem to correlate with the previous 
> info. A bit confusing,
> the error log used should have been the one in the 2.ndb_db directory 
> if that's where you
> are running it from. This not being written in a week is very strange.
> 
> The other error.log is from another cluster instance you worked on it 
> seems. Actually this
> correlates better with the cluster log.
> 
> Anyways one problem that you might look for that we had problems with 
> another customer
> is whether there are any other processes starting to work at the time 
> of the crash. If there are
> other memory hungry processes working then the ndbd process might start 
> swapping and
> then it can easily be swapped out of the cluster due to missed 
> heartbeats. This will not cause
> the watch dog however to complain so the picture is not completely 
> clear.
> 
> Rgrds Mikael
> 
> 2004-07-29 kl. 23.45 skrev Jim Hoadley:
> 
> > Mikael --
> >
> > Thanks for taking on my question. I find 2 error.logs in the 
> > ndbcluster tree,
> > and the "live" one isn't where I expected it to be.
> >
> > I'm running the node from this directory:
> > /var/ndbcluster/mysql-test/ndbcluster/2.ndb_db
> >
> > But the (2) error.log(s) I found were here:
> >
> > -rw-r--r--    1 root     root          468 Jul 21 13:48
> > /var/ndbcluster/mysql-test/var/ndbcluster/2.ndb_db/error.log
> > -rw-r--r--    1 root     root         4458 Jul 28 20:49
> > /var/ndbcluster/ndb/error.log
> >
> > Is this a problem?
> >
> > Obviously the one dated Jul 21 is not used. The other has this as the 
> > last
> > entry, the one from the most recent "crash":
> >
> > Date/Time: Wednesday 28 July 2004 - 20:49:21
> > Type of error: error
> > Message: Arbitrator shutdown
> > Fault ID: 2305
> > Problem data: Arbitrator decided to shutdown this node
> > Object of reference: QMGR (Line: 3764) 0x00000002
> > ProgramName: NDB Kernel
> > ProcessID: 15932
> > TraceFile: /var/ndbcluster/ndbNDB_TraceFile_10.trace
> > ***EOM***
> >
> > So here's /var/ndbcluster/ndbNDB_TraceFile_10.trace
> >
> > [root@BOX2 ndbcluster]# ls -l ./ndbNDB_TraceFile_10.trace
> > -rw-r--r--    1 root     root       962184 Jul 28 20:49
> > ./ndbNDB_TraceFile_10.trace
> >
> >
> > [root@cooler ndbcluster]# head -160 ./ndbNDB_TraceFile_10.trace
> > JAM CONTENTS up->down left->right ?=not block entry
> > BLOCK   ADDR   ADDR   ADDR   ADDR   ADDR   ADDR   ADDR   ADDR
> >        ?000112 001452 001475 001489 001489 001499
> > NDBFS   000913 000915 000726 000932
> > DBACC   000103 000110
> > QMGR    001817 001827 001852 001859 002705 002708 002727 002729
> >         002737 002737 002740 002737 002737 002737 002737 002737
> >         002737 002737 002737 002737 002737 002737 002737 002737
> >         002737 002737 002737 002737 002737 002737 002737 002737
> >         002737 002737 002737 002737 002737 002737 002737 002737
> >         002737 002737 002737 002737 002737 002737 002737 002737
> >         002737 002737 002737 002737 002737 002737 002737 002737
> >         002737 001243 001243 001243 001243 001243 001243 001243
> >         001243 001243 001243 001243 001243 001243 001243 001243
> >         001243 001243 001243 001243 001243 001243 001243 001243
> >         001243 001243 001243 001243 001243 001243 001243 001243
> >         001243 001243 001243 001243 001243 001243 001243 001243
> >         001243 001243 001243 001243 001243 001243 001243 001243
> >         001243 001273 001285 001296 001316 001316 001316 001316
> >         001316 001316 001316 001316 001316 001316 001316 001316
> >         001316 001316 001316 001316 001316 001316 001316 001316
> >         001316 001316 001316 001316 001316 001316 001316 001316
> >         001316 001316 001316 001316 001316 001316 001316 001316
> >         001316 001316 001316 001316 001316 001316 001316 001316
> >         001316 001316 001316 001316 002091 002093 002105 002109
> >         002109 002112 002109 002109 002109 002109 002109 002109
> >         002109 002109 002109 002109 002109 002109 002109 002109
> >         002109 002109 002109 002109 002109 002109 002109 002109
> >         002109 002109 002109 002109 002109 002109 002109 002109
> >         002109 002109 002109 002109 002109 002109 002109 002109
> >         002109 002109 002109 002109 002109 002109 002109 002109
> > QMGR    001817 001827 001895 001897
> > QMGR    001694 003061 003231 003268 003325 003352 003360 003360
> >         003364 003364 003386 003791 003791 003794 003791 003791
> >         003791 003791 003791 003791 003791 003791 003791 003791
> >         003791 003791 003791 003791 003791 003791 003791 003791
> >         003791 003791 003791 003791 003791 003791 003791 003791
> >         003791 003791 003791 003791 003791 003791 003791 003791
> >         003791 003791 003791 003791 003791 003791 003791 003791
> >         003791 003791 003791 003791 003396 002956 002958 002960
> >         003312
> > QMGR    001817 001827 001895 001897
> > QMGR    001694 003054 003056 003058 003061 003231 003268 003325
> >         003352 003360 003360 003364 003364 003364 003386 003791
> >         003791 003794 003791 003791 003791 003791 003791 003791
> >         003791 003791 003791 003791 003791 003791 003791 003791
> >         003791 003791 003791 003791 003791 003791 003791 003791
> >         003791 003791 003791 003791 003791 003791 003791 003791
> >         003791 003791 003791 003791 003791 003791 003791 003791
> >         003791 003791 003791 003791 003791 003791 003791 003791
> >         003396 002956 002958 002960 003312
> > QMGR    000150 002091 002093
> > QMGR    002133
> > DBDIH   013454 002170 002870
> > DBTC    000795 000950 000950 000950 000950 000950 000950 000950
> >         000950 000950 000950 000950 000950 000950 000950 000950
> >         000950 000950 000950 000950 000950 000950 000950 000950
> >         000950 000950 000950 000950 000950 000950 000950 000950
> >         000950 000950 000950 000950 000950 000950 000950 000950
> >         000950 000950 000950 000950 000950 000950 000950 000950
> >         000950 000950 000950 000950 000950 000950 000950 000950
> >         000950 000950 000950 000950 000950 000950 000950 000950
> >         000950 000950 000950 000950 000950 000950 000950 000950
> >         000950 000950 000950 000950 000950 000950 000950 000950
> >         000950 000950 000950 000950 000950 000950 000950 000950
> >         000950 000950 000950 000950 000950 000950 000950 000950
> >         000950 000950 000950 000950 000950 000950 000950 000950
> >         000950 000950 000950 000950 000950 000950 000950 000950
> >         000950 000950 000950 000950 000950 000950 000950 000950
> >         000950 000950 000950 000950 000950 000950 000950 000950
> >         000950 000950 000950 000950 000950 000950 000950 000950
> >         000950 000950 000950 000950 000950 000950 000950 000950
> >         000950 000950 000950 000950 000950 000950 000950 000950
> >         000950 000950 000950 000950 000950 000950 000950 000950
> >         000950 000950 000950 000950 000950 000950 000950 000950
> >         000950 000950 000950 000950 000950 000950 000950 000950
> >         000950 000950 000950 000950 000950 000950 000950 000950
> >         000950 000950 000950 000950 000950 000950 000950 000950
> >         000950 000950 000950 000950 000950 000950 000950 000950
> >         000950 000950 000950 000950 000950 000950 000950 000950
> >         000950 000950 000950 000950 000950 000950 000950 000950
> >         000950 000950 000950 000950 000950 000950 000950 000950
> >         000950 000950 000950 000950 000950 000950 000950 000950
> >         000950 000950 000950 000950 000950 000950 000950 000950
> >         000950 000950 000950 000950 000950 000950 000950 000950
> >         000950 000950 000950 000950 000950 000950 000950 000950
> >         000950 000950
> > DBDICT  002546
> > SUMA    000306 000314
> > GREP    000367
> > CMVMI   000349 000353
> > DBTC    000795 000950 000950 000950 000950 000950 000950 000950
> >         000950 000950 000950 000950 000950 000950 000950 000950
> >         000950 000950 000950 000950 000950 000950 000950 000950
> >         000950 000950 000950 000950 000950 000950 000950 000950
> >         000950 000845 000893 000950 000950 000950 000950 000950
> >         000950 000950 000950 000950 000950 000950 000950 000950
> >         000950 000950 000950 000950 000950 000950 000950 000950
> >         000950 000950 000950 000950 000950 000950 000950 000950
> >         000950 000950 000950 000950 000950 000950 000950 000950
> >         000950 000950 000950 000950 000950 000950 000950 000950
> >         000950 000950 000950 000950 000950 000950 000950 000950
> >         000950 000950 000950 000950 000950 000950 000950 000950
> >         000950 000950 000950 000950 000950 000950 000950 000950
> >         000950 000950 000950 000950 000950 000950 000950 000950
> >         000950 000950 000950 000950 000950 000950 000950 000950
> >         000950 000950 000950 000950 000950 000950 000950 000950
> >         000950 000950 000950 000950 000950 000950 000950 000950
> >         000950 000950 000950 000950 000950 000950 000950 000950
> >         000950 000950 000950 000950 000950 000950 000950 000950
> >         000950 000950 000950 000950 000950 000950 000950 000950
> >         000950 000950 000950 000950 000950 000950 000950 000950
> >         000950 000950 000950 000950 000950 000950 000950 000950
> >         000950 000950 000950 000950 000950 000950 000950 000950
> >         000950 000950 000950 000950 000950 000950 000950 000950
> >         000950 000950 000950 000950 000950 000950 000950 000950
> >         000950 000950 000950 000950 000950 000950 000950 000950
> >         000950 000950 000950 000950 000950 000950 000950 000950
> >         000950 000950 000950 000950 000950 000950 000950 000950
> >         000950 000950 000950 000950 000950 000950 000950 000950
> >         000950 000950 000950 000950 000950 000950 000950 000950
> >         000950 000950 000950 000950 000950 000950 000950 000950
> >         000950 000950 000950 000950 000950 000950 000950 000950
> >         000950 000950
> > DBDICT  002546
> > SUMA    000306 000314
> > GREP    000367
> > CMVMI   000349 000353
> > CMVMI   000349 000353 000372
> > QMGR    002193 002218 002222 002228 002230 002257 002262
> > QMGR    002283 002297 002297 002297 002297 002297 002297 002297
> >         002297 002297 002297 002297 002297 002297 002297 002297
> >         002297 002297 002297 002297 002297 002297 002297 002297
> >         002297 002297 002297 002297 002297 002297 002297 002297
> 
=== message truncated ===



		
_______________________________
Do you Yahoo!?
Win 1 of 4,000 free domain names from Yahoo! Enter now.
http://promotions.yahoo.com/goldrush
Thread
Re: nightly crashingJim Hoadley1 Sep
  • Re: nightly crashingJim Hoadley1 Sep
    • Can't Init DatabaseSharad Maloo2 Sep
      • Re: Can't Init DatabaseOlivier Kaloudoff2 Sep
        • Re: Can't Init DatabaseSharad Maloo2 Sep
          • Re: Can't Init DatabaseTomas Ulin2 Sep
      • Re: Can't Init DatabaseWouter de Jong2 Sep
        • Re: Can't Init DatabaseChad Martin2 Sep
          • Re: Can't Init DatabaseSharad Maloo2 Sep
          • Re: Can't Init DatabaseWouter de Jong3 Sep
  • Re: nightly crashingPh.D. Joseph E. Sacco1 Sep
  • Re: nightly crashingMikael Ronström2 Sep
    • Re: nightly crashingJim Hoadley22 Sep
      • Re: nightly crashingMikael Ronström23 Sep
      • Re: nightly crashingTomas Ulin23 Sep
        • Re: nightly crashingJim Hoadley23 Sep
RE: Can't Init DatabaseLuke H. Crouch2 Sep
  • Re: Can't Init DatabaseWouter de Jong3 Sep
    • Re: Can't Init DatabaseTomas Ulin3 Sep
      • Re: Can't Init Database <- solvedWouter de Jong3 Sep
        • Re: Can't Init Database <- solvedMagnus Svensson6 Sep
Re: nightly crashingJim Hoadley23 Sep
  • Re: nightly crashingTomas Ulin25 Sep
    • Re: nightly crashingJim Hoadley27 Sep
      • Re: nightly crashingTomas Ulin28 Sep