List:Cluster« Previous MessageNext Message »
From:Mikael Ronström Date:September 2 2004 6:59am
Subject:Re: nightly crashing
View as plain text  
Hi Jim,
 From the logs I can see that the ndbd process stopped working properly 
since you get
> Ndb kernel is stuck in: Job Handling
> Ndb kernel is stuck in: Job Handling

This can have many causes, both ndbd causes and OS causes. A look into 
the trace-file
will probably reveal the cause. So please send it to me and I can have 
a look at it.

Rgrds Mikael

2004-09-01 kl. 18.49 skrev Jim Hoadley:

> I've had a cluster going for several weeks, however, almost every day 
> one or
> more of my nodes crashes. My first thought is that the Linux boxes are
> underpowered, but is there any way I can prove that? I'd sure like to 
> know
> MySQL Cluster is stable and will suit my needs before shelling out the 
> money
> for production boxes.
>
> I wrote to the list about this a month ago but never got to the root 
> cause. So
> I'd like to try again.
>
> I've got 2 replicas, 4 storage nodes, 4 APIs on 4 test Linux boxes. 
> Each box
> has 512MB RAM, and the test database is tiny (4 rows of data). Here's 
> what
> STDERR says on the node [4] that died yesterday afternoon:
>
> [root@ed 4.ndb_mgm]# ndbd &
> [3] 8884
> 2004-08-31 16:38:10 [NDB] INFO     -- Angel pid: 8884 ndb pid: 8885
> 2004-08-31 16:38:10 [NDB] INFO     -- NDB Cluster -- DB node 4
> 2004-08-31 16:38:10 [NDB] INFO     -- Version 3.5.0 (beta) --
> 2004-08-31 16:38:10 [NDB] INFO     -- Start initiated (version 3.5.0)
> 2004-08-31 16:38:11 [NDB] INFO     -- Communication to Node 2 opened
> 2004-08-31 16:38:11 [NDB] INFO     -- Communication to Node 3 opened
> 2004-08-31 16:38:11 [NDB] INFO     -- Communication to Node 5 opened
> 2004-08-31 16:38:11 [NDB] INFO     -- Node 1 Connected
> 2004-08-31 16:38:12 [NDB] INFO     -- Node 2 Connected
> 2004-08-31 16:38:12 [NDB] INFO     -- Node 3 Connected
> 2004-08-31 16:38:14 [NDB] INFO     -- Node 5 Connected
> 2004-08-31 16:38:14 [NDB] INFO     -- Node 2: API version 3.5.0
> 2004-08-31 16:38:14 [NDB] INFO     -- Node 3: API version 3.5.0
> 2004-08-31 16:38:14 [NDB] INFO     -- Node 5: API version 3.5.0
> NR: setLcpActiveStatusEnd - m_participatingLQH
> 2004-08-31 16:38:16 [NDB] INFO     -- Communication to Node 11 opened
> 2004-08-31 16:38:16 [NDB] INFO     -- Communication to Node 12 opened
> 2004-08-31 16:38:16 [NDB] INFO     -- Communication to Node 13 opened
> 2004-08-31 16:38:16 [NDB] INFO     -- Communication to Node 14 opened
> 2004-08-31 16:38:16 [NDB] INFO     -- Communication to Node 0 opened
> 2004-08-31 16:38:16 [NDB] INFO     -- Started (version 3.5.0)
> 2004-08-31 16:38:16 [NDB] INFO     -- Node 1: API version 3.5.0
> 2004-08-31 16:38:17 [NDB] INFO     -- Node 12 Connected
> 2004-08-31 16:38:17 [NDB] INFO     -- Node 12: API version 3.5.0
> 2004-08-31 16:38:17 [NDB] INFO     -- Node 11 Connected
> 2004-08-31 16:38:17 [NDB] INFO     -- Node 14 Connected
> 2004-08-31 16:38:17 [NDB] INFO     -- Node 11: API version 3.5.0
> 2004-08-31 16:38:17 [NDB] INFO     -- Node 14: API version 3.5.0
> 2004-08-31 16:38:17 [NDB] INFO     -- Node 13 Connected
> 2004-08-31 16:38:17 [NDB] INFO     -- Node 13: API version 3.5.0
> Ndb kernel is stuck in: Job Handling
> Ndb kernel is stuck in: Job Handling
> 2004-08-31 19:40:39 [NDB] ALERT    -- Node 3 Disconnected
> Error handler shutting down system
> Error handler shutdown completed - exiting
>
> So I'm looking for logging information on node[4] and maybe the MGM 
> node.
>
> [root@ed 4.ndb_mgm]# find /var/ndbcluster/ -name *.log -print |xargs 
> ls -lrt
> <...>
> -rw-r--r--    1 root     root            2 Aug 31 19:40
> /var/ndbcluster/ndb/NextTraceFileNo.log
> -rw-r--r--    1 root     root        10044 Aug 31 19:40
> /var/ndbcluster/ndb/error.log
>
> Let's look in the error log. Last message says:
>
> Date/Time: Tuesday 31 August 2004 - 19:40:39
> Type of error: error
> Message: Arbitrator shutdown
> Fault ID: 2305
> Problem data: Arbitrator decided to shutdown this node
> Object of reference: QMGR (Line: 3764) 0x00000002
> ProgramName: NDB Kernel
> ProcessID: 8885
> TraceFile: /var/ndbcluster/ndbNDB_TraceFile_18.trace
> ***EOM***
>
> So let's look at the trace file, #18.
>
> [root@ed ndbcluster]# ls -l /var/ndbcluster/ndbNDB_TraceFile_18.trace
> -rw-r--r--    1 root     root       961439 Aug 31 19:40
> /var/ndbcluster/ndbNDB_TraceFile_18.trace
>
> [root@ed ndbcluster]# wc !$
> wc /var/ndbcluster/ndbNDB_TraceFile_18.trace
>   16863  135203  961439 /var/ndbcluster/ndbNDB_TraceFile_18.trace
>
> It's 16k+ lines. I shouldn't past that in here. Could I email it to 
> someone?
>
> Any other ideas how to track this down?
>
> Thanks in advance!
>
> -- Jim
>
>
> --- Mikael_Ronstr�m <mikael@stripped> wrote:
>
>> Hi Jim,
>> The trace file you sent doesn't seem to correlate with the previous
>> info. A bit confusing,
>> the error log used should have been the one in the 2.ndb_db directory
>> if that's where you
>> are running it from. This not being written in a week is very strange.
>>
>> The other error.log is from another cluster instance you worked on it
>> seems. Actually this
>> correlates better with the cluster log.
>>
>> Anyways one problem that you might look for that we had problems with
>> another customer
>> is whether there are any other processes starting to work at the time
>> of the crash. If there are
>> other memory hungry processes working then the ndbd process might 
>> start
>> swapping and
>> then it can easily be swapped out of the cluster due to missed
>> heartbeats. This will not cause
>> the watch dog however to complain so the picture is not completely
>> clear.
>>
>> Rgrds Mikael
>>
>> 2004-07-29 kl. 23.45 skrev Jim Hoadley:
>>
>>> Mikael --
>>>
>>> Thanks for taking on my question. I find 2 error.logs in the
>>> ndbcluster tree,
>>> and the "live" one isn't where I expected it to be.
>>>
>>> I'm running the node from this directory:
>>> /var/ndbcluster/mysql-test/ndbcluster/2.ndb_db
>>>
>>> But the (2) error.log(s) I found were here:
>>>
>>> -rw-r--r--    1 root     root          468 Jul 21 13:48
>>> /var/ndbcluster/mysql-test/var/ndbcluster/2.ndb_db/error.log
>>> -rw-r--r--    1 root     root         4458 Jul 28 20:49
>>> /var/ndbcluster/ndb/error.log
>>>
>>> Is this a problem?
>>>
>>> Obviously the one dated Jul 21 is not used. The other has this as the
>>> last
>>> entry, the one from the most recent "crash":
>>>
>>> Date/Time: Wednesday 28 July 2004 - 20:49:21
>>> Type of error: error
>>> Message: Arbitrator shutdown
>>> Fault ID: 2305
>>> Problem data: Arbitrator decided to shutdown this node
>>> Object of reference: QMGR (Line: 3764) 0x00000002
>>> ProgramName: NDB Kernel
>>> ProcessID: 15932
>>> TraceFile: /var/ndbcluster/ndbNDB_TraceFile_10.trace
>>> ***EOM***
>>>
>>> So here's /var/ndbcluster/ndbNDB_TraceFile_10.trace
>>>
>>> [root@BOX2 ndbcluster]# ls -l ./ndbNDB_TraceFile_10.trace
>>> -rw-r--r--    1 root     root       962184 Jul 28 20:49
>>> ./ndbNDB_TraceFile_10.trace
>>>
>>>
>>> [root@cooler ndbcluster]# head -160 ./ndbNDB_TraceFile_10.trace
>>> JAM CONTENTS up->down left->right ?=not block entry
>>> BLOCK   ADDR   ADDR   ADDR   ADDR   ADDR   ADDR   ADDR   ADDR
>>>        ?000112 001452 001475 001489 001489 001499
>>> NDBFS   000913 000915 000726 000932
>>> DBACC   000103 000110
>>> QMGR    001817 001827 001852 001859 002705 002708 002727 002729
>>>         002737 002737 002740 002737 002737 002737 002737 002737
>>>         002737 002737 002737 002737 002737 002737 002737 002737
>>>         002737 002737 002737 002737 002737 002737 002737 002737
>>>         002737 002737 002737 002737 002737 002737 002737 002737
>>>         002737 002737 002737 002737 002737 002737 002737 002737
>>>         002737 002737 002737 002737 002737 002737 002737 002737
>>>         002737 001243 001243 001243 001243 001243 001243 001243
>>>         001243 001243 001243 001243 001243 001243 001243 001243
>>>         001243 001243 001243 001243 001243 001243 001243 001243
>>>         001243 001243 001243 001243 001243 001243 001243 001243
>>>         001243 001243 001243 001243 001243 001243 001243 001243
>>>         001243 001243 001243 001243 001243 001243 001243 001243
>>>         001243 001273 001285 001296 001316 001316 001316 001316
>>>         001316 001316 001316 001316 001316 001316 001316 001316
>>>         001316 001316 001316 001316 001316 001316 001316 001316
>>>         001316 001316 001316 001316 001316 001316 001316 001316
>>>         001316 001316 001316 001316 001316 001316 001316 001316
>>>         001316 001316 001316 001316 001316 001316 001316 001316
>>>         001316 001316 001316 001316 002091 002093 002105 002109
>>>         002109 002112 002109 002109 002109 002109 002109 002109
>>>         002109 002109 002109 002109 002109 002109 002109 002109
>>>         002109 002109 002109 002109 002109 002109 002109 002109
>>>         002109 002109 002109 002109 002109 002109 002109 002109
>>>         002109 002109 002109 002109 002109 002109 002109 002109
>>>         002109 002109 002109 002109 002109 002109 002109 002109
>>> QMGR    001817 001827 001895 001897
>>> QMGR    001694 003061 003231 003268 003325 003352 003360 003360
>>>         003364 003364 003386 003791 003791 003794 003791 003791
>>>         003791 003791 003791 003791 003791 003791 003791 003791
>>>         003791 003791 003791 003791 003791 003791 003791 003791
>>>         003791 003791 003791 003791 003791 003791 003791 003791
>>>         003791 003791 003791 003791 003791 003791 003791 003791
>>>         003791 003791 003791 003791 003791 003791 003791 003791
>>>         003791 003791 003791 003791 003396 002956 002958 002960
>>>         003312
>>> QMGR    001817 001827 001895 001897
>>> QMGR    001694 003054 003056 003058 003061 003231 003268 003325
>>>         003352 003360 003360 003364 003364 003364 003386 003791
>>>         003791 003794 003791 003791 003791 003791 003791 003791
>>>         003791 003791 003791 003791 003791 003791 003791 003791
>>>         003791 003791 003791 003791 003791 003791 003791 003791
>>>         003791 003791 003791 003791 003791 003791 003791 003791
>>>         003791 003791 003791 003791 003791 003791 003791 003791
>>>         003791 003791 003791 003791 003791 003791 003791 003791
>>>         003396 002956 002958 002960 003312
>>> QMGR    000150 002091 002093
>>> QMGR    002133
>>> DBDIH   013454 002170 002870
>>> DBTC    000795 000950 000950 000950 000950 000950 000950 000950
>>>         000950 000950 000950 000950 000950 000950 000950 000950
>>>         000950 000950 000950 000950 000950 000950 000950 000950
>>>         000950 000950 000950 000950 000950 000950 000950 000950
>>>         000950 000950 000950 000950 000950 000950 000950 000950
>>>         000950 000950 000950 000950 000950 000950 000950 000950
>>>         000950 000950 000950 000950 000950 000950 000950 000950
>>>         000950 000950 000950 000950 000950 000950 000950 000950
>>>         000950 000950 000950 000950 000950 000950 000950 000950
>>>         000950 000950 000950 000950 000950 000950 000950 000950
>>>         000950 000950 000950 000950 000950 000950 000950 000950
>>>         000950 000950 000950 000950 000950 000950 000950 000950
>>>         000950 000950 000950 000950 000950 000950 000950 000950
>>>         000950 000950 000950 000950 000950 000950 000950 000950
>>>         000950 000950 000950 000950 000950 000950 000950 000950
>>>         000950 000950 000950 000950 000950 000950 000950 000950
>>>         000950 000950 000950 000950 000950 000950 000950 000950
>>>         000950 000950 000950 000950 000950 000950 000950 000950
>>>         000950 000950 000950 000950 000950 000950 000950 000950
>>>         000950 000950 000950 000950 000950 000950 000950 000950
>>>         000950 000950 000950 000950 000950 000950 000950 000950
>>>         000950 000950 000950 000950 000950 000950 000950 000950
>>>         000950 000950 000950 000950 000950 000950 000950 000950
>>>         000950 000950 000950 000950 000950 000950 000950 000950
>>>         000950 000950 000950 000950 000950 000950 000950 000950
>>>         000950 000950 000950 000950 000950 000950 000950 000950
>>>         000950 000950 000950 000950 000950 000950 000950 000950
>>>         000950 000950 000950 000950 000950 000950 000950 000950
>>>         000950 000950 000950 000950 000950 000950 000950 000950
>>>         000950 000950 000950 000950 000950 000950 000950 000950
>>>         000950 000950 000950 000950 000950 000950 000950 000950
>>>         000950 000950 000950 000950 000950 000950 000950 000950
>>>         000950 000950
>>> DBDICT  002546
>>> SUMA    000306 000314
>>> GREP    000367
>>> CMVMI   000349 000353
>>> DBTC    000795 000950 000950 000950 000950 000950 000950 000950
>>>         000950 000950 000950 000950 000950 000950 000950 000950
>>>         000950 000950 000950 000950 000950 000950 000950 000950
>>>         000950 000950 000950 000950 000950 000950 000950 000950
>>>         000950 000845 000893 000950 000950 000950 000950 000950
>>>         000950 000950 000950 000950 000950 000950 000950 000950
>>>         000950 000950 000950 000950 000950 000950 000950 000950
>>>         000950 000950 000950 000950 000950 000950 000950 000950
>>>         000950 000950 000950 000950 000950 000950 000950 000950
>>>         000950 000950 000950 000950 000950 000950 000950 000950
>>>         000950 000950 000950 000950 000950 000950 000950 000950
>>>         000950 000950 000950 000950 000950 000950 000950 000950
>>>         000950 000950 000950 000950 000950 000950 000950 000950
>>>         000950 000950 000950 000950 000950 000950 000950 000950
>>>         000950 000950 000950 000950 000950 000950 000950 000950
>>>         000950 000950 000950 000950 000950 000950 000950 000950
>>>         000950 000950 000950 000950 000950 000950 000950 000950
>>>         000950 000950 000950 000950 000950 000950 000950 000950
>>>         000950 000950 000950 000950 000950 000950 000950 000950
>>>         000950 000950 000950 000950 000950 000950 000950 000950
>>>         000950 000950 000950 000950 000950 000950 000950 000950
>>>         000950 000950 000950 000950 000950 000950 000950 000950
>>>         000950 000950 000950 000950 000950 000950 000950 000950
>>>         000950 000950 000950 000950 000950 000950 000950 000950
>>>         000950 000950 000950 000950 000950 000950 000950 000950
>>>         000950 000950 000950 000950 000950 000950 000950 000950
>>>         000950 000950 000950 000950 000950 000950 000950 000950
>>>         000950 000950 000950 000950 000950 000950 000950 000950
>>>         000950 000950 000950 000950 000950 000950 000950 000950
>>>         000950 000950 000950 000950 000950 000950 000950 000950
>>>         000950 000950 000950 000950 000950 000950 000950 000950
>>>         000950 000950 000950 000950 000950 000950 000950 000950
>>>         000950 000950
>>> DBDICT  002546
>>> SUMA    000306 000314
>>> GREP    000367
>>> CMVMI   000349 000353
>>> CMVMI   000349 000353 000372
>>> QMGR    002193 002218 002222 002228 002230 002257 002262
>>> QMGR    002283 002297 002297 002297 002297 002297 002297 002297
>>>         002297 002297 002297 002297 002297 002297 002297 002297
>>>         002297 002297 002297 002297 002297 002297 002297 002297
>>>         002297 002297 002297 002297 002297 002297 002297 002297
>>
> === message truncated ===
>
>
>
> 		
> _______________________________
> Do you Yahoo!?
> Win 1 of 4,000 free domain names from Yahoo! Enter now.
> http://promotions.yahoo.com/goldrush
>
Mikael Ronström, Senior Software Architect
MySQL AB, www.mysql.com

Clustering:
http://www.infoworld.com/article/04/04/14/HNmysqlcluster_1.html

http://www.eweek.com/article2/0,1759,1567546,00.asp


Thread
Re: nightly crashingJim Hoadley1 Sep
  • Re: nightly crashingJim Hoadley1 Sep
    • Can't Init DatabaseSharad Maloo2 Sep
      • Re: Can't Init DatabaseOlivier Kaloudoff2 Sep
        • Re: Can't Init DatabaseSharad Maloo2 Sep
          • Re: Can't Init DatabaseTomas Ulin2 Sep
      • Re: Can't Init DatabaseWouter de Jong2 Sep
        • Re: Can't Init DatabaseChad Martin2 Sep
          • Re: Can't Init DatabaseSharad Maloo2 Sep
          • Re: Can't Init DatabaseWouter de Jong3 Sep
  • Re: nightly crashingPh.D. Joseph E. Sacco1 Sep
  • Re: nightly crashingMikael Ronström2 Sep
    • Re: nightly crashingJim Hoadley22 Sep
      • Re: nightly crashingMikael Ronström23 Sep
      • Re: nightly crashingTomas Ulin23 Sep
        • Re: nightly crashingJim Hoadley23 Sep
RE: Can't Init DatabaseLuke H. Crouch2 Sep
  • Re: Can't Init DatabaseWouter de Jong3 Sep
    • Re: Can't Init DatabaseTomas Ulin3 Sep
      • Re: Can't Init Database <- solvedWouter de Jong3 Sep
        • Re: Can't Init Database <- solvedMagnus Svensson6 Sep
Re: nightly crashingJim Hoadley23 Sep
  • Re: nightly crashingTomas Ulin25 Sep
    • Re: nightly crashingJim Hoadley27 Sep
      • Re: nightly crashingTomas Ulin28 Sep