Hi Jim,
Looking at the two emails you have sent it definitely looks like we are
getting to the core of
the problem. Thx for sharing the info and the work on getting it. I'll
make sure we look into the
problem and come back on that.
Which version is it that you are using.
Rgrds Mikael
2004-09-22 kl. 22.42 skrev Jim Hoadley:
> Mikael --
>
> Thanks for your help on this. OK, I've been monitoring my servers' CPU
> and
> memory usage to see what my be causing these node crashes.
>
> First, I switched from 4 nodes to 2 nodes to simplify the
> configuration. So,
> we've got:
>
> TORMAN: [PIII 1Ghz 256K L2 cache 1024MB RAM], NDB [node2], API [node11]
> EDSEL: [PIII 1Ghz 256K L2 cache 512MB RAM], NDB [node3], API [node12]
>
> There is not much to implicate a problem with CPU. The load remains at
> or below 1.0 right up until the time of a so-called NDB node "crash".
> At that time the load (wait time) goes up to 4, 5 ,or 6, but I would
> guess
> this is a symptom and not a cause of the problem.
>
> Memory usage is much more interesting. On EDSEL, when both the API and
> NDB
> nodes are up, available memory is lost at a rapid rate (approximately
> 300MB
> lost in 13 hours), swapping begins, then, with only ~6MBs, left the
> NDB node
> crashes. At that time it's memory chunk is freed, but the MySQL API
> keeps
> on eating memory at roughly the same rate.
>
> When I run the API without the NDB it eats memory at a slower rate, but
> still approximately 10MB per hour.
>
> When I run NDB without the API there is no such problem.
>
> Is this a memory leak?
>
> You could say the small amount of RAM in the machine is the source of
> the
> problem, but isn't the large anount of RAM Cluster requires for the
> NDB,
> not the API? But in this case it's the API that's taking all the
> memory.
>
> So, suggestions?
>
> -- Jim
>
>
> --- Mikael Ronström <mikael@stripped> wrote:
>
>> Hi Jim,
>> From the logs I can see that the ndbd process stopped working
>> properly
>> since you get
>>> Ndb kernel is stuck in: Job Handling
>>> Ndb kernel is stuck in: Job Handling
>>
>> This can have many causes, both ndbd causes and OS causes. A look into
>> the trace-file
>> will probably reveal the cause. So please send it to me and I can have
>> a look at it.
>>
>> Rgrds Mikael
>>
>> 2004-09-01 kl. 18.49 skrev Jim Hoadley:
>>
>>> I've had a cluster going for several weeks, however, almost every day
>>> one or
>>> more of my nodes crashes. My first thought is that the Linux boxes
>>> are
>>> underpowered, but is there any way I can prove that? I'd sure like to
>>> know
>>> MySQL Cluster is stable and will suit my needs before shelling out
>>> the
>>> money
>>> for production boxes.
>>>
>>> I wrote to the list about this a month ago but never got to the root
>>> cause. So
>>> I'd like to try again.
>>>
>>> I've got 2 replicas, 4 storage nodes, 4 APIs on 4 test Linux boxes.
>>> Each box
>>> has 512MB RAM, and the test database is tiny (4 rows of data). Here's
>>> what
>>> STDERR says on the node [4] that died yesterday afternoon:
>>>
>>> [root@ed 4.ndb_mgm]# ndbd &
>>> [3] 8884
>>> 2004-08-31 16:38:10 [NDB] INFO -- Angel pid: 8884 ndb pid: 8885
>>> 2004-08-31 16:38:10 [NDB] INFO -- NDB Cluster -- DB node 4
>>> 2004-08-31 16:38:10 [NDB] INFO -- Version 3.5.0 (beta) --
>>> 2004-08-31 16:38:10 [NDB] INFO -- Start initiated (version 3.5.0)
>>> 2004-08-31 16:38:11 [NDB] INFO -- Communication to Node 2 opened
>>> 2004-08-31 16:38:11 [NDB] INFO -- Communication to Node 3 opened
>>> 2004-08-31 16:38:11 [NDB] INFO -- Communication to Node 5 opened
>>> 2004-08-31 16:38:11 [NDB] INFO -- Node 1 Connected
>>> 2004-08-31 16:38:12 [NDB] INFO -- Node 2 Connected
>>> 2004-08-31 16:38:12 [NDB] INFO -- Node 3 Connected
>>> 2004-08-31 16:38:14 [NDB] INFO -- Node 5 Connected
>>> 2004-08-31 16:38:14 [NDB] INFO -- Node 2: API version 3.5.0
>>> 2004-08-31 16:38:14 [NDB] INFO -- Node 3: API version 3.5.0
>>> 2004-08-31 16:38:14 [NDB] INFO -- Node 5: API version 3.5.0
>>> NR: setLcpActiveStatusEnd - m_participatingLQH
>>> 2004-08-31 16:38:16 [NDB] INFO -- Communication to Node 11 opened
>>> 2004-08-31 16:38:16 [NDB] INFO -- Communication to Node 12 opened
>>> 2004-08-31 16:38:16 [NDB] INFO -- Communication to Node 13 opened
>>> 2004-08-31 16:38:16 [NDB] INFO -- Communication to Node 14 opened
>>> 2004-08-31 16:38:16 [NDB] INFO -- Communication to Node 0 opened
>>> 2004-08-31 16:38:16 [NDB] INFO -- Started (version 3.5.0)
>>> 2004-08-31 16:38:16 [NDB] INFO -- Node 1: API version 3.5.0
>>> 2004-08-31 16:38:17 [NDB] INFO -- Node 12 Connected
>>> 2004-08-31 16:38:17 [NDB] INFO -- Node 12: API version 3.5.0
>>> 2004-08-31 16:38:17 [NDB] INFO -- Node 11 Connected
>>> 2004-08-31 16:38:17 [NDB] INFO -- Node 14 Connected
>>> 2004-08-31 16:38:17 [NDB] INFO -- Node 11: API version 3.5.0
>>> 2004-08-31 16:38:17 [NDB] INFO -- Node 14: API version 3.5.0
>>> 2004-08-31 16:38:17 [NDB] INFO -- Node 13 Connected
>>> 2004-08-31 16:38:17 [NDB] INFO -- Node 13: API version 3.5.0
>>> Ndb kernel is stuck in: Job Handling
>>> Ndb kernel is stuck in: Job Handling
>>> 2004-08-31 19:40:39 [NDB] ALERT -- Node 3 Disconnected
>>> Error handler shutting down system
>>> Error handler shutdown completed - exiting
>>>
>>> So I'm looking for logging information on node[4] and maybe the MGM
>>> node.
>>>
>>> [root@ed 4.ndb_mgm]# find /var/ndbcluster/ -name *.log -print |xargs
>>> ls -lrt
>>> <...>
>>> -rw-r--r-- 1 root root 2 Aug 31 19:40
>>> /var/ndbcluster/ndb/NextTraceFileNo.log
>>> -rw-r--r-- 1 root root 10044 Aug 31 19:40
>>> /var/ndbcluster/ndb/error.log
>>>
>>> Let's look in the error log. Last message says:
>>>
>>> Date/Time: Tuesday 31 August 2004 - 19:40:39
>>> Type of error: error
>>> Message: Arbitrator shutdown
>>> Fault ID: 2305
>>> Problem data: Arbitrator decided to shutdown this node
>>> Object of reference: QMGR (Line: 3764) 0x00000002
>>> ProgramName: NDB Kernel
>>> ProcessID: 8885
>>> TraceFile: /var/ndbcluster/ndbNDB_TraceFile_18.trace
>>> ***EOM***
>>>
>>> So let's look at the trace file, #18.
>>>
>>> [root@ed ndbcluster]# ls -l /var/ndbcluster/ndbNDB_TraceFile_18.trace
>>> -rw-r--r-- 1 root root 961439 Aug 31 19:40
>>> /var/ndbcluster/ndbNDB_TraceFile_18.trace
>>>
>>> [root@ed ndbcluster]# wc !$
>>> wc /var/ndbcluster/ndbNDB_TraceFile_18.trace
>>> 16863 135203 961439 /var/ndbcluster/ndbNDB_TraceFile_18.trace
>>>
>>> It's 16k+ lines. I shouldn't past that in here. Could I email it to
>>> someone?
>>>
>>> Any other ideas how to track this down?
>>>
>>> Thanks in advance!
>>>
>>> -- Jim
>>>
>>>
>>> --- Mikael_Ronstr�m <mikael@stripped> wrote:
>>>
>>>> Hi Jim,
>>>> The trace file you sent doesn't seem to correlate with the previous
>>>> info. A bit confusing,
>>>> the error log used should have been the one in the 2.ndb_db
>>>> directory
>>>> if that's where you
>>>> are running it from. This not being written in a week is very
>>>> strange.
>>>>
>>>> The other error.log is from another cluster instance you worked on
>>>> it
>>>> seems. Actually this
>>>> correlates better with the cluster log.
>>>>
>>>> Anyways one problem that you might look for that we had problems
>>>> with
>>>> another customer
>>>> is whether there are any other processes starting to work at the
>>>> time
>>>> of the crash. If there are
>>>> other memory hungry processes working then the ndbd process might
>>>> start
>>>> swapping and
>>>> then it can easily be swapped out of the cluster due to missed
>>>> heartbeats. This will not cause
>>>> the watch dog however to complain so the picture is not completely
>>>> clear.
>>>>
>>>> Rgrds Mikael
>>>>
>>>> 2004-07-29 kl. 23.45 skrev Jim Hoadley:
>>>>
>>>>> Mikael --
>>>>>
>>>>> Thanks for taking on my question. I find 2 error.logs in the
>>>>> ndbcluster tree,
>>>>> and the "live" one isn't where I expected it to be.
>>>>>
>>>>> I'm running the node from this directory:
>>>>> /var/ndbcluster/mysql-test/ndbcluster/2.ndb_db
>>>>>
>>>>> But the (2) error.log(s) I found were here:
>>>>>
>>>>> -rw-r--r-- 1 root root 468 Jul 21 13:48
>>>>> /var/ndbcluster/mysql-test/var/ndbcluster/2.ndb_db/error.log
>>>>> -rw-r--r-- 1 root root 4458 Jul 28 20:49
>>>>> /var/ndbcluster/ndb/error.log
>>>>>
>>>>> Is this a problem?
>>>>>
>>>>> Obviously the one dated Jul 21 is not used. The other has this as
>>>>> the
>>>>> last
>>>>> entry, the one from the most recent "crash":
>>>>>
>>>>> Date/Time: Wednesday 28 July 2004 - 20:49:21
>>>>> Type of error: error
>>>>> Message: Arbitrator shutdown
>>>>> Fault ID: 2305
>>>>> Problem data: Arbitrator decided to shutdown this node
>>>>> Object of reference: QMGR (Line: 3764) 0x00000002
>>>>> ProgramName: NDB Kernel
>>>>> ProcessID: 15932
>>>>> TraceFile: /var/ndbcluster/ndbNDB_TraceFile_10.trace
>>>>> ***EOM***
>>>>>
>>>>> So here's /var/ndbcluster/ndbNDB_TraceFile_10.trace
>>>>>
>>>>> [root@BOX2 ndbcluster]# ls -l ./ndbNDB_TraceFile_10.trace
>>>>> -rw-r--r-- 1 root root 962184 Jul 28 20:49
>>>>> ./ndbNDB_TraceFile_10.trace
>>>>>
>>>>>
>>>>> [root@cooler ndbcluster]# head -160 ./ndbNDB_TraceFile_10.trace
>>>>> JAM CONTENTS up->down left->right ?=not block entry
>>>>> BLOCK ADDR ADDR ADDR ADDR ADDR ADDR ADDR ADDR
>>>>> ?000112 001452 001475 001489 001489 001499
>>>>> NDBFS 000913 000915 000726 000932
>>>>> DBACC 000103 000110
>>>>> QMGR 001817 001827 001852 001859 002705 002708 002727 002729
>>>>> 002737 002737 002740 002737 002737 002737 002737 002737
>>>>> 002737 002737 002737 002737 002737 002737 002737 002737
>>>>> 002737 002737 002737 002737 002737 002737 002737 002737
>>>>> 002737 002737 002737 002737 002737 002737 002737 002737
>>>>> 002737 002737 002737 002737 002737 002737 002737 002737
>>>>> 002737 002737 002737 002737 002737 002737 002737 002737
>>>>> 002737 001243 001243 001243 001243 001243 001243 001243
>>>>> 001243 001243 001243 001243 001243 001243 001243 001243
>>>>> 001243 001243 001243 001243 001243 001243 001243 001243
>>
> === message truncated ===
>
>
>
>
> _______________________________
> Do you Yahoo!?
> Declare Yourself - Register online to vote today!
> http://vote.yahoo.com
>
> --
> MySQL Cluster Mailing List
> For list archives: http://lists.mysql.com/cluster
> To unsubscribe:
> http://lists.mysql.com/cluster?unsub=1
>
>
Mikael Ronström, Senior Software Architect
MySQL AB, www.mysql.com
Clustering:
http://www.infoworld.com/article/04/04/14/HNmysqlcluster_1.html
http://www.eweek.com/article2/0,1759,1567546,00.asp