List:Cluster« Previous MessageNext Message »
From:Tomas Ulin Date:September 23 2004 9:55am
Subject:Re: nightly crashing
View as plain text  
Jim,

looks like a memory leak to me.  Is it possible for you to send us your 
application (or part of it) so we can reproduce, and still the leak?

A question about the below:

What do you mean by: "When I run the API without the NDB it eats memory 
at a slower rate, ..."?

BR,

T

Jim Hoadley wrote:

>Mikael --
>
>Thanks for your help on this. OK, I've been monitoring my servers' CPU and
>memory usage to see what my be causing these node crashes.
>
>First, I switched from 4 nodes to 2 nodes to simplify the configuration. So,
>we've got:
>
>TORMAN: [PIII 1Ghz 256K L2 cache 1024MB RAM], NDB [node2], API [node11]
>EDSEL:  [PIII 1Ghz 256K L2 cache  512MB RAM], NDB [node3], API [node12]
>
>There is not much to implicate a problem with CPU. The load remains at 
>or below 1.0 right up until the time of a so-called NDB node "crash". 
>At that time the load (wait time) goes up to 4, 5 ,or 6, but I would guess
>this is a symptom and not a cause of the problem.
>
>Memory usage is much more interesting. On EDSEL, when both the API and NDB
>nodes are up, available memory is lost at a rapid rate (approximately 300MB
>lost in 13 hours), swapping begins, then, with only ~6MBs, left the NDB node
>crashes. At that time it's memory chunk is freed, but the MySQL API keeps 
>on eating memory at roughly the same rate.
>
>When I run the API without the NDB it eats memory at a slower rate, but 
>still approximately 10MB per hour.
>
>When I run NDB without the API there is no such problem.
>
>Is this a memory leak?
>
>You could say the small amount of RAM in the machine is the source of the
>problem, but isn't the large anount of RAM Cluster requires for the NDB, 
>not the API? But in this case it's the API that's taking all the memory.
>
>So, suggestions? 
>
>-- Jim
>
>
>--- Mikael Ronström <mikael@stripped> wrote:
>
>  
>
>>Hi Jim,
>> From the logs I can see that the ndbd process stopped working properly 
>>since you get
>>    
>>
>>>Ndb kernel is stuck in: Job Handling
>>>Ndb kernel is stuck in: Job Handling
>>>      
>>>
>>This can have many causes, both ndbd causes and OS causes. A look into 
>>the trace-file
>>will probably reveal the cause. So please send it to me and I can have 
>>a look at it.
>>
>>Rgrds Mikael
>>
>>2004-09-01 kl. 18.49 skrev Jim Hoadley:
>>
>>    
>>
>>>I've had a cluster going for several weeks, however, almost every day 
>>>one or
>>>more of my nodes crashes. My first thought is that the Linux boxes are
>>>underpowered, but is there any way I can prove that? I'd sure like to 
>>>know
>>>MySQL Cluster is stable and will suit my needs before shelling out the 
>>>money
>>>for production boxes.
>>>
>>>I wrote to the list about this a month ago but never got to the root 
>>>cause. So
>>>I'd like to try again.
>>>
>>>I've got 2 replicas, 4 storage nodes, 4 APIs on 4 test Linux boxes. 
>>>Each box
>>>has 512MB RAM, and the test database is tiny (4 rows of data). Here's 
>>>what
>>>STDERR says on the node [4] that died yesterday afternoon:
>>>
>>>[root@ed 4.ndb_mgm]# ndbd &
>>>[3] 8884
>>>2004-08-31 16:38:10 [NDB] INFO     -- Angel pid: 8884 ndb pid: 8885
>>>2004-08-31 16:38:10 [NDB] INFO     -- NDB Cluster -- DB node 4
>>>2004-08-31 16:38:10 [NDB] INFO     -- Version 3.5.0 (beta) --
>>>2004-08-31 16:38:10 [NDB] INFO     -- Start initiated (version 3.5.0)
>>>2004-08-31 16:38:11 [NDB] INFO     -- Communication to Node 2 opened
>>>2004-08-31 16:38:11 [NDB] INFO     -- Communication to Node 3 opened
>>>2004-08-31 16:38:11 [NDB] INFO     -- Communication to Node 5 opened
>>>2004-08-31 16:38:11 [NDB] INFO     -- Node 1 Connected
>>>2004-08-31 16:38:12 [NDB] INFO     -- Node 2 Connected
>>>2004-08-31 16:38:12 [NDB] INFO     -- Node 3 Connected
>>>2004-08-31 16:38:14 [NDB] INFO     -- Node 5 Connected
>>>2004-08-31 16:38:14 [NDB] INFO     -- Node 2: API version 3.5.0
>>>2004-08-31 16:38:14 [NDB] INFO     -- Node 3: API version 3.5.0
>>>2004-08-31 16:38:14 [NDB] INFO     -- Node 5: API version 3.5.0
>>>NR: setLcpActiveStatusEnd - m_participatingLQH
>>>2004-08-31 16:38:16 [NDB] INFO     -- Communication to Node 11 opened
>>>2004-08-31 16:38:16 [NDB] INFO     -- Communication to Node 12 opened
>>>2004-08-31 16:38:16 [NDB] INFO     -- Communication to Node 13 opened
>>>2004-08-31 16:38:16 [NDB] INFO     -- Communication to Node 14 opened
>>>2004-08-31 16:38:16 [NDB] INFO     -- Communication to Node 0 opened
>>>2004-08-31 16:38:16 [NDB] INFO     -- Started (version 3.5.0)
>>>2004-08-31 16:38:16 [NDB] INFO     -- Node 1: API version 3.5.0
>>>2004-08-31 16:38:17 [NDB] INFO     -- Node 12 Connected
>>>2004-08-31 16:38:17 [NDB] INFO     -- Node 12: API version 3.5.0
>>>2004-08-31 16:38:17 [NDB] INFO     -- Node 11 Connected
>>>2004-08-31 16:38:17 [NDB] INFO     -- Node 14 Connected
>>>2004-08-31 16:38:17 [NDB] INFO     -- Node 11: API version 3.5.0
>>>2004-08-31 16:38:17 [NDB] INFO     -- Node 14: API version 3.5.0
>>>2004-08-31 16:38:17 [NDB] INFO     -- Node 13 Connected
>>>2004-08-31 16:38:17 [NDB] INFO     -- Node 13: API version 3.5.0
>>>Ndb kernel is stuck in: Job Handling
>>>Ndb kernel is stuck in: Job Handling
>>>2004-08-31 19:40:39 [NDB] ALERT    -- Node 3 Disconnected
>>>Error handler shutting down system
>>>Error handler shutdown completed - exiting
>>>
>>>So I'm looking for logging information on node[4] and maybe the MGM 
>>>node.
>>>
>>>[root@ed 4.ndb_mgm]# find /var/ndbcluster/ -name *.log -print |xargs 
>>>ls -lrt
>>><...>
>>>-rw-r--r--    1 root     root            2 Aug 31 19:40
>>>/var/ndbcluster/ndb/NextTraceFileNo.log
>>>-rw-r--r--    1 root     root        10044 Aug 31 19:40
>>>/var/ndbcluster/ndb/error.log
>>>
>>>Let's look in the error log. Last message says:
>>>
>>>Date/Time: Tuesday 31 August 2004 - 19:40:39
>>>Type of error: error
>>>Message: Arbitrator shutdown
>>>Fault ID: 2305
>>>Problem data: Arbitrator decided to shutdown this node
>>>Object of reference: QMGR (Line: 3764) 0x00000002
>>>ProgramName: NDB Kernel
>>>ProcessID: 8885
>>>TraceFile: /var/ndbcluster/ndbNDB_TraceFile_18.trace
>>>***EOM***
>>>
>>>So let's look at the trace file, #18.
>>>
>>>[root@ed ndbcluster]# ls -l /var/ndbcluster/ndbNDB_TraceFile_18.trace
>>>-rw-r--r--    1 root     root       961439 Aug 31 19:40
>>>/var/ndbcluster/ndbNDB_TraceFile_18.trace
>>>
>>>[root@ed ndbcluster]# wc !$
>>>wc /var/ndbcluster/ndbNDB_TraceFile_18.trace
>>>  16863  135203  961439 /var/ndbcluster/ndbNDB_TraceFile_18.trace
>>>
>>>It's 16k+ lines. I shouldn't past that in here. Could I email it to 
>>>someone?
>>>
>>>Any other ideas how to track this down?
>>>
>>>Thanks in advance!
>>>
>>>-- Jim
>>>
>>>
>>>--- Mikael_Ronstr�m <mikael@stripped> wrote:
>>>
>>>      
>>>
>>>>Hi Jim,
>>>>The trace file you sent doesn't seem to correlate with the previous
>>>>info. A bit confusing,
>>>>the error log used should have been the one in the 2.ndb_db directory
>>>>if that's where you
>>>>are running it from. This not being written in a week is very strange.
>>>>
>>>>The other error.log is from another cluster instance you worked on it
>>>>seems. Actually this
>>>>correlates better with the cluster log.
>>>>
>>>>Anyways one problem that you might look for that we had problems with
>>>>another customer
>>>>is whether there are any other processes starting to work at the time
>>>>of the crash. If there are
>>>>other memory hungry processes working then the ndbd process might 
>>>>start
>>>>swapping and
>>>>then it can easily be swapped out of the cluster due to missed
>>>>heartbeats. This will not cause
>>>>the watch dog however to complain so the picture is not completely
>>>>clear.
>>>>
>>>>Rgrds Mikael
>>>>
>>>>2004-07-29 kl. 23.45 skrev Jim Hoadley:
>>>>
>>>>        
>>>>
>>>>>Mikael --
>>>>>
>>>>>Thanks for taking on my question. I find 2 error.logs in the
>>>>>ndbcluster tree,
>>>>>and the "live" one isn't where I expected it to be.
>>>>>
>>>>>I'm running the node from this directory:
>>>>>/var/ndbcluster/mysql-test/ndbcluster/2.ndb_db
>>>>>
>>>>>But the (2) error.log(s) I found were here:
>>>>>
>>>>>-rw-r--r--    1 root     root          468 Jul 21 13:48
>>>>>/var/ndbcluster/mysql-test/var/ndbcluster/2.ndb_db/error.log
>>>>>-rw-r--r--    1 root     root         4458 Jul 28 20:49
>>>>>/var/ndbcluster/ndb/error.log
>>>>>
>>>>>Is this a problem?
>>>>>
>>>>>Obviously the one dated Jul 21 is not used. The other has this as the
>>>>>last
>>>>>entry, the one from the most recent "crash":
>>>>>
>>>>>Date/Time: Wednesday 28 July 2004 - 20:49:21
>>>>>Type of error: error
>>>>>Message: Arbitrator shutdown
>>>>>Fault ID: 2305
>>>>>Problem data: Arbitrator decided to shutdown this node
>>>>>Object of reference: QMGR (Line: 3764) 0x00000002
>>>>>ProgramName: NDB Kernel
>>>>>ProcessID: 15932
>>>>>TraceFile: /var/ndbcluster/ndbNDB_TraceFile_10.trace
>>>>>***EOM***
>>>>>
>>>>>So here's /var/ndbcluster/ndbNDB_TraceFile_10.trace
>>>>>
>>>>>[root@BOX2 ndbcluster]# ls -l ./ndbNDB_TraceFile_10.trace
>>>>>-rw-r--r--    1 root     root       962184 Jul 28 20:49
>>>>>./ndbNDB_TraceFile_10.trace
>>>>>
>>>>>
>>>>>[root@cooler ndbcluster]# head -160 ./ndbNDB_TraceFile_10.trace
>>>>>JAM CONTENTS up->down left->right ?=not block entry
>>>>>BLOCK   ADDR   ADDR   ADDR   ADDR   ADDR   ADDR   ADDR   ADDR
>>>>>       ?000112 001452 001475 001489 001489 001499
>>>>>NDBFS   000913 000915 000726 000932
>>>>>DBACC   000103 000110
>>>>>QMGR    001817 001827 001852 001859 002705 002708 002727 002729
>>>>>        002737 002737 002740 002737 002737 002737 002737 002737
>>>>>        002737 002737 002737 002737 002737 002737 002737 002737
>>>>>        002737 002737 002737 002737 002737 002737 002737 002737
>>>>>        002737 002737 002737 002737 002737 002737 002737 002737
>>>>>        002737 002737 002737 002737 002737 002737 002737 002737
>>>>>        002737 002737 002737 002737 002737 002737 002737 002737
>>>>>        002737 001243 001243 001243 001243 001243 001243 001243
>>>>>        001243 001243 001243 001243 001243 001243 001243 001243
>>>>>        001243 001243 001243 001243 001243 001243 001243 001243
>>>>>          
>>>>>
>=== message truncated ===
>
>
>
>		
>_______________________________
>Do you Yahoo!?
>Declare Yourself - Register online to vote today!
>http://vote.yahoo.com
>
>  
>

Thread
Re: nightly crashingJim Hoadley1 Sep
  • Re: nightly crashingJim Hoadley1 Sep
    • Can't Init DatabaseSharad Maloo2 Sep
      • Re: Can't Init DatabaseOlivier Kaloudoff2 Sep
        • Re: Can't Init DatabaseSharad Maloo2 Sep
          • Re: Can't Init DatabaseTomas Ulin2 Sep
      • Re: Can't Init DatabaseWouter de Jong2 Sep
        • Re: Can't Init DatabaseChad Martin2 Sep
          • Re: Can't Init DatabaseSharad Maloo2 Sep
          • Re: Can't Init DatabaseWouter de Jong3 Sep
  • Re: nightly crashingPh.D. Joseph E. Sacco1 Sep
  • Re: nightly crashingMikael Ronström2 Sep
    • Re: nightly crashingJim Hoadley22 Sep
      • Re: nightly crashingMikael Ronström23 Sep
      • Re: nightly crashingTomas Ulin23 Sep
        • Re: nightly crashingJim Hoadley23 Sep
RE: Can't Init DatabaseLuke H. Crouch2 Sep
  • Re: Can't Init DatabaseWouter de Jong3 Sep
    • Re: Can't Init DatabaseTomas Ulin3 Sep
      • Re: Can't Init Database <- solvedWouter de Jong3 Sep
        • Re: Can't Init Database <- solvedMagnus Svensson6 Sep
Re: nightly crashingJim Hoadley23 Sep
  • Re: nightly crashingTomas Ulin25 Sep
    • Re: nightly crashingJim Hoadley27 Sep
      • Re: nightly crashingTomas Ulin28 Sep