List:Cluster« Previous MessageNext Message »
From:Jim Hoadley Date:September 23 2004 2:50pm
Subject:Re: nightly crashing
View as plain text  
Tomas --

Thanks, I'll send the application to you Mikael in a separate email.
Mikael: I am running mysql-4.1.5-gamma-nightly-20040902.tar.gz.

> What do you mean by: "When I run the API without the NDB it eats memory 
> at a slower rate, ..."?

Now that I have identified the reason NDB on EDSEL is crashing--because it runs
out of
memory, I am trying to determine if the loss of memory is due to the NDB, the
API or both
together. Make sense?

  API and NDB together: problem 
  NDB: looks like memory loss isn't a big problem, but haven't tested enough
  API: apparently still a problem

I could provide statistics to back up these statements.

And I'm going to re-send an email I sent yesterday that didn't show up on the
list.

Thanks again!

-- Jim

--- Tomas Ulin <tomas@stripped> wrote:

> Jim,
> 
> looks like a memory leak to me.  Is it possible for you to send us your 
> application (or part of it) so we can reproduce, and still the leak?
> 
> A question about the below:
> 
> What do you mean by: "When I run the API without the NDB it eats memory 
> at a slower rate, ..."?
> 
> BR,
> 
> T
> 
> Jim Hoadley wrote:
> 
> >Mikael --
> >
> >Thanks for your help on this. OK, I've been monitoring my servers' CPU and
> >memory usage to see what my be causing these node crashes.
> >
> >First, I switched from 4 nodes to 2 nodes to simplify the configuration. So,
> >we've got:
> >
> >TORMAN: [PIII 1Ghz 256K L2 cache 1024MB RAM], NDB [node2], API [node11]
> >EDSEL:  [PIII 1Ghz 256K L2 cache  512MB RAM], NDB [node3], API [node12]
> >
> >There is not much to implicate a problem with CPU. The load remains at 
> >or below 1.0 right up until the time of a so-called NDB node "crash". 
> >At that time the load (wait time) goes up to 4, 5 ,or 6, but I would guess
> >this is a symptom and not a cause of the problem.
> >
> >Memory usage is much more interesting. On EDSEL, when both the API and NDB
> >nodes are up, available memory is lost at a rapid rate (approximately 300MB
> >lost in 13 hours), swapping begins, then, with only ~6MBs, left the NDB node
> >crashes. At that time it's memory chunk is freed, but the MySQL API keeps 
> >on eating memory at roughly the same rate.
> >
> >When I run the API without the NDB it eats memory at a slower rate, but 
> >still approximately 10MB per hour.
> >
> >When I run NDB without the API there is no such problem.
> >
> >Is this a memory leak?
> >
> >You could say the small amount of RAM in the machine is the source of the
> >problem, but isn't the large anount of RAM Cluster requires for the NDB, 
> >not the API? But in this case it's the API that's taking all the memory.
> >
> >So, suggestions? 
> >
> >-- Jim
> >
> >
> >--- Mikael Ronstr> >
> >  
> >
> >>Hi Jim,
> >> From the logs I can see that the ndbd process stopped working properly 
> >>since you get
> >>    
> >>
> >>>Ndb kernel is stuck in: Job Handling
> >>>Ndb kernel is stuck in: Job Handling
> >>>      
> >>>
> >>This can have many causes, both ndbd causes and OS causes. A look into 
> >>the trace-file
> >>will probably reveal the cause. So please send it to me and I can have 
> >>a look at it.
> >>
> >>Rgrds Mikael
> >>
> >>2004-09-01 kl. 18.49 skrev Jim Hoadley:
> >>
> >>    
> >>
> >>>I've had a cluster going for several weeks, however, almost every day 
> >>>one or
> >>>more of my nodes crashes. My first thought is that the Linux boxes are
> >>>underpowered, but is there any way I can prove that? I'd sure like to 
> >>>know
> >>>MySQL Cluster is stable and will suit my needs before shelling out the 
> >>>money
> >>>for production boxes.
> >>>
> >>>I wrote to the list about this a month ago but never got to the root 
> >>>cause. So
> >>>I'd like to try again.
> >>>
> >>>I've got 2 replicas, 4 storage nodes, 4 APIs on 4 test Linux boxes. 
> >>>Each box
> >>>has 512MB RAM, and the test database is tiny (4 rows of data). Here's 
> >>>what
> >>>STDERR says on the node [4] that died yesterday afternoon:
> >>>
> >>>[root@ed 4.ndb_mgm]# ndbd &
> >>>[3] 8884
> >>>2004-08-31 16:38:10 [NDB] INFO     -- Angel pid: 8884 ndb pid: 8885
> >>>2004-08-31 16:38:10 [NDB] INFO     -- NDB Cluster -- DB node 4
> >>>2004-08-31 16:38:10 [NDB] INFO     -- Version 3.5.0 (beta) --
> >>>2004-08-31 16:38:10 [NDB] INFO     -- Start initiated (version 3.5.0)
> >>>2004-08-31 16:38:11 [NDB] INFO     -- Communication to Node 2 opened
> >>>2004-08-31 16:38:11 [NDB] INFO     -- Communication to Node 3 opened
> >>>2004-08-31 16:38:11 [NDB] INFO     -- Communication to Node 5 opened
> >>>2004-08-31 16:38:11 [NDB] INFO     -- Node 1 Connected
> >>>2004-08-31 16:38:12 [NDB] INFO     -- Node 2 Connected
> >>>2004-08-31 16:38:12 [NDB] INFO     -- Node 3 Connected
> >>>2004-08-31 16:38:14 [NDB] INFO     -- Node 5 Connected
> >>>2004-08-31 16:38:14 [NDB] INFO     -- Node 2: API version 3.5.0
> >>>2004-08-31 16:38:14 [NDB] INFO     -- Node 3: API version 3.5.0
> >>>2004-08-31 16:38:14 [NDB] INFO     -- Node 5: API version 3.5.0
> >>>NR: setLcpActiveStatusEnd - m_participatingLQH
> >>>2004-08-31 16:38:16 [NDB] INFO     -- Communication to Node 11 opened
> >>>2004-08-31 16:38:16 [NDB] INFO     -- Communication to Node 12 opened
> >>>2004-08-31 16:38:16 [NDB] INFO     -- Communication to Node 13 opened
> >>>2004-08-31 16:38:16 [NDB] INFO     -- Communication to Node 14 opened
> >>>2004-08-31 16:38:16 [NDB] INFO     -- Communication to Node 0 opened
> >>>2004-08-31 16:38:16 [NDB] INFO     -- Started (version 3.5.0)
> >>>2004-08-31 16:38:16 [NDB] INFO     -- Node 1: API version 3.5.0
> >>>2004-08-31 16:38:17 [NDB] INFO     -- Node 12 Connected
> >>>2004-08-31 16:38:17 [NDB] INFO     -- Node 12: API version 3.5.0
> >>>2004-08-31 16:38:17 [NDB] INFO     -- Node 11 Connected
> >>>2004-08-31 16:38:17 [NDB] INFO     -- Node 14 Connected
> >>>2004-08-31 16:38:17 [NDB] INFO     -- Node 11: API version 3.5.0
> >>>2004-08-31 16:38:17 [NDB] INFO     -- Node 14: API version 3.5.0
> >>>2004-08-31 16:38:17 [NDB] INFO     -- Node 13 Connected
> >>>2004-08-31 16:38:17 [NDB] INFO     -- Node 13: API version 3.5.0
> >>>Ndb kernel is stuck in: Job Handling
> >>>Ndb kernel is stuck in: Job Handling
> >>>2004-08-31 19:40:39 [NDB] ALERT    -- Node 3 Disconnected
> >>>Error handler shutting down system
> >>>Error handler shutdown completed - exiting
> >>>
> >>>So I'm looking for logging information on node[4] and maybe the MGM 
> >>>node.
> >>>
> >>>[root@ed 4.ndb_mgm]# find /var/ndbcluster/ -name *.log -print |xargs 
> >>>ls -lrt
> >>><...>
> >>>-rw-r--r--    1 root     root            2 Aug 31 19:40
> >>>/var/ndbcluster/ndb/NextTraceFileNo.log
> >>>-rw-r--r--    1 root     root        10044 Aug 31 19:40
> >>>/var/ndbcluster/ndb/error.log
> >>>
> >>>Let's look in the error log. Last message says:
> >>>
> >>>Date/Time: Tuesday 31 August 2004 - 19:40:39
> >>>Type of error: error
> >>>Message: Arbitrator shutdown
> >>>Fault ID: 2305
> >>>Problem data: Arbitrator decided to shutdown this node
> >>>Object of reference: QMGR (Line: 3764) 0x00000002
> >>>ProgramName: NDB Kernel
> >>>ProcessID: 8885
> >>>TraceFile: /var/ndbcluster/ndbNDB_TraceFile_18.trace
> >>>***EOM***
> >>>
> >>>So let's look at the trace file, #18.
> >>>
> >>>[root@ed ndbcluster]# ls -l /var/ndbcluster/ndbNDB_TraceFile_18.trace
> >>>-rw-r--r--    1 root     root       961439 Aug 31 19:40
> >>>/var/ndbcluster/ndbNDB_TraceFile_18.trace
> >>>
> >>>[root@ed ndbcluster]# wc !$
> >>>wc /var/ndbcluster/ndbNDB_TraceFile_18.trace
> >>>  16863  135203  961439 /var/ndbcluster/ndbNDB_TraceFile_18.trace
> >>>
> >>>It's 16k+ lines. I shouldn't past that in here. Could I email it to 
> >>>someone?
> >>>
> >>>Any other ideas how to track this down?
> >>>
> >>>Thanks in advance!
> >>>
> >>>-- Jim
> >>>
> >>>
> >>>--- Mikael_Ronstr> >>>
> >>>      
> >>>
> >>>>Hi Jim,
> >>>>The trace file you sent doesn't seem to correlate with the previous
> >>>>info. A bit confusing,
> >>>>the error log used should have been the one in the 2.ndb_db
> directory
> >>>>if that's where you
> >>>>are running it from. This not being written in a week is very
> strange.
> >>>>
> >>>>The other error.log is from another cluster instance you worked on
> it
> >>>>seems. Actually this
> >>>>correlates better with the cluster log.
> >>>>
> >>>>Anyways one problem that you might look for that we had problems
> with
> >>>>another customer
> >>>>is whether there are any other processes starting to work at the
> time
> >>>>of the crash. If there are
> >>>>other memory hungry processes working then the ndbd process might 
> >>>>start
> >>>>swapping and
> 
=== message truncated ===


	
		
__________________________________
Do you Yahoo!?
New and Improved Yahoo! Mail - 100MB free storage!
http://promotions.yahoo.com/new_mail 
Thread
Re: nightly crashingJim Hoadley1 Sep
  • Re: nightly crashingJim Hoadley1 Sep
    • Can't Init DatabaseSharad Maloo2 Sep
      • Re: Can't Init DatabaseOlivier Kaloudoff2 Sep
        • Re: Can't Init DatabaseSharad Maloo2 Sep
          • Re: Can't Init DatabaseTomas Ulin2 Sep
      • Re: Can't Init DatabaseWouter de Jong2 Sep
        • Re: Can't Init DatabaseChad Martin2 Sep
          • Re: Can't Init DatabaseSharad Maloo2 Sep
          • Re: Can't Init DatabaseWouter de Jong3 Sep
  • Re: nightly crashingPh.D. Joseph E. Sacco1 Sep
  • Re: nightly crashingMikael Ronström2 Sep
    • Re: nightly crashingJim Hoadley22 Sep
      • Re: nightly crashingMikael Ronström23 Sep
      • Re: nightly crashingTomas Ulin23 Sep
        • Re: nightly crashingJim Hoadley23 Sep
RE: Can't Init DatabaseLuke H. Crouch2 Sep
  • Re: Can't Init DatabaseWouter de Jong3 Sep
    • Re: Can't Init DatabaseTomas Ulin3 Sep
      • Re: Can't Init Database <- solvedWouter de Jong3 Sep
        • Re: Can't Init Database <- solvedMagnus Svensson6 Sep
Re: nightly crashingJim Hoadley23 Sep
  • Re: nightly crashingTomas Ulin25 Sep
    • Re: nightly crashingJim Hoadley27 Sep
      • Re: nightly crashingTomas Ulin28 Sep