List:Cluster« Previous MessageNext Message »
From:Jim Hoadley Date:September 22 2004 8:42pm
Subject:Re: nightly crashing
View as plain text  
Mikael --

Thanks for your help on this. OK, I've been monitoring my servers' CPU and
memory usage to see what my be causing these node crashes.

First, I switched from 4 nodes to 2 nodes to simplify the configuration. So,
we've got:

TORMAN: [PIII 1Ghz 256K L2 cache 1024MB RAM], NDB [node2], API [node11]
EDSEL:  [PIII 1Ghz 256K L2 cache  512MB RAM], NDB [node3], API [node12]

There is not much to implicate a problem with CPU. The load remains at 
or below 1.0 right up until the time of a so-called NDB node "crash". 
At that time the load (wait time) goes up to 4, 5 ,or 6, but I would guess
this is a symptom and not a cause of the problem.

Memory usage is much more interesting. On EDSEL, when both the API and NDB
nodes are up, available memory is lost at a rapid rate (approximately 300MB
lost in 13 hours), swapping begins, then, with only ~6MBs, left the NDB node
crashes. At that time it's memory chunk is freed, but the MySQL API keeps 
on eating memory at roughly the same rate.

When I run the API without the NDB it eats memory at a slower rate, but 
still approximately 10MB per hour.

When I run NDB without the API there is no such problem.

Is this a memory leak?

You could say the small amount of RAM in the machine is the source of the
problem, but isn't the large anount of RAM Cluster requires for the NDB, 
not the API? But in this case it's the API that's taking all the memory.

So, suggestions? 

-- Jim


--- Mikael Ronstr
> Hi Jim,
>  From the logs I can see that the ndbd process stopped working properly 
> since you get
> > Ndb kernel is stuck in: Job Handling
> > Ndb kernel is stuck in: Job Handling
> 
> This can have many causes, both ndbd causes and OS causes. A look into 
> the trace-file
> will probably reveal the cause. So please send it to me and I can have 
> a look at it.
> 
> Rgrds Mikael
> 
> 2004-09-01 kl. 18.49 skrev Jim Hoadley:
> 
> > I've had a cluster going for several weeks, however, almost every day 
> > one or
> > more of my nodes crashes. My first thought is that the Linux boxes are
> > underpowered, but is there any way I can prove that? I'd sure like to 
> > know
> > MySQL Cluster is stable and will suit my needs before shelling out the 
> > money
> > for production boxes.
> >
> > I wrote to the list about this a month ago but never got to the root 
> > cause. So
> > I'd like to try again.
> >
> > I've got 2 replicas, 4 storage nodes, 4 APIs on 4 test Linux boxes. 
> > Each box
> > has 512MB RAM, and the test database is tiny (4 rows of data). Here's 
> > what
> > STDERR says on the node [4] that died yesterday afternoon:
> >
> > [root@ed 4.ndb_mgm]# ndbd &
> > [3] 8884
> > 2004-08-31 16:38:10 [NDB] INFO     -- Angel pid: 8884 ndb pid: 8885
> > 2004-08-31 16:38:10 [NDB] INFO     -- NDB Cluster -- DB node 4
> > 2004-08-31 16:38:10 [NDB] INFO     -- Version 3.5.0 (beta) --
> > 2004-08-31 16:38:10 [NDB] INFO     -- Start initiated (version 3.5.0)
> > 2004-08-31 16:38:11 [NDB] INFO     -- Communication to Node 2 opened
> > 2004-08-31 16:38:11 [NDB] INFO     -- Communication to Node 3 opened
> > 2004-08-31 16:38:11 [NDB] INFO     -- Communication to Node 5 opened
> > 2004-08-31 16:38:11 [NDB] INFO     -- Node 1 Connected
> > 2004-08-31 16:38:12 [NDB] INFO     -- Node 2 Connected
> > 2004-08-31 16:38:12 [NDB] INFO     -- Node 3 Connected
> > 2004-08-31 16:38:14 [NDB] INFO     -- Node 5 Connected
> > 2004-08-31 16:38:14 [NDB] INFO     -- Node 2: API version 3.5.0
> > 2004-08-31 16:38:14 [NDB] INFO     -- Node 3: API version 3.5.0
> > 2004-08-31 16:38:14 [NDB] INFO     -- Node 5: API version 3.5.0
> > NR: setLcpActiveStatusEnd - m_participatingLQH
> > 2004-08-31 16:38:16 [NDB] INFO     -- Communication to Node 11 opened
> > 2004-08-31 16:38:16 [NDB] INFO     -- Communication to Node 12 opened
> > 2004-08-31 16:38:16 [NDB] INFO     -- Communication to Node 13 opened
> > 2004-08-31 16:38:16 [NDB] INFO     -- Communication to Node 14 opened
> > 2004-08-31 16:38:16 [NDB] INFO     -- Communication to Node 0 opened
> > 2004-08-31 16:38:16 [NDB] INFO     -- Started (version 3.5.0)
> > 2004-08-31 16:38:16 [NDB] INFO     -- Node 1: API version 3.5.0
> > 2004-08-31 16:38:17 [NDB] INFO     -- Node 12 Connected
> > 2004-08-31 16:38:17 [NDB] INFO     -- Node 12: API version 3.5.0
> > 2004-08-31 16:38:17 [NDB] INFO     -- Node 11 Connected
> > 2004-08-31 16:38:17 [NDB] INFO     -- Node 14 Connected
> > 2004-08-31 16:38:17 [NDB] INFO     -- Node 11: API version 3.5.0
> > 2004-08-31 16:38:17 [NDB] INFO     -- Node 14: API version 3.5.0
> > 2004-08-31 16:38:17 [NDB] INFO     -- Node 13 Connected
> > 2004-08-31 16:38:17 [NDB] INFO     -- Node 13: API version 3.5.0
> > Ndb kernel is stuck in: Job Handling
> > Ndb kernel is stuck in: Job Handling
> > 2004-08-31 19:40:39 [NDB] ALERT    -- Node 3 Disconnected
> > Error handler shutting down system
> > Error handler shutdown completed - exiting
> >
> > So I'm looking for logging information on node[4] and maybe the MGM 
> > node.
> >
> > [root@ed 4.ndb_mgm]# find /var/ndbcluster/ -name *.log -print |xargs 
> > ls -lrt
> > <...>
> > -rw-r--r--    1 root     root            2 Aug 31 19:40
> > /var/ndbcluster/ndb/NextTraceFileNo.log
> > -rw-r--r--    1 root     root        10044 Aug 31 19:40
> > /var/ndbcluster/ndb/error.log
> >
> > Let's look in the error log. Last message says:
> >
> > Date/Time: Tuesday 31 August 2004 - 19:40:39
> > Type of error: error
> > Message: Arbitrator shutdown
> > Fault ID: 2305
> > Problem data: Arbitrator decided to shutdown this node
> > Object of reference: QMGR (Line: 3764) 0x00000002
> > ProgramName: NDB Kernel
> > ProcessID: 8885
> > TraceFile: /var/ndbcluster/ndbNDB_TraceFile_18.trace
> > ***EOM***
> >
> > So let's look at the trace file, #18.
> >
> > [root@ed ndbcluster]# ls -l /var/ndbcluster/ndbNDB_TraceFile_18.trace
> > -rw-r--r--    1 root     root       961439 Aug 31 19:40
> > /var/ndbcluster/ndbNDB_TraceFile_18.trace
> >
> > [root@ed ndbcluster]# wc !$
> > wc /var/ndbcluster/ndbNDB_TraceFile_18.trace
> >   16863  135203  961439 /var/ndbcluster/ndbNDB_TraceFile_18.trace
> >
> > It's 16k+ lines. I shouldn't past that in here. Could I email it to 
> > someone?
> >
> > Any other ideas how to track this down?
> >
> > Thanks in advance!
> >
> > -- Jim
> >
> >
> > --- Mikael_Ronstr> >
> >> Hi Jim,
> >> The trace file you sent doesn't seem to correlate with the previous
> >> info. A bit confusing,
> >> the error log used should have been the one in the 2.ndb_db directory
> >> if that's where you
> >> are running it from. This not being written in a week is very strange.
> >>
> >> The other error.log is from another cluster instance you worked on it
> >> seems. Actually this
> >> correlates better with the cluster log.
> >>
> >> Anyways one problem that you might look for that we had problems with
> >> another customer
> >> is whether there are any other processes starting to work at the time
> >> of the crash. If there are
> >> other memory hungry processes working then the ndbd process might 
> >> start
> >> swapping and
> >> then it can easily be swapped out of the cluster due to missed
> >> heartbeats. This will not cause
> >> the watch dog however to complain so the picture is not completely
> >> clear.
> >>
> >> Rgrds Mikael
> >>
> >> 2004-07-29 kl. 23.45 skrev Jim Hoadley:
> >>
> >>> Mikael --
> >>>
> >>> Thanks for taking on my question. I find 2 error.logs in the
> >>> ndbcluster tree,
> >>> and the "live" one isn't where I expected it to be.
> >>>
> >>> I'm running the node from this directory:
> >>> /var/ndbcluster/mysql-test/ndbcluster/2.ndb_db
> >>>
> >>> But the (2) error.log(s) I found were here:
> >>>
> >>> -rw-r--r--    1 root     root          468 Jul 21 13:48
> >>> /var/ndbcluster/mysql-test/var/ndbcluster/2.ndb_db/error.log
> >>> -rw-r--r--    1 root     root         4458 Jul 28 20:49
> >>> /var/ndbcluster/ndb/error.log
> >>>
> >>> Is this a problem?
> >>>
> >>> Obviously the one dated Jul 21 is not used. The other has this as the
> >>> last
> >>> entry, the one from the most recent "crash":
> >>>
> >>> Date/Time: Wednesday 28 July 2004 - 20:49:21
> >>> Type of error: error
> >>> Message: Arbitrator shutdown
> >>> Fault ID: 2305
> >>> Problem data: Arbitrator decided to shutdown this node
> >>> Object of reference: QMGR (Line: 3764) 0x00000002
> >>> ProgramName: NDB Kernel
> >>> ProcessID: 15932
> >>> TraceFile: /var/ndbcluster/ndbNDB_TraceFile_10.trace
> >>> ***EOM***
> >>>
> >>> So here's /var/ndbcluster/ndbNDB_TraceFile_10.trace
> >>>
> >>> [root@BOX2 ndbcluster]# ls -l ./ndbNDB_TraceFile_10.trace
> >>> -rw-r--r--    1 root     root       962184 Jul 28 20:49
> >>> ./ndbNDB_TraceFile_10.trace
> >>>
> >>>
> >>> [root@cooler ndbcluster]# head -160 ./ndbNDB_TraceFile_10.trace
> >>> JAM CONTENTS up->down left->right ?=not block entry
> >>> BLOCK   ADDR   ADDR   ADDR   ADDR   ADDR   ADDR   ADDR   ADDR
> >>>        ?000112 001452 001475 001489 001489 001499
> >>> NDBFS   000913 000915 000726 000932
> >>> DBACC   000103 000110
> >>> QMGR    001817 001827 001852 001859 002705 002708 002727 002729
> >>>         002737 002737 002740 002737 002737 002737 002737 002737
> >>>         002737 002737 002737 002737 002737 002737 002737 002737
> >>>         002737 002737 002737 002737 002737 002737 002737 002737
> >>>         002737 002737 002737 002737 002737 002737 002737 002737
> >>>         002737 002737 002737 002737 002737 002737 002737 002737
> >>>         002737 002737 002737 002737 002737 002737 002737 002737
> >>>         002737 001243 001243 001243 001243 001243 001243 001243
> >>>         001243 001243 001243 001243 001243 001243 001243 001243
> >>>         001243 001243 001243 001243 001243 001243 001243 001243
> 
=== message truncated ===



		
_______________________________
Do you Yahoo!?
Declare Yourself - Register online to vote today!
http://vote.yahoo.com
Thread
Re: nightly crashingJim Hoadley1 Sep
  • Re: nightly crashingJim Hoadley1 Sep
    • Can't Init DatabaseSharad Maloo2 Sep
      • Re: Can't Init DatabaseOlivier Kaloudoff2 Sep
        • Re: Can't Init DatabaseSharad Maloo2 Sep
          • Re: Can't Init DatabaseTomas Ulin2 Sep
      • Re: Can't Init DatabaseWouter de Jong2 Sep
        • Re: Can't Init DatabaseChad Martin2 Sep
          • Re: Can't Init DatabaseSharad Maloo2 Sep
          • Re: Can't Init DatabaseWouter de Jong3 Sep
  • Re: nightly crashingPh.D. Joseph E. Sacco1 Sep
  • Re: nightly crashingMikael Ronström2 Sep
    • Re: nightly crashingJim Hoadley22 Sep
      • Re: nightly crashingMikael Ronström23 Sep
      • Re: nightly crashingTomas Ulin23 Sep
        • Re: nightly crashingJim Hoadley23 Sep
RE: Can't Init DatabaseLuke H. Crouch2 Sep
  • Re: Can't Init DatabaseWouter de Jong3 Sep
    • Re: Can't Init DatabaseTomas Ulin3 Sep
      • Re: Can't Init Database <- solvedWouter de Jong3 Sep
        • Re: Can't Init Database <- solvedMagnus Svensson6 Sep
Re: nightly crashingJim Hoadley23 Sep
  • Re: nightly crashingTomas Ulin25 Sep
    • Re: nightly crashingJim Hoadley27 Sep
      • Re: nightly crashingTomas Ulin28 Sep