List:Cluster« Previous MessageNext Message »
From:Johan Andersson Date:July 21 2004 4:32am
Subject:Re: unable to start (infinite crash loop) and workaround
View as plain text  
Hi,

Thank you for providing us with important test data! I have also noticed 
problems with system restart and your information is very valuable. We 
are very interested in getting the tracefiles. Can you put together the 
information as follows:

devananda_mgm.tgz (cluster.log + config.ini)
devananda_db12.tgz (tracefiles + error.log)
devananda_db13.tgz (tracefiles + error.log)
devananda_db14.tgz (tracefiles + error.log)
devananda_db15.tgz (tracefiles + error.log)

Send this to me privately (because of the size of the attachments) and I 
will distribute it to the right people!

Also, I noticed that your NDB nodes started to miss heartbeats. So I 
have a couple of questions and recommendations:

* Was the system heavily loaded when the nodes started to miss heartbeats?
* What hardware are you using (Disk subsystem (IDE, SCSI), CPU?
* Does two or more NDB nodes share a single disk?
* Can you do /sbin/hdpart -Tt /dev/hdX  (where X is the drive that keeps 
the NDB filessytem)?

If the system is heavily loaded and the disks are slow then there is a 
chance that the NDB nodes can miss heartbeats. This can happen because 
the NDB nodes writes checkpoints and transaction logs to disk, and this 
can be very disk intensive. If you can do vmstat 1 (a program that 
atleast exist on Linux) and give me information about how many blocks 
per second that are written to disk (look for the "bo" column) and also 
what block size and file system (ext3, reiserfs...) you are currently 
using.

A way to flatten out the disk writes is to change the 
TimeBetweenLocalCheckpoints to ~500. This means that the redo log 
buffers will be flushed to disk often and thus reducing disk writes. 
Otherwise, during high load (write load) the REDO log buffers can become 
big, resulting in a lot of information that must be written to disk. 
During these disk writes on a very loaded system other processes can be 
stalled because they must wait for I/O, thus resulting in that 
heartbeats might not be sent as they should, because other processes 
must also do I/O.

In any case, you should always be able to do a system restart. Sorry for 
the inconvenience caused and thanks again for you help.

Best regards,
Johan Andersson



Devananda wrote:

> Sorry! forgot to post the workaround.
>
> executed 'all stop' and deleted the ndb data storage directory on 2 of 
> my 4 DB nodes (one from each pair). When I restarted the cluster, it 
> started slowly and then copied data over onto the 2 that I deleted. 
> This worked once, but I am having trouble making it work a second time.
>
>>
>

Thread
unable to start (infinite crash loop) and workaroundDevananda21 Jul
  • Re: unable to start (infinite crash loop) and workaroundDevananda21 Jul
    • Re: unable to start (infinite crash loop) and workaroundJohan Andersson21 Jul
Re: unable to start (infinite crash loop) and workaroundJohan Andersson23 Jul