Hi,
Thank you for providing us with important test data! I have also noticed
problems with system restart and your information is very valuable. We
are very interested in getting the tracefiles. Can you put together the
information as follows:
devananda_mgm.tgz (cluster.log + config.ini)
devananda_db12.tgz (tracefiles + error.log)
devananda_db13.tgz (tracefiles + error.log)
devananda_db14.tgz (tracefiles + error.log)
devananda_db15.tgz (tracefiles + error.log)
Send this to me privately (because of the size of the attachments) and I
will distribute it to the right people!
Also, I noticed that your NDB nodes started to miss heartbeats. So I
have a couple of questions and recommendations:
* Was the system heavily loaded when the nodes started to miss heartbeats?
* What hardware are you using (Disk subsystem (IDE, SCSI), CPU?
* Does two or more NDB nodes share a single disk?
* Can you do /sbin/hdpart -Tt /dev/hdX (where X is the drive that keeps
the NDB filessytem)?
If the system is heavily loaded and the disks are slow then there is a
chance that the NDB nodes can miss heartbeats. This can happen because
the NDB nodes writes checkpoints and transaction logs to disk, and this
can be very disk intensive. If you can do vmstat 1 (a program that
atleast exist on Linux) and give me information about how many blocks
per second that are written to disk (look for the "bo" column) and also
what block size and file system (ext3, reiserfs...) you are currently
using.
A way to flatten out the disk writes is to change the
TimeBetweenLocalCheckpoints to ~500. This means that the redo log
buffers will be flushed to disk often and thus reducing disk writes.
Otherwise, during high load (write load) the REDO log buffers can become
big, resulting in a lot of information that must be written to disk.
During these disk writes on a very loaded system other processes can be
stalled because they must wait for I/O, thus resulting in that
heartbeats might not be sent as they should, because other processes
must also do I/O.
In any case, you should always be able to do a system restart. Sorry for
the inconvenience caused and thanks again for you help.
Best regards,
Johan Andersson
Devananda wrote:
> Sorry! forgot to post the workaround.
>
> executed 'all stop' and deleted the ndb data storage directory on 2 of
> my 4 DB nodes (one from each pair). When I restarted the cluster, it
> started slowly and then copied data over onto the 2 that I deleted.
> This worked once, but I am having trouble making it work a second time.
>
>>
>