More on this problem. A couple of hours later, node 4 went down,
then all nodes died, taking down the cluster.
Looks like the error messages available to me our more interesting
this time.
Here's what the ndb_error logs say:
Node 1:
Date/Time: Monday 2 May 2005 - 20:15:08
Type of error: assert
Message: Assertion, probably a programming error
Fault ID: 2301
Problem data: ArrayPool<T>::getPtr
Object of reference: ../../../../../ndb/src/kernel/vm/ArrayPool.hpp line: 350
(b
lock: BACKUP)
ProgramName: ndbd
ProcessID: 13637
TraceFile: /usr/local/mysql/ndb_1_trace.log.4
***EOM***
Node 2:
Date/Time: Monday 2 May 2005 - 20:15:22
Type of error: assert
Message: Assertion, probably a programming error
Fault ID: 2301
Problem data: ArrayPool<T>::getPtr
Object of reference: ../../../../../ndb/src/kernel/vm/ArrayPool.hpp line: 350
(b
lock: BACKUP)
ProgramName: ndbd
ProcessID: 3666
TraceFile: /usr/local/mysql/ndb_2_trace.log.8
***EOM***
Node 3:
<none>
Node 4:
<none>
Here's what the ndb_out files say:
Node 1:
Error handler shutting down system
Error handler shutdown completed - exiting
Node 3:
2005-05-02 20:15:23 [NDB] INFO -- Received signal 11. Running error
handler.
Node 2:
Ndb kernel is stuck in: Polling for Receive
Error handler shutting down system
Error handler shutdown completed - exiting
Node 4:
2005-05-02 20:00:04 [NDB] INFO -- Received signal 11. Running error
handler.
It looks like node 1 and node 2 died, then with no node available in that node
group, the management server had to shut down node 3 and node 4.
The "object of reference" line in the error log mentions BACKUP. I began an
ndbcluster BACKUP just 10 or 15 minutes prior to the crash (at 20:00). Could
that have been the cause? If so, why?
The previous backup ran (at 18:00) when one of the 4 nodes was offline.
When it finished I deleted the directories. Could either of these caused some
corruption?
-- Jim
--- Jim Hoadley <j_hoadley@stripped> wrote:
> After running for many days, my cluster crashed this
> afternoon. Can someone help me understand the reason?
>
> Any help would be greatly appreciated.
>
> This was the sequence of events. Node 1 crashed while I
> was logged in with the mysql client. I restarted node 1,
> then both node 1 and node 3 (both on the same host) crashed.
> I restarted nodes 1 and 3 successfully and they're still running.
>
> Here's what the error log for node1 says. Node 3 did not
> have an error log. Please let me know which lines from the
> trace logs are relevant and I'll post theose too.
>
> Date/Time: Monday 2 May 2005 - 17:49:05
> Type of error: error
> Message: Internal program error (failed ndbrequire)
> Fault ID: 2341
> Problem data: DbtcMain.cpp
> Object of reference: DBTC (Line: 12251) 0x0000000a
> ProgramName: ndbd
> ProcessID: 3669
> TraceFile: /usr/local/mysql/ndb_1_trace.log.2
> ***EOM***
>
>
>
>
> Date/Time: Monday 2 May 2005 - 17:52:14
> Type of error: error
> Message: Node failed during system restart
> Fault ID: 2308
> Problem data: Unhandled node failure of started node during restart
> Object of reference: NDBCNTR (Line: 1417) 0x0000000a
> ProgramName: ndbd
> ProcessID: 13585
> TraceFile: /usr/local/mysql/ndb_1_trace.log.3
> ***EOM***
>
> These are my specs.
>
> 3-host cluster:
>
> host1 = node [1], node [3], API [6]
> host2 = node [2], node [4], API [7]
> host3 = mgm [5]
>
> Each host has 6GB RAM and 2 3.6G Xeons
> RedHat Enterprise Linux 3 with hugemem kernel
> 2.4.21-27.0.4.ELhugemem #1 SMP
>
> Here's my config.ini:
>
> [ndbd default]
> LockPagesInMainMemory=1
> TransactionDeadlockDetectionTimeout=14000
> NoOfReplicas= 2
> MaxNoOfConcurrentOperations=131072
> DataMemory= 1900M
> IndexMemory= 400M
> Diskless= 0
> DataDir= /var/mysql-cluster
> TimeBetweenWatchDogCheck=10000
> HeartbeatIntervalDbDb=10000
> HeartbeatIntervalDbApi=10000
> NoOfFragmentLogFiles=64
>
> NoOfDiskPagesToDiskAfterRestartTUP=54 #40
> NoOfDiskPagesToDiskAfterRestartACC=8 #20
>
> MaxNoOfAttributes = 2000 #1000
> MaxNoOfOrderedIndexes = 5000 #128
> MaxNoOfUniqueHashIndexes = 5000 #64
>
> [ndbd]
> HostName= 10.0.1.199
>
> [ndbd]
> HostName= 10.0.1.200
>
> [ndbd]
> HostName= 10.0.1.199
>
> [ndbd]
> HostName= 10.0.1.200
>
> [ndb_mgmd]
> HostName= 10.0.1.198
> PortNumber= 2200
>
> [mysqld]
>
> [mysqld]
> [tcp default]
> PortNumber= 2202
>
>
> And show:
>
> ndb_mgm> show
> Connected to Management Server at: 10.0.1.198:2200
> Cluster Configuration
> ---------------------
> [ndbd(NDB)] 4 node(s)
> id=1 @10.0.1.199 (Version: 4.1.11, Nodegroup: 0)
> id=2 @10.0.1.200 (Version: 4.1.11, Nodegroup: 0, Master)
> id=3 @10.0.1.199 (Version: 4.1.11, Nodegroup: 1)
> id=4 @10.0.1.200 (Version: 4.1.11, Nodegroup: 1)
>
> [ndb_mgmd(MGM)] 1 node(s)
> id=5 @10.0.1.198 (Version: 4.1.11)
>
> [mysqld(API)] 2 node(s)
> id=6 @10.0.1.199 (Version: 4.1.11)
> id=7 @10.0.1.200 (Version: 4.1.11)
>
>
> Thanks in advance.
>
> -- Jim
>
>
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam? Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
>
> --
> MySQL Cluster Mailing List
> For list archives: http://lists.mysql.com/cluster
> To unsubscribe: http://lists.mysql.com/cluster?unsub=1
>
>
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com