List:Cluster« Previous MessageNext Message »
From:Jonas Oreland Date:May 4 2005 11:09am
Subject:Re: failed ndbrequire -- reason?
View as plain text  
Jim Hoadley wrote:
> More on this problem. A couple of hours later, node 4 went down, 
> then all nodes died, taking down the cluster. 
> 
> Looks like the error messages available to me our more interesting 
> this time.
> 
> Here's what the ndb_error logs say:
> 
> Node 1:
> 
> Date/Time: Monday 2 May 2005 - 20:15:08
> Type of error: assert
> Message: Assertion, probably a programming error
> Fault ID: 2301
> Problem data: ArrayPool<T>::getPtr
> Object of reference: ../../../../../ndb/src/kernel/vm/ArrayPool.hpp line: 350
> (b
> lock: BACKUP)
> ProgramName: ndbd
> ProcessID: 13637
> TraceFile: /usr/local/mysql/ndb_1_trace.log.4
> ***EOM***
> 
> Node 2:
> 
> Date/Time: Monday 2 May 2005 - 20:15:22
> Type of error: assert
> Message: Assertion, probably a programming error
> Fault ID: 2301
> Problem data: ArrayPool<T>::getPtr
> Object of reference: ../../../../../ndb/src/kernel/vm/ArrayPool.hpp line: 350
> (b
> lock: BACKUP)
> ProgramName: ndbd
> ProcessID: 3666
> TraceFile: /usr/local/mysql/ndb_2_trace.log.8
> ***EOM***
> 
> Node 3:
> 
> <none>
> 
> Node 4:
> 
> <none>
> 
> Here's what the ndb_out files say:
> 
> Node 1:
> Error handler shutting down system
> Error handler shutdown completed - exiting
> 
> Node 3:
> 2005-05-02 20:15:23 [NDB] INFO     -- Received signal 11. Running error
> handler.
> 
> Node 2:
> Ndb kernel is stuck in: Polling for Receive
> Error handler shutting down system
> Error handler shutdown completed - exiting
> 
> Node 4:
> 2005-05-02 20:00:04 [NDB] INFO     -- Received signal 11. Running error
> handler.
> 
> It looks like node 1 and node 2 died, then with no node available in that node
> group, the management server had to shut down node 3 and node 4.
> 
>  The "object of reference" line in the error log mentions BACKUP. I began an 
> ndbcluster BACKUP just 10 or 15 minutes prior to the crash (at 20:00). Could 
> that have been the cause? If so, why?

The backup is definitly the cause of the bug.
Exactly why is hard to say...you must include the trace files from the crashes.

> 
> The previous backup ran (at 18:00) when one of the 4 nodes was offline.
> When it finished I deleted the directories. Could either of these caused some 
> corruption?

There should be no problem taking a backup with some nodes offline.
And the deleting of the directories does not matter.

Regarding the first crash in TC.
1) What version are you running?
2) That's related to ndb's internal triggers which are (among other things)
   used for backup and unique indexes. Where you running backup at the time of the
failure?

3) If so, there has been a number of bug fixes lately regarding node failure during
backup.
   Which might affect the second crash (but not the TC one)


/Jonas
> 
> -- Jim
> 
> 
> 
> 
> --- Jim Hoadley <j_hoadley@stripped> wrote:
> 
>>After running for many days, my cluster crashed this 
>>afternoon. Can someone help me understand the reason?
>>
>>Any help would be greatly appreciated.
>>
>>This was the sequence of events. Node 1 crashed while I
>>was logged in with the mysql client. I restarted node 1,
>>then both node 1 and node 3 (both on the same host) crashed.
>>I restarted nodes 1 and 3 successfully and they're still running.
>>
>>Here's what the error log for node1 says. Node 3 did not 
>>have an error log. Please let me know which lines from the 
>>trace logs are relevant and I'll post theose too.
>>
>>   Date/Time: Monday 2 May 2005 - 17:49:05
>>   Type of error: error
>>   Message: Internal program error (failed ndbrequire)
>>   Fault ID: 2341
>>   Problem data: DbtcMain.cpp
>>   Object of reference: DBTC (Line: 12251) 0x0000000a
>>   ProgramName: ndbd
>>   ProcessID: 3669
>>   TraceFile: /usr/local/mysql/ndb_1_trace.log.2
>>   ***EOM***
>>                                                                             
>> 
>>
>>                       
>>   Date/Time: Monday 2 May 2005 - 17:52:14
>>   Type of error: error
>>   Message: Node failed during system restart
>>   Fault ID: 2308
>>   Problem data: Unhandled node failure of started node during restart
>>   Object of reference: NDBCNTR (Line: 1417) 0x0000000a
>>   ProgramName: ndbd
>>   ProcessID: 13585
>>   TraceFile: /usr/local/mysql/ndb_1_trace.log.3
>>   ***EOM***
>>
>>These are my specs.
>>
>>3-host cluster:
>>
>>   host1 = node [1], node [3], API [6]
>>   host2 = node [2], node [4], API [7]
>>   host3 = mgm [5]
>>
>>Each host has 6GB RAM and 2 3.6G Xeons
>>RedHat Enterprise Linux 3 with hugemem kernel
>>2.4.21-27.0.4.ELhugemem #1 SMP
>>
>>Here's my config.ini:
>>
>>   [ndbd default]
>>   LockPagesInMainMemory=1
>>   TransactionDeadlockDetectionTimeout=14000
>>   NoOfReplicas= 2
>>   MaxNoOfConcurrentOperations=131072
>>   DataMemory= 1900M
>>   IndexMemory= 400M
>>   Diskless= 0
>>   DataDir= /var/mysql-cluster
>>   TimeBetweenWatchDogCheck=10000
>>   HeartbeatIntervalDbDb=10000
>>   HeartbeatIntervalDbApi=10000
>>   NoOfFragmentLogFiles=64
>>   
>>   NoOfDiskPagesToDiskAfterRestartTUP=54   #40
>>   NoOfDiskPagesToDiskAfterRestartACC=8    #20
>>   
>>   MaxNoOfAttributes = 2000                #1000
>>   MaxNoOfOrderedIndexes = 5000            #128
>>   MaxNoOfUniqueHashIndexes = 5000         #64
>>    
>>   [ndbd]
>>   HostName= 10.0.1.199
>>    
>>   [ndbd]
>>   HostName= 10.0.1.200
>> 
>>   [ndbd]
>>   HostName= 10.0.1.199
>>   
>>   [ndbd]
>>   HostName= 10.0.1.200
>>
>>   [ndb_mgmd]
>>   HostName= 10.0.1.198
>>   PortNumber= 2200
>>   
>>   [mysqld]
>>   
>>   [mysqld]
>>   [tcp default]
>>   PortNumber= 2202
>>
>>
>>And show:
>>
>>   ndb_mgm> show
>>   Connected to Management Server at: 10.0.1.198:2200
>>   Cluster Configuration
>>   ---------------------
>>   [ndbd(NDB)]     4 node(s)
>>   id=1    @10.0.1.199  (Version: 4.1.11, Nodegroup: 0)
>>   id=2    @10.0.1.200  (Version: 4.1.11, Nodegroup: 0, Master)
>>   id=3    @10.0.1.199  (Version: 4.1.11, Nodegroup: 1)
>>   id=4    @10.0.1.200  (Version: 4.1.11, Nodegroup: 1)
>>    
>>   [ndb_mgmd(MGM)] 1 node(s)
>>   id=5    @10.0.1.198  (Version: 4.1.11)
>> 
>>   [mysqld(API)]   2 node(s)
>>   id=6    @10.0.1.199  (Version: 4.1.11)
>>   id=7    @10.0.1.200  (Version: 4.1.11)
>>
>>
>>Thanks in advance.
>>
>>-- Jim
>>
>>
>>
>>__________________________________________________
>>Do You Yahoo!?
>>Tired of spam?  Yahoo! Mail has the best spam protection around 
>>http://mail.yahoo.com 
>>
>>-- 
>>MySQL Cluster Mailing List
>>For list archives: http://lists.mysql.com/cluster
>>To unsubscribe:    http://lists.mysql.com/cluster?unsub=1
>>
>>
> 
> 
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around 
> http://mail.yahoo.com 
> 


-- 
Jonas Oreland, Software Engineer
MySQL AB, www.mysql.com
Thread
failed ndbrequire -- reason?Jim Hoadley3 May
Re: failed ndbrequire -- reason?Jim Hoadley3 May
  • Re: failed ndbrequire -- reason?Jonas Oreland4 May
    • RAIDLeonard Cremer4 May
Re: RAIDMikael Ronström4 May
  • Re: RAIDSimon Garner5 May
    • Re: RAIDMikael Ronström5 May
    • Re: RAIDpekka6 May
Re: RAIDClint Byrum4 May
  • Re: RAIDMikael Ronström5 May