List:Cluster« Previous MessageNext Message »
From:umapathi b Date:November 8 2011 5:22am
Subject:cluster issues ..
View as plain text  
 Hi,

yesterday, for some reason one of our 2 ndb nodes crashed. I am not able to
restart it. In startphase 5 when it tries to restore schema, it always
crashes with error 2341. starting with --initial is no working, too.
The other node is currently working, but I am afraid it will also crash and
my databases are lost.

Is there a way to solve this problem whitout loosing data?

Best regards,
Christian

Here is the last output from ndb_4_out.log:

2010-02-09 09:02:38 [ndbd] INFO -- Angel pid: 2262 ndb pid: 2263
NDBMT: non-mt
2010-02-09 09:02:38 [ndbd] INFO -- NDB Cluster -- DB node 4
2010-02-09 09:02:38 [ndbd] INFO -- mysql-5.1.35 ndb-7.0.7 --
2010-02-09 09:02:38 [ndbd] INFO -- WatchDog timer is set to 6000 ms
2010-02-09 09:02:38 [ndbd] INFO -- Ndbd_mem_manager::init(1) min: 4100Mb
initial: 4484Mb
Adding 4484Mb to ZONE_LO (1,143487)
2010-02-09 09:02:38 [ndbd] INFO -- Start initiated (mysql-5.1.35 ndb-7.0.7)
NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
WOPool::init(61, 9)
RWPool::init(22, 13)
WARNING: timerHandlingLab now: 3230666 sent: 3229519 diff: 1147
RWPool::init(42, 18)
RWPool::init(62, 13)
Using 1 fragments per node
RWPool::init(c2, 18)
RWPool::init(e2, 16)
WOPool::init(41, 8)
RWPool::init(82, 12)
RWPool::init(a2, 52)
WOPool::init(21, 10)
WARNING: timerHandlingLab now: 3230976 sent: 3230666 diff: 310
Time moved forward with 3503 milliseconds
WARNING: timerHandlingLab now: 3235613 sent: 3232110 diff: 3503
2010-02-09 09:02:44 [ndbd] INFO -- findNeighbours from: 2042 old (left:
65535 right: 65535) new (3 3)
WARNING: timerHandlingLab now: 3306680 sent: 3306627 diff: 53
WARNING: timerHandlingLab now: 3779560 sent: 3779504 diff: 56
WARNING: timerHandlingLab now: 3838808 sent: 3838756 diff: 52
WARNING: timerHandlingLab now: 4405036 sent: 4404984 diff: 52
WARNING: timerHandlingLab now: 4433104 sent: 4433054 diff: 50
WARNING: timerHandlingLab now: 4548566 sent: 4548516 diff: 50
WARNING: timerHandlingLab now: 4726247 sent: 4726196 diff: 51
WARNING: timerHandlingLab now: 4740371 sent: 4740320 diff: 51
WARNING: timerHandlingLab now: 4878876 sent: 4878824 diff: 52
WARNING: timerHandlingLab now: 4879903 sent: 4879850 diff: 53
WARNING: timerHandlingLab now: 4974276 sent: 4974222 diff: 54
WARNING: timerHandlingLab now: 5060594 sent: 5060544 diff: 50
WARNING: timerHandlingLab now: 5261477 sent: 5261424 diff: 53
2010-02-09 09:37:49 [ndbd] INFO -- dbdict/Dbdict.cpp
2010-02-09 09:37:49 [ndbd] INFO -- DBDICT (Line: 3986) 0x0000000e
2010-02-09 09:37:49 [ndbd] INFO -- Error handler startup shutting down
system
2010-02-09 09:37:49 [ndbd] INFO -- Error handler shutdown completed -
exiting
2010-02-09 09:37:49 [ndbd] INFO -- Angel received ndbd startup failure
count 1.
2010-02-09 09:37:49 [ndbd] ALERT -- Node 4: Forced node shutdown completed.
Occured during startphase 5. Caused by error 2341: 'Internal program error
(failed ndbrequire)(Internal error, programming error or missing error
message, please report a bug). Temporary error, restart node'.

My NDBD DEFAULT settings:

[NDBD DEFAULT]
NoOfReplicas=2
Datadir=/mysql/data
FileSystemPathDD=/mysql/data
#FileSystemPathUndoFiles=/mysql/data
#FileSystemPathDataFiles=/mysql/data
DataMemory=4096M
IndexMemory=384M
LockPagesInMainMemory=1

MaxNoOfConcurrentOperations=500000

StringMemory=50MB
MaxNoOfTables=20000
MaxNoOfOrderedIndexes=10000
MaxNoOfUniqueHashIndexes=2500
MaxNoOfAttributes=120000
DiskCheckpointSpeedInRestart=100M
FragmentLogFileSize=256M
InitFragmentLogFiles=FULL
NoOfFragmentLogFiles=18
RedoBuffer=256M

TimeBetweenLocalCheckpoints=20
TimeBetweenGlobalCheckpoints=1000
TimeBetweenEpochs=100

MemReportFrequency=3600
BackupReportFrequency=10

### Params for setting logging
LogLevelStartup=15
LogLevelShutdown=15
LogLevelCheckpoint=8
LogLevelNodeRestart=15

### Params for increasing Disk throughput
BackupMaxWriteSize=1M
BackupDataBufferSize=16M
BackupLogBufferSize=4M
BackupMemory=20M
#Reports indicates that odirect=1 can cause io errors (os err code 5) on
some systems. You must test.
ODirect=1

### Watchdog
TimeBetweenWatchdogCheckInitial=60000

### TransactionInactiveTimeout - should be enabled in Production
TransactionInactiveTimeout=30000

TransactionDeadlockDetectionTimeout=10000
HeartbeatIntervalDbDb=3000
HeartbeatIntervalDbApi=3000

### CGE 6.3 - REALTIME EXTENSIONS
RealTimeScheduler=1
SchedulerExecutionTimer=80
SchedulerSpinTimer=40

### DISK DATA
SharedGlobalMemory=384M
#read my blog how to set this:
DiskPageBufferMemory=512M

### Multithreading
MaxNoOfExecutionThreads=2

### Increasing the LongMessageBuffer b/c of a bug (20090903)
LongMessageBuffer=8M

BatchSizePerLocalScan=512

Thread
cluster issues ..umapathi b8 Nov
  • Re: cluster issues ..Wagner Bianchi8 Nov