List:Cluster« Previous MessageNext Message »
From:Wagner Bianchi Date:November 8 2011 12:19pm
Subject:Re: cluster issues ..
View as plain text  
if your failed data node is crashing on the phase 5 - when you're trying to
restart it - there is a problem related with files used to registry a local
checkpoint - redo logs - or more memory is needed to that data node. I
remember you whether data node start to do swap, it will stop. Check
log message to determine what is happening and share with us in order to
help understand better this case.

Best wishes,
--
*Wagner Bianchi* - Administrador de Bancos de Dados
*Mobile:* +55 (31) 8654 - 9510
*LinkedIn*: http://br.linkedin.com/in/wagnerbianchi
*Twitter*: @wagnerbianchijr
*Skype*: wbianchijr



2011/11/8 umapathi b <umapathi.b@stripped>

>  Hi,
>
> yesterday, for some reason one of our 2 ndb nodes crashed. I am not able to
> restart it. In startphase 5 when it tries to restore schema, it always
> crashes with error 2341. starting with --initial is no working, too.
> The other node is currently working, but I am afraid it will also crash and
> my databases are lost.
>
> Is there a way to solve this problem whitout loosing data?
>
> Best regards,
> Christian
>
> Here is the last output from ndb_4_out.log:
>
> 2010-02-09 09:02:38 [ndbd] INFO -- Angel pid: 2262 ndb pid: 2263
> NDBMT: non-mt
> 2010-02-09 09:02:38 [ndbd] INFO -- NDB Cluster -- DB node 4
> 2010-02-09 09:02:38 [ndbd] INFO -- mysql-5.1.35 ndb-7.0.7 --
> 2010-02-09 09:02:38 [ndbd] INFO -- WatchDog timer is set to 6000 ms
> 2010-02-09 09:02:38 [ndbd] INFO -- Ndbd_mem_manager::init(1) min: 4100Mb
> initial: 4484Mb
> Adding 4484Mb to ZONE_LO (1,143487)
> 2010-02-09 09:02:38 [ndbd] INFO -- Start initiated (mysql-5.1.35 ndb-7.0.7)
> NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
> NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
> NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
> NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
> NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
> NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
> NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
> NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
> NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
> NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
> NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
> NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
> NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
> NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
> NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
> NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
> NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
> NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
> NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
> NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
> NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
> NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
> NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
> NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
> NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
> NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
> NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
> WOPool::init(61, 9)
> RWPool::init(22, 13)
> WARNING: timerHandlingLab now: 3230666 sent: 3229519 diff: 1147
> RWPool::init(42, 18)
> RWPool::init(62, 13)
> Using 1 fragments per node
> RWPool::init(c2, 18)
> RWPool::init(e2, 16)
> WOPool::init(41, 8)
> RWPool::init(82, 12)
> RWPool::init(a2, 52)
> WOPool::init(21, 10)
> WARNING: timerHandlingLab now: 3230976 sent: 3230666 diff: 310
> Time moved forward with 3503 milliseconds
> WARNING: timerHandlingLab now: 3235613 sent: 3232110 diff: 3503
> 2010-02-09 09:02:44 [ndbd] INFO -- findNeighbours from: 2042 old (left:
> 65535 right: 65535) new (3 3)
> WARNING: timerHandlingLab now: 3306680 sent: 3306627 diff: 53
> WARNING: timerHandlingLab now: 3779560 sent: 3779504 diff: 56
> WARNING: timerHandlingLab now: 3838808 sent: 3838756 diff: 52
> WARNING: timerHandlingLab now: 4405036 sent: 4404984 diff: 52
> WARNING: timerHandlingLab now: 4433104 sent: 4433054 diff: 50
> WARNING: timerHandlingLab now: 4548566 sent: 4548516 diff: 50
> WARNING: timerHandlingLab now: 4726247 sent: 4726196 diff: 51
> WARNING: timerHandlingLab now: 4740371 sent: 4740320 diff: 51
> WARNING: timerHandlingLab now: 4878876 sent: 4878824 diff: 52
> WARNING: timerHandlingLab now: 4879903 sent: 4879850 diff: 53
> WARNING: timerHandlingLab now: 4974276 sent: 4974222 diff: 54
> WARNING: timerHandlingLab now: 5060594 sent: 5060544 diff: 50
> WARNING: timerHandlingLab now: 5261477 sent: 5261424 diff: 53
> 2010-02-09 09:37:49 [ndbd] INFO -- dbdict/Dbdict.cpp
> 2010-02-09 09:37:49 [ndbd] INFO -- DBDICT (Line: 3986) 0x0000000e
> 2010-02-09 09:37:49 [ndbd] INFO -- Error handler startup shutting down
> system
> 2010-02-09 09:37:49 [ndbd] INFO -- Error handler shutdown completed -
> exiting
> 2010-02-09 09:37:49 [ndbd] INFO -- Angel received ndbd startup failure
> count 1.
> 2010-02-09 09:37:49 [ndbd] ALERT -- Node 4: Forced node shutdown completed.
> Occured during startphase 5. Caused by error 2341: 'Internal program error
> (failed ndbrequire)(Internal error, programming error or missing error
> message, please report a bug). Temporary error, restart node'.
>
> My NDBD DEFAULT settings:
>
> [NDBD DEFAULT]
> NoOfReplicas=2
> Datadir=/mysql/data
> FileSystemPathDD=/mysql/data
> #FileSystemPathUndoFiles=/mysql/data
> #FileSystemPathDataFiles=/mysql/data
> DataMemory=4096M
> IndexMemory=384M
> LockPagesInMainMemory=1
>
> MaxNoOfConcurrentOperations=500000
>
> StringMemory=50MB
> MaxNoOfTables=20000
> MaxNoOfOrderedIndexes=10000
> MaxNoOfUniqueHashIndexes=2500
> MaxNoOfAttributes=120000
> DiskCheckpointSpeedInRestart=100M
> FragmentLogFileSize=256M
> InitFragmentLogFiles=FULL
> NoOfFragmentLogFiles=18
> RedoBuffer=256M
>
> TimeBetweenLocalCheckpoints=20
> TimeBetweenGlobalCheckpoints=1000
> TimeBetweenEpochs=100
>
> MemReportFrequency=3600
> BackupReportFrequency=10
>
> ### Params for setting logging
> LogLevelStartup=15
> LogLevelShutdown=15
> LogLevelCheckpoint=8
> LogLevelNodeRestart=15
>
> ### Params for increasing Disk throughput
> BackupMaxWriteSize=1M
> BackupDataBufferSize=16M
> BackupLogBufferSize=4M
> BackupMemory=20M
> #Reports indicates that odirect=1 can cause io errors (os err code 5) on
> some systems. You must test.
> ODirect=1
>
> ### Watchdog
> TimeBetweenWatchdogCheckInitial=60000
>
> ### TransactionInactiveTimeout - should be enabled in Production
> TransactionInactiveTimeout=30000
>
> TransactionDeadlockDetectionTimeout=10000
> HeartbeatIntervalDbDb=3000
> HeartbeatIntervalDbApi=3000
>
> ### CGE 6.3 - REALTIME EXTENSIONS
> RealTimeScheduler=1
> SchedulerExecutionTimer=80
> SchedulerSpinTimer=40
>
> ### DISK DATA
> SharedGlobalMemory=384M
> #read my blog how to set this:
> DiskPageBufferMemory=512M
>
> ### Multithreading
> MaxNoOfExecutionThreads=2
>
> ### Increasing the LongMessageBuffer b/c of a bug (20090903)
> LongMessageBuffer=8M
>
> BatchSizePerLocalScan=512
>

Thread
cluster issues ..umapathi b8 Nov
  • Re: cluster issues ..Wagner Bianchi8 Nov