Hi,
and thx for excellent analysis
the patch that your proposing is "not correct" but do indeed solve your problem.
(btw amazing that this bug has "slipped through")
the "correct" patch is to run constructor in DbtupGen
(i however have not tested this :-)
===== dbtup/DbtupGen.cpp 1.31 vs edited =====
--- 1.31/storage/ndb/src/kernel/blocks/dbtup/DbtupGen.cpp 2006-11-15 10:29:28 +01:00
+++ dbtup/DbtupGen.cpp 2006-11-15 10:29:13 +01:00
@@ -329,6 +329,7 @@
ScanOpPtr lcp;
ndbrequire(c_scanOpPool.seize(lcp));
+ new (lcp.p) ScanOp();
c_lcp_scan_op= lcp.i;
czero = 0;
/Jonas
ps.
I filed a bug report for this, http://bugs.mysql.com/24331, and made patch accordingly
ds.
ps2.
ndb mails are better to send to cluster@stripped
ds2.
Vasil Dimov wrote:
> Hi,
>
> I got ndb_dd_backuprestore test failing with some core files lying
> around after that.
>
> Here is what happens (it is MySQL 5.1.14-beta from BK as of 2006.11.15):
>
> --- cut ---
> $ ./mysql-test-run.pl --mysqld=--binlog-format=mixed --do-test=ndb_dd_backuprestore
> ...
> ndb_dd_backuprestore [ fail ]
>
> Errors are (from /usr/local/devel/mysql/mysql-5.1/mysql-test/var/log/mysqltest-time)
> :
> Failed to connect to ndb!!
> Failed to initialize consumers
> mysqltest: In included file "./include/ndb_restore_master.inc": At line 8: command
> "$NDB_TOOLS_DIR/ndb_restore --no-defaults --ndb-connectstring="localhost:$NDBCLUSTER_PORT"
> -p 8 -b $the_backup_id -n 2 -r --print --print_meta
> $NDB_BACKUP_DIR/BACKUP/BACKUP-$the_backup_id >> $NDB_TOOLS_OUTPUT" failed
> (the last lines may be the most important ones)
> Result from queries before failure can be found in r/ndb_dd_backuprestore.log
> --- cut ---
>
> Here are the last 4 lines from r/ndb_dd_backuprestore.log:
>
> --- r/ndb_dd_backuprestore.log begins here ---
> 113: ENGINE = NDB;
> 114: DROP LOGFILE GROUP log_group1
> 115: ENGINE =NDB;
> 116: Abort trap (core dumped) <--- great, we will play with gdb soon...
> --- r/ndb_dd_backuprestore.log ends here ---
>
> and the relevant kernel messages:
>
> --- kern.all begins here ---
> Nov 15 08:39:31 qlovarnika kernel: pid 80453 (ndbd), uid 1001: exited on signal 6
> (core dumped)
> Nov 15 08:39:31 qlovarnika kernel: pid 80454 (ndbd), uid 1001: exited on signal 6
> (core dumped)
> Nov 15 08:39:54 qlovarnika kernel: pid 80492 (ndb_restore), uid 1001: exited on
> signal 6 (core dumped)
> --- kern.all ends here ---
>
> ndb_restore is aborting because it cannot connect to ndb,
> the root of the problem is that ndbd aborts, here is its backtrace:
>
> --- ndbd backtrace begins here ---
> Core was generated by `ndbd'.
> Program terminated with signal 6, Aborted.
> #0 0x0000000800ac789c in pthread_testcancel () from /usr/lib/libpthread.so.2
> [New Thread 0x6c8c000 (sleeping)]
> [New Thread 0x6b75c00 (sleeping)]
> [New Thread 0x6b75800 (sleeping)]
> [New Thread 0x6b75400 (sleeping)]
> [New Thread 0x6b75000 (sleeping)]
> [New Thread 0x6a60c00 (sleeping)]
> [New Thread 0x6a60800 (sleeping)]
> [New Thread 0x6a60400 (sleeping)]
> [New Thread 0x6a60000 (sleeping)]
> [New Thread 0x6949c00 (sleeping)]
> [New Thread 0x6949800 (sleeping)]
> [New Thread 0x6949400 (sleeping)]
> [New Thread 0x6949000 (sleeping)]
> [New Thread 0x6832c00 (sleeping)]
> [New Thread 0x6832800 (sleeping)]
> [New Thread 0x6832400 (sleeping)]
> [New Thread 0x6832000 (sleeping)]
> [New Thread 0x671cc00 (sleeping)]
> [New Thread 0x671c800 (sleeping)]
> [New Thread 0x671c400 (sleeping)]
> [New Thread 0x671c000 (sleeping)]
> [New Thread 0x6604c00 (sleeping)]
> [New Thread 0x6604800 (sleeping)]
> [New Thread 0x6604400 (sleeping)]
> [New Thread 0x6604000 (sleeping)]
> [New Thread 0x6424c00 (sleeping)]
> [New Thread 0x6424800 (sleeping)]
> [New Thread 0xa14800 (LWP 100306)]
> [New Thread 0x8e1000 (LWP 100184)]
> #0 0x0000000800ac789c in pthread_testcancel () from /usr/lib/libpthread.so.2
> #1 0x0000000800ab55c3 in sigaction () from /usr/lib/libpthread.so.2
> #2 0x0000000800ab70e2 in sigaction () from /usr/lib/libpthread.so.2
> #3 0x0000000800ab0db6 in pthread_kill () from /usr/lib/libpthread.so.2
> #4 0x0000000800ab0633 in raise () from /usr/lib/libpthread.so.2
> #5 0x0000000800ca115d in abort () from /lib/libc.so.6
> #6 0x000000000040541b in childAbort (code=-1, currentStartPhase=4)
> at main.cpp:105
> #7 0x00000000005bd2ee in NdbShutdown (type=NST_ErrorHandler,
> restartType=NRT_Default) at Emulator.cpp:255
> #8 0x00000000005c6569 in ErrorReporter::handleError (messageID=2306,
> problemData=0x4 <Address 0x4 out of bounds>,
> objRef=0x6 <Address 0x6 out of bounds>, nst=NST_ErrorHandler)
> at ErrorReporter.cpp:206
> #9 0x00000000005b5948 in SimulatedBlock::progError (this=0xed4000, line=538,
> err_code=2306, extra=0x629dfd "dbtup/DbtupScan.cpp")
> at SimulatedBlock.cpp:738
> #10 0x0000000000537444 in Dbtup::scanFirst (this=0xed4000, scanPtr=
> {p = 0x4378000, i = 0}) at dbtup/DbtupScan.cpp:538
> #11 0x000000000053904f in Dbtup::execLCP_FRAG_ORD (this=0xed4000,
> signal=0x8c5b28) at dbtup/DbtupScan.cpp:1069
> #12 0x00000000004d23cf in Dblqh::execLCP_PREPARE_CONF (this=0xec8000,
> signal=0x8c5b28) at SimulatedBlock.hpp:576
> #13 0x00000000005b9bf7 in FastScheduler::doJob (this=0x8cf4e0)
> at SimulatedBlock.hpp:576
> #14 0x00000000005bb485 in ThreadConfig::ipControlLoop (this=0x0)
> at ThreadConfig.cpp:175
> #15 0x0000000000405fac in main (argc=7, argv=0x7fffffffcff8) at main.cpp:474
> --- ndbd backtrace ends here ---
>
> The file names and line numbers in the backtrace are a little bit
> misleading because SimulatedBlock::executeFunction() is inline'd.
> Anyway let's dig a little bit more:
>
> --- gdb session begins here ---
> (gdb) f 10
> #10 0x0000000000537444 in Dbtup::scanFirst (this=0xed4000, scanPtr=
> {p = 0x4378000, i = 0}) at dbtup/DbtupScan.cpp:538
> 538 ptrCheckGuard(fragPtr, cnoOfFragrec, fragrecord);
> --- gdb session ends here ---
>
> This is the implementation of ptrCheckGuard:
>
> --- storage/ndb/src/kernel/vm/pc.hpp:ptrCheckGuardErr begins here ---
> 93 #define ptrCheckGuardErr(ptr, limit, rec, error) {\
> 94 UintR TxxzLimit; \
> 95 TxxzLimit = (limit); \
> 96 UintR TxxxPtr; \
> 97 TxxxPtr = ptr.i; \
> 98 ptr.p = &rec[TxxxPtr]; \
> 99 if (TxxxPtr < (TxxzLimit)) { \
> 100 ; \
> 101 } else { \
> 102 progError(__LINE__, error, __FILE__); \
> 103 }}
> --- storage/ndb/src/kernel/vm/pc.hpp:ptrCheckGuardErr ends here ---
>
> It calls progError() if ptr.i >= limit, so let's investigate:
>
> --- gdb session begins here ---
> (gdb) f 10
> #10 0x0000000000537444 in Dbtup::scanFirst (this=0xed4000, scanPtr=
> {p = 0x4378000, i = 0}) at dbtup/DbtupScan.cpp:538
> 538 ptrCheckGuard(fragPtr, cnoOfFragrec, fragrecord);
> (gdb) ins fragPtr.i
> $1 = 3503345872 <----- Seems bogus
> (gdb) ins cnoOfFragrec
> $2 = 452
> (gdb) l 534
> 529 Dbtup::scanFirst(Signal*, ScanOpPtr scanPtr)
> 530 {
> 531 ScanOp& scan = *scanPtr.p;
> 532 ScanPos& pos = scan.m_scanPos;
> 533 Local_key& key = pos.m_key;
> 534 const Uint32 bits = scan.m_bits;
> 535 // fragment
> 536 FragrecordPtr fragPtr;
> 537 fragPtr.i = scan.m_fragPtrI;
> 538 ptrCheckGuard(fragPtr, cnoOfFragrec, fragrecord);
> //
> // So fragPtr.i came from scan.m_fragPtrI which came from
> // scanPtr.p->m_fragPtrI which was passed as an argument to
> // Dbtup::scanFirst()
> //
> (gdb) ins scanPtr.p->m_fragPtrI
> $3 = 3503345872
> //
> // That's correct: fragPtr.i has the same value as scanPtr.p->m_fragPtrI,
> // let's see where did it came from:
> //
> (gdb) f 11
> #11 0x000000000053904f in Dbtup::execLCP_FRAG_ORD (this=0xed4000,
> signal=0x8c5b28) at dbtup/DbtupScan.cpp:1069
> 1069 scanFirst(signal, scanPtr);
> (gdb) l 1065,1069
> 1065 ScanOpPtr scanPtr;
> 1066 c_scanOpPool.getPtr(scanPtr, frag.m_lcp_scan_op);
> 1067 //ndbrequire(scanPtr.p->m_fragPtrI == fragPtr.i); ?
> 1068
> 1069 scanFirst(signal, scanPtr);
> //
> // Obviously scanPtr must have been initialized in c_scanOpPool.getPtr().
> // What type is c_scanOpPool?
> //
> (gdb) ins &c_scanOpPool
> $4 = (ArrayPool<Dbtup::ScanOp> *) 0xed6fc8
> //
> // So here is the body of getPtr():
> //
> (gdb) l ArrayPool.hpp:360,366
> 360 template <class T>
> 361 inline
> 362 void
> 363 ArrayPool<T>::getPtr(Ptr<T> & ptr, Uint32 i){
> 364 ptr.i = i;
> 365 if(likely(i < size)){
> 366 ptr.p = &theArray[i];
> //
> // So ptr.p is just assigned the i'th member of theArray.
> // What is the value of i?
> //
> (gdb) f 11
> #11 0x000000000053904f in Dbtup::execLCP_FRAG_ORD (this=0xed4000,
> signal=0x8c5b28) at dbtup/DbtupScan.cpp:1069
> 1069 scanFirst(signal, scanPtr);
> (gdb) l 1065,1066
> 1065 ScanOpPtr scanPtr;
> 1066 c_scanOpPool.getPtr(scanPtr, frag.m_lcp_scan_op);
> (gdb) ins frag.m_lcp_scan_op
> $5 = 0
> //
> // 0, great. Let's check:
> //
> (gdb) ins c_scanOpPool.theArray[0].m_fragPtrI
> $6 = 3503345872
> //
> // Yes, the same value (3503345872), let's look at the whole 0th
> // member of theArray:
> //
> (gdb) ins c_scanOpPool.theArray[0]
> $7 = {m_state = 53456, m_bits = 53456, m_userPtr = 3503345872,
> m_userRef = 3503345872, m_tableId = 3503345872, m_fragId = 3503345872,
> m_fragPtrI = 3503345872, m_transId1 = 3503345872, m_transId2 = 3503345872, {
> m_savePointId = 3503345872, m_scanGCI = 3503345872},
> m_accLockOp = 3503345872, m_scanPos = {m_get = 3503345872, m_key = {
> m_page_no = 3503345872, m_page_idx = 53456, m_file_no = 53456},
> m_page = 0xd0d0d0d0d0d0d0d0, m_key_mm = {m_page_no = 3503345872,
> m_page_idx = 53456, m_file_no = 53456}, m_realpid_mm = 3503345872,
> m_extent_info_ptr_i = 3503345872}, m_accLockOps = {
> firstItem = 3503345872, lastItem = 3503345872, in_use = 208}, {
> nextPool = 1, nextList = 1}, prevList = 3503345872}
> //
> // A lot of members with the bogus value 3503345872
> // which is D0D0D0D0 in hex. Now I remember that I have set up my
> // system malloc to initialize each byte of returned memory with 0xd0 [1]
> // Let's see where theArray member is allocated:
> //
> (gdb) l ArrayPool.hpp:220,262
> 220 template <class T>
> 221 inline
> 222 bool
> 223 ArrayPool<T>::setSize(Uint32 noOfElements,
> 224 bool align, bool exit_on_error, bool guard){
> 225 if(size == 0){
> 226 if(noOfElements == 0)
> 227 return true;
> 228 if(align)
> 229 {
> 230 alloc_ptr = ndbd_malloc((noOfElements+1) * sizeof(T));
> 231 UintPtr p = (UintPtr)alloc_ptr;
> 232 UintPtr mod = p % sizeof(T);
> 233 if (mod)
> 234 {
> 235 p += sizeof(T) - mod;
> 236 }
> 237 theArray = (T *)p;
> 238 }
> 239 else
> 240 theArray = (T *)(alloc_ptr = ndbd_malloc(noOfElements * sizeof(T)));
> 241
> 242 if(theArray == 0)
> 243 {
> 244 if (!exit_on_error)
> 245 return false;
> 246 ErrorReporter::handleAssert("ArrayPool<T>::setSize malloc
> failed",
> 247 __FILE__, __LINE__, NDBD_EXIT_MEMALLOC);
> 248 return false; // not reached
> 249 }
> 250 size = noOfElements;
> 251 noOfFree = noOfElements;
> 252
> 253 /**
> 254 * Set next pointers
> 255 */
> 256 T * t = &theArray[0];
> 257 for(Uint32 i = 0; i<size; i++){
> 258 t->nextPool = (i + 1);
> 259 t++;
> 260 }
> 261 theArray[size-1].nextPool = RNIL;
> 262 firstFree = 0;
> //
> // T in this case is ScanOp which has a nice constructor (that
> // initializes its members) but it is never called because ndbd_malloc()
> // is used instead of new. As one can see above - only nextPool members
> // are initialized (manually) after the allocation. Others are left
> // untouched as they are returned by ndbd_malloc().
> //
> --- gdb session ends here ---
>
> The following patch fixes the problem:
>
> --- ArrayPool.hpp_memset.diff begins here ---
> --- storage/ndb/src/kernel/vm/ArrayPool.hpp.orig Wed Nov 15 09:48:39 2006
> +++ storage/ndb/src/kernel/vm/ArrayPool.hpp Wed Nov 15 09:47:14 2006
> @@ -255,6 +255,7 @@
> */
> T * t = &theArray[0];
> for(Uint32 i = 0; i<size; i++){
> + memset(t, 0, sizeof(*t));
> t->nextPool = (i + 1);
> t++;
> }
> --- ArrayPool.hpp_memset.diff ends here ---
>
> I.e. after applying it I get ndb_dd_backuprestore test passing and
> no programs aborting:
>
> --- cut ---
> ndb_dd_backuprestore [ pass ] 109547
> --- cut ---
>
>
> Before filling a bug report I want to ask -
> "Is this, that the constructors are not called, expected behavior?".
> And thus "Is the above patch correct?".
>
> Maybe this has something to do with Bug#18546 (at least the beginning
> is the same :).
>
> My system is FreeBSD 6.2-PRERELEASE/amd64 and MySQL is
> compiled with BUILD/compile-amd64-debug-max but I think this
> is irrelevant.
>
> [1]
> See the J option in http://www.freebsd.org/cgi/man.cgi?query=malloc.conf
>