List:Internals« Previous MessageNext Message »
From:Jonas Oreland Date:November 15 2006 10:38am
Subject:Re: ndb: usage of uninitialized memory
View as plain text  
Hi,

and thx for excellent analysis

the patch that your proposing is "not correct" but do indeed solve your problem.
(btw amazing that this bug has "slipped through")

the "correct" patch is to run constructor in DbtupGen
(i however have not tested this :-)


===== dbtup/DbtupGen.cpp 1.31 vs edited =====
--- 1.31/storage/ndb/src/kernel/blocks/dbtup/DbtupGen.cpp       2006-11-15 10:29:28 +01:00
+++ dbtup/DbtupGen.cpp  2006-11-15 10:29:13 +01:00
@@ -329,6 +329,7 @@
 
   ScanOpPtr lcp;
   ndbrequire(c_scanOpPool.seize(lcp));
+  new (lcp.p) ScanOp();
   c_lcp_scan_op= lcp.i;
 
   czero = 0;

/Jonas

ps.
  I filed a bug report for this, http://bugs.mysql.com/24331, and made patch accordingly
ds.

ps2.
  ndb mails are better to send to cluster@stripped
ds2.

Vasil Dimov wrote:
> Hi,
> 
> I got ndb_dd_backuprestore test failing with some core files lying
> around after that.
> 
> Here is what happens (it is MySQL 5.1.14-beta from BK as of 2006.11.15):
> 
> --- cut ---
> $ ./mysql-test-run.pl --mysqld=--binlog-format=mixed --do-test=ndb_dd_backuprestore
> ...
> ndb_dd_backuprestore           [ fail ]
> 
> Errors are (from /usr/local/devel/mysql/mysql-5.1/mysql-test/var/log/mysqltest-time)
> :
> Failed to connect to ndb!!
> Failed to initialize consumers
> mysqltest: In included file "./include/ndb_restore_master.inc": At line 8: command
> "$NDB_TOOLS_DIR/ndb_restore --no-defaults --ndb-connectstring="localhost:$NDBCLUSTER_PORT"
> -p 8 -b $the_backup_id -n 2  -r --print --print_meta
> $NDB_BACKUP_DIR/BACKUP/BACKUP-$the_backup_id >> $NDB_TOOLS_OUTPUT" failed
> (the last lines may be the most important ones)
> Result from queries before failure can be found in r/ndb_dd_backuprestore.log
> --- cut ---
> 
> Here are the last 4 lines from r/ndb_dd_backuprestore.log:
> 
> --- r/ndb_dd_backuprestore.log begins here ---
> 113: ENGINE = NDB;
> 114: DROP LOGFILE GROUP log_group1
> 115: ENGINE =NDB;
> 116: Abort trap (core dumped)   <--- great, we will play with gdb soon...
> --- r/ndb_dd_backuprestore.log ends here ---
> 
> and the relevant kernel messages:
> 
> --- kern.all begins here ---
> Nov 15 08:39:31 qlovarnika kernel: pid 80453 (ndbd), uid 1001: exited on signal 6
> (core dumped)
> Nov 15 08:39:31 qlovarnika kernel: pid 80454 (ndbd), uid 1001: exited on signal 6
> (core dumped)
> Nov 15 08:39:54 qlovarnika kernel: pid 80492 (ndb_restore), uid 1001: exited on
> signal 6 (core dumped)
> --- kern.all ends here ---
> 
> ndb_restore is aborting because it cannot connect to ndb,
> the root of the problem is that ndbd aborts, here is its backtrace:
> 
> --- ndbd backtrace begins here ---
> Core was generated by `ndbd'.
> Program terminated with signal 6, Aborted.
> #0  0x0000000800ac789c in pthread_testcancel () from /usr/lib/libpthread.so.2
> [New Thread 0x6c8c000 (sleeping)]
> [New Thread 0x6b75c00 (sleeping)]
> [New Thread 0x6b75800 (sleeping)]
> [New Thread 0x6b75400 (sleeping)]
> [New Thread 0x6b75000 (sleeping)]
> [New Thread 0x6a60c00 (sleeping)]
> [New Thread 0x6a60800 (sleeping)]
> [New Thread 0x6a60400 (sleeping)]
> [New Thread 0x6a60000 (sleeping)]
> [New Thread 0x6949c00 (sleeping)]
> [New Thread 0x6949800 (sleeping)]
> [New Thread 0x6949400 (sleeping)]
> [New Thread 0x6949000 (sleeping)]
> [New Thread 0x6832c00 (sleeping)]
> [New Thread 0x6832800 (sleeping)]
> [New Thread 0x6832400 (sleeping)]
> [New Thread 0x6832000 (sleeping)]
> [New Thread 0x671cc00 (sleeping)]
> [New Thread 0x671c800 (sleeping)]
> [New Thread 0x671c400 (sleeping)]
> [New Thread 0x671c000 (sleeping)]
> [New Thread 0x6604c00 (sleeping)]
> [New Thread 0x6604800 (sleeping)]
> [New Thread 0x6604400 (sleeping)]
> [New Thread 0x6604000 (sleeping)]
> [New Thread 0x6424c00 (sleeping)]
> [New Thread 0x6424800 (sleeping)]
> [New Thread 0xa14800 (LWP 100306)]
> [New Thread 0x8e1000 (LWP 100184)]
> #0  0x0000000800ac789c in pthread_testcancel () from /usr/lib/libpthread.so.2
> #1  0x0000000800ab55c3 in sigaction () from /usr/lib/libpthread.so.2
> #2  0x0000000800ab70e2 in sigaction () from /usr/lib/libpthread.so.2
> #3  0x0000000800ab0db6 in pthread_kill () from /usr/lib/libpthread.so.2
> #4  0x0000000800ab0633 in raise () from /usr/lib/libpthread.so.2
> #5  0x0000000800ca115d in abort () from /lib/libc.so.6
> #6  0x000000000040541b in childAbort (code=-1, currentStartPhase=4)
>     at main.cpp:105
> #7  0x00000000005bd2ee in NdbShutdown (type=NST_ErrorHandler, 
>     restartType=NRT_Default) at Emulator.cpp:255
> #8  0x00000000005c6569 in ErrorReporter::handleError (messageID=2306, 
>     problemData=0x4 <Address 0x4 out of bounds>, 
>     objRef=0x6 <Address 0x6 out of bounds>, nst=NST_ErrorHandler)
>     at ErrorReporter.cpp:206
> #9  0x00000000005b5948 in SimulatedBlock::progError (this=0xed4000, line=538, 
>     err_code=2306, extra=0x629dfd "dbtup/DbtupScan.cpp")
>     at SimulatedBlock.cpp:738
> #10 0x0000000000537444 in Dbtup::scanFirst (this=0xed4000, scanPtr=
>       {p = 0x4378000, i = 0}) at dbtup/DbtupScan.cpp:538
> #11 0x000000000053904f in Dbtup::execLCP_FRAG_ORD (this=0xed4000, 
>     signal=0x8c5b28) at dbtup/DbtupScan.cpp:1069
> #12 0x00000000004d23cf in Dblqh::execLCP_PREPARE_CONF (this=0xec8000, 
>     signal=0x8c5b28) at SimulatedBlock.hpp:576
> #13 0x00000000005b9bf7 in FastScheduler::doJob (this=0x8cf4e0)
>     at SimulatedBlock.hpp:576
> #14 0x00000000005bb485 in ThreadConfig::ipControlLoop (this=0x0)
>     at ThreadConfig.cpp:175
> #15 0x0000000000405fac in main (argc=7, argv=0x7fffffffcff8) at main.cpp:474
> --- ndbd backtrace ends here ---
> 
> The file names and line numbers in the backtrace are a little bit
> misleading because SimulatedBlock::executeFunction() is inline'd.
> Anyway let's dig a little bit more:
> 
> --- gdb session begins here ---
> (gdb) f 10
> #10 0x0000000000537444 in Dbtup::scanFirst (this=0xed4000, scanPtr=
>       {p = 0x4378000, i = 0}) at dbtup/DbtupScan.cpp:538
> 538       ptrCheckGuard(fragPtr, cnoOfFragrec, fragrecord);
> --- gdb session ends here ---
> 
> This is the implementation of ptrCheckGuard:
> 
> --- storage/ndb/src/kernel/vm/pc.hpp:ptrCheckGuardErr begins here ---
>  93 #define ptrCheckGuardErr(ptr, limit, rec, error) {\
>  94   UintR TxxzLimit; \
>  95   TxxzLimit = (limit); \
>  96   UintR TxxxPtr; \
>  97   TxxxPtr = ptr.i; \
>  98   ptr.p = &rec[TxxxPtr]; \
>  99   if (TxxxPtr < (TxxzLimit)) { \
> 100     ; \
> 101   } else { \
> 102     progError(__LINE__, error, __FILE__); \
> 103   }}
> --- storage/ndb/src/kernel/vm/pc.hpp:ptrCheckGuardErr ends here ---
> 
> It calls progError() if ptr.i >= limit, so let's investigate:
> 
> --- gdb session begins here ---
> (gdb) f 10
> #10 0x0000000000537444 in Dbtup::scanFirst (this=0xed4000, scanPtr=
>       {p = 0x4378000, i = 0}) at dbtup/DbtupScan.cpp:538
> 538       ptrCheckGuard(fragPtr, cnoOfFragrec, fragrecord);
> (gdb) ins fragPtr.i
> $1 = 3503345872             <----- Seems bogus
> (gdb) ins cnoOfFragrec
> $2 = 452
> (gdb) l 534
> 529     Dbtup::scanFirst(Signal*, ScanOpPtr scanPtr)
> 530     {
> 531       ScanOp& scan = *scanPtr.p;
> 532       ScanPos& pos = scan.m_scanPos;
> 533       Local_key& key = pos.m_key;
> 534       const Uint32 bits = scan.m_bits;
> 535       // fragment
> 536       FragrecordPtr fragPtr;
> 537       fragPtr.i = scan.m_fragPtrI;
> 538       ptrCheckGuard(fragPtr, cnoOfFragrec, fragrecord);
> //
> // So fragPtr.i came from scan.m_fragPtrI which came from
> // scanPtr.p->m_fragPtrI which was passed as an argument to
> // Dbtup::scanFirst()
> //
> (gdb) ins scanPtr.p->m_fragPtrI
> $3 = 3503345872
> //
> // That's correct: fragPtr.i has the same value as scanPtr.p->m_fragPtrI,
> // let's see where did it came from:
> //
> (gdb) f 11
> #11 0x000000000053904f in Dbtup::execLCP_FRAG_ORD (this=0xed4000,
>     signal=0x8c5b28) at dbtup/DbtupScan.cpp:1069
> 1069        scanFirst(signal, scanPtr);
> (gdb) l 1065,1069
> 1065        ScanOpPtr scanPtr;
> 1066        c_scanOpPool.getPtr(scanPtr, frag.m_lcp_scan_op);
> 1067        //ndbrequire(scanPtr.p->m_fragPtrI == fragPtr.i); ?
> 1068
> 1069        scanFirst(signal, scanPtr);
> //
> // Obviously scanPtr must have been initialized in c_scanOpPool.getPtr().
> // What type is c_scanOpPool?
> //
> (gdb) ins &c_scanOpPool
> $4 = (ArrayPool<Dbtup::ScanOp> *) 0xed6fc8
> //
> // So here is the body of getPtr():
> //
> (gdb) l ArrayPool.hpp:360,366
> 360     template <class T>
> 361     inline
> 362     void
> 363     ArrayPool<T>::getPtr(Ptr<T> & ptr, Uint32 i){
> 364       ptr.i = i;
> 365       if(likely(i < size)){
> 366         ptr.p = &theArray[i];
> //
> // So ptr.p is just assigned the i'th member of theArray.
> // What is the value of i?
> //
> (gdb) f 11
> #11 0x000000000053904f in Dbtup::execLCP_FRAG_ORD (this=0xed4000,
>     signal=0x8c5b28) at dbtup/DbtupScan.cpp:1069
> 1069        scanFirst(signal, scanPtr);
> (gdb) l 1065,1066
> 1065        ScanOpPtr scanPtr;
> 1066        c_scanOpPool.getPtr(scanPtr, frag.m_lcp_scan_op);
> (gdb) ins frag.m_lcp_scan_op
> $5 = 0
> //
> // 0, great. Let's check:
> //
> (gdb) ins c_scanOpPool.theArray[0].m_fragPtrI
> $6 = 3503345872
> //
> // Yes, the same value (3503345872), let's look at the whole 0th
> // member of theArray:
> //
> (gdb) ins c_scanOpPool.theArray[0]
> $7 = {m_state = 53456, m_bits = 53456, m_userPtr = 3503345872,
>   m_userRef = 3503345872, m_tableId = 3503345872, m_fragId = 3503345872,
>   m_fragPtrI = 3503345872, m_transId1 = 3503345872, m_transId2 = 3503345872, {
>     m_savePointId = 3503345872, m_scanGCI = 3503345872},
>   m_accLockOp = 3503345872, m_scanPos = {m_get = 3503345872, m_key = {
>       m_page_no = 3503345872, m_page_idx = 53456, m_file_no = 53456},
>     m_page = 0xd0d0d0d0d0d0d0d0, m_key_mm = {m_page_no = 3503345872,
>       m_page_idx = 53456, m_file_no = 53456}, m_realpid_mm = 3503345872,
>     m_extent_info_ptr_i = 3503345872}, m_accLockOps = {
>     firstItem = 3503345872, lastItem = 3503345872, in_use = 208}, {
>     nextPool = 1, nextList = 1}, prevList = 3503345872}
> //
> // A lot of members with the bogus value 3503345872
> // which is D0D0D0D0 in hex. Now I remember that I have set up my
> // system malloc to initialize each byte of returned memory with 0xd0 [1]
> // Let's see where theArray member is allocated:
> //
> (gdb) l ArrayPool.hpp:220,262
> 220     template <class T>
> 221     inline
> 222     bool
> 223     ArrayPool<T>::setSize(Uint32 noOfElements,
> 224                           bool align, bool exit_on_error, bool guard){
> 225       if(size == 0){
> 226         if(noOfElements == 0)
> 227           return true;
> 228         if(align)
> 229         {
> 230           alloc_ptr = ndbd_malloc((noOfElements+1) * sizeof(T));
> 231           UintPtr p = (UintPtr)alloc_ptr;
> 232           UintPtr mod = p % sizeof(T);
> 233           if (mod)
> 234           {
> 235             p += sizeof(T) - mod;
> 236           }
> 237           theArray = (T *)p;
> 238         }
> 239         else
> 240           theArray = (T *)(alloc_ptr = ndbd_malloc(noOfElements * sizeof(T)));
> 241
> 242         if(theArray == 0)
> 243         {
> 244           if (!exit_on_error)
> 245             return false;
> 246           ErrorReporter::handleAssert("ArrayPool<T>::setSize malloc
> failed",
> 247                                       __FILE__, __LINE__, NDBD_EXIT_MEMALLOC);
> 248           return false; // not reached
> 249         }
> 250         size = noOfElements;
> 251         noOfFree = noOfElements;
> 252
> 253         /**
> 254          * Set next pointers
> 255          */
> 256         T * t = &theArray[0];
> 257         for(Uint32 i = 0; i<size; i++){
> 258           t->nextPool = (i + 1);
> 259           t++;
> 260         }
> 261         theArray[size-1].nextPool = RNIL;
> 262         firstFree = 0;
> //
> // T in this case is ScanOp which has a nice constructor (that
> // initializes its members) but it is never called because ndbd_malloc()
> // is used instead of new. As one can see above - only nextPool members
> // are initialized (manually) after the allocation. Others are left
> // untouched as they are returned by ndbd_malloc().
> //
> --- gdb session ends here ---
> 
> The following patch fixes the problem:
> 
> --- ArrayPool.hpp_memset.diff begins here ---
> --- storage/ndb/src/kernel/vm/ArrayPool.hpp.orig	Wed Nov 15 09:48:39 2006
> +++ storage/ndb/src/kernel/vm/ArrayPool.hpp	Wed Nov 15 09:47:14 2006
> @@ -255,6 +255,7 @@
>       */
>      T * t = &theArray[0];
>      for(Uint32 i = 0; i<size; i++){
> +      memset(t, 0, sizeof(*t));
>        t->nextPool = (i + 1);
>        t++;
>      }
> --- ArrayPool.hpp_memset.diff ends here ---
> 
> I.e. after applying it I get ndb_dd_backuprestore test passing and
> no programs aborting:
> 
> --- cut ---
> ndb_dd_backuprestore           [ pass ]         109547
> --- cut ---
> 
> 
> Before filling a bug report I want to ask -
> "Is this, that the constructors are not called, expected behavior?".
> And thus "Is the above patch correct?".
> 
> Maybe this has something to do with Bug#18546 (at least the beginning
> is the same :).
> 
> My system is FreeBSD 6.2-PRERELEASE/amd64 and MySQL is
> compiled with BUILD/compile-amd64-debug-max but I think this
> is irrelevant.
> 
> [1]
> See the J option in http://www.freebsd.org/cgi/man.cgi?query=malloc.conf
> 

Thread
ndb: usage of uninitialized memoryVasil Dimov15 Nov
  • Re: ndb: usage of uninitialized memoryJonas Oreland15 Nov
    • Re: ndb: usage of uninitialized memoryVasil Dimov15 Nov
      • Re: ndb: usage of uninitialized memoryJonas Oreland15 Nov
        • Re: ndb: usage of uninitialized memoryVasil Dimov15 Nov