List:Internals« Previous MessageNext Message »
From:Vasil Dimov Date:November 15 2006 9:43am
Subject:ndb: usage of uninitialized memory
View as plain text  
Hi,

I got ndb_dd_backuprestore test failing with some core files lying
around after that.

Here is what happens (it is MySQL 5.1.14-beta from BK as of 2006.11.15):

--- cut ---
$ ./mysql-test-run.pl --mysqld=--binlog-format=mixed --do-test=ndb_dd_backuprestore
...
ndb_dd_backuprestore           [ fail ]

Errors are (from /usr/local/devel/mysql/mysql-5.1/mysql-test/var/log/mysqltest-time) :
Failed to connect to ndb!!
Failed to initialize consumers
mysqltest: In included file "./include/ndb_restore_master.inc": At line 8: command
"$NDB_TOOLS_DIR/ndb_restore --no-defaults
--ndb-connectstring="localhost:$NDBCLUSTER_PORT" -p 8 -b $the_backup_id -n 2  -r --print
--print_meta $NDB_BACKUP_DIR/BACKUP/BACKUP-$the_backup_id >> $NDB_TOOLS_OUTPUT"
failed
(the last lines may be the most important ones)
Result from queries before failure can be found in r/ndb_dd_backuprestore.log
--- cut ---

Here are the last 4 lines from r/ndb_dd_backuprestore.log:

--- r/ndb_dd_backuprestore.log begins here ---
113: ENGINE = NDB;
114: DROP LOGFILE GROUP log_group1
115: ENGINE =NDB;
116: Abort trap (core dumped)   <--- great, we will play with gdb soon...
--- r/ndb_dd_backuprestore.log ends here ---

and the relevant kernel messages:

--- kern.all begins here ---
Nov 15 08:39:31 qlovarnika kernel: pid 80453 (ndbd), uid 1001: exited on signal 6 (core
dumped)
Nov 15 08:39:31 qlovarnika kernel: pid 80454 (ndbd), uid 1001: exited on signal 6 (core
dumped)
Nov 15 08:39:54 qlovarnika kernel: pid 80492 (ndb_restore), uid 1001: exited on signal 6
(core dumped)
--- kern.all ends here ---

ndb_restore is aborting because it cannot connect to ndb,
the root of the problem is that ndbd aborts, here is its backtrace:

--- ndbd backtrace begins here ---
Core was generated by `ndbd'.
Program terminated with signal 6, Aborted.
#0  0x0000000800ac789c in pthread_testcancel () from /usr/lib/libpthread.so.2
[New Thread 0x6c8c000 (sleeping)]
[New Thread 0x6b75c00 (sleeping)]
[New Thread 0x6b75800 (sleeping)]
[New Thread 0x6b75400 (sleeping)]
[New Thread 0x6b75000 (sleeping)]
[New Thread 0x6a60c00 (sleeping)]
[New Thread 0x6a60800 (sleeping)]
[New Thread 0x6a60400 (sleeping)]
[New Thread 0x6a60000 (sleeping)]
[New Thread 0x6949c00 (sleeping)]
[New Thread 0x6949800 (sleeping)]
[New Thread 0x6949400 (sleeping)]
[New Thread 0x6949000 (sleeping)]
[New Thread 0x6832c00 (sleeping)]
[New Thread 0x6832800 (sleeping)]
[New Thread 0x6832400 (sleeping)]
[New Thread 0x6832000 (sleeping)]
[New Thread 0x671cc00 (sleeping)]
[New Thread 0x671c800 (sleeping)]
[New Thread 0x671c400 (sleeping)]
[New Thread 0x671c000 (sleeping)]
[New Thread 0x6604c00 (sleeping)]
[New Thread 0x6604800 (sleeping)]
[New Thread 0x6604400 (sleeping)]
[New Thread 0x6604000 (sleeping)]
[New Thread 0x6424c00 (sleeping)]
[New Thread 0x6424800 (sleeping)]
[New Thread 0xa14800 (LWP 100306)]
[New Thread 0x8e1000 (LWP 100184)]
#0  0x0000000800ac789c in pthread_testcancel () from /usr/lib/libpthread.so.2
#1  0x0000000800ab55c3 in sigaction () from /usr/lib/libpthread.so.2
#2  0x0000000800ab70e2 in sigaction () from /usr/lib/libpthread.so.2
#3  0x0000000800ab0db6 in pthread_kill () from /usr/lib/libpthread.so.2
#4  0x0000000800ab0633 in raise () from /usr/lib/libpthread.so.2
#5  0x0000000800ca115d in abort () from /lib/libc.so.6
#6  0x000000000040541b in childAbort (code=-1, currentStartPhase=4)
    at main.cpp:105
#7  0x00000000005bd2ee in NdbShutdown (type=NST_ErrorHandler, 
    restartType=NRT_Default) at Emulator.cpp:255
#8  0x00000000005c6569 in ErrorReporter::handleError (messageID=2306, 
    problemData=0x4 <Address 0x4 out of bounds>, 
    objRef=0x6 <Address 0x6 out of bounds>, nst=NST_ErrorHandler)
    at ErrorReporter.cpp:206
#9  0x00000000005b5948 in SimulatedBlock::progError (this=0xed4000, line=538, 
    err_code=2306, extra=0x629dfd "dbtup/DbtupScan.cpp")
    at SimulatedBlock.cpp:738
#10 0x0000000000537444 in Dbtup::scanFirst (this=0xed4000, scanPtr=
      {p = 0x4378000, i = 0}) at dbtup/DbtupScan.cpp:538
#11 0x000000000053904f in Dbtup::execLCP_FRAG_ORD (this=0xed4000, 
    signal=0x8c5b28) at dbtup/DbtupScan.cpp:1069
#12 0x00000000004d23cf in Dblqh::execLCP_PREPARE_CONF (this=0xec8000, 
    signal=0x8c5b28) at SimulatedBlock.hpp:576
#13 0x00000000005b9bf7 in FastScheduler::doJob (this=0x8cf4e0)
    at SimulatedBlock.hpp:576
#14 0x00000000005bb485 in ThreadConfig::ipControlLoop (this=0x0)
    at ThreadConfig.cpp:175
#15 0x0000000000405fac in main (argc=7, argv=0x7fffffffcff8) at main.cpp:474
--- ndbd backtrace ends here ---

The file names and line numbers in the backtrace are a little bit
misleading because SimulatedBlock::executeFunction() is inline'd.
Anyway let's dig a little bit more:

--- gdb session begins here ---
(gdb) f 10
#10 0x0000000000537444 in Dbtup::scanFirst (this=0xed4000, scanPtr=
      {p = 0x4378000, i = 0}) at dbtup/DbtupScan.cpp:538
538       ptrCheckGuard(fragPtr, cnoOfFragrec, fragrecord);
--- gdb session ends here ---

This is the implementation of ptrCheckGuard:

--- storage/ndb/src/kernel/vm/pc.hpp:ptrCheckGuardErr begins here ---
 93 #define ptrCheckGuardErr(ptr, limit, rec, error) {\
 94   UintR TxxzLimit; \
 95   TxxzLimit = (limit); \
 96   UintR TxxxPtr; \
 97   TxxxPtr = ptr.i; \
 98   ptr.p = &rec[TxxxPtr]; \
 99   if (TxxxPtr < (TxxzLimit)) { \
100     ; \
101   } else { \
102     progError(__LINE__, error, __FILE__); \
103   }}
--- storage/ndb/src/kernel/vm/pc.hpp:ptrCheckGuardErr ends here ---

It calls progError() if ptr.i >= limit, so let's investigate:

--- gdb session begins here ---
(gdb) f 10
#10 0x0000000000537444 in Dbtup::scanFirst (this=0xed4000, scanPtr=
      {p = 0x4378000, i = 0}) at dbtup/DbtupScan.cpp:538
538       ptrCheckGuard(fragPtr, cnoOfFragrec, fragrecord);
(gdb) ins fragPtr.i
$1 = 3503345872             <----- Seems bogus
(gdb) ins cnoOfFragrec
$2 = 452
(gdb) l 534
529     Dbtup::scanFirst(Signal*, ScanOpPtr scanPtr)
530     {
531       ScanOp& scan = *scanPtr.p;
532       ScanPos& pos = scan.m_scanPos;
533       Local_key& key = pos.m_key;
534       const Uint32 bits = scan.m_bits;
535       // fragment
536       FragrecordPtr fragPtr;
537       fragPtr.i = scan.m_fragPtrI;
538       ptrCheckGuard(fragPtr, cnoOfFragrec, fragrecord);
//
// So fragPtr.i came from scan.m_fragPtrI which came from
// scanPtr.p->m_fragPtrI which was passed as an argument to
// Dbtup::scanFirst()
//
(gdb) ins scanPtr.p->m_fragPtrI
$3 = 3503345872
//
// That's correct: fragPtr.i has the same value as scanPtr.p->m_fragPtrI,
// let's see where did it came from:
//
(gdb) f 11
#11 0x000000000053904f in Dbtup::execLCP_FRAG_ORD (this=0xed4000,
    signal=0x8c5b28) at dbtup/DbtupScan.cpp:1069
1069        scanFirst(signal, scanPtr);
(gdb) l 1065,1069
1065        ScanOpPtr scanPtr;
1066        c_scanOpPool.getPtr(scanPtr, frag.m_lcp_scan_op);
1067        //ndbrequire(scanPtr.p->m_fragPtrI == fragPtr.i); ?
1068
1069        scanFirst(signal, scanPtr);
//
// Obviously scanPtr must have been initialized in c_scanOpPool.getPtr().
// What type is c_scanOpPool?
//
(gdb) ins &c_scanOpPool
$4 = (ArrayPool<Dbtup::ScanOp> *) 0xed6fc8
//
// So here is the body of getPtr():
//
(gdb) l ArrayPool.hpp:360,366
360     template <class T>
361     inline
362     void
363     ArrayPool<T>::getPtr(Ptr<T> & ptr, Uint32 i){
364       ptr.i = i;
365       if(likely(i < size)){
366         ptr.p = &theArray[i];
//
// So ptr.p is just assigned the i'th member of theArray.
// What is the value of i?
//
(gdb) f 11
#11 0x000000000053904f in Dbtup::execLCP_FRAG_ORD (this=0xed4000,
    signal=0x8c5b28) at dbtup/DbtupScan.cpp:1069
1069        scanFirst(signal, scanPtr);
(gdb) l 1065,1066
1065        ScanOpPtr scanPtr;
1066        c_scanOpPool.getPtr(scanPtr, frag.m_lcp_scan_op);
(gdb) ins frag.m_lcp_scan_op
$5 = 0
//
// 0, great. Let's check:
//
(gdb) ins c_scanOpPool.theArray[0].m_fragPtrI
$6 = 3503345872
//
// Yes, the same value (3503345872), let's look at the whole 0th
// member of theArray:
//
(gdb) ins c_scanOpPool.theArray[0]
$7 = {m_state = 53456, m_bits = 53456, m_userPtr = 3503345872,
  m_userRef = 3503345872, m_tableId = 3503345872, m_fragId = 3503345872,
  m_fragPtrI = 3503345872, m_transId1 = 3503345872, m_transId2 = 3503345872, {
    m_savePointId = 3503345872, m_scanGCI = 3503345872},
  m_accLockOp = 3503345872, m_scanPos = {m_get = 3503345872, m_key = {
      m_page_no = 3503345872, m_page_idx = 53456, m_file_no = 53456},
    m_page = 0xd0d0d0d0d0d0d0d0, m_key_mm = {m_page_no = 3503345872,
      m_page_idx = 53456, m_file_no = 53456}, m_realpid_mm = 3503345872,
    m_extent_info_ptr_i = 3503345872}, m_accLockOps = {
    firstItem = 3503345872, lastItem = 3503345872, in_use = 208}, {
    nextPool = 1, nextList = 1}, prevList = 3503345872}
//
// A lot of members with the bogus value 3503345872
// which is D0D0D0D0 in hex. Now I remember that I have set up my
// system malloc to initialize each byte of returned memory with 0xd0 [1]
// Let's see where theArray member is allocated:
//
(gdb) l ArrayPool.hpp:220,262
220     template <class T>
221     inline
222     bool
223     ArrayPool<T>::setSize(Uint32 noOfElements,
224                           bool align, bool exit_on_error, bool guard){
225       if(size == 0){
226         if(noOfElements == 0)
227           return true;
228         if(align)
229         {
230           alloc_ptr = ndbd_malloc((noOfElements+1) * sizeof(T));
231           UintPtr p = (UintPtr)alloc_ptr;
232           UintPtr mod = p % sizeof(T);
233           if (mod)
234           {
235             p += sizeof(T) - mod;
236           }
237           theArray = (T *)p;
238         }
239         else
240           theArray = (T *)(alloc_ptr = ndbd_malloc(noOfElements * sizeof(T)));
241
242         if(theArray == 0)
243         {
244           if (!exit_on_error)
245             return false;
246           ErrorReporter::handleAssert("ArrayPool<T>::setSize malloc failed",
247                                       __FILE__, __LINE__, NDBD_EXIT_MEMALLOC);
248           return false; // not reached
249         }
250         size = noOfElements;
251         noOfFree = noOfElements;
252
253         /**
254          * Set next pointers
255          */
256         T * t = &theArray[0];
257         for(Uint32 i = 0; i<size; i++){
258           t->nextPool = (i + 1);
259           t++;
260         }
261         theArray[size-1].nextPool = RNIL;
262         firstFree = 0;
//
// T in this case is ScanOp which has a nice constructor (that
// initializes its members) but it is never called because ndbd_malloc()
// is used instead of new. As one can see above - only nextPool members
// are initialized (manually) after the allocation. Others are left
// untouched as they are returned by ndbd_malloc().
//
--- gdb session ends here ---

The following patch fixes the problem:

--- ArrayPool.hpp_memset.diff begins here ---
--- storage/ndb/src/kernel/vm/ArrayPool.hpp.orig	Wed Nov 15 09:48:39 2006
+++ storage/ndb/src/kernel/vm/ArrayPool.hpp	Wed Nov 15 09:47:14 2006
@@ -255,6 +255,7 @@
      */
     T * t = &theArray[0];
     for(Uint32 i = 0; i<size; i++){
+      memset(t, 0, sizeof(*t));
       t->nextPool = (i + 1);
       t++;
     }
--- ArrayPool.hpp_memset.diff ends here ---

I.e. after applying it I get ndb_dd_backuprestore test passing and
no programs aborting:

--- cut ---
ndb_dd_backuprestore           [ pass ]         109547
--- cut ---


Before filling a bug report I want to ask -
"Is this, that the constructors are not called, expected behavior?".
And thus "Is the above patch correct?".

Maybe this has something to do with Bug#18546 (at least the beginning
is the same :).

My system is FreeBSD 6.2-PRERELEASE/amd64 and MySQL is
compiled with BUILD/compile-amd64-debug-max but I think this
is irrelevant.

[1]
See the J option in http://www.freebsd.org/cgi/man.cgi?query=malloc.conf

-- 
Vasil Dimov
gro.DSBeerF@dv
%
The only difference between the saint and the sinner is that
every saint has a past and every sinner has a future.
                -- Oscar Wilde

Attachment: [application/pgp-signature]
Thread
ndb: usage of uninitialized memoryVasil Dimov15 Nov
  • Re: ndb: usage of uninitialized memoryJonas Oreland15 Nov
    • Re: ndb: usage of uninitialized memoryVasil Dimov15 Nov
      • Re: ndb: usage of uninitialized memoryJonas Oreland15 Nov
        • Re: ndb: usage of uninitialized memoryVasil Dimov15 Nov