From: Date: November 15 2006 9:43am Subject: ndb: usage of uninitialized memory List-Archive: http://lists.mysql.com/internals/34081 Message-Id: <20061115084324.GA88290@qlovarnika.bg.datamax> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="liOOAslEiF7prFVr" --liOOAslEiF7prFVr Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable Hi, I got ndb_dd_backuprestore test failing with some core files lying around after that. Here is what happens (it is MySQL 5.1.14-beta from BK as of 2006.11.15): --- cut --- $ ./mysql-test-run.pl --mysqld=3D--binlog-format=3Dmixed --do-test=3Dndb_dd= _backuprestore =2E.. ndb_dd_backuprestore [ fail ] Errors are (from /usr/local/devel/mysql/mysql-5.1/mysql-test/var/log/mysqlt= est-time) : Failed to connect to ndb!! Failed to initialize consumers mysqltest: In included file "./include/ndb_restore_master.inc": At line 8: = command "$NDB_TOOLS_DIR/ndb_restore --no-defaults --ndb-connectstring=3D"lo= calhost:$NDBCLUSTER_PORT" -p 8 -b $the_backup_id -n 2 -r --print --print_m= eta $NDB_BACKUP_DIR/BACKUP/BACKUP-$the_backup_id >> $NDB_TOOLS_OUTPUT" fail= ed (the last lines may be the most important ones) Result from queries before failure can be found in r/ndb_dd_backuprestore.l= og --- cut --- Here are the last 4 lines from r/ndb_dd_backuprestore.log: --- r/ndb_dd_backuprestore.log begins here --- 113: ENGINE =3D NDB; 114: DROP LOGFILE GROUP log_group1 115: ENGINE =3DNDB; 116: Abort trap (core dumped) <--- great, we will play with gdb soon... --- r/ndb_dd_backuprestore.log ends here --- and the relevant kernel messages: --- kern.all begins here --- Nov 15 08:39:31 qlovarnika kernel: pid 80453 (ndbd), uid 1001: exited on si= gnal 6 (core dumped) Nov 15 08:39:31 qlovarnika kernel: pid 80454 (ndbd), uid 1001: exited on si= gnal 6 (core dumped) Nov 15 08:39:54 qlovarnika kernel: pid 80492 (ndb_restore), uid 1001: exite= d on signal 6 (core dumped) --- kern.all ends here --- ndb_restore is aborting because it cannot connect to ndb, the root of the problem is that ndbd aborts, here is its backtrace: --- ndbd backtrace begins here --- Core was generated by `ndbd'. Program terminated with signal 6, Aborted. #0 0x0000000800ac789c in pthread_testcancel () from /usr/lib/libpthread.so= =2E2 [New Thread 0x6c8c000 (sleeping)] [New Thread 0x6b75c00 (sleeping)] [New Thread 0x6b75800 (sleeping)] [New Thread 0x6b75400 (sleeping)] [New Thread 0x6b75000 (sleeping)] [New Thread 0x6a60c00 (sleeping)] [New Thread 0x6a60800 (sleeping)] [New Thread 0x6a60400 (sleeping)] [New Thread 0x6a60000 (sleeping)] [New Thread 0x6949c00 (sleeping)] [New Thread 0x6949800 (sleeping)] [New Thread 0x6949400 (sleeping)] [New Thread 0x6949000 (sleeping)] [New Thread 0x6832c00 (sleeping)] [New Thread 0x6832800 (sleeping)] [New Thread 0x6832400 (sleeping)] [New Thread 0x6832000 (sleeping)] [New Thread 0x671cc00 (sleeping)] [New Thread 0x671c800 (sleeping)] [New Thread 0x671c400 (sleeping)] [New Thread 0x671c000 (sleeping)] [New Thread 0x6604c00 (sleeping)] [New Thread 0x6604800 (sleeping)] [New Thread 0x6604400 (sleeping)] [New Thread 0x6604000 (sleeping)] [New Thread 0x6424c00 (sleeping)] [New Thread 0x6424800 (sleeping)] [New Thread 0xa14800 (LWP 100306)] [New Thread 0x8e1000 (LWP 100184)] #0 0x0000000800ac789c in pthread_testcancel () from /usr/lib/libpthread.so= =2E2 #1 0x0000000800ab55c3 in sigaction () from /usr/lib/libpthread.so.2 #2 0x0000000800ab70e2 in sigaction () from /usr/lib/libpthread.so.2 #3 0x0000000800ab0db6 in pthread_kill () from /usr/lib/libpthread.so.2 #4 0x0000000800ab0633 in raise () from /usr/lib/libpthread.so.2 #5 0x0000000800ca115d in abort () from /lib/libc.so.6 #6 0x000000000040541b in childAbort (code=3D-1, currentStartPhase=3D4) at main.cpp:105 #7 0x00000000005bd2ee in NdbShutdown (type=3DNST_ErrorHandler,=20 restartType=3DNRT_Default) at Emulator.cpp:255 #8 0x00000000005c6569 in ErrorReporter::handleError (messageID=3D2306,=20 problemData=3D0x4
,=20 objRef=3D0x6
, nst=3DNST_ErrorHandler) at ErrorReporter.cpp:206 #9 0x00000000005b5948 in SimulatedBlock::progError (this=3D0xed4000, line= =3D538,=20 err_code=3D2306, extra=3D0x629dfd "dbtup/DbtupScan.cpp") at SimulatedBlock.cpp:738 #10 0x0000000000537444 in Dbtup::scanFirst (this=3D0xed4000, scanPtr=3D {p =3D 0x4378000, i =3D 0}) at dbtup/DbtupScan.cpp:538 #11 0x000000000053904f in Dbtup::execLCP_FRAG_ORD (this=3D0xed4000,=20 signal=3D0x8c5b28) at dbtup/DbtupScan.cpp:1069 #12 0x00000000004d23cf in Dblqh::execLCP_PREPARE_CONF (this=3D0xec8000,=20 signal=3D0x8c5b28) at SimulatedBlock.hpp:576 #13 0x00000000005b9bf7 in FastScheduler::doJob (this=3D0x8cf4e0) at SimulatedBlock.hpp:576 #14 0x00000000005bb485 in ThreadConfig::ipControlLoop (this=3D0x0) at ThreadConfig.cpp:175 #15 0x0000000000405fac in main (argc=3D7, argv=3D0x7fffffffcff8) at main.cp= p:474 --- ndbd backtrace ends here --- The file names and line numbers in the backtrace are a little bit misleading because SimulatedBlock::executeFunction() is inline'd. Anyway let's dig a little bit more: --- gdb session begins here --- (gdb) f 10 #10 0x0000000000537444 in Dbtup::scanFirst (this=3D0xed4000, scanPtr=3D {p =3D 0x4378000, i =3D 0}) at dbtup/DbtupScan.cpp:538 538 ptrCheckGuard(fragPtr, cnoOfFragrec, fragrecord); --- gdb session ends here --- This is the implementation of ptrCheckGuard: --- storage/ndb/src/kernel/vm/pc.hpp:ptrCheckGuardErr begins here --- 93 #define ptrCheckGuardErr(ptr, limit, rec, error) {\ 94 UintR TxxzLimit; \ 95 TxxzLimit =3D (limit); \ 96 UintR TxxxPtr; \ 97 TxxxPtr =3D ptr.i; \ 98 ptr.p =3D &rec[TxxxPtr]; \ 99 if (TxxxPtr < (TxxzLimit)) { \ 100 ; \ 101 } else { \ 102 progError(__LINE__, error, __FILE__); \ 103 }} --- storage/ndb/src/kernel/vm/pc.hpp:ptrCheckGuardErr ends here --- It calls progError() if ptr.i >=3D limit, so let's investigate: --- gdb session begins here --- (gdb) f 10 #10 0x0000000000537444 in Dbtup::scanFirst (this=3D0xed4000, scanPtr=3D {p =3D 0x4378000, i =3D 0}) at dbtup/DbtupScan.cpp:538 538 ptrCheckGuard(fragPtr, cnoOfFragrec, fragrecord); (gdb) ins fragPtr.i $1 =3D 3503345872 <----- Seems bogus (gdb) ins cnoOfFragrec $2 =3D 452 (gdb) l 534 529 Dbtup::scanFirst(Signal*, ScanOpPtr scanPtr) 530 { 531 ScanOp& scan =3D *scanPtr.p; 532 ScanPos& pos =3D scan.m_scanPos; 533 Local_key& key =3D pos.m_key; 534 const Uint32 bits =3D scan.m_bits; 535 // fragment 536 FragrecordPtr fragPtr; 537 fragPtr.i =3D scan.m_fragPtrI; 538 ptrCheckGuard(fragPtr, cnoOfFragrec, fragrecord); // // So fragPtr.i came from scan.m_fragPtrI which came from // scanPtr.p->m_fragPtrI which was passed as an argument to // Dbtup::scanFirst() // (gdb) ins scanPtr.p->m_fragPtrI $3 =3D 3503345872 // // That's correct: fragPtr.i has the same value as scanPtr.p->m_fragPtrI, // let's see where did it came from: // (gdb) f 11 #11 0x000000000053904f in Dbtup::execLCP_FRAG_ORD (this=3D0xed4000, signal=3D0x8c5b28) at dbtup/DbtupScan.cpp:1069 1069 scanFirst(signal, scanPtr); (gdb) l 1065,1069 1065 ScanOpPtr scanPtr; 1066 c_scanOpPool.getPtr(scanPtr, frag.m_lcp_scan_op); 1067 //ndbrequire(scanPtr.p->m_fragPtrI =3D=3D fragPtr.i); ? 1068 1069 scanFirst(signal, scanPtr); // // Obviously scanPtr must have been initialized in c_scanOpPool.getPtr(). // What type is c_scanOpPool? // (gdb) ins &c_scanOpPool $4 =3D (ArrayPool *) 0xed6fc8 // // So here is the body of getPtr(): // (gdb) l ArrayPool.hpp:360,366 360 template 361 inline 362 void 363 ArrayPool::getPtr(Ptr & ptr, Uint32 i){ 364 ptr.i =3D i; 365 if(likely(i < size)){ 366 ptr.p =3D &theArray[i]; // // So ptr.p is just assigned the i'th member of theArray. // What is the value of i? // (gdb) f 11 #11 0x000000000053904f in Dbtup::execLCP_FRAG_ORD (this=3D0xed4000, signal=3D0x8c5b28) at dbtup/DbtupScan.cpp:1069 1069 scanFirst(signal, scanPtr); (gdb) l 1065,1066 1065 ScanOpPtr scanPtr; 1066 c_scanOpPool.getPtr(scanPtr, frag.m_lcp_scan_op); (gdb) ins frag.m_lcp_scan_op $5 =3D 0 // // 0, great. Let's check: // (gdb) ins c_scanOpPool.theArray[0].m_fragPtrI $6 =3D 3503345872 // // Yes, the same value (3503345872), let's look at the whole 0th // member of theArray: // (gdb) ins c_scanOpPool.theArray[0] $7 =3D {m_state =3D 53456, m_bits =3D 53456, m_userPtr =3D 3503345872, m_userRef =3D 3503345872, m_tableId =3D 3503345872, m_fragId =3D 35033458= 72, m_fragPtrI =3D 3503345872, m_transId1 =3D 3503345872, m_transId2 =3D 3503= 345872, { m_savePointId =3D 3503345872, m_scanGCI =3D 3503345872}, m_accLockOp =3D 3503345872, m_scanPos =3D {m_get =3D 3503345872, m_key = =3D { m_page_no =3D 3503345872, m_page_idx =3D 53456, m_file_no =3D 53456}, m_page =3D 0xd0d0d0d0d0d0d0d0, m_key_mm =3D {m_page_no =3D 3503345872, m_page_idx =3D 53456, m_file_no =3D 53456}, m_realpid_mm =3D 35033458= 72, m_extent_info_ptr_i =3D 3503345872}, m_accLockOps =3D { firstItem =3D 3503345872, lastItem =3D 3503345872, in_use =3D 208}, { nextPool =3D 1, nextList =3D 1}, prevList =3D 3503345872} // // A lot of members with the bogus value 3503345872 // which is D0D0D0D0 in hex. Now I remember that I have set up my // system malloc to initialize each byte of returned memory with 0xd0 [1] // Let's see where theArray member is allocated: // (gdb) l ArrayPool.hpp:220,262 220 template 221 inline 222 bool 223 ArrayPool::setSize(Uint32 noOfElements, 224 bool align, bool exit_on_error, bool guard){ 225 if(size =3D=3D 0){ 226 if(noOfElements =3D=3D 0) 227 return true; 228 if(align) 229 { 230 alloc_ptr =3D ndbd_malloc((noOfElements+1) * sizeof(T)); 231 UintPtr p =3D (UintPtr)alloc_ptr; 232 UintPtr mod =3D p % sizeof(T); 233 if (mod) 234 { 235 p +=3D sizeof(T) - mod; 236 } 237 theArray =3D (T *)p; 238 } 239 else 240 theArray =3D (T *)(alloc_ptr =3D ndbd_malloc(noOfElements * s= izeof(T))); 241 242 if(theArray =3D=3D 0) 243 { 244 if (!exit_on_error) 245 return false; 246 ErrorReporter::handleAssert("ArrayPool::setSize malloc fai= led", 247 __FILE__, __LINE__, NDBD_EXIT_MEM= ALLOC); 248 return false; // not reached 249 } 250 size =3D noOfElements; 251 noOfFree =3D noOfElements; 252 253 /** 254 * Set next pointers 255 */ 256 T * t =3D &theArray[0]; 257 for(Uint32 i =3D 0; inextPool =3D (i + 1); 259 t++; 260 } 261 theArray[size-1].nextPool =3D RNIL; 262 firstFree =3D 0; // // T in this case is ScanOp which has a nice constructor (that // initializes its members) but it is never called because ndbd_malloc() // is used instead of new. As one can see above - only nextPool members // are initialized (manually) after the allocation. Others are left // untouched as they are returned by ndbd_malloc(). // --- gdb session ends here --- The following patch fixes the problem: --- ArrayPool.hpp_memset.diff begins here --- --- storage/ndb/src/kernel/vm/ArrayPool.hpp.orig Wed Nov 15 09:48:39 2006 +++ storage/ndb/src/kernel/vm/ArrayPool.hpp Wed Nov 15 09:47:14 2006 @@ -255,6 +255,7 @@ */ T * t =3D &theArray[0]; for(Uint32 i =3D 0; inextPool =3D (i + 1); t++; } --- ArrayPool.hpp_memset.diff ends here --- I.e. after applying it I get ndb_dd_backuprestore test passing and no programs aborting: --- cut --- ndb_dd_backuprestore [ pass ] 109547 --- cut --- Before filling a bug report I want to ask - "Is this, that the constructors are not called, expected behavior?". And thus "Is the above patch correct?". Maybe this has something to do with Bug#18546 (at least the beginning is the same :). My system is FreeBSD 6.2-PRERELEASE/amd64 and MySQL is compiled with BUILD/compile-amd64-debug-max but I think this is irrelevant. [1] See the J option in http://www.freebsd.org/cgi/man.cgi?query=3Dmalloc.conf --=20 Vasil Dimov gro.DSBeerF@dv % The only difference between the saint and the sinner is that every saint has a past and every sinner has a future. -- Oscar Wilde --liOOAslEiF7prFVr Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- iD8DBQFFWtMsFw6SP/bBpCARAiLJAJ9UzciuU9cWlD2o2nXabDznoFrqlwCg1Bqe V1JQAx5ipEiJbDyO82YGVdM= =Fiu0 -----END PGP SIGNATURE----- --liOOAslEiF7prFVr--