Hi,
I got ndb_dd_backuprestore test failing with some core files lying
around after that.
Here is what happens (it is MySQL 5.1.14-beta from BK as of 2006.11.15):
--- cut ---
$ ./mysql-test-run.pl --mysqld=--binlog-format=mixed --do-test=ndb_dd_backuprestore
...
ndb_dd_backuprestore [ fail ]
Errors are (from /usr/local/devel/mysql/mysql-5.1/mysql-test/var/log/mysqltest-time) :
Failed to connect to ndb!!
Failed to initialize consumers
mysqltest: In included file "./include/ndb_restore_master.inc": At line 8: command
"$NDB_TOOLS_DIR/ndb_restore --no-defaults
--ndb-connectstring="localhost:$NDBCLUSTER_PORT" -p 8 -b $the_backup_id -n 2 -r --print
--print_meta $NDB_BACKUP_DIR/BACKUP/BACKUP-$the_backup_id >> $NDB_TOOLS_OUTPUT"
failed
(the last lines may be the most important ones)
Result from queries before failure can be found in r/ndb_dd_backuprestore.log
--- cut ---
Here are the last 4 lines from r/ndb_dd_backuprestore.log:
--- r/ndb_dd_backuprestore.log begins here ---
113: ENGINE = NDB;
114: DROP LOGFILE GROUP log_group1
115: ENGINE =NDB;
116: Abort trap (core dumped) <--- great, we will play with gdb soon...
--- r/ndb_dd_backuprestore.log ends here ---
and the relevant kernel messages:
--- kern.all begins here ---
Nov 15 08:39:31 qlovarnika kernel: pid 80453 (ndbd), uid 1001: exited on signal 6 (core
dumped)
Nov 15 08:39:31 qlovarnika kernel: pid 80454 (ndbd), uid 1001: exited on signal 6 (core
dumped)
Nov 15 08:39:54 qlovarnika kernel: pid 80492 (ndb_restore), uid 1001: exited on signal 6
(core dumped)
--- kern.all ends here ---
ndb_restore is aborting because it cannot connect to ndb,
the root of the problem is that ndbd aborts, here is its backtrace:
--- ndbd backtrace begins here ---
Core was generated by `ndbd'.
Program terminated with signal 6, Aborted.
#0 0x0000000800ac789c in pthread_testcancel () from /usr/lib/libpthread.so.2
[New Thread 0x6c8c000 (sleeping)]
[New Thread 0x6b75c00 (sleeping)]
[New Thread 0x6b75800 (sleeping)]
[New Thread 0x6b75400 (sleeping)]
[New Thread 0x6b75000 (sleeping)]
[New Thread 0x6a60c00 (sleeping)]
[New Thread 0x6a60800 (sleeping)]
[New Thread 0x6a60400 (sleeping)]
[New Thread 0x6a60000 (sleeping)]
[New Thread 0x6949c00 (sleeping)]
[New Thread 0x6949800 (sleeping)]
[New Thread 0x6949400 (sleeping)]
[New Thread 0x6949000 (sleeping)]
[New Thread 0x6832c00 (sleeping)]
[New Thread 0x6832800 (sleeping)]
[New Thread 0x6832400 (sleeping)]
[New Thread 0x6832000 (sleeping)]
[New Thread 0x671cc00 (sleeping)]
[New Thread 0x671c800 (sleeping)]
[New Thread 0x671c400 (sleeping)]
[New Thread 0x671c000 (sleeping)]
[New Thread 0x6604c00 (sleeping)]
[New Thread 0x6604800 (sleeping)]
[New Thread 0x6604400 (sleeping)]
[New Thread 0x6604000 (sleeping)]
[New Thread 0x6424c00 (sleeping)]
[New Thread 0x6424800 (sleeping)]
[New Thread 0xa14800 (LWP 100306)]
[New Thread 0x8e1000 (LWP 100184)]
#0 0x0000000800ac789c in pthread_testcancel () from /usr/lib/libpthread.so.2
#1 0x0000000800ab55c3 in sigaction () from /usr/lib/libpthread.so.2
#2 0x0000000800ab70e2 in sigaction () from /usr/lib/libpthread.so.2
#3 0x0000000800ab0db6 in pthread_kill () from /usr/lib/libpthread.so.2
#4 0x0000000800ab0633 in raise () from /usr/lib/libpthread.so.2
#5 0x0000000800ca115d in abort () from /lib/libc.so.6
#6 0x000000000040541b in childAbort (code=-1, currentStartPhase=4)
at main.cpp:105
#7 0x00000000005bd2ee in NdbShutdown (type=NST_ErrorHandler,
restartType=NRT_Default) at Emulator.cpp:255
#8 0x00000000005c6569 in ErrorReporter::handleError (messageID=2306,
problemData=0x4 <Address 0x4 out of bounds>,
objRef=0x6 <Address 0x6 out of bounds>, nst=NST_ErrorHandler)
at ErrorReporter.cpp:206
#9 0x00000000005b5948 in SimulatedBlock::progError (this=0xed4000, line=538,
err_code=2306, extra=0x629dfd "dbtup/DbtupScan.cpp")
at SimulatedBlock.cpp:738
#10 0x0000000000537444 in Dbtup::scanFirst (this=0xed4000, scanPtr=
{p = 0x4378000, i = 0}) at dbtup/DbtupScan.cpp:538
#11 0x000000000053904f in Dbtup::execLCP_FRAG_ORD (this=0xed4000,
signal=0x8c5b28) at dbtup/DbtupScan.cpp:1069
#12 0x00000000004d23cf in Dblqh::execLCP_PREPARE_CONF (this=0xec8000,
signal=0x8c5b28) at SimulatedBlock.hpp:576
#13 0x00000000005b9bf7 in FastScheduler::doJob (this=0x8cf4e0)
at SimulatedBlock.hpp:576
#14 0x00000000005bb485 in ThreadConfig::ipControlLoop (this=0x0)
at ThreadConfig.cpp:175
#15 0x0000000000405fac in main (argc=7, argv=0x7fffffffcff8) at main.cpp:474
--- ndbd backtrace ends here ---
The file names and line numbers in the backtrace are a little bit
misleading because SimulatedBlock::executeFunction() is inline'd.
Anyway let's dig a little bit more:
--- gdb session begins here ---
(gdb) f 10
#10 0x0000000000537444 in Dbtup::scanFirst (this=0xed4000, scanPtr=
{p = 0x4378000, i = 0}) at dbtup/DbtupScan.cpp:538
538 ptrCheckGuard(fragPtr, cnoOfFragrec, fragrecord);
--- gdb session ends here ---
This is the implementation of ptrCheckGuard:
--- storage/ndb/src/kernel/vm/pc.hpp:ptrCheckGuardErr begins here ---
93 #define ptrCheckGuardErr(ptr, limit, rec, error) {\
94 UintR TxxzLimit; \
95 TxxzLimit = (limit); \
96 UintR TxxxPtr; \
97 TxxxPtr = ptr.i; \
98 ptr.p = &rec[TxxxPtr]; \
99 if (TxxxPtr < (TxxzLimit)) { \
100 ; \
101 } else { \
102 progError(__LINE__, error, __FILE__); \
103 }}
--- storage/ndb/src/kernel/vm/pc.hpp:ptrCheckGuardErr ends here ---
It calls progError() if ptr.i >= limit, so let's investigate:
--- gdb session begins here ---
(gdb) f 10
#10 0x0000000000537444 in Dbtup::scanFirst (this=0xed4000, scanPtr=
{p = 0x4378000, i = 0}) at dbtup/DbtupScan.cpp:538
538 ptrCheckGuard(fragPtr, cnoOfFragrec, fragrecord);
(gdb) ins fragPtr.i
$1 = 3503345872 <----- Seems bogus
(gdb) ins cnoOfFragrec
$2 = 452
(gdb) l 534
529 Dbtup::scanFirst(Signal*, ScanOpPtr scanPtr)
530 {
531 ScanOp& scan = *scanPtr.p;
532 ScanPos& pos = scan.m_scanPos;
533 Local_key& key = pos.m_key;
534 const Uint32 bits = scan.m_bits;
535 // fragment
536 FragrecordPtr fragPtr;
537 fragPtr.i = scan.m_fragPtrI;
538 ptrCheckGuard(fragPtr, cnoOfFragrec, fragrecord);
//
// So fragPtr.i came from scan.m_fragPtrI which came from
// scanPtr.p->m_fragPtrI which was passed as an argument to
// Dbtup::scanFirst()
//
(gdb) ins scanPtr.p->m_fragPtrI
$3 = 3503345872
//
// That's correct: fragPtr.i has the same value as scanPtr.p->m_fragPtrI,
// let's see where did it came from:
//
(gdb) f 11
#11 0x000000000053904f in Dbtup::execLCP_FRAG_ORD (this=0xed4000,
signal=0x8c5b28) at dbtup/DbtupScan.cpp:1069
1069 scanFirst(signal, scanPtr);
(gdb) l 1065,1069
1065 ScanOpPtr scanPtr;
1066 c_scanOpPool.getPtr(scanPtr, frag.m_lcp_scan_op);
1067 //ndbrequire(scanPtr.p->m_fragPtrI == fragPtr.i); ?
1068
1069 scanFirst(signal, scanPtr);
//
// Obviously scanPtr must have been initialized in c_scanOpPool.getPtr().
// What type is c_scanOpPool?
//
(gdb) ins &c_scanOpPool
$4 = (ArrayPool<Dbtup::ScanOp> *) 0xed6fc8
//
// So here is the body of getPtr():
//
(gdb) l ArrayPool.hpp:360,366
360 template <class T>
361 inline
362 void
363 ArrayPool<T>::getPtr(Ptr<T> & ptr, Uint32 i){
364 ptr.i = i;
365 if(likely(i < size)){
366 ptr.p = &theArray[i];
//
// So ptr.p is just assigned the i'th member of theArray.
// What is the value of i?
//
(gdb) f 11
#11 0x000000000053904f in Dbtup::execLCP_FRAG_ORD (this=0xed4000,
signal=0x8c5b28) at dbtup/DbtupScan.cpp:1069
1069 scanFirst(signal, scanPtr);
(gdb) l 1065,1066
1065 ScanOpPtr scanPtr;
1066 c_scanOpPool.getPtr(scanPtr, frag.m_lcp_scan_op);
(gdb) ins frag.m_lcp_scan_op
$5 = 0
//
// 0, great. Let's check:
//
(gdb) ins c_scanOpPool.theArray[0].m_fragPtrI
$6 = 3503345872
//
// Yes, the same value (3503345872), let's look at the whole 0th
// member of theArray:
//
(gdb) ins c_scanOpPool.theArray[0]
$7 = {m_state = 53456, m_bits = 53456, m_userPtr = 3503345872,
m_userRef = 3503345872, m_tableId = 3503345872, m_fragId = 3503345872,
m_fragPtrI = 3503345872, m_transId1 = 3503345872, m_transId2 = 3503345872, {
m_savePointId = 3503345872, m_scanGCI = 3503345872},
m_accLockOp = 3503345872, m_scanPos = {m_get = 3503345872, m_key = {
m_page_no = 3503345872, m_page_idx = 53456, m_file_no = 53456},
m_page = 0xd0d0d0d0d0d0d0d0, m_key_mm = {m_page_no = 3503345872,
m_page_idx = 53456, m_file_no = 53456}, m_realpid_mm = 3503345872,
m_extent_info_ptr_i = 3503345872}, m_accLockOps = {
firstItem = 3503345872, lastItem = 3503345872, in_use = 208}, {
nextPool = 1, nextList = 1}, prevList = 3503345872}
//
// A lot of members with the bogus value 3503345872
// which is D0D0D0D0 in hex. Now I remember that I have set up my
// system malloc to initialize each byte of returned memory with 0xd0 [1]
// Let's see where theArray member is allocated:
//
(gdb) l ArrayPool.hpp:220,262
220 template <class T>
221 inline
222 bool
223 ArrayPool<T>::setSize(Uint32 noOfElements,
224 bool align, bool exit_on_error, bool guard){
225 if(size == 0){
226 if(noOfElements == 0)
227 return true;
228 if(align)
229 {
230 alloc_ptr = ndbd_malloc((noOfElements+1) * sizeof(T));
231 UintPtr p = (UintPtr)alloc_ptr;
232 UintPtr mod = p % sizeof(T);
233 if (mod)
234 {
235 p += sizeof(T) - mod;
236 }
237 theArray = (T *)p;
238 }
239 else
240 theArray = (T *)(alloc_ptr = ndbd_malloc(noOfElements * sizeof(T)));
241
242 if(theArray == 0)
243 {
244 if (!exit_on_error)
245 return false;
246 ErrorReporter::handleAssert("ArrayPool<T>::setSize malloc failed",
247 __FILE__, __LINE__, NDBD_EXIT_MEMALLOC);
248 return false; // not reached
249 }
250 size = noOfElements;
251 noOfFree = noOfElements;
252
253 /**
254 * Set next pointers
255 */
256 T * t = &theArray[0];
257 for(Uint32 i = 0; i<size; i++){
258 t->nextPool = (i + 1);
259 t++;
260 }
261 theArray[size-1].nextPool = RNIL;
262 firstFree = 0;
//
// T in this case is ScanOp which has a nice constructor (that
// initializes its members) but it is never called because ndbd_malloc()
// is used instead of new. As one can see above - only nextPool members
// are initialized (manually) after the allocation. Others are left
// untouched as they are returned by ndbd_malloc().
//
--- gdb session ends here ---
The following patch fixes the problem:
--- ArrayPool.hpp_memset.diff begins here ---
--- storage/ndb/src/kernel/vm/ArrayPool.hpp.orig Wed Nov 15 09:48:39 2006
+++ storage/ndb/src/kernel/vm/ArrayPool.hpp Wed Nov 15 09:47:14 2006
@@ -255,6 +255,7 @@
*/
T * t = &theArray[0];
for(Uint32 i = 0; i<size; i++){
+ memset(t, 0, sizeof(*t));
t->nextPool = (i + 1);
t++;
}
--- ArrayPool.hpp_memset.diff ends here ---
I.e. after applying it I get ndb_dd_backuprestore test passing and
no programs aborting:
--- cut ---
ndb_dd_backuprestore [ pass ] 109547
--- cut ---
Before filling a bug report I want to ask -
"Is this, that the constructors are not called, expected behavior?".
And thus "Is the above patch correct?".
Maybe this has something to do with Bug#18546 (at least the beginning
is the same :).
My system is FreeBSD 6.2-PRERELEASE/amd64 and MySQL is
compiled with BUILD/compile-amd64-debug-max but I think this
is irrelevant.
[1]
See the J option in http://www.freebsd.org/cgi/man.cgi?query=malloc.conf
--
Vasil Dimov
gro.DSBeerF@dv
%
The only difference between the saint and the sinner is that
every saint has a past and every sinner has a future.
-- Oscar Wilde
Attachment: [application/pgp-signature]