From: Olav Sandstaa Date: March 22 2009 10:59pm Subject: Deadlock detector accesses deleted Transaction? List-Archive: http://lists.mysql.com/falcon/629 Message-Id: <49C6C2BD.3000107@sun.com> MIME-Version: 1.0 Content-Type: text/plain; format=flowed; charset=ISO-8859-1 Content-Transfer-Encoding: 7BIT Hi, Bug#41665 reports a crash situation in Transaction::waitForTransaction() caused by Falcon accessing a deleted transaction object. I expected this to be caused by the RecordVersion object having a pointer to a deleted Transaction object. And by introducing the TransactionState object and using this instead of the Transaction object it first seemed like it fixed the problem. The attached repro case for bug#41655 which failed almost on every run before the fix ran without error for the ~15 test runs (or more than an hour) - but then it crashed again. The new crash occurs in the "dead lock detector" in Transaction::waitForTransaction(). In the for loop the "trans" pointer points to an illegal address: volatile Transaction *trans; for (trans = transaction->waitingFor; trans; trans = trans->waitingFor) if (trans == this) { *deadlock = true; break; } Note that as part of the rewrite and introduction of the new TransactionState object I have rewritten this code to: volatile TransactionState *trans; for (trans = ts->waitingFor; trans; trans = trans->waitingFor) if (trans == this->transactionState) { *deadlock = true; break; } I hope this code should behave the same as the original code but I can not exclude that I have introduced something in the code that causes this crash. My initial theory for how this crash can happen is that the Transaction object's waitingFor pointer can access a deleted Transaction object. I have not been able to get this in a debuger/core file but currently my theory based on reading code is that the following scenario can happen: 1. When we start, the following three Transactions A, B, C exists and A and B is blocked by C: |A|->|B|->|C| where "->" is the waitingFor pointer. 2. Transaction C commits (or more likely aborts). This unlocks the "syncIsActive" lock which B is waiting on. Note that C's transaction object still exists since B has a use count on it. 3. A "new" transaction D calls Transaction::waitForTransaction(A) and the wait for graph becomes: |D|->|A|->|B|->|C| 4. Transaction D initializes the "dead lock detector" by walking through this list. At some point the "trans" pointer points on "C" - at this time the thread is interrupted (or more likely runs in parallel with thread B on a multi-core machine). 5. Thread B is waken-up (signalled by thread C in step 2 above) and continue to run: it release the use count it has on C' transaction object. I think this can be the last use count and this leads to C's transaction object being deleted. 6. The thread running transaction D continues with the "dead lock detector". The "trans" pointer now point to a deleted (and overwritten) transaction object and this leads to it (eventually) accessing an illegal address (and crashes). Note that this is just a theory for how this might happen (and again: I do not rule out that I have introduced something in my prototype implementation of the new TransactionState object). Still if anyone has a better theory for what happens or suggestions for a solutions let me know. Note also the following comment in the header of the Transaction::waitForTransaction(): // Note: // Deadlock check could use locking, because there are potentially concurrent // threads checking and modifying the waitFor list. // Instead, it implements a fancy lock-free algorithm that works reliably only // with full memory barriers. Thus "volatile"-specifier and COMPARE_EXCHANGE // are used when traversing and modifying waitFor list. Maybe it is better to // use inline assembly or intrinsics to generate memory barrier instead of // volatile. I do not quite understand the purpose of the following use of "volatile" in the above code: volatile Transaction *trans; Hopefully after a some hours of sleep I might have a good idea for what happens and how to fix this. Olav