From: Olav Sandstaa Date: March 24 2009 10:26pm Subject: Re: Deadlock detector accesses deleted Transaction? List-Archive: http://lists.mysql.com/falcon/642 Message-Id: <49C95E27.8040508@sun.com> MIME-Version: 1.0 Content-Type: text/plain; format=flowed; charset=ISO-8859-1 Content-Transfer-Encoding: 7BIT Hi Kevin, Thanks for providing suggestions for how to best solve this. Kevin Lewis wrote: > Olav, > > Any thread can traverse a waitingFor list of TransactionState pointers > without getting ref counts because they always have a shared lock on > syncActiveTransactions while this search is carried out. > > Suppose there are three transactions, A,B&C with A->B->C. Suppose C > rolls back, releases all useCounts on the TransactionState except the > 1 for the waitingFor pointer from B. Then it unlocks its exclusive > lock on syncIsActive and B is signaled. C also gives up the exclusive > lock on syncActiveTransactions. A reader thread gets a shared lock > on it and collects a pointer to C. A nanosecond later, the thread > for transaction B, which had given up its shared lock on > syncActiveTransactions wakes up and removes the reference to C. > Notice that it does not get a shared lock on syncActiveTransactions > before removing the link and refCount. This is the bug! > > We need to re-get the shared lock on syncActiveTransactions before > removing the link in both locations where the link is removed; I do not quite understand how getting the shared lock on syncActiveTransactions before removing the waitingFor link will help. Most likely the thread for transaction B will be granted the shared lock on syncActiveTransactions "immediately" and will proceed to remove the waitingFor link from B to C and the last refCount on C. And the C transaction object will be deleted immediately - leaving the A reader thread's pointer pointing to a deleted transaction? It is a bit too late in the evening to think about lock-free data structures so I will give it another try tomorrow morning. Olav > > storage\falcon\Transaction.cpp(1036): > if (!COMPARE_EXCHANGE_POINTER(&waitingFor, transaction, NULL)) > > storage\falcon\Transaction.cpp(1042): > if (!COMPARE_EXCHANGE_POINTER(&waitingFor, transaction, NULL)) > > Olav, Please make this change in your next push. > > Olav Sandstaa wrote: >> Kevin Lewis wrote: >>> Olav, If a TransactionState::waitingFor is pointing to another >>> TransactionState object, it should set a useCount. Would this solve >>> the problem? >> >> The existing code already has a useCount for this. Before a >> Transaction sets the waitingFor to point to a Transaction (old >> code)/TransactionState (new code) object it increases the use count >> on the Transaction/TransactionState object. This protect the >> waitingFor pointer to always point to a valid >> Transaction/TransactionState object. >> >> But this seems not to handle all situations. In the example I sent we >> use transaction B's waitingFor pointer to locate the next transaction >> (C) in the graph. Unfortunately, there is nothing preventing the >> thread "owning" transaction B from continuing and remove the >> waitingFor pointer and decrease the use count and then delete the >> Transaction/TransactionState object. So even if the thread D got a >> valid pointer to Transaction/TransactionState object C there is >> nothing preventing the Transaction/TransactionState object from >> getting deleted while it accesses this pointer. >> >> So the current use of use count to protect this from happening seems >> not to be enough. >> >> I think one way to protect this would be to introduce a "use count" >> on the waitingFor pointer itself. Then each thread traversing the >> graph would set this new use count as it walked down the graph and >> removed it as it walked back up. This would result in two interlocked >> operations per node in the graph as we traverses it (seems a bit much >> to me...). >> >> There might be other, simpler solutions for this. I will try to think >> of better ways. >> >> The scenario I presented in the last email might also not be what >> actually happens..... >> >> Olav >> >> >>> >>> Olav Sandstaa wrote: >>>> Hi, >>>> >>>> Bug#41665 reports a crash situation in >>>> Transaction::waitForTransaction() caused by Falcon accessing a deleted >>>> transaction object. I expected this to be caused by the RecordVersion >>>> object having a pointer to a deleted Transaction object. And by >>>> introducing the TransactionState object and using this instead of the >>>> Transaction object it first seemed like it fixed the problem. The >>>> attached repro case for bug#41655 which failed almost on every run >>>> before the fix ran without error for the ~15 test runs (or more than >>>> an hour) - but then it crashed again. >>>> >>>> The new crash occurs in the "dead lock detector" in >>>> Transaction::waitForTransaction(). In the for loop the "trans" pointer >>>> points to an illegal address: >>>> >>>> volatile Transaction *trans; >>>> for (trans = transaction->waitingFor; trans; trans = >>>> trans->waitingFor) >>>> if (trans == this) >>>> { >>>> *deadlock = true; >>>> break; >>>> } >>>> >>>> Note that as part of the rewrite and introduction of the new >>>> TransactionState object I have rewritten this code to: >>>> >>>> volatile TransactionState *trans; >>>> for (trans = ts->waitingFor; trans; trans = trans->waitingFor) >>>> if (trans == this->transactionState) >>>> { >>>> *deadlock = true; >>>> break; >>>> } >>>> >>>> I hope this code should behave the same as the original code but I can >>>> not exclude that I have introduced something in the code that causes >>>> this crash. >>>> >>>> My initial theory for how this crash can happen is that the >>>> Transaction object's waitingFor pointer can access a deleted >>>> Transaction object. I have not been able to get this in a debuger/core >>>> file but currently my theory based on reading code is that the >>>> following scenario can happen: >>>> >>>> 1. When we start, the following three Transactions A, B, C exists and >>>> A and B is blocked by C: >>>> >>>> |A|->|B|->|C| >>>> >>>> where "->" is the waitingFor pointer. >>>> >>>> 2. Transaction C commits (or more likely aborts). This unlocks the >>>> "syncIsActive" lock which B is waiting on. Note that C's transaction >>>> object still exists since B has a use count on it. >>>> >>>> 3. A "new" transaction D calls Transaction::waitForTransaction(A) and >>>> the wait for graph becomes: >>>> >>>> |D|->|A|->|B|->|C| >>>> >>>> 4. Transaction D initializes the "dead lock detector" by walking >>>> through this list. At some point the "trans" pointer points on "C" - >>>> at this time the thread is interrupted (or more likely runs in >>>> parallel with thread B on a multi-core machine). >>>> >>>> 5. Thread B is waken-up (signalled by thread C in step 2 above) and >>>> continue to run: it release the use count it has on C' transaction >>>> object. I think this can be the last use count and this leads to C's >>>> transaction object being deleted. >>>> >>>> 6. The thread running transaction D continues with the "dead lock >>>> detector". The "trans" pointer now point to a deleted (and >>>> overwritten) transaction object and this leads to it (eventually) >>>> accessing an illegal address (and crashes). >>>> >>>> >>>> Note that this is just a theory for how this might happen (and again: >>>> I do not rule out that I have introduced something in my prototype >>>> implementation of the new TransactionState object). Still if anyone >>>> has a better theory for what happens or suggestions for a solutions >>>> let me know. >>>> >>>> Note also the following comment in the header of the >>>> Transaction::waitForTransaction(): >>>> >>>> // Note: >>>> // Deadlock check could use locking, because there are potentially >>>> concurrent >>>> // threads checking and modifying the waitFor list. >>>> // Instead, it implements a fancy lock-free algorithm that works >>>> reliably only >>>> // with full memory barriers. Thus "volatile"-specifier and >>>> COMPARE_EXCHANGE >>>> // are used when traversing and modifying waitFor list. Maybe it >>>> is better to >>>> // use inline assembly or intrinsics to generate memory barrier >>>> instead of >>>> // volatile. >>>> >>>> I do not quite understand the purpose of the following use of >>>> "volatile" in the above code: >>>> >>>> volatile Transaction *trans; >>>> >>>> Hopefully after a some hours of sleep I might have a good idea for >>>> what happens and how to fix this. >>>> >>>> Olav >>>> >>>> >>>> >> >>