List:Falcon Storage Engine« Previous MessageNext Message »
From:Kevin Lewis Date:March 24 2009 3:42pm
Subject:Re: Deadlock detector accesses deleted Transaction?
View as plain text  

Any thread can traverse a waitingFor list of TransactionState pointers 
without getting ref counts because they always have a shared lock on 
syncActiveTransactions while this search is carried out.

Suppose there are three transactions, A,B&C with A->B->C.  Suppose C 
rolls back, releases all useCounts on the TransactionState except the 1 
for the waitingFor pointer from B.  Then it unlocks its exclusive lock 
on syncIsActive and B is signaled.  C also gives up the exclusive lock 
on syncActiveTransactions.   A reader thread gets a shared lock on it 
and collects a pointer to C.   A nanosecond later, the thread for 
transaction B, which had given up its shared lock on 
syncActiveTransactions wakes up and removes the reference to C.  Notice 
that it does not get a shared lock on syncActiveTransactions before 
removing the link and refCount.  This is the bug!

We need to re-get the shared lock on syncActiveTransactions before 
removing the link in both locations where the link is removed;

	if (!COMPARE_EXCHANGE_POINTER(&waitingFor, transaction, NULL))

	if (!COMPARE_EXCHANGE_POINTER(&waitingFor, transaction, NULL))

Olav, Please make this change in your next push.

Olav Sandstaa wrote:
> Kevin Lewis wrote:
>> Olav, If a TransactionState::waitingFor is pointing to another 
>> TransactionState object, it should set a useCount.  Would this solve 
>> the problem?
> The existing code already has a useCount for this. Before a Transaction 
> sets the waitingFor to point to a Transaction (old 
> code)/TransactionState (new code) object it increases the use count on 
> the Transaction/TransactionState object. This protect the waitingFor 
> pointer to always point to a valid Transaction/TransactionState object.
> But this seems not to handle all situations. In the example I sent we 
> use transaction B's waitingFor pointer to locate the next transaction 
> (C) in the graph. Unfortunately, there is nothing preventing the thread 
> "owning" transaction B from continuing and remove the waitingFor pointer 
> and decrease the use count and then delete the 
> Transaction/TransactionState object. So even if the thread D got a valid 
> pointer to Transaction/TransactionState object C there is nothing 
> preventing the Transaction/TransactionState object from getting deleted 
> while it  accesses this pointer.
> So the current use of use count to protect this from happening seems not 
> to be enough.
> I think one way to protect this would be to introduce a "use count" on 
> the waitingFor pointer itself.  Then each thread traversing the graph 
> would set this new use count as it walked down the graph and removed it 
> as it walked back up. This would result in two interlocked operations 
> per node in the graph as we traverses it (seems a bit much to me...).
> There might be other, simpler solutions for this. I will try to think of 
> better ways.
> The scenario I presented in the last email might also not be what 
> actually happens.....
> Olav
>> Olav Sandstaa wrote:
>>> Hi,
>>> Bug#41665 reports a crash situation in
>>> Transaction::waitForTransaction() caused by Falcon accessing a deleted
>>> transaction object. I expected this to be caused by the RecordVersion
>>> object having a pointer to a deleted Transaction object. And by
>>> introducing the TransactionState object and using this instead of the
>>> Transaction object it first seemed like it fixed the problem. The
>>> attached repro case for bug#41655 which failed almost on every run
>>> before the fix ran without error for the ~15 test runs (or more than
>>> an hour) - but then it crashed again.
>>> The new crash occurs in the "dead lock detector" in
>>> Transaction::waitForTransaction(). In the for loop the "trans" pointer
>>> points to an illegal address:
>>>    volatile Transaction *trans;
>>>    for (trans = transaction->waitingFor; trans; trans = 
>>> trans->waitingFor)
>>>        if (trans == this)
>>>            {
>>>            *deadlock = true;
>>>            break;
>>>            }
>>> Note that as part of the rewrite and introduction of the new
>>> TransactionState object I have rewritten this code to:
>>>       volatile TransactionState *trans;
>>>    for (trans = ts->waitingFor; trans; trans = trans->waitingFor)
>>>        if (trans == this->transactionState)
>>>            {
>>>            *deadlock = true;
>>>            break;
>>>            }
>>> I hope this code should behave the same as the original code but I can
>>> not exclude that I have introduced something in the code that causes
>>> this crash.
>>> My initial theory for how this crash can happen is that the
>>> Transaction object's waitingFor pointer can access a deleted
>>> Transaction object. I have not been able to get this in a debuger/core
>>> file but currently my theory based on reading code is that the
>>> following scenario can happen:
>>> 1. When we start, the following three Transactions A, B, C exists and
>>> A and B is blocked by C:
>>>       |A|->|B|->|C|
>>>   where "->" is the waitingFor pointer.
>>> 2. Transaction C commits (or more likely aborts). This unlocks the
>>> "syncIsActive" lock which B is waiting on. Note that C's transaction
>>> object still exists since B has a use count on it.
>>> 3. A "new" transaction D calls Transaction::waitForTransaction(A) and
>>> the wait for graph becomes:
>>>     |D|->|A|->|B|->|C|
>>> 4. Transaction D initializes the "dead lock detector" by walking
>>> through this list. At some point the "trans" pointer points on "C" -
>>> at this time the thread is interrupted (or more likely runs in
>>> parallel with thread B on a multi-core machine).
>>> 5. Thread B is waken-up (signalled by thread C in step 2 above) and
>>> continue to run: it release the use count it has on C' transaction
>>> object. I think this can be the last use count and this leads to C's
>>> transaction object being deleted.
>>> 6. The thread running transaction D continues with the "dead lock
>>> detector". The "trans" pointer now point to a deleted (and
>>> overwritten) transaction object and this leads to it (eventually)
>>> accessing an illegal address (and crashes).
>>> Note that this is just a theory for how this might happen (and again:
>>> I do not rule out that I have introduced something in my prototype
>>> implementation of the new TransactionState object). Still if anyone
>>> has a better theory for what happens or suggestions for a solutions
>>> let me know.
>>> Note also the following comment in the header of the
>>> Transaction::waitForTransaction():
>>> // Note:
>>> // Deadlock check could use locking, because  there are potentially 
>>> concurrent
>>> // threads checking and modifying the waitFor list.
>>> // Instead, it implements a fancy lock-free algorithm  that works 
>>> reliably only
>>> // with full memory barriers. Thus "volatile"-specifier and 
>>> // are used  when traversing and modifying waitFor list. Maybe it is 
>>> better to
>>> // use inline assembly or intrinsics to generate memory barrier 
>>> instead of
>>> // volatile.
>>> I do not quite understand the purpose of the following use of
>>> "volatile" in the above code:
>>>  volatile Transaction *trans;
>>> Hopefully after a some hours of sleep I might have a good idea for
>>> what happens and how to fix this.
>>> Olav
Deadlock detector accesses deleted Transaction?Olav Sandstaa22 Mar
  • Re: Deadlock detector accesses deleted Transaction?Kevin Lewis23 Mar
    • Re: Deadlock detector accesses deleted Transaction?Olav Sandstaa23 Mar
      • Re: Deadlock detector accesses deleted Transaction?Kevin Lewis24 Mar
        • Re: Deadlock detector accesses deleted Transaction?Olav Sandstaa24 Mar
  • RE: Deadlock detector accesses deleted Transaction?Vladislav Vaintroub23 Mar
    • RE: Deadlock detector accesses deleted Transaction?Vladislav Vaintroub23 Mar