List:Falcon Storage Engine« Previous MessageNext Message »
From:Olav Sandstaa Date:March 24 2009 10:26pm
Subject:Re: Deadlock detector accesses deleted Transaction?
View as plain text  
Hi Kevin,

Thanks for providing suggestions for how to best solve this.

Kevin Lewis wrote:
> Olav,
> Any thread can traverse a waitingFor list of TransactionState pointers 
> without getting ref counts because they always have a shared lock on 
> syncActiveTransactions while this search is carried out.
> Suppose there are three transactions, A,B&C with A->B->C.  Suppose C 
> rolls back, releases all useCounts on the TransactionState except the 
> 1 for the waitingFor pointer from B.  Then it unlocks its exclusive 
> lock on syncIsActive and B is signaled.  C also gives up the exclusive 
> lock on syncActiveTransactions.   A reader thread gets a shared lock 
> on it and collects a pointer to C.   A nanosecond later, the thread 
> for transaction B, which had given up its shared lock on 
> syncActiveTransactions wakes up and removes the reference to C.  
> Notice that it does not get a shared lock on syncActiveTransactions 
> before removing the link and refCount.  This is the bug!
> We need to re-get the shared lock on syncActiveTransactions before 
> removing the link in both locations where the link is removed;

I do not quite understand how getting the shared lock on 
syncActiveTransactions before removing the waitingFor link will help.

Most likely the thread for transaction B will be granted the shared lock 
on syncActiveTransactions "immediately" and will proceed to remove the 
waitingFor link from B to C and the last refCount on C. And the C 
transaction object will be deleted immediately - leaving the A reader 
thread's pointer pointing to a deleted transaction?

It is a bit too late in the evening to think about lock-free data 
structures so I will give it another try tomorrow morning.


>   storage\falcon\Transaction.cpp(1036):
>     if (!COMPARE_EXCHANGE_POINTER(&waitingFor, transaction, NULL))
>   storage\falcon\Transaction.cpp(1042):   
>     if (!COMPARE_EXCHANGE_POINTER(&waitingFor, transaction, NULL))
> Olav, Please make this change in your next push.
> Olav Sandstaa wrote:
>> Kevin Lewis wrote:
>>> Olav, If a TransactionState::waitingFor is pointing to another 
>>> TransactionState object, it should set a useCount.  Would this solve 
>>> the problem?
>> The existing code already has a useCount for this. Before a 
>> Transaction sets the waitingFor to point to a Transaction (old 
>> code)/TransactionState (new code) object it increases the use count 
>> on the Transaction/TransactionState object. This protect the 
>> waitingFor pointer to always point to a valid 
>> Transaction/TransactionState object.
>> But this seems not to handle all situations. In the example I sent we 
>> use transaction B's waitingFor pointer to locate the next transaction 
>> (C) in the graph. Unfortunately, there is nothing preventing the 
>> thread "owning" transaction B from continuing and remove the 
>> waitingFor pointer and decrease the use count and then delete the 
>> Transaction/TransactionState object. So even if the thread D got a 
>> valid pointer to Transaction/TransactionState object C there is 
>> nothing preventing the Transaction/TransactionState object from 
>> getting deleted while it  accesses this pointer.
>> So the current use of use count to protect this from happening seems 
>> not to be enough.
>> I think one way to protect this would be to introduce a "use count" 
>> on the waitingFor pointer itself.  Then each thread traversing the 
>> graph would set this new use count as it walked down the graph and 
>> removed it as it walked back up. This would result in two interlocked 
>> operations per node in the graph as we traverses it (seems a bit much 
>> to me...).
>> There might be other, simpler solutions for this. I will try to think 
>> of better ways.
>> The scenario I presented in the last email might also not be what 
>> actually happens.....
>> Olav
>>> Olav Sandstaa wrote:
>>>> Hi,
>>>> Bug#41665 reports a crash situation in
>>>> Transaction::waitForTransaction() caused by Falcon accessing a deleted
>>>> transaction object. I expected this to be caused by the RecordVersion
>>>> object having a pointer to a deleted Transaction object. And by
>>>> introducing the TransactionState object and using this instead of the
>>>> Transaction object it first seemed like it fixed the problem. The
>>>> attached repro case for bug#41655 which failed almost on every run
>>>> before the fix ran without error for the ~15 test runs (or more than
>>>> an hour) - but then it crashed again.
>>>> The new crash occurs in the "dead lock detector" in
>>>> Transaction::waitForTransaction(). In the for loop the "trans" pointer
>>>> points to an illegal address:
>>>>    volatile Transaction *trans;
>>>>    for (trans = transaction->waitingFor; trans; trans = 
>>>> trans->waitingFor)
>>>>        if (trans == this)
>>>>            {
>>>>            *deadlock = true;
>>>>            break;
>>>>            }
>>>> Note that as part of the rewrite and introduction of the new
>>>> TransactionState object I have rewritten this code to:
>>>>       volatile TransactionState *trans;
>>>>    for (trans = ts->waitingFor; trans; trans = trans->waitingFor)
>>>>        if (trans == this->transactionState)
>>>>            {
>>>>            *deadlock = true;
>>>>            break;
>>>>            }
>>>> I hope this code should behave the same as the original code but I can
>>>> not exclude that I have introduced something in the code that causes
>>>> this crash.
>>>> My initial theory for how this crash can happen is that the
>>>> Transaction object's waitingFor pointer can access a deleted
>>>> Transaction object. I have not been able to get this in a debuger/core
>>>> file but currently my theory based on reading code is that the
>>>> following scenario can happen:
>>>> 1. When we start, the following three Transactions A, B, C exists and
>>>> A and B is blocked by C:
>>>>       |A|->|B|->|C|
>>>>   where "->" is the waitingFor pointer.
>>>> 2. Transaction C commits (or more likely aborts). This unlocks the
>>>> "syncIsActive" lock which B is waiting on. Note that C's transaction
>>>> object still exists since B has a use count on it.
>>>> 3. A "new" transaction D calls Transaction::waitForTransaction(A) and
>>>> the wait for graph becomes:
>>>>     |D|->|A|->|B|->|C|
>>>> 4. Transaction D initializes the "dead lock detector" by walking
>>>> through this list. At some point the "trans" pointer points on "C" -
>>>> at this time the thread is interrupted (or more likely runs in
>>>> parallel with thread B on a multi-core machine).
>>>> 5. Thread B is waken-up (signalled by thread C in step 2 above) and
>>>> continue to run: it release the use count it has on C' transaction
>>>> object. I think this can be the last use count and this leads to C's
>>>> transaction object being deleted.
>>>> 6. The thread running transaction D continues with the "dead lock
>>>> detector". The "trans" pointer now point to a deleted (and
>>>> overwritten) transaction object and this leads to it (eventually)
>>>> accessing an illegal address (and crashes).
>>>> Note that this is just a theory for how this might happen (and again:
>>>> I do not rule out that I have introduced something in my prototype
>>>> implementation of the new TransactionState object). Still if anyone
>>>> has a better theory for what happens or suggestions for a solutions
>>>> let me know.
>>>> Note also the following comment in the header of the
>>>> Transaction::waitForTransaction():
>>>> // Note:
>>>> // Deadlock check could use locking, because  there are potentially 
>>>> concurrent
>>>> // threads checking and modifying the waitFor list.
>>>> // Instead, it implements a fancy lock-free algorithm  that works 
>>>> reliably only
>>>> // with full memory barriers. Thus "volatile"-specifier and 
>>>> // are used  when traversing and modifying waitFor list. Maybe it 
>>>> is better to
>>>> // use inline assembly or intrinsics to generate memory barrier 
>>>> instead of
>>>> // volatile.
>>>> I do not quite understand the purpose of the following use of
>>>> "volatile" in the above code:
>>>>  volatile Transaction *trans;
>>>> Hopefully after a some hours of sleep I might have a good idea for
>>>> what happens and how to fix this.
>>>> Olav

Deadlock detector accesses deleted Transaction?Olav Sandstaa22 Mar
  • Re: Deadlock detector accesses deleted Transaction?Kevin Lewis23 Mar
    • Re: Deadlock detector accesses deleted Transaction?Olav Sandstaa23 Mar
      • Re: Deadlock detector accesses deleted Transaction?Kevin Lewis24 Mar
        • Re: Deadlock detector accesses deleted Transaction?Olav Sandstaa24 Mar
  • RE: Deadlock detector accesses deleted Transaction?Vladislav Vaintroub23 Mar
    • RE: Deadlock detector accesses deleted Transaction?Vladislav Vaintroub23 Mar