Hi Kevin,
Thanks for providing suggestions for how to best solve this.
Kevin Lewis wrote:
> Olav,
>
> Any thread can traverse a waitingFor list of TransactionState pointers
> without getting ref counts because they always have a shared lock on
> syncActiveTransactions while this search is carried out.
>
> Suppose there are three transactions, A,B&C with A->B->C. Suppose C
> rolls back, releases all useCounts on the TransactionState except the
> 1 for the waitingFor pointer from B. Then it unlocks its exclusive
> lock on syncIsActive and B is signaled. C also gives up the exclusive
> lock on syncActiveTransactions. A reader thread gets a shared lock
> on it and collects a pointer to C. A nanosecond later, the thread
> for transaction B, which had given up its shared lock on
> syncActiveTransactions wakes up and removes the reference to C.
> Notice that it does not get a shared lock on syncActiveTransactions
> before removing the link and refCount. This is the bug!
>
> We need to re-get the shared lock on syncActiveTransactions before
> removing the link in both locations where the link is removed;
I do not quite understand how getting the shared lock on
syncActiveTransactions before removing the waitingFor link will help.
Most likely the thread for transaction B will be granted the shared lock
on syncActiveTransactions "immediately" and will proceed to remove the
waitingFor link from B to C and the last refCount on C. And the C
transaction object will be deleted immediately - leaving the A reader
thread's pointer pointing to a deleted transaction?
It is a bit too late in the evening to think about lock-free data
structures so I will give it another try tomorrow morning.
Olav
>
> storage\falcon\Transaction.cpp(1036):
> if (!COMPARE_EXCHANGE_POINTER(&waitingFor, transaction, NULL))
>
> storage\falcon\Transaction.cpp(1042):
> if (!COMPARE_EXCHANGE_POINTER(&waitingFor, transaction, NULL))
>
> Olav, Please make this change in your next push.
>
> Olav Sandstaa wrote:
>> Kevin Lewis wrote:
>>> Olav, If a TransactionState::waitingFor is pointing to another
>>> TransactionState object, it should set a useCount. Would this solve
>>> the problem?
>>
>> The existing code already has a useCount for this. Before a
>> Transaction sets the waitingFor to point to a Transaction (old
>> code)/TransactionState (new code) object it increases the use count
>> on the Transaction/TransactionState object. This protect the
>> waitingFor pointer to always point to a valid
>> Transaction/TransactionState object.
>>
>> But this seems not to handle all situations. In the example I sent we
>> use transaction B's waitingFor pointer to locate the next transaction
>> (C) in the graph. Unfortunately, there is nothing preventing the
>> thread "owning" transaction B from continuing and remove the
>> waitingFor pointer and decrease the use count and then delete the
>> Transaction/TransactionState object. So even if the thread D got a
>> valid pointer to Transaction/TransactionState object C there is
>> nothing preventing the Transaction/TransactionState object from
>> getting deleted while it accesses this pointer.
>>
>> So the current use of use count to protect this from happening seems
>> not to be enough.
>>
>> I think one way to protect this would be to introduce a "use count"
>> on the waitingFor pointer itself. Then each thread traversing the
>> graph would set this new use count as it walked down the graph and
>> removed it as it walked back up. This would result in two interlocked
>> operations per node in the graph as we traverses it (seems a bit much
>> to me...).
>>
>> There might be other, simpler solutions for this. I will try to think
>> of better ways.
>>
>> The scenario I presented in the last email might also not be what
>> actually happens.....
>>
>> Olav
>>
>>
>>>
>>> Olav Sandstaa wrote:
>>>> Hi,
>>>>
>>>> Bug#41665 reports a crash situation in
>>>> Transaction::waitForTransaction() caused by Falcon accessing a deleted
>>>> transaction object. I expected this to be caused by the RecordVersion
>>>> object having a pointer to a deleted Transaction object. And by
>>>> introducing the TransactionState object and using this instead of the
>>>> Transaction object it first seemed like it fixed the problem. The
>>>> attached repro case for bug#41655 which failed almost on every run
>>>> before the fix ran without error for the ~15 test runs (or more than
>>>> an hour) - but then it crashed again.
>>>>
>>>> The new crash occurs in the "dead lock detector" in
>>>> Transaction::waitForTransaction(). In the for loop the "trans" pointer
>>>> points to an illegal address:
>>>>
>>>> volatile Transaction *trans;
>>>> for (trans = transaction->waitingFor; trans; trans =
>>>> trans->waitingFor)
>>>> if (trans == this)
>>>> {
>>>> *deadlock = true;
>>>> break;
>>>> }
>>>>
>>>> Note that as part of the rewrite and introduction of the new
>>>> TransactionState object I have rewritten this code to:
>>>>
>>>> volatile TransactionState *trans;
>>>> for (trans = ts->waitingFor; trans; trans = trans->waitingFor)
>>>> if (trans == this->transactionState)
>>>> {
>>>> *deadlock = true;
>>>> break;
>>>> }
>>>>
>>>> I hope this code should behave the same as the original code but I can
>>>> not exclude that I have introduced something in the code that causes
>>>> this crash.
>>>>
>>>> My initial theory for how this crash can happen is that the
>>>> Transaction object's waitingFor pointer can access a deleted
>>>> Transaction object. I have not been able to get this in a debuger/core
>>>> file but currently my theory based on reading code is that the
>>>> following scenario can happen:
>>>>
>>>> 1. When we start, the following three Transactions A, B, C exists and
>>>> A and B is blocked by C:
>>>>
>>>> |A|->|B|->|C|
>>>>
>>>> where "->" is the waitingFor pointer.
>>>>
>>>> 2. Transaction C commits (or more likely aborts). This unlocks the
>>>> "syncIsActive" lock which B is waiting on. Note that C's transaction
>>>> object still exists since B has a use count on it.
>>>>
>>>> 3. A "new" transaction D calls Transaction::waitForTransaction(A) and
>>>> the wait for graph becomes:
>>>>
>>>> |D|->|A|->|B|->|C|
>>>>
>>>> 4. Transaction D initializes the "dead lock detector" by walking
>>>> through this list. At some point the "trans" pointer points on "C" -
>>>> at this time the thread is interrupted (or more likely runs in
>>>> parallel with thread B on a multi-core machine).
>>>>
>>>> 5. Thread B is waken-up (signalled by thread C in step 2 above) and
>>>> continue to run: it release the use count it has on C' transaction
>>>> object. I think this can be the last use count and this leads to C's
>>>> transaction object being deleted.
>>>>
>>>> 6. The thread running transaction D continues with the "dead lock
>>>> detector". The "trans" pointer now point to a deleted (and
>>>> overwritten) transaction object and this leads to it (eventually)
>>>> accessing an illegal address (and crashes).
>>>>
>>>>
>>>> Note that this is just a theory for how this might happen (and again:
>>>> I do not rule out that I have introduced something in my prototype
>>>> implementation of the new TransactionState object). Still if anyone
>>>> has a better theory for what happens or suggestions for a solutions
>>>> let me know.
>>>>
>>>> Note also the following comment in the header of the
>>>> Transaction::waitForTransaction():
>>>>
>>>> // Note:
>>>> // Deadlock check could use locking, because there are potentially
>>>> concurrent
>>>> // threads checking and modifying the waitFor list.
>>>> // Instead, it implements a fancy lock-free algorithm that works
>>>> reliably only
>>>> // with full memory barriers. Thus "volatile"-specifier and
>>>> COMPARE_EXCHANGE
>>>> // are used when traversing and modifying waitFor list. Maybe it
>>>> is better to
>>>> // use inline assembly or intrinsics to generate memory barrier
>>>> instead of
>>>> // volatile.
>>>>
>>>> I do not quite understand the purpose of the following use of
>>>> "volatile" in the above code:
>>>>
>>>> volatile Transaction *trans;
>>>>
>>>> Hopefully after a some hours of sleep I might have a good idea for
>>>> what happens and how to fix this.
>>>>
>>>> Olav
>>>>
>>>>
>>>>
>>
>>