Olav,
Any thread can traverse a waitingFor list of TransactionState pointers
without getting ref counts because they always have a shared lock on
syncActiveTransactions while this search is carried out.
Suppose there are three transactions, A,B&C with A->B->C. Suppose C
rolls back, releases all useCounts on the TransactionState except the 1
for the waitingFor pointer from B. Then it unlocks its exclusive lock
on syncIsActive and B is signaled. C also gives up the exclusive lock
on syncActiveTransactions. A reader thread gets a shared lock on it
and collects a pointer to C. A nanosecond later, the thread for
transaction B, which had given up its shared lock on
syncActiveTransactions wakes up and removes the reference to C. Notice
that it does not get a shared lock on syncActiveTransactions before
removing the link and refCount. This is the bug!
We need to re-get the shared lock on syncActiveTransactions before
removing the link in both locations where the link is removed;
storage\falcon\Transaction.cpp(1036):
if (!COMPARE_EXCHANGE_POINTER(&waitingFor, transaction, NULL))
storage\falcon\Transaction.cpp(1042):
if (!COMPARE_EXCHANGE_POINTER(&waitingFor, transaction, NULL))
Olav, Please make this change in your next push.
Olav Sandstaa wrote:
> Kevin Lewis wrote:
>> Olav, If a TransactionState::waitingFor is pointing to another
>> TransactionState object, it should set a useCount. Would this solve
>> the problem?
>
> The existing code already has a useCount for this. Before a Transaction
> sets the waitingFor to point to a Transaction (old
> code)/TransactionState (new code) object it increases the use count on
> the Transaction/TransactionState object. This protect the waitingFor
> pointer to always point to a valid Transaction/TransactionState object.
>
> But this seems not to handle all situations. In the example I sent we
> use transaction B's waitingFor pointer to locate the next transaction
> (C) in the graph. Unfortunately, there is nothing preventing the thread
> "owning" transaction B from continuing and remove the waitingFor pointer
> and decrease the use count and then delete the
> Transaction/TransactionState object. So even if the thread D got a valid
> pointer to Transaction/TransactionState object C there is nothing
> preventing the Transaction/TransactionState object from getting deleted
> while it accesses this pointer.
>
> So the current use of use count to protect this from happening seems not
> to be enough.
>
> I think one way to protect this would be to introduce a "use count" on
> the waitingFor pointer itself. Then each thread traversing the graph
> would set this new use count as it walked down the graph and removed it
> as it walked back up. This would result in two interlocked operations
> per node in the graph as we traverses it (seems a bit much to me...).
>
> There might be other, simpler solutions for this. I will try to think of
> better ways.
>
> The scenario I presented in the last email might also not be what
> actually happens.....
>
> Olav
>
>
>>
>> Olav Sandstaa wrote:
>>> Hi,
>>>
>>> Bug#41665 reports a crash situation in
>>> Transaction::waitForTransaction() caused by Falcon accessing a deleted
>>> transaction object. I expected this to be caused by the RecordVersion
>>> object having a pointer to a deleted Transaction object. And by
>>> introducing the TransactionState object and using this instead of the
>>> Transaction object it first seemed like it fixed the problem. The
>>> attached repro case for bug#41655 which failed almost on every run
>>> before the fix ran without error for the ~15 test runs (or more than
>>> an hour) - but then it crashed again.
>>>
>>> The new crash occurs in the "dead lock detector" in
>>> Transaction::waitForTransaction(). In the for loop the "trans" pointer
>>> points to an illegal address:
>>>
>>> volatile Transaction *trans;
>>> for (trans = transaction->waitingFor; trans; trans =
>>> trans->waitingFor)
>>> if (trans == this)
>>> {
>>> *deadlock = true;
>>> break;
>>> }
>>>
>>> Note that as part of the rewrite and introduction of the new
>>> TransactionState object I have rewritten this code to:
>>>
>>> volatile TransactionState *trans;
>>> for (trans = ts->waitingFor; trans; trans = trans->waitingFor)
>>> if (trans == this->transactionState)
>>> {
>>> *deadlock = true;
>>> break;
>>> }
>>>
>>> I hope this code should behave the same as the original code but I can
>>> not exclude that I have introduced something in the code that causes
>>> this crash.
>>>
>>> My initial theory for how this crash can happen is that the
>>> Transaction object's waitingFor pointer can access a deleted
>>> Transaction object. I have not been able to get this in a debuger/core
>>> file but currently my theory based on reading code is that the
>>> following scenario can happen:
>>>
>>> 1. When we start, the following three Transactions A, B, C exists and
>>> A and B is blocked by C:
>>>
>>> |A|->|B|->|C|
>>>
>>> where "->" is the waitingFor pointer.
>>>
>>> 2. Transaction C commits (or more likely aborts). This unlocks the
>>> "syncIsActive" lock which B is waiting on. Note that C's transaction
>>> object still exists since B has a use count on it.
>>>
>>> 3. A "new" transaction D calls Transaction::waitForTransaction(A) and
>>> the wait for graph becomes:
>>>
>>> |D|->|A|->|B|->|C|
>>>
>>> 4. Transaction D initializes the "dead lock detector" by walking
>>> through this list. At some point the "trans" pointer points on "C" -
>>> at this time the thread is interrupted (or more likely runs in
>>> parallel with thread B on a multi-core machine).
>>>
>>> 5. Thread B is waken-up (signalled by thread C in step 2 above) and
>>> continue to run: it release the use count it has on C' transaction
>>> object. I think this can be the last use count and this leads to C's
>>> transaction object being deleted.
>>>
>>> 6. The thread running transaction D continues with the "dead lock
>>> detector". The "trans" pointer now point to a deleted (and
>>> overwritten) transaction object and this leads to it (eventually)
>>> accessing an illegal address (and crashes).
>>>
>>>
>>> Note that this is just a theory for how this might happen (and again:
>>> I do not rule out that I have introduced something in my prototype
>>> implementation of the new TransactionState object). Still if anyone
>>> has a better theory for what happens or suggestions for a solutions
>>> let me know.
>>>
>>> Note also the following comment in the header of the
>>> Transaction::waitForTransaction():
>>>
>>> // Note:
>>> // Deadlock check could use locking, because there are potentially
>>> concurrent
>>> // threads checking and modifying the waitFor list.
>>> // Instead, it implements a fancy lock-free algorithm that works
>>> reliably only
>>> // with full memory barriers. Thus "volatile"-specifier and
>>> COMPARE_EXCHANGE
>>> // are used when traversing and modifying waitFor list. Maybe it is
>>> better to
>>> // use inline assembly or intrinsics to generate memory barrier
>>> instead of
>>> // volatile.
>>>
>>> I do not quite understand the purpose of the following use of
>>> "volatile" in the above code:
>>>
>>> volatile Transaction *trans;
>>>
>>> Hopefully after a some hours of sleep I might have a good idea for
>>> what happens and how to fix this.
>>>
>>> Olav
>>>
>>>
>>>
>
>