Kevin Lewis wrote:
> Olav, If a TransactionState::waitingFor is pointing to another
> TransactionState object, it should set a useCount. Would this solve
> the problem?
The existing code already has a useCount for this. Before a Transaction
sets the waitingFor to point to a Transaction (old
code)/TransactionState (new code) object it increases the use count on
the Transaction/TransactionState object. This protect the waitingFor
pointer to always point to a valid Transaction/TransactionState object.
But this seems not to handle all situations. In the example I sent we
use transaction B's waitingFor pointer to locate the next transaction
(C) in the graph. Unfortunately, there is nothing preventing the thread
"owning" transaction B from continuing and remove the waitingFor pointer
and decrease the use count and then delete the
Transaction/TransactionState object. So even if the thread D got a valid
pointer to Transaction/TransactionState object C there is nothing
preventing the Transaction/TransactionState object from getting deleted
while it accesses this pointer.
So the current use of use count to protect this from happening seems not
to be enough.
I think one way to protect this would be to introduce a "use count" on
the waitingFor pointer itself. Then each thread traversing the graph
would set this new use count as it walked down the graph and removed it
as it walked back up. This would result in two interlocked operations
per node in the graph as we traverses it (seems a bit much to me...).
There might be other, simpler solutions for this. I will try to think of
better ways.
The scenario I presented in the last email might also not be what
actually happens.....
Olav
>
> Olav Sandstaa wrote:
>> Hi,
>>
>> Bug#41665 reports a crash situation in
>> Transaction::waitForTransaction() caused by Falcon accessing a deleted
>> transaction object. I expected this to be caused by the RecordVersion
>> object having a pointer to a deleted Transaction object. And by
>> introducing the TransactionState object and using this instead of the
>> Transaction object it first seemed like it fixed the problem. The
>> attached repro case for bug#41655 which failed almost on every run
>> before the fix ran without error for the ~15 test runs (or more than
>> an hour) - but then it crashed again.
>>
>> The new crash occurs in the "dead lock detector" in
>> Transaction::waitForTransaction(). In the for loop the "trans" pointer
>> points to an illegal address:
>>
>> volatile Transaction *trans;
>> for (trans = transaction->waitingFor; trans; trans =
>> trans->waitingFor)
>> if (trans == this)
>> {
>> *deadlock = true;
>> break;
>> }
>>
>> Note that as part of the rewrite and introduction of the new
>> TransactionState object I have rewritten this code to:
>>
>> volatile TransactionState *trans;
>> for (trans = ts->waitingFor; trans; trans = trans->waitingFor)
>> if (trans == this->transactionState)
>> {
>> *deadlock = true;
>> break;
>> }
>>
>> I hope this code should behave the same as the original code but I can
>> not exclude that I have introduced something in the code that causes
>> this crash.
>>
>> My initial theory for how this crash can happen is that the
>> Transaction object's waitingFor pointer can access a deleted
>> Transaction object. I have not been able to get this in a debuger/core
>> file but currently my theory based on reading code is that the
>> following scenario can happen:
>>
>> 1. When we start, the following three Transactions A, B, C exists and
>> A and B is blocked by C:
>>
>> |A|->|B|->|C|
>>
>> where "->" is the waitingFor pointer.
>>
>> 2. Transaction C commits (or more likely aborts). This unlocks the
>> "syncIsActive" lock which B is waiting on. Note that C's transaction
>> object still exists since B has a use count on it.
>>
>> 3. A "new" transaction D calls Transaction::waitForTransaction(A) and
>> the wait for graph becomes:
>>
>> |D|->|A|->|B|->|C|
>>
>> 4. Transaction D initializes the "dead lock detector" by walking
>> through this list. At some point the "trans" pointer points on "C" -
>> at this time the thread is interrupted (or more likely runs in
>> parallel with thread B on a multi-core machine).
>>
>> 5. Thread B is waken-up (signalled by thread C in step 2 above) and
>> continue to run: it release the use count it has on C' transaction
>> object. I think this can be the last use count and this leads to C's
>> transaction object being deleted.
>>
>> 6. The thread running transaction D continues with the "dead lock
>> detector". The "trans" pointer now point to a deleted (and
>> overwritten) transaction object and this leads to it (eventually)
>> accessing an illegal address (and crashes).
>>
>>
>> Note that this is just a theory for how this might happen (and again:
>> I do not rule out that I have introduced something in my prototype
>> implementation of the new TransactionState object). Still if anyone
>> has a better theory for what happens or suggestions for a solutions
>> let me know.
>>
>> Note also the following comment in the header of the
>> Transaction::waitForTransaction():
>>
>> // Note:
>> // Deadlock check could use locking, because there are potentially
>> concurrent
>> // threads checking and modifying the waitFor list.
>> // Instead, it implements a fancy lock-free algorithm that works
>> reliably only
>> // with full memory barriers. Thus "volatile"-specifier and
>> COMPARE_EXCHANGE
>> // are used when traversing and modifying waitFor list. Maybe it is
>> better to
>> // use inline assembly or intrinsics to generate memory barrier
>> instead of
>> // volatile.
>>
>> I do not quite understand the purpose of the following use of
>> "volatile" in the above code:
>>
>> volatile Transaction *trans;
>>
>> Hopefully after a some hours of sleep I might have a good idea for
>> what happens and how to fix this.
>>
>> Olav
>>
>>
>>