Hi!
>>>>> "Guilhem" == Guilhem Bichot <guilhem@stripped> writes:
<cut>
>> For the close, the above *should* be safe as the share should not be
>> used by anyone, but somehow we got a hang here.
Guilhem> Thanks for the clear explanation.
Guilhem> It does sound like a bug that maria_close() (which, as it calls
Guilhem> _ma_remove_not_visible_states(), is the last close, so table is not used
Guilhem> by anybody, as you wrote) can happen at the same time as a commit on the
Guilhem> same table, which means that someone else is using the table. I think it
Guilhem> is dangerous to leave this situation unexplored, even though the
Guilhem> deadlock is now gone; this situation is so fishy that it could cause
Guilhem> other bugs.
Agree and I will plan to look into this.
There are however asserts in maria_close() that should find cases like
this.
Still, removing the mutex should still be a good iea.
The one case that could be the reason for the problem is that
checkpoint may keep the share in use even when the table should be
closed.
If we could release the share lock in maria_close() BEFORE we call
_ma_remove_not_visible_states() things would be much better!
This is the one thing that I plan to explore further.
Guilhem> If I could recommend something... it would be to, in a local tree,
Guilhem> revert the deadlock fix, see the deadlock happen again (I never saw it),
Guilhem> and then inspect what's going on with this close vs commit. If time is
Guilhem> lacking, maybe not debug it right away, but at least file a bug report
Guilhem> with all details, so that we can later fix the close vs commit problem,
Guilhem> or see if this is a bug in the server itself.
Guilhem> Could you please do this? That would be helpful.
If there is is still a bug, I am quite sure it will crash later
instead when it tries to access a share that is removed.
The main reason for removing the lock was to get away from even a
theoretical deadlock situation when you take locks in wrong order.
Regards,
Monty