Ann W. Harrison wrote:
> Hi Olav,
>> 6. IO-thread: uncaught exception when writing the check point log
>> record. This is now handled in Cache::ioThread(). Checkpoint will
>> continue to run but the checkpoint log record will not be written.
>> NOTE: Pay particular note of the last of these given that the solution
>> to get out of this situation is to do a successful recovery. This fix
>> might result in checkpoints that writes pages to the database without
>> having a complete checkpoint log record between them (or at least I
>> think this might be a possible scenario). Can this give problems for
>> the recovery process? (I do not think so but thought it was good to
>> mention it.... :-) ).
> I think it's OK to fail to write the checkpoint log record and
> finish the database writes. However, after the checkpoint log
> record write fails, nothing else should happen.
As the code is now I think that the IO thread will go to sleep after it
has failed to write the checkpoint log record. But I expect it is
possible that a new checkpoint might be started and although there
probably will not be much that has been updated since last it was active
it might be dirty data pages it can write out (I admit that I have not
check the details about this). But I do not stop the IO thread when this
error is detected, and since we have multiple of these we probably had
to stop all the others as well. So if I guess correctly about how this
works it is likely that datapages written by checkpoint n+1 can be
written out even if the checkpoint log record from checkpoint n is not
written to the log.
But even if this happens I do not see that this should have any negative
consequences for the recovery. Dirty datapages can probably be written
out at any time anyway? But let me know if anyone thinks this is an
issue and we can solve this by ensuring that we stop all IO threads.
If we want to do some changes in how we control the IO-threads, maybe we
can take that as a separate patch?
> It's way beyond the scope of this set of changes, but as Vlad
> suggested we should have a high level state bit that says that
> Falcon doesn't work. Every call through the handler should
> check that bit and return an error. I'm not sure if that
> state should stop the gophers ... but it should stop subsequent
> changes to data and metadata and checkpoints. At least I
> think it should.
I think Vlad's idea of having a high level state bit to stop calls from
the server is a good idea. But that would not solve the problem with
uncaught exceptions that is fixed by this patch, just reduce it. At the
time we set the serial log to state "writeError" it would be too late to
stop threads that already had passed "Vlad's border control" and is
running at full speed towards the serial log exception.
One potential drawback with Vlad's suggestion is that this would also
stop read-only queries. With the current code I am able to run read-only
queries successfully also after the serial log has run into problems -
but I expect that blocking everything in this situation is not a big
About Scavenger and IO threads: in my first prototype implementation of
this I let these threads die when they hit the serial log exception with
the argument that with Falcon in this state it was just no point of
having these around - and a very easy way to get out of more problems.
In this patch I let them continue to live and "do useful work?" :-)
Thanks for looking at this, Ann!