List:Falcon Storage Engine« Previous MessageNext Message »
From:Olav Sandstaa Date:October 21 2008 8:43pm
Subject:Re: Review request: Handling of exceptions after serial log is in
state writeError
View as plain text  
Ann W. Harrison wrote:
> Hi Olav,
>>
>> 6. IO-thread: uncaught exception when writing the check point log
>>   record. This is now handled in Cache::ioThread(). Checkpoint will
>>   continue to run but the checkpoint log record will not be written.
>>
>> NOTE: Pay particular note of the last of these given that the solution
>> to get out of this situation is to do a successful recovery. This fix
>> might result in checkpoints that writes pages to the database without
>> having a complete checkpoint log record between them (or at least I
>> think this might be a possible scenario). Can this give problems for
>> the recovery process? (I do not think so but thought it was good to
>> mention it.... :-) ).
>>
>
> I think it's OK to fail to write the checkpoint log record and
> finish the database writes.  However, after the checkpoint log
> record write fails, nothing else should happen.

As the code is now I think that the IO thread will go to sleep after it 
has failed to write the checkpoint log record. But I expect it is 
possible that a new checkpoint might be started and although there 
probably will not be much that has been updated since last it was active 
it might be dirty data pages it can write out (I admit that I have not 
check the details about this). But I do not stop the IO thread when this 
error is detected, and since we have multiple of these we probably had 
to stop all the others as well. So if I guess correctly about how this 
works it is likely that datapages written by checkpoint n+1 can be 
written out even if the checkpoint log record from checkpoint n is not 
written to the log.

But even if this happens I do not see that this should have any negative 
consequences for the recovery. Dirty datapages can probably be written 
out at any time anyway? But let me know if anyone thinks this is an 
issue and we can solve this by ensuring that we stop all IO threads.

If we want to do some changes in how we control the IO-threads, maybe we 
can take that as a separate patch?

>
> It's way beyond the scope of this set of changes, but as Vlad
> suggested we should have a high level state bit that says that
> Falcon doesn't work.  Every call through the handler should
> check that bit and return an error.  I'm not sure if that
> state should stop the gophers ... but it should stop subsequent
> changes to data and metadata and checkpoints.  At least I
> think it should.

I think Vlad's idea of having a high level state bit to stop calls from 
the server is a good idea. But that would not solve the problem with 
uncaught exceptions that is fixed by this patch, just reduce it. At the 
time we set the serial log to state "writeError" it would be too late to 
stop threads that already had passed "Vlad's border control" and is 
running at full speed towards the serial log exception.

One potential drawback with Vlad's suggestion is that this would also 
stop read-only queries. With the current code I am able to run read-only 
queries successfully also after the serial log has run into problems - 
but I expect that blocking everything in this situation is not a big 
concern.

About Scavenger and IO threads: in my first prototype implementation of 
this I let these threads die when they hit the serial log exception with 
the argument that with Falcon in this state it was just no point of 
having these around - and a very easy way to get out of more problems. 
In this patch I let them continue to live and "do useful work?" :-)

Thanks for looking at this, Ann!

Best regards,
Olav
Thread
Review request: Handling of exceptions after serial log is in statewriteErrorOlav Sandstaa21 Oct
  • Re: Review request: Handling of exceptions after serial log is instate writeErrorAnn W. Harrison21 Oct
    • Re: Review request: Handling of exceptions after serial log is instate writeErrorOlav Sandstaa21 Oct