From: Olav Sandstaa Date: October 21 2008 8:43pm Subject: Re: Review request: Handling of exceptions after serial log is in state writeError List-Archive: http://lists.mysql.com/falcon/76 Message-Id: <48FE3EF4.4000003@sun.com> MIME-Version: 1.0 Content-Type: text/plain; format=flowed; charset=ISO-8859-1 Content-Transfer-Encoding: 7BIT Ann W. Harrison wrote: > Hi Olav, >> >> 6. IO-thread: uncaught exception when writing the check point log >> record. This is now handled in Cache::ioThread(). Checkpoint will >> continue to run but the checkpoint log record will not be written. >> >> NOTE: Pay particular note of the last of these given that the solution >> to get out of this situation is to do a successful recovery. This fix >> might result in checkpoints that writes pages to the database without >> having a complete checkpoint log record between them (or at least I >> think this might be a possible scenario). Can this give problems for >> the recovery process? (I do not think so but thought it was good to >> mention it.... :-) ). >> > > I think it's OK to fail to write the checkpoint log record and > finish the database writes. However, after the checkpoint log > record write fails, nothing else should happen. As the code is now I think that the IO thread will go to sleep after it has failed to write the checkpoint log record. But I expect it is possible that a new checkpoint might be started and although there probably will not be much that has been updated since last it was active it might be dirty data pages it can write out (I admit that I have not check the details about this). But I do not stop the IO thread when this error is detected, and since we have multiple of these we probably had to stop all the others as well. So if I guess correctly about how this works it is likely that datapages written by checkpoint n+1 can be written out even if the checkpoint log record from checkpoint n is not written to the log. But even if this happens I do not see that this should have any negative consequences for the recovery. Dirty datapages can probably be written out at any time anyway? But let me know if anyone thinks this is an issue and we can solve this by ensuring that we stop all IO threads. If we want to do some changes in how we control the IO-threads, maybe we can take that as a separate patch? > > It's way beyond the scope of this set of changes, but as Vlad > suggested we should have a high level state bit that says that > Falcon doesn't work. Every call through the handler should > check that bit and return an error. I'm not sure if that > state should stop the gophers ... but it should stop subsequent > changes to data and metadata and checkpoints. At least I > think it should. I think Vlad's idea of having a high level state bit to stop calls from the server is a good idea. But that would not solve the problem with uncaught exceptions that is fixed by this patch, just reduce it. At the time we set the serial log to state "writeError" it would be too late to stop threads that already had passed "Vlad's border control" and is running at full speed towards the serial log exception. One potential drawback with Vlad's suggestion is that this would also stop read-only queries. With the current code I am able to run read-only queries successfully also after the serial log has run into problems - but I expect that blocking everything in this situation is not a big concern. About Scavenger and IO threads: in my first prototype implementation of this I let these threads die when they hit the serial log exception with the argument that with Falcon in this state it was just no point of having these around - and a very easy way to get out of more problems. In this patch I let them continue to live and "do useful work?" :-) Thanks for looking at this, Ann! Best regards, Olav