From: Olav Sandstaa Date: October 21 2008 7:11pm Subject: Review request: Handling of exceptions after serial log is in state writeError List-Archive: http://lists.mysql.com/falcon/72 Message-Id: <48FE2954.8060003@sun.com> MIME-Version: 1.0 Content-Type: text/plain; format=flowed; charset=ISO-8859-1 Content-Transfer-Encoding: 7BIT Hi, I have committed the second patch for fixing Bug #39912 "Falcon can crash after hitting problems with the serial log". This patch adds code for handling uncaught exceptions that occurs after the state of the serial log has been set to "writeError". The patch is available here: http://lists.mysql.com/commits/56745 The following six situationa where MySQL previously would crash due to uncaught exceptions are now handled: 1. Rollback to savepoint: handled in StorageConnection::rollbackVerb() and in StorageInterface::rollback() 2. Rollback of transaction: handled in StorageConnection::rollback() and in StorageInterface::rollback() 3. Commit: handled in StorageConnection::commit() and StorageInterface::commit() 4. Commit of implicite transactions: handled in StorageConnection::commit(), StorageConnection::endImpliciteTransaction() and StorageInterface::external_lock() All of these will now return HA_ERR_LOGGING_IMPOSSIBLE to the server after the serial log becomes "un-writable". 5. Scavenger: uncaught exception when committing updates to the cardinalities. This is now handled in Database::updateCardinalities(). After this situation has occured, cardinalities will no longer be updated. 6. IO-thread: uncaught exception when writing the check point log record. This is now handled in Cache::ioThread(). Checkpoint will continue to run but the checkpoint log record will not be written. NOTE: Pay particular note of the last of these given that the solution to get out of this situation is to do a successful recovery. This fix might result in checkpoints that writes pages to the database without having a complete checkpoint log record between them (or at least I think this might be a possible scenario). Can this give problems for the recovery process? (I do not think so but thought it was good to mention it.... :-) ). Call stacks for all uncaught exceptions scenarios are available in the bug report. Concern 1: Wlad thinks that returning HA_ERR_LOGGING_IMPOSSIBLE is the wrong error code to return. His main objection was that this is an error code that is only used by InnoDB in relation to replication. If we come up with a better error code I can either update the patch before it get pushed or as a separate patch later. (anyway, HA_ERR_LOGGING_IMPOSSIBLE is better than DuplicateKeyError which we returned earlier or to crash the process). Concern 2: There are like more cases where it is possible to get uncaught exceptions after the serial log is in writeError state but I do think I have covered the most frequently occuring. Testing: I have tested that the solution fixes all of the cases above except case 4 which I only saw once and did not reproduce easily. The testing has been done by instrumenting the Serial Log to set state to writeError after some time. Please feel free to review and comment on the patch. Olav