From: Olav Sandstaa Date: July 3 2009 2:14pm Subject: SRLOverflowPages log record and flushing of serial log List-Archive: http://lists.mysql.com/falcon/782 Message-Id: <4A4E124D.1000708@sun.com> MIME-Version: 1.0 Content-Type: text/plain; CHARSET=US-ASCII; format=flowed Content-Transfer-Encoding: 7BIT Hi, Here is a short status and some questions about the recovery bug I work on (Bug #45301 Recovery crash due to fetched page marked free in PIP http://bugs.mysql.com/bug.php?id=45301). The recovery crash occurs when the following happens: 1. Recovery is processing an UpdateRecords log record (record id 1602). It finds the record in the Record Locator Page which points to DataPage 461. 2. DataPage 461 looks fine (ie. no evidence of curruption). In this page the record's slot says that the record is stored in an overflowPage (with page number 5909). So the first thing the code tires to do is to delete this overflow page (and then the problems starts) 3. The code tries to fetch this page from disk by calling Cache::fetchPage() 4. First indication of something wrong: Cache::fetchPage() checks that this page is marked as used in the PageInventoryPage. This results in the error written: "During recovery, fetched page 5909 tablespace 1 type 9 marked free in PIP" 5. The code "corrects" this by calling PageInventoryPage::markPageInUse() (questionable??) 6. Next indication of something wrong: When IO::readPage() tries to read the page it get an EOF error. Conclusion: The overflow page does not exits in the database file. So the questions are: Why did this happen? (it has only happend once as far as I am aware of...). How to correct this? Some of the things I have found out this far is: * Is page "5909" a reasonable block number? Yes. The database file is 5864 blocks - so the "missing block" is "only" 55 blocks behind the EOF of the database file. I have also done a scan of the database and found that there are also three other DataPages that refers to OverflowPage that are behind EOF (data pages refer to overflow pages 5902, 5909, 5913, 5929). * The serial log (including log going back about three checkpoints) only contains SRLOverflowPages log records where the max overflow page size if 5854 (ie. all are within the existing database file). So the question I am currently wondering about is (and which would be great to get some feedback/suggestion on): why do we not have SRLOverflowPages log records for the four overflow pages that are behind EOF? Are these created before the section of the serial log I have? Or are they created after the log I have? Or have we forgotten to log them? If I read the code correctly, the only place we create overflow pages is in Section::storeTail(), at least this is the only place we log the creation of OverflowPages. This is mostly done by the Gopher (at least for normal data records) but I can not see that the Gopher is flushing the log after creating the SRLOverflowPages log record. So the main question is: Is ti OK that we create overflow pages and SRLOverflowPages log records without flushing the serial log? Is it possible that this is the reason for me not finding SRLOverflowPages log records for overflow page 5909? (and if it is OK, how should be handle this situation? Catch the EOF exception and ignore it? The page is going to be deleted/free anyway..) I am also studying some other issues related to this crash but that can wait until another email - but if anyone have suggestions for what caused this crash to happen feel free to let me know. BTW: I am also a bit concerned about the performance impact if we have to flush on every SRLOverflowPage (so it is best if we do not have to - and it would be great if someone confirmed that this is OK). Olav