From: Olav Sandstaa Date: July 8 2009 12:41pm Subject: Re: SRLOverflowPages log record and flushing of serial log List-Archive: http://lists.mysql.com/falcon/788 Message-Id: <4A549417.8030905@sun.com> MIME-Version: 1.0 Content-Type: text/plain; CHARSET=US-ASCII; format=flowed Content-Transfer-Encoding: 7BIT Ann and everyone else, I have no spent some more time looking at this and think I know more about what happened. Ann W. Harrison wrote: > Olav, >> >> The recovery crash occurs when the following happens: >> >> 1. Recovery is processing an UpdateRecords log record (record id >> 1602). It finds the record in the Record Locator Page which points to >> DataPage 461. >> 2. DataPage 461 looks fine (ie. no evidence of curruption). In this >> page the record's slot says that the record is stored in an >> overflowPage (with page number 5909). So the first thing the code >> tires to do is to delete this overflow page (and then the problems >> starts) >> 3. The code tries to fetch this page from disk by calling >> Cache::fetchPage() >> 4. First indication of something wrong: Cache::fetchPage() checks >> that this page is marked as used in the PageInventoryPage. This >> results in the error written: >> >> "During recovery, fetched page 5909 tablespace 1 type 9 marked >> free in PIP" >> >> 5. The code "corrects" this by calling >> PageInventoryPage::markPageInUse() (questionable??) > > Not if the goal is to make recovery succeed in as many cases as > possible. If due to some miscalculation, a page is left unmarked > in the PIP, but is valid - and it will be validated - I'd rather > recover the database and figure out what's wrong later than leave > a user with an unrecoverable database and no way to go forward. > >> 6. Next indication of something wrong: When IO::readPage() tries to >> read the page it get an EOF error. >> >> Conclusion: The overflow page does not exits in the database file. > > Question: > > Is this the second attempt to restore this database? No, I think it is the first recovery attempt. The behavior/crash I see when re-running recovery is the same as in the original log from test just after the test has "killed" MySQL ro force a recovery. > >> >> So the questions are: Why did this happen? (it has only happend once >> as far as I am aware of...). How to correct this? > > If it is the second attempt, then it may be related to the problem > I saw, which is that overflow pages can be released more than once. >> > ... >> >> So the question I am currently wondering about is (and which would be >> great to get some feedback/suggestion on): why do we not have >> SRLOverflowPages log records for the four overflow pages that are >> behind EOF? Are these created before the section of the serial log I >> have? Or are they created after the log I have? Or have we forgotten >> to log them? > > We don't make entries in the log for overflow pages that we create > during a recovery. So, this is what I think is happening. The first > recovery updated record 1602, releasing it's previous overflow page > and allocating a new one. When the cache filled with recovered pages, > page 461 was written out to free up space, so the reference to the > new overflow page 5909 went to disk, but neither page 5909 nor the > PIP did. Then recovery failed (or was killed). The second recovery > tried to update record 1602, starting by releasing its overflow > page and boom... Thanks for this explanation. Sounds like something that might happen in a secondary recovery attempt. I have an example further down that I think shows what can happen during normal execution. >> >> If I read the code correctly, the only place we create overflow pages >> is in Section::storeTail(), at least this is the only place we log >> the creation of OverflowPages. This is mostly done by the Gopher (at >> least for normal data records) but I can not see that the Gopher is >> flushing the log after creating the SRLOverflowPages log record. > > As I said, I don't think the problem is in normal processing, but in > recovery. The gopher doesn't have to flush the log after creating > an overflow page - that SRL record has to be on disk when the > transaction commits. The record that references it won't be in > the database until after the commit. I think the SRLOverflowPages log record are created by the Gopher after the commit of the transaction (see example further down). > > >> So the main question is: Is ti OK that we create overflow pages and >> SRLOverflowPages log records without flushing the serial log? Is it >> possible that this is the reason for me not finding SRLOverflowPages >> log records for overflow page 5909? (and if it is OK, how should be >> handle this situation? Catch the EOF exception and ignore it? The >> page is going to be deleted/free anyway..) > > No, I think we need to think about the handling of overflow pages > during recovery more carefully. >> >> >> BTW: I am also a bit concerned about the performance impact if we >> have to flush on every SRLOverflowPage (so it is best if we do not >> have to - and it would be great if someone confirmed that this is OK). > > Me too. Lets talk about that this afternoon. The crash in Bug#45301 has only been seen once and I have not been able to reproduce it. But I think I am now able to reproduce a very similar crash situation. The main difference between crash in Bug#45301 and the one I reproduce it that the page status is correct in the PageInventoryPage. But both crashes occurres when doing recovery and trying to delete an overflow page that is after End-Of-File. Here is what I do (slightly simplified): 1. Run a single transaction that updates 1000 records in a table. Each record consist of a key and a 4100 byte varchar field (initial size of the varchar field is 20 bytes) - so this increases the size of every record by about 4K. 2. When this transaction is executed it produces one UpdateRecords log record and one Commit log record. Serial log is flushed. 3. After the transaction has committed the Gopher executes the UpdateRecords log record. For each record that is updated the Gopher will do: a) allocate a new OverflowPage b) write a new SRLOverflowPages log record c) update the original data page Note: no flushing of the serial log take place here and no writing of the database buffer takes place. But the size of the database have grown a lot due to allocating a bunch of new OverflowPages. 4. Checkpoint starts: and get KILLED before it has written out all pages. Note: no flushing has occurred on the SerialLog since the 1000 SRLOverflowPages records got written - these are likely lost. 5. Recovery starts. 6. Stage 2 recovery: no SRLOverflowPages are in the log 7. Stage 3 recovery (REDO): We redo the UpdateRecords log by doing the following: Section::updateRecord() DataPage::updateRecord(): if this DataPage was one of the pages that got written to the disk before MySQL got killed it will find that there is an overflow page for this record, so we try to delete that: DataPage::deleteOverflowPage(): we ask the Cache to fetch the page from disk: Cache::fetchPage(): ask the IO to read the page from disk IO::readPage(): if this page was newly allocated (which is likely) and was not written to the disk before MySQL got killed this now leads to a "fatal" read of a block that does not exits on disk - and recovery fails. This is repeatable. What is the root cause for this to crash? Either: a) We are not able to redo all physical page allocations during stage 2 do not flushing SRLOverflowPages to disk? b) We do not handle or avoid reading pages beyond EOF. The "good thing" about the situation is that no data is lost - the log is complete so it is just a question about how to recover without crashing. Possible solutions: 1. Have the Gopher flush SRLOverflowPages log records to disk? (question: if we do not flush the log, why bother writing the log records?) My concern here is that this will be a "performance issue". 2. Improve how we handle DataPage::deleteOverflowPage() during recovery: by either: 2a: check the file size, if block is beyond EOF do not attempt to read it - assume it is freed. 2b: if beyond EOF is reported - ignore it during recovery - and assume that the block is freed. 3. Avoid having to read in the overflow block: Is there other reasons for reading it than to check if it has a "next" overflow page? If that is the case, we could change it to store the entire list of overflow blocks in the data page (could also improve on reading large records/blobs) - (no I am not voluntering for this right now) Note that alternativ 2: will not handle links to following overflow blocks and that can lead to permanently lost data blocks (but that is a minor issue with the current price per 4K block of disk :-) ). Olav