List:Falcon Storage Engine« Previous MessageNext Message »
From:Olav Sandstaa Date:July 8 2009 12:41pm
Subject:Re: SRLOverflowPages log record and flushing of serial log
View as plain text  
Ann and everyone else,

I have no spent some more time looking at this and think I know more 
about what happened.

Ann W. Harrison wrote:
> Olav,
>>
>> The recovery crash occurs when the following happens:
>>
>> 1. Recovery is processing an UpdateRecords log record (record id 
>> 1602). It finds the record in the Record Locator Page which points to 
>> DataPage 461.
>> 2. DataPage 461 looks fine (ie. no evidence of curruption). In this 
>> page the record's slot says that the record is stored in an 
>> overflowPage (with page number 5909). So the first thing the code 
>> tires to do is to delete this overflow page (and then the problems 
>> starts)
>> 3. The code tries to fetch this page from disk by calling 
>> Cache::fetchPage()
>> 4. First indication of something wrong: Cache::fetchPage() checks 
>> that this page is marked as used in the PageInventoryPage. This 
>> results in the error written:
>>
>>       "During recovery, fetched page 5909 tablespace 1 type 9 marked 
>> free in PIP"
>>
>> 5. The code "corrects" this by calling 
>> PageInventoryPage::markPageInUse() (questionable??)
>
> Not if the goal is to make recovery succeed in as many cases as
> possible.  If due to some miscalculation, a page is left unmarked
> in the PIP, but is valid - and it will be validated - I'd rather
> recover the database and figure out what's wrong later than leave
> a user with an unrecoverable database and no way to go forward.
>
>> 6. Next indication of something wrong: When IO::readPage() tries to 
>> read the page it get an EOF error.
>>
>> Conclusion: The overflow page does not exits in the database file.
>
> Question:
>
>   Is this the second attempt to restore this database?

No, I think it is the first recovery attempt. The behavior/crash I see 
when re-running recovery is the same as in the original log from test 
just after the test has "killed" MySQL ro force a recovery.

>
>>
>> So the questions are: Why did this happen? (it has only happend once 
>> as far as I am aware of...). How to correct this?
>
> If it is the second attempt, then it may be related to the problem
> I saw, which is that overflow pages can be released more than once.
>>
> ...
>>
>> So the question I am currently wondering about is (and which would be 
>> great to get some feedback/suggestion on): why do we not have 
>> SRLOverflowPages log records for the four overflow pages that are 
>> behind EOF? Are these created before the section of the serial log I 
>> have? Or are they created after the log I have? Or have we forgotten 
>> to log them?
>
> We don't make entries in the log for overflow pages that we create
> during a recovery.  So, this is what I think is happening.  The first
> recovery updated record 1602, releasing it's previous overflow page
> and allocating a new one.  When the cache filled with recovered pages,
> page 461 was written out to free up space, so the reference to the
> new overflow page 5909 went to disk, but neither page 5909 nor the
> PIP did.  Then recovery failed (or was killed).  The second recovery
> tried to update record 1602, starting by releasing its overflow
> page and boom...

Thanks for this explanation. Sounds like something that might happen in 
a secondary recovery attempt. I have an example further down that I 
think shows what can happen during normal execution.

>>
>> If I read the code correctly, the only place we create overflow pages 
>> is in Section::storeTail(), at least this is the only place we log 
>> the creation of OverflowPages. This is mostly done by the Gopher (at 
>> least for normal data records) but I can not see that the Gopher is 
>> flushing the log after creating the SRLOverflowPages log record. 
>
> As I said, I don't think the problem is in normal processing, but in
> recovery.  The gopher doesn't have to flush the log after creating
> an overflow page - that SRL record has to be on disk when the
> transaction commits.  The record that references it won't be in
> the database until after the commit.

I think the SRLOverflowPages log record are created by the Gopher after 
the commit of the transaction (see example further down).

>
>
>> So the main question is: Is ti OK that we create overflow pages and 
>> SRLOverflowPages log records without flushing the serial log? Is it 
>> possible that this is the reason for me not finding SRLOverflowPages 
>> log records for overflow page 5909? (and if it is OK, how should be 
>> handle this situation? Catch the EOF exception and ignore it? The 
>> page is going to be deleted/free anyway..)
>
> No, I think we need to think about the handling of overflow pages
> during recovery more carefully.
>>
>>
>> BTW: I am also a bit concerned about the performance impact if we 
>> have to flush on every SRLOverflowPage (so it is best if we do not 
>> have to - and it would be great if someone confirmed that this is OK).
>
> Me too.  Lets talk about that this afternoon.

The crash in Bug#45301 has only been seen once and I have not been able 
to reproduce it. But I think I am now able to reproduce a very similar 
crash situation. The main difference between crash in Bug#45301 and the 
one I reproduce it that the page status is correct in the 
PageInventoryPage. But both crashes occurres when doing recovery and 
trying to delete an overflow page that is after End-Of-File.

Here is what I do (slightly simplified):

1. Run a single transaction that updates 1000 records in a table. Each 
record consist of a key and a 4100 byte varchar field (initial size of 
the varchar field is 20 bytes) - so this increases the size of every 
record by about 4K.

2. When this transaction is executed it produces one UpdateRecords log 
record and one Commit log record. Serial log is flushed.

3. After the transaction has committed the Gopher executes the 
UpdateRecords log record. For each record that is updated the Gopher 
will do:

     a) allocate a new OverflowPage
     b) write a new SRLOverflowPages log record
     c) update the original data page

    Note: no flushing of the serial log take place here and no writing 
of the database buffer  takes place. But the size of the database have 
grown a lot due to allocating  a bunch of new OverflowPages.

4. Checkpoint  starts:  and  get KILLED before it has written out all  
pages.

    Note: no flushing has occurred on the SerialLog since the 1000 
SRLOverflowPages records got written - these are likely lost.

5. Recovery starts.

6.  Stage 2 recovery: no SRLOverflowPages are in the log

7. Stage 3 recovery (REDO): We redo the UpdateRecords log by doing the 
following:

       Section::updateRecord()
       DataPage::updateRecord(): if this DataPage was one of the pages 
that got written to the disk before MySQL got killed it will find that 
there is an overflow page for this record, so we try to delete that:
       DataPage::deleteOverflowPage(): we ask the Cache to fetch the 
page from disk:
       Cache::fetchPage(): ask the IO to read the page from disk
       IO::readPage(): if this page was newly allocated (which is 
likely) and was not written to the disk before MySQL got killed this now 
leads to a "fatal" read of a block that does not exits on disk - and 
recovery fails.

This is repeatable.

What is the root cause for this to crash? Either:

  a) We are not able to redo all physical page allocations during stage 
2 do not flushing SRLOverflowPages to disk?
  b) We do not handle or avoid reading pages beyond EOF.

The "good thing" about the situation is that no data is lost - the log 
is complete so it is just a question about how to recover without crashing.

Possible solutions:

  1. Have the Gopher flush SRLOverflowPages log records to disk? 
(question: if we do not flush the log, why bother writing the log 
records?) My concern here is that this will be a "performance issue".
  2. Improve how we handle DataPage::deleteOverflowPage() during 
recovery: by either:
       2a: check the file size, if block is beyond EOF do not attempt to 
read it - assume it is freed.
       2b: if beyond EOF is reported - ignore it during recovery - and 
assume that the block is freed.
  3. Avoid having to read in the overflow block: Is there other reasons 
for reading it than to check if it has a "next" overflow page? If that 
is the case, we could change it to store the entire list of overflow 
blocks in the data page (could also improve on reading large 
records/blobs) - (no I am not voluntering for this right now)

Note that alternativ 2: will not handle links to following overflow 
blocks and that can lead to permanently lost data blocks (but that is a 
minor issue with the current price per 4K block of disk :-) ).

Olav



Thread
SRLOverflowPages log record and flushing of serial logOlav Sandstaa3 Jul
  • Re: SRLOverflowPages log record and flushing of serial logAnn W. Harrison6 Jul
    • Re: SRLOverflowPages log record and flushing of serial logOlav Sandstaa6 Jul
      • Double RecoveriesJim Starkey6 Jul
    • Re: SRLOverflowPages log record and flushing of serial logOlav Sandstaa8 Jul
      • Re: SRLOverflowPages log record and flushing of serial logOlav Sandstaa10 Jul
        • Re: SRLOverflowPages log record and flushing of serial logAnn W. Harrison10 Jul