> The recovery crash occurs when the following happens:
> 1. Recovery is processing an UpdateRecords log record (record id 1602).
> It finds the record in the Record Locator Page which points to DataPage
> 2. DataPage 461 looks fine (ie. no evidence of curruption). In this page
> the record's slot says that the record is stored in an overflowPage
> (with page number 5909). So the first thing the code tires to do is to
> delete this overflow page (and then the problems starts)
> 3. The code tries to fetch this page from disk by calling
> 4. First indication of something wrong: Cache::fetchPage() checks that
> this page is marked as used in the PageInventoryPage. This results in
> the error written:
> "During recovery, fetched page 5909 tablespace 1 type 9 marked
> free in PIP"
> 5. The code "corrects" this by calling
> PageInventoryPage::markPageInUse() (questionable??)
Not if the goal is to make recovery succeed in as many cases as
possible. If due to some miscalculation, a page is left unmarked
in the PIP, but is valid - and it will be validated - I'd rather
recover the database and figure out what's wrong later than leave
a user with an unrecoverable database and no way to go forward.
> 6. Next indication of something wrong: When IO::readPage() tries to read
> the page it get an EOF error.
> Conclusion: The overflow page does not exits in the database file.
Is this the second attempt to restore this database?
> So the questions are: Why did this happen? (it has only happend once as
> far as I am aware of...). How to correct this?
If it is the second attempt, then it may be related to the problem
I saw, which is that overflow pages can be released more than once.
> So the question I am currently wondering about is (and which would be
> great to get some feedback/suggestion on): why do we not have
> SRLOverflowPages log records for the four overflow pages that are behind
> EOF? Are these created before the section of the serial log I have? Or
> are they created after the log I have? Or have we forgotten to log them?
We don't make entries in the log for overflow pages that we create
during a recovery. So, this is what I think is happening. The first
recovery updated record 1602, releasing it's previous overflow page
and allocating a new one. When the cache filled with recovered pages,
page 461 was written out to free up space, so the reference to the
new overflow page 5909 went to disk, but neither page 5909 nor the
PIP did. Then recovery failed (or was killed). The second recovery
tried to update record 1602, starting by releasing its overflow
page and boom...
> If I read the code correctly, the only place we create overflow pages is
> in Section::storeTail(), at least this is the only place we log the
> creation of OverflowPages. This is mostly done by the Gopher (at least
> for normal data records) but I can not see that the Gopher is flushing
> the log after creating the SRLOverflowPages log record.
As I said, I don't think the problem is in normal processing, but in
recovery. The gopher doesn't have to flush the log after creating
an overflow page - that SRL record has to be on disk when the
transaction commits. The record that references it won't be in
the database until after the commit.
> So the main
> question is: Is ti OK that we create overflow pages and SRLOverflowPages
> log records without flushing the serial log? Is it possible that this is
> the reason for me not finding SRLOverflowPages log records for overflow
> page 5909? (and if it is OK, how should be handle this situation? Catch
> the EOF exception and ignore it? The page is going to be deleted/free
No, I think we need to think about the handling of overflow pages
during recovery more carefully.
> BTW: I am also a bit concerned about the performance impact if we have
> to flush on every SRLOverflowPage (so it is best if we do not have to -
> and it would be great if someone confirmed that this is OK).
Me too. Lets talk about that this afternoon.