Here is a short status and some questions about the recovery bug I work
on (Bug #45301 Recovery crash due to fetched page marked free in PIP
The recovery crash occurs when the following happens:
1. Recovery is processing an UpdateRecords log record (record id 1602).
It finds the record in the Record Locator Page which points to DataPage 461.
2. DataPage 461 looks fine (ie. no evidence of curruption). In this
page the record's slot says that the record is stored in an overflowPage
(with page number 5909). So the first thing the code tires to do is to
delete this overflow page (and then the problems starts)
3. The code tries to fetch this page from disk by calling
4. First indication of something wrong: Cache::fetchPage() checks that
this page is marked as used in the PageInventoryPage. This results in
the error written:
"During recovery, fetched page 5909 tablespace 1 type 9 marked
free in PIP"
5. The code "corrects" this by calling
6. Next indication of something wrong: When IO::readPage() tries to
read the page it get an EOF error.
Conclusion: The overflow page does not exits in the database file.
So the questions are: Why did this happen? (it has only happend once as
far as I am aware of...). How to correct this?
Some of the things I have found out this far is:
* Is page "5909" a reasonable block number? Yes. The database file is
5864 blocks - so the "missing block" is "only" 55 blocks behind the EOF
of the database file.
I have also done a scan of the database and found that there are also
three other DataPages that refers to OverflowPage that are behind EOF
(data pages refer to overflow pages 5902, 5909, 5913, 5929).
* The serial log (including log going back about three checkpoints)
only contains SRLOverflowPages log records where the max overflow page
size if 5854 (ie. all are within the existing database file).
So the question I am currently wondering about is (and which would be
great to get some feedback/suggestion on): why do we not have
SRLOverflowPages log records for the four overflow pages that are behind
EOF? Are these created before the section of the serial log I have? Or
are they created after the log I have? Or have we forgotten to log them?
If I read the code correctly, the only place we create overflow pages is
in Section::storeTail(), at least this is the only place we log the
creation of OverflowPages. This is mostly done by the Gopher (at least
for normal data records) but I can not see that the Gopher is flushing
the log after creating the SRLOverflowPages log record. So the main
question is: Is ti OK that we create overflow pages and SRLOverflowPages
log records without flushing the serial log? Is it possible that this is
the reason for me not finding SRLOverflowPages log records for overflow
page 5909? (and if it is OK, how should be handle this situation? Catch
the EOF exception and ignore it? The page is going to be deleted/free
I am also studying some other issues related to this crash but that can
wait until another email - but if anyone have suggestions for what
caused this crash to happen feel free to let me know.
BTW: I am also a bit concerned about the performance impact if we have
to flush on every SRLOverflowPage (so it is best if we do not have to -
and it would be great if someone confirmed that this is OK).