List:Falcon Storage Engine« Previous MessageNext Message »
From:Olav Sandstaa Date:June 17 2009 9:54pm
Subject:Recovery issue: record locator page is update but record is not in
data page
View as plain text  
Hi,

Here is a short status on where I am at one of the recovery issues I am 
working on (see Bug#44744 and Bug#45297).

The crash happens when we redo an SRLUpdateRecords log record:

1. The actual crash occurs  when  we call DataPage::updateRecord()  in 
order to update  the record in the data page. Common for both bugs is 
that the record is stored with a line number equal to the block's 
maxLine. This is should not happen (ie. is illegal - and I have already 
pushed code in to detect situations where this happens). So when we read 
the page's index entry at the given line number (equal to maxLine) we 
read "garbage" and in both crash situations we interpret whatever is in 
the index entry as the offset to the overflow page and end up trying to 
read  a "random block" from disk and crashes (I have also seen 
situations where this does not lead to a crash but where we likely 
produce an inconsistent data page by actually being able to insert the 
record likely ending up in an inconsistent data page?).

What causes the code to "believe" that the record already exists in the 
data page and calls DataPage::updateRecord()?

2. The record (using Bug#45297 as an example) has record number 2011. 
This is used for finding the section page and further the record locator 
page (page number #1392).

3. In the record locator page there is an entry for this record on 
element/index position  484. This has the following content: page=2836, 
line=2836 (and availableSpace=0). This corresponds to the data page 
number and line number used when calling DataPage::updateRecord(). So 
the record locator page basically says that the record should already be 
stored in the datapage(?)

So the "inconsistency" can be summarized (if I understand correctly) 
that the record locator page claims that there is a record with record 
number 2011 and that it is stored in datapage 2836 on line 10 - while 
the datapage does not have this record (maxLine is 10). I have also 
verified that this is the situation when the record locator page and 
data page is first read from the disk during recovery so this 
inconsistency exists in the database on disk when MySQL is killed (which 
probably is fine *if* we can use the serial log to fix it).

First of all, if anyone think this is a wrong conclusion please let me 
know. I have also tried to re-run the recovery by starting at as early 
as possible in the serial log. The result is still the same and there 
seems not to be any earlier log record for the record.

The challenge is to understand how we can get into this situation. So I 
have the following questions, maybe mostly for myself to figure out the 
answer to but if anyone have suggestions that would be great. What I 
think has happened is:

1. We have inserted the initial version into the database. This has 
created the SRLUpdateRecords log record. The gopher has processed this 
log record and this has resulted in the update to the Record Locator 
Page and the datapage. But only the updated Record locator page has been 
flushed to disk (possible during a checkpoint?) at the time the MySQL 
process is killed.

Does this seem like a likely and correct scenario for this case?

If this is the case, then it is the job of redo processing of this log 
record to ensure that the update is done correctly to the data page. I 
see the following to possible solutions:

1. We add code to DataPage::updateRecord() that detects that we are 
trying to update a "non-existing" record by checking that the given line 
number is equal to the maxLine. If this is the case, we can either:

 2a. extend DataPage::updateRecord() so that it inserts the record into 
the page (insert the record data on the page or on an overflow page, 
update the new index/line entry, increase maxLine++)

or

2b. extend DataPage::updateRecord() so that it returns an error code 
(possibly returning a negative spaceAvailable value). This can then be 
handled by Section::updateRecord() so that it clears the entry in the 
record locator page (existing call to Section::deleteLine()) and do a 
complete new insert of the record by calling Section::storeRecord().

If 2b can re-use the existing code this would be the easiest alternative 
(and safest) while 2a would be more "correct" since it would actually 
complete the storing of the record on the same page as the record 
originally was stored at.

Does this seem like a correct approach? Any preferences? Please let me 
know if there are better ways for how to handle this. If not, then I 
will go ahead and try to implement one of the above strategies and see 
if it solves the two recovery situations I have.

Regards,
Olav



Thread
Recovery issue: record locator page is update but record is not indata pageOlav Sandstaa17 Jun
  • Re: Recovery issue: record locator page is update but record is not indata pageOlav Sandstaa18 Jun