From: Olav Sandstaa Date: June 17 2009 9:54pm Subject: Recovery issue: record locator page is update but record is not in data page List-Archive: http://lists.mysql.com/falcon/767 Message-Id: <4A396600.8050703@sun.com> MIME-Version: 1.0 Content-Type: text/plain; CHARSET=US-ASCII; format=flowed Content-Transfer-Encoding: 7BIT Hi, Here is a short status on where I am at one of the recovery issues I am working on (see Bug#44744 and Bug#45297). The crash happens when we redo an SRLUpdateRecords log record: 1. The actual crash occurs when we call DataPage::updateRecord() in order to update the record in the data page. Common for both bugs is that the record is stored with a line number equal to the block's maxLine. This is should not happen (ie. is illegal - and I have already pushed code in to detect situations where this happens). So when we read the page's index entry at the given line number (equal to maxLine) we read "garbage" and in both crash situations we interpret whatever is in the index entry as the offset to the overflow page and end up trying to read a "random block" from disk and crashes (I have also seen situations where this does not lead to a crash but where we likely produce an inconsistent data page by actually being able to insert the record likely ending up in an inconsistent data page?). What causes the code to "believe" that the record already exists in the data page and calls DataPage::updateRecord()? 2. The record (using Bug#45297 as an example) has record number 2011. This is used for finding the section page and further the record locator page (page number #1392). 3. In the record locator page there is an entry for this record on element/index position 484. This has the following content: page=2836, line=2836 (and availableSpace=0). This corresponds to the data page number and line number used when calling DataPage::updateRecord(). So the record locator page basically says that the record should already be stored in the datapage(?) So the "inconsistency" can be summarized (if I understand correctly) that the record locator page claims that there is a record with record number 2011 and that it is stored in datapage 2836 on line 10 - while the datapage does not have this record (maxLine is 10). I have also verified that this is the situation when the record locator page and data page is first read from the disk during recovery so this inconsistency exists in the database on disk when MySQL is killed (which probably is fine *if* we can use the serial log to fix it). First of all, if anyone think this is a wrong conclusion please let me know. I have also tried to re-run the recovery by starting at as early as possible in the serial log. The result is still the same and there seems not to be any earlier log record for the record. The challenge is to understand how we can get into this situation. So I have the following questions, maybe mostly for myself to figure out the answer to but if anyone have suggestions that would be great. What I think has happened is: 1. We have inserted the initial version into the database. This has created the SRLUpdateRecords log record. The gopher has processed this log record and this has resulted in the update to the Record Locator Page and the datapage. But only the updated Record locator page has been flushed to disk (possible during a checkpoint?) at the time the MySQL process is killed. Does this seem like a likely and correct scenario for this case? If this is the case, then it is the job of redo processing of this log record to ensure that the update is done correctly to the data page. I see the following to possible solutions: 1. We add code to DataPage::updateRecord() that detects that we are trying to update a "non-existing" record by checking that the given line number is equal to the maxLine. If this is the case, we can either: 2a. extend DataPage::updateRecord() so that it inserts the record into the page (insert the record data on the page or on an overflow page, update the new index/line entry, increase maxLine++) or 2b. extend DataPage::updateRecord() so that it returns an error code (possibly returning a negative spaceAvailable value). This can then be handled by Section::updateRecord() so that it clears the entry in the record locator page (existing call to Section::deleteLine()) and do a complete new insert of the record by calling Section::storeRecord(). If 2b can re-use the existing code this would be the easiest alternative (and safest) while 2a would be more "correct" since it would actually complete the storing of the record on the same page as the record originally was stored at. Does this seem like a correct approach? Any preferences? Please let me know if there are better ways for how to handle this. If not, then I will go ahead and try to implement one of the above strategies and see if it solves the two recovery situations I have. Regards, Olav