From: Mats Kindahl Date: February 1 2013 12:15pm Subject: Re: reducing fsyncs during handlerton->prepare and handlerton->commit in 5.6 List-Archive: http://lists.mysql.com/internals/38714 Message-Id: <510BB1E3.8040509@oracle.com> MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit On 02/01/2013 11:26 AM, Zardosht Kasheff wrote: > inlining > > On Fri, Feb 1, 2013 at 2:37 AM, Mats Kindahl wrote: >> On 01/31/2013 10:28 PM, Zardosht Kasheff wrote: >>> Thank you for the detailed reply. >>> >>> I want to confirm that I understand the contract: >>> - when flush logs is called, the engine must ensure that any >>> transaction committed up until that point is recovered as committed >>> after a crash. No such transaction can come up in the prepared state. >> Yes, that is correct. >> >>> The reason I ask this is that as of now, on flush logs, we flush all >>> of our data to disk, and this seems like overkill. Instead, if we just >>> fsync our recovery log, that will satisfy the above contract. >>> >>> Can we do this? >> It looks like it should work. I don't know the details of your >> implementation (and details can make a big difference), but if you by >> using the recovery log can ensure that no transactions committed before >> the flush_logs() show up as "prepared" on recovery, you should be safe. >> (Note: on recovery, the storage engine is asked for all prepared >> transactions, and this set should not include any transactions that are >> in any binary logs except the last one.) >> >> Just a questions: how do you make it possible to flush all data to disk >> if, e.g., the user want to take a backup? You don't have to use FLUSH >> LOGS for this (which would call flush_logs()), but it has to be possible. > Our engine, TokuDB, has the concept of a checkpoint that flushes all > data to disk. From what has been said in this thread, I think doing a > checkpoint on flush logs is overkill, especially when overturning a > binary log. Yes, I agree. > I think you are asking what mechanism will we have for the > user to induce a checkpoint. I don't have an answer for that yet, we > need to investigate. But it sounds like a user experience question. Yes, that's what it is. /Matz > >> /Matz >> >>> Thanks >>> -Zardosht >>> >>> On Thu, Jan 31, 2013 at 4:12 PM, Mats Kindahl wrote: >>>> On 01/31/2013 05:09 PM, Zardosht Kasheff wrote: >>>>> Thanks a lot Kristian and Mats. >>>>> >>>>> I am learning that I know a lot less than I thought I knew. To help my >>>>> understanding, I will focus in this thread on MySQL 5.6. I will focus >>>>> on MariaDB in another thread. >>>>> >>>>> I want to make sure my understanding is correct. Is what I write below accurate? >>>>> >>>>> In MySQL 5.5, we have the following APIs: >>>>> - handlerton->prepare >>>>> - handlerton->commit >>>>> - handlerton->flush_logs. >>>>> We need to fsync on prepare and commit. But what does flush_logs need >>>>> to do? According to comments, flush_logs runs a checkpoint on the >>>>> system, which is pretty expensive. Is this accurate? >>>> The flush_logs() function is called before rotating the binary log and >>>> when doing an explicit FLUSH LOGS. It give the storage engine a chance >>>> to flush any in-memory buffers, so yes, it does a checkpoint and it is >>>> expensive. However, it is just once for each binary log. >>>> >>>> This means that each time a binary log is rotated, the system grinds to >>>> a halt, which is not very nice. We have been discussing ways to avoid this. >>>> >>>>> In MySQL 5.6, we have the same APIs, but the contract has changed. We >>>>> still always fsync on prepare, that remains the same. For commit, if >>>>> HA_IGNORE_DURABILITY is set, we should not fsync, otherwise we may >>>>> have poor performance. >>>> Not poorer than without group commits, but yes, if you sync with every >>>> commit, you will have poor performance. >>>> >>>>> If HA_IGNORE_DURABILITY is not set, then we >>>>> must fsync on commit. >>>> Correct. The HA_IGNORE_DURABILITY says that the server "handles the >>>> durability". >>>> >>>>> I do not know what flush_logs needs to do. >>>>> >>>>> My last question is the following: >>>>> - what should flush_logs do? >>>>> - what is the purpose/contract of flush_logs? Under what scenarios is >>>>> it meant to be called? >>>> The flush_logs should create a checkpoint, just as you said above. It is >>>> called on binary log rotate and on explicit FLUSH LOGS (actually also on >>>> ALTER TABLE, under some circumstances: see sql_table.cc). >>>> >>>>> The comments in MySQL 5.5 and 5.6 imply that a checkpoint is run on >>>>> the system. This is what we do as well. This sounds expensive, because >>>>> IIUC, a checkpoint writes all dirty nodes to disk. But looking at the >>>>> implementation, it seems that flush_logs only ensures that the redo >>>>> log is synced up to the proper lsn, and if not, syncs it. Essentially, >>>>> it just fsyncs the log. >>>> I assume that you have been looking at InnoDB. The reason a checkpoint >>>> is done (by flushing the log) is that the recovery procedure only looks >>>> in the last binary log, which means that you might lose committed >>>> transactions on a crash, which is not OK. >>>> >>>> If you have prepared and committed some transaction but do not flush the >>>> log on disk before a rotate, it might be that on recovery it is listed >>>> as prepared (because the commit record was not written to disk) but the >>>> recovery procedure will not find it in the binary log (because it was >>>> not in the last one, it was in the preceding one) and hence it will be >>>> rolled back. >>>> >>>> Poof! Transaction gone. >>>> >>>> /Matz >>>> >>>>> Is this accurate? If so I think we need to modify our engine to not >>>>> checkpoint and just fsync our recovery log. >>>>> >>>>> Thanks >>>>> -Zardosht >>>>> >>>>> On Thu, Jan 31, 2013 at 4:41 AM, Kristian Nielsen >>>>> wrote: >>>>>> Mats Kindahl writes: >>>>>> >>>>>>> In MySQL 5.6, there are no new APIs that you *have* to comply with. >>>>>> But MySQL 5.6 serialises calls to the commit handlerton method, the next one >>>>>> cannot start before the previous one completes. So if you did fsync() with >>>>>> group commit before in commit, your group commit will no longer work in 5.6 >>>>>> and you will get a serious performance regression if you do not honour the >>>>>> HA_IGNORE_DURABILITY flag. I consider that breaking the storage engine API, as >>>>>> you see Mats and I disagree a bit on that point :-) >>>>>> >>>>>> Anyway, it should be easy to do for you. MySQL 5.6 sets HA_IGNORE_DURABILITY, >>>>>> this has similar semantics to when MariaDB 5.3+ calls the commit_ordered() >>>>>> method. So you can probably use the same code for both with a small amount of >>>>>> #ifdef. >>>>>> >>>>>> Note that for MySQL 5.6 you need to implement also the flush_logs() method to >>>>>> fsync() all prior commits durably to disk (if you did not already implement >>>>>> it). What happens is basically that in MySQL crash recovery looks only at the >>>>>> last binlog file written. So it calls flush_logs() before creating a new >>>>>> binlog, and storage engine must ensure that all commits become durable at that >>>>>> point. Otherwise commits may be lost if a crash happens just after binlog >>>>>> rotation. >>>>>> >>>>>> This is actually the _only_ reason that fsync() was ever needed in commit, to >>>>>> ensure that it is done when binlog is rotated. So it is rather silly that we >>>>>> have done it for _every_ commit for so long. Anyway, it will be fixed now. >>>>>> >>>>>> - Kristian. >>>> -- >>>> Senior Principal Software Developer >>>> Oracle, MySQL Department >>>> >> -- >> Senior Principal Software Developer >> Oracle, MySQL Department >> >> >> -- >> MySQL Internals Mailing List >> For list archives: http://lists.mysql.com/internals >> To unsubscribe: http://lists.mysql.com/internals >> -- Senior Principal Software Developer Oracle, MySQL Department