|List:||Internals||« Previous MessageNext Message »|
|From:||Zardosht Kasheff||Date:||January 31 2013 4:09pm|
|Subject:||Re: reducing fsyncs during handlerton->prepare and handlerton->commit|
|View as plain text|
Thanks a lot Kristian and Mats. I am learning that I know a lot less than I thought I knew. To help my understanding, I will focus in this thread on MySQL 5.6. I will focus on MariaDB in another thread. I want to make sure my understanding is correct. Is what I write below accurate? In MySQL 5.5, we have the following APIs: - handlerton->prepare - handlerton->commit - handlerton->flush_logs. We need to fsync on prepare and commit. But what does flush_logs need to do? According to comments, flush_logs runs a checkpoint on the system, which is pretty expensive. Is this accurate? In MySQL 5.6, we have the same APIs, but the contract has changed. We still always fsync on prepare, that remains the same. For commit, if HA_IGNORE_DURABILITY is set, we should not fsync, otherwise we may have poor performance. If HA_IGNORE_DURABILITY is not set, then we must fsync on commit. I do not know what flush_logs needs to do. My last question is the following: - what should flush_logs do? - what is the purpose/contract of flush_logs? Under what scenarios is it meant to be called? The comments in MySQL 5.5 and 5.6 imply that a checkpoint is run on the system. This is what we do as well. This sounds expensive, because IIUC, a checkpoint writes all dirty nodes to disk. But looking at the implementation, it seems that flush_logs only ensures that the redo log is synced up to the proper lsn, and if not, syncs it. Essentially, it just fsyncs the log. Is this accurate? If so I think we need to modify our engine to not checkpoint and just fsync our recovery log. Thanks -Zardosht On Thu, Jan 31, 2013 at 4:41 AM, Kristian Nielsen <knielsen@stripped> wrote: > Mats Kindahl <mats.kindahl@stripped> writes: > >> In MySQL 5.6, there are no new APIs that you *have* to comply with. > > But MySQL 5.6 serialises calls to the commit handlerton method, the next one > cannot start before the previous one completes. So if you did fsync() with > group commit before in commit, your group commit will no longer work in 5.6 > and you will get a serious performance regression if you do not honour the > HA_IGNORE_DURABILITY flag. I consider that breaking the storage engine API, as > you see Mats and I disagree a bit on that point :-) > > Anyway, it should be easy to do for you. MySQL 5.6 sets HA_IGNORE_DURABILITY, > this has similar semantics to when MariaDB 5.3+ calls the commit_ordered() > method. So you can probably use the same code for both with a small amount of > #ifdef. > > Note that for MySQL 5.6 you need to implement also the flush_logs() method to > fsync() all prior commits durably to disk (if you did not already implement > it). What happens is basically that in MySQL crash recovery looks only at the > last binlog file written. So it calls flush_logs() before creating a new > binlog, and storage engine must ensure that all commits become durable at that > point. Otherwise commits may be lost if a crash happens just after binlog > rotation. > > This is actually the _only_ reason that fsync() was ever needed in commit, to > ensure that it is done when binlog is rotated. So it is rather silly that we > have done it for _every_ commit for so long. Anyway, it will be fixed now. > > - Kristian.