From: Zardosht Kasheff Date: January 31 2013 9:28pm Subject: Re: reducing fsyncs during handlerton->prepare and handlerton->commit in 5.6 List-Archive: http://lists.mysql.com/internals/38711 Message-Id: MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Thank you for the detailed reply. I want to confirm that I understand the contract: - when flush logs is called, the engine must ensure that any transaction committed up until that point is recovered as committed after a crash. No such transaction can come up in the prepared state. The reason I ask this is that as of now, on flush logs, we flush all of our data to disk, and this seems like overkill. Instead, if we just fsync our recovery log, that will satisfy the above contract. Can we do this? Thanks -Zardosht On Thu, Jan 31, 2013 at 4:12 PM, Mats Kindahl wrote: > > On 01/31/2013 05:09 PM, Zardosht Kasheff wrote: >> Thanks a lot Kristian and Mats. >> >> I am learning that I know a lot less than I thought I knew. To help my >> understanding, I will focus in this thread on MySQL 5.6. I will focus >> on MariaDB in another thread. >> >> I want to make sure my understanding is correct. Is what I write below accurate? >> >> In MySQL 5.5, we have the following APIs: >> - handlerton->prepare >> - handlerton->commit >> - handlerton->flush_logs. >> We need to fsync on prepare and commit. But what does flush_logs need >> to do? According to comments, flush_logs runs a checkpoint on the >> system, which is pretty expensive. Is this accurate? > > The flush_logs() function is called before rotating the binary log and > when doing an explicit FLUSH LOGS. It give the storage engine a chance > to flush any in-memory buffers, so yes, it does a checkpoint and it is > expensive. However, it is just once for each binary log. > > This means that each time a binary log is rotated, the system grinds to > a halt, which is not very nice. We have been discussing ways to avoid this. > >> >> In MySQL 5.6, we have the same APIs, but the contract has changed. We >> still always fsync on prepare, that remains the same. For commit, if >> HA_IGNORE_DURABILITY is set, we should not fsync, otherwise we may >> have poor performance. > > Not poorer than without group commits, but yes, if you sync with every > commit, you will have poor performance. > >> If HA_IGNORE_DURABILITY is not set, then we >> must fsync on commit. > > Correct. The HA_IGNORE_DURABILITY says that the server "handles the > durability". > >> I do not know what flush_logs needs to do. >> >> My last question is the following: >> - what should flush_logs do? >> - what is the purpose/contract of flush_logs? Under what scenarios is >> it meant to be called? > > The flush_logs should create a checkpoint, just as you said above. It is > called on binary log rotate and on explicit FLUSH LOGS (actually also on > ALTER TABLE, under some circumstances: see sql_table.cc). > >> >> The comments in MySQL 5.5 and 5.6 imply that a checkpoint is run on >> the system. This is what we do as well. This sounds expensive, because >> IIUC, a checkpoint writes all dirty nodes to disk. But looking at the >> implementation, it seems that flush_logs only ensures that the redo >> log is synced up to the proper lsn, and if not, syncs it. Essentially, >> it just fsyncs the log. > > I assume that you have been looking at InnoDB. The reason a checkpoint > is done (by flushing the log) is that the recovery procedure only looks > in the last binary log, which means that you might lose committed > transactions on a crash, which is not OK. > > If you have prepared and committed some transaction but do not flush the > log on disk before a rotate, it might be that on recovery it is listed > as prepared (because the commit record was not written to disk) but the > recovery procedure will not find it in the binary log (because it was > not in the last one, it was in the preceding one) and hence it will be > rolled back. > > Poof! Transaction gone. > > /Matz > >> >> Is this accurate? If so I think we need to modify our engine to not >> checkpoint and just fsync our recovery log. >> >> Thanks >> -Zardosht >> >> On Thu, Jan 31, 2013 at 4:41 AM, Kristian Nielsen >> wrote: >>> Mats Kindahl writes: >>> >>>> In MySQL 5.6, there are no new APIs that you *have* to comply with. >>> But MySQL 5.6 serialises calls to the commit handlerton method, the next one >>> cannot start before the previous one completes. So if you did fsync() with >>> group commit before in commit, your group commit will no longer work in 5.6 >>> and you will get a serious performance regression if you do not honour the >>> HA_IGNORE_DURABILITY flag. I consider that breaking the storage engine API, as >>> you see Mats and I disagree a bit on that point :-) >>> >>> Anyway, it should be easy to do for you. MySQL 5.6 sets HA_IGNORE_DURABILITY, >>> this has similar semantics to when MariaDB 5.3+ calls the commit_ordered() >>> method. So you can probably use the same code for both with a small amount of >>> #ifdef. >>> >>> Note that for MySQL 5.6 you need to implement also the flush_logs() method to >>> fsync() all prior commits durably to disk (if you did not already implement >>> it). What happens is basically that in MySQL crash recovery looks only at the >>> last binlog file written. So it calls flush_logs() before creating a new >>> binlog, and storage engine must ensure that all commits become durable at that >>> point. Otherwise commits may be lost if a crash happens just after binlog >>> rotation. >>> >>> This is actually the _only_ reason that fsync() was ever needed in commit, to >>> ensure that it is done when binlog is rotated. So it is rather silly that we >>> have done it for _every_ commit for so long. Anyway, it will be fixed now. >>> >>> - Kristian. > > -- > Senior Principal Software Developer > Oracle, MySQL Department >