From: Mats Kindahl Date: February 1 2013 7:37am Subject: Re: reducing fsyncs during handlerton->prepare and handlerton->commit in 5.6 List-Archive: http://lists.mysql.com/internals/38712 Message-Id: <510B70D7.9050900@oracle.com> MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit On 01/31/2013 10:28 PM, Zardosht Kasheff wrote: > Thank you for the detailed reply. > > I want to confirm that I understand the contract: > - when flush logs is called, the engine must ensure that any > transaction committed up until that point is recovered as committed > after a crash. No such transaction can come up in the prepared state. Yes, that is correct. > > The reason I ask this is that as of now, on flush logs, we flush all > of our data to disk, and this seems like overkill. Instead, if we just > fsync our recovery log, that will satisfy the above contract. > > Can we do this? It looks like it should work. I don't know the details of your implementation (and details can make a big difference), but if you by using the recovery log can ensure that no transactions committed before the flush_logs() show up as "prepared" on recovery, you should be safe. (Note: on recovery, the storage engine is asked for all prepared transactions, and this set should not include any transactions that are in any binary logs except the last one.) Just a questions: how do you make it possible to flush all data to disk if, e.g., the user want to take a backup? You don't have to use FLUSH LOGS for this (which would call flush_logs()), but it has to be possible. /Matz > > Thanks > -Zardosht > > On Thu, Jan 31, 2013 at 4:12 PM, Mats Kindahl wrote: >> On 01/31/2013 05:09 PM, Zardosht Kasheff wrote: >>> Thanks a lot Kristian and Mats. >>> >>> I am learning that I know a lot less than I thought I knew. To help my >>> understanding, I will focus in this thread on MySQL 5.6. I will focus >>> on MariaDB in another thread. >>> >>> I want to make sure my understanding is correct. Is what I write below accurate? >>> >>> In MySQL 5.5, we have the following APIs: >>> - handlerton->prepare >>> - handlerton->commit >>> - handlerton->flush_logs. >>> We need to fsync on prepare and commit. But what does flush_logs need >>> to do? According to comments, flush_logs runs a checkpoint on the >>> system, which is pretty expensive. Is this accurate? >> The flush_logs() function is called before rotating the binary log and >> when doing an explicit FLUSH LOGS. It give the storage engine a chance >> to flush any in-memory buffers, so yes, it does a checkpoint and it is >> expensive. However, it is just once for each binary log. >> >> This means that each time a binary log is rotated, the system grinds to >> a halt, which is not very nice. We have been discussing ways to avoid this. >> >>> In MySQL 5.6, we have the same APIs, but the contract has changed. We >>> still always fsync on prepare, that remains the same. For commit, if >>> HA_IGNORE_DURABILITY is set, we should not fsync, otherwise we may >>> have poor performance. >> Not poorer than without group commits, but yes, if you sync with every >> commit, you will have poor performance. >> >>> If HA_IGNORE_DURABILITY is not set, then we >>> must fsync on commit. >> Correct. The HA_IGNORE_DURABILITY says that the server "handles the >> durability". >> >>> I do not know what flush_logs needs to do. >>> >>> My last question is the following: >>> - what should flush_logs do? >>> - what is the purpose/contract of flush_logs? Under what scenarios is >>> it meant to be called? >> The flush_logs should create a checkpoint, just as you said above. It is >> called on binary log rotate and on explicit FLUSH LOGS (actually also on >> ALTER TABLE, under some circumstances: see sql_table.cc). >> >>> The comments in MySQL 5.5 and 5.6 imply that a checkpoint is run on >>> the system. This is what we do as well. This sounds expensive, because >>> IIUC, a checkpoint writes all dirty nodes to disk. But looking at the >>> implementation, it seems that flush_logs only ensures that the redo >>> log is synced up to the proper lsn, and if not, syncs it. Essentially, >>> it just fsyncs the log. >> I assume that you have been looking at InnoDB. The reason a checkpoint >> is done (by flushing the log) is that the recovery procedure only looks >> in the last binary log, which means that you might lose committed >> transactions on a crash, which is not OK. >> >> If you have prepared and committed some transaction but do not flush the >> log on disk before a rotate, it might be that on recovery it is listed >> as prepared (because the commit record was not written to disk) but the >> recovery procedure will not find it in the binary log (because it was >> not in the last one, it was in the preceding one) and hence it will be >> rolled back. >> >> Poof! Transaction gone. >> >> /Matz >> >>> Is this accurate? If so I think we need to modify our engine to not >>> checkpoint and just fsync our recovery log. >>> >>> Thanks >>> -Zardosht >>> >>> On Thu, Jan 31, 2013 at 4:41 AM, Kristian Nielsen >>> wrote: >>>> Mats Kindahl writes: >>>> >>>>> In MySQL 5.6, there are no new APIs that you *have* to comply with. >>>> But MySQL 5.6 serialises calls to the commit handlerton method, the next one >>>> cannot start before the previous one completes. So if you did fsync() with >>>> group commit before in commit, your group commit will no longer work in 5.6 >>>> and you will get a serious performance regression if you do not honour the >>>> HA_IGNORE_DURABILITY flag. I consider that breaking the storage engine API, as >>>> you see Mats and I disagree a bit on that point :-) >>>> >>>> Anyway, it should be easy to do for you. MySQL 5.6 sets HA_IGNORE_DURABILITY, >>>> this has similar semantics to when MariaDB 5.3+ calls the commit_ordered() >>>> method. So you can probably use the same code for both with a small amount of >>>> #ifdef. >>>> >>>> Note that for MySQL 5.6 you need to implement also the flush_logs() method to >>>> fsync() all prior commits durably to disk (if you did not already implement >>>> it). What happens is basically that in MySQL crash recovery looks only at the >>>> last binlog file written. So it calls flush_logs() before creating a new >>>> binlog, and storage engine must ensure that all commits become durable at that >>>> point. Otherwise commits may be lost if a crash happens just after binlog >>>> rotation. >>>> >>>> This is actually the _only_ reason that fsync() was ever needed in commit, to >>>> ensure that it is done when binlog is rotated. So it is rather silly that we >>>> have done it for _every_ commit for so long. Anyway, it will be fixed now. >>>> >>>> - Kristian. >> -- >> Senior Principal Software Developer >> Oracle, MySQL Department >> -- Senior Principal Software Developer Oracle, MySQL Department