List:Internals« Previous MessageNext Message »
From:Zardosht Kasheff Date:January 31 2013 9:28pm
Subject:Re: reducing fsyncs during handlerton->prepare and handlerton->commit
in 5.6
View as plain text  
Thank you for the detailed reply.

I want to confirm that I understand the contract:
 - when flush logs is called, the engine must ensure that any
transaction committed up until that point is recovered as committed
after a crash. No such transaction can come up in the prepared state.

The reason I ask this is that as of now, on flush logs, we flush all
of our data to disk, and this seems like overkill. Instead, if we just
fsync our recovery log, that will satisfy the above contract.

Can we do this?

Thanks
-Zardosht

On Thu, Jan 31, 2013 at 4:12 PM, Mats Kindahl <mats.kindahl@stripped> wrote:
>
> On 01/31/2013 05:09 PM, Zardosht Kasheff wrote:
>> Thanks a lot Kristian and Mats.
>>
>> I am learning that I know a lot less than I thought I knew. To help my
>> understanding, I will focus in this thread on MySQL 5.6. I will focus
>> on MariaDB in another thread.
>>
>> I want to make sure my understanding is correct. Is what I write below accurate?
>>
>> In MySQL 5.5, we have the following APIs:
>>  - handlerton->prepare
>>  - handlerton->commit
>>  - handlerton->flush_logs.
>> We need to fsync on prepare and commit. But what does flush_logs need
>> to do? According to comments, flush_logs runs a checkpoint on the
>> system, which is pretty expensive. Is this accurate?
>
> The flush_logs() function is called before rotating the binary log and
> when doing an explicit FLUSH LOGS. It give the storage engine a chance
> to flush any in-memory buffers, so yes, it does a checkpoint and it is
> expensive. However, it is just once for each binary log.
>
> This means that each time a binary log is rotated, the system grinds to
> a halt, which is not very nice. We have been discussing ways to avoid this.
>
>>
>> In MySQL 5.6, we have the same APIs, but the contract has changed. We
>> still always fsync on prepare, that remains the same. For commit, if
>> HA_IGNORE_DURABILITY is set, we should not fsync, otherwise we may
>> have poor performance.
>
> Not poorer than without group commits, but yes, if you sync with every
> commit, you will have poor performance.
>
>> If HA_IGNORE_DURABILITY is not set, then we
>> must fsync on commit.
>
> Correct. The HA_IGNORE_DURABILITY says that the server "handles the
> durability".
>
>> I do not know what flush_logs needs to do.
>>
>> My last question is the following:
>>  - what should flush_logs do?
>>  - what is the purpose/contract of flush_logs? Under what scenarios is
>> it meant to be called?
>
> The flush_logs should create a checkpoint, just as you said above. It is
> called on binary log rotate and on explicit FLUSH LOGS (actually also on
> ALTER TABLE, under some circumstances: see sql_table.cc).
>
>>
>> The comments in MySQL 5.5 and 5.6 imply that a checkpoint is run on
>> the system. This is what we do as well. This sounds expensive, because
>> IIUC, a checkpoint writes all dirty nodes to disk. But looking at the
>> implementation, it seems that flush_logs only ensures that the redo
>> log is synced up to the proper lsn, and if not, syncs it. Essentially,
>> it just fsyncs the log.
>
> I assume that you have been looking at InnoDB. The reason a checkpoint
> is done (by flushing the log) is that the recovery procedure only looks
> in the last binary log, which means that you might lose committed
> transactions on a crash, which is not OK.
>
> If you have prepared and committed some transaction but do not flush the
> log on disk before a rotate, it might be that on recovery it is listed
> as prepared (because the commit record was not written to disk) but the
> recovery procedure will not find it in the binary log (because it was
> not in the last one, it was in the preceding one) and hence it will be
> rolled back.
>
> Poof! Transaction gone.
>
> /Matz
>
>>
>> Is this accurate? If so I think we need to modify our engine to not
>> checkpoint and just fsync our recovery log.
>>
>> Thanks
>> -Zardosht
>>
>> On Thu, Jan 31, 2013 at 4:41 AM, Kristian Nielsen
>> <knielsen@stripped> wrote:
>>> Mats Kindahl <mats.kindahl@stripped> writes:
>>>
>>>> In MySQL 5.6, there are no new APIs that you *have* to comply with.
>>> But MySQL 5.6 serialises calls to the commit handlerton method, the next one
>>> cannot start before the previous one completes. So if you did fsync() with
>>> group commit before in commit, your group commit will no longer work in 5.6
>>> and you will get a serious performance regression if you do not honour the
>>> HA_IGNORE_DURABILITY flag. I consider that breaking the storage engine API,
> as
>>> you see Mats and I disagree a bit on that point :-)
>>>
>>> Anyway, it should be easy to do for you. MySQL 5.6 sets
> HA_IGNORE_DURABILITY,
>>> this has similar semantics to when MariaDB 5.3+ calls the commit_ordered()
>>> method. So you can probably use the same code for both with a small amount
> of
>>> #ifdef.
>>>
>>> Note that for MySQL 5.6 you need to implement also the flush_logs() method
> to
>>> fsync() all prior commits durably to disk (if you did not already implement
>>> it). What happens is basically that in MySQL crash recovery looks only at
> the
>>> last binlog file written. So it calls flush_logs() before creating a new
>>> binlog, and storage engine must ensure that all commits become durable at
> that
>>> point. Otherwise commits may be lost if a crash happens just after binlog
>>> rotation.
>>>
>>> This is actually the _only_ reason that fsync() was ever needed in commit,
> to
>>> ensure that it is done when binlog is rotated. So it is rather silly that we
>>> have done it for _every_ commit for so long. Anyway, it will be fixed now.
>>>
>>>  - Kristian.
>
> --
> Senior Principal Software Developer
> Oracle, MySQL Department
>
Thread
reducing fsyncs during handlerton->prepare and handlerton->commit in 5.6Zardosht Kasheff30 Jan
  • Re: reducing fsyncs during handlerton->prepare and handlerton->commit in 5.6Kristian Nielsen31 Jan
  • Re: reducing fsyncs during handlerton->prepare and handlerton->commitin 5.6Mats Kindahl31 Jan
    • Re: reducing fsyncs during handlerton->prepare and handlerton->commit in 5.6Kristian Nielsen31 Jan
      • Re: reducing fsyncs during handlerton->prepare and handlerton->commitin 5.6Zardosht Kasheff31 Jan
        • Re: reducing fsyncs during handlerton->prepare and handlerton->commitin 5.6Mats Kindahl31 Jan
          • Re: reducing fsyncs during handlerton->prepare and handlerton->commitin 5.6Zardosht Kasheff31 Jan
            • Re: reducing fsyncs during handlerton->prepare and handlerton->commitin 5.6Mats Kindahl1 Feb
              • Re: reducing fsyncs during handlerton->prepare and handlerton->commitin 5.6Zardosht Kasheff1 Feb
                • Re: reducing fsyncs during handlerton->prepare and handlerton->commitin 5.6Mats Kindahl1 Feb
      • Re: reducing fsyncs during handlerton->prepare and handlerton->commitin 5.6Mats Kindahl3 Feb