List:Internals« Previous MessageNext Message »
From:Mats Kindahl Date:February 1 2013 7:37am
Subject:Re: reducing fsyncs during handlerton->prepare and handlerton->commit
in 5.6
View as plain text  
On 01/31/2013 10:28 PM, Zardosht Kasheff wrote:
> Thank you for the detailed reply.
>
> I want to confirm that I understand the contract:
>  - when flush logs is called, the engine must ensure that any
> transaction committed up until that point is recovered as committed
> after a crash. No such transaction can come up in the prepared state.

Yes, that is correct.

>
> The reason I ask this is that as of now, on flush logs, we flush all
> of our data to disk, and this seems like overkill. Instead, if we just
> fsync our recovery log, that will satisfy the above contract.
>
> Can we do this?

It looks like it should work. I don't know the details of your
implementation (and details can make a big difference), but if you by
using the recovery log can ensure that no transactions committed before
the flush_logs() show up as "prepared" on recovery, you should be safe.
(Note: on recovery, the storage engine is asked for all prepared
transactions, and this set should not include any transactions that are
in any binary logs except the last one.)

Just a questions: how do you make it possible to flush all data to disk
if, e.g., the user want to take a backup? You don't have to use FLUSH
LOGS for this (which would call flush_logs()), but it has to be possible.

/Matz

>
> Thanks
> -Zardosht
>
> On Thu, Jan 31, 2013 at 4:12 PM, Mats Kindahl <mats.kindahl@stripped> wrote:
>> On 01/31/2013 05:09 PM, Zardosht Kasheff wrote:
>>> Thanks a lot Kristian and Mats.
>>>
>>> I am learning that I know a lot less than I thought I knew. To help my
>>> understanding, I will focus in this thread on MySQL 5.6. I will focus
>>> on MariaDB in another thread.
>>>
>>> I want to make sure my understanding is correct. Is what I write below
> accurate?
>>>
>>> In MySQL 5.5, we have the following APIs:
>>>  - handlerton->prepare
>>>  - handlerton->commit
>>>  - handlerton->flush_logs.
>>> We need to fsync on prepare and commit. But what does flush_logs need
>>> to do? According to comments, flush_logs runs a checkpoint on the
>>> system, which is pretty expensive. Is this accurate?
>> The flush_logs() function is called before rotating the binary log and
>> when doing an explicit FLUSH LOGS. It give the storage engine a chance
>> to flush any in-memory buffers, so yes, it does a checkpoint and it is
>> expensive. However, it is just once for each binary log.
>>
>> This means that each time a binary log is rotated, the system grinds to
>> a halt, which is not very nice. We have been discussing ways to avoid this.
>>
>>> In MySQL 5.6, we have the same APIs, but the contract has changed. We
>>> still always fsync on prepare, that remains the same. For commit, if
>>> HA_IGNORE_DURABILITY is set, we should not fsync, otherwise we may
>>> have poor performance.
>> Not poorer than without group commits, but yes, if you sync with every
>> commit, you will have poor performance.
>>
>>> If HA_IGNORE_DURABILITY is not set, then we
>>> must fsync on commit.
>> Correct. The HA_IGNORE_DURABILITY says that the server "handles the
>> durability".
>>
>>> I do not know what flush_logs needs to do.
>>>
>>> My last question is the following:
>>>  - what should flush_logs do?
>>>  - what is the purpose/contract of flush_logs? Under what scenarios is
>>> it meant to be called?
>> The flush_logs should create a checkpoint, just as you said above. It is
>> called on binary log rotate and on explicit FLUSH LOGS (actually also on
>> ALTER TABLE, under some circumstances: see sql_table.cc).
>>
>>> The comments in MySQL 5.5 and 5.6 imply that a checkpoint is run on
>>> the system. This is what we do as well. This sounds expensive, because
>>> IIUC, a checkpoint writes all dirty nodes to disk. But looking at the
>>> implementation, it seems that flush_logs only ensures that the redo
>>> log is synced up to the proper lsn, and if not, syncs it. Essentially,
>>> it just fsyncs the log.
>> I assume that you have been looking at InnoDB. The reason a checkpoint
>> is done (by flushing the log) is that the recovery procedure only looks
>> in the last binary log, which means that you might lose committed
>> transactions on a crash, which is not OK.
>>
>> If you have prepared and committed some transaction but do not flush the
>> log on disk before a rotate, it might be that on recovery it is listed
>> as prepared (because the commit record was not written to disk) but the
>> recovery procedure will not find it in the binary log (because it was
>> not in the last one, it was in the preceding one) and hence it will be
>> rolled back.
>>
>> Poof! Transaction gone.
>>
>> /Matz
>>
>>> Is this accurate? If so I think we need to modify our engine to not
>>> checkpoint and just fsync our recovery log.
>>>
>>> Thanks
>>> -Zardosht
>>>
>>> On Thu, Jan 31, 2013 at 4:41 AM, Kristian Nielsen
>>> <knielsen@stripped> wrote:
>>>> Mats Kindahl <mats.kindahl@stripped> writes:
>>>>
>>>>> In MySQL 5.6, there are no new APIs that you *have* to comply with.
>>>> But MySQL 5.6 serialises calls to the commit handlerton method, the next
> one
>>>> cannot start before the previous one completes. So if you did fsync()
> with
>>>> group commit before in commit, your group commit will no longer work in
> 5.6
>>>> and you will get a serious performance regression if you do not honour
> the
>>>> HA_IGNORE_DURABILITY flag. I consider that breaking the storage engine
> API, as
>>>> you see Mats and I disagree a bit on that point :-)
>>>>
>>>> Anyway, it should be easy to do for you. MySQL 5.6 sets
> HA_IGNORE_DURABILITY,
>>>> this has similar semantics to when MariaDB 5.3+ calls the
> commit_ordered()
>>>> method. So you can probably use the same code for both with a small
> amount of
>>>> #ifdef.
>>>>
>>>> Note that for MySQL 5.6 you need to implement also the flush_logs()
> method to
>>>> fsync() all prior commits durably to disk (if you did not already
> implement
>>>> it). What happens is basically that in MySQL crash recovery looks only at
> the
>>>> last binlog file written. So it calls flush_logs() before creating a new
>>>> binlog, and storage engine must ensure that all commits become durable at
> that
>>>> point. Otherwise commits may be lost if a crash happens just after
> binlog
>>>> rotation.
>>>>
>>>> This is actually the _only_ reason that fsync() was ever needed in
> commit, to
>>>> ensure that it is done when binlog is rotated. So it is rather silly that
> we
>>>> have done it for _every_ commit for so long. Anyway, it will be fixed
> now.
>>>>
>>>>  - Kristian.
>> --
>> Senior Principal Software Developer
>> Oracle, MySQL Department
>>

-- 
Senior Principal Software Developer
Oracle, MySQL Department

Thread
reducing fsyncs during handlerton->prepare and handlerton->commit in 5.6Zardosht Kasheff30 Jan
  • Re: reducing fsyncs during handlerton->prepare and handlerton->commit in 5.6Kristian Nielsen31 Jan
  • Re: reducing fsyncs during handlerton->prepare and handlerton->commitin 5.6Mats Kindahl31 Jan
    • Re: reducing fsyncs during handlerton->prepare and handlerton->commit in 5.6Kristian Nielsen31 Jan
      • Re: reducing fsyncs during handlerton->prepare and handlerton->commitin 5.6Zardosht Kasheff31 Jan
        • Re: reducing fsyncs during handlerton->prepare and handlerton->commitin 5.6Mats Kindahl31 Jan
          • Re: reducing fsyncs during handlerton->prepare and handlerton->commitin 5.6Zardosht Kasheff31 Jan
            • Re: reducing fsyncs during handlerton->prepare and handlerton->commitin 5.6Mats Kindahl1 Feb
              • Re: reducing fsyncs during handlerton->prepare and handlerton->commitin 5.6Zardosht Kasheff1 Feb
                • Re: reducing fsyncs during handlerton->prepare and handlerton->commitin 5.6Mats Kindahl1 Feb
      • Re: reducing fsyncs during handlerton->prepare and handlerton->commitin 5.6Mats Kindahl3 Feb