List:Internals« Previous MessageNext Message »
From:Kristian Nielsen Date:February 8 2013 2:04pm
Subject:Re: [PATCH] use fallocate to create table
View as plain text  
Domas Mituzas <domas.mituzas@stripped> writes:

> The way we modified InnoDB I/O in our patch is that when we're doing O_DIRECT writes,
> we're not calling fsync() afterwards.

O_DIRECT anyway only returns when the data is safely on disk. So what would be
the point of calling fsync() after O_DIRECT writes?

> Correct semantics would be calling fsync() afterwards, as that would make sure that
> OS updates all the metadata properly. 

What metadata? True, there is the "file last modified" property, but I am
assuming that is not what you meant.

I saw some linux kernel mailing list discussions about a bug that O_DIRECT
writes that extend the file do not sync the metadata about the size of the
file to disk. Is that what you have in mind? So we might write a new data
block at the end of the file, then crash, and when kernel comes back up, the
newly written block is missing.

It does seem probable that there would be bugs like this. Hey, I even recently
discovered a bug where a normal write and fdatasync() was not crash safe. It
is not well tested territory.

> Instead we do fsync() only at the end of zero-filled block append, which makes sure
> that FS has properly allocated those blocks and we can O_DIRECT to/from them as much as we
> want \o/

Yes. If we extend a file in 1MB chunks and f(data)sync() after each, then it
will be very inefficient. There will be metadata write to disk (new file size)
every 1MB of data. This will about double the time needed on spinning rust.

> Obviously, such a change allows more efficient in-place I/O, assuming one uses
> O_DIRECT, but we have to keep away from relying on any metadata afterwards.

I do not understand. What metadata? If we fdatasync() _once_ after extending
the file the full amount, then file dates may be incorrect but everything else
should be reliable. If we fsync() _once_, then even dates should be ok.

If we extend with a decent amount each time, say 64MB or so, then the overhead
of a single fdatasync() should be modest, shouldn't it?

 - Kristian.
Re: [PATCH] use fallocate to create tableKristian Nielsen8 Feb