Zardosht Kasheff <zardosht@stripped> writes:
> Instead of having the binary log be stored as a log file, a table with
> a transactional engine (like InnoDB) is used. I have NOT done any
Yes, this is a tempting idea, is it not? Drizzle even implemented this.
But there are some subtle problems, and I think there are better solutions.
One problem you will find if you try implementing this is - how are you going
to keep the concept of binlog order? A relational table is unordered. You can
use an increasing uint64 as the primary key, but then how are you going to
ensure that the order in the table is the same as the commit order in InnoDB
(InnoDB redo log)? If you take a lock from allocating the sequence number to
doing the commit, then you have introduced a serialisation bottleneck and
killed group commit. If you accept different order (as Drizzle does), then
when you take an XtraBackup, it may not be consistent with the binlog.
You will be writing your data *6* times: To the real table, when flushing its
buffer pool pages. To the double-write buffer while flushing, and to the redo
log before flushing. Then to the binlog table, and before that to the redo log
and the doublewrite buffer. I could never reconcile myself with this.
So it is tempting, but re-implementing binlog is a huge change (also in terms
of end-user visibility), and my personal opinion is that for such huge a
change we need to go to the best solution, not second-best.
> investigation into whether this is possible, but I'm wondering what
> people abstractly think of this idea.
> My major motivation is to find a way to have users not be forced to
> fsync on every transaction commit on slaves. With crash safety of
My preference is MWL#164:
This reduces to one fsync() per group commit. But as you point out, this is
still horrendously expensive for a single-threaded slave :-(
Abstractly, the proper solution is to store the binlog and the InnoDB redo log
(and all other transactional-type logs) in the *same* log. The upper layer
would provide a general transactional log facility, and binlog, InnoDB,
partitioning, etc. would use this instead of their own logs. But it is hard to
see how we could arrive there from current code ...
For MariaDB global transaction ID, I support crash-safe without binlog
enabled, so there turning off fsync() will be possible. But this does not help
if a slave also needs to act as a master.
Parallel replication can help a lot, by allowing group commit, amortising the
cost of the fsync(). But it would still be nice to solve the orginal problem.
As you say, for a slave it is not a problem to loose transactions in a crash
as they can be just replicated again. The problem is to be consistent between
what is in the InnoDB tables and what is in the binlog.
InnoDB already records in a crash-safe way the binlog position of the last
commit. So if we can ensure that InnoDB is never ahead of the binlog, we can
just cut the binlog after a crash if it has more events than what InnoDB
recovered. We could do this by doing binlog writes as async I/O O_DIRECT
writes, and sending a signal somehow to InnoDB so it would buffer its own redo
log until binlog is safe on disk. Then no fsync() would be needed. But this is
so far just a loose idea, it needs more thought to be sure it can work...
Anyway, this is my current long-term plan: if we cannot have a proper single
database-wide transaction log, then instead coordinate writes between the
multiple logs that exist. Delay writes to logs so that they are never ahead of
the master log. And at crash, either recover missing pieces of logs from the
master log (MWL#164), or cut logs to be consistent. Then we can be consistent
without any fsync() delays.
> - As I understand it, for two phase commit, storage engines are
> currently expected to fsync once after a transaction is prepared and
> once after a transaction is committed. If the binary log is an InnoDB
> table, then two-phase commit is unnecessary and the number of fsyncs
> on the system is cut in half.
In MariaDB 10 and MySQL 5.6, the fsync() at commit is no longer needed (InnoDB
does not have it).
[Note that IIRC, MySQL 5.6 breaks the storage engine API, you will need to
implement the skip-fsync-at-commit-stage, or you will loose group commit.]
> - With crash safe slaves on MySQL 5.6, one theoretically does not
> need to enable fsyncs on transaction commit on slaves. If some data
> gets lost after a crash, then the data may be replayed from the
> master. However, to get GTIDs working, one must enable the binary log
> on slaves, which requires two-phase commit, which require fsyncs. By
Yes, requiring binlog on slaves is a serious mistake :-(
Ok, as you can perhaps tell, this is something I have thought a lot about over
the past years, so I have a lot of ideas but it is a bit difficult to put it
all in a single mail reply without it becomming mostly a jumble of loose
ideas. You are welcome to ask for more details on any specific points of