I made a rough AIO-prefetch patch to improve I/O bound performance of InnoDB. I also
posted this to percona-discussion, but I hear that there may be some more interested
What the patch does is:
1. Implement read_multi_range_*(), read_range_*() for InnoDB (almost a copy of the default
2. Add asynchronous prefetches for pages that will be read by the innodb row fetch code
<-- main feature
3. Remap index_read_*(), read_range_*() to call read_multi_range_*()
4. Fix records_in_range() to do less synchronous I/O
Problem that it adresses is that InnoDB does synchronous request for pages in a range
scan. This has a few drawbacks:
- When using O_DIRECT, you have no FS readahead, sequential pages are requested one-by-one
to the disk
- You want to use O_DIRECT otherwise AIO is useless (io_submit() waits for completion).
- When not using O_DIRECT, when pages are not exactly sequential, the FS readahead does
not kick in, making things slow. This happens a lot if you have a fragmented table.
- When reading random pages you have to wait for a disk seek before getting EACH PAGE
- Changed records_in_range() to read less pages since it does synchronous requests which
makes performance horrible again. It now reads only node pages, unless the range begins
and starts on the same or adjacent pages, at which point it descends to the leaf pages to
get the real count.
The prefetch mechanism I made prefetches up to innodb_prefetch_pages during
read_multi_range_first(), and then prefetches the next innodb_prefetch_pages when needed
during read_multi_range_next(). The prefetches are done at the same time, asynchronously.
This allows your disk or RAID controller to do smart things like reordering and merging of
the requests which was otherwise impossible. The pages are read to the buffer pool, after
which the normal InnoDB code kicks in and gets the pages from there. The internal row
fetching code therefore triggers almost no direct (synchronous) I/O.
I then remapped the index_read() to do a range_read() as well, but I think there may be
some issues with that. The main reason I want to do that is when you're doing an
index_read with an index prefix, you can prefetch all the matches, which are requested
with index_next() after that. But it probably still breaks something.
On my test machine (6-disk 10k rpm SAS, dual quad-xeon 3.0 ghz on a 30Gb unfragmented
table, 4Gb buffer pool) this gives me the following:
- Normal performance (no O_DIRECT): 15 qps
- O_DIRECT performance: 10 qps (Due to no readahead)
- My patch with O_DIRECT: 30 qps
Nice thing to see is that your iostats soar to 2000 IOPS with just a single query running
Although the patch is aimed at increasing speed of a single query, theory and some little
tests say that it also improves performance when there are a lot of parallel queries
I still need to test with fragmented tables, and I think there are probably various
memleaks and stuff in there. Also rnd_* needs to be mapped for prefetch too. I think there
is also a problem with the estimate code if the tree contains no node pages (small
the patch is at http://download.zarafa.com/zarafa/innodb_range_prefetch.diff
it's against 5.5.6-rc
not sure how useful it is in 5.1 since 5.1 has no AIO on linux.