List:Falcon Storage Engine« Previous MessageNext Message »
From:Ann W. Harrison Date:December 14 2008 9:35pm
Subject:Record cache
View as plain text  
Hello all,

The question of the record cache came up during the review
of Falcon, with the implication that we're wasting cycles
copying records from the page cache to the record cache.
That's true if you're making a linear pass over a whole
table with no intention of returning to any previously
read record.  But that's not Falcon's typical use case.
When Jim and I talked about it later, he had several
reasons for using a separate cache for records.

The first reason is that a record cache is necessary to keep
record versions out of the database.  More or less by definition,
the page cache consists of images of database pages from disk.
Dirty pages in the page cache are written to disk during every
checkpoint.  Keeping record versions in memory only avoids
having to remove them from the database - like PostgreSQL's
vacuum or Firebird's garbage collector - expensive I/O operations.
Yes, we could have a separate cache just for old versions and
another just for new, uncommitted versions, but why?

The usual explanation is that the record cache stores more
records in a given amount of space than the pages cache.
Data pages have overhead - page header plus two 32bit words
per record for length and offset on page.  Data pages also
have wasted space - because they're not yet full, or because
they filled to a level where a new record will no longer fit,
or because records have been deleted from them.  In many
applications, only some of the records on a page are interesting.
Other records on the page just take up space.

Another reason is that accessing records from the record cache
is faster than accessing them from the page cache. The worst
case is a fragmented record - stored on two or more pages.  To
be useful, it must be copied somewhere and put back together.
But even in the case of non-fragmented records, finding a record
in the record cache is a direct access from the record tree.
Finding it by record number on a data pages requires decoding
the record number, finding the record locator page, indexing
into it to get the data page number and index offset, then
using the index offset on the data page to get the offset of
the actual record on page.

A final consideration is that records in the record cache
have a field offset vector on the front.  As fields are
referenced, their offset in the record is stored in that
vector, so the next reference to the field does not require
scanning and decoding the record up to that field.  Again,
not much use if you're just ripping records off the disk,
never to be seen again, but very useful if you have a
cache that holds a significant number of frequently used
records.

One question that I thought I heard was "Why not read records
directly into the record cache?"  One answer is that until
you've read the data page, you don't know exactly where the
record is on that page.  That indirection allows pages to
be compressed dynamically without affecting anything outside
of that page.  Another is that records in the same table vary
in length, and the data page is the only place where the
length of a particular record is stored.  And finally,
my (very out of date) understanding of disks is that
fixed size block reads and writes are more efficient and
easier to optimize than random offset random length access.

Cheers,


Ann

Thread
Record cacheAnn W. Harrison14 Dec