From: Ann W. Harrison Date: December 14 2008 9:35pm Subject: Record cache List-Archive: http://lists.mysql.com/falcon/306 Message-Id: <49457C0D.8020500@mysql.com> MIME-Version: 1.0 Content-Type: text/plain; format=flowed; charset=ISO-8859-1 Content-Transfer-Encoding: 7BIT Hello all, The question of the record cache came up during the review of Falcon, with the implication that we're wasting cycles copying records from the page cache to the record cache. That's true if you're making a linear pass over a whole table with no intention of returning to any previously read record. But that's not Falcon's typical use case. When Jim and I talked about it later, he had several reasons for using a separate cache for records. The first reason is that a record cache is necessary to keep record versions out of the database. More or less by definition, the page cache consists of images of database pages from disk. Dirty pages in the page cache are written to disk during every checkpoint. Keeping record versions in memory only avoids having to remove them from the database - like PostgreSQL's vacuum or Firebird's garbage collector - expensive I/O operations. Yes, we could have a separate cache just for old versions and another just for new, uncommitted versions, but why? The usual explanation is that the record cache stores more records in a given amount of space than the pages cache. Data pages have overhead - page header plus two 32bit words per record for length and offset on page. Data pages also have wasted space - because they're not yet full, or because they filled to a level where a new record will no longer fit, or because records have been deleted from them. In many applications, only some of the records on a page are interesting. Other records on the page just take up space. Another reason is that accessing records from the record cache is faster than accessing them from the page cache. The worst case is a fragmented record - stored on two or more pages. To be useful, it must be copied somewhere and put back together. But even in the case of non-fragmented records, finding a record in the record cache is a direct access from the record tree. Finding it by record number on a data pages requires decoding the record number, finding the record locator page, indexing into it to get the data page number and index offset, then using the index offset on the data page to get the offset of the actual record on page. A final consideration is that records in the record cache have a field offset vector on the front. As fields are referenced, their offset in the record is stored in that vector, so the next reference to the field does not require scanning and decoding the record up to that field. Again, not much use if you're just ripping records off the disk, never to be seen again, but very useful if you have a cache that holds a significant number of frequently used records. One question that I thought I heard was "Why not read records directly into the record cache?" One answer is that until you've read the data page, you don't know exactly where the record is on that page. That indirection allows pages to be compressed dynamically without affecting anything outside of that page. Another is that records in the same table vary in length, and the data page is the only place where the length of a particular record is stored. And finally, my (very out of date) understanding of disks is that fixed size block reads and writes are more efficient and easier to optimize than random offset random length access. Cheers, Ann