List:Falcon Storage Engine« Previous MessageNext Message »
From:Christopher Powers Date:February 13 2009 4:59am
Subject:Falcon Backlogging
View as plain text  

In an effort to more tightly couple backlogging and scavenging, several
questions about backlogging have arisen, especially with regard to

Kevin and I have assembled several questions, concerns and ideas about
backlogging. Can you take some time after the Friday team conference call
to discuss this with Kevin and I?




Falcon Backlogging

Falcon's holy grail is that "out of memory" errors should never
(or very, very rarely) be encountered. Is it achievable?

In principle, yes. "Enahanced" backlogging (see below) stands as

In practice, however, the expanded roll of backlogging raises several
issues with concurrency and puts the entire concept of backlogging
into question, "enhanced" or otherwise.

Load-based scavenges are triggered by the failure of record version
memory allocation. An out-of-memory exception will occur if a load-based
scavenge cannot free enough memory for the record allocation to succeed.

The record chill process moves record data from a transaction into the
serial log--sort of a pre-commit. Record chilling is triggered by
transaction size rather than the state of the record cache, and is
orthogonal to the record scavenging process.

Record backlogging is a more radical form of record chilling. It moves a
record and all of its prior record versions out of the record cache and
into a temporary table. Record backlogging is driven strictly by the
record chill process, and is therefore independent of the state of the
record cache.

Enhanced Backlogging
In a memory-constrained environment, out-of-memory exceptions can be
prevented if record chilling (and therefore backlogging) is triggered by
the failure of a record version allocation, just as are load-based
scavenges. This mechanism has been prototyped and shown to work for test
cases that previously triggered an out-of-memory exception.

For example, an INSERT/SELECT scenario that failed at 8 million rows can
now insert up to 32 million rows, despite a record memory max setting of
250 MB. In fact, the record cache never exceeded 200 MB (the test was
arbitrarily stopped at 32 million rows.)

Admittedly, a small record cache size will result in a slow response, but
the engine was still able to manage the load. This is the robust behavior
that users are likely to expect.

Here is how enhanced backlogging works:

1. Scavenger triggers low memory condition
    - Database::scavengeRecords() detects low memory, sets low memory flag
    - scavengeRecords() calls TransactionManager::setLowMemory()
    - TransactionManager::setLowMemory() sets lowMemory flag in all
      active transactions

2. Alloc record version forces backlog in current transaction BEFORE 
    - Table::allocRecordVersion() first checks low memory flag in 
    - If low memory, allocRecordVersion() forces Transaction::chillRecords()
    - chillRecords() causes backlog of records in current transaction

3. Other transactions will also force backlogging

4. When enough memory is freed, Scavenger clears low memory condition

Criteria for Backlogging
Backlogging criteria must be strict because backlogging removes
not only a given record but its entire record chain from the record
cache.  Until now, only the first of two backlogging requirements
was checked:

A. Record version must have already been chilled
     - Assurred via Transaction::chillRecords() criteria

B. No other thread or memory object can have a pointer to the memory
that will be released by backlogging this record chain.  If there is,
then Transaction::backlogRecord() needs to clear those pointers.

Backlogging Concurrency Issues
Since backlogging a record chain will remove records from memory, there
can be no other dangling pointers to this memory. Here is a summary of
possible pointer locations to Record and RecordVersion objects. In
theory, each one of these would have a useCount on the Record object it
is pointing to:

	1. RecordLeaf::records  <-- base record only
	2. Recordversion::priorVersion
	3. Transaction chain
	4. Savepoint::records
	5. Transaction::chillPoint
	6. StorageTable::record
	7. IndexWalker::currentRecord
	8. Stack based Record pointer

I think the following are Netfrastructure only:
	9. Context::record
	10. TriggerRecord::record
	11. SortRecord::object

We need to resolve the effect of each of these pointer locations to
determine if backlogging can be done.

Currently, backlogging handles locations 1,2 3 & 4. It takes the base
record off the Record Leaf, calls Transaction::removeRecord() to pull
it off the Transaction list, calls savePoint::backlogRecord to get a
bitmap of backlogged records, and resets the priorRecord pointers when
the chain is rebuilt from the backlogged stream.

(TBD: Where is Savepoint::records actually set to null?)

However, pointer locations 5-11 still present a concurrency problem:

- Is it true that a full record chain CANNOT be backlogged if any
   record in the chain contains a useCount > 2?

- If RecordVersion::transaction == NULL, the useCount should be 1.
   This is currently not checked.

- How can we detach a record chain from the RecordLeaf without
   affecting other threads that are concurrently looking through
   the chain or examining some record in the chain?

- How can we ensure that stack-based pointers to a record are
   reflected in the useCount when we look for the record?

IDEA: Table::fetchForBacklog()
The following is an idea for 'grabbing' a record chain to be backlogged:

1. Exclusive lock record leaf
2. Trade the base record for a NULL pointer
3. Set recordBitmap in Table.
4. Exclusive lock on syncPrior for record chain
5. Inspect use counts on prior version chain
    A. One useCount for the priorRecord or RecordLeaf pointer
    B. One useCount for each recordversion that has a transaction
6. Accept or reject the record for backlogging.

Question:  Is it possible to do a limited version of a backlogging when
the useCount is too high?  Here are two types of limited backlog:

1. Backlog only data buffers.
2. Backlogging Intermediate Record Versions

IDEA: Backlog only data buffers
Most databases that have a place for old record versions on disk will
write out only the data itself.  Active transaction and versioning
metadata always build up in memory.  Falcon backlogging is unique in
that it attempts to write out of memory the record metadata also.

IDEA: Backlog Only Intermediate Record Versions
Most traversals of prior version chain are superficial such that a
pointer to the intermediate version is not usually held outside a shared
lock on syncPrior. The exception is stack-based pointers to the visible
commmitted records.  Is it possible, then, to backlog intermediate
record versions?

The following are "deep" traversals, where the contents of ALL prior
record versions must be examined:

1. RecordVersion::scavengeSavepoint
    -> Scavenge::garbageCollect()
2. Scavenge::pruneRecords()
3. Transaction::releaseSavepoint()
    -> RecordVersion::scavengeSavepoint()
    -> Table::garbageCollect()
4. Table::checkUniqueRecordVersion()

Miscellaneous Questions
1. Within each savePoint, there should never be more than one version of
the same record number. Is that true?  Can a savepoint be partially
backlogged?  Where is Savepoint::records actually set to null?

Falcon BackloggingChristopher Powers13 Feb
  • Re: Falcon BackloggingJim Starkey13 Feb
    • Re: Falcon BackloggingChristopher Powers13 Feb