Thank you very much. It helps me a lot.
From: Ann.Harrison@stripped [mailto:Ann.Harrison@stripped] On Behalf Of Ann W. Harrison
Sent: 2008年11月26日 4:17
To: Hu, Xuekun
Cc: Ann.Harrison@stripped; FalconDev
Subject: Re: Several questions about Falcon Index?
> Hi, Ann
> Currently I'm looking at Falcon code to understand its index
> architecture. I want to understand how Falcon does an index
> search mechanism. I have several questions; hope you give me
> some answers.
OK, you're going to get more of an answer than you may have
anticipated, but here goes.
Physically, Falcon indexes are b*trees, containing nodes comprised of
a key value and either a record number, or the page number of a lower
level index page. Generally keys are both suffix and prefix compressed,
though some entries on each page are not compressed. Upper level keys
include the record number for the matching entry in the lowest level
of the index.
Indexes grow from the bottom up. An empty index consists of a single
page, and is level 0. When that page fills, it splits, and a new
higher level page is created as level 1, which points to the two
level 0 pages. Normally when an index page splits, half the
nodes on the page stay with the old page and half go to the new
page. If the page being split is the last page in the level,
the new page contains only one entry
All keys are mangled so they sort correctly bytewise. All numeric
keys are converted to floating point before mangling. Keys for
non-binary alphanumeric collations are expanded as necessary to
provide the desired order in a byte-wise binary comparison. Unlike
other MySQL engines, Falcon does not call back into the engine to
compare index keys. It does call into the engine to get the appropriate
conversion to a string of bytes that compare correctly.
There's also a bit of magic in the comparison to handle the fact that
the mother of all character sets was designed for teletypes, and spaces
are 0x20 and other legitimate characters are lower hex values. That
magic consists of adding an imaginary space to the end of both the
key value and the value being compared. Obviously, there's special
handling for fixed length multi-byte character sets.
Suffix compression removes trailing blanks from alphanumeric keys
and trailing zeros from numeric keys. It is applied to all parts
of a compound key. For single field keys, the trailing blanks or
spaces are adjusted so each field is some multiple of four bytes
long. (e.g. a field declared as char or varchar (20) containing
the value 'abc' becomes a four byte string 'abc ' (four bytes)
and 'abcde' becomes 'abcde ' (8 bytes). Then, a tag byte is
inserted after every four bytes, containing the position of the
field in the key. So, a composed key of last_name varchar(40)
first_name varchar(30) containing 'Ann' 'Harrison' becomes an
index key 'Ann 0Harr1ison1' - a considerable saving over
'Ann Harrison'. And, even better,
it recognizes the difference between 'Ann' 'Harrison' and
'An ' 'nHarrison', which many earlier algorithms didn't.
Suffix compression is performed on all keys.
Most keys also suffer prefix compression. Prefix compression just
means that the bytes that are the same as the previous key are
suppressed. Thus '12346', '12347', 12347', '12349', '12350'
become (0)12346, (4)7, (5), (4)9, (3)50 - where the number in
parentheses is the number of identical leading bytes which are
suppressed. Duplicates compress to just a count byte.
The first key on each page has a fully expanded prefix. In addition,
prefix expanded keys are scattered across the page and indexed by an
array of "super nodes" at the beginning of a page. Starting at a
page size of 4K, the cost of a scan and decompression becomes
measurable. The supernodes reduce the length of each scan and
allow a binary search of sections of the page. The middle supernode
contains the offset of a prefix-expanded nod near the middle of the
page. If that node is larger than the value to be found, the
searcher picks an lower supernode. If smaller, the searcher
picks a higher supernode.
Other key quirks.
The last node on each index page has the same value as the first
value on the next page. That reduces the number of page that
need to be read when looking for a non-existent value.
Upper level nodes include the record number of the matching
lowest level node, and lowest level duplicate nodes are stored in
record number order. This prevents problems that can occur when
trying to remove entries from long chains of duplicate values.
> 2. What's the purpose of DeferredIndex class?
When a transaction creates a new record version for a table with indexed
fields, it creates a temporary shadow of each index containing just its
changes to those indexes - new entries on insert, and entries for key
field changes on update.
That temporary index is called a deferred index. On commit, each
deferred index created by the transaction is written in its
entirety to the serial log. The gopher thread merges the
deferred index with the primary index. The main purpose of
deferred indexes is to reduce the cost of adding new entries
to an index and to avoid the need to remove entries from an index
after a rollback or rollback of a save point.
The complication of deferred indexes is that there is no one place
to look for index entries. There is the "parent" index which is stored
on database pages in the database and in the page cache, and there are
the memory resident deferred indexes with recent changes. The main
parent index has most of the entries, but some may exist in deferred
Normally, a transaction has to look only in its own deferred index
and in the parent index to find entries that it can see. When
checking for uniqueness of a new entry, Falcon must check the parent
index and all deferred pieces of the index.
One of the interesting states of a Falcon transaction is "committed
but not write complete". The transaction is committed when all its
changes have been written to the serial log and the serial log has
been flushed. However, until the gopher moves all the changes from
the serial log to the database, the transaction is not "write complete".
While a transaction is in that state, it must stick around and keep
all its deferred indexes. Other transactions will reference its
deferred indexes as necessary when their access uses the parent of
those deferred indexes.
Falcon normally does its indexed accesses in two stages. First, it
reads the index, setting bits in a sparse bit map of record numbers
when it finds a match to the search criteria. The second phase reads
the identified records in bit map order.
This has several advantages and one disadvantage.
One advantage is that since access is made first to the index
in a single pass, locking and repositioning the index is minimal
to non-existent. Another is that Falcon is not bouncing back
and forth between the index and the data, so positioning indexes and
data separately on disks is not necessary for performance. A third
advantage is that records are read in storage order - Falcon never
reads a data page twice for the same indexed lookup. Finally, and
unfortunately not with MySQL, Falcon has the ability to combine
bitmaps created from different indexes (e.g. state = 'MA' and
last_name = 'Harrison') to reduce the number of records read.
The disadvantage is that records aren't retrieved in indexed order.
This is bad when there's a LIMIT clause - the main query asks for
several thousand rows, then throws away all but the first three.
If the rows are guaranteed to arrive in index order and the index
matches the order by clause, LIMIT is much cheaper.
So, Falcon also has a mode in which it reads the index, retrieves that
row, reads the index again, retrieves that row, and so on. In its
normal operation, it can access the parent index first and then any
deferred indexes that it can see - meaning that they were created by
transactions that are visible to the searching transaction but are
not yet write complete. To return rows in indexed order, Falcon has
to read all the deferred indexes in parallel and merge the results
with the main index.
We tend to call the first method "bitmap" or "two-stage" index
access, and the second "navigational".
Normally we use bitmap access. For queries with LIMIT clauses and
supporting indexes, the server passes us a hint that says, "please
return rows in indexed order" and Falcon uses navigational access.
(I believe there are other cases where the server asks for results
in index order, but can't remember at the moment which they are."=)
> 1. What's the difference between Index::scanIndex and Index::positionIndex?
Index::scanIndex starts the "bitmap" index access method.
Index::positionIndex starts the "navigational" index access.
> BTW. Do you happen to have some docs about Falcon Index Design that can be shared with me? :)
Sorry, there aren't any. These are old family recipes for indexes that
have evolved over 25 years.