Falconers all,
Falcon's indexes were designed to be as dense as possible - the
keys are variable length with both suffix and prefix compression on
all key values, including suffix compression on the leading segments
of compound keys. Compressing the trailing blanks on compound keys
will lead to confusion unless some information is added to separate
the keys.
If these fields with these values form a compound key
Field1 Field2
a bcdefgh
ab cdefgh
abc defgh
abcd efgh
abcde fgh
abcdef gh
abcdefg h
If you just strip the spaces and stick the two fields together, all
those values are equivalent.
What Falcon does is pad each leading field out with spaces to a
multiple of four bytes (I think, not characters, though that
should be checked and probably changed for UTF-16). It then inserts
a binary byte after each group of four characters. The byte holds
the position of the field in the key. So the value above would
become
Field1 Field2
a<0x20><0x20><0x20><0x0>bcde<0x1>fgh
ab<0x20><0x20><0x0>cdef<0x1>gh
abc<0x20><0x0>defg<0x1>h
abcd<0x0>efgh
abcd<0x0>e<0x20><0x20><0x20><0x0>fgh
abcd<0x0>ef<0x20><0x20><0x0>gh
abcd<0x0>efg<0x20><0x0>h
(Putting the field position byte every five bytes is arbitrary -
spacing them further apart reduces the key growth at a cost of
raising the minimum key size.)
The question that concerns me at the moment is whether introducing
tab <0x7> characters causes a problem with this trick. I don't
think so, but ... Here are the cases, with each line of the
values above preceded by one that has a <tab> as the last byte
of the first field.
values:
Field1 Field2
a<0x7> bcdefgh
a bcdefgh
ab<0x7> cdefgh
ab cdefgh
abc<0x7> defgh
abc defgh
abcd<0x7> efgh
abcd efgh
abcde<0x7> fgh
abcde fgh
abcdef<0x7> gh
abcdef gh
abcdefg<0x7> h
abcdefg h
encoding:
Field1 Field2
a<0x7><0x20><0x20><0x0>bcde<0x1>fgh
a<0x20><0x20><0x20><0x0>bcde<0x1>fgh
ab<0x7><0x20><0x0>cdef<0x1>gh
ab<0x20><0x20><0x0>cdef<0x1>gh
abc<0x7><0x0>defg<0x1>h
abc<0x20><0x0>defg<0x1>h
abcd<0x0><0x7><0x20><0x20><0x20><0x0>efgh
abcd<0x0>efgh
abcde<0x0><0x7><0x20><0x20><0x0>fgh
abcd<0x0>e<0x20><0x20><0x20><0x0>fgh
abcd<0x0>ef<0x7><0x20><0x0>gh
abcd<0x0>ef<0x20><0x20><0x0>gh
abcd<0x0>efg<0x7><0x0>h
abcd<0x0>efg<0x20><0x0>h
If anyone has followed this, do you agree that it seems to
get the right answer?
Best,
Ann