Ann,
Embedding the tab character in this sequence works just fine and does
sort correctly.
Incidently, the run size is 6 bytes which includes the end-of-segment
byte. So we each RUN contains 5 bytes of the key (not 4), unless it is
the last RUN.
storage\falcon\IndexKey.h(36):static const uint RUN = 6;
Ann W. Harrison wrote:
> Falconers all,
>
> Falcon's indexes were designed to be as dense as possible - the
> keys are variable length with both suffix and prefix compression on
> all key values, including suffix compression on the leading segments
> of compound keys. Compressing the trailing blanks on compound keys
> will lead to confusion unless some information is added to separate
> the keys.
>
> If these fields with these values form a compound key
>
> Field1 Field2
>
> a bcdefgh
> ab cdefgh
> abc defgh
> abcd efgh
> abcde fgh
> abcdef gh
> abcdefg h
>
> If you just strip the spaces and stick the two fields together, all
> those values are equivalent.
>
> What Falcon does is pad each leading field out with spaces to a
> multiple of four bytes (I think, not characters, though that
> should be checked and probably changed for UTF-16). It then inserts
> a binary byte after each group of four characters. The byte holds
> the position of the field in the key. So the value above would
> become
>
> Field1 Field2
>
> a<0x20><0x20><0x20><0x0>bcde<0x1>fgh
> ab<0x20><0x20><0x0>cdef<0x1>gh
> abc<0x20><0x0>defg<0x1>h
> abcd<0x0>efgh
> abcd<0x0>e<0x20><0x20><0x20><0x0>fgh
> abcd<0x0>ef<0x20><0x20><0x0>gh
> abcd<0x0>efg<0x20><0x0>h
>
> (Putting the field position byte every five bytes is arbitrary -
> spacing them further apart reduces the key growth at a cost of
> raising the minimum key size.)
>
> The question that concerns me at the moment is whether introducing
> tab <0x7> characters causes a problem with this trick. I don't
> think so, but ... Here are the cases, with each line of the
> values above preceded by one that has a <tab> as the last byte
> of the first field.
>
> values:
>
> Field1 Field2
>
> a<0x7> bcdefgh
> a bcdefgh
> ab<0x7> cdefgh
> ab cdefgh
> abc<0x7> defgh
> abc defgh
> abcd<0x7> efgh
> abcd efgh
> abcde<0x7> fgh
> abcde fgh
> abcdef<0x7> gh
> abcdef gh
> abcdefg<0x7> h
> abcdefg h
>
>
> encoding:
>
> Field1 Field2
>
> a<0x7><0x20><0x20><0x0>bcde<0x1>fgh
> a<0x20><0x20><0x20><0x0>bcde<0x1>fgh
> ab<0x7><0x20><0x0>cdef<0x1>gh
> ab<0x20><0x20><0x0>cdef<0x1>gh
> abc<0x7><0x0>defg<0x1>h
> abc<0x20><0x0>defg<0x1>h
> abcd<0x0><0x7><0x20><0x20><0x20><0x0>efgh
> abcd<0x0>efgh
> abcde<0x0><0x7><0x20><0x20><0x0>fgh
> abcd<0x0>e<0x20><0x20><0x20><0x0>fgh
> abcd<0x0>ef<0x7><0x20><0x0>gh
> abcd<0x0>ef<0x20><0x20><0x0>gh
> abcd<0x0>efg<0x7><0x0>h
> abcd<0x0>efg<0x20><0x0>h
>
>
> If anyone has followed this, do you agree that it seems to
> get the right answer?
>
> Best,
>
> Ann
>