List:Falcon Storage Engine« Previous MessageNext Message »
From:Kevin Lewis Date:December 3 2008 11:33pm
Subject:Re: trailing spaces and compound keys and tabs
View as plain text  
Ann,

Embedding the tab character in this sequence works just fine and does 
sort correctly.

Incidently, the run size is 6 bytes which includes the end-of-segment 
byte.  So we each RUN contains 5 bytes of the key (not 4), unless it is 
the last RUN.

storage\falcon\IndexKey.h(36):static const uint RUN = 6;


Ann W. Harrison wrote:
> Falconers all,
> 
>    Falcon's indexes were designed to be as dense as possible - the
> keys are variable length with both suffix and prefix compression on
> all key values, including suffix compression on the leading segments
> of compound keys.  Compressing the trailing blanks on compound keys
> will lead to confusion unless some information is added to separate
> the keys.
> 
> If these fields with these values form a compound key
> 
> Field1   Field2
> 
> a        bcdefgh
> ab       cdefgh
> abc      defgh
> abcd     efgh
> abcde    fgh
> abcdef   gh
> abcdefg  h
> 
> If you just strip the spaces and stick the two fields together, all
> those values are equivalent.
> 
> What Falcon does is pad each leading field out with spaces to a
> multiple of four bytes (I think, not characters, though that
> should be checked and probably changed for UTF-16). It then inserts
> a binary byte after each group of four characters.  The byte holds
> the position of the field in the key.  So the value above would
> become
> 
> Field1         Field2
> 
> a<0x20><0x20><0x20><0x0>bcde<0x1>fgh
> ab<0x20><0x20><0x0>cdef<0x1>gh
> abc<0x20><0x0>defg<0x1>h
> abcd<0x0>efgh
> abcd<0x0>e<0x20><0x20><0x20><0x0>fgh
> abcd<0x0>ef<0x20><0x20><0x0>gh
> abcd<0x0>efg<0x20><0x0>h
> 
> (Putting the field position byte every five bytes is arbitrary -
> spacing them further apart reduces the key growth at a cost of
> raising the minimum key size.)
> 
> The question that concerns me at the moment is whether introducing
> tab <0x7> characters causes a problem with this trick.  I don't
> think so, but ...  Here are the cases, with each line of the
> values above preceded by one that has a <tab> as the last byte
> of the first field.
> 
> values:
> 
> Field1        Field2
> 
> a<0x7>        bcdefgh
> a             bcdefgh
> ab<0x7>       cdefgh
> ab            cdefgh
> abc<0x7>      defgh
> abc           defgh
> abcd<0x7>     efgh
> abcd          efgh
> abcde<0x7>    fgh
> abcde         fgh
> abcdef<0x7>   gh
> abcdef        gh
> abcdefg<0x7>  h
> abcdefg       h
> 
> 
> encoding:
> 
> Field1         Field2
> 
> a<0x7><0x20><0x20><0x0>bcde<0x1>fgh
> a<0x20><0x20><0x20><0x0>bcde<0x1>fgh
> ab<0x7><0x20><0x0>cdef<0x1>gh
> ab<0x20><0x20><0x0>cdef<0x1>gh
> abc<0x7><0x0>defg<0x1>h
> abc<0x20><0x0>defg<0x1>h
> abcd<0x0><0x7><0x20><0x20><0x20><0x0>efgh
> abcd<0x0>efgh
> abcde<0x0><0x7><0x20><0x20><0x0>fgh
> abcd<0x0>e<0x20><0x20><0x20><0x0>fgh
> abcd<0x0>ef<0x7><0x20><0x0>gh
> abcd<0x0>ef<0x20><0x20><0x0>gh
> abcd<0x0>efg<0x7><0x0>h
> abcd<0x0>efg<0x20><0x0>h
> 
> 
> If anyone has followed this, do you agree that it seems to
> get the right answer?
> 
> Best,
> 
> Ann
> 
Thread
trailing spaces and compound keys and tabsAnn W. Harrison4 Dec
  • Re: trailing spaces and compound keys and tabsKevin Lewis4 Dec
    • Re: trailing spaces and compound keys and tabsAnn W. Harrison4 Dec
  • Re: trailing spaces and compound keys and tabsAnn W. Harrison5 Dec