List:Falcon Storage Engine« Previous MessageNext Message »
From:Kevin Lewis Date:April 2 2009 3:42am
Subject:Re: Indexes and their walkers (again, slightly edited)
View as plain text  

Your investigations seem to be very good keep it up.  Please be aware 
that some indexes do indeed have a lot more entries in them than there 
are records.  Index key entries are not deleted until the old record 
version is deleted by the scavenger.  When a record is pruned, the 
garbageCollect will clean up the index of a key value pointing to that 
record number if that is the last record version with that key value.

Your sample file seems strange to have so many key values in the index. 
9 times more than records?  This is worth investigation.

Be aware that the last key value on a key page is the same as the first 
key value on the next page.  It is repeated.

Lars-Erik Bjørk wrote:
> Hi all,
> I am resending this mail, that I sent a couple of weeks ago, hoping to get
> some input that can help me move on. At Kevin's request, I have put my
> questions at the beginning this time, however, some of them will not make
> much sense without the context given below. I have also tagged pieces of
> the text so that you can skip directly to the part giving the context to 
> the question
> of your choice. However, the information is not perfectly modularized :)
> I am working with stabilizing the LIMIT feature, and I am running
> Philip's RQG to expose the bugs. So far I have fixed a few, but there
> seems to be more of them hiding behind each other.
> So my questions are:
> 1. Ann, do you agree that we are comparing the wrong keys after
> repositioning?

If you say so.  A fix should be fairly obvious.

> 2. How long should it take for old versions to be removed from the
> index? (I would believe this was done quickly)

Answered above.  It depends on the retension of old record versions in 
the record cache.  However, if a crash occurs.  The old records aree 
lost, they are not in the serial log of the file.  So there is no way to 
clean up the index short of dropping it and re-creating it.

> 3. If a record A changes it's value from x to y and back to x again, and
> there are three entries for it in the index. Is there any chance
> that another record B, also with value x, can be positioned between the
> two entries for A?

x -> A
x -> B
y -> A

But note that even if the record cache contains two versions of the 
record that contain x in that indexed field, They would effectively have 
the same key node; value x pointing to record A.  So there will only be 
one index entry for these two record versions.

> 4. Does anyone (Ann?:)) have a gut feeling of what is going on?

Hope my comments help.

> <Q4>
> When I started running the test as described at
>, I would get 0 rows returned
> instead of 1000, when executing a LIMIT query designed to return all the
> rows in a table. This was reproducible, and did not involve any deferred
> indexes. The walker structure is therefore simple, and only involves a
> WalkIndex that traverses the index at the bottom level.
> For the index nodes, we have two record numbers with special meaning.
> These are
> END_BUCKET (-1) - This is the last node in this page (I hope I interpret
> this correctly :) )
> END_LEVEL (-2) - This is the last node at this level
> At first I discovered that given enough updates, the first index page
> may contain only the END_BUCKET node. There is special handling of the
> first node of every page, but this logic was not able to handle this,
> and we would stop the search without finding any matching records. I
> have changed the behavior to reposition the index to the next page, and 
> continue
> the search (IndexRootPage::repositionIndex). This made the
> test return more rows, but still not all of them (and sometimes still 
> zero).
> <Q1>
> When traversing the nodes, we check each index key value against a key
> we build from the record the node refers to
> (IndexWalker::getValidatedRecord). When we reach the end of an index
> page we reposition to the next page and continue from there. When
> debugging, after "fixing" the first error, I saw that for many records,
> none of the versions in the index would match the record. I realized
> that when we reposition to the next page, we don't update the (local
> version of the) index key value that we compare to (at least I have not
> found the place it is done). It therefore seems like we compare the key
> built from the first referred record in a page to the index value of the
> last node in the previous page.
> </Q1>
> <Q2>
> Changing this improves the situation, but still not to perfection. In most
> runs of the test, we are returned 1001 rows instead of 1000 rows, or
> 101 instead of 100. In these cases, it is always the first
> record that is duplicated. However, every once in a while we are
> returned more than 1300 records instead of 1000. The excess records
> seem to be in a continuous sequence. Not a random record every now
> and then, but a sequence of records.
> I have no idea why this is happening. But I have made the observation
> that there seems to be a lot of versions in the index for each record.
> And the index does not seem to be cleaned. When counting,
> IndexWalker::getValidatedRecord was called almost 9000 times for a table
> with 1000 rows. I think this is strange.
> </Q2>
> </Q4>
> Thank you for reading the mail!
> Humble regards,
> Lars-Erik
> P.S: For the interested, my diff so far is the following:
> === modified file 'storage/falcon/WalkIndex.cpp'
> --- storage/falcon/WalkIndex.cpp    2009-03-02 08:16:53 +0000
> +++ storage/falcon/WalkIndex.cpp    2009-03-17 11:27:41 +0000
> @@ -88,9 +88,23 @@ int32 WalkIndex::getNextNode(void)
>             recordNumber = node.getNumber();
>             if (recordNumber >= 0)
> +                {
> +                node.expandKey(&indexKey);
>                 return recordNumber;
> -            else if (recordNumber == END_LEVEL || recordNumber == 
> +                }
> +            else if (recordNumber == END_BUCKET)
> +                {
> +                IndexRootPage::repositionIndex(index->dbb, 
> index->indexId, this);
> +                continue;
> +                }
> +            else if (recordNumber == END_LEVEL)
> +                {
>                 return -1;
> +                }
> +            else
> +                {
> +                ASSERT (false);
> +                }
>             }
>         node.getNext(endNodes);
Indexes and their walkers (again, slightly edited)Lars-Erik Bjørk1 Apr
  • Re: Indexes and their walkers (again, slightly edited)Kevin Lewis2 Apr