List:General Discussion« Previous MessageNext Message »
From:Michael T. Babcock Date:February 18 2003 4:03pm
Subject:Re: Slow retrieval of distinct on indexed fields [OT?]
View as plain text  
Dan Nelson wrote:

>You need to walk the entire index to make sure you have all the values.
>There might be a single "AAB" inbetween those million "AAA"'s and
>million "BBB"'s.
>  
>

Another DBA and I once discussed that an index of index values would be 
helpful for such large searches as web search engines (for example).  An 
index would list all the words that are indexed with their offsets 
within the index itself, and then those offsets would contain the 
locations in the document for that word; if you needed to find AAB, its 
the second entry (in alpha order) in the word index, and its list of 
positions within the document is the 10234th entry (or byte position) 
within the index file.  To know how many entries, one would simply grab 
the next index item (AAC in this case) and subtract (10235 - 10234 = 1 
entry for AAB).  

AAA -> 1
AAB -> 10234
AAC -> 10235

1: 12
2: 25
[...]
10233: 4285
10234: 73
10235: 4123

To top that off, finding closest matches to AAA with relation to AAC in 
a sentence (for example) would be simple as you can walk the index for 
AAA and AAC at the same time (since you know where both start in the 
index very quickly) and simply increment each according to the diff. 
between the position offsets in each (which are sorted in position order).

Just thinking out-loud, and no, I've never benchmarked it but I played 
with the idea in Python a few times as a proof-of-concept.

-- 
Michael T. Babcock
C.T.O., FibreSpeed Ltd. SQL
http://www.fibrespeed.net/~mbabcock


Thread
Slow retrieval of distinct on indexed fieldsRenars Jeromans18 Feb
  • Re: Slow retrieval of distinct on indexed fieldsDan Nelson18 Feb
    • Re: Slow retrieval of distinct on indexed fields [OT?]Michael T. Babcock18 Feb
  • Re: Slow retrieval of distinct on indexed fieldsPeter Grigor18 Feb
    • Re: Slow retrieval of distinct on indexed fieldsDan Nelson18 Feb
  • Re: Slow retrieval of distinct on indexed fieldsPeter Grigor18 Feb