Dan Nelson wrote:
>You need to walk the entire index to make sure you have all the values.
>There might be a single "AAB" inbetween those million "AAA"'s and
>million "BBB"'s.
>
>
Another DBA and I once discussed that an index of index values would be
helpful for such large searches as web search engines (for example). An
index would list all the words that are indexed with their offsets
within the index itself, and then those offsets would contain the
locations in the document for that word; if you needed to find AAB, its
the second entry (in alpha order) in the word index, and its list of
positions within the document is the 10234th entry (or byte position)
within the index file. To know how many entries, one would simply grab
the next index item (AAC in this case) and subtract (10235 - 10234 = 1
entry for AAB).
AAA -> 1
AAB -> 10234
AAC -> 10235
1: 12
2: 25
[...]
10233: 4285
10234: 73
10235: 4123
To top that off, finding closest matches to AAA with relation to AAC in
a sentence (for example) would be simple as you can walk the index for
AAA and AAC at the same time (since you know where both start in the
index very quickly) and simply increment each according to the diff.
between the position offsets in each (which are sorted in position order).
Just thinking out-loud, and no, I've never benchmarked it but I played
with the idea in Python a few times as a proof-of-concept.
--
Michael T. Babcock
C.T.O., FibreSpeed Ltd. SQL
http://www.fibrespeed.net/~mbabcock