Hi!
If you have a n rows table of x char. each. in disk of 4k pagesize, having
an index of y chars. per entry, with i entries. Selecting r number of rows
matching your WHERE restriction has the following plans and associated
costs:
Data Scanning (Dscan) Cost
It is a fixed nr. of physical reads to data disk = (n * x) / 4k
Index Scanning (Iscan) Cost
It is a nr. of phys. reads to index disk = (i * y) / 4k + Row Picking Cost;
Row Picking Cost = r variable number of random phys. reads to data disk
(not divided by 4k! except for clustered data on current index).
Case 1: Suppossing that r is few (i.e. 0:3), n is big (1M), rowsize is
middle (100 char.) and indexsize is middle (20 chars.)
Dscan cost = 1M * 100 / 4k = 24414 i/o's
Iscan cost = 1M * 20 / 4k = 4882 + 3 = 4885 i/o's [winner]
Case 2: Suppossing that r is big (i.e. 20% matching rows), n is big (1M),
rowsize is middle (100 char.) and indexsize is middle (20 chars.)
Dscan cost = 1M * 100 / 4k = 24414 i/o's [winner]
Iscan cost = 1M * 20 / 4k = 4882 + .2M = 204882 i/o's
Conclusion: Computing rawly the Iscan cost is trivial and implementing it,
too. However, a full Index scan plan cost varies a lot depending on the
resulting set size. I do not see a cheap and confident way to estimate the
size of a partial restriction.
Regards,
Enrique
Note 1. Optimization using Index must done the estimation of the resulting
set size. For short estimated set, Index Pick + Data Pick obtain cheap
costs because being function of matching rows instead of a fixed function
of table cardinality. But for big resulting set, Index Pick + Data Pick
can multiply easyly the Data Scan cost because Data Pick probably needs to
read full data page for reading a data row (except clustered data by
current index).
This suggest the case of very big data rows (=disk page size) which are
favoring index usage.
Case 3: Suppossing that r is big (i.e. 20% matching rows), n is big (1M),
rowsize is very big (4k char.) and indexsize is middle (20 chars.)
Dscan cost = 1M * 4k / 4k = 1M i/o's
Iscan cost = 1M * 20 / 4k = 4882 + .2M = 204882 i/o's [winner]
Note 2. You may have Index spaces whose indexrow number is different of
datarow number (eg.: Index on collection can have more entries than
original rows; Inverted List can have less entries than original rows
--because duplicated values are substituted by its ROWID in the list--), in
which case, i and n are different, but nothing changes to the conclusion.
----------
Desde: Michael Widenius[SMTP:monty@stripped]
Enviado el: domingo 4 de marzo de 2001 11:28
Para: Antony T Curtis
Cc: Thimble Smith; internals@stripped
Asunto: Re: Optimisation suggestion...
Hi!
Hi!
Sorry for the delayed reply; I have been on a vacation for a week and
are slowly catching up with all old emails.
>>>>> "Antony" == Antony T Curtis <antony@stripped> writes:
Antony> Thimble Smith wrote:
>>
>> On Wed, Feb 21, 2001 at 12:59:43PM +0000, Antony T Curtis wrote:
>> > In the case where there is a SELECT on a table where no index is
>> > directly applicable, if there is a compound index where the second
part
>> > (or third, etc) of the index is used in the WHERE clause, it should be
>> > possible to still use that index by being able to skip rows rather
than
>> > doing a brute-force search through all the records.
>> >
>> > Any hints where I should look to see if such an optimisation could be
>> > implemented?
>>
>> It isn't clear to me how that would work. Do you mean, to scan
>> the whole index in order to find rows where the secondary index
>> part matches, instead of scanning the data file? Or do you have
>> something else in mind? I think this optimization wouldn't work,
>> but I'm not sure I understand what you're suggesting.
>>
>> Tim
Antony> The kind of thing it'd do is something like this...
Antony> If the key-part to scan is the (n)th key part of the index,
Antony> 1. Seek index to start
Antony> 2. Read key from index
Antony> 3. Check (n)th key part against end-range, if less or equal, goto 6
Antony> 4. Increment (n-1)th key part and seek to greater/equal key
Antony> 5. goto stage 2
Antony> 6. Check (n)th key part against start-range, if greater/equal, goto
xx
Antony> 7. Set (n)th key part to start-range and seek to greather/equal key
Antony> 8. goto 2
Antony> 9. (the key should point to valid row in range)
Antony> 10. Seek to next key
Antony> 11. if not end of index, goto 6
The problem is that the above will only work if a major part of the
keys are identical in the first n-1 parts. (You would guess that at
least 1/3 of the keys must be identical per key block for this to
work, which is not a very common situation).
The major problem is that when it comes to scan indexes compared to
scan a fixed length data file, index scanning is at least 3 times
slower than data file scanning (for B-trees). The reason is that
there is much more seeks involved when scanning the index structure
and also because the index structure is more complicated than the data
file structure, especially when we talk about compressed indexes.
The above is at least true for MyISAM tables where the data file is
stored separately from the index.
If you have a database structure where the data is stored in the leafs
of the trees (like Berkeley DB), the above optimization could be faster
as there is as much seeks involved in scanning the index as scanning
the data.
Antony> As an aside, IBM DB2 does not allow two indexes to exist where the
two
Antony> keys have the same components but are in different orders,
Antony> Something like the following will fail in DB2.
>> CREATE TABLE db.test (
>> COL1 CHAR(10) NOT NULL,
>> COL2 CHAR(10) NOT NULL,
>> INDEX (COL1, COL2),
>> INDEX (COL2, COL1)
>> );
Antony> I suggest that their product does something similar to the above.
I can't see any reason why the above should fail; For example, if you
don't allow the above, you can't optimize a wide range of queries,
like:
SELECT * from test where col1 >= "A" and col1 <= "B"
SELECT * from test where col2 >= "A" and col2 <= "B"
SELECT * from test where col1 >= "A" and col2 >= "E"
SELECT * from test where col2 >= "A" and col1 >= "E"
SELECT * from test ORDER BY col1,col2;
SELECT * from test ORDER BY col2,col1;
To optimize a major part of the above queries, you must have both
indexes! To not allow one to do this is only relevant if you are
using some kind of index type where you can't do range or order by
searching.
As MySQL keeps a note about which indexes can be used to satisfy the
WHERE and the SELECT part, it would not be hard to do this
optimization for the handlers where the data is stored in the trees
(Currently only BDB tables).
The general optimization, where you allow any operator on the key
parts is easy to do:
- Add a handler function:
read_next_unique(length-of-key-to-test)
and add in sql_select.cc two new functions:
join_read_next_unique() and init_join_read_next_unique()
that could be based on join_read_next() and init_join_read_next()
in this case you need to call handler::read_next() as long as the
WHERE clause was satisfied and read_next_unique() when the WHERE
clause wasn't satisfied.
The general optimization to handle cases like:
SELECT * from test where key_part2>= "A" and key_part2 <="B" or
key_part2 >="D" or key_part2 <="E"
is much harder to do; In this case we could theoretically skip to
next key_part1 key when: we have found a key (#, "F"), but to do this
one would have to do a lot of new code, in the style of opt_range.cc
Regards,
Monty
---------------------------------------------------------------------
Before posting, please check:
http://www.mysql.com/manual.php (the manual)
http://lists.mysql.com/ (the list archive)
To request this thread, e-mail internals-thread555@stripped
To unsubscribe, e-mail <internals-unsubscribe@stripped>