List:Internals« Previous MessageNext Message »
From:Jan Dvorak Date:November 22 2000 1:55pm
Subject:Re: Index with condition? [A distant idea]
View as plain text  

Michael Widenius wrote:
> Hi!
> >>>>> "Jan" == Jan Dvorak <jan.dvorak@stripped> writes:
> Jan> Hi all,
> Jan> I'm having the following idea and
> Jan> I'm looking for a DBMS that would satisfy me.
> Jan> I wish it was MySQL, one day. :-)
> Jan> A conventional database index always spans
> Jan> all rows of the table it's constructed on.
> Jan> Sometimes, though, I'd prefer a finer resolution was possible:
> Jan> I'd like to restrict the rows to be indexed.
> Jan> Several examples:
> Jan> 1. A foreign key that realizes an optional relationship
> Jan>    (a "0,1" on the master side) that has only a low fraction
> Jan>    of non-NULLs.
> Jan>    But you have all the NULLs in the index,
> Jan>    although they won't ever be used:
> Jan>     * If I do joins, the NULLs won't match
> Jan>       (right, if I use the normal "=", not the "<=>",
> Jan>        but that's another story).
> Jan>     * If I select the "fk IS NULL" rows,
> Jan>       any sensible optimizer will decide to scan the table
> Jan>       instead of going through the index.
> The above is actually not true; If there is few values of fk, where fk
> is NULL, then it's much better to have the column indexed.

Yes, I assumed a low fraction of non-NULLs.

> Jan>    If I could construct the index with the condition
> Jan>    "fk IS NOT NULL":
> Jan>     CREATE INDEX idx_name
> Jan>       ON tab_name ( col1_name ) WHERE col1 IS NOT NULL;
> Jan>    I could get rid of the dead part of the complete index.
> Jan>    The index could be much smaller and the db operations faster.
> For a packed index (CHAR/VARCHAR/BLOB) , the index would not be
> notably smaller (as the NULL part is mostly optimized away);  You
> would win about 0.1 % space with this option.
> For fixed size columns you would win more (but we could actually fix
> this by adding a bit for each page that tells us if the keys on this
> page may be 'null' or not).

I think it would be smaller - one would get rid
of both the keys and the row pointers!
If I have 10M rows in a dynamic format,
that's 40MB of row pointers!

> Another problem is also that the checking of index corruption and
> repairing gets easily much more complicated.

After all myisamchk functionality is merged into mysqld,
will it continue to exist as a stand-alone utility?

> The major win is mainly that storing columns as NULL would be a bit
> faster.
> Note also that the above optimization will not work when the column is
> part of a multi-part index.

I could place the condition that neither of the foreign key columns be NULL
- then I would index just the rows that do join.

> I really don't know if this is a good idea or not...
> Jan> 2. Sometimes I don't trust my users and don't allow them
> Jan>    to do a real delete.  Instead, I put an "erased" field
> Jan>    in the table and add an additional condition to my selects,
> Jan>    like "... and not erased".  Now, if I have a uniqueness constraint,
> Jan>    I want it to work only on the "not erased" records.
> Jan>    I wish I could say
> Jan>     CREATE TABLE lookup1 (
> Jan>       code CHAR(5) NOT NULL,
> Jan>       description VARCHAR(255),
> Jan>       erased BIT NOT NULL DEFAULT 0,
> Jan>       UNIQUE ( code ) WHERE NOT erased
> Jan>     );
> Why not do it the SQL way:
>     CREATE TABLE lookup1 (
>       code CHAR(5),
>       description VARCHAR(255),
>       erased BIT NOT NULL DEFAULT 0,
>       UNIQUE ( code )
>     );
> This gives you exactly the same benefits!

Does it?
Assume a user sets the erased bit.
In your approach, one should also update code with NULL,
otherwise no new record of that code value can be inserted;
with conditional index, the previous value can stay there,
as the constraint is effective only on the non-erased rows.
If an admin wants to restore the previous record
(it turns out the user has screwed it up),
the old code is lost under your set-up.

True, I could have a "deleted" table where I would move
the records that are marked for deletion,
but that's what I wanted to avoid in the first place.

> Jan> 3. A similar situation arises when I want to track history
> Jan>    and add the "valid_from" and "valid_to" fields.
> Jan>    The "valid_from" can't be null, but "valid_to" can:
> Jan>    a null value there signals the up-to-date status,
> Jan>    whereas a non-null means the validity of the record
> Jan>    has been closed.
> Jan>    Suppose I track identification documents of a person.
> Jan>    The data is precious enough for me that I don't want to delete
> Jan>    previous data, neither do I want to keep the non-up-to-date data
> Jan>    in a separate table.
> Jan>    At any given time, I'll honor just one identification of a type.
> Jan>    I wish I could do
> Jan>      CREATE TABLE Person_Ident_Document (
> Jan>        person_id INTEGER UNSIGNED NOT NULL REFERENCES Person,
> Jan>        valid_from DATETIME NOT NULL,
> Jan>        valid_to DATETIME,
> Jan>        ident_doc_type VARCHAR(255) NOT NULL,
> Jan>        ident_doc_number VARCHAR(255) NOT NULL,
> Jan>        ident_doc_issued_when DATE,
> Jan>        ident_doc_issued_where VARCHAR(255),
> Jan>        ident_doc_issued_by VARCHAR(255),
> Jan>        UNIQUE ( person_id, ident_doc_type ) WHERE valid_to IS NULL
> Jan>      );
> you can do the above with:
>      CREATE TABLE Person_Ident_Document (
>        valid_from DATETIME NOT NULL,
>        valid_to DATETIME,
>        ident_doc_type VARCHAR(255) NOT NULL,
>        ident_doc_number VARCHAR(255) NOT NULL,
>        ident_doc_issued_when DATE,
>        ident_doc_issued_where VARCHAR(255),
>        ident_doc_issued_by VARCHAR(255),
>        valid_is_not_null CHAR(0)
>        UNIQUE ( person_id, ident_doc_type,valid_is_not_null)
>      );
> The only thing you need to do is to update valid_date as follows:
> Update Person_Ident_Document SET valid_to=NOW(),valid_is_not_NULL='';
> Valid_is_not_null will only take a bit in your Person_date_Document
> (and one byte per index entry)

O.k., I can see how you'd go around the issue:
A uniqueness constraint doesn't apply to rows
where a column of the unique index is NULL.

Still, you do have the rows indexed.

> Jan> As far as I know, allowing for a conditional index
> Jan> would not violate any of the principles of relational databases.
> No, but it would be a bit troublesome to maintain;  Especially the
> UNIQUE constraint is very hard to do generally!

I can see the following programming issues here:
 1. Modify the virtual table interface to accept this extension.
 2. Provide a table driver that actually can have the conditional index.
 3. Tweak the optimizer to take the condition into account.

To begin with, the conditions could be limited to simple
expressions containing only the =, <=>, <>, <, >, <=, >=,
IS NULL, IS NOT NULL, AND, OR, NOT operators, literals,
and the columns of the indexed table.
No functions (not to say UDFs), no subquery expressions.

> Jan> I think this is something other DBMS'es usually don't have,
> Jan> but what would be a very handy extension over the standards.
> Jan> The advantages are in a more efficient space usage,
> Jan> faster index maintenance, faster database operations.
> Only in some cases (mainly a fixed size index on one column)

Actually, in pretty much every situation
where you don't need to index all rows,
just those that satisfy a highly filtering condition.

> Jan> The only disadvantage I can see is that it's not implemented yet. :-)
> But as you can see, its very easy to simulate...

O.k., while it's not impossible to simulate,
it definitely impacts the way one designs the database.

> Regards,
> Monty

Thanks for your comments!