List:Internals« Previous MessageNext Message »
From:Michael Widenius Date:November 21 2000 4:50pm
Subject:Index with condition? [A distant idea]
View as plain text  
Hi!

>>>>> "Jan" == Jan Dvorak <jan.dvorak@stripped> writes:

Jan> Hi all,
Jan> I'm having the following idea and
Jan> I'm looking for a DBMS that would satisfy me.
Jan> I wish it was MySQL, one day. :-)

Jan> A conventional database index always spans
Jan> all rows of the table it's constructed on.
Jan> Sometimes, though, I'd prefer a finer resolution was possible:
Jan> I'd like to restrict the rows to be indexed.

Jan> Several examples:

Jan> 1. A foreign key that realizes an optional relationship
Jan>    (a "0,1" on the master side) that has only a low fraction
Jan>    of non-NULLs.
Jan>    But you have all the NULLs in the index,
Jan>    although they won't ever be used:
Jan>     * If I do joins, the NULLs won't match
Jan>       (right, if I use the normal "=", not the "<=>",
Jan>        but that's another story).
Jan>     * If I select the "fk IS NULL" rows,
Jan>       any sensible optimizer will decide to scan the table
Jan>       instead of going through the index.

The above is actually not true; If there is few values of fk, where fk
is NULL, then it's much better to have the column indexed.

Jan>    If I could construct the index with the condition
Jan>    "fk IS NOT NULL":

Jan>     CREATE INDEX idx_name
Jan>       ON tab_name ( col1_name ) WHERE col1 IS NOT NULL;

Jan>    I could get rid of the dead part of the complete index.
Jan>    The index could be much smaller and the db operations faster.

For a packed index (CHAR/VARCHAR/BLOB) , the index would not be
notably smaller (as the NULL part is mostly optimized away);  You
would win about 0.1 % space with this option.

For fixed size columns you would win more (but we could actually fix
this by adding a bit for each page that tells us if the keys on this
page may be 'null' or not).

Another problem is also that the checking of index corruption and
repairing gets easily much more complicated.

The major win is mainly that storing columns as NULL would be a bit
faster.

Note also that the above optimization will not work when the column is
part of a multi-part index.

I really don't know if this is a good idea or not...

Jan> 2. Sometimes I don't trust my users and don't allow them
Jan>    to do a real delete.  Instead, I put an "erased" field
Jan>    in the table and add an additional condition to my selects,
Jan>    like "... and not erased".  Now, if I have a uniqueness constraint,
Jan>    I want it to work only on the "not erased" records.
Jan>    I wish I could say

Jan>     CREATE TABLE lookup1 (
Jan>       id INTEGER UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
Jan>       code CHAR(5) NOT NULL,
Jan>       description VARCHAR(255),
Jan>       erased BIT NOT NULL DEFAULT 0,
Jan>       UNIQUE ( code ) WHERE NOT erased
Jan>     );

Why not do it the SQL way:

    CREATE TABLE lookup1 (
      id INTEGER UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
      code CHAR(5),
      description VARCHAR(255),
      erased BIT NOT NULL DEFAULT 0,
      UNIQUE ( code )
    );

This gives you exactly the same benefits!

Jan> 3. A similar situation arises when I want to track history
Jan>    and add the "valid_from" and "valid_to" fields.
Jan>    The "valid_from" can't be null, but "valid_to" can:
Jan>    a null value there signals the up-to-date status,
Jan>    whereas a non-null means the validity of the record
Jan>    has been closed.

Jan>    Suppose I track identification documents of a person.
Jan>    The data is precious enough for me that I don't want to delete
Jan>    previous data, neither do I want to keep the non-up-to-date data
Jan>    in a separate table.
Jan>    At any given time, I'll honor just one identification of a type.
Jan>    I wish I could do

Jan>      CREATE TABLE Person_Ident_Document (
Jan>        id INTEGER UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
Jan>        person_id INTEGER UNSIGNED NOT NULL REFERENCES Person,
Jan>        valid_from DATETIME NOT NULL,
Jan>        valid_to DATETIME,
Jan>        ident_doc_type VARCHAR(255) NOT NULL,
Jan>        ident_doc_number VARCHAR(255) NOT NULL,
Jan>        ident_doc_issued_when DATE,
Jan>        ident_doc_issued_where VARCHAR(255),
Jan>        ident_doc_issued_by VARCHAR(255),
Jan>        UNIQUE ( person_id, ident_doc_type ) WHERE valid_to IS NULL
Jan>      );

you can do the above with:

     CREATE TABLE Person_Ident_Document (
       id INTEGER UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
       person_id INTEGER UNSIGNED NOT NULL REFERENCES Person,
       valid_from DATETIME NOT NULL,
       valid_to DATETIME,
       ident_doc_type VARCHAR(255) NOT NULL,
       ident_doc_number VARCHAR(255) NOT NULL,
       ident_doc_issued_when DATE,
       ident_doc_issued_where VARCHAR(255),
       ident_doc_issued_by VARCHAR(255),
       valid_is_not_null CHAR(0)
       UNIQUE ( person_id, ident_doc_type,valid_is_not_null)
     );

The only thing you need to do is to update valid_date as follows:

Update Person_Ident_Document SET valid_to=NOW(),valid_is_not_NULL='';

Valid_is_not_null will only take a bit in your Person_date_Document
(and one byte per index entry)

Jan> [End of examples.]

Jan> As far as I know, allowing for a conditional index
Jan> would not violate any of the principles of relational databases.

No, but it would be a bit troublesome to maintain;  Especially the
UNIQUE constraint is very hard to do generally!

Jan> I think this is something other DBMS'es usually don't have,
Jan> but what would be a very handy extension over the standards.

Jan> The advantages are in a more efficient space usage,
Jan> faster index maintenance, faster database operations.

Only in some cases (mainly a fixed size index on one column)

Jan> The only disadvantage I can see is that it's not implemented yet. :-)

But as you can see, its very easy to simulate...

Regards,
Monty
Thread