List:Internals« Previous MessageNext Message »
From:Michael Widenius Date:November 22 2000 3:37pm
Subject:Re: Index with condition? [A distant idea]
View as plain text  
Hi!

>>>>> "Jan" == Jan Dvorak <jan.dvorak@stripped> writes:

Jan> Hi,
Jan> Michael Widenius wrote:

<cut>

Jan> I think it would be smaller - one would get rid
Jan> of both the keys and the row pointers!
Jan> If I have 10M rows in a dynamic format,
Jan> that's 40MB of row pointers!

Only of there is a lot of NULL entries.  If the index is a compressed
index, each NULL will only take about 2 bytes, in which case it's only
20M of pointers. As long as you are not searching after NULL's these
entires are not touched and takes only place on disk; They will not
normally make the access to the other rows notably slower.

>> Another problem is also that the checking of index corruption and
>> repairing gets easily much more complicated.

Jan> Yes.
Jan> After all myisamchk functionality is merged into mysqld,
Jan> will it continue to exist as a stand-alone utility?

The plan is that it will continue to exist as a standalone utility;
This gives you the benefit to be able to repair most tables off-line
which is safer is there by some strange reason would be a bug in the
repair code.

<cut>

Jan> 2. Sometimes I don't trust my users and don't allow them
Jan> to do a real delete.  Instead, I put an "erased" field
Jan> in the table and add an additional condition to my selects,
Jan> like "... and not erased".  Now, if I have a uniqueness constraint,
Jan> I want it to work only on the "not erased" records.
Jan> I wish I could say
>> 
Jan> CREATE TABLE lookup1 (
Jan> id INTEGER UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
Jan> code CHAR(5) NOT NULL,
Jan> description VARCHAR(255),
Jan> erased BIT NOT NULL DEFAULT 0,
Jan> UNIQUE ( code ) WHERE NOT erased
Jan> );
>> 
>> Why not do it the SQL way:
>> 
>> CREATE TABLE lookup1 (
>> id INTEGER UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
>> code CHAR(5),
>> description VARCHAR(255),
>> erased BIT NOT NULL DEFAULT 0,
>> UNIQUE ( code )
>> );
>> 
>> This gives you exactly the same benefits!

Jan> Does it?
Jan> Assume a user sets the erased bit.

Jan> In your approach, one should also update code with NULL,
Jan> otherwise no new record of that code value can be inserted;
Jan> with conditional index, the previous value can stay there,
Jan> as the constraint is effective only on the non-erased rows.
Jan> If an admin wants to restore the previous record
Jan> (it turns out the user has screwed it up),
Jan> the old code is lost under your set-up.

I assumed you didn't mind changing code to NULL.

If you do mind, the solution is to do the unique constrains as:

UNIQUE ( code, erased )

And have erased defined as:

'erased CHAR(0) default ""'

To erase something you now only have to set erased to "NULL".

Jan> True, I could have a "deleted" table where I would move
Jan> the records that are marked for deletion,
Jan> but that's what I wanted to avoid in the first place.


Jan> 3. A similar situation arises when I want to track history
Jan> and add the "valid_from" and "valid_to" fields.
Jan> The "valid_from" can't be null, but "valid_to" can:
Jan> a null value there signals the up-to-date status,
Jan> whereas a non-null means the validity of the record
Jan> has been closed.

Jan> Suppose I track identification documents of a person.
Jan> The data is precious enough for me that I don't want to delete
Jan> previous data, neither do I want to keep the non-up-to-date data
Jan> in a separate table.
Jan> At any given time, I'll honor just one identification of a type.
Jan> I wish I could do
>> 
Jan> CREATE TABLE Person_Ident_Document (
Jan> id INTEGER UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
Jan> person_id INTEGER UNSIGNED NOT NULL REFERENCES Person,
Jan> valid_from DATETIME NOT NULL,
Jan> valid_to DATETIME,
Jan> ident_doc_type VARCHAR(255) NOT NULL,
Jan> ident_doc_number VARCHAR(255) NOT NULL,
Jan> ident_doc_issued_when DATE,
Jan> ident_doc_issued_where VARCHAR(255),
Jan> ident_doc_issued_by VARCHAR(255),
Jan> UNIQUE ( person_id, ident_doc_type ) WHERE valid_to IS NULL
Jan> );
>> 
>> you can do the above with:
>> 
>> CREATE TABLE Person_Ident_Document (
>> id INTEGER UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
>> person_id INTEGER UNSIGNED NOT NULL REFERENCES Person,
>> valid_from DATETIME NOT NULL,
>> valid_to DATETIME,
>> ident_doc_type VARCHAR(255) NOT NULL,
>> ident_doc_number VARCHAR(255) NOT NULL,
>> ident_doc_issued_when DATE,
>> ident_doc_issued_where VARCHAR(255),
>> ident_doc_issued_by VARCHAR(255),
>> valid_is_not_null CHAR(0)
>> UNIQUE ( person_id, ident_doc_type,valid_is_not_null)
>> );
>> 
>> The only thing you need to do is to update valid_date as follows:
>> 
>> Update Person_Ident_Document SET valid_to=NOW(),valid_is_not_NULL='';
>> 
>> Valid_is_not_null will only take a bit in your Person_date_Document
>> (and one byte per index entry)

Jan> O.k., I can see how you'd go around the issue:
Jan> A uniqueness constraint doesn't apply to rows
Jan> where a column of the unique index is NULL.
Jan> Clever.

Jan> Still, you do have the rows indexed.

Yes, but you already had the row indexed in the first place.  They
only thing we added was 1 byte / index entry.

Jan> As far as I know, allowing for a conditional index
Jan> would not violate any of the principles of relational databases.

>> No, but it would be a bit troublesome to maintain;  Especially the
>> UNIQUE constraint is very hard to do generally!

Jan> I can see the following programming issues here:
Jan>  1. Modify the virtual table interface to accept this extension.
Jan>  2. Provide a table driver that actually can have the conditional index.
Jan>  3. Tweak the optimizer to take the condition into account.

Jan> To begin with, the conditions could be limited to simple
Jan> expressions containing only the =, <=>, <>, <, >, <=, >=,
Jan> IS NULL, IS NOT NULL, AND, OR, NOT operators, literals,
Jan> and the columns of the indexed table.
Jan> No functions (not to say UDFs), no subquery expressions.

Jan> I think this is something other DBMS'es usually don't have,
Jan> but what would be a very handy extension over the standards.
>> 
Jan> The advantages are in a more efficient space usage,
Jan> faster index maintenance, faster database operations.
>> 
>> Only in some cases (mainly a fixed size index on one column)

Jan> Actually, in pretty much every situation
Jan> where you don't need to index all rows,
Jan> just those that satisfy a highly filtering condition.

Jan> The only disadvantage I can see is that it's not implemented yet. :-)
>> 
>> But as you can see, its very easy to simulate...

Jan> O.k., while it's not impossible to simulate,
Jan> it definitely impacts the way one designs the database.

The goods news is however that the solution I proposed is portable :)

I have put archived this mail so that we can look at doing this when
4.0 is stable.

Regards,
Monty
Thread