List:Maria Storage Engine« Previous MessageNext Message »
From:Guilhem Bichot Date:November 10 2008 2:34pm
Subject:Re: Versioning for delete & update (for transactional tables) (4619)
View as plain text  
Hello Monty,

I have some questions about the specs.

> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
> TASK...........: Versioning for delete & update (for transactional tables)
> DESCRIPTION:
> 
> Versioning for delete & update (for transactional tables)
> 
> 
> HIGH-LEVEL SPECIFICATION:
> 
> For delete, instead of physically deleting rows when maria_delete() is
> called, we will change the delete internally to an update where the
> row and it's keys are tagged with the current transaction id as their
> delete trans id.

Will there be a new type of REDO log record to describe this change done 
to the row or key? Or does some existing REDO log record need to be 
slightly changed?
If there is a "yes" to one of these questions, we need to update 
recovery's code to add/modify "log record replay" functions.

> For update, we don't delete changed keys but instead tag them with
> with the current transaction id as their delete trans id.

Same question here.

> The position to the rows and keys are stored in the transaction log (like now).
> 
> After the transaction has committed, the link to the last deleted
> entry in the transaction log will be submitted to a purge handler.
> 
> The purge handler will wait until all transaction that started before
> the given one has completed and will then go through the linked entries
> in the log and execute a normal row-delete & key-deletes on the
> stale entries.

Actually, trnman_end_trn() already separates committed transactions 
which can be purged, from those to keep: see this function, it already 
has a comment:
   /* QQ: send them to the purge thread */
So the purge handler does not need to wait, whatever it receives can be 
purged immediately.

> LOW-LEVEL DESIGN:
> 
> Most of the flows for delete and update row described in WL#3165 are still accurate.
> 
> The only difference is that instead of deleting the rows/keys we mark the
> with a delete transid.
> 
> For deleting rows in the block format (WL#3061), we already have one
> bit in the flag reserved for the marking the row delete. The
> DELETE_TRANSID position is used for storing the transid for the
> transaction that is doing the delete.
> 
> The reading of a row or key entry is now going to use the following logic:
> 
> If row/key doesn't have transid
>    Return entry
> if transid is not visible
>    Skip entry
> if entry doesn't have delete-transid or delete-transid is not visible
>    Return entry
> Skip entry
> 
> The keys now have the following format (described in detail later)
> 
> key_data
> pointer_to_row << 1
> packed_optional_transid << 1
> packed_optional_delete_transid << 1
> 
> The pointer and transid are shifted up one bit so that we can use the
> lowest bit as a marker if the following optional entry exists. The
> reason packed_optional_delete_transid is shifted is that  we use the same
> function (transid_store_packed()) to pack all transid.
> 
> --------
> Storing of transid (Trid):
> 
> Trid is max 6 bytes long
> 
> First Trid it's converted to a smaller number by using
> trid= trid - create_trid.

I think this needs a note:
Right now maria_create() executes with dummy_transaction_object as far 
as I know. Either this can be changed to a real non-dummy TRN, or 
maria_create() can just fetch the current value of the global TrID 
generator and use that as create_trid.

> Then trid is then shifted up one bit so that we can use the
> lowest bit as a marker if it's followed by another trid.
> 
> Trid is then stored as follows:
> 
> if trid < 256-12
>   one byte
> else
>   one byte prefix (length_of_trid_in_bytes+249) followed by data
>   in high-byte-first order
> 
> Prefix bytes 244 to 249 are reserved for negative transid, that can be used
> when we pack transid relative to each other on a key block.
 >
> We have to store transid in high-byte-first order to be able to do a
> fast byte-per-byte comparision of them without packing them up.

you mean "without unpacking them" ?
I don't understand the sentence, could you please explain the scenario 
why this high-byte-first helps?

> ------------
> 
> For example, assuming we the following data:
> 
> key_data:               1                (4 byte integer)
> pointer_to_row:         2 << 8 + 3 = 515 (page 2, row 3)
> table_create_transid    1000             Defined at create table time
> transid                 1010             Transaction that created row
> delete_transid          2011             Transaction that deleted row
> 
> In addition we assume the table is created with a data pointer length
> of 4 bytes (this is automatically calculated based on the medium length of rows
> and the given max number of rows)
> 
> The binary data for the key would then look like this in hex:
> 
> 00 00 00 01     Key
> 00 00 00 47     (515 << 1) + 1         ;  The last 1 is marker that key cont.
> 15              ((1000-1010) << 1) + 1 ;  The last 1 is marker that key cont.

you mean 1000-1010 or 1010-1000 ?

> FB 07 E6        length byte and  ((2011 - 1000) << 1) = 07 E6

Could you please add "07E6 is 2 bytes and so 249 + 2 = 251 = FB" ?

Do you need to specifically update maria_chk, maria_pack, so that they 
don't fail when finding a delete_transid?
What about "zerofill" code, does it need an update?
Thread
Re: Versioning for delete & update (for transactional tables) (4619)Guilhem Bichot10 Nov
  • Re: Versioning for delete & update (for transactional tables) (4619)Oleksandr \"Sanja\" Byelkin11 Nov
  • Re: Versioning for delete & update (for transactional tables)(4619)Michael Widenius11 Nov
    • Re: Versioning for delete & update (for transactional tables) (4619)Guilhem Bichot12 Nov
      • Re: Versioning for delete & update (for transactional tables) (4619)Guilhem Bichot12 Nov