I have some questions about the specs.
> TASK...........: Versioning for delete & update (for transactional tables)
> Versioning for delete & update (for transactional tables)
> HIGH-LEVEL SPECIFICATION:
> For delete, instead of physically deleting rows when maria_delete() is
> called, we will change the delete internally to an update where the
> row and it's keys are tagged with the current transaction id as their
> delete trans id.
Will there be a new type of REDO log record to describe this change done
to the row or key? Or does some existing REDO log record need to be
If there is a "yes" to one of these questions, we need to update
recovery's code to add/modify "log record replay" functions.
> For update, we don't delete changed keys but instead tag them with
> with the current transaction id as their delete trans id.
Same question here.
> The position to the rows and keys are stored in the transaction log (like now).
> After the transaction has committed, the link to the last deleted
> entry in the transaction log will be submitted to a purge handler.
> The purge handler will wait until all transaction that started before
> the given one has completed and will then go through the linked entries
> in the log and execute a normal row-delete & key-deletes on the
> stale entries.
Actually, trnman_end_trn() already separates committed transactions
which can be purged, from those to keep: see this function, it already
has a comment:
/* QQ: send them to the purge thread */
So the purge handler does not need to wait, whatever it receives can be
> LOW-LEVEL DESIGN:
> Most of the flows for delete and update row described in WL#3165 are still accurate.
> The only difference is that instead of deleting the rows/keys we mark the
> with a delete transid.
> For deleting rows in the block format (WL#3061), we already have one
> bit in the flag reserved for the marking the row delete. The
> DELETE_TRANSID position is used for storing the transid for the
> transaction that is doing the delete.
> The reading of a row or key entry is now going to use the following logic:
> If row/key doesn't have transid
> Return entry
> if transid is not visible
> Skip entry
> if entry doesn't have delete-transid or delete-transid is not visible
> Return entry
> Skip entry
> The keys now have the following format (described in detail later)
> pointer_to_row << 1
> packed_optional_transid << 1
> packed_optional_delete_transid << 1
> The pointer and transid are shifted up one bit so that we can use the
> lowest bit as a marker if the following optional entry exists. The
> reason packed_optional_delete_transid is shifted is that we use the same
> function (transid_store_packed()) to pack all transid.
> Storing of transid (Trid):
> Trid is max 6 bytes long
> First Trid it's converted to a smaller number by using
> trid= trid - create_trid.
I think this needs a note:
Right now maria_create() executes with dummy_transaction_object as far
as I know. Either this can be changed to a real non-dummy TRN, or
maria_create() can just fetch the current value of the global TrID
generator and use that as create_trid.
> Then trid is then shifted up one bit so that we can use the
> lowest bit as a marker if it's followed by another trid.
> Trid is then stored as follows:
> if trid < 256-12
> one byte
> one byte prefix (length_of_trid_in_bytes+249) followed by data
> in high-byte-first order
> Prefix bytes 244 to 249 are reserved for negative transid, that can be used
> when we pack transid relative to each other on a key block.
> We have to store transid in high-byte-first order to be able to do a
> fast byte-per-byte comparision of them without packing them up.
you mean "without unpacking them" ?
I don't understand the sentence, could you please explain the scenario
why this high-byte-first helps?
> For example, assuming we the following data:
> key_data: 1 (4 byte integer)
> pointer_to_row: 2 << 8 + 3 = 515 (page 2, row 3)
> table_create_transid 1000 Defined at create table time
> transid 1010 Transaction that created row
> delete_transid 2011 Transaction that deleted row
> In addition we assume the table is created with a data pointer length
> of 4 bytes (this is automatically calculated based on the medium length of rows
> and the given max number of rows)
> The binary data for the key would then look like this in hex:
> 00 00 00 01 Key
> 00 00 00 47 (515 << 1) + 1 ; The last 1 is marker that key cont.
> 15 ((1000-1010) << 1) + 1 ; The last 1 is marker that key cont.
you mean 1000-1010 or 1010-1000 ?
> FB 07 E6 length byte and ((2011 - 1000) << 1) = 07 E6
Could you please add "07E6 is 2 bytes and so 249 + 2 = 251 = FB" ?
Do you need to specifically update maria_chk, maria_pack, so that they
don't fail when finding a delete_transid?
What about "zerofill" code, does it need an update?