List:Commits« Previous MessageNext Message »
From:Alfranio Correia Date:November 26 2009 10:09pm
Subject:Re: bzr commit into mysql-5.1-rep+2 branch (luis.soares:3137) WL#5092
View as plain text  
Hi Luis,

I forgot to mention that you should create a patch based on the rep+3.

Cheers.

Alfranio Correia wrote:
> Hi Luis,
> 
>   Great work.
>   Please, find some comments in what follows.
> 
> STATUS
> ------
>   Not approved.
> 
> REQUIRED CHANGES
> ----------------
> See comments in-line.
> 
> REQUESTS
> --------
> See comments in-line.
> 
> SUGGESTIONS
> -----------
> See comments in-line.
> 
> DETAILS
> -------
> n/a
> 
> 
> Cheers.
> 
> 
> 
> Luis Soares wrote:
>> #At
> file:///home/lsoares/Workspace/bzr/work/features/wl5092/mysql-5.1-rep%2B2-commit/ based on
> revid:li-bing.song@stripped
>>
>>  3137 Luis Soares	2009-11-06
>>       WL#5092: RBR: Options for writing partial or full row images in RBR events
>>       
>>       Summary
>>       =======
>>       
>>         Implement mysqld option so that the user is able to select
>>         whether to log minimal or full images on RBR events depending
>>         on the keys available in the master. When sending minimal rows,
>>         one can save binlog disk space, network resources and mysqld
>>         memory footprint.
>>       
>>       BACKGROUND
>>       ==========
>>       
>>         Before and After Images
>>         -----------------------
>>       
>>         In row based replication row events may contain two copies for the
>>         row that they are changing. These are generally known as images. The
>>         first one, called *before image* (BI), contains data that existed on
>>         the row before it was actually changed. The second one, called
>>         *after image* (AI), contain the actual changes. Each, BI and AI,
>>         usage is confined to two different moments in the execution
>>         flow. The BI is used while the slave is searching for the row to be
>>         updated, while AI is used when replaying the changes in the row.
>>       
>>         Both have some restrictions:
>>       
>>           - BI: needs to hold a set of values that can be used by the
>>             slave to fetch the correct row. In other words, it should
>>             provide a set a values that *uniquely* identify the row to
>>             be changed;
>>       
>>           - AI: needs to hold values that are needed to replay all the
>>             changes, that were actually done during the original
>>             execution, in an identical set (meaning, same index
>>             structures, same engine, ...).
>>       
>>         Given that BI and AI have different usages, their usefulness can be
>>         mapped into the data modification row events:
>>       
>>           - Write_rows_log_event: *requires only AI*.
>>       
>>             There is no need for a BI because, we are adding a record,
>>             and not changing an existing one. The current
>>             implementation logs only the AI.
>>       
>>           - Delete_rows_log_event: *requires only BI*.
>>       
>>             There is no need for an AI because, the row ceases to
>>             exist, as it is removed. However, before removing it, one
>>             needs to find it, thence BI is required.
>>       
>>           - Update_rows_log_event: *requires both: AI and BI*.
>>       
>>             Both BI and AI, are required. The row needs to be found (BI
>>             comes to play) before being changed (AI comes to play).
>>       
>>         Summing up, BI must contain values that uniquely identifies rows,
>>         acting like a primary key equivalent (PKE), while AI must contain
>>         values that make possible changing the row according to the original
>>         execution.
>>       
>>         Primary Key Equivalent
>>         ----------------------
>>       
>>         Tables contain an index structure - mysql calls indexes keys. This
>>         structure holds information on which keys are declared, and these
>>         range from plain keys (K), to unique keys (UK), or even a primary
>>         key (PK).
>>       
>>         PK - When it comes to logging row based events, PK plays an
>>              important role, as it covers the BI requirement of uniquely
>>              identifying a given row just by searching using the PK
>>              value. Thence, if replicating the master contents, one should
>>              be fine by just logging the PK column. The slave would then use
>>              this value to search the correct row. Technically, "A PRIMARY
>>              KEY is a unique index where all key columns must be defined as
>>              NOT NULL. If they are not explicitly declared as NOT NULL,
>>              MySQL declares them so implicitly (and silently). A table can
>>              have only one PRIMARY KEY." [1]
>>       
>>         UK - Unique keys share the same usefulness of the PK, except that
>>              they need to be declared without nullable parts. From the
>>              manual [1]: "[...] For all engines, a UNIQUE index allows
>>              multiple NULL values for columns that can contain NULL."  In
>>              fact, if there is no PK declared in a table and an application
>>              requests one, MySQL "[...] returns the first UNIQUE index that
>>              has no NULL columns as the PRIMARY KEY." [1]
>>         
>>          K - Regular keys or nullable UK are of no particular interest when
>>              logging row events. Given that these are stripped from
>>              uniqueness property provided by UK and PK, these cannot be used
>>              to uniquely identify a row.
>>       
>>         There can be the case that a table does not declare any index at
>>         all, or all indexes are regular keys. In this case the master must
>>         ensure that the data provided in BI shall be enough to uniquely
>>         identify the row. As such, the alternative is to log the full
>>         row. This ensures that when searching (using the BI) and applying
>>         (using the AI) the next record to be fetched will be uniquely
>>         identified by all the fields in the BI. Should there be
>>         indistinguishable rows, searching and updating either one in any
>>         given order leads to a correct state.
>>       
>>         Consequently, a *Primary Key Equivalent* (PKE) is defined as:
>>       
>>           1. If a PK exists, the PKE is equal to the PK.
>>       
>>           2. Otherwise, if there exists a UK where all columns have the NOT
>>              NULL attribute, then that is the PKE (if there are more than
>>              one such UKs, then one is chosen arbitrarily).
>>       
>>           3. Otherwise, the PKE is equal to the set of all columns.
>>       
>>         Hereafter, we will be considering to PK to be a subset of PKE and it
>>         shall map into items 1. and 2. of PKE definition. Furthermore, if no
>>         explicit primary key nor UK NOT NULL exists in the table, it is said
>>         that the table has no PK.
>>       
>>       PROBLEM STATEMENT
>>       =================
>>       
>>         Given the definition of PKE, one can have a smaller set of columns
>>         to be logged (instead of the full row), and still be able to find
>>         (BI) and update a record (AI), while replaying the row event.
>>       
>>         This can be useful to reduce bandwidth (less network traffic) and
>>         storage (samller binlogs) usage as well as mysqld memory
>>         footprint. It becomes even more important if tables contain large
>>         blobs that do not need to be logged as part of the BI. Currently, in
>>         MySQL 5.1 GA, PKE is always assumed to be the full row, ie, the
>>         index structure is ignored. Consequently, AI and BI are always
>>         logged with all their columns.
>>       
>>         It should be possible for the user to configure this behavior such
>>         that he could request that BI and AI always log full rows, or a PK
>>         when available. Put it more simply, the user should be able to state
>>         that he wants row images to be logged according to a PKE or always
>>         as a full row.
>>       
>>       SOLUTION
>>       ========
>>       
>>         MySQL shall provide an option and the different configurations
>>         should be:
>>       
>>           - minimal: 
>>              Means PKE in the before image and changed columns
>>              in after image
>>       
>>           - full:
>>              Means all columns in both before and after image
>>       
>>           - noblob:
>>              Works as full but avoids sending blobs when these are not
>>              needed. Blobs are still replicated if:
>>       
>>                1. In AI, if they have been changed.
>>                2. In BI, if they are part of PK.
>>       
>>         It shall be named:
>>       
>>           --binlog-row-image={minimal,noblob,full}
>>       
>>         DEFAULT VALUE SHALL BE: 'FULL'.
>>       
>>       BACKPORTS
>>       =========
>>       
>>         Part of this work required the backporting of patches for 
>>         BUG 33055 and BUG 40045.
>>       
>>       BUG FIXES
>>       =========
>>         
>>         - BUG 47200
>>         - BUG 47313
>>         - BUG 14068
>>         - BUG 33055 (backport)
>>         - BUG 40045 (backport)
>>         - BUG 47323 (fix applied in tree used - needed for testing purposes)
>>      @ sql/log.h
>>         Added enumeration for binlog_row_image option.
>>      @ sql/log_event.cc
>>         Several changes:
>>           1. refactored find_row so that BUG 47200 and BUG 47303 are fixed.
>>           2. added column marking according to binlog-row-image on slave
>>              while applying events.
>>           3. merged backport fix for BUG 40045 and BUG 33055.
>>           4. merged fix for BUG 47323, which is not available in 
>>              mysql-5.1-rep+2 ATM but is needed for testing purposes.
>>      @ sql/log_event.h
>>         Changes according to backport of BUG 33055.
>>      @ sql/mysqld.cc
>>         Added binlog-row-image option. Default set to: BINLOG_ROW_IMAGE_FULL
>>      @ sql/rpl_injector.cc
>>         Merged backport for BUG 33055.
>>      @ sql/rpl_injector.h
>>         Merged backport for BUG 33055.
>>      @ sql/rpl_record.cc
>>         Merged backport for BUG 40045 and BUG 33055.
>>      @ sql/set_var.cc
>>         Added dynamic variable binlog-row-image.
>>      @ sql/set_var.h
>>         Added class sys_var_thd_binlog_row_image.
>>      @ sql/sql_class.cc
>>         Several changes:
>>           1. merged backport for BUG 33055
>>           2. changed THD::binlog_*_row so that, when needed, some fields are
>>              removed from the read_set according to the binlog-row-image
>>              option used.
>>      @ sql/sql_class.h
>>         Merged backport for BUG 33055. Also added binlog_prepare_row_images
>>         method to THD.
>>      @ sql/sql_insert.cc
>>         Fixes insert delayed {read,write}_set.
>>      @ sql/sql_update.cc
>>         Merged BUG 40045 backport fix.
>>      @ sql/table.cc
>>         Changed mark_columns_needed_for_* so that these take into
>>         account binlog-row-image option.
>>      @ sql/table.h
>>         Added method mark_columns_per_binlog_row_image.
>>
>>     modified:
>>       sql/handler.cc
>>       sql/log.h
>>       sql/log_event.cc
>>       sql/log_event.h
>>       sql/log_event_old.cc
>>       sql/log_event_old.h
>>       sql/mysql_priv.h
>>       sql/mysqld.cc
>>       sql/rpl_injector.cc
>>       sql/rpl_injector.h
>>       sql/rpl_record.cc
>>       sql/rpl_record.h
>>       sql/set_var.cc
>>       sql/set_var.h
>>       sql/sql_class.cc
>>       sql/sql_class.h
>>       sql/sql_insert.cc
>>       sql/sql_update.cc
>>       sql/table.cc
>>       sql/table.h
>> === modified file 'sql/handler.cc'
>> --- a/sql/handler.cc	2009-10-03 10:50:25 +0000
>> +++ b/sql/handler.cc	2009-11-06 01:09:40 +0000
>> @@ -2484,8 +2484,9 @@ int handler::update_auto_increment()
>>  void handler::column_bitmaps_signal()
>>  {
>>    DBUG_ENTER("column_bitmaps_signal");
>> -  DBUG_PRINT("info", ("read_set: 0x%lx  write_set: 0x%lx", (long)
> table->read_set,
>> -                      (long) table->write_set));
>> +  if (table)
>> +    DBUG_PRINT("info", ("read_set: 0x%lx  write_set: 0x%lx", (long)
> table->read_set,
>> +                        table->write_set));
> 
> Why is this needed now?
> 
> 
>>    DBUG_VOID_RETURN;
>>  }
>>  
>> @@ -4553,8 +4554,8 @@ static int write_locked_table_maps(THD *
>>  }
>>  
>>  
>> -typedef bool Log_func(THD*, TABLE*, bool, MY_BITMAP*,
>> -                      uint, const uchar*, const uchar*);
>> +typedef bool Log_func(THD*, TABLE*, bool,
>> +                      const uchar*, const uchar*);
>>  
>>  static int binlog_log_row(TABLE* table,
>>                            const uchar *before_record,
>> @@ -4568,30 +4569,19 @@ static int binlog_log_row(TABLE* table,
>>  
>>    if (check_table_binlog_row_based(thd, table))
>>    {
>> -    MY_BITMAP cols;
>> -    /* Potential buffer on the stack for the bitmap */
>> -    uint32 bitbuf[BITMAP_STACKBUF_SIZE/sizeof(uint32)];
>> -    uint n_fields= table->s->fields;
>> -    my_bool use_bitbuf= n_fields <= sizeof(bitbuf)*8;
>> +    DBUG_DUMP("read_set 10", (uchar*) table->read_set->bitmap,
>> +              (table->s->fields + 7) / 8);
>>  
>>      /*
>>        If there are no table maps written to the binary log, this is
>>        the first row handled in this statement. In that case, we need
>>        to write table maps for all locked tables to the binary log.
>>      */
>> -    if (likely(!(error= bitmap_init(&cols,
>> -                                    use_bitbuf ? bitbuf : NULL,
>> -                                    (n_fields + 7) & ~7UL,
>> -                                    FALSE))))
>> +    if (likely(!(error= write_locked_table_maps(thd))))
>>      {
>> -      bitmap_set_all(&cols);
>> -      if (likely(!(error= write_locked_table_maps(thd))))
>> -        error= (*log_func)(thd, table, table->file->has_transactions(),
>> -                           &cols, table->s->fields,
>> -                           before_record, after_record);
>> -
>> -      if (!use_bitbuf)
>> -        bitmap_free(&cols);
>> +      bool const has_trans= table->file->has_transactions();
>> +      error=
>> +        (*log_func)(thd, table, has_trans, before_record, after_record);
>>      }
>>    }
>>    return error ? HA_ERR_RBR_LOGGING_FAILED : 0;
>>
>> === modified file 'sql/log.h'
>> --- a/sql/log.h	2009-10-07 21:13:07 +0000
>> +++ b/sql/log.h	2009-11-06 01:09:40 +0000
>> @@ -593,6 +593,25 @@ public:
>>    }
>>  };
>>  
>> +enum enum_binlog_row_image {
>> +  /*
>> +    MINIMAL
>> +      means PK in the before image and changed columns in after image
>> +
>> +    NOBLOB 
>> +      means that whenever possible, before and after image contain all columns
>> +      except blobs 
>> +
>> +    FULL
>> +      means all columns in both before and after image
>> +   */
>> +
>> +  BINLOG_ROW_IMAGE_MINIMAL= 0,
>> +  BINLOG_ROW_IMAGE_NOBLOB= 1,
>> +  BINLOG_ROW_IMAGE_FULL= 2
>> +};
>> +extern TYPELIB binlog_row_image_typelib;
>> +
> 
> ok.
> 
>>  enum enum_binlog_format {
>>    /*
>>      statement-based except for cases where only row-based can work (UUID()
>>
>> === modified file 'sql/log_event.cc'
>> --- a/sql/log_event.cc	2009-10-09 13:26:37 +0000
>> +++ b/sql/log_event.cc	2009-11-06 01:09:40 +0000
>> @@ -1852,6 +1852,7 @@ Rows_log_event::print_verbose_one_row(IO
>>  {
>>    const uchar *value0= value;
>>    const uchar *null_bits= value;
>> +  uint null_bit_index= 0;
>>    char typestr[64]= "";
>>    
>>    value+= (m_width + 7) / 8;
>> @@ -1860,7 +1861,8 @@ Rows_log_event::print_verbose_one_row(IO
>>    
>>    for (size_t i= 0; i < td->size(); i ++)
>>    {
>> -    int is_null= (null_bits[i / 8] >> (i % 8))  & 0x01;
>> +    int is_null= (null_bits[null_bit_index / 8] 
>> +                  >> (null_bit_index % 8))  & 0x01;
>>  
>>      if (bitmap_is_set(cols_bitmap, i) == 0)
>>        continue;
>> @@ -1897,6 +1899,8 @@ Rows_log_event::print_verbose_one_row(IO
>>      }
>>      
>>      my_b_printf(file, "\n");
>> +    
>> +    null_bit_index++;
>>    }
>>    return value - value0;
>>  }
>> @@ -7162,6 +7166,18 @@ int Rows_log_event::do_add_row_data(ucha
>>    DBUG_ENTER("Rows_log_event::do_add_row_data");
>>    DBUG_PRINT("enter", ("row_data: 0x%lx  length: %lu", (ulong) row_data,
>>                         (ulong) length));
>> +
>> +  /*
>> +    If length is zero, there is nothing to write, so we just
>> +    return. Note that this is not an optimization, since calling
>> +    realloc() with size 0 means free().
> 
> Please, improve this comment.
> Is this the case for insert into tt values()?
> 
> 
>> +   */
>> +  if (length == 0)
>> +  {
>> +    m_row_count++;
>> +    DBUG_RETURN(0);
>> +  }
>> +
>>    /*
>>      Don't print debug messages when running valgrind since they can
>>      trigger false warnings.
>> @@ -7476,7 +7492,7 @@ int Rows_log_event::do_apply_event(Relay
>>                            (ulong) m_curr_row, (ulong) m_curr_row_end, (ulong)
> m_rows_end));
>>  
>>        if (!m_curr_row_end && !error)
>> -        error= unpack_current_row(rli);
>> +        error= unpack_current_row(rli, &m_cols);
>>    
>>        // at this moment m_curr_row_end should be set
>>        DBUG_ASSERT(error || m_curr_row_end != NULL); 
>> @@ -8238,9 +8254,8 @@ void Table_map_log_event::print(FILE *fi
>>  #if !defined(MYSQL_CLIENT)
>>  Write_rows_log_event::Write_rows_log_event(THD *thd_arg, TABLE *tbl_arg,
>>                                             ulong tid_arg,
>> -                                           MY_BITMAP const *cols,
>>                                             bool is_transactional)
>> -  : Rows_log_event(thd_arg, tbl_arg, tid_arg, cols, is_transactional)
>> +  : Rows_log_event(thd_arg, tbl_arg, tid_arg, tbl_arg->write_set,
> is_transactional)
>>  {
>>  }
>>  #endif
>> @@ -8444,15 +8459,18 @@ Rows_log_event::write_row(const Relay_lo
>>    /* fill table->record[0] with default values */
>>    bool abort_on_warnings= (rli->sql_thd->variables.sql_mode &
>>                             (MODE_STRICT_TRANS_TABLES |
> MODE_STRICT_ALL_TABLES));
>> -  if ((error= prepare_record(table, m_width,
>> +  if ((error= prepare_record(table, &m_cols,
>>                               table->file->ht->db_type !=
> DB_TYPE_NDBCLUSTER,
>>                               abort_on_warnings, m_curr_row == m_rows_buf)))
>>      DBUG_RETURN(error);
>>    
>>    /* unpack row into table->record[0] */
>> -  if ((error= unpack_current_row(rli, abort_on_warnings)))
>> +  if ((error= unpack_current_row(rli, &m_cols, abort_on_warnings)))
>>      DBUG_RETURN(error);
>>  
>> +  // Temporary fix to find out why it fails [/Matz]
>> +  memcpy(m_table->write_set->bitmap, m_cols.bitmap,
> (m_table->write_set->n_bits + 7) / 8);
>> +
> 
> Is this really necessary after your changes?
> 
>>    if (m_curr_row == m_rows_buf)
>>    {
>>      /* this is the first row to be inserted, we estimate the rows with
>> @@ -8477,6 +8495,8 @@ Rows_log_event::write_row(const Relay_lo
>>      TODO: Add safety measures against infinite looping. 
>>     */
>>  
>> +  m_table->mark_columns_per_binlog_row_image();
>> +
>>    while ((error= table->file->ha_write_row(table->record[0])))
>>    {
>>      if (error == HA_ERR_LOCK_DEADLOCK ||
>> @@ -8493,7 +8513,7 @@ Rows_log_event::write_row(const Relay_lo
>>          - or because the information which key is not available
>>        */
>>        table->file->print_error(error, MYF(0));
>> -      DBUG_RETURN(error);
>> +      goto error;
>>      }
>>      /*
>>         We need to retrieve the old row into record[1] to be able to
>> @@ -8515,7 +8535,7 @@ Rows_log_event::write_row(const Relay_lo
>>          if (error == HA_ERR_RECORD_DELETED)
>>            error= HA_ERR_KEY_NOT_FOUND;
>>          table->file->print_error(error, MYF(0));
>> -        DBUG_RETURN(error);
>> +        goto error;
>>        }
>>      }
>>      else
>> @@ -8525,7 +8545,8 @@ Rows_log_event::write_row(const Relay_lo
>>        if (table->file->extra(HA_EXTRA_FLUSH_CACHE))
>>        {
>>          DBUG_PRINT("info",("Error when setting HA_EXTRA_FLUSH_CACHE"));
>> -        DBUG_RETURN(my_errno);
>> +        error= my_errno;
>> +        goto error;
>>        }
>>  
>>        if (key.get() == NULL)
>> @@ -8534,7 +8555,8 @@ Rows_log_event::write_row(const Relay_lo
>>          if (key.get() == NULL)
>>          {
>>            DBUG_PRINT("info",("Can't allocate key buffer"));
>> -          DBUG_RETURN(ENOMEM);
>> +          error= ENOMEM;
>> +          goto error;
>>          }
>>        }
>>  
>> @@ -8550,7 +8572,7 @@ Rows_log_event::write_row(const Relay_lo
>>          if (error == HA_ERR_RECORD_DELETED)
>>            error= HA_ERR_KEY_NOT_FOUND;
>>          table->file->print_error(error, MYF(0));
>> -        DBUG_RETURN(error);
>> +        goto error;
>>        }
>>      }
>>  
>> @@ -8567,7 +8589,7 @@ Rows_log_event::write_row(const Relay_lo
>>      if (!get_flags(COMPLETE_ROWS_F))
>>      {
>>        restore_record(table,record[1]);
>> -      error= unpack_current_row(rli);
>> +      error= unpack_current_row(rli, &m_cols);
>>      }
>>  
>>  #ifndef DBUG_OFF
>> @@ -8612,7 +8634,7 @@ Rows_log_event::write_row(const Relay_lo
>>          table->file->print_error(error, MYF(0));
>>        }
>>        
>> -      DBUG_RETURN(error);
>> +      goto error;
>>      }
>>      else
>>      {
>> @@ -8621,12 +8643,14 @@ Rows_log_event::write_row(const Relay_lo
>>        {
>>          DBUG_PRINT("info",("ha_delete_row() returns error %d",error));
>>          table->file->print_error(error, MYF(0));
>> -        DBUG_RETURN(error);
>> +        goto error;
>>        }
>>        /* Will retry ha_write_row() with the offending row removed. */
>>      }
>>    }
>>  
>> +error:
>> +  m_table->default_column_bitmaps();
>>    DBUG_RETURN(error);
> 
> 
> Why do you need to that?
> What is the difference between the master and the slave in this case?
> Please, add a comment when you call mark_columns_per_binlog_row_image
> and after setting the default values.
> 
> 
>>  }
>>  
>> @@ -8668,7 +8692,7 @@ void Write_rows_log_event::print(FILE *f
>>  
>>    Returns TRUE if different.
>>  */
>> -static bool record_compare(TABLE *table)
>> +static bool record_compare(TABLE *table, MY_BITMAP *cols)
>>  {
>>    /*
>>      Need to set the X bit and the filler bits in both records since
>> @@ -8700,14 +8724,16 @@ static bool record_compare(TABLE *table)
>>      }
>>    }
>>  
>> -  if (table->s->blob_fields + table->s->varchar_fields == 0)
>> +  if (table->s->blob_fields + table->s->varchar_fields == 0
> &&
>> +      bitmap_is_set_all(cols))
>>    {
>>      result= cmp_record(table,record[1]);
>>      goto record_compare_exit;
>>    }
>>  
>>    /* Compare null bits */
>> -  if (memcmp(table->null_flags,
>> +  if (bitmap_is_set_all(cols) &&
>> +      memcmp(table->null_flags,
>>  	     table->null_flags+table->s->rec_buff_length,
>>  	     table->s->null_bytes))
>>    {
>> @@ -8716,12 +8742,17 @@ static bool record_compare(TABLE *table)
>>    }
>>  
>>    /* Compare updated fields */
>> -  for (Field **ptr=table->field ; *ptr ; ptr++)
>> +  for (Field **ptr=table->field ; 
>> +       *ptr && ((ptr - table->field) < cols->n_bits);
>> +       ptr++)
>>    {
>> -    if ((*ptr)->cmp_binary_offset(table->s->rec_buff_length))
>> +    if (bitmap_is_set(cols, (*ptr)->field_index))
>>      {
>> -      result= TRUE;
>> -      goto record_compare_exit;
>> +      if ((*ptr)->cmp_binary_offset(table->s->rec_buff_length))
>> +      {
>> +        result= TRUE;
>> +        goto record_compare_exit;
>> +      }
>>      }
>>    }
>>  
>> @@ -8744,6 +8775,159 @@ record_compare_exit:
>>    return result;
>>  }
>>  
>> +
>> +/**
>> +  Validates before image. Searches the bitmap for 
>> +  columns set. If no colum, for the existing table, 
>> +  then the image cannot be used for searching a
>> +  record (regardless of using position(), index scan
>> +  or table scan).
>> +
>> +  @param table   the table we are using.
>> +  @param bi_cols the bitmap that signals usable columns.
>> +
>> +  @return TRUE if invalid, FALSE otherwise.
>> +*/
>> +static
>> +my_bool invalid_bi(TABLE *table, MY_BITMAP *bi_cols)
>> +{
>> +
>> +  int nfields_set= 0;
>> +  for (Field **ptr=table->field ; 
>> +       *ptr && ((ptr - table->field) < bi_cols->n_bits);
>> +       ptr++)
>> +  {
>> +    if (bitmap_is_set(bi_cols, (*ptr)->field_index))
>> +      nfields_set++;
>> +  }
>> +
>> +  return (nfields_set == 0);
>> +}
> 
> 
> ok.
> 
>> +
>> +/**
>> +  Validates whether the before image is usable for the
>> +  given key. It can be the case that the before image
>> +  does not contain values for the key (eg, master was
>> +  using 'minimal' option for image logging and slave has
>> +  different index structure on the table).
>> +  
>> +  @param keyinfo  reference to key.
>> +  @param bi_cols  the bitmap that signals usable columns.
>> +
>> +  @return TRUE if usable, FALSE otherwise.
>> +*/
>> +static
>> +my_bool is_usable_key(KEY *keyinfo, MY_BITMAP *bi_cols)
>> +{
>> +  for (uint i=0 ; i < keyinfo->key_parts ;i++)
>> +  {
>> +    uint fieldnr= keyinfo->key_part[i].fieldnr - 1;
>> +    if (fieldnr >= bi_cols->n_bits || 
>> +        !bitmap_is_set(bi_cols, fieldnr))
>> +      return FALSE;
>> +  }
>> + 
>> +  return TRUE;
>> +}
>> +
>> +/**
>> +  Searches the table for a given key that can be used
>> +  according to the existing values, ie, columns set
>> +  in the bitmap.
>> +
>> +  The caller can specify which type of key to find by
>> +  setting the following flags in the key_type parameter:
>> +
>> +    - PRI_KEY_FLAG
>> +      Returns the primary key.
>> +
>> +    - UNIQUE_KEY_FLAG
>> +      Returns a unique key (flagged with HA_NOSAME)
>> +
>> +    - MULTIPLE_KEY_FLAG
>> +      Returns a key not unique nor PK.
>> +
>> +  The above flags can be used together, in which case, the
>> +  search is conducted in the above listed order. Eg, the 
>> +  following flag:
>> +
>> +    (PRI_KEY_FLAG | UNIQUE_KEY_FLAG | MULTIPLE_KEY_FLAG)
>> +
>> +  means that a primary key is returned if it is suitable. If
>> +  not then the unique keys are searched. If no unique key is
>> +  suitable, then the keys are searched. Finally, if no key
>> +  is suitable, MAX_KEY is returned.
>> +
>> +  @param table    reference to the table.
>> +  @param bi_cols  the bitmap that signals usable columns.
>> +  @param key_type the type of key to search.
>> +
>> +  @return MAX_KEY if no key, according to the key_type specified
>> +          is suitable. Returns the key otherwise.
>> +
>> +*/
>> +static
>> +uint
>> +search_key_for_bi(TABLE *table, MY_BITMAP *bi_cols, uint key_type)
>> +{
>> +  KEY *keyinfo;
>> +  uint res= MAX_KEY;
>> +  uint key;
>> +
>> +  if (key_type & PRI_KEY_FLAG)
>> +  {
>> +    if (table->s->primary_key < MAX_KEY)
>> +    {
>> +      keyinfo= table->s->key_info + (uint) table->s->primary_key;
>> +      if (is_usable_key(keyinfo, bi_cols)) 
>> +        return table->s->primary_key;
>> +    }
>> +  }
>> +
>> +  if (key_type & UNIQUE_KEY_FLAG)
>> +  {
>> +    if (table->s->uniques)
>> +    {
>> +      for (key=0,keyinfo= table->key_info ; 
>> +           (key < table->s->keys) && (res == MAX_KEY);
>> +           key++,keyinfo++)
>> +      {
>> +        if (!(keyinfo->flags & HA_NOSAME) || /* skip not unique */
>> +            (key == table->s->primary_key))  /* skip primary */
>> +          continue;
>> +        res= is_usable_key(keyinfo, bi_cols) ? key : MAX_KEY;
>> +
>> +        if (res < MAX_KEY)
>> +          return res;
>> +      }
>> +    }
>> +  }
>> +
>> +  if (key_type & MULTIPLE_KEY_FLAG)
>> +  {
>> +    if (table->s->keys)
>> +    {
>> +      for (key=0,keyinfo= table->key_info ; 
>> +           (key < table->s->keys) && (res == MAX_KEY);
>> +           key++,keyinfo++)
>> +      {
>> +        if (!(table->s->keys_in_use.is_set(key)) || /* key is no active
> */
>> +            (keyinfo->flags & HA_NOSAME) || /* skip uniques */
>> +            (key == table->s->primary_key)) /* skip primary */
>> +          continue;
>> +
>> +        res= is_usable_key(keyinfo, bi_cols) ? key : MAX_KEY;
>> +
>> +        if (res < MAX_KEY)
>> +          return res;
>> +      }
>> +    }
>> +  }
>> +
>> +  return res;
>> +}
>> +
>> +
>>  /**
>>    Locate the current row in event's table.
>>  
>> @@ -8771,6 +8955,7 @@ record_compare_exit:
>>    @c position() and @c rnd_pos() will be used. 
>>   */
>>  
>> +
>>  int Rows_log_event::find_row(const Relay_log_info *rli)
>>  {
>>    DBUG_ENTER("Rows_log_event::find_row");
>> @@ -8779,6 +8964,8 @@ int Rows_log_event::find_row(const Relay
>>  
>>    TABLE *table= m_table;
>>    int error= 0;
>> +  KEY *keyinfo;
>> +  uint key;
>>  
>>    /*
>>      rpl_row_tabledefs.test specifies that
>> @@ -8787,16 +8974,28 @@ int Rows_log_event::find_row(const Relay
>>      Todo: fix wl3228 hld that requires defauls for all types of events
>>    */
>>    
>> -  prepare_record(table, m_width, FALSE);
>> -  error= unpack_current_row(rli);
>> +  prepare_record(table, &m_cols, FALSE);
>> +  error= unpack_current_row(rli, &m_cols);
>> +
>> +  // Temporary fix to find out why it fails [/Matz]
>> +  memcpy(m_table->read_set->bitmap, m_cols.bitmap,
> (m_table->read_set->n_bits + 7) / 8);
>> +
>> +  if (invalid_bi(table, &m_cols))
>> +  {
>> +    error= HA_ERR_END_OF_FILE;
>> +    goto err;
>> +  }
>>  
>>  #ifndef DBUG_OFF
>>    DBUG_PRINT("info",("looking for the following record"));
>>    DBUG_DUMP("record[0]", table->record[0], table->s->reclength);
>>  #endif
>>  
>> -  if ((table->file->ha_table_flags() &
> HA_PRIMARY_KEY_REQUIRED_FOR_POSITION) &&
>> -      table->s->primary_key < MAX_KEY)
>> +  if ((key= search_key_for_bi(table, &m_cols, PRI_KEY_FLAG)) >= MAX_KEY)
>> +    /* we dont have a PK, or PK is not usable with BI values */
>> +    goto INDEX_SCAN;
>> +
>> +  if ((table->file->ha_table_flags() &
> HA_PRIMARY_KEY_REQUIRED_FOR_POSITION))
>>    {
>>      /*
>>        Use a more efficient method to fetch the record given by
>> @@ -8830,11 +9029,7 @@ int Rows_log_event::find_row(const Relay
>>  
>>    // We can't use position() - try other methods.
>>    
>> -  /* 
>> -    We need to retrieve all fields
>> -    TODO: Move this out from this function to main loop 
>> -   */
>> -  table->use_all_columns();
>> +INDEX_SCAN:
>>  
>>    /*
>>      Save copy of the record in table->record[1]. It might be needed 
>> @@ -8842,8 +9037,16 @@ int Rows_log_event::find_row(const Relay
>>     */ 
>>    store_record(table,record[1]);    
>>  
>> -  if (table->s->keys > 0)
>> +  if ((key= search_key_for_bi(table, &m_cols, 
>> +                              (PRI_KEY_FLAG | UNIQUE_KEY_FLAG |
> MULTIPLE_KEY_FLAG))) 
>> +       >= MAX_KEY)
>> +    /* we dont have a key, or no key is suitable for the BI values */
>> +    goto TABLE_SCAN; 
>> +
>>    {
>> +    keyinfo= table->key_info + key;
>> +
>> +
>>      DBUG_PRINT("info",("locating record using primary key (index_read)"));
>>  
>>      /* We have a key: search the table using the index */
>> @@ -8857,14 +9060,14 @@ int Rows_log_event::find_row(const Relay
>>      /* Fill key data for the row */
>>  
>>      DBUG_ASSERT(m_key);
>> -    key_copy(m_key, table->record[0], table->key_info, 0);
>> +    key_copy(m_key, table->record[0], keyinfo, 0);
>>  
>>      /*
>>        Don't print debug messages when running valgrind since they can
>>        trigger false warnings.
>>       */
>>  #ifndef HAVE_purify
>> -    DBUG_DUMP("key data", m_key, table->key_info->key_length);
>> +    DBUG_DUMP("key data", m_key, keyinfo->key_length);
>>  #endif
>>  
>>      /*
>> @@ -8911,7 +9114,7 @@ int Rows_log_event::find_row(const Relay
>>        found.  I can see no scenario where it would be incorrect to
>>        chose the row to change only using a PK or an UNNI.
>>      */
>> -    if (table->key_info->flags & HA_NOSAME)
>> +    if (keyinfo->flags & HA_NOSAME || key ==
> table->s->primary_key)
>>      {
>>        table->file->ha_index_end();
>>        goto ok;
>> @@ -8924,7 +9127,7 @@ int Rows_log_event::find_row(const Relay
>>       */ 
>>      DBUG_PRINT("info",("non-unique index, scanning it to find matching
> record")); 
>>  
>> -    while (record_compare(table))
>> +    while (record_compare(table, &m_cols))
>>      {
>>        /*
>>          We need to set the null bytes to ensure that the filler bit
>> @@ -8957,8 +9160,12 @@ int Rows_log_event::find_row(const Relay
>>        Have to restart the scan to be able to fetch the next row.
>>      */
>>      table->file->ha_index_end();
>> +    goto ok;
>>    }
>> -  else
>> +
>> +TABLE_SCAN:
>> +
>> +  /* All that we can do now is rely on a table scan */
>>    {
>>      DBUG_PRINT("info",("locating record using table scan (rnd_next)"));
>>  
>> @@ -8978,7 +9185,6 @@ int Rows_log_event::find_row(const Relay
>>      {
>>    restart_rnd_next:
>>        error= table->file->rnd_next(table->record[0]);
>> -
>>        DBUG_PRINT("info", ("error: %s", HA_ERR(error)));
>>        switch (error) {
>>  
>> @@ -9005,7 +9211,7 @@ int Rows_log_event::find_row(const Relay
>>          goto err;
>>        }
>>      }
>> -    while (restart_count < 2 && record_compare(table));
>> +    while (restart_count < 2 && record_compare(table, &m_cols));
>>      
>>      /* 
>>        Note: above record_compare will take into accout all record fields 
>> @@ -9041,9 +9247,9 @@ err:
>>  
>>  #ifndef MYSQL_CLIENT
>>  Delete_rows_log_event::Delete_rows_log_event(THD *thd_arg, TABLE *tbl_arg,
>> -                                             ulong tid, MY_BITMAP const *cols,
>> +                                             ulong tid,
>>                                               bool is_transactional)
>> -  : Rows_log_event(thd_arg, tbl_arg, tid, cols, is_transactional)
>> +  : Rows_log_event(thd_arg, tbl_arg, tid, tbl_arg->read_set,
> is_transactional)
>>  {
>>  }
>>  #endif /* #if !defined(MYSQL_CLIENT) */
>> @@ -9065,22 +9271,14 @@ Delete_rows_log_event::Delete_rows_log_e
>>  int 
>>  Delete_rows_log_event::do_before_row_operations(const Slave_reporting_capability
> *const)
>>  {
>> -  if ((m_table->file->ha_table_flags() &
> HA_PRIMARY_KEY_REQUIRED_FOR_POSITION) &&
>> -      m_table->s->primary_key < MAX_KEY)
>> -  {
>> -    /*
>> -      We don't need to allocate any memory for m_key since it is not used.
>> -    */
>> -    return 0;
>> -  }
>> -
>>    if (m_table->s->keys > 0)
>>    {
>>      // Allocate buffer for key searches
>> -    m_key= (uchar*)my_malloc(m_table->key_info->key_length, MYF(MY_WME));
>> +    m_key= (uchar*)my_malloc(MAX_KEY_LENGTH, MYF(MY_WME));
>>      if (!m_key)
>>        return HA_ERR_OUT_OF_MEM;
>>    }
>> +
>>    return 0;
>>  }
>>  
>> @@ -9103,10 +9301,13 @@ int Delete_rows_log_event::do_exec_row(c
>>  
>>    if (!(error= find_row(rli))) 
>>    { 
>> +
>> +    m_table->mark_columns_per_binlog_row_image();
>>      /*
>>        Delete the record found, located in record[0]
>>      */
>>      error= m_table->file->ha_delete_row(m_table->record[0]);
>> +    m_table->default_column_bitmaps();
>>    }
>>    return error;
>>  }
>> @@ -9132,21 +9333,10 @@ void Delete_rows_log_event::print(FILE *
>>  #if !defined(MYSQL_CLIENT)
>>  Update_rows_log_event::Update_rows_log_event(THD *thd_arg, TABLE *tbl_arg,
>>                                               ulong tid,
>> -                                             MY_BITMAP const *cols_bi,
>> -                                             MY_BITMAP const *cols_ai,
>>                                               bool is_transactional)
>> -: Rows_log_event(thd_arg, tbl_arg, tid, cols_bi, is_transactional)
>> +: Rows_log_event(thd_arg, tbl_arg, tid, tbl_arg->read_set, is_transactional)
>>  {
>> -  init(cols_ai);
>> -}
>> -
>> -Update_rows_log_event::Update_rows_log_event(THD *thd_arg, TABLE *tbl_arg,
>> -                                             ulong tid,
>> -                                             MY_BITMAP const *cols,
>> -                                             bool is_transactional)
>> -: Rows_log_event(thd_arg, tbl_arg, tid, cols, is_transactional)
>> -{
>> -  init(cols);
>> +  init(tbl_arg->write_set);
>>  }
>>  
>>  void Update_rows_log_event::init(MY_BITMAP const *cols)
>> @@ -9232,7 +9422,7 @@ Update_rows_log_event::do_exec_row(const
>>        able to skip to the next pair of updates
>>      */
>>      m_curr_row= m_curr_row_end;
>> -    unpack_current_row(rli);
>> +    unpack_current_row(rli, &m_cols_ai);
>>      return error;
>>    }
>>  
>> @@ -9253,7 +9443,7 @@ Update_rows_log_event::do_exec_row(const
>>                             (MODE_STRICT_TRANS_TABLES |
> MODE_STRICT_ALL_TABLES));
>>    m_curr_row= m_curr_row_end;
>>    /* this also updates m_curr_row_end */
>> -  if ((error= unpack_current_row(rli, abort_on_warnings)))
>> +  if ((error= unpack_current_row(rli, &m_cols_ai, abort_on_warnings)))
>>      return error;
>>  
>>    /*
>> @@ -9270,9 +9460,15 @@ Update_rows_log_event::do_exec_row(const
>>    DBUG_DUMP("new values", m_table->record[0], m_table->s->reclength);
>>  #endif
>>  
>> +  // Temporary fix to find out why it fails [/Matz]
>> +  memcpy(m_table->read_set->bitmap, m_cols.bitmap,
> (m_table->read_set->n_bits + 7) / 8);
>> +  memcpy(m_table->write_set->bitmap, m_cols_ai.bitmap,
> (m_table->write_set->n_bits + 7) / 8);
>> +
>> +  m_table->mark_columns_per_binlog_row_image();
>>    error= m_table->file->ha_update_row(m_table->record[1],
> m_table->record[0]);
>>    if (error == HA_ERR_RECORD_IS_THE_SAME)
>>      error= 0;
>> +  m_table->default_column_bitmaps();
>>  
>>    return error;
>>  }
>>
>> === modified file 'sql/log_event.h'
>> --- a/sql/log_event.h	2009-10-01 17:22:44 +0000
>> +++ b/sql/log_event.h	2009-11-06 01:09:40 +0000
>> @@ -3563,13 +3563,14 @@ protected:
>>  
>>    // Unpack the current row into m_table->record[0]
>>    int unpack_current_row(const Relay_log_info *const rli,
>> +                         MY_BITMAP const *cols,
>>                           const bool abort_on_warning= TRUE)
>>    { 
>>      DBUG_ASSERT(m_table);
>>  
>>      bool first_row= (m_curr_row == m_rows_buf);
>>      ASSERT_OR_RETURN_ERROR(m_curr_row < m_rows_end, HA_ERR_CORRUPT_EVENT);
>> -    int const result= ::unpack_row(rli, m_table, m_width, m_curr_row,
> &m_cols,
>> +    int const result= ::unpack_row(rli, m_table, m_width, m_curr_row, cols,
>>                                     &m_curr_row_end,
> &m_master_reclength,
>>                                     abort_on_warning, first_row);
>>      if (m_curr_row_end > m_rows_end)
>> @@ -3661,7 +3662,7 @@ public:
>>  
>>  #if !defined(MYSQL_CLIENT)
>>    Write_rows_log_event(THD*, TABLE*, ulong table_id, 
>> -		       MY_BITMAP const *cols, bool is_transactional);
>> +		       bool is_transactional);
>>  #endif
>>  #ifdef HAVE_REPLICATION
>>    Write_rows_log_event(const char *buf, uint event_len, 
>> @@ -3670,14 +3671,12 @@ public:
>>  #if !defined(MYSQL_CLIENT) 
>>    static bool binlog_row_logging_function(THD *thd, TABLE *table,
>>                                            bool is_transactional,
>> -                                          MY_BITMAP *cols,
>> -                                          uint fields,
>>                                            const uchar *before_record
>>                                            __attribute__((unused)),
>>                                            const uchar *after_record)
>>    {
>>      return thd->binlog_write_row(table, is_transactional,
>> -                                 cols, fields, after_record);
>> +                                 after_record);
>>    }
>>  #endif
>>  
>> @@ -3724,7 +3723,6 @@ public:
>>                          bool is_transactional);
>>  
>>    Update_rows_log_event(THD*, TABLE*, ulong table_id,
>> -			MY_BITMAP const *cols,
>>                          bool is_transactional);
>>  
>>    void init(MY_BITMAP const *cols);
>> @@ -3740,13 +3738,11 @@ public:
>>  #if !defined(MYSQL_CLIENT) 
>>    static bool binlog_row_logging_function(THD *thd, TABLE *table,
>>                                            bool is_transactional,
>> -                                          MY_BITMAP *cols,
>> -                                          uint fields,
>>                                            const uchar *before_record,
>>                                            const uchar *after_record)
>>    {
>>      return thd->binlog_update_row(table, is_transactional,
>> -                                  cols, fields, before_record, after_record);
>> +                                  before_record, after_record);
>>    }
>>  #endif
>>  
>> @@ -3800,7 +3796,7 @@ public:
>>  
>>  #ifndef MYSQL_CLIENT
>>    Delete_rows_log_event(THD*, TABLE*, ulong, 
>> -			MY_BITMAP const *cols, bool is_transactional);
>> +			bool is_transactional);
>>  #endif
>>  #ifdef HAVE_REPLICATION
>>    Delete_rows_log_event(const char *buf, uint event_len, 
>> @@ -3809,14 +3805,12 @@ public:
>>  #if !defined(MYSQL_CLIENT) 
>>    static bool binlog_row_logging_function(THD *thd, TABLE *table,
>>                                            bool is_transactional,
>> -                                          MY_BITMAP *cols,
>> -                                          uint fields,
>>                                            const uchar *before_record,
>>                                            const uchar *after_record
>>                                            __attribute__((unused)))
>>    {
>>      return thd->binlog_delete_row(table, is_transactional,
>> -                                  cols, fields, before_record);
>> +                                  before_record);
>>    }
>>  #endif
>>    
>>
>> === modified file 'sql/log_event_old.cc'
>> --- a/sql/log_event_old.cc	2009-10-09 13:26:37 +0000
>> +++ b/sql/log_event_old.cc	2009-11-06 01:09:40 +0000
>> @@ -2007,7 +2007,7 @@ Old_rows_log_event::write_row(const Rela
>>  
>>    /* fill table->record[0] with default values */
>>  
>> -  if ((error= prepare_record(table, m_width,
>> +  if ((error= prepare_record(table, table->write_set,
>>                               TRUE /* check if columns have def. values */)))
>>      DBUG_RETURN(error);
>>    
>> @@ -2222,7 +2222,7 @@ int Old_rows_log_event::find_row(const R
>>    /* unpack row - missing fields get default values */
>>  
>>    // TODO: shall we check and report errors here?
>> -  prepare_record(table, m_width, FALSE /* don't check errors */); 
>> +  prepare_record(table, table->read_set, FALSE /* don't check errors */); 
>>    error= unpack_current_row(rli); 
>>  
>>  #ifndef DBUG_OFF
>>
>> === modified file 'sql/log_event_old.h'
>> --- a/sql/log_event_old.h	2007-12-05 19:00:14 +0000
>> +++ b/sql/log_event_old.h	2009-11-06 01:09:40 +0000
>> @@ -358,7 +358,8 @@ class Write_rows_log_event_old : public 
>>  public:
>>  #if !defined(MYSQL_CLIENT)
>>    Write_rows_log_event_old(THD*, TABLE*, ulong table_id,
>> -                           MY_BITMAP const *cols, bool is_transactional);
>> +                           MY_BITMAP const *cols,
>> +                           bool is_transactional);
>>  #endif
>>  #ifdef HAVE_REPLICATION
>>    Write_rows_log_event_old(const char *buf, uint event_len,
>> @@ -367,14 +368,12 @@ public:
>>  #if !defined(MYSQL_CLIENT) 
>>    static bool binlog_row_logging_function(THD *thd, TABLE *table,
>>                                            bool is_transactional,
>> -                                          MY_BITMAP *cols,
>> -                                          uint fields,
>>                                            const uchar *before_record
>>                                            __attribute__((unused)),
>>                                            const uchar *after_record)
>>    {
>>      return thd->binlog_write_row(table, is_transactional,
>> -                                 cols, fields, after_record);
>> +                                 after_record);
>>    }
>>  #endif
>>  
>> @@ -444,13 +443,11 @@ public:
>>  #if !defined(MYSQL_CLIENT) 
>>    static bool binlog_row_logging_function(THD *thd, TABLE *table,
>>                                            bool is_transactional,
>> -                                          MY_BITMAP *cols,
>> -                                          uint fields,
>>                                            const uchar *before_record,
>>                                            const uchar *after_record)
>>    {
>>      return thd->binlog_update_row(table, is_transactional,
>> -                                  cols, fields, before_record, after_record);
>> +                                  before_record, after_record);
>>    }
>>  #endif
>>  
>> @@ -509,7 +506,8 @@ class Delete_rows_log_event_old : public
>>  public:
>>  #ifndef MYSQL_CLIENT
>>    Delete_rows_log_event_old(THD*, TABLE*, ulong,
>> -                            MY_BITMAP const *cols, bool is_transactional);
>> +                            MY_BITMAP const *cols,
>> +                            bool is_transactional);
>>  #endif
>>  #ifdef HAVE_REPLICATION
>>    Delete_rows_log_event_old(const char *buf, uint event_len,
>> @@ -518,14 +516,12 @@ public:
>>  #if !defined(MYSQL_CLIENT) 
>>    static bool binlog_row_logging_function(THD *thd, TABLE *table,
>>                                            bool is_transactional,
>> -                                          MY_BITMAP *cols,
>> -                                          uint fields,
>>                                            const uchar *before_record,
>>                                            const uchar *after_record
>>                                            __attribute__((unused)))
>>    {
>>      return thd->binlog_delete_row(table, is_transactional,
>> -                                  cols, fields, before_record);
>> +                                  before_record);
>>    }
>>  #endif
>>    
> 
> 
> ok.
> 
>> === modified file 'sql/mysql_priv.h'
>> --- a/sql/mysql_priv.h	2009-09-29 14:40:52 +0000
>> +++ b/sql/mysql_priv.h	2009-11-06 01:09:40 +0000
>> @@ -1860,6 +1860,7 @@ extern ulong binlog_cache_size, open_fil
>>  extern ulonglong max_binlog_cache_size;
>>  extern ulong max_binlog_size, max_relay_log_size;
>>  extern ulong opt_binlog_rows_event_max_size;
>> +extern ulong opt_binlog_row_image_id;
>>  extern ulong rpl_recovery_rank, thread_cache_size, thread_pool_size;
>>  extern ulong back_log;
>>  #endif /* MYSQL_SERVER */
>> @@ -1983,6 +1984,7 @@ extern TYPELIB thread_handling_typelib;
>>  extern uint8 uc_update_queries[SQLCOM_END+1];
>>  extern uint sql_command_flags[];
>>  extern TYPELIB log_output_typelib;
>> +extern TYPELIB binlog_row_image_typelib;
>>  
>>  /* optional things, have_* variables */
>>  extern SHOW_COMP_OPTION have_community_features;
>>
>> === modified file 'sql/mysqld.cc'
>> --- a/sql/mysqld.cc	2009-10-03 10:50:25 +0000
>> +++ b/sql/mysqld.cc	2009-11-06 01:09:40 +0000
>> @@ -506,6 +506,14 @@ TYPELIB binlog_format_typelib=
>>      binlog_format_names, NULL };
>>  ulong opt_binlog_format_id= (ulong) BINLOG_FORMAT_UNSPEC;
>>  const char *opt_binlog_format= binlog_format_names[opt_binlog_format_id];
>> +
>> +const char *binlog_row_image_names[]= {"MINIMAL", "NOBLOB", "FULL", NullS};
>> +TYPELIB binlog_row_image_typelib=
>> +  { array_elements(binlog_row_image_names) - 1, "",
>> +    binlog_row_image_names, NULL };
>> +ulong opt_binlog_row_image_id= (ulong) BINLOG_ROW_IMAGE_FULL;
>> +const char *opt_binlog_row_image_arg=
>> +  binlog_row_image_names[BINLOG_ROW_IMAGE_MINIMAL];
>>  #ifdef HAVE_INITGROUPS
>>  static bool calling_initgroups= FALSE; /**< Used in SIGSEGV handler. */
>>  #endif
>> @@ -5685,7 +5693,8 @@ enum options_mysqld
>>    OPT_IGNORE_BUILTIN_INNODB,
>>    OPT_SYNC_RELAY_LOG,
>>    OPT_SYNC_RELAY_LOG_INFO,
>> -  OPT_SYNC_MASTER_INFO
>> +  OPT_SYNC_MASTER_INFO,
>> +  OPT_BINLOG_ROW_IMAGE
>>  };
>>  
>>  
>> @@ -7038,6 +7047,16 @@ The minimum value for this variable is 4
>>     (uchar**) &max_system_variables.net_wait_timeout, 0, GET_ULONG,
>>     REQUIRED_ARG, NET_WAIT_TIMEOUT, 1, IF_WIN(INT_MAX32/1000, LONG_TIMEOUT),
>>     0, 1, 0},
> 
> 
>> +  {"binlog-row-image", OPT_BINLOG_ROW_IMAGE,
>> +    "Controls wether the records should be logged in a full, reversible or "
>> +    "minimal manner. Full means that all columns are stored in the before "
>> +    "and after image are logged. Reversible means that PK and changed "
>> +    "columns are stored in the before and after image. Minimal means that PK "
>> +    "columns are stored in before image and changed columns in the after "
>> +    "image. (Default: minimal.)", 
> 
> Don't use the expression manner, replace it by the possible values.
> Reversible? I thought we would postpone its implementation?
> And blobs?
> Is minimal the default?
> 
>> +    (uchar **) &opt_binlog_row_image_arg,
>> +    (uchar **) &opt_binlog_row_image_arg,
>> +    0, GET_STR, REQUIRED_ARG, 0, 0, 0, 0, 0, 0},
>>    {0, 0, 0, 0, 0, 0, GET_NO_ARG, NO_ARG, 0, 0, 0, 0, 0, 0}
>>  };
>>  
>> @@ -7779,6 +7798,8 @@ static int mysql_init_variables(void)
>>    global_system_variables.old_passwords= 0;
>>    global_system_variables.old_alter_table= 0;
>>    global_system_variables.binlog_format= BINLOG_FORMAT_UNSPEC;
>> +  global_system_variables.binlog_row_image= BINLOG_ROW_IMAGE_FULL;
>> +
>>    /*
>>      Default behavior for 4.1 and 5.0 is to treat NULL values as unequal
>>      when collecting index statistics for MyISAM tables.
>> @@ -8044,6 +8065,13 @@ mysqld_get_one_option(int optid,
>>      global_system_variables.binlog_format= opt_binlog_format_id= id - 1;
>>      break;
>>    }
>> +  case OPT_BINLOG_ROW_IMAGE:
>> +  {
>> +    int id;
>> +    id= find_type_or_exit(argument, &binlog_row_image_typelib,
> opt->name);
>> +    global_system_variables.binlog_row_image= opt_binlog_row_image_id= id - 1;
>> +    break;
>> +  }
>>    case (int)OPT_BINLOG_DO_DB:
>>    {
>>      binlog_filter->add_do_db(argument);
>>
>> === modified file 'sql/rpl_injector.cc'
>> --- a/sql/rpl_injector.cc	2008-02-19 11:43:01 +0000
>> +++ b/sql/rpl_injector.cc	2009-11-06 01:09:40 +0000
>> @@ -109,14 +109,15 @@ int injector::transaction::write_row (se
>>  				      record_type record)
>>  {
>>     DBUG_ENTER("injector::transaction::write_row(...)");
>> -
>>     if (int error= check_state(ROW_STATE))
>>       DBUG_RETURN(error);
>>  
>>     server_id_type save_id= m_thd->server_id;
>>     m_thd->set_server_id(sid);
>> -   m_thd->binlog_write_row(tbl.get_table(), tbl.is_transactional(), 
>> -                           cols, colcnt, record);
>> +
>> +   table::save_sets saveset(tbl, cols, cols);
>> +
>> +   m_thd->binlog_write_row(tbl.get_table(), tbl.is_transactional(), record);
>>     m_thd->set_server_id(save_id);
>>     DBUG_RETURN(0);
>>  }
>> @@ -133,8 +134,8 @@ int injector::transaction::delete_row(se
>>  
>>     server_id_type save_id= m_thd->server_id;
>>     m_thd->set_server_id(sid);
>> -   m_thd->binlog_delete_row(tbl.get_table(), tbl.is_transactional(), 
>> -                            cols, colcnt, record);
>> +   table::save_sets saveset(tbl, cols, cols);
>> +   m_thd->binlog_delete_row(tbl.get_table(), tbl.is_transactional(),
> record);
>>     m_thd->set_server_id(save_id);
>>     DBUG_RETURN(0);
>>  }
>> @@ -151,8 +152,11 @@ int injector::transaction::update_row(se
>>  
>>     server_id_type save_id= m_thd->server_id;
>>     m_thd->set_server_id(sid);
>> -   m_thd->binlog_update_row(tbl.get_table(), tbl.is_transactional(),
>> -		            cols, colcnt, before, after);
>> +
>> +   // The read- and write sets with autorestore
> 
> Please, elaborate on this.
> 
>> +   table::save_sets saveset(tbl, cols, cols);
>> +
>> +   m_thd->binlog_update_row(tbl.get_table(), tbl.is_transactional(), before,
> after);
>>     m_thd->set_server_id(save_id);
>>     DBUG_RETURN(0);
>>  }
> 
> 
> Why do I need this if before writting to the binary log we save bitmap and after
> restore?
> Can you elaborate and also add some comments?
> Can you also explain why the readset is equal to the writeset?
> 
>> === modified file 'sql/rpl_injector.h'
>> --- a/sql/rpl_injector.h	2007-05-10 09:59:39 +0000
>> +++ b/sql/rpl_injector.h	2009-11-06 01:09:40 +0000
>> @@ -117,6 +117,27 @@ public:
>>        class table 
>>        {
>>        public:
>> +        class save_sets {
>> +        public:
>> +          save_sets(table const &tbl, MY_BITMAP const *new_rs, MY_BITMAP
> const *new_ws)
>> +            : m_table(tbl.get_table()),
>> +              save_read_set(m_table->read_set),
>> +              save_write_set(m_table->write_set)
>> +          {
>> +           
> m_table->column_bitmaps_set(const_cast<MY_BITMAP*>(new_rs),
>> +                                        const_cast<MY_BITMAP*>(new_ws));
>> +          }
>> +
>> +          ~save_sets() {
>> +            m_table->column_bitmaps_set(save_read_set, save_write_set);
>> +          }
>> +
>> +        private:
>> +          TABLE *m_table;
>> +          MY_BITMAP *save_read_set;
>> +          MY_BITMAP *save_write_set;
>> +        };
>> +
>>          table(TABLE *table, bool is_transactional) 
>>              : m_table(table), m_is_transactional(is_transactional)
>>          { 
>>
>> === modified file 'sql/rpl_record.cc'
>> --- a/sql/rpl_record.cc	2009-09-29 14:18:44 +0000
>> +++ b/sql/rpl_record.cc	2009-11-06 01:09:40 +0000
>> @@ -75,15 +75,20 @@ pack_row(TABLE *table, MY_BITMAP const* 
>>    unsigned int null_bits= (1U << 8) - 1;
>>    // Mask to mask out the correct but among the null bits
>>    unsigned int null_mask= 1U;
>> +  DBUG_PRINT("debug", ("null ptr: 0x%lx; row start: %p; null bytes: %d",
>> +                       (ulong) null_ptr, row_data, null_byte_count));
>> +  DBUG_DUMP("cols", (uchar*) cols->bitmap, cols->last_word_ptr -
> cols->bitmap + 1);
>>    for ( ; (field= *p_field) ; p_field++)
>>    {
>> -    DBUG_PRINT("debug", ("null_mask=%d; null_ptr=%p; row_data=%p;
> null_byte_count=%d",
>> -                         null_mask, null_ptr, row_data, null_byte_count));
>> +    DBUG_PRINT("debug", ("field: %s; null mask: 0x%x",
>> +                         field->field_name, null_mask));
>>      if (bitmap_is_set(cols, p_field - table->field))
>>      {
>>        my_ptrdiff_t offset;
>>        if (field->is_null(rec_offset))
>>        {
>> +        DBUG_PRINT("debug", ("Is NULL; null_mask: 0x%x; null_bits: 0x%x",
>> +                             null_mask, null_bits));
>>          offset= def_offset;
>>          null_bits |= null_mask;
>>        }
>> @@ -104,9 +109,9 @@ pack_row(TABLE *table, MY_BITMAP const* 
>>  #endif
>>          pack_ptr= field->pack(pack_ptr, field->ptr + offset,
>>                                field->max_data_length(), TRUE);
>> -        DBUG_PRINT("debug", ("field: %s; pack_ptr: 0x%lx;"
>> +        DBUG_PRINT("debug", ("Packed into row; pack_ptr: 0x%lx;"
>>                               " pack_ptr':0x%lx; bytes: %d",
>> -                             field->field_name, (ulong) old_pack_ptr,
>> +                             (ulong) old_pack_ptr,
>>                               (ulong) pack_ptr,
>>                               (int) (pack_ptr - old_pack_ptr)));
>>        }
>> @@ -120,6 +125,12 @@ pack_row(TABLE *table, MY_BITMAP const* 
>>          null_bits= (1U << 8) - 1;
>>        }
>>      }
>> +#ifndef DBUG_OFF
>> +    else
>> +    {
>> +      DBUG_PRINT("debug", ("Skipped"));
>> +    }
>> +#endif
>>    }
>>  
>>    /*
>> @@ -207,6 +218,11 @@ unpack_row(Relay_log_info const *rli,
>>    {
>>      Field *const f= *field_ptr;
>>  
>> +    DBUG_PRINT("debug", ("field: %s; null mask: 0x%x; null bits: 0x%lx;"
>> +                         " row start: %p; null bytes: %ld",
>> +                         f->field_name, null_mask, (ulong) null_bits,
>> +                         pack_ptr, (ulong) master_null_byte_count));
>> +
>>      /*
>>        No need to bother about columns that does not exist: they have
>>        gotten default values when being emptied above.
>> @@ -268,15 +284,20 @@ unpack_row(Relay_log_info const *rli,
>>          uchar const *const old_pack_ptr= pack_ptr;
>>  #endif
>>          pack_ptr= f->unpack(f->ptr, pack_ptr, metadata, TRUE);
>> -	DBUG_PRINT("debug", ("field: %s; metadata: 0x%x;"
>> +	DBUG_PRINT("debug", ("Unpacked; metadata: 0x%x;"
>>                               " pack_ptr: 0x%lx; pack_ptr': 0x%lx; bytes: %d",
>> -                             f->field_name, metadata,
>> -                             (ulong) old_pack_ptr, (ulong) pack_ptr,
>> +                             metadata, (ulong) old_pack_ptr, (ulong) pack_ptr,
>>                               (int) (pack_ptr - old_pack_ptr)));
>>        }
>>  
>>        null_mask <<= 1;
>>      }
>> +#ifndef DBUG_OFF
>> +    else
>> +    {
>> +      DBUG_PRINT("debug", ("Non-existent: skipped"));
>> +    }
>> +#endif
>>      i++;
>>    }
>>  
>> @@ -331,8 +352,6 @@ unpack_row(Relay_log_info const *rli,
>>    be NULL. Otherwise error is reported.
>>   
>>    @param table  Table whose record[0] buffer is prepared. 
>> -  @param skip   Number of columns for which default/nullable check 
>> -                should be skipped.
>>    @param check  Specifies if lack of default error needs checking.
>>    @param abort_on_warning
>>                  Controls how to react on lack of a field's default.
>> @@ -341,8 +360,7 @@ unpack_row(Relay_log_info const *rli,
>>                  
>>    @returns 0 on success or a handler level error code
>>   */ 
>> -int prepare_record(TABLE *const table, 
>> -                   const uint skip, const bool check,
>> +int prepare_record(TABLE *const table, const MY_BITMAP *cols, const bool check,
>>                     const bool abort_on_warning, const bool first_row)
>>  {
>>    DBUG_ENTER("prepare_record");
>> @@ -350,12 +368,7 @@ int prepare_record(TABLE *const table, 
>>    int error= 0;
>>    restore_record(table, s->default_values);
>>  
>> -  /*
>> -     This skip should be revisited in 6.0, because in 6.0 RBR one 
>> -     can have holes in the row (as the grain of the writeset is 
>> -     the column and not the entire row).
>> -   */
>> -  if (skip >= table->s->fields || !check)
>> +  if (!check)
>>      DBUG_RETURN(0);
>>  
>>    /*
>> @@ -364,34 +377,38 @@ int prepare_record(TABLE *const table, 
>>      explicit value for a field not having the explicit default 
>>      (@c check_that_all_fields_are_given_values()).
>>    */
>> -  for (Field **field_ptr= table->field+skip; *field_ptr; ++field_ptr)
>> +  
>> +  DBUG_PRINT_BITSET("debug", "cols: %s", cols);
>> +  for (Field **field_ptr= table->field; *field_ptr; ++field_ptr)
>>    {
>> -    uint32 const mask= NOT_NULL_FLAG | NO_DEFAULT_VALUE_FLAG;
>> -    Field *const f= *field_ptr;
>> -    if ((f->flags &  NO_DEFAULT_VALUE_FLAG) &&
>> -        (f->real_type() != MYSQL_TYPE_ENUM))
>> -    {
>> -
>> -      MYSQL_ERROR::enum_warning_level error_type=
>> -        MYSQL_ERROR::WARN_LEVEL_NOTE;
>> -      if (abort_on_warning && (table->file->has_transactions() ||
>> -                               first_row))
>> -      {
>> -        error= HA_ERR_ROWS_EVENT_APPLY;
>> -        error_type= MYSQL_ERROR::WARN_LEVEL_ERROR;
>> -      }
>> -      else
>> +    if ((uint) (field_ptr - table->field) >= cols->n_bits ||
>> +        !bitmap_is_set(cols, field_ptr - table->field))
>> +    {   
>> +      Field *const f= *field_ptr;
>> +      if ((f->flags &  NO_DEFAULT_VALUE_FLAG) &&
>> +          (f->real_type() != MYSQL_TYPE_ENUM))
>>        {
>> -        f->set_default();
>> -        error_type= MYSQL_ERROR::WARN_LEVEL_WARN;
>> +        MYSQL_ERROR::enum_warning_level error_type=
>> +          MYSQL_ERROR::WARN_LEVEL_NOTE;
>> +        if (abort_on_warning && (table->file->has_transactions()
> ||
>> +                                 first_row))
>> +        {
>> +          error= HA_ERR_ROWS_EVENT_APPLY;
>> +          error_type= MYSQL_ERROR::WARN_LEVEL_ERROR;
>> +        }
>> +        else
>> +        {
>> +          DBUG_PRINT("debug", ("Set default; field: %s", f->field_name));
>> +          f->set_default();
>> +          error_type= MYSQL_ERROR::WARN_LEVEL_WARN;
>> +        }
>> +        push_warning_printf(current_thd, error_type,
>> +                            ER_NO_DEFAULT_FOR_FIELD,
>> +                            ER(ER_NO_DEFAULT_FOR_FIELD),
>> +                            f->field_name);
>>        }
>> -      push_warning_printf(current_thd, error_type,
>> -                          ER_NO_DEFAULT_FOR_FIELD,
>> -                          ER(ER_NO_DEFAULT_FOR_FIELD),
>> -                          f->field_name);
>>      }
>>    }
>> -
>>    DBUG_RETURN(error);
>>  }
>>  
>>
>> === modified file 'sql/rpl_record.h'
>> --- a/sql/rpl_record.h	2009-09-29 14:18:44 +0000
>> +++ b/sql/rpl_record.h	2009-11-06 01:09:40 +0000
>> @@ -31,7 +31,8 @@ int unpack_row(Relay_log_info const *rli
>>                 const bool abort_on_warning= TRUE, const bool first_row= TRUE);
>>  
>>  // Fill table's record[0] with default values.
>> -int prepare_record(TABLE *const table, const uint skip, const bool check,
>> +int prepare_record(TABLE *const table, const MY_BITMAP *cols,
>> +                   const bool check,
>>                     const bool abort_on_warning= TRUE,
>>                     const bool first_row= TRUE);
>>  #endif
> 
> ok.
> 
>> === modified file 'sql/set_var.cc'
>> --- a/sql/set_var.cc	2009-09-29 14:40:52 +0000
>> +++ b/sql/set_var.cc	2009-11-06 01:09:40 +0000
>> @@ -180,6 +180,9 @@ static sys_var_long_ptr	sys_binlog_cache
>>  					      &binlog_cache_size);
>>  static sys_var_thd_binlog_format sys_binlog_format(&vars, "binlog_format",
>>                                              &SV::binlog_format);
>> +static sys_var_thd_binlog_row_image sys_binlog_row_image(&vars,
>> +                                                         "binlog_row_image",
>> +                                                        
> &SV::binlog_row_image);
>>  static sys_var_thd_ulong	sys_bulk_insert_buff_size(&vars,
> "bulk_insert_buffer_size",
>>  						  &SV::bulk_insert_buff_size);
>>  static sys_var_const_os         sys_character_sets_dir(&vars,
>> @@ -1289,12 +1292,23 @@ bool sys_var_thd_binlog_format::is_reado
>>    return sys_var_thd_enum::is_readonly();
>>  }
> 
>>  
>> -
> 
> Please, avoid this.
> 
>>  void fix_binlog_format_after_update(THD *thd, enum_var_type type)
>>  {
>>    thd->reset_current_stmt_binlog_row_based();
>>  }
>>  
>> +bool sys_var_thd_binlog_row_image::check(THD *thd, set_var *var) {
>> +  /*
>> +    All variables that affect writing to binary log (either format or
>> +    turning logging on and off) use the same checking. We call the
>> +    superclass ::check function to assign the variable correctly, and
>> +    then check the value.
>> +   */
> 
> Please, improve the comment.
> 
>> +  bool result= sys_var_thd_enum::check(thd, var);
>> +  if (!result)
>> +    result= check_log_update(thd, var);
>> +  return result;
>> +}
>>  
>>  static void fix_max_binlog_size(THD *thd, enum_var_type type)
>>  {
>>
>> === modified file 'sql/set_var.h'
>> --- a/sql/set_var.h	2009-09-29 14:40:52 +0000
>> +++ b/sql/set_var.h	2009-11-06 01:09:40 +0000
>> @@ -1281,6 +1281,19 @@ public:
>>    bool is_readonly() const;
>>  };
>>  
>> +class sys_var_thd_binlog_row_image :public sys_var_thd_enum
>> +{
>> +public:
>> +  sys_var_thd_binlog_row_image(sys_var_chain *chain, const char *name_arg, 
>> +                            ulong SV::*offset_arg)
>> +    :sys_var_thd_enum(chain, name_arg, offset_arg,
>> +                      &binlog_row_image_typelib,
>> +                      NULL)
>> +  {};
>> +  bool check(THD *thd, set_var *var);
>> +};
>> +
>> +
>>  /****************************************************************************
>>    Classes for parsing of the SET command
>>  ****************************************************************************/
>>
>> === modified file 'sql/sql_class.cc'
>> --- a/sql/sql_class.cc	2009-10-03 10:50:25 +0000
>> +++ b/sql/sql_class.cc	2009-11-06 01:09:40 +0000
>> @@ -3306,8 +3306,6 @@ void xid_cache_delete(XID_STATE *xid_sta
>>  
>>  template <class RowsEventT> Rows_log_event* 
>>  THD::binlog_prepare_pending_rows_event(TABLE* table, uint32 serv_id,
>> -                                       MY_BITMAP const* cols,
>> -                                       size_t colcnt,
>>                                         size_t needed,
>>                                         bool is_transactional,
>>  				       RowsEventT *hint __attribute__((unused)))
>> @@ -3345,13 +3343,11 @@ THD::binlog_prepare_pending_rows_event(T
>>        pending->server_id != serv_id || 
>>        pending->get_table_id() != table->s->table_map_id ||
>>        pending->get_type_code() != type_code || 
>> -      pending->get_data_size() + needed > opt_binlog_rows_event_max_size
> || 
>> -      pending->get_width() != colcnt ||
>> -      !bitmap_cmp(pending->get_cols(), cols)) 
>> +      pending->get_data_size() + needed > opt_binlog_rows_event_max_size)
> 
>>    {
>>      /* Create a new RowsEventT... */
>>      Rows_log_event* const
>> -	ev= new RowsEventT(this, table, table->s->table_map_id, cols,
>> +	ev= new RowsEventT(this, table, table->s->table_map_id,
>>                             is_transactional);
>>      if (unlikely(!ev))
>>        DBUG_RETURN(NULL);
>> @@ -3377,18 +3373,15 @@ THD::binlog_prepare_pending_rows_event(T
>>    compiling option.
>>  */
>>  template Rows_log_event*
>> -THD::binlog_prepare_pending_rows_event(TABLE*, uint32, MY_BITMAP const*,
>> -				       size_t, size_t, bool,
>> +THD::binlog_prepare_pending_rows_event(TABLE*, uint32, size_t, bool,
>>  				       Write_rows_log_event*);
>>  
>>  template Rows_log_event*
>> -THD::binlog_prepare_pending_rows_event(TABLE*, uint32, MY_BITMAP const*,
>> -				       size_t colcnt, size_t, bool,
>> +THD::binlog_prepare_pending_rows_event(TABLE*, uint32, size_t, bool,
>>  				       Delete_rows_log_event *);
>>  
>>  template Rows_log_event* 
>> -THD::binlog_prepare_pending_rows_event(TABLE*, uint32, MY_BITMAP const*,
>> -				       size_t colcnt, size_t, bool,
>> +THD::binlog_prepare_pending_rows_event(TABLE*, uint32, size_t, bool,
>>  				       Update_rows_log_event *);
>>  #endif
>>  
>> @@ -3580,7 +3573,6 @@ namespace {
>>  
>>  
>>  int THD::binlog_write_row(TABLE* table, bool is_trans, 
>> -                          MY_BITMAP const* cols, size_t colcnt, 
>>                            uchar const *record) 
>>  { 
>>    DBUG_ASSERT(current_stmt_binlog_row_based &&
> mysql_bin_log.is_open());
>> @@ -3595,11 +3587,10 @@ int THD::binlog_write_row(TABLE* table, 
>>  
>>    uchar *row_data= memory.slot(0);
>>  
>> -  size_t const len= pack_row(table, cols, row_data, record);
>> +  size_t const len= pack_row(table, table->write_set, row_data, record);
>>  
>>    Rows_log_event* const ev=
>> -    binlog_prepare_pending_rows_event(table, server_id, cols, colcnt,
>> -                                      len, is_trans,
>> +    binlog_prepare_pending_rows_event(table, server_id, len, is_trans,
>>                                       
> static_cast<Write_rows_log_event*>(0));
>>  
>>    if (unlikely(ev == 0))
>> @@ -3609,12 +3600,26 @@ int THD::binlog_write_row(TABLE* table, 
>>  }
>>  
>>  int THD::binlog_update_row(TABLE* table, bool is_trans,
>> -                           MY_BITMAP const* cols, size_t colcnt,
>>                             const uchar *before_record,
>>                             const uchar *after_record)
>>  { 
>> +  int error= 0;
>>    DBUG_ASSERT(current_stmt_binlog_row_based &&
> mysql_bin_log.is_open());
>>  
>> +  /**
>> +    Save a reference to the original read and write set bitmaps.
>> +    We will need this to restore the bitmaps at the end.
>> +   */
>> +  MY_BITMAP *old_read_set= table->read_set;
>> +  MY_BITMAP *old_write_set= table->write_set;
>> +
>> +  /** 
>> +     This will remove spurious fields required during execution but
>> +     not needed for binlogging. This is done according to the:
>> +     binlog-row-image option.
>> +   */
>> +  binlog_prepare_row_images(table);
>> +
>>    size_t const before_maxlen = max_row_length(table, before_record);
>>    size_t const after_maxlen  = max_row_length(table, after_record);
>>  
>> @@ -3625,9 +3630,9 @@ int THD::binlog_update_row(TABLE* table,
>>    uchar *before_row= row_data.slot(0);
>>    uchar *after_row= row_data.slot(1);
>>  
>> -  size_t const before_size= pack_row(table, cols, before_row,
>> +  size_t const before_size= pack_row(table, table->read_set, before_row,
>>                                          before_record);
>> -  size_t const after_size= pack_row(table, cols, after_row,
>> +  size_t const after_size= pack_row(table, table->write_set, after_row,
>>                                         after_record);
>>  
>>    /*
>> @@ -3642,24 +3647,43 @@ int THD::binlog_update_row(TABLE* table,
>>  #endif
>>  
>>    Rows_log_event* const ev=
>> -    binlog_prepare_pending_rows_event(table, server_id, cols, colcnt,
>> +    binlog_prepare_pending_rows_event(table, server_id,
>>  				      before_size + after_size, is_trans,
>>  				      static_cast<Update_rows_log_event*>(0));
>>  
>>    if (unlikely(ev == 0))
>>      return HA_ERR_OUT_OF_MEM;
>>  
>> -  return
>> -    ev->add_row_data(before_row, before_size) ||
>> -    ev->add_row_data(after_row, after_size);
>> +  error= ev->add_row_data(before_row, before_size) ||
>> +         ev->add_row_data(after_row, after_size);
>> +
> 
>> +  /* restore read set for the rest of execution */
>> +  table->column_bitmaps_set_no_signal(old_read_set,
>> +                                      old_write_set);
> 
> Restore not only the read set but also the write set.
> 
>> +
>> +  return error;
>>  }
>>  
>>  int THD::binlog_delete_row(TABLE* table, bool is_trans, 
>> -                           MY_BITMAP const* cols, size_t colcnt,
>>                             uchar const *record)
>>  { 
>> +  int error= 0;
>>    DBUG_ASSERT(current_stmt_binlog_row_based &&
> mysql_bin_log.is_open());
>>  
>> +  /**
>> +    Save a reference to the original read and write set bitmaps.
>> +    We will need this to restore the bitmaps at the end.
>> +   */
>> +  MY_BITMAP *old_read_set= table->read_set;
>> +  MY_BITMAP *old_write_set= table->write_set;
>> +
>> +  /** 
>> +     This will remove spurious fields required during execution but
>> +     not needed for binlogging. This is done according to the:
>> +     binlog-row-image option.
>> +   */
>> +  binlog_prepare_row_images(table);
>> +
>>    /* 
>>       Pack records into format for transfer. We are allocating more
>>       memory than needed, but that doesn't matter.
>> @@ -3670,19 +3694,80 @@ int THD::binlog_delete_row(TABLE* table,
>>  
>>    uchar *row_data= memory.slot(0);
>>  
>> -  size_t const len= pack_row(table, cols, row_data, record);
> 
>> +  DBUG_DUMP("table->read_set", (uchar*) table->read_set->bitmap,
> (table->s->fields + 7) / 8);
> 
> Add a line explaining why you are only interesing in dumping information on the
> delete.
> 
>> +  size_t const len= pack_row(table, table->read_set, row_data, record);
>>  
>>    Rows_log_event* const ev=
>> -    binlog_prepare_pending_rows_event(table, server_id, cols, colcnt,
>> -				      len, is_trans,
>> +    binlog_prepare_pending_rows_event(table, server_id, len, is_trans,
>>  				      static_cast<Delete_rows_log_event*>(0));
>>  
>>    if (unlikely(ev == 0))
>>      return HA_ERR_OUT_OF_MEM;
>>  
>> -  return ev->add_row_data(row_data, len);
>> +  error= ev->add_row_data(row_data, len);
>> +
>> +  /* restore read set for the rest of execution */
>> +  table->column_bitmaps_set_no_signal(old_read_set,
>> +                                      old_write_set);
>> +
>> +  return error;
> 
> Same as above.
> 
>>  }
>>  
>> +void THD::binlog_prepare_row_images(TABLE *table) 
>> +{
>> +  DBUG_ENTER("THD::binlog_prepare_row_images");
>> +  /** 
>> +    Remove from read_set spurious columns. The write_set has been
>> +    handled before in table->mark_columns_needed_for_update. 
>> +   */
>> +
>> +  DBUG_PRINT_BITSET("debug", "table->read_set (before preparing): %s",
> table->read_set);
>> +  THD *thd= table->in_use;
>> +
>> +  /** 
>> +    if there is a primary key in the table (ie, user declared PK or a
>> +    non-null unique index) and we dont want to ship the entire image.
>> +   */
>> +  if (table->s->primary_key < MAX_KEY &&
>> +      (thd->variables.binlog_row_image < BINLOG_ROW_IMAGE_FULL))
>> +  {
>> +    DBUG_ASSERT(table->read_set != &table->tmp_set);
> 
> Please, try to describe what the assertion means.
> 
>> +
>> +    bitmap_clear_all(&table->tmp_set);
>> +
>> +    switch(thd->variables.binlog_row_image)
>> +    {
>> +      case BINLOG_ROW_IMAGE_MINIMAL:
>> +        /* MINIMAL: Mark only PK */
>> +       
> table->mark_columns_used_by_index_no_reset(table->s->primary_key,
>> +                                                   &table->tmp_set);
>> +        break;
>> +      case BINLOG_ROW_IMAGE_NOBLOB:
>> +        /** 
>> +          NOBLOB: Remove unnecessary BLOB fields from read_set 
>> +                  (the ones that are not part of PK).
>> +         */
>> +        bitmap_union(&table->tmp_set, table->read_set);
>> +        for (Field **ptr=table->field ; *ptr ; ptr++)
>> +        {
>> +          Field *field= (*ptr);
>> +          if ((field->type() == MYSQL_TYPE_BLOB) &&
>> +              !(field->flags & PRI_KEY_FLAG))
>> +            bitmap_clear_bit(&table->tmp_set, field->field_index);
>> +        }
>> +        break;
>> +      default:
>> +        DBUG_ASSERT(0); // impossible.
>> +    }
>> +
>> +    /* set the temporary read_set */
>> +    table->column_bitmaps_set_no_signal(&table->tmp_set,
>> +                                        table->write_set);
>> +  }
>> +
>> +  DBUG_PRINT_BITSET("debug", "table->read_set (after preparing): %s",
> table->read_set);
>> +  DBUG_VOID_RETURN;
>> +}
>>  
>>  int THD::binlog_remove_pending_rows_event(bool clear_maps)
>>  {
>>
>> === modified file 'sql/sql_class.h'
>> --- a/sql/sql_class.h	2009-10-03 10:50:25 +0000
>> +++ b/sql/sql_class.h	2009-11-06 01:09:40 +0000
>> @@ -355,6 +355,7 @@ struct system_variables
>>    ulong ndb_index_stat_cache_entries;
>>    ulong ndb_index_stat_update_freq;
>>    ulong binlog_format; // binlog format for this thd (see enum_binlog_format)
>> +  ulong binlog_row_image; 
>>    /*
>>      In slave thread we need to know in behalf of which
>>      thread the query is being run to replicate temp tables properly
>> @@ -1405,14 +1406,12 @@ public:
>>    void binlog_set_stmt_begin();
>>    int binlog_write_table_map(TABLE *table, bool is_transactional);
>>    int binlog_write_row(TABLE* table, bool is_transactional,
>> -                       MY_BITMAP const* cols, size_t colcnt,
>> -                       const uchar *buf);
>> +                       const uchar *new_data);
>>    int binlog_delete_row(TABLE* table, bool is_transactional,
>> -                        MY_BITMAP const* cols, size_t colcnt,
>> -                        const uchar *buf);
>> +                        const uchar *old_data);
>>    int binlog_update_row(TABLE* table, bool is_transactional,
>> -                        MY_BITMAP const* cols, size_t colcnt,
>>                          const uchar *old_data, const uchar *new_data);
>> +  void binlog_prepare_row_images(TABLE* table);
>>  
>>    void set_server_id(uint32 sid) { server_id = sid; }
>>  
>> @@ -1421,8 +1420,6 @@ public:
>>    */
>>    template <class RowsEventT> Rows_log_event*
>>      binlog_prepare_pending_rows_event(TABLE* table, uint32 serv_id,
>> -                                      MY_BITMAP const* cols,
>> -                                      size_t colcnt,
>>                                        size_t needed,
>>                                        bool is_transactional,
>>  				      RowsEventT* hint);
>>
>> === modified file 'sql/sql_insert.cc'
>> --- a/sql/sql_insert.cc	2009-08-31 14:09:09 +0000
>> +++ b/sql/sql_insert.cc	2009-11-06 01:09:40 +0000
>> @@ -2202,6 +2202,11 @@ int write_delayed(THD *thd, TABLE *table
>>      DBUG_PRINT("delayed", ("transmitting auto_inc: %lu",
>>                             (ulong) row->forced_insert_id));
>>    }
>> +  
>> +  /* Copy the original write set */
>> +  bitmap_clear_all(di->table->write_set);
>> +  bitmap_union(di->table->write_set, table->write_set);
>> +  di->table->file->column_bitmaps_signal();
> 
> Please, add an explanation on this part.
> 
>>  
>>    di->rows.push_back(row);
>>    di->stacked_inserts++;
>> @@ -2568,7 +2573,13 @@ bool Delayed_insert::handle_inserts(void
>>    pthread_mutex_unlock(&mutex);
>>  
>>    table->next_number_field=table->found_next_number_field;
>> -  table->use_all_columns();
>> +  /* 
>> +     needed for some autoinc not null fields.
>> +     Otherwise, one would hit an assertion
>> +     when the insert tried to read the field.
>> +  */
>> +  bitmap_set_all(table->read_set);
>> +  table->mark_columns_needed_for_insert();
>>  
>>    thd_proc_info(&thd, "upgrading lock");
>>    if (thr_upgrade_write_delay_lock(*thd.lock->locks, delayed_lock))
> 
> 
> Can you explain the hunks above?
> 
> 
>> === modified file 'sql/sql_update.cc'
>> --- a/sql/sql_update.cc	2009-08-28 16:21:54 +0000
>> +++ b/sql/sql_update.cc	2009-11-06 01:09:40 +0000
>> @@ -326,6 +326,7 @@ int mysql_update(THD *thd,
>>    /* Update the table->file->stats.records number */
>>    table->file->info(HA_STATUS_VARIABLE | HA_STATUS_NO_LOCK);
>>  
>> +  table->mark_columns_needed_for_update();
> 
> 
> I think you should explain why you are moving.
> 
> 
>>    select= make_select(table, 0, 0, conds, 0, &error);
>>    if (error || !limit ||
>>        (select && select->check_quick(thd, safe_update, limit)))
>> @@ -364,8 +365,6 @@ int mysql_update(THD *thd,
>>    }
>>    init_ftfuncs(thd, select_lex, 1);
>>  
>> -  table->mark_columns_needed_for_update();
>> -
>>    /* Check if we are modifying a key that we are used to search with */
>>    
>>    if (select && select->quick)
>>
>> === modified file 'sql/table.cc'
>> --- a/sql/table.cc	2009-08-12 09:46:08 +0000
>> +++ b/sql/table.cc	2009-11-06 01:09:40 +0000
>> @@ -4443,6 +4443,8 @@ void st_table::mark_auto_increment_colum
>>  
>>  void st_table::mark_columns_needed_for_delete()
>>  {
>> +  mark_columns_per_binlog_row_image();
>> +
>>    if (triggers)
>>      triggers->mark_fields_used(TRG_EVENT_DELETE);
>>    if (file->ha_table_flags() & HA_REQUIRES_KEY_COLUMNS_FOR_DELETE)
>> @@ -4455,15 +4457,26 @@ void st_table::mark_columns_needed_for_d
>>      }
>>      file->column_bitmaps_signal();
>>    }
>> -  if (file->ha_table_flags() & HA_PRIMARY_KEY_REQUIRED_FOR_DELETE)
>> +  if (file->ha_table_flags() & HA_PRIMARY_KEY_REQUIRED_FOR_DELETE ||
>> +      mysql_bin_log.is_open() && in_use &&
>> +      in_use->current_stmt_binlog_row_based)
> 
> I don't get why you need this. Can you explain?
> And if you really need this why don't you do whatever you need to do in the
> mark_columns_per_binlog_row_image()?
> 
>>    {
>>      /*
>> -      If the handler has no cursor capabilites, we have to read either
>> -      the primary key, the hidden primary key or all columns to be
>> -      able to do an delete
>> +      If the handler has no cursor capabilites, or we have row-based
>> +      replication active for the current statement, we have to read
>> +      either the primary key, the hidden primary key or all columns to
>> +      be able to do an delete
>>      */
> 
> ok.
> 
>>      if (s->primary_key == MAX_KEY)
>> -      file->use_hidden_primary_key();
>> +    {
>> +      /*
>> +        If in RBR, this was done already in:
>> +        mark_columns_per_binlog_row_image
>> +      */
>> +      if (!(mysql_bin_log.is_open() && in_use &&
>> +          in_use->current_stmt_binlog_row_based))
>> +        file->use_hidden_primary_key();
>> +    }
> 
> Please, merge the comments and the "if".
> 
>>      else
>>      {
>>        mark_columns_used_by_index_no_reset(s->primary_key, read_set);
>> @@ -4493,6 +4506,9 @@ void st_table::mark_columns_needed_for_d
>>  
>>  void st_table::mark_columns_needed_for_update()
>>  {
>> +
>> +  mark_columns_per_binlog_row_image();
>> +
> 
> Please, move this to a place after the DBUG_ENTER.
> 
> 
>>    DBUG_ENTER("mark_columns_needed_for_update");
>>    if (triggers)
>>      triggers->mark_fields_used(TRG_EVENT_UPDATE);
> 
> 
> 
>> @@ -4511,12 +4527,21 @@ void st_table::mark_columns_needed_for_u
>>    if (file->ha_table_flags() & HA_PRIMARY_KEY_REQUIRED_FOR_DELETE)
>>    {
>>      /*
>> -      If the handler has no cursor capabilites, we have to read either
>> +      If the handler has no cursor capabilites, or we have row-based
>> +      logging active for the current statement, we have to read either
>>        the primary key, the hidden primary key or all columns to be
>>        able to do an update
>>      */
>>      if (s->primary_key == MAX_KEY)
>> -      file->use_hidden_primary_key();
>> +    {
>> +      /*
>> +        If in RBR, this was done already in:
>> +        mark_columns_per_binlog_row_image
>> +      */
>> +      if (!(mysql_bin_log.is_open() && in_use &&
>> +          in_use->current_stmt_binlog_row_based))
>> +        file->use_hidden_primary_key();
>> +    }
> 
> Please, merge the comments and the "if".
> 
> 
>>      else
>>      {
>>        mark_columns_used_by_index_no_reset(s->primary_key, read_set);
>> @@ -4526,6 +4551,67 @@ void st_table::mark_columns_needed_for_u
>>    DBUG_VOID_RETURN;
>>  }
> 
> 
> 
>>  
>> +void st_table::mark_columns_per_binlog_row_image()
>> +{
>> +  DBUG_ENTER("mark_columns_per_binlog_row_image");
> 
> 
> Add a description.
> 
>> +
>> +  /**
>> +    If in RBR we may need to mark some extra columns,
>> +    depending on the binlog-row-image command line argument.
>> +   */
>> +  if ((mysql_bin_log.is_open() && in_use &&
>> +       in_use->current_stmt_binlog_row_based))
> 
> 
> Why do you need the in_use?
> Should it be always true, shouldn't it?
> 
> 
>> +  {
>> +
>> +    THD *thd= current_thd;
>> +
>> +    /* if there is no PK, then mark all columns for the BI. */
>> +    if (s->primary_key >= MAX_KEY)
>> +      bitmap_set_all(read_set);
>> +
>> +    switch (thd->variables.binlog_row_image)
>> +    {
>> +      case BINLOG_ROW_IMAGE_FULL:
> 
>> +        if (s->primary_key < MAX_KEY)
>> +          bitmap_set_all(read_set);
> 
> Why do you need the if? Performance?
> It would clearer if you remove the "if".
> 
>> +        bitmap_set_all(write_set);
>> +        break;
>> +      case BINLOG_ROW_IMAGE_NOBLOB:
>> +        /* for every field that is not set, mark it unless it is a blob */
>> +        for (Field **ptr=field ; *ptr ; ptr++)
>> +        {
>> +          Field *field= *ptr;
>> +          /* 
>> +            bypass blob fields. These can be set or not set, we don't care.
>> +            Later if we don't need them in the before image, we will discard
>> +            them.
> 
> Later? When? Please, just give a hint. You don't need to get into too many details.
> Also explain why you do not get rid of them here.
> 
>> +
>> +            If set in the AI, then the blob is really needed. There is nothing
>> +            we can do.
> 
> s/. There is/, there is/
> 
>> +           */
>> +          if ((field->flags & PRI_KEY_FLAG) || 
>> +              (field->type() != MYSQL_TYPE_BLOB))
>> +            bitmap_set_bit(read_set, field->field_index);
> 
> 
> Now I think you should check if bitmap_set_all was called.
> 
> 
>> +
>> +          if (field->type() != MYSQL_TYPE_BLOB)
>> +            bitmap_set_bit(write_set, field->field_index);
> 
> 
> ok.
> 
> 
>> +        }
>> +        break;
>> +      case BINLOG_ROW_IMAGE_MINIMAL:
>> +        /* mark the primary key if available in the read_set */
>> +        if (s->primary_key < MAX_KEY)
>> +          mark_columns_used_by_index_no_reset(s->primary_key, read_set);
>> +        break;
>> +
>> +      default: 
>> +        DBUG_ASSERT(FALSE);
>> +    }
>> +    file->column_bitmaps_signal();
>> +  }
>> +
>> +  DBUG_VOID_RETURN;
>> +}
>> +
>>  
>>  /*
>>    Mark columns the handler needs for doing an insert
>> @@ -4536,6 +4622,7 @@ void st_table::mark_columns_needed_for_u
>>  
>>  void st_table::mark_columns_needed_for_insert()
>>  {
>> +  mark_columns_per_binlog_row_image();
>>    if (triggers)
>>    {
>>      /*
> 
> ok.
> 
>> === modified file 'sql/table.h'
>> --- a/sql/table.h	2009-09-23 21:32:31 +0000
>> +++ b/sql/table.h	2009-11-06 01:09:40 +0000
>> @@ -820,6 +820,7 @@ struct st_table {
>>    void mark_columns_needed_for_update(void);
>>    void mark_columns_needed_for_delete(void);
>>    void mark_columns_needed_for_insert(void);
>> +  void mark_columns_per_binlog_row_image(void);
>>    inline void column_bitmaps_set(MY_BITMAP *read_set_arg,
>>                                   MY_BITMAP *write_set_arg)
>>    {
> 
> 
> Change the name of the function. Maybe something like
> mark_columns_needed_for_binlog?
> 
> 
>>
>> ------------------------------------------------------------------------
>>
>> This body part will be downloaded on demand.
> 
> 
Thread
bzr commit into mysql-5.1-rep+2 branch (luis.soares:3137) WL#5092Luis Soares6 Nov
  • Re: bzr commit into mysql-5.1-rep+2 branch (luis.soares:3137) WL#5092Alfranio Correia26 Nov
    • Re: bzr commit into mysql-5.1-rep+2 branch (luis.soares:3137) WL#5092Alfranio Correia26 Nov
      • Re: bzr commit into mysql-5.1-rep+2 branch (luis.soares:3137) WL#5092Luís Soares4 Dec
  • Re: bzr commit into mysql-5.1-rep+2 branch (luis.soares:3137) WL#5092Mats Kindahl3 Dec
    • Re: bzr commit into mysql-5.1-rep+2 branch (luis.soares:3137) WL#5092Luís Soares4 Dec