List:Commits« Previous MessageNext Message »
From:Øystein Grøvlen Date:November 19 2008 2:35pm
Subject:Re: bzr commit into mysql-6.0-backup branch (Rafal.Somla:2731)
Bug#40303 Bug#40304
View as plain text  

Rafal Somla wrote:
 > Hi Oystein,
 >
 > Thank you for a very careful and detailed review. See my comments and
 > explanations below and expect an updated patch from me soon.
 >
 > Øystein Grøvlen wrote:

...

 >> REQUESTS
 >> ========
 >>
 >> 1. Some error reports are now less informative than previously.  (See
 >>    [2], [7]).  I think that should be fixed.
 >
 > I'd rather not deal with that issue in this patch. True that it changes
 > some error messages, but I think it is not as bad as you claim. See my
 > explanations below.

I think making error messages less informative is a step in the wrong
direction.  However, for [2] I think the extra error on the stack
makes it acceptable.  For [7], I still think my solution is better,
but I will not insists on it.

 >
 >>
 >> 2. Fix long lines: [6], [13], [19].
 >
 > Will do.
 >
 >>
 >> COMMENTARY
 >> ==========
 >>
 >> 1. A lot of the error messages used for backup have very little value
 >>    for a user and might as well have been replaced by "Internal error
 >>    during BACKUP/RESTORE".  What use is messages like these to the
 >>    user:
 >>
 >>     - "Error when preparing for backup operation"
 >>     - "Can't enumerate server tables"
 >>     - "Can't read backup archive preamble"
 >>     - "Error when reading summary section of backup image"
 >>     - "Can't create %-.64s backup driver"
 >>     - "Could not create the validity point"
 >>
 >>    Error messages should give useful information to user, not reflect
 >>    the internal design/implementation of the software!
 >
 > Sure, but this is not the scope of this patch.

I agree. That is why I put this comment in the Commentary section. ;-)

 > This patch aims at ensuring that the user sees the firs error
 > reported, as is the case for all other SQL statements. Improving
 > overall quality of error reporting was the task of WL#4384 (Online
 > backup: Revise error reporting), at least before it was redefined to
 > do something simpler.

I think the main principle should be to make error messages as
informative as possible to the user. From what I have experienced so
far, this is most often the first error reported, and it makes sense
to make this the default behavior.  This is why I think this patch is
generally good.  However, in some situations I think it makes sense to
override the first error in order to provide an error message that is
more informative to the user.  However, maybe that is easier said than
done.  If such functionality is not supported by the server, I guess
for now we just need to be satisfied with providing additional
information as warnings.


 >
 >>
 >> 2. Changes in this patch (e.g., [5]) illustrate that making
 >>    Backup_restore_ctx a sub-class of Logger was not a good design
 >>    decision.  Inheritance should only be used for is-a relationships.
 >>    I do not think it makes sense to say that a context "is a" logger.
 >>    However, a logger exist for a given context.  Hence, encapsulation
 >>    should be used instead.
 >>
 >
 > I can agree that this design is not perfect. However, there was a reason
 > behind such decision. The context object encapsulates all the context
 > needed to perform backup/restore operations and offers methods for
 > executing related actions. One of such actions is reporting errors to
 > the user (in the context of backup/restore opearation). Thus the context
 > object can be used to log messages to a user and hence, in this sense,
 > it "is a" logger. Although, I agree that one can look at this from
 > different perspectives.

I think your argumentation proves my point.  You say that "The context
object encapsulates all the context needed to perform backup/restore
operations".  That is why I think encapsulation should be used, not
inheritance.

 >
 >>
 >> SUGGESTIONS
 >> ===========

...

 >> 2. Skip masking of error numbers in SHOW WARNING [3].
 >
 > I'd rather keep it as it is - see below.

OK with me.

...

 >> 4. Try to give a more specific error message than
 >>    BACKUP_VP_FAILED. [8]
 >
 > I think it is not possible in the scope of this patch - see below.
 >

I questioned whether there is any value in this message if we cannot
be more specific. Why not use just use ER_BACKUP_BACKUP?

...

 >>
 >> 9. Is it OK to use different error codes for  report_error and
 >>    fatal_error for the same error?  [12], [14], [15].
 >
 > Yes, it is fine - I explain below in more detail.
 >

Even if it does not create any problems, it is a bit confusing.

 >>
 >> 10. Since setting m_error is encapsulated in a method (fatal_error), I
 >>     think reading it should also be encapsulated. [17].
 >
 > I'd rather keep it as it is - see below.

OK, not a big deal.

 >> Rafal Somla wrote:
 >>  > === modified file 'mysql-test/suite/backup/r/backup_views.result'
 >>  > --- a/mysql-test/suite/backup/r/backup_views.result    2008-10-07
 >> 17:15:44 +0000
 >>  > +++ b/mysql-test/suite/backup/r/backup_views.result    2008-11-12
 >> 11:20:04 +0000
 >>  > @@ -277,10 +277,10 @@ DROP DATABASE bup_db2;
 >>  >  Restore database.
 >>  >  restore database with view dependency to other, non-existing db
 >>  >  RESTORE FROM 'bup_objectview1.bak';
 >>  > -ERROR HY000: Could not restore view `bup_db1`.`v5`. Please check
 >> the view definition for possible missing dependencies.
 >>  > +ERROR 42S02: Table 'bup_db2.t2' doesn't exist
 >>  >  DROP DATABASE bup_db1;
 >>  >  RESTORE FROM 'bup_objectview2.bak';
 >>  > -ERROR HY000: Could not restore view `bup_db2`.`student_details`.
 >> Please check the view definition for possible missing dependencies.
 >>  > +ERROR 42S02: Table 'bup_db1.t3' doesn't exist
 >>
 >> [2] The new error message is not as informative as the old one.  I do
 >> not think this is a move in the right direction.
 >>
 >
 > The main goal of this patch is to change error reporting infrastructure
 > so that the first reported error is sent to the client. This is the
 > principle followed in the whole server. For example consider this:
 >
 >   mysql> create procedure p() create view v2 as select * from t2;
 >   Query OK, 0 rows affected (0.01 sec)
 >   mysql> call p();
 >   ERROR 1146 (42S02): Table 'test.t2' doesn't exist
 >
 > The error message gives you no clue that the original problem was
 > creating view v2.
 >
 > Do you say that we should apply a different principle in the backup code
 > and then BUG#40303 should be cancelled?

The difference is that in the create procedure example the user has
explicitly dealt with t2 and written the definition himself.  In the
restore case, it is not evident where t2 is used.

 >
 > If not, then I'm afraid there is not much we can do to improve
 > informativeness of the error message which user sees. It happens that
 > the first error detected and reported is during an attempt to access a
 > non-existent table, and according to the principle this error will be
 > sent to the client. If we want something else, we need a different
 > principle.

I am not sure we need "a principle".  What we need is a default
behavior and the possibility to override it when necessary when
necessary for usability.

 > However, the situation is not that bad, because BACKUP
 > command will push additional error message on the error stack and user
 > can investigate it with SHOW ERRORS:
 >
 > mysql> restore from 'db.bkp';
 > ERROR 1146 (42S02): Table 'test.t1' doesn't exist
 > mysql> show errors;
 > 
+-------+------+------------------------------------------------------------------------------------------------------+


 >
 > | Level | Code | Message                                       |
 > 
+-------+------+------------------------------------------------------------------------------------------------------+


 >
 > | Error | 1146 | Table 'test.t1' doesn't exist
 >                                       |
 > | Error | 1678 | Could not restore view `db`.`v`. Please check the view
 > definition for possible missing dependencies. |
 > 
+-------+------+------------------------------------------------------------------------------------------------------+


 >
 >
 > This actually gives more information than the original code.

I agree that together this gives enough information.  It would be
better if it was the other way around, but if that is difficult, I
guess this is sufficient.

...

 >>  > === modified file 'sql/backup/backup_info.cc'
 >>  > --- a/sql/backup/backup_info.cc    2008-11-05 09:41:15 +0000
 >>  > +++ b/sql/backup/backup_info.cc    2008-11-12 11:20:04 +0000
 > (cut)
 >>  > @@ -301,7 +301,7 @@ Backup_info::Dep_node::get_key(const uch
 >>  >    @c find_backup_engine().
 >>  >   */
 >>  >  Backup_info::Backup_info(Backup_restore_ctx &ctx)
 >>  > -  :m_ctx(ctx), m_state(Backup_info::ERROR), native_snapshots(8),
 >>  > +  :m_log(ctx), m_thd(ctx.thd()), m_state(Backup_info::ERROR),
 >> native_snapshots(8),
 >>
 >> [5] I think this illustrates that making Backup_restore_ctx a subclass
 >> of Logger was a bad idea.  Anyhow, if Backup_info is not supposed to
 >> deal with Backup_restore_ctx, I feel it is better to pass in Logger
 >> and THD as separate arguments to the constructor instead passing in a
 >> Backup_restore_ctx for Backup_info to pick sub-components.
 >>
 >
 > I think I could find some arguments for tying Backup_info to the
 > corresponding Backup_restore_ctx instance on the interface level, and
 > picking relevant parts of the context on the implementation level, but
 > it does not matter that much. Thus I'll do the change here and in
 > Restore_info ctor.

Why would you want to couple two classes when it is not necessary?
We should strive for low coupling.


 >
 >> [6] The above line is too long.
 >>
 >
 > OK.
 >
 >>  > === modified file 'sql/backup/data_backup.cc'
 >>  > --- a/sql/backup/data_backup.cc    2008-10-30 20:02:15 +0000
 >>  > +++ b/sql/backup/data_backup.cc    2008-11-12 11:20:04 +0000
 > (cut)
 >>  > @@ -448,9 +449,15 @@ int write_table_data(THD* thd, Backup_in
 >>  >
 >>  >      Scheduler::Pump *p= new Scheduler::Pump(*i, s);
 >>  >
 >>  > -    if (!p || !p->is_valid())
 >>  > +    if (!p)
 >>  >      {
 >>  > -      info.m_ctx.fatal_error(ER_OUT_OF_RESOURCES);
 >>  > +      log.report_error(ER_OUT_OF_RESOURCES);
 >>  > +      goto error;
 >>  > +    }
 >>  > +    if (!p->is_valid())
 >>  > +    {
 >>  > +      log.report_error(ER_BACKUP_CREATE_BACKUP_DRIVER,p->m_name);
 >>  > +      delete p;
 >>  >        goto error;
 >>  >      }
 >>
 >> [7] Currently, the only way that p is not valid is if allocation
 >> fails.
 >
 > That is not true. The Scheduler::Pump constructor calls base Backup_pump
 > constructor which can report errors by setting state to
 > backup_state::ERROR. This can happen if bitmap_init() or
 > Snapshot_info.get_backup_driver() fail. And the latter can fail because
 > it was not possible to create backup driver instance or initialize it.

As far as I can tell all get_backup_driver functions only return ERROR
if new returns null pointer.

 >
 >> Hence, OUT_OF_RESOURCES is more informative to a user than
 >> BACKUP_CREATE_BACKUP_DRIVER.
 >
 > I'm puzzled. Clearly the original code gave less information than the
 > current one because it did not distinguish between the two problems that
 > can be detected here - either memory allocation failed, or there was
 > some other problem during creation of backup driver and the
 > infrastructure required to use it. Thus the current code gives to user
 > the maximum of the information it possess.
 >
 >> A user is given no clue why a backup
 >> driver could not be created.
 >>
 >
 > You are right, but at least it does not give a false clue that the
 > problem was failed memory allocation.

I guess you can argue that the a useless message is better than an
incorrect message, but currently I believe OUT_OF_RESOURCE will be
always be correct.

 >
 > To give even more precise information, we would have to increase the
 > granularity of error reporting inside the constructor. E.g., instead of
 > setting state to general ERROR value, constructor would have to set it
 > to one of several error values indicating the precise reason of the
 > failure. Then the code would have to interpret these values and log
 > appropriate messages.

I agree.  This is an even bigger problem for the Stream interface, no
information about the error is propagated there either and that
prevents given specific errors about file operations etc.

 > However, I think such change is outside the scope of this patch. This
 > patch does not change the original design of the code: we detect the
 > fact that construction of an object failed but do not detect the reason
 > of this failure.

I agree that such a change is outside the patch, but I think that
tweaking the design in order to improve usability should be OK here,
and this patch is reverting such a tweak.


 >
 >>  > @@ -569,9 +576,12 @@ int write_table_data(THD* thd, Backup_in
 >>  >      int error= 0;
 >>  >      error= block_commits(thd, NULL);
 >>  >      if (error)
 >>  > +    {
 >>  > +      log.report_error(ER_BACKUP_VP_FAILED);
 >>
 >> [8] I do not feel this error gives much information about what is
 >> actually wrong.  Is it possible to say something about when
 >> block_commits() fail and what the user could do about it?
 >>
 >
 > I don't see any such thing and hence a general error message. I think,
 > the existence of a commit blocker should not be brought to the attention
 > of the user - it is a very low level implementation detail.

If block_commits timeouts out due to locks held by other connections,
this is useful information to the user regardless of whether the locking
is an implementation detail or not.

 >
 > To increase the amount of information given, block_commits(...) should
 > give more information about the error (need to extend the interface) and
 > the code above should analyse it and give better message in
 > user-relevant terms (if possible).
 >
 > But again, this is not the scope of this patch. This patch improves the
 > situation a bit, because originally nothing was reported at all and now
 > user sees that BACKUP failed during VP creation, which *is* some
 > information.

Which is useful to a developer trying to debug the problem, but does a user
care which phase of backup failed?

 >
 >>  > @@ -1371,11 +1377,25 @@ int restore_table_data(THD *thd, Restore
 >>  >    if (info.snap_count() == 0 || info.table_count() == 0) //
 >> nothing to restore
 >>  >      DBUG_RETURN(0);
 >>  >
 >>  > +  Logger &log= info.m_log;
 >>  > +
 >>  > +  /* Drv[n] points at restore driver used to process snapshot n. */
 >>  >    Restore_driver* drv[256];
 >>  > +  /*
 >>  > +    Active[n] is not NULL if driver drv[n] has been activated.
 >> Such driver needs
 >>  > +    an end() or cancel() call to shut it down properly.
 >>  > +  */
 >>  > +  Restore_driver* active[256];
 >>
 >> [9] A named constant should be used here instead of 256.
 >>
 >
 > Not sure. The maximum number of allowable drivers is not a settable
 > parameter. It is fixed by the backup image format in which one byte is
 > used to identify a snapshot. This puts a fixed limit of max 256
 > snapshots inside backup image and thus max 256 drivers used to process
 > them.
 >
 > But, since I see such suggestion for a second time, as a compromise I
 > can define a symbolic constants and document it explaining why it can
 > not be changed.
 >

Which is exactly what I was looking for.

 >>  > === modified file 'sql/backup/kernel.cc'
 >>  > --- a/sql/backup/kernel.cc    2008-10-30 20:02:15 +0000
 >>  > +++ b/sql/backup/kernel.cc    2008-11-12 11:20:04 +0000
 >>  > @@ -154,11 +154,8 @@ execute_backup_command(THD *thd, LEX *le
 >>  >      folders in the path could have been moved, deleted, etc.
 >>  >    */
 >>  >    if (backupdir->length() && my_access(backupdir->c_ptr(),
 >> (F_OK|W_OK)))
 >>  > -  {
 >>  > -    context.fatal_error(ER_BACKUP_BACKUPDIR, backupdir->c_ptr());
 >>  >      DBUG_RETURN(send_error(context, ER_BACKUP_BACKUPDIR,
 >> backupdir->c_ptr()));
 >>  > -  }
 >>  > -
 >>  > +
 >>  >    switch (lex->sql_command) {
 >>  >
 >>  >    case SQLCOM_BACKUP:
 >>  > @@ -195,7 +192,7 @@ execute_backup_command(THD *thd, LEX *le
 >>  >
 >>  >      if (info->db_count() == 0)
 >>  >      {
 >>  > -      context.fatal_error(ER_BACKUP_NOTHING_TO_BACKUP);
 >>  > +      context.report_error(ER_BACKUP_NOTHING_TO_BACKUP);
 >>  >        DBUG_RETURN(send_error(context, ER_BACKUP_NOTHING_TO_BACKUP));
 >>
 >> [10] Why is report_error called here, but not for other errors in this
 >> method?
 >>
 >
 > report_error() needs to be called only if the error was not already
 > reported by function or method called before, as is the case in this
 > fragment.
 >

In that case what is missing is comments that the called methods
report errors.

...

 >>  >
 >>  >    /*
 >>  >      Check access for SUPER rights. If user does not have SUPER,
 >> fail with error.
 >>  > +
 >>  > +    In case of error, we write only to backup logs, because
 >> check_global_access()
 >>  > +    pushes the same error on the error stack.
 >>  >    */
 >>  > -  if (check_global_access(m_thd, SUPER_ACL))
 >>  > -  {
 >>  > -    fatal_error(ER_SPECIFIC_ACCESS_DENIED_ERROR, "SUPER");
 >>  > -    return m_error;
 >>  > -  }
 >>  > +  ret= check_global_access(m_thd, SUPER_ACL);
 >>  > +  if (ret)
 >>
 >> [11] Why this change?  What is better with assigning the return value
 >> to a local variable.  I fell that complicates code since I need to be
 >> concerned with whether ret is only used in the line below or more
 >> times below.  used
 >>
 >
 > You are right - this change is not justified by the goal of the patch.
 > Chuck also pointed it out. It got it here because I was creating the
 > patch from an earlier prototype for dealing with interruption inside
 > backup/restore command. If check_global_access() is interrupted, then
 > before we report error, we should check if it is not in fact
 > interruption. I.e., the code will look as follows:
 >
 >   ret= check_global_access(m_thd, SUPER_ACL);
 >   if (<check for interruption>)
 >     <handle the interruption>;
 >   if (ret)
 >     <no interruption - handle the error>;
 >
 > So this change prepares for the future changes for BUG#35079/WL#4538. If
 > you and Chuck can accept this, then this can save me a substantial
 > amount of work needed for move these changes to the more appropriate 
patch.

I do not quite see why this change could not be part of the other
patch, but OK.

 >
 >>  > @@ -658,12 +651,16 @@ Backup_restore_ctx::prepare_for_backup(S
 >>  >
 >>  >    if (!info)
 >>  >    {
 >>  > -    fatal_error(ER_OUT_OF_RESOURCES);
 >>  > +    fatal_error(report_error(ER_OUT_OF_RESOURCES));
 >>  >      return NULL;
 >>  >    }
 >>  >
 >>  >    if (!info->is_valid())
 >>  > -    return NULL;    // Error has been logged by Backup_Info
 >> constructor
 >>  > +  {
 >>  > +    // Error has been logged by Backup_Info constructor
 >>  > +    fatal_error(ER_BACKUP_BACKUP_PREPARE);
 >>
 >> [12] So here fatal_error will register a different error from what is
 >> logged before.  Which is reported to the user?
 >>
 >
 > No, fatal_error(...) has been redefined and now it does not log
 > anything. It only moves context object to an error state by setting
 > internal m_error variable to the indicated error code. The error code
 > stored in m_error is not currently used - only the fact that m_error !=
 > 0 is detected in the code.

I understand that it does not currently create any problems, but I do
not understand the motivation for operating with different error
codes.

...

 >>  > +  int ret;
 >>  > +
 >>  >    /*
 >>  > -    Read catalogue from the input stream.
 >>  > +    Read header and catalogue from the input stream.
 >>  >     */
 >>  >
 >>  > -  if (read_header(*info, *s))
 >>  > -  {
 >>  > -    fatal_error(ER_BACKUP_READ_HEADER);
 >>  > -    return NULL;
 >>  > -  }
 >>  > -
 >>  > -  if (s->next_chunk() != BSTREAM_OK)
 >>  > +  ret= read_header(*info, *s);  // Can log errors via callback
 >> functions.
 >>  > +  if (ret)
 >>  >    {
 >>  > -    fatal_error(ER_BACKUP_NEXT_CHUNK);
 >>  > +    if (!error_reported())
 >>  > +      report_error(ER_BACKUP_READ_HEADER);
 >>  > +    fatal_error(ret);
 >>
 >> [14] Why is potentially different error messages used here for
 >> reporr_error and fatal_error?
 >>
 >
 > As explained above, the error code stored by fatal_error(..) does not
 > really matter. Thus we can store the exact code returned by
 > read_header(..) so that this information is not discarded in case it
 > turns out to be handy to look at it.

But in this case read_header either returns ERROR or OK.  Mixing
error code from different domains in m_error does not seem like a good
idea.

--
Øystein
Thread
bzr commit into mysql-6.0-backup branch (Rafal.Somla:2731) Bug#40303Bug#40304Rafal Somla12 Nov
  • RE: bzr commit into mysql-6.0-backup branch (Rafal.Somla:2731)Bug#40303 Bug#40304Chuck Bell13 Nov
    • Re: bzr commit into mysql-6.0-backup branch (Rafal.Somla:2731)Bug#40303 Bug#40304Rafal Somla18 Nov
  • Re: bzr commit into mysql-6.0-backup branch (Rafal.Somla:2731)Bug#40303 Bug#40304Øystein Grøvlen17 Nov
    • Re: bzr commit into mysql-6.0-backup branch (Rafal.Somla:2731)Bug#40303 Bug#40304Rafal Somla18 Nov
      • Re: bzr commit into mysql-6.0-backup branch (Rafal.Somla:2731)Bug#40303 Bug#40304Øystein Grøvlen19 Nov