List:Backup« Previous MessageNext Message »
From:Rafal Somla Date:October 27 2009 8:35am
Subject:Re: RFC: WL#5046 - error reporting
View as plain text  
Hi Andreas,

Thanks for the feedback.

Andreas Almroth wrote:
> My suggestions for error handling:
> 
> S12 Get error information
> 
>     [IN]  backup storage session
>     [OUT] error number
> 
> Will return last error occurred in the given BSM session. Returns BSM_NO_ERR 
> if no previous error has occurred.

Yes, Ingo suggested something similar. So, I'll change specification of S12 
to not error if there was no previous error.

> Error description is returned by the BSM using current I18N locale or default 
> to English.
> 

Not that easy I'm afraid. The problem is what is "the current I18N locale". 
The language in which module describes errors should be consistent with the 
language used for reporting errors to mysqld user. This depends on several 
settings in the server (and client). This means that user of BSM (backup 
kernel) must inform BSM about what language should be used. Thus we would 
need additional service. Perhaps locale information should be passed when 
creating storage session (service S1)? An alternative is that BSM uses 
callback functions for error reporting and backup kernel provides functions 
which know about locale settings (but not 100% sure if this idea can work).

> Note: In practical implementation the two outputs are split in to API calls 
> really. One for retrieving the error code (a la errno in C), and one to map 
> text to the error code; get_error. This is how I envision the implementation. 
> Calling Get_Error with overhead for text error code is too troublesome. If 
> the backup kernel requires the text (for output to user), it retrieves it 
> using get_error_desc or something.

Yes, in C it is most sensible to have 2 functions: one for error code and 
one for error description.

> 
> So;
> S12 Get last error
> 
>     [IN]  backup storage session
>     [OUT] error number
> 
> Will return last error occurred in the given BSM session. Returns BSM_NO_ERR if no
> previous error has occurred.
> 
> S13 Get error description
> 
>     [IN]  backup storage session
>     [IN]  error code
>     [OUT] error description
> 
> Return textual description of error code in current session. Error description should
> be returned by the BSM using current I18N locale or default to English.
>

OK, but wouldn't it be simpler to implement a service which only returns 
description of the last error. I.e., instead of having service for 
transforming *any* error code into the corresponding error description, we 
only need to return description of the last error that has happened. I don't 
think we need anything more than that.

> 
> Error codes must be standardized in the BSM API as else there would be no way 
> for the backup kernel to decide whether to abort, continue or act on error.

My idea is that if BSM reports error, then it is a fatal error and BSM can 
not be used after that. Any "non-fatal error", after which BSM can continue 
to serve the user, is not reported as an error but is signalled as a valid 
reply from a service.

For example, if service S10 is called to get information about backup image 
but the location does not contain any data, this is signalled in a reply 
from the service which completes successfully. This is so because after such 
situation the storage session is still usable and the user can e.g., create 
a new image at this location.

In such design, the only sensible thing which backup kernel can do when BSM 
signals error is to report it and abort. If other action is possible, this 
would depend on the context in which BSM was called. Thus I don't see it 
necessary that errors reported by BSMs must be interpreted by its user.

> In my MyBRM API there are a set of errors that is expected from the MyBRM 
> implementations that the backup kernel can handle.
> 

I found this list in the patch I got from you:

   #define MYBRM_RC_FAILURE -1
   #define MYBRM_RC_SUCCESS 0
   #define MYBRM_RC_NO_MATCH 17
   #define MYBRM_RC_NO_MORE_DATA 18

The RC_NO_MATCH code is probably for listing locations which we do not 
include in the current specifications. The RC_NO_MORE_DATA is a good example 
for the point I want to make. This "error code" is really informing the 
caller that end of stream has been reached. When this happens, the storage 
session is still usable and this is why in HLS it is not considered as an 
error, but as a reply from service S7 "read bytes from location": "[OUT] 
information about end of stream". How this information is passed to the 
caller is an implementation detail and using RC_NO_MORE_DATA code is one 
possible way. However, according to the specification, such reply should be 
distinguishable from an error. But this requirement could be satisfied by a 
simple convention like "only negative return codes denote errors" (for example).

 From this perspective, looking at the specifications in HLS, I think all 
required information is provided by BSM and in case of error, nothing more 
than the fact that error has happened is needed. Good if you can double 
check that.

> For the HLS, I don't see the need to go into such details, but the LLD will 
> cover it all.

I think some general principles should be decided first. And whether error 
codes from BSM calls have any interpretation (and what) or not is one of 
such things which should be decided on the HLS level. If for some services 
we are not satisfied with "opaque" error notification and want to 
distinguish different error situations, we should specify it in HLS I think. 
Let me know if you have any propositions for such distinctions.

> Also, errors reported by the backup kernel (not the BSM) should go into 
> the errmsg and any locales we choose to cover. These errors must be added to 
> the mysql global database.

Agree.

> 
> For BSM reported errors, the backup kernel should either relay to the error 
> log or just simply not forward to user. I'm more in favour for adding messages 
> in the error log and do like a stack trace of messages to guide the user. 
> That is; fixed part of backup kernel error message + dynamic part reported by the 
> individual BSM.

I'm not sure if I understand your proposition completely. Is this basically 
the same as currently described in HLS or you propose something else?

> We should all know that if the user runs SE_sv or DE_de or whatever locale, and 
> the 3rd party BSM does not, it will be mixed language in the error log.
> The only way to avoid this is to not log BSM errors at all, but the user will 
> have to rely on that the 3rd party will provide their own error log.

Or, accept that BSM error descriptions are in plain English as proposed in 
HLS :)

Rafal

> -----Original Message-----
> From: Ingo.Struewing@stripped [mailto:Ingo.Struewing@stripped] 
> Sent: 25. oktober 2009 19:34
> To: Rafal Somla
> Cc: Ingo Strüwing; Andreas Almroth; backup@stripped
> Subject: Re: RFC: WL#5046 - error reporting
> 
> Hi Rafal,
> 
> Rafal Somla, 24.10.2009 16:02:
> 
>> Hi,
>>
>> I've described in HLS a proposition how to handle errors from backup
>> storage modules. Please have a look and let me know if you have any
>> comments on that.
> 
> great work. Thank you.
> 
>> S12 Get error information.
> ...
>>     Note: this service will fail if it is called without any error being
>>     detected earlier.
> 
> 
> And then you can call it again and get a "no error" error? What about
> letting it always succeed and return zero and empty string (or "no
> error") if no error happened? If it can fail, its failure needs to be
> handled...
> 
> What does "earlier" mean? "Anytime before in the backup storage
> session"? Or "during the last service call"?
> 
>> Examples
> 
> The examples are nice. But should they be part of the specification?
> Confusion might come from the fact that they give additional
> information, which might be taken as "specified" or "optional". E.g. is
> it part of the specification that path names in error messages have to
> be canonical, or it is just an option?
> 
>> Design principles
>> -----------------
>>
>> - Several errors can be reported for a single failure - they provide information
>> which should be combined to get the most accurate description of the failure.
> 
> 
> From the text, later in the section, I guess that you mean that the
> backup kernel can add more error messages. But since we specify the API
> here, it could be misunderstood as if multiple "S12 Get error
> information" could be called in a row, to retrieve several errors from
> the failed service.
> 
> If my guess is correct, then this sentence is a duplicate of the later
> "Together with an error from storage module, backup kernel reports more
>  errors informing about the context in which the error has happened." I
> find the latter easier to understand and suggest to drop the former.
> 
>> - Errors from storage modules are reported as a single ... error" ...
>> - Backup kernel does not try to interpret errors ...
> 
> I suggest to exchange the order of the two paragraphs. Then two adjacent
> paragraphs explain, how the errors are reported.
> 
>> - Backup kernel does not try to interpret errors reported by backup storage
>> modules, it only notes that an error has happened and possibly forwards its to
>> backup error log. There is no global convention about which error number means
>> what.
> 
> I'm unhappy with this. I understand that the current semantics of the
> interface distinguishes between problems that can be accepted and
> circumvented (and which are not reported as errors), and errors that
> require an abort of the backup/restore.
> 
> But I'm not sure if no future extension can make it necessary that the
> kernel distinguishes certain errors to be able to react flexibly.
> 
> It is a pretty common design pattern to deduct different problems from
> different errors. I'd like this API to work that way too. I think this
> is a design decision. Perhaps you can note my preference as an alternative.
> 
>> There is no global convention about which error number means what.
> 
> This is something, which might bite us one day. If multiple modules
> report similar error messages for problems that are pretty different to
> handle by the user, then the support team might have a hard time to
> figure out, what happened exactly. Sure, the final message contains the
> module name, but often the customer doesn't remember the exact text.
> Especially if he is no native English speaker. "The backup said no such
> tape", but it was "file not found". Perhaps the xbsa: type specifier had
> been forgotten. A globally unique error number would help a lot.
> 
>> - For simplicity, storage module error descriptions are in plain English. The
>> common "error from storage engine" error message is internationalized like other
>> MySQL errors (using errmsg file).
> 
> Unfortunately, we have such behavior in storage engines already as
> historical ballast. But I oppose to specify this for new software. There
> should be a way to handle internationalization for storage modules.
> 
>> - For simplicity, storage module error descriptions are in plain English. The
>> common "error from storage engine" error message is internationalized like other
>> MySQL errors (using errmsg file).
> 
> A matter of course. I wouldn't mention it in the specification.
> 
>> Proposed error messages
> 
> A lot of work, a lot of details. You may have got it mostly right. My
> concern is that it may turn out as incomplete during implementation.
> Then the specification must be changed. I wonder if it won't be better
> to keep this out of the specification.
> 
> Regards
> Ingo
Thread
RFC: WL#5046 - error reportingRafal Somla24 Oct
  • Re: RFC: WL#5046 - error reportingIngo Strüwing25 Oct
    • RE: RFC: WL#5046 - error reportingAndreas Almroth26 Oct
    • Re: RFC: WL#5046 - error reportingRafal Somla27 Oct
      • Re: RFC: WL#5046 - error reportingIngo Strüwing27 Oct
        • Re: RFC: WL#5046 - error reportingRafal Somla28 Oct
          • Re: RFC: WL#5046 - error reportingIngo Strüwing29 Oct
            • Re: RFC: WL#5046 - error reportingRafal Somla3 Nov
              • RE: RFC: WL#5046 - error reportingAndreas Almroth4 Nov
  • Re: RFC: WL#5046 - error reportingRafal Somla27 Oct
    • Re: RFC: WL#5046 - error reportingIngo Strüwing27 Oct
      • Re: RFC: WL#5046 - error reportingRafal Somla27 Oct
        • RE: RFC: WL#5046 - error reportingAndreas Almroth27 Oct
  • Re: RFC: WL#5046 - error reportingIngo Strüwing4 Nov