List:Backup« Previous MessageNext Message »
From:Rafal Somla Date:October 28 2009 4:16pm
Subject:Re: RFC: WL#5046 - error reporting
View as plain text  
Hi Ingo,

The discussion continues...

Ingo Strüwing wrote:
> Hi Rafal,
> Rafal Somla, 27.10.2009 10:08:
> ...
>> Ingo Strüwing wrote:
> ...
>>> Rafal Somla, 24.10.2009 16:02:
> ...
>> But, do you have any concrete propositions what error situations for
>> which services should be distinguishable (across all BSMs)? Or you are
>> only concerned with possible future extensions?
> I am concerned about human fallibility. Often during implementation or
> changes (e.g. bug fixes) problems pop up, which haven't been foreseen
> during specification.

Sure. Working at MySQL I've learned that probably the best way to deal with 
that is:
1. do our best to design as good interface as we can at the moment, 
accepting that we can not predict everything;
2. when unpredicted issue arises, rework the interface.

> Hence I'm not in favor of an interface specification, which requires all
> (if only non-fatal) problems to be identified in advance.

After thinking about it for a long while, for me this boils down to the 
following desing choice (don't ask me why :)):

Currently, when a service is called, only two general outcomes are possible:

1. Service succeeds and provides the specified information.
2. Service fails and this is a fatal error - the whole session is interrupted.

But perhaps we want to have three possible outcomes:

1. Service succeeds and provides the specified information.
2. Service fails with fatal error - the whole session is interrupted.
3. Service fails with non-fatal error - the session can still be used.

I am still not convinced that we really need it. Although I think I can also 
buy it. The only think which stops me right now is that I'd rather keep it 
simpler if possible.

If we are to go this way, then I think a user of a storage module (backup 
kernel) can not decide on its own whether given error is fatal or non-fatal.
In the end, it is the storage module which knows whether the failure that 
has happened prevents further operation or not.

The distinction could be done by convention (some error codes are fatal and 
some are not) or by other means. But I don't think it needs to be specified 
on this level of abstraction. Most important is to agree whether storage 
module can report non-fatal errors or, as in the current design, it must 
either succeed or otherwise all errors are fatal.

> If we want to specify all behavioral aspects, it might be easier to
> adapt the specification to reality by changing a list of error codes
> than by changing a service signature.

I'm trying to imagine how the alternative specification would look then. As 
far as I can tell, the difference would be that instead of explicitly saying 
that e.g., service S10 (get information about backup image) informs us if 
the location is empty, there will be a global error constant like 
BSM_LOC_EMPTY with implicit understanding that:

a) whenever some service tries to access backup image at location which is 
empty, it should return BSM_LOC_EMPTY error (it took me a while to formulate 
it precisely).

b) error BSM_LOC_EMPTY is non-fatal: storage session can be used after it 
has been reported.

I prefer to make such choices more explicit. Also, I see such specification 
as more low-level, because it implies a particular implementation choice 
(use of error codes). I know this is an obvious and natural choice, but 
still I don't think it is necessary and appropriate to force it on this 
level of specification.

With my specification, I'm only informing what information must be passed 
out of service - I do not tell how this should be done. In particular, it 
can be done as described above: an implementation of service S10

S10 Get information about image stored in the location.
     [IN]  backup storage session
     [OUT] size and timestamp of the image or information that location does
           not contain a backup image.

can be a function:

int get_image_info(session, ...)

which returns:
  0               - upon normal termination
  BSM_LOC_EMPTY   - if location is empty
  BSM_WRONG_DATA  - if location does not contain backup image
  error           - negative error code if fatal error has been encountered.

This is a valid implementation of the above specification. The information 
that location does not contain a backup image is passed in form of positive 
return value. This is distinguished from (fatal) errors which are signalled 
via negative return values.

> We can leave it to the backup kernel, which errors to take as fatal, and
> which to work around. Backup kernel could be fixed in this respect,
> without changing the interface.

But first of all, backup kernel must know if backup storage session is 
usable after an error or not. This information must be passed somehow from 
module to the kernel - the kernel can not decide it on its own. But once it 
knows what kind of error has happened, then sure, it can freely decide what 
to do about that. As far as we speak about the interface, it is most 
important to specify what information is passed through it and then how.

> ...
>>>> There is no global convention about which error number means what.
>>> This is something, which might bite us one day. If multiple modules
>>> report similar error messages for problems that are pretty different to
>>> handle by the user, then the support team might have a hard time to
>>> figure out, what happened exactly. Sure, the final message contains the
>>> module name, but often the customer doesn't remember the exact text.
>>> Especially if he is no native English speaker. "The backup said no such
>>> tape", but it was "file not found". Perhaps the xbsa: type specifier had
>>> been forgotten. A globally unique error number would help a lot.
>> I don't understand the example. How "The backup said no such tape" could
>> possibly appear if we are using a filesystem BSM and the real problem
>> was "file not found"?
> By plain user error. Users don't read error messages carefully.
> Especially if they don't understand the language well. To the example:
> User wants to restore from tape. He forgets xbsa:. Error message is
> "file not found". User identifies the words "not found". From his school
> English he understands it as "not there". But the tape is there! So
> what's wrong? Let's call support and tell them MySQL is stupid. It
> claims "tape not there", while it is there.
> The user would do better, if the message was in his native language.
> Support would have better chances if there is a unique error number.

And what if users makes mistake when reporting error number ;)

> Some (management-)applications could also profit from unique error
> numbers. So they won't need to parse the error text.

Ok, I still see an issue how global error codes (code ranges) would be 
assigned to particular BSM implementations? The only solution which I can 
come up with is that we reserve certain range for MySQL and then all BSMs 
developed at MySQL will use unique error numbers from that range. But any 
external implementers will use arbitrary chosen numbers outside of that 
range. Then the error numbers are bound to overlap and the advantage you 
have in mind is not going to happen. Thus I consider my proposition to have 
local error codes to be more "clean" and not promising something we can not 
guarantee anyway.

> ...
>>>     There
>>> should be a way to handle internationalization for storage modules.
>> Easier said than done :) Any propositions? If the proposition is that
>> BSMs use my_error() to report errors then I will oppose such solution...
> See for example:
> WL#2940 - plugin service: error reporting
> WL#751 - Error message construction
> But I must admit that this is not solved for us yet. We may not want to
> spend the effort at the moment. So I will not further insist in
> internationalized messaged from storage modules for now, but I want to
> have the interface specified so that it can be added later without an
> interface change. This might be doable when the plugin services (here
> WL#2940) are implemented. You have already suggested a way to select the
> language.

Thanks for digging these out (I had a vague recollection that such WLs 
exist). Indeed WL#2940 does not solve our problems:

- it says nothing about global error codes
- it does not say how locale setting is passed to the module

For the moment, I have updated HLS as suggested: locale setting is passed to 
"create storage session" service and error descriptions returned by storage 
module should be in appropriate language. I think this specification should 
not be strict. That is, BSM should do its best to describe errors in the 
selected language, but if not possible, it can use other language (English). 
Any thoughts about how to best specify it?


RFC: WL#5046 - error reportingRafal Somla24 Oct
  • Re: RFC: WL#5046 - error reportingIngo Strüwing25 Oct
    • RE: RFC: WL#5046 - error reportingAndreas Almroth26 Oct
    • Re: RFC: WL#5046 - error reportingRafal Somla27 Oct
      • Re: RFC: WL#5046 - error reportingIngo Strüwing27 Oct
        • Re: RFC: WL#5046 - error reportingRafal Somla28 Oct
          • Re: RFC: WL#5046 - error reportingIngo Strüwing29 Oct
            • Re: RFC: WL#5046 - error reportingRafal Somla3 Nov
              • RE: RFC: WL#5046 - error reportingAndreas Almroth4 Nov
  • Re: RFC: WL#5046 - error reportingRafal Somla27 Oct
    • Re: RFC: WL#5046 - error reportingIngo Strüwing27 Oct
      • Re: RFC: WL#5046 - error reportingRafal Somla27 Oct
        • RE: RFC: WL#5046 - error reportingAndreas Almroth27 Oct
  • Re: RFC: WL#5046 - error reportingIngo Strüwing4 Nov