List:Backup« Previous MessageNext Message »
From:Rafal Somla Date:November 3 2009 12:58pm
Subject:Re: RFC: WL#5046 - error reporting
View as plain text  
Hi Ingo,

Ingo Strüwing wrote:

...
>> Sure. Working at MySQL I've learned that probably the best way to deal
>> with that is:
>> 1. do our best to design as good interface as we can at the moment,
>> accepting that we can not predict everything;
>> 2. when unpredicted issue arises, rework the interface.
> 
> 
> Interesting. We had this topic in the other direction when talking about
> the asynchronism extensions. :)

No no, this was point 1: "do our best to design as good interface as we can"  :)

> 
> ...
>> After thinking about it for a long while, for me this boils down to the
>> following desing choice (don't ask me why :)):
>>
>> Currently, when a service is called, only two general outcomes are
>> possible:
>>
>> 1. Service succeeds and provides the specified information.
>> 2. Service fails and this is a fatal error - the whole session is
>> interrupted.
>>
>> But perhaps we want to have three possible outcomes:
>>
>> 1. Service succeeds and provides the specified information.
>> 2. Service fails with fatal error - the whole session is interrupted.
>> 3. Service fails with non-fatal error - the session can still be used.
> 
> 
> I agree. These are the choices, we were discussing about.
> 

I have changed HLS to specify the second alternative. That is, the 
distinction between fatal and non-fatal errors is made explicit in the 
interface. I kept fatal errors because I think it allows for more efficient 
implementations which can assume that after reporting fatal error no other 
services would be called. Otherwise, all service implementations would have 
to check for session validity and this would cost additional cycles.


>> I am still not convinced that we really need it. Although I think I can
>> also buy it. The only think which stops me right now is that I'd rather
>> keep it simpler if possible.
>>
>> If we are to go this way, then I think a user of a storage module
>> (backup kernel) can not decide on its own whether given error is fatal
>> or non-fatal.
>> In the end, it is the storage module which knows whether the failure
>> that has happened prevents further operation or not.
> 
> 
> Funny, I feel the contrary. How can the module know, which options the
> kernel has, to work around problems?
>

There can be two reasons why operations can not continue:

a) the storage module is in fatal error condition and it can not work,
b) the backup kernel has reached a state where nothing else can be done but 
aborting the operation.

Only storage module knows about condition a) and only backup kernel knows 
about condition b). Storage module should not decide whether b) has happened 
and backup kernel should not guess if a) has happened. Since it is backup 
kernel who drives the whole operation, the information about condition a) 
must be passed from storage module to it. I think there is no need to inform 
storage module about condition b). If backup kernel decides to abort 
operation it will abort storage session and shut down storage module using 
its services.

> For example. If the medium runs full, how can the module know, if the
> kernel is allowed by the user to retry with compression?
> 
> I think there are only very few errors that make a session unusable, for
> example insufficient memory.
> 
> OTOH, with the current service specification, we can probably just drop
> a failed session and initialize a new one. (With the compression
> selection service, things would be different.)
>

Very good example. In the previous version of HLS there was no way for 
storage module to inform about "disk full" condition - the only option was 
to report fatal error and then backup kernel would have to create new 
session. Currently I imagine that backup kernel logic would be somethink 
like this:

   1. call "write bytes" service.
   2. If call was ok then continue.
   3. if fatal error then report error an abort.
   4. If non-fatal error, then:
      4a. close stream and free location,
      4b. re-open location for writing,
      4c. restart backup using compression.

Here I still avoid to analyse errors signaled by storage module. Simply, 
whenever a non-fatal error is reported, the work-around with compression 
would be tried.

However, it could make sense to try compression only in "disk full" 
condition and do other things (simple re-try or abort) upon other non-fatal 
errors. To implement this behaviour, backup kernel must be able to 
distinguish the two situations and this can be done only if storage module 
provides more information about it. Thus I see this as a request for 
extending the interface so that the information can be conveyed. In this 
case I would change specification of "write bytes service" to explicitly 
return information about disk full condition:

S6  Write bytes to location.
     IN:  Backup storage session, data buffer and amount of data to
          be written.
     OUT: Amount of data that has been written and information if there is
          space for more data.

Fatal and other non-fatal errors would be reported as usual.

...
> Agree. But I find it more natural for a software developer to think in
> function signatures when reading [IN) and [OUT]. I would prefer to avoid
> the surprise when switching from specification- to code reading.
> 
> If we want to leave freedom to the implementor, then we could perhaps
> specify the services with one paragraph per in/out "information" instead
> of [IN) and [OUT].
> 
> OTOH this should not be a prerequisite for me to approve the HLS.
> 

I like this more general form of specifications and I have updated HLS 
accordingly.

>>> We can leave it to the backup kernel, which errors to take as fatal, and
>>> which to work around. Backup kernel could be fixed in this respect,
>>> without changing the interface.
>>>
>> But first of all, backup kernel must know if backup storage session is
>> usable after an error or not. This information must be passed somehow
>> from module to the kernel - the kernel can not decide it on its own.
> 
> 
> Well, this could be solved by all further services to fail, so that
> every attempt to work around the problem would fail.

This approach has two problems:
- It might be difficult for backup kernel to decide if the failure is rally 
fatal or if it can/should try again,
- It would require each service to check if session is valid upon each call. 
If storage module can explicitly report fatal error and there is convention 
that service calls after fatal error are prohibited, then slightly more 
optimal implementation is possible.

> 
> But the most important cases will have well-known error codes. And
> hence, well-known severity.
> 

I leave it to LLD to decide how error severity is reported. This is one 
possibility.

...
>> Ok, I still see an issue how global error codes (code ranges) would be
>> assigned to particular BSM implementations? The only solution which I
>> can come up with is that we reserve certain range for MySQL and then all
>> BSMs developed at MySQL will use unique error numbers from that range.
>> But any external implementers will use arbitrary chosen numbers outside
>> of that range. Then the error numbers are bound to overlap and the
>> advantage you have in mind is not going to happen.
> 
> 
> This applies to storage modules, which are developed and used
> proprietarily. I think of community projects mainly. These will be added
> to the MySQL code base. If they add error messages to errmsg-utf8.txt,
> their final push will reserve the numbers once and forever.
> 

This suggestion is in contrast with what I propose in HLS. My idea was that 
storage module provides error description via service call and then kernel 
reports it using one general "error from storage module" mysql error (one 
entry in errmsg.txt). With your proposition, storage modules should register 
error messages in errmsg.txt like the rest of the server. Then there is no 
need for a service which gives error description. It is enough that error 
number is reported and then backup kernel can locate its description in 
errmsg.txt as usual. I described it as alternative A5 - please verify that 
it is adequately described.

I think one consequence of such a design will be that a user would not be 
able to hot-plug a storage module which was not known at server build time 
(when errmsg.txt was compiled). To use such a module he would not only have 
to stop the server but also recompile it. With my design, it should be 
possible to upgrade a running server by adding to it a new storage module, 
even if this module did not exist at the server build time.

Rafal

Thread
RFC: WL#5046 - error reportingRafal Somla24 Oct
  • Re: RFC: WL#5046 - error reportingIngo Strüwing25 Oct
    • RE: RFC: WL#5046 - error reportingAndreas Almroth26 Oct
    • Re: RFC: WL#5046 - error reportingRafal Somla27 Oct
      • Re: RFC: WL#5046 - error reportingIngo Strüwing27 Oct
        • Re: RFC: WL#5046 - error reportingRafal Somla28 Oct
          • Re: RFC: WL#5046 - error reportingIngo Strüwing29 Oct
            • Re: RFC: WL#5046 - error reportingRafal Somla3 Nov
              • RE: RFC: WL#5046 - error reportingAndreas Almroth4 Nov
  • Re: RFC: WL#5046 - error reportingRafal Somla27 Oct
    • Re: RFC: WL#5046 - error reportingIngo Strüwing27 Oct
      • Re: RFC: WL#5046 - error reportingRafal Somla27 Oct
        • RE: RFC: WL#5046 - error reportingAndreas Almroth27 Oct
  • Re: RFC: WL#5046 - error reportingIngo Strüwing4 Nov