Ingo, Rafal,
I think it is important to consider we are talking about two layers of error reporting here; BSM and backup kernel.
We require the BSM to report errors, which are only a concern between the backup kernel and the BSM. The clear text message of these errors are of interest to the end user, thus should/must be forwarded to the general logging facility available to mysqld.
Also, please note, that most backup software use their own logging facility, so if the user can't find the root cause in logs from mysqld, they may be able to find it in the application logs. My view is that we should provide as much information as possible, regardless of any locale issues, enabling the user to troubleshoot the situation more efficiently.
The backup kernel will only act upon the error codes returned by the BSM according to the identified states in the activity diagrams/sequence diagrams. I have decided not to write state machines, as it will take too much effort, and give little in return really. The activity diagrams will cover the possible error conditions. Unexpected errors will _always_ lead to complete abort, thus not undefined condition.
The second layer of errors is what the backup kernel reports to the mysqld. These we can control and enumerate and have UTF8 versions of even, but we cannot control the BSM textual message errors. We forward the BSM errors as a favour to the user, not that we need to though. Nor can we mandate that those textual error messages abides to the locale in use. If the BSM only use English, the error message stack will consist of;
1) overall error description as seen by the backup kernel (I18N) abiding to locale
2) error message from BSM (I18N if implemented), English or using locale
The combination will guide the user for further action to remedy the error condition, regardless of language. Worst case would be a combination of two languages (seen before, sigh).
So;
* Backup kernel error codes should be formally registered
* BSM error codes are "private" between backup kernel and BSM
Best regards / Cordialement
Andreas Almroth
-----Original Message-----
From: Rafal.Somla@stripped [mailto:Rafal.Somla@stripped]
Sent: 3. november 2009 13:59
To: Ingo Strüwing
Cc: Andreas Almroth; backup@stripped
Subject: Re: RFC: WL#5046 - error reporting
Hi Ingo,
Ingo Strüwing wrote:
...
>> Sure. Working at MySQL I've learned that probably the best way to deal
>> with that is:
>> 1. do our best to design as good interface as we can at the moment,
>> accepting that we can not predict everything;
>> 2. when unpredicted issue arises, rework the interface.
>
>
> Interesting. We had this topic in the other direction when talking about
> the asynchronism extensions. :)
No no, this was point 1: "do our best to design as good interface as we can" :)
>
> ...
>> After thinking about it for a long while, for me this boils down to the
>> following desing choice (don't ask me why :)):
>>
>> Currently, when a service is called, only two general outcomes are
>> possible:
>>
>> 1. Service succeeds and provides the specified information.
>> 2. Service fails and this is a fatal error - the whole session is
>> interrupted.
>>
>> But perhaps we want to have three possible outcomes:
>>
>> 1. Service succeeds and provides the specified information.
>> 2. Service fails with fatal error - the whole session is interrupted.
>> 3. Service fails with non-fatal error - the session can still be used.
>
>
> I agree. These are the choices, we were discussing about.
>
I have changed HLS to specify the second alternative. That is, the
distinction between fatal and non-fatal errors is made explicit in the
interface. I kept fatal errors because I think it allows for more efficient
implementations which can assume that after reporting fatal error no other
services would be called. Otherwise, all service implementations would have
to check for session validity and this would cost additional cycles.
>> I am still not convinced that we really need it. Although I think I can
>> also buy it. The only think which stops me right now is that I'd rather
>> keep it simpler if possible.
>>
>> If we are to go this way, then I think a user of a storage module
>> (backup kernel) can not decide on its own whether given error is fatal
>> or non-fatal.
>> In the end, it is the storage module which knows whether the failure
>> that has happened prevents further operation or not.
>
>
> Funny, I feel the contrary. How can the module know, which options the
> kernel has, to work around problems?
>
There can be two reasons why operations can not continue:
a) the storage module is in fatal error condition and it can not work,
b) the backup kernel has reached a state where nothing else can be done but
aborting the operation.
Only storage module knows about condition a) and only backup kernel knows
about condition b). Storage module should not decide whether b) has happened
and backup kernel should not guess if a) has happened. Since it is backup
kernel who drives the whole operation, the information about condition a)
must be passed from storage module to it. I think there is no need to inform
storage module about condition b). If backup kernel decides to abort
operation it will abort storage session and shut down storage module using
its services.
> For example. If the medium runs full, how can the module know, if the
> kernel is allowed by the user to retry with compression?
>
> I think there are only very few errors that make a session unusable, for
> example insufficient memory.
>
> OTOH, with the current service specification, we can probably just drop
> a failed session and initialize a new one. (With the compression
> selection service, things would be different.)
>
Very good example. In the previous version of HLS there was no way for
storage module to inform about "disk full" condition - the only option was
to report fatal error and then backup kernel would have to create new
session. Currently I imagine that backup kernel logic would be somethink
like this:
1. call "write bytes" service.
2. If call was ok then continue.
3. if fatal error then report error an abort.
4. If non-fatal error, then:
4a. close stream and free location,
4b. re-open location for writing,
4c. restart backup using compression.
Here I still avoid to analyse errors signaled by storage module. Simply,
whenever a non-fatal error is reported, the work-around with compression
would be tried.
However, it could make sense to try compression only in "disk full"
condition and do other things (simple re-try or abort) upon other non-fatal
errors. To implement this behaviour, backup kernel must be able to
distinguish the two situations and this can be done only if storage module
provides more information about it. Thus I see this as a request for
extending the interface so that the information can be conveyed. In this
case I would change specification of "write bytes service" to explicitly
return information about disk full condition:
S6 Write bytes to location.
IN: Backup storage session, data buffer and amount of data to
be written.
OUT: Amount of data that has been written and information if there is
space for more data.
Fatal and other non-fatal errors would be reported as usual.
...
> Agree. But I find it more natural for a software developer to think in
> function signatures when reading [IN) and [OUT]. I would prefer to avoid
> the surprise when switching from specification- to code reading.
>
> If we want to leave freedom to the implementor, then we could perhaps
> specify the services with one paragraph per in/out "information" instead
> of [IN) and [OUT].
>
> OTOH this should not be a prerequisite for me to approve the HLS.
>
I like this more general form of specifications and I have updated HLS
accordingly.
>>> We can leave it to the backup kernel, which errors to take as fatal, and
>>> which to work around. Backup kernel could be fixed in this respect,
>>> without changing the interface.
>>>
>> But first of all, backup kernel must know if backup storage session is
>> usable after an error or not. This information must be passed somehow
>> from module to the kernel - the kernel can not decide it on its own.
>
>
> Well, this could be solved by all further services to fail, so that
> every attempt to work around the problem would fail.
This approach has two problems:
- It might be difficult for backup kernel to decide if the failure is rally
fatal or if it can/should try again,
- It would require each service to check if session is valid upon each call.
If storage module can explicitly report fatal error and there is convention
that service calls after fatal error are prohibited, then slightly more
optimal implementation is possible.
>
> But the most important cases will have well-known error codes. And
> hence, well-known severity.
>
I leave it to LLD to decide how error severity is reported. This is one
possibility.
...
>> Ok, I still see an issue how global error codes (code ranges) would be
>> assigned to particular BSM implementations? The only solution which I
>> can come up with is that we reserve certain range for MySQL and then all
>> BSMs developed at MySQL will use unique error numbers from that range.
>> But any external implementers will use arbitrary chosen numbers outside
>> of that range. Then the error numbers are bound to overlap and the
>> advantage you have in mind is not going to happen.
>
>
> This applies to storage modules, which are developed and used
> proprietarily. I think of community projects mainly. These will be added
> to the MySQL code base. If they add error messages to errmsg-utf8.txt,
> their final push will reserve the numbers once and forever.
>
This suggestion is in contrast with what I propose in HLS. My idea was that
storage module provides error description via service call and then kernel
reports it using one general "error from storage module" mysql error (one
entry in errmsg.txt). With your proposition, storage modules should register
error messages in errmsg.txt like the rest of the server. Then there is no
need for a service which gives error description. It is enough that error
number is reported and then backup kernel can locate its description in
errmsg.txt as usual. I described it as alternative A5 - please verify that
it is adequately described.
I think one consequence of such a design will be that a user would not be
able to hot-plug a storage module which was not known at server build time
(when errmsg.txt was compiled). To use such a module he would not only have
to stop the server but also recompile it. With my design, it should be
possible to upgrade a running server by adding to it a new storage module,
even if this module did not exist at the server build time.
Rafal