Andrei Elkin wrote:
> Mats, hello.
>
> You are still formally a reviewer for the bug. Since the last time we
> discussed the issue and my idea of fixes it got a customer.
> I have analyzed my last changeset once again to be utmost confident
> in the idea implemented in
>
> http://lists.mysql.com/commits/46451
>
> Let me put it all once again to avoid spawning yet more our last year
> thread.
>
> THE PROBLEM
>
> We have overreaching error messages in the slave's server error log.
> The message is either Deadlock or Timeout in waiting for a lock
> at applying a replication event. It is not really a problem but rather
> transient issue and should not be noticed by the user.
> \footnote{The temporary status changes to the regular once the slave
> has re-tried to apply the failing events unsuccessfully for
> rli->trans_retries == slave_trans_retries times}
>
>
> THE OFFERED SOLUTION
>
> To avoid logging an error having temporary status the reporting method
>
> rli->report(error description args)
>
> should call has_temporary_error() to sort out the actual status.
> If the error is temporary it won't force an error line in the error
> log.
>
>
>
> THE IMPLEMENTATION
>
> rli->report() demotes the error's severity to the warning level
> basing on has_temporary_error() report.
>
> has_temporary_error() becomes available in rpl_reporting.cc as well
> still to be executed exclusively by the SQL thread.
>
> The temporary to the regular status promotion is achieved as simple as
> with raising up thd->is_fatal_error flag.
>
>
> BEING DISCUSSED ISSUES
>
> has_temporary_error() is not necessary to be precise in detection of
> the temporary status. The function is rather optimistic to give a
> chance to re-apply (uselessly) an event e.g in a case of the last
> applying cycle produced a sequence of errors with the temporary in
> the end: a-reg-err_1, ..., a-temp-err_N.
>
> This optimistic feature is irrelevant to the idea of the
> solution. It does not make the SQL slave not to notice any errors
> and when it eventually stops all errors and warnings are printed
> out.
>
> A technical issue to change signature of
> Slave_reporting_capability::report to accept THD* as an arg:
> rli->report(..., THD *thd, ...)
>
> I would do that, still in a separate patch.
>
>
> I would be happy to hear your opinion.
The problem is still the following, to which I have not seen a convincing
explanation or alternative solution: the reporting function report() demotes the
error to a warning if the thread has a temporary error, which may or may not be
the same as the one that is passed.
Suppose that the function is used in this way somewhere, maybe by mistake, maybe
intentional, with a new error code ER_DEADLOCK_IS_FATAL:
/*
If we get a deadlock here, this is to be reported as an error.
*/
int error= thd->stmt_da->sql_errno();
if (error == ER_LOCK_DEADLOCK)
thd->report(ERROR_LEVEL, ER_DEADLOCK_IS_FATAL, "A fatal deadlock occured");
Now, since the patch demotes the error to a warning inside report(), this will
cause the function above to not do what was intended because the calling code
thinks this is an error (ER_DEADLOCK_IS_FATAL) while report() thinks this is a
warning (ER_LOCK_DEADLOCK). This occurs because has_temporary_error() talks
about the error stored in the diagnostics area (ER_LOCK_DEADLOCK) while the
calling code talks about the code passed to the report() function
(ER_DEADLOCK_IS_FATAL).
From my point of view, you can solve this either by:
- Convincing me that this cannot happen, i.e.: that nobody currently does or
ever wants to do anything like the above.
- Changing the patch to ensure that has_temporary_error() and calling code talks
about the same error code either by:
- Passing the error code passed to report() further on to the
has_temporary_error() function when doing the check
- Removing the error code from the parameter list of the report() function and
reading the error code from the diagnostics area inside the report()
function
Best wishes,
Mats Kindahl
>
> cheers,
>
> Andrei
>
> PS the very last mail I replied with to your comments is attached.
>
>> Mats, hi.
>>
>>>>>> void
>>>>>> Slave_reporting_capability::report(loglevel level, int
> err_code,
>>>>>> const char *msg, ...) const
>>>>>> @@ -13,6 +27,8 @@ Slave_reporting_capability::report(logle
>>>>>> va_list args;
>>>>>> va_start(args, msg);
>>>>>> + if (level == ERROR_LEVEL &&
> has_temporary_error(current_thd))
>>>>>> + level= WARNING_LEVEL;
>>>>>>
>>>>
>>>>> Why do you put the logic here? Isn't the what log level should be
> used
>>>>> a decision to be made by the caller? This seems to put to many
>>>>> features into the function, which previously had the role of just
>>>>> reporting an error at a certain level.
>>>>>
>>>> When I made the first patch where
>>>> Rows_log_event,Query_log_event::do_apply_event
>>>> composed the correct arg for rli->report(),
>>>> I had thought there would be just two occurances where rli->report()
>>>> needs to be careful.
>>>> But overnight I realized a simple thing that load data is prone to the
>>>> timeout, and moreover, there is the general problem.
>>>>
>>>> Anywhere in the slave sql thread execution there might happen a
> situation
>>>> that we would not like to report as an error for some time.
>>>> As there are many sources I think it's wiser to implement a general
>>>> mechanism.
>>>>
>>> I don't understand what you're saying here...
>> The paragraph is trying to say what is implemented:
>>
>> whenever there is a
>> rpl->report(error)
>> invocation there is a possibility that the error is a temporary.
>> Hence, its reporting should take of that.
>>
>> The current scheme of reporting
>> where
>>
>> rli->report()
>>
>> in the middle of executing an event
>> follows with the final too late care of whether that was a
>> temporary error is vicious.
>>
>> Here is a suggestion how it can be improved:
>>
>>>> The mechanism that can block reporting and resume it later could be as
>>>> simple as the patch offers, which is in
>>>>
>>>> 1. check if has_temporary_error() and block (demote) the error from
>>>> showing up as an error;
>>>>
>>>> 2. raise is_fatal_error as soon as there is certainty that a blocked
>>>> error reporting needs resuming.
>>>>
>> As you did not comment on 1,2 I presume you agree conceptually.
>>
>>>>> Why is this a problem?
>>>> I hope you see my concern to avoid duplication of the same logics all
>>>> around the code (mentioned above as done in the first patch).
>>>>
>>> Duplicating the logic might be an issue, but changing the report()
>>> function for this purpose strikes me as fixing the problem at the
>>> wrong place.
>>> Is there any particular reason to why you don't construct a special
>>> function for this file only with the functionality you want?
>>> That way,
>>> you avoid code duplication and the changes are kept local to the file,
>>> which eliminates the risk that you are changing behavior of code that
>>> should not be affected by this change.
>> A special function would be a wrapper over rli->report() if I
>> understand your advice correctly.
>> But then, it'd be of dubious advantange as we would extend error
>> reporting interface.
>>
>> Considering what rli->report() should do is a well good subject to
>> discuss.
>> As I see it, rli->report() provide a similar feature as my_error() does.
>> There is only one essencial difference though.
>> An error of a user session can not be temporary.
>> But the rest is conceptaully the same so that I think eventually
>> rli->report() should be replaced with my_error().
>>
>> Our task in that case would be to define a necessary
>>
>> int (*error_handler_hook)(uint my_err, const char *str,myf MyFlags);
>>
>> hooks for IO, SQL threads.
>>
>> We don't need any storage that Slave_reporting_capability provides,
>> because there is thd->main_da.sql_errno etc available.
>>
>> Considering the current DEADLOCK temp error and imagining for a second
>> if there was no rli->report() I would call
>>
>> my_error(error)
>>
>> and hooks, I'd have to prepare/register, would call has_temporary_error() to
>> avoid ERROR line showing up.
>>
>>
>>> The other option is to get all
>>> the information from the THD instance instead of part of the error as
>>> parameter and part from the THD instance.
>> Have you meant to pass THD as an argument to rli->report() and to use
>> thd->main_da ?
>>
>>>>
>>>>> This will make all errors reported through this
>>>>> interface be demoted provided the current thread has a temporary
>>>>> error, regardless of what is actually being reported.
>>>>>
>> Need to speak to you, as sorry, I am lost in guesses.
>>
>>>> It should not. Let's look at your cases.
>>>>
>>>>
>>>>> This can cause
>>>>> two strange situations, to a naïve user of the reporting
> function:
>>>>>
>>>>> 1. A different error than the error present for the current thread
> is
>>>>> passed to the reporting function. This can be a decision on the
>>>>> callers behalf to regard a certain temporary error as a fatal
>>>>> error, but the reporting function will demote this error to a
>>>>> warning level just because the caller has "forgotten" to set
> the
>>>>> "fatal error" flag.
>>>>>
>>>> That's a good question.
>>>>
>>>> In other words, if there had been a temporary error and after that
>>>> the 2nd error that is not meant to be temporary. Is that the
>>>> concern?
>>>>
>>> No, that is not the case I'm describing. The report() function can be
>>> used to report any error, not just one that has been generated.
>>>
>> I hope we must see it in agreement that
>> report() is an immediate reaction on an error situation.
>> There can not be two pending errors to report.
>> We report them as soon as they are generated.
>> It's reasonable and basicly the current non-perfect code follows that.
>>
>> This means indeed, report() can report only immediate, just has been
>> generated error, and nothing else.
>>
>> If there is disagreemnt we will need to talk that over.
>>
>>>> The current handling in rli->report() would respect
>>>> the current error, coming as an argument, so that the 2nd error
>>>> remains a legal error as long as if has_temporary_error() confirms
>>>> that.
>>> ... but what I'm describing is a case where has_temporary_error()
>>> looks at the last generated error and not at the error that was
>>> supplied to the reporting function.
>> Again, the assumption above is once an error situation happened it
>> must be reported.
>> At reporting time properties of the error have to
>> be shaped off, and so the patch does calling has_temporary_error().
>>
>>>> The latter function is interested only in the latest error.
>>>>
>>> Yes, but that might not be what is passed to report().
>>>
>>>> This observation tells me that has_temporary_error() is combinable
>>>> with rli->report() that also deals only with the latest.
>>>>
>>> Again, this is not the scenario I describe. You are changing the
>>> behavior of the report() function to *all* users, not just the case
>>> you are trying to "fix".
>> Right, for *all* because there are many already. But (*many* - all)
>> the rest is not affected: if we had an event and an error which were
>> not listed in has_temporary_error() there would be no change.
>>
>>>>
>>>>> 2. It is necessary to add code to prevent the error message to be
>>>>> repeated ad-nauseum,
>>>> there was no such intention, I change only for demotion. But logics
>>>> of rli->report() can be extended to add such a rule. Again, it would
>>>> be an advantage to do that in the method not on the caller level.
>>>>
>>> I cannot see any advantage. What would the advantage be?
>> The only advantage is that we catch temporary errors correctly on time
>> and in the single central location - inside of report().
>> Making rli->report() just a bit more intelligent is not a big deal.
>>
>> The above might be a temproray solution, but I worder if you will
>> share my ideal world view.
>>
>> Imo logics should be like this:
>>
>> an error situation happens,
>> my_error(error) is called,
>> (*error_handler_hook) having access to execution context figures out
>> what and how to report.
>>
>>>>
>>>>> which means that part of the logic for
>>>>> deciding if this should be a warning is in the caller and part
> is
>>>>> inside the reporting function. This split just introduces an
>>>>> unnecessary coupling between the caller and the callee.
>>>>>
>>>> The caller has is_fatal_error always available to express his strong
>>>> preference. And that idea has been worked so far. And then let's leave
>>>> it to the caller but release the caller from expressing the soft
>>>> opinion because that soft opinion is just a default.
>>>>
>>> I don't understand this...
>> thd->is_fatal_error cancels temporary meaning.
>> rli->report(error) can be prepared with the strong thd->is_fatal_error
>> opinion so that a temporary error will show up as an error at once.
>>
>> The express the soft opinion is contrary to designate explictly in the
>> caller that the error is temporary.
>> That's what i like to avoid because there are several report() calling
>> places that would need that - duplication of logics.
>>
>> So the patch offers to think about an error with the soft opinion to
>> allow reporting function to shape off the error fully, incl its
>> severity.
>> We have a simple way to change the opinion with thd->is_fatal_error
>> that we do e.g when (rli->trans_retries >= slave_trans_retries).
>>
>>>>
>>>>> I would like you to pass in the correct log-level to the reporting
>>>>> function instead,
>>>>>
>>>> I hope my comments will make it more clear the idea of this patch.
>>>>
>>>>> which will also keep the has_temporary_error()
>>>>> function inside the file and not exposing it outwards.
>>>>>
>>>>>
>>>> You can compare one one-line hunk with what we would have to do -
>>>> finding all case prone to the temprorary errors, duplicate the same dumb
>>>> lines all over the cases, and after that sit and wait then there will
>>>> show up yet another mtr->report() where the caller forgets to check
>>>> the temporary status.
>>>>
>>> The role of the reporting function is to report errors... *any* error,
>>> not just the latest generated error.
>> We need to talk as it is obvious to me it's impossible to report any
>> error. A simplest example can be a case of two errors where the latter
>> is inferred by the first. We never should report the 2nd as the first
>> or instead of the 1st.
>>
>> IMO, it's a reasonable requirement is to report errors immediately.
>>
>>> If you want to change the logic, then information should be collected
>>> so that the risk of incompatible information is minimal. The
>>> has_temporary_error() checks one error, and the error coming through
>>> the parameter list is another error.
>> That should not be possible and unless I miss something is not
>> possible with the current patch.
>>
>> I agree it'd be better to use the only source for the error code THD
>> for instance.
>>
>>> If they don't match, then
>>> something is wrong. Either you pass both items through as parameters
>>> *or* you get all information from the THD instance, but don't try to
>>> mix the two parts of the error, because they may or may not be
>>> compatible.
>>>
>>> *If* you decide to take the information from the THD instance instead
>>> (*both* the has_temporary_error() information and the actual error
>>> code that has_temporary_error() has decided is the temporary error,
>>> then you should make sure to pass the THD structure as parameter since
>>> using "current_thd" is very expensive (it starts digging in the
>>> thread-local store).
>> Even though it does this "digging" how it's expensive?
>>
>> First, not to say errors reporting subject is separate from
>> optimising, but it's not that important.
>>
>> Second, current_thd is used in reporting through my_error() that I
>> find as a good pattern to follow.
>>
>>
>> I hope we can talk and come to consensus about this discussion's
>> topics.
>>
--
Mats Kindahl
Senior Software Engineer
Database Technology Group
Sun Microsystems