Mats, hello.
> Hi Andrei!
>
> Comments inline below.
>
Let's talk it over.
> /Matz
>
> Andrei Elkin wrote:
>> Below is the list of changes that have just been committed into a local
>> 5.1 repository of aelkin. When aelkin does a push these changes
>> will be propagated to the main repository and, within 24 hours after the
>> push, to the public repository.
>> For information on how to access the public repository
>> see http://dev.mysql.com/doc/mysql/en/installing-source-tree.html
>>
>> ChangeSet@stripped, 2008-05-07 15:49:18+03:00, aelkin@stripped +3 -0
>> Bug #36524 incident of deadlock on slave is overdramatized
>> In cases of temporary errors in replication event execution
>> (deadlock, timeout)
>> the error log gained an overreacting error message whereas just a warning would
> be fine.
>> Fixed with checking of the temporary status of the error
>> inside Slave_reporting_capability::report() to demote the error to
>> the warning level if needed.
>> The warning gets converted into the error whenever number of successive
> attempts to re-apply
>> the event(s) gets equal to the global server variable slave_trans_retries.
>>
>> sql/rpl_reporting.cc@stripped, 2008-05-07 15:49:16+03:00,
> aelkin@stripped +16 -0
>> putting burden of decision whether to report the message as the error or
>> as the warning to report() method.
>>
>> sql/slave.cc@stripped, 2008-05-07 15:49:16+03:00, aelkin@stripped +4
> -1
>> exporting the function to be used in log_event.cc;
>> promoting a temporary error to the fatal level so that
>> has_temporary_error() won't return TRUE if is_fatal_error is ON.
>>
>> sql/slave.h@stripped, 2008-05-07 15:49:16+03:00, aelkin@stripped +1
> -0
>> the function starts to be shared by log_event.cc as well.
>>
>> diff -Nrup a/sql/rpl_reporting.cc b/sql/rpl_reporting.cc
>> --- a/sql/rpl_reporting.cc 2007-06-11 23:15:28 +03:00
>> +++ b/sql/rpl_reporting.cc 2008-05-07 15:49:16 +03:00
>> @@ -2,6 +2,20 @@
>> #include "mysql_priv.h"
>> #include "rpl_reporting.h"
>> +/**
>> + The method composes a message for the error log and/or
>> + for future show slave status exposure.
>> +
>> + @param level designate seriousity of the message.
>> + @param err_code the user level error code
>> + @param msg the format string of the message,
>> + can be followed with more arguments
>> + providing the data for the format string
>> +
>> + @note level can be corrected inside. E.g when the error
>> + appears to be a temprorary the level is demoted
>>
> typo "temprorary"
ok
>
>> + to WARNING_LEVEL
>> + */
>> void
>> Slave_reporting_capability::report(loglevel level, int err_code,
>> const char *msg, ...) const
>> @@ -13,6 +27,8 @@ Slave_reporting_capability::report(logle
>> va_list args;
>> va_start(args, msg);
>> + if (level == ERROR_LEVEL && has_temporary_error(current_thd))
>> + level= WARNING_LEVEL;
>>
>
> Why do you put the logic here? Isn't the what log level should be used
> a decision to be made by the caller? This seems to put to many
> features into the function, which previously had the role of just
> reporting an error at a certain level.
When I made the first patch where
Rows_log_event,Query_log_event::do_apply_event
composed the correct arg for rli->report(),
I had thought there would be just two occurances where rli->report()
needs to be careful.
But overnight I realized a simple thing that load data is prone to the
timeout, and moreover, there is the general problem.
Anywhere in the slave sql thread execution there might happen a situation
that we would not like to report as an error for some time.
As there are many sources I think it's wiser to implement a general
mechanism.
The mechanism that can block reporting and resume it later could be as
simple as the patch offers, which is in
1. check if has_temporary_error() and block (demote) the error from
showing up as an error;
2. raise is_fatal_error as soon as there is certainty that a blocked
error reporting needs resuming.
>
> Why is this a problem?
I hope you see my concern to avoid duplication of the same logics all
around the code (mentioned above as done in the first patch).
> This will make all errors reported through this
> interface be demoted provided the current thread has a temporary
> error, regardless of what is actually being reported.
It should not. Let's look at your cases.
> This can cause
> two strange situations, to a naïve user of the reporting function:
>
> 1. A different error than the error present for the current thread is
> passed to the reporting function. This can be a decision on the
> callers behalf to regard a certain temporary error as a fatal
> error, but the reporting function will demote this error to a
> warning level just because the caller has "forgotten" to set the
> "fatal error" flag.
That's a good question.
In other words, if there had been a temporary error and after that
the 2nd error that is not meant to be temporary.
Is that the concern?
The current handling in rli->report() would respect
the current error, coming as an argument, so that the 2nd error
remains a legal error as long as if has_temporary_error() confirms
that. The latter function is interested only in the latest error.
This observation tells me that has_temporary_error() is combinable
with rli->report() that also deals only with the latest.
> 2. It is necessary to add code to prevent the error message to be
> repeated ad-nauseum,
there was no such intention, I change only for demotion. But logics
of rli->report() can be extended to add such a rule. Again, it would
be an advantage to do that in the method not on the caller level.
> which means that part of the logic for
> deciding if this should be a warning is in the caller and part is
> inside the reporting function. This split just introduces an
> unnecessary coupling between the caller and the callee.
The caller has is_fatal_error always available to express his strong
preference. And that idea has been worked so far. And then let's leave
it to the caller but release the caller from expressing the soft
opinion because that soft opinion is just a default.
>
> I would like you to pass in the correct log-level to the reporting
> function instead,
I hope my comments will make it more clear the idea of this patch.
> which will also keep the has_temporary_error()
> function inside the file and not exposing it outwards.
>
You can compare one one-line hunk with what we would have to do -
finding all case prone to the temprorary errors, duplicate the same dumb
lines all over the cases, and after that sit and wait then there will
show up yet another mtr->report() where the caller forgets to check
the temporary status.
>> switch (level)
>> {
>> case ERROR_LEVEL:
>> diff -Nrup a/sql/slave.cc b/sql/slave.cc
>> --- a/sql/slave.cc 2008-03-31 11:55:42 +03:00
>> +++ b/sql/slave.cc 2008-05-07 15:49:16 +03:00
>> @@ -1780,7 +1780,7 @@ int check_expected_error(THD* thd, Relay
>> that the error is temporary by pushing a warning with the error code
>> ER_GET_TEMPORARY_ERRMSG, if the originating error is temporary.
>> */
>> -static int has_temporary_error(THD *thd)
>> +int has_temporary_error(THD *thd)
>> {
>> DBUG_ENTER("has_temporary_error");
>> @@ -2121,10 +2121,13 @@ static int exec_relay_log_event(THD* thd
>> }
>> }
>> else
>> + {
>> sql_print_error("Slave SQL thread retried transaction %lu time(s) "
>> "in vain, giving up. Consider raising the value of "
>> "the slave_transaction_retries variable.",
>> slave_trans_retries);
>> + thd->is_fatal_error= 1; // to force has_temporary_error() return 0
>> + }
>> }
>> else if (exec_res && !temp_err ||
>> (opt_using_transactions &&
>> diff -Nrup a/sql/slave.h b/sql/slave.h
>> --- a/sql/slave.h 2008-03-29 14:19:49 +02:00
>> +++ b/sql/slave.h 2008-05-07 15:49:16 +03:00
>> @@ -190,6 +190,7 @@ void set_slave_thread_default_charset(TH
>> void rotate_relay_log(Master_info* mi);
>> int apply_event_and_update_pos(Log_event* ev, THD* thd, Relay_log_info* rli,
>> bool skip);
>> +inline int has_temporary_error(THD *thd);
>>
>
> It is pointless to *declare* a function inline when it doesn't have a
> definition. Remove the keyword, it just adds confusion.
Ops, thanks for this remark! I guess a copy/paste happens to me...
>
>> pthread_handler_t handle_slave_io(void *arg);
>> pthread_handler_t handle_slave_sql(void *arg);
>>
>>
>
>
cheers,
Andrei