From: Date: June 17 2009 9:27pm Subject: Re: bk commit into 5.1 tree (aelkin:1.2612) BUG#36524 List-Archive: http://lists.mysql.com/commits/76513 Message-Id: <87tz2ewthx.fsf@mysql1000.dsl.inet.fi> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: QUOTED-PRINTABLE Mats, hello. You are still formally a reviewer for the bug. Since the last time w= e discussed the issue and my idea of fixes it got a customer. I have analyzed my last changeset once again to be utmost confident in the idea implemented in http://lists.mysql.com/commits/46451 Let me put it all once again to avoid spawning yet more our last year thread. THE PROBLEM We have overreaching error messages in the slave's server error log. The message is either Deadlock or Timeout in waiting for a lock at applying a replication event. It is not really a problem but rathe= r transient issue and should not be noticed by the user. \footnote{The temporary status changes to the regular once the slave has re-tried to apply the failing events unsuccessfully for rli->trans_retries =3D=3D slave_trans_retries times} THE OFFERED SOLUTION To avoid logging an error having temporary status the reporting metho= d rli->report(error description args)=20 should call has_temporary_error() to sort out the actual status. If the error is temporary it won't force an error line in the error log. THE IMPLEMENTATION rli->report() demotes the error's severity to the warning level basing on has_temporary_error() report. has_temporary_error() becomes available in rpl_reporting.cc as well still to be executed exclusively by the SQL thread. The temporary to the regular status promotion is achieved as simple a= s with raising up thd->is_fatal_error flag. BEING DISCUSSED ISSUES has_temporary_error() is not necessary to be precise in detection of the temporary status. The function is rather optimistic to give a chance to re-apply (uselessly) an event e.g in a case of the last applying cycle produced a sequence of errors with the temporary in the end: a-reg-err_1, ..., a-temp-err_N. This optimistic feature is irrelevant to the idea of the solution. It does not make the SQL slave not to notice any errors and when it eventually stops all errors and warnings are printed out. A technical issue to change signature of Slave_reporting_capability::report to accept THD* as an arg: rli->report(..., THD *thd, ...) I would do that, still in a separate patch. I would be happy to hear your opinion. cheers, Andrei PS the very last mail I replied with to your comments is attached. > Mats, hi. > >>>>> void >>>>> Slave_reporting_capability::report(loglevel level, int err_cod= e, >>>>> const char *msg, ...) const >>>>> @@ -13,6 +27,8 @@ Slave_reporting_capability::report(logle >>>>> va_list args; >>>>> va_start(args, msg); >>>>> + if (level =3D=3D ERROR_LEVEL && has_temporary_error(current= _thd)) >>>>> + level=3D WARNING_LEVEL; >>>>> =20 >>> >>> =20 >>>> Why do you put the logic here? Isn't the what log level should b= e used >>>> a decision to be made by the caller? This seems to put to many >>>> features into the function, which previously had the role of jus= t >>>> reporting an error at a certain level. >>>> =20 >>> >>> When I made the first patch where >>> Rows_log_event,Query_log_event::do_apply_event >>> composed the correct arg for rli->report(), >>> I had thought there would be just two occurances where rli->repor= t() >>> needs to be careful. >>> But overnight I realized a simple thing that load data is prone t= o the >>> timeout, and moreover, there is the general problem. >>> >>> Anywhere in the slave sql thread execution there might happen a s= ituation >>> that we would not like to report as an error for some time. >>> As there are many sources I think it's wiser to implement a gener= al >>> mechanism. >>> =20 >> >> I don't understand what you're saying here... > > The paragraph is trying to say what is implemented: > > whenever there is a=20 > rpl->report(error)=20 > invocation there is a possibility that the error is a temporary. > Hence, its reporting should take of that. > > The current scheme of reporting > where=20 > =20 > rli->report() > > in the middle of executing an event > follows with the final too late care of whether that was a > temporary error is vicious. > > Here is a suggestion how it can be improved: > >>> The mechanism that can block reporting and resume it later could = be as >>> simple as the patch offers, which is in >>> >>> 1. check if has_temporary_error() and block (demote) the error fr= om >>> showing up as an error; >>> >>> 2. raise is_fatal_error as soon as there is certainty that a bloc= ked >>> error reporting needs resuming. >>> > > As you did not comment on 1,2 I presume you agree conceptually. > >>>> Why is this a problem? =20 >>> >>> I hope you see my concern to avoid duplication of the same logics= all >>> around the code (mentioned above as done in the first patch). >>> =20 >> >> Duplicating the logic might be an issue, but changing the report() >> function for this purpose strikes me as fixing the problem at the >> wrong place. > >> Is there any particular reason to why you don't construct a specia= l >> function for this file only with the functionality you want? >> That way, >> you avoid code duplication and the changes are kept local to the f= ile, >> which eliminates the risk that you are changing behavior of code t= hat >> should not be affected by this change. > > A special function would be a wrapper over rli->report() if I > understand your advice correctly. > But then, it'd be of dubious advantange as we would extend error > reporting interface. > > Considering what rli->report() should do is a well good subject to > discuss. > As I see it, rli->report() provide a similar feature as my_error() = does. > There is only one essencial difference though. > An error of a user session can not be temporary. > But the rest is conceptaully the same so that I think eventually > rli->report() should be replaced with my_error(). > > Our task in that case would be to define a necessary > > int (*error_handler_hook)(uint my_err, const char *str,myf MyFl= ags); > > hooks for IO, SQL threads. > > We don't need any storage that Slave_reporting_capability provides, > because there is thd->main_da.sql_errno etc available. > > Considering the current DEADLOCK temp error and imagining for a sec= ond > if there was no rli->report() I would call > > my_error(error)=20 > > and hooks, I'd have to prepare/register, would call has_temporary_e= rror() to > avoid ERROR line showing up. > > >> The other option is to get all >> the information from the THD instance instead of part of the error= as >> parameter and part from the THD instance. > > Have you meant to pass THD as an argument to rli->report() and to u= se > thd->main_da ? > >>> =20 >>>> This will make all errors reported through this >>>> interface be demoted provided the current thread has a temporary >>>> error, regardless of what is actually being reported. >>>> =20 > > Need to speak to you, as sorry, I am lost in guesses. > >>> >>> It should not. Let's look at your cases. >>> >>> =20 >>>> This can cause >>>> two strange situations, to a na=EFve user of the reporting funct= ion: >>>> >>>> 1. A different error than the error present for the current th= read is >>>> passed to the reporting function. This can be a decision on= the >>>> callers behalf to regard a certain temporary error as a fat= al >>>> error, but the reporting function will demote this error to= a >>>> warning level just because the caller has "forgotten" to se= t the >>>> "fatal error" flag. >>>> =20 >>> >>> That's a good question. >>> >>> In other words, if there had been a temporary error and after tha= t >>> the 2nd error that is not meant to be temporary. Is that the >>> concern? >>> =20 > >> >> No, that is not the case I'm describing. The report() function can= be >> used to report any error, not just one that has been generated. >> > > I hope we must see it in agreement that=20 > report() is an immediate reaction on an error situation. > There can not be two pending errors to report. > We report them as soon as they are generated. > It's reasonable and basicly the current non-perfect code follows th= at. > > This means indeed, report() can report only immediate, just has bee= n > generated error, and nothing else. > > If there is disagreemnt we will need to talk that over. > >>> The current handling in rli->report() would respect >>> the current error, coming as an argument, so that the 2nd error >>> remains a legal error as long as if has_temporary_error() confirm= s >>> that. >> >> ... but what I'm describing is a case where has_temporary_error() >> looks at the last generated error and not at the error that was >> supplied to the reporting function. > > Again, the assumption above is once an error situation happened it > must be reported.=20 > At reporting time properties of the error have to > be shaped off, and so the patch does calling has_temporary_error(). > >> >>> The latter function is interested only in the latest error. >>> =20 >> >> Yes, but that might not be what is passed to report(). >> >>> This observation tells me that has_temporary_error() is combinabl= e >>> with rli->report() that also deals only with the latest. >>> =20 >> >> Again, this is not the scenario I describe. You are changing the >> behavior of the report() function to *all* users, not just the cas= e >> you are trying to "fix". > > Right, for *all* because there are many already. But (*many* - all) > the rest is not affected: if we had an event and an error which wer= e > not listed in has_temporary_error() there would be no change. > >> >>> =20 >>>> 2. It is necessary to add code to prevent the error message to= be >>>> repeated ad-nauseum, =20 >>> >>> there was no such intention, I change only for demotion. But log= ics >>> of rli->report() can be extended to add such a rule. Again, it wo= uld >>> be an advantage to do that in the method not on the caller level. >>> =20 >> >> I cannot see any advantage. What would the advantage be? > > The only advantage is that we catch temporary errors correctly on t= ime > and in the single central location - inside of report(). > Making rli->report() just a bit more intelligent is not a big deal. > > The above might be a temproray solution, but I worder if you will > share my ideal world view. > > Imo logics should be like this: > > an error situation happens, > my_error(error) is called, > (*error_handler_hook) having access to execution context figures ou= t > what and how to report. > >> >>> =20 >>>> which means that part of the logic for >>>> deciding if this should be a warning is in the caller and p= art is >>>> inside the reporting function. This split just introduces a= n >>>> unnecessary coupling between the caller and the callee. >>>> =20 >>> >>> The caller has is_fatal_error always available to express his str= ong >>> preference. And that idea has been worked so far. And then let's = leave >>> it to the caller but release the caller from expressing the soft >>> opinion because that soft opinion is just a default. >>> =20 >> >> I don't understand this... > > thd->is_fatal_error cancels temporary meaning. > rli->report(error) can be prepared with the strong thd->is_fatal_er= ror > opinion so that a temporary error will show up as an error at once. > > The express the soft opinion is contrary to designate explictly in = the > caller that the error is temporary. > That's what i like to avoid because there are several report() call= ing > places that would need that - duplication of logics. > > So the patch offers to think about an error with the soft opinion t= o > allow reporting function to shape off the error fully, incl its > severity. > We have a simple way to change the opinion with thd->is_fatal_error > that we do e.g when (rli->trans_retries >=3D slave_trans_retries). > >>> =20 >>>> I would like you to pass in the correct log-level to the reporti= ng >>>> function instead, >>>> =20 >>> >>> I hope my comments will make it more clear the idea of this patch= . >>> =20 >>>> which will also keep the has_temporary_error() >>>> function inside the file and not exposing it outwards. >>>> >>>> =20 >>> >>> You can compare one one-line hunk with what we would have to do - >>> finding all case prone to the temprorary errors, duplicate the sa= me dumb >>> lines all over the cases, and after that sit and wait then there = will >>> show up yet another mtr->report() where the caller forgets to che= ck >>> the temporary status. >>> =20 >> >> The role of the reporting function is to report errors... *any* er= ror, >> not just the latest generated error. > > We need to talk as it is obvious to me it's impossible to report an= y > error. A simplest example can be a case of two errors where the lat= ter > is inferred by the first. We never should report the 2nd as the fir= st > or instead of the 1st. > > IMO, it's a reasonable requirement is to report errors immediately. > >> >> If you want to change the logic, then information should be collec= ted >> so that the risk of incompatible information is minimal. The >> has_temporary_error() checks one error, and the error coming throu= gh >> the parameter list is another error.=20 > > That should not be possible and unless I miss something is not > possible with the current patch. > > I agree it'd be better to use the only source for the error code TH= D > for instance. > >>If they don't match, then >> something is wrong. Either you pass both items through as paramete= rs >> *or* you get all information from the THD instance, but don't try = to >> mix the two parts of the error, because they may or may not be >> compatible. >> >> *If* you decide to take the information from the THD instance inst= ead >> (*both* the has_temporary_error() information and the actual error >> code that has_temporary_error() has decided is the temporary error= , >> then you should make sure to pass the THD structure as parameter s= ince >> using "current_thd" is very expensive (it starts digging in the >> thread-local store). > > Even though it does this "digging" how it's expensive?=20 > > First, not to say errors reporting subject is separate from > optimising, but it's not that important. > > Second, current_thd is used in reporting through my_error() that I > find as a good pattern to follow. > > > I hope we can talk and come to consensus about this discussion's > topics. >