Lars, Rafal,
This is my reply to Rafal's comments to my comments to his comments...
Items 1-10 were under the 'REQUESTS' section of his latest review and
11-13 were under 'SUGGESTIONS' section.
I think this is ready to go as I have complied with all 10 of his
demands. I would suggest you read through this and if you agree, please
either allow me to push the patch as-is or perhaps have Jorgen review
the patch and emails to see if he concurs that I have met the demands of
the reviewers.
Chuck
Rafal Somla wrote:
> Hi Chuck,
>
> Please reconsider your decisions. See my comments below.
>
> Charles Bell wrote:
>> Rafal,
>>
>> This is a formal reply to your first review. I have a new patch ready
>> and will submit that soon for review.
>>
>> I have also asked that this be put on the backup decision meeting
>> agenda in anticipation of your rejection of my replies.
>>
>> Note: I have implemented all of Jorgen's requests and suggestions
>> where possible.
>>
>>> REQUESTS
>>> --------
>>> 1. Use a different method for testing errors during insertion of
>>> various object types into the catalogue (ER_BACKUP_CATALOG_ADD_*) and
>>> when restoring objects (ER_BACKUP_CANT_RESTORE_*/ER_BACKUP_GET_META_*).
>>
>> I see your request as an alternative implementation that is actually
>> more complex that my original patch and not a case where the code is
>> wrong. Your version is more complex because the three areas where this
>> occurs will mean three copies of something very similar to what I have
>> with sync points instead of debug execute if. Furthermore, your
>> version will make the test cases much more complex. Therefore, I must
>> reject this alternative as it does not invalidate nor improve the
>> original patch.
>>
>> STATUS: alternative implementation is rejected
>>
>
> It is a pity I think, but I accept your decision as the solution you
> proposed will work. In that case I replace my request 1 by two other
> requests:
>
> 1a. Guard the added extra debug code (the one which sets the name of the
> error insertion point in str variable) with appropriate #ifdef so that
> it is not included in production code.
>
I think this is extreme overkill and evidence that you do not trust the
code I have written. I include the #ifdef as you demand but under
extreme protest.
STATUS: Complied.
> 1b. Replace this fragment:
>
> + if (str.ptr())
> + {
> + DBUG_EXECUTE_IF(str.ptr(),
> + m_log.report_error(error, db.name().ptr(), obj->get_name()->ptr());
> + return NULL;);
> + }
>
> with:
>
> + if (str.ptr())
> + {
> + DBUG_EXECUTE_IF(str.ptr(), res= get_dep_node_res::ERROR;);
> + }
>
> so that the error handling branch in the original code is entered.
>
STATUS: complied
>>> 2. Do not set Backup_pump::m_name member upon error insertion - it
>>> should be already correctly initialized.
>>
>> As I have explained, this is so that the tests will be deterministic.
>> Reminder: This form of testing can only show the error handling for a
>> given error is possible, not that it is 100% error free.
>>
>> Since you do not believe me that the data I have masked is
>> non-deterministic, consider this modification:
>>
>> (data_backup.cc @ 1279)
>>
>> from:
>>
>> // Error code insertion
>> DBUG_EXECUTE_IF("ER_BACKUP_PREPARE_DRIVER",
>> res= ERROR; m_name= "Default";);
>>
>> to:
>>
>> // Error code insertion
>> DBUG_EXECUTE_IF("ER_BACKUP_PREPARE_DRIVER", res= ERROR;);
>>
>> As recorded without m_name change:
>>
>> # Show warning/error from progress and history log
>> # if they exist.
>> #
>> SELECT * FROM mysql.backup_progress WHERE error_num <> 0;
>> backup_id object error_num notes
>> # #### Default backup driver can't create its validity point
>> PURGE BACKUP LOGS;
>>
>> And after several runs, this happens. Ooops, who'da thunk it...it
>> really is non-deterministic!
>>
>> CURRENT_TEST: backup.backup_errors_debug_2
>> ---
>> d:/source/bzr/mysql-6.0-wl-4722/mysql-test/suite/backup/r/backup_errors_deb
>>
>> g_2.result 2009-08-20 21:44:42.337506400 +0300
>> +++
>> d:\source\bzr\mysql-6.0-wl-4722\mysql-test\suite\backup\r\backup_errors_deb
>>
>> g_2.reject 2009-08-20 21:45:39.293106400 +0300
>> @@ -53,7 +53,7 @@
>> # Execute backup
>> #
>> BACKUP DATABASE backup_test TO 'backup_test_err.bak';
>> -ERROR HY000: Default backup driver can't create its validity point
>> +ERROR HY000: MyISAM backup driver can't create its validity point
>> #
>> # Test case for error ER_BACKUP_CREATE_VP PASSED.
>> #
>> @@ -63,7 +63,7 @@
>> #
>> SELECT * FROM mysql.backup_progress WHERE error_num <> 0;
>> backup_id object error_num notes
>> -# #### Default backup driver can't create its
>> validity point
>> +# #### MyISAM backup driver can't create its validity
>> point
>> PURGE BACKUP LOGS;
>>
>> You may believe it is a bug and if so you're welcome to open a bug
>> report but I do not agree and see this as merely expected behavior. I
>> have explained this in another email so I will not rehash that here.
>>
>> STATUS: rejected
>>
>
> Thanks for detailed explanation.
>
> Even if the non-determinism is justified by the way the code works, the
> test can be easily made deterministic using:
>
> --replace_regex /(Default|MyISAM|...) backup driver/<name> backup driver/
>
> in front of the select. Thus I insist that the setting of
> Backup_pump::m_name upon error injection is removed and the test is
> modified instead.
>
I have made the change as you demand albeit with extreme prejudice as I
believe this is a case where the reviewer is usurping his role.
STATUS: Complied
>>> 3. Remove driver list cutting during error insertion in
>>> restore_table_data() or explain why it is needed.
>>
>> I do this to avoid the non-deterministic problem.
>>
>> STATUS: rejected
>>
>
> The error injection you propose
>
> > + DBUG_EXECUTE_IF("ER_BACKUP_CREATE_RESTORE_DRIVER", + //
> Only process the first driver (0) for debug test
> > + active[1]= 0; + drv[1]= 0;
>
> makes the error code in restore_table_data() (data_backup.cc:1749) not
> work as expected. In particular the loop which cancels other drivers
> than the failing one is not fully executed. Here is the affected code:
>
> error:
>
> state= ERROR;
>
> DBUG_PRINT("restore",("Cancelling restore process"));
>
> // Call cancel() for all active drivers
>
> for (uint n=0; n < info.snap_count(); ++n)
> {
> if (!active[n])
> continue;
>
> DBUG_PRINT("restore",("Cancelling restore driver %s",
> info.m_snap[n]->name()));
> res= active[n]->cancel();
> if (res)
> {
> if (!bad_drivers.is_empty())
> bad_drivers.append(",");
> bad_drivers.append(info.m_snap[n]->name());
> }
> }
>
> I think it is not good to change the normal execution path so
> drastically, as we can miss problems which could occur there. We should
> test the error handling code as it is, not an artificial modification of
> it.
>
> I assume that you are right about inherent non-determinism in the order
> in which backup/restore drivers are created and listed in the catalogue.
> In that case, I think this non-determinism should be removed in the
> test, by e.g. result masking.
>
STATUS: Complied
>>> 4. Directly trigger error when testing for
>>> ER_BACKUP_STOP_RESTORE_DRIVERS.
>>
>> This does *not* work. I have changed the code exactly as you say and
>> the test case fails. That is because we are simulating an error here
>> and the normal path of execution does not enter the section you
>> requested I move the code to.
>>
>> STATUS: rejected
>>
>
> You did not explain what and how fails. If it fails it looks suspicious
> - probably an error in the code. Note that what I proposed is doing the
> following error injection in the above driver cancelling loop:
>
> for (uint n=0; n < info.snap_count(); ++n)
> {
> if (!active[n])
> continue;
>
> DBUG_PRINT("restore",("Cancelling restore driver %s",
> info.m_snap[n]->name()));
> res= active[n]->cancel();
> DBUG_EXECUTE_IF("trigger_error", res= TRUE;); <--- PROPOSED INJECTION
> if (res)
> {
> if (!bad_drivers.is_empty())
> bad_drivers.append(",");
> bad_drivers.append(info.m_snap[n]->name());
> }
> }
>
> Thus it simulates exactly the situation when driver's cancel() method
> signals error, in which case ER_BACKUP_STOP_RESTORE_DRIVERS should be
> reported. If this does not work, then we have a bug in the code - it
> should work!
>
> Thus I insist that we do not go around this problem, as your original
> patch proposed, but use this most direct method of triggering the error
> and investigate why it does not work (report a bug if there is a problem).
>
This will not work, Rafal. I don't how many times I must explain that
your requested changes place the code in a situation where the code in
question is never reached. I forced the issue under protest by inserting
additional debug code to code around the code to comply with your
request. I object to the use of conjecture in a review.
STATUS: Complied after a fashion
>>> 5. Also trigger direct error in send_reply() method
>>> (ER_BACKUP_SEND_REPLY).
>>
>> Done. BUG#46594 set to 'not a bug'.
>>
>> STATUS: code changed
>>
>>> 6. Make sure that errors are simulated directly after a call to
>>> function reporting error, so that all code following it is executed.
>>
>> Done.
>>
>> STATUS: code changed
Status: We are agreeing on this point.
>>
>>> 7. Remove extra variable if a single one can be used.
>>
>> Done.
>>
>> STATUS: code changed
>>
Status: We are agreeing on this point.
>>> 8. When triggering ER_BACKUP_RESTORE_2 do not change type of the
>>> operation stored in the context class.
>>
>> Done.
>>
>> STATUS: code changed
>>
Status: We are agreeing on this point.
>>> 9. Remove assertion where error should be reported
>>> (ER_BACKUP_WRONG_TABLE_BE).
>>
>> I have made the change as you requested and it does *not* work. In the
>> case of snap= NULL, the server crashes. In the case of snap_num + 1
>> the test fails.
>>
>> STATUS: rejected
>>
>
> The change I proposed looks as follows (around kernel.cc:1834):
>
> DBUG_PRINT("restore",(" table's snapshot no. is %d", it->snap_num));
>
> DBUG_EXECUTE_IF("trigger_error", it->snap_num= <wrong number>); <---
>
> HERE
> Snapshot_info *snap= info->m_snap[it->snap_num];
>
> if (!snap)
> {
> /*
> This can happen only if the snapshot number is too big - if we
> failed
>
> As already explained, if this does not work, then we have a problem with
> error detection here. In case it->snap_num is wrong (e.g. backup image
> corruption), the info->m_snap[it->snap_num] should be NULL and the error
> should be reported.
>
> I insist that we test it here as suggested and report bug if it does not
> work, instead of injecting code which pretends that everything is ok:
>
>
> + // Debug insertion for ER_BACKUP_WRONG_TABLE_BE.
> + // We must use debug insertion to avoid assertion
> + DBUG_EXECUTE_IF("ER_BACKUP_WRONG_TABLE_BE", +
> log.report_error(ER_BACKUP_WRONG_TABLE_BE, it->snap_num + 1);
> + return BSTREAM_ERROR;);
>
With this latest suggestion, I got it to work.
STATUS : Complied
>>> 10. Remove error insertion for ER_BACKUP_INTERRUPTION since it is
>>> already extensively tested in backup_interruption test from
>>> backup_engines suite and this without using error injection.
>>
>> There are no tests that test this error message or its conditions. I
>> have searched the latest tree and there is also not test result that
>> contains this error message.
>>
>> STATUS: rejected
>>
>
> As I have said already twice, this is tested in backup_interruption test
> from backup_engines suite. The reason why you do not see
> ER_BACKUP_INTERRUPTION mentioned in test and result files is that this
> message is reported with log_level::INFO, and such messages are written
> only to the server error log - never reported by SQL statement or pushed
> on the error stack (see Logger::write_message() method in logger.cc).
> You can check with code coverage tool that the line in
> Logger::report_killed() which reports this error is executed during
> backup_interruption test (or simply look at the server error log after
> executing the test).
>
> Since this particular situation and error message is tested extensively,
> I think the error injection you proposed should be removed to not
> pollute the code and keep injections at minimum.
>
Ok, if you insist but I have executed said test and the code does *not*
reach this error condition.
STATUS: Complied
>>> SUGGESTIONS
>>> -----------
>>> 11. Whenever possible, test the original error branch, not a copy of it.
>>
>> Your specific example for the binlog test still crashes the server.
>> See BUG#46574.
>>
>> Your specific example for ER_BACKUP_UNEXPECTED_DATA will not work
>> because, again, we are executing through the normal non-error portion
>> of the code and your request has me move the debug insertion to inside
>> an error condition. This will not work and the test case fails.
>>
>> STATUS: rejected
>>
>
> I don't understand why it should not work. The injection which I propose
> looks as follows:
>
> /*
> If there are no tables stored in the image, there is nothing to do
> in this
> function. However, we must call bstream_rd_data_chunk() which will
> absorb
> the 0x00 byte signalling end of (the empty) table data chunk sequence.
> */
> if (info.snap_count() == 0 || info.table_count() == 0) // nothing to
> restore
> {
> int res= bstream_rd_data_chunk(&s, &chunk_info);
> DBUG_EXECUTE_IF("inject_error", res= BSTREAM_OK;); <---- HERE
> if (res != BSTREAM_EOC)
> {
> info.m_log.report_error(res == BSTREAM_ERROR ?
> ER_BACKUP_READ_DATA :
> ER_BACKUP_UNEXPECTED_DATA);
> DBUG_RETURN(ERROR);
> }
>
> This is the most direct way to trigger the situation where
> ER_BACKUP_UNEXPECTED_DATA is reported. Of course, to get to this
> situation you need to restore an image without tables, so that the whole
> branch is entered. Only in such situation the error should be reported.
>
Well, you didn't say it needed to be an empty backup and it was not
clear from the code (but it is now) that that was the case. I have made
the change and it works now.
STATUS: complied
>>> 12. Other minor suggestions.
>>
>> All were changed except the first which forced the server to crash and
>> the third which I do not consider relevant.
>
> I would rather investigate the crash - is it a bug in the error handling
> code?
>
No crash now that I've pulled latest code.
STATUS: complied
>>
>> STATUS: partially accepted
>>
>>> 13. Make include/basic_data.inc database independent so that it can
>>> be easily used from other tests.
>>
>> This is not so easy to do. Consider the event, trigger, procedure,
>> function, and grant statement. Parameterizing this would require extra
>> work. I choose to leave it as it is.
>>
>
> Ok, good to know. With this the situation looks much better.
>
STATUS: Unknown. It seems like Rafal is ok with me not implementing this
suggestion.