Charles Bell wrote:
> Rafal,
>
> General comment/rebut:
>
> All of the places where I hard coded names or other such variables were
> done intentionally. The tests are non-deterministic for these locations
> because there is more than one value for these variables. Thus, given
> enough executions the test would print (for example) 'MyISAM' instead of
> 'Default' (and vice-versa) for the driver name.
>
> Unless you reject this explanation, I shall ignore all such remarks in
> the review i.e. [2].
>
Unless I'm missing something completely, I don't see any good reason why code
should behave non-deterministically in the indicated places. Perhaps
non-determinism could be eliminated by more careful preparation of the testing
environment?
If this is not the case, then I suspect that we might face some subtle bugs here
which should be investigated, not ignored. Great that the improvements to the
patch which I have suggested helped in detecting these issues! In that case it
is even more important that you implement the suggested changes.
> NOTE: I will wait to generate a new patch once you have read my comments.
>
See my replies below.
>> STATUS
>> ------
>> Not approved - please resolve issues listed below.
>>
>> REQUESTS
>> --------
>> 1. Use a different method for testing errors during insertion of
>> various object types into the catalogue (ER_BACKUP_CATALOG_ADD_*) and
>> when restoring objects (ER_BACKUP_CANT_RESTORE_*/ER_BACKUP_GET_META_*).
>
> I experimented with different methods. This is the only one that worked
> reliably. Why is this so bad?
>
1. IMO, one of the points of this coverage testing is to ensure that all error
handling branches in the code are entered. In your solution you do not enter the
error handling branch but rather execute a copy of it which is inserted under
DBUG_EXECUTE_IF(). This is better to be avoided, since when the real error
branch changes but the copy is not updated then we would not detected potential
problems in the modified branch.
2. We should try to make the error insertion code as minimal as possible. Adding
single lines of the form DBUG_EXECUTE_IF("...", res= ERROR) is already intrusive
enough that we had doubts if it is an acceptable testing method. Here you add
new variable and several lines of code to setup the name of the insertion point.
This does not look good and I think it can be avoided by writing slightly more
advanced test using debug sync points, as I described in my previous review.
3. In any case, even if the extra code remains, it should be guarded with DBUG
flag so that it is not included in the production code. I would also insist that
the DBUG_EXECUTE_IF() sets res to ERROR and not repeats the error handling branch.
>> 3. Remove driver list cutting during error insertion in
>> restore_table_data() or explain why it is needed.
>>
>> 4. Directly trigger error when testing for
>> ER_BACKUP_STOP_RESTORE_DRIVERS.
>>
>> 5. Also trigger direct error in send_reply() method
>> (ER_BACKUP_SEND_REPLY).
>>
>> 6. Make sure that errors are simulated directly after a call to
>> function reporting error, so that all code following it is executed.
>>
>> 7. Remove extra variable if a single one can be used.
>>
>> 8. When triggering ER_BACKUP_RESTORE_2 do not change type of the
>> operation stored in the context class.
>>
>> 9. Remove assertion where error should be reported
>> (ER_BACKUP_WRONG_TABLE_BE).
>>
>> 10. Remove error insertion for ER_BACKUP_INTERRUPTION since it is
>> already extensively tested in backup_interruption test from
>> backup_engines suite and this without using error injection.
>>
>> SUGGESTIONS
>> -----------
>> 11. Whenever possible, test the original error branch, not a copy of it.
>> 12. Other minor suggestions.
>> 13. Make include/basic_data.inc database independent so that it can be
>> easily used from other tests.
>>
...
>>> === modified file 'sql/backup/data_backup.cc'
>>> --- a/sql/backup/data_backup.cc 2009-05-21 13:17:37 +0000
>>> +++ b/sql/backup/data_backup.cc 2009-08-09 21:33:03 +0000
...
>>> @@ -1544,6 +1604,14 @@ int restore_table_data(THD *thd, Restore
>>> res= snap->get_restore_driver(drv[n]);
>>> if (log.report_killed())
>>> goto error;
>>> +
>>> + // Error code insertion for ER_BACKUP_CREATE_RESTORE_DRIVER.
>>> + DBUG_EXECUTE_IF("ER_BACKUP_CREATE_RESTORE_DRIVER", + //
>>> Only process the first driver (0) for debug test
>>> + active[1]= 0; + drv[1]= 0;
>>
>> [3] Hmm, why this? It is not what happens in real code...
>
> For reliability. The test is not deterministic otherwise.
>
Why is it not deterministic? Can it be made deterministic by more careful setup
of the test environment? Is not the lack of determinism a sign that we have some
bug here (I see no obvious reason for non-determinism here)?
>>
>>> + res= backup::ERROR;);
>>> +
>>> if (res == backup::ERROR)
>>> {
>>> log.report_error(ER_BACKUP_CREATE_RESTORE_DRIVER, snap->name());
...
>>> === modified file 'sql/backup/kernel.cc'
>>> --- a/sql/backup/kernel.cc 2009-08-07 07:50:49 +0000
>>> +++ b/sql/backup/kernel.cc 2009-08-09 21:33:03 +0000
>> ...
>>> @@ -332,6 +358,10 @@ int send_reply(Backup_restore_ctx &conte
>>> goto err;
>>> }
>>> my_eof(context.thd()); // Never errors
>>> +
>>> + // Error code insertion for ER_BACKUP_SEND_REPLY.
>>> + DBUG_EXECUTE_IF("ER_BACKUP_SEND_REPLY", goto err;);
>>> +
>>
>> [5] Rather, simulate error reply from protocol->write() and/or
>> protocol->store() above. Note that this way my_eof() will not be
>> executed upon error, as is the case in the real code.
>
> Sorry, I tried that and it didn't work. The server crashed on Linux (but
> not Windows or Mac).
>
I think it is a sign that we have a bug here. If this is the case, then it is
unacceptable that we keep a blind eye on that by shaping error insertion in such
a way that the problem does not manifest itself. On the contrary - the whole
point of this patch is to catch such situations!
>>> DBUG_RETURN(0);
>>>
>>> err:
>> ...
>>> @@ -704,6 +738,9 @@ Backup_restore_ctx::prepare_for_backup(S
>>> if (! my_open_status)
>>> m_state= PREPARED_FOR_BACKUP;
>>>
>>
>> [6] Insert error just after s->open() call above, so that the above
>> code executes.
>
> Again, tried that but this causes problems.
Then it is probably a bug which needs to be detected and fixed.
>
>>> + // Error code insertion for ER_BACKUP_READ_LOC.
>>> + DBUG_EXECUTE_IF("ER_BACKUP_READ_LOC_1", my_open_status=
>>> ER_BACKUP_READ_LOC;);
>>> +
>>> if (my_open_status != 0 || is_killed())
>>> {
>>> if (report_killed())
>> ...
>>> @@ -1068,6 +1125,9 @@ void Backup_restore_ctx::unlock_tables()
>>> */ int Backup_restore_ctx::close()
>>> {
>>> + int result= 0;
>>> + int res= 0;
>>
>> [7] Why res and result? Wouldn't one variable be sufficient?
>
> I was trying to avoid corrupting the existing variable and would prefer
> to leave it that way. My intent was to avoid using existing variables as
> much as possible.
>
I think it is strange, but ok, I'll turn this into a suggestion insted.
>>
>>> +
>>> if (m_state == CLOSED)
>>> return 0;
>>>
>>> @@ -1082,8 +1142,14 @@ int Backup_restore_ctx::close()
>>>
>>> using namespace backup;
>>>
>>> + if (m_catalog)
>>> + result= !m_catalog->is_valid();
>>
>> [12] Suggestion: "result= m_catalog ? m_catalog->is_valid() : TRUE;"
>> and then below test for result only.
>>
>>> +
>>> + // Error code insertion for ER_BACKUP_RESTORE.
>>> + DBUG_EXECUTE_IF("ER_BACKUP_RESTORE_2", result= 1; m_type= RESTORE;);
>>
>> [8] I think it is not acceptable to change m_type like this - this
>> should reflect the real type of the operation (BACKUP or RESTORE) -
>> the rest of the code can depend on this. I also don't think it is
>> needed: if you activate this insertion point just before RESTORE
>> operation, then m_type should be RESTORE anyway.
>
> Again, this is for a deterministic result.
>
I don't understand how you can get non-deterministic result here - the value of
m_type is very well determined by the type of the operation you are performing:
BACKUP or RESTORE. If, regardless, you can observe that the value of m_type is
not determined then this is a bug which should be reported.
>>
>>> +
>>> // Move context to error state if the catalog became corrupted.
>>> - if (m_catalog && !m_catalog->is_valid())
>>> + if (m_catalog && result)
>>> fatal_error(m_type == BACKUP ? ER_BACKUP_BACKUP :
>>> ER_BACKUP_RESTORE);
>>>
>>> if (m_catalog && !m_catalog->get_end_time())
...
>>> @@ -1832,6 +1926,12 @@ int bcat_add_item(st_bstream_image_heade
>>>
>>> Snapshot_info *snap= info->m_snap[it->snap_num];
>>>
>>> + // Debug insertion for ER_BACKUP_WRONG_TABLE_BE.
>>> + // We must use debug insertion to avoid assertion
>>> + DBUG_EXECUTE_IF("ER_BACKUP_WRONG_TABLE_BE", +
>>> log.report_error(ER_BACKUP_WRONG_TABLE_BE, it->snap_num + 1);
>>> + return BSTREAM_ERROR;);
>>> +
>>
>> [9] If this error condition happens (info->m_snap[it->snap_num] is
>> NULL) then we either want to crash server (failed assertion) or report
>> an error and continue. I think the second alternative applies and thus
>> the assertion should be removed. Then this error would be tested like
>> all the other, i.e.:
>>
>> DBUG_EXECUTE_IF("ER_BACKUP_WRONG_TABLE_BE", snap= NULL);
>>
>> Even better, one could change it->snap_num to wrong number before the
>> snap definition. I.e.,
>>
>
> This didn't work.
>
>
>> DBUG_EXECUTE_IF("ER_BACKUP_WRONG_TABLE_BE",
>> it->snap_num= info->snap_count()+1;);
>
> I will try that but I do not hold much hope for success.
>
Note that this is the error situation which could happen in the reality and
which the code tries to detect. It is a situation where for some reason (e.g.,
data corruption) the snap_num, which was read from the backup image, has some
insane value which is out of range. In that case the code should stop here and
report error. If this doesn't work, then we have a bug which should be taken
care of.
>>
>>> if (!snap)
>>> {
>>> /*
...
>>> === modified file 'sql/backup/logger.cc'
>>> --- a/sql/backup/logger.cc 2009-03-16 14:38:05 +0000
>>> +++ b/sql/backup/logger.cc 2009-08-09 21:33:03 +0000
>> ...
>>> @@ -252,6 +254,8 @@ bool Logger::report_killed()
>>> if (m_state == CREATED || m_state == READY)
>>> return TRUE;
>>>
>>> +debug_error:
>>> +
>>> // log_error() does not push messages on the server's error stack.
>>> log_error(backup::log_level::INFO, ER_BACKUP_INTERRUPTED);
>>
>> [10] There is extensive test backup_interruption in backup_engines
>> suite which tests triggering this error in various situations.
>
> Not according to my research. Can you specify the exact test where this
> is so?
>
I have done it above. This is the backup_interruption test in backup_engines
test suite.
>>> === modified file 'sql/backup/stream.cc'
>>> --- a/sql/backup/stream.cc 2009-05-19 20:20:54 +0000
>>> +++ b/sql/backup/stream.cc 2009-08-09 21:33:03 +0000
>> ...
>>> @@ -435,7 +438,13 @@ bool Output_stream::init()
>>> an ASSERT and do an explicit cast from size_t.
>>> */
>>> DBUG_ASSERT(m_block_size <= ~((unsigned long)0));
>>> - if (BSTREAM_OK != bstream_open_wr(this, (unsigned
>>> long)m_block_size, len))
>>> +
>>> + int res= bstream_open_wr(this, (unsigned long)m_block_size, len);
>>> +
>>> + // Error code insertion for ER_BACKUP_OPEN_WR.
>>> + DBUG_EXECUTE_IF("ER_BACKUP_OPEN_WR", res= !BSTREAM_OK;);
>>
>> [12] Please change to "res= BSTREAM_ERROR"
>
> I needed another value as the code operates on the other values. I
> needed to simulate an error -- that's why it was added.
>
I don't understand. Function bstream_open_wr() signals errors by returning
BSTREAM_ERROR value and this is what you want to simulate here. After all, the
only thing which matters here is that the "if (res != BSTREAM_OK)" branch is
entered. Thus any value != BSTREAM_OK would do, but BSTREAM_ERROR has the
advantage of being the value which can occurr in reality whie !BSTREAM_OK not.
Anyway, this was just a suggestion so you can ignore it if you see a good reason
for this.
>>
>>> +
>>> + if (res != BSTREAM_OK)
>>> {
>>> m_log.report_error(ER_BACKUP_OPEN_WR);
>>> return FALSE;
Rafal