Hi Oystein,
After a very long while, I managed to update this patch to the current tree
*and* make it work. Sorry for the delay. Also thanks for your valuable remarks
which helped me to improve the patch. I reply to them below.
The new patch is here: <http://lists.mysql.com/commits/66984>. Please let me
know if you consider it good for pushing.
Øystein Grøvlen wrote:
> Rafal,
>
> Øystein Grøvlen wrote:
> > Rafal,
> >
> > I am trying to apply your patch, but I am not able to figure out which
> > revision it is on based on. I reverted back to the team tree at the
> > time of your commit, but still get conflicts. Would it be possible for
> > you to generate a new patch?
>
> I guess this email got lost in the backup camp heat. It will be
> pretty difficult to verify that kill checks are not missing without
> being able to apply the patch. Do you know which base version to use
> in order to succesfully apply the patch?
>
> I have so far gone through the code part (not tests) of the patch.
> Unfortunately, I seem to have lost my notes while travelling
> yesterday, but here are some main points that I remember from the top
> of my head (I will try to reconstruct more details later):
>
> - I am not quite sure it is necessary to check for killed status as
> frequently as this patch does. Some primitive grepping indicates
> that there will be more than 5 times as many checks per line of
> code in the backup directory as in the sql directory. Quite a few
> of the checks are after calls that are not very likely to take much
> time, but I also understand the advantage of using a general rule
> with respect to placement of checks, and not assume anything about
> the underlying operations.
I think, with very few exceptions, the general rule is that check is made after
calls to external code (or methods/functions which potentially call external
code). For example, in data_backup.cc, checks are made after:
- calls to server core functionality:
block_commits(thd, NULL);
unblock_commits(thd)
- calls to backup/restore drivers (via driver API):
drv->begin()
drv->prepare()
drv->send_data(buf)
drv->end()
eng->get_restore_driver(drv[n])
new Scheduler::Pump(*i, s) // because it calls get_restore_driver()
- calls to backup stream library:
bstream_rd_data_chunk(&s, &chunk_info)
And there are few other places:
- in Scheduler::step() - because this method is called repeatedly and it is
convenient to place check there. In particular the check will be done after
each drv->get_data() call.
- at the beginning of sync phase - this is a bit redundant, just to make sure
that we don't start syncing if process is interrupted.
- after the sync phase - this detects interruptions which might happen during
synchronization phase, e.g., during drv->lock() and drv->unlock() calls.
As you wrote, the idea is to not make any assumptions about the external code. I
consider it worth making a couple of possibly redundant checks.
>
> - I feel that some of the checks are a bit inconsistent. As far as I
> recall from the WL, the idea was to check when other modules are
> called and when possibly time consuming operations within backup is
> done. However, it seems that some times the checking is done at a
> higher level in the backup hierarchy instead when external methods
> are called. That is a bit confusing since one will have to check
> the caller in order to determine whether it is OK that a check is
> missing after a call to an external method.
Yes, but my idea was also that it is enough to make explicit checks for
interruption only at the top-level of the call stack, not all around the code. I
added checks only in Backup_restore_ctx methods and in
write/restore_table_data() functions which do the main job of pumping data
to/from the drivers.
This is based on assumption that it is not crucial to detect interruption
immediately after it has happened. The execution can continue until control
reaches the top-level caller and then the interruption will be detected and
appropriate action taken. This should work especially that most functions will
report error in case of an interruption and thus the execution will be stopped
anyway. If not, then it will be broken at the top-level. I assume that this is
enough for the purpose.
>
> - There is a lot of code like this:
>
> if (report_killed())
> return fatal_error(ER_QUERY_INTERRUPTED);
> if (ret)
> return fatal_error(report_error(ER_DDL_BLOCK));
>
> I think the following code would do the same and is more
> readable:
>
> if (ret || is_killed())
> return fatal_error(report_error(ER_DDL_BLOCK));
>
> where is_killed() is an inline method that checks m_thd->killed.
I buy it. This will work, because report_error() will correctly report
interruption, instead of the given error, if it has happened. And in that case
it will return ER_QUERY_INTERRUPTED, not the provided error code.
It has the advantage you notice below, that it is more likely that is_killed()
test will be inlined.
I changed the code as you suggest in places where it was suitable. But in few
places I left a separate check for interruption because it felt more appropriate
or, e.g., report_error() was not called.
>
> - It seems strange to me that report_error() is inline while
> report_killed() is not. It would have made sense if it was the
> other way around. report_killed() is called all the time while
> report_error() is only called upon errors. In other words,
> report_killed() will probably be called hundreds of times per
> backup, while report_error() will probably be called less than
> daily for most users.
>
> On the other hand, I am not sure that the compiler would decide to
> inline report_killed if asked to. It calls another inline function
> and that might complicate stuff for all I know. A pattern like
> this would probably make the inlining of the critical check more
> certain:
>
> if (is_killed())
> report_killed();
>
After above change, in most places the code has the pattern similar to what you
suggest here:
if (error || is_killed())
report_error()
But in places where report_error() is not called, and report_killed() must be
called explicitly, I preferred to leave it as
if (report_killed())
<do something>
> - Do you really need Locking_thread_st::m_thread_started? There is
> already
> lock_state that I think you could use instead.
I think lock_state will be not good for the purpose. It is managed from within
the locking thread. In particular, after creating the locking thread,
backup_thread_for_locking() sets lock_state to LOCK_NOT_STARTED at the
beginning. Thus, it is possible that for the short time the locking thread
exists but lock_state is still LOCK_NOT_STARTED. Then calling
kill_locking_thread() during that time would be no-op and would leave the
locking thread alive, leading to problems.
Since m_thread_started is controlled by Locking_thread_st methods there should
be no race conditions as long as the class instance is used from a single
thread. I added comment that Locking_thread_st is not multi-thread safe.
> - I did not really understand the movement of the
> reset_diagnostics_area calls. Especially with respect to stream.h.
> Why not handle it where read_meta_data() is called?
>
There is no strong reason. I was thinking like this:
- the reset is here because si_object functions, which internally execute DDL
statements do not clean the context properly.
- these DDL statements are executed when metadata is read and objects are
created - this happens inside bstream_rd_meta_data() function of bstream library.
- function read_meta_data() in stream.h is a wrapper around the
bstream_rd_meta_data() function, so it seems to be a perfect place to do such
extra cleanup, so that "outside" read_meta_data() everything is fine.
If you insist, I can revert this change - it is not important for the patch.
> --
> Øystein
>
> Some more detailed comments below:
>
> > Rafal Somla wrote:
> >> #At file:///ext/mysql/bzr/backup/wl4538-new/
> >>
> >> 2740 Rafal Somla 2008-12-12
> >> BUG#35079/WL#4538 - Make BACKUP/RESTORE statements interruptible.
> >> The server detects interruptions in statement execution
> >> such as when client connection is closed or if user hits Ctrl+C.
> >> However, the code inside the server must actively check if an
> >> interruption has happened and abort execution in that case.
> >> This patch adds such checks to the backup code, as
> >> described in WL#4538. After this patch it should be possible to
> >> interrupt on-going BACKUP/RESTORE command. It also fixes few
> >> problems in the code so that the new tests can pass through.
...
> >> === modified file 'mysql-test/lib/mtr_report.pl'
> >> --- a/mysql-test/lib/mtr_report.pl 2008-11-21 15:02:34 +0000
> >> +++ b/mysql-test/lib/mtr_report.pl 2008-12-12 15:04:40 +0000
> >> @@ -392,6 +392,17 @@ sub mtr_report_stats ($) {
> >>
> >> ($testname eq 'backup.backup_myisam1') and
> >> (/Backup: Can't initialize MyISAM backup driver/) or
> >> +
> >> + ($testname eq 'backup.backup_intr_errors') and
> >> + (
> >> + # ignore errors triggered on purpose during
> >> BACKUP/RESTORE + # operation shutdown
> >> + /Backup: .* backup driver can't cancel its backup
> >> operation/ or
> >> + /Backup: Error on delete of/ or
> >> + /Error on close of backup stream/ or
> >> + /Restore: Can't shut down .* restore driver/
> >> + ) or
> >> +
> >> /Sort aborted/ or
> >> /Time-out in NDB/ or
> >> /One can only use the --user.*root/ or
>
> The above code contains a mix of tabs and spaces for indentation.
>
Now mtr2 is used and the suppression rules are inside tests, not here.
> >> === modified file 'sql/backup/be_thread.cc'
...
> >> - kill_locking_thread();
> >> - wait_until_locking_thread_dies();
> >> -
> >> + if (m_thread_started)
> >> + {
> >> + kill_locking_thread();
> >> + wait_until_locking_thread_dies();
> >> + }
>
> You check for m_thread_started both before calling and within each
> function. That seems a bit unecessary.
>
Yes, but it does not hurt too much and makes code more robust (in the sense that
it will work correct even if internal check in kill_locking_thread() and
wait_until_locking_thread_dies() disappear for some reason).
> >> === modified file 'sql/backup/data_backup.cc'
> >> --- a/sql/backup/data_backup.cc 2008-11-26 10:05:19 +0000
> >> +++ b/sql/backup/data_backup.cc 2008-12-12 15:04:40 +0000
> >> @@ -510,7 +510,8 @@ int write_table_data(THD* thd, Backup_in
> >> List<Scheduler::Pump> inactive; // list of images not yet being
> >> created
> >>
> >> // keeps maximal init size for images in inactive list
> >> - size_t max_init_size=0;
> >> + size_t max_init_size=0;
> >> + bool commits_blocked= FALSE; // indicates if commit blocker is
> >> active
> >>
> >> DBUG_PRINT("backup_data",("initializing scheduler"));
> >>
> >> @@ -524,13 +525,12 @@ int write_table_data(THD* thd, Backup_in
> >> continue;
> >>
> >> Scheduler::Pump *p= new Scheduler::Pump(*i, s);
> >> -
> >> if (!p)
> >> {
> >> log.report_error(ER_OUT_OF_RESOURCES);
> >> goto error;
> >> }
> >> - if (!p->is_valid())
> >> + if (log.report_killed() || !p->is_valid())
> >> {
> >> log.report_error(ER_BACKUP_CREATE_BACKUP_DRIVER,p->m_name);
> >> delete p;
>
> I do not understand why you need to check here. The pump constructor
> is within the same module and it does not seem to do anything time
> consuming stuff.
But it makes a call to external code: it calls get_restore_driver() function
implemented by backup engine.
>
> >> @@ -650,8 +650,9 @@ int write_table_data(THD* thd, Backup_in
> >> non-transactional tables and pass that to block_commits().
> >> */
> >> int error= 0;
> >> - error= block_commits(thd, NULL);
> >> - if (error)
> >> + if (!(error= block_commits(thd, NULL)))
> >> + commits_blocked= TRUE;
> >> + if (log.report_killed() || error)
> >> goto error;
>
> I think this check needs to go within block_commits. There are
> several steps there involving locks, and I think one needs to check
> for killed thread in between those steps.
>
Hmm, I don't know why (un)block_commits() are in data_backup.cc. To me they are
part of server services and shuld be implemented in si_objects (or similar).
I want to treat these functions as interface to server functionality and have
backup kernel code written without any (implicit) assumptions about their
implementation. In particular, I don't assume that these functions correctly
detect and report interruptions. This is why I want to keep the check for
interruptions after calling them. Also, even if block_commits() detects and
reports interruption, we still want to record this fact in the logs and this
will be done by report_killed().
I agree that one should check that (un)block_commits() behave reasonable when
interrupted. Chances are that they do, because they call server functions which
usually report error in case of interruption. And if server function returns
error, (un)blocking commits is aborted.
> >>
> >> if (sch.prepare()) // logs errors
> >> @@ -665,6 +666,12 @@ int write_table_data(THD* thd, Backup_in
> >> DBUG_PRINT("backup_data",("-- SYNC PHASE --"));
> >>
> >> + /*
> >> + Before proceeding, check if the process is interrupted.
> >> + */
> >> + if (log.report_killed())
> >> + goto error;
> >> +
>
> The placement of this check confused me a bit since it did not
> directly follow a function call. I suggest to place it after
> Scheduler::step call. (Or maybe inside step if there are particular
> calls you are concerned about).
>
The comment is trying to explain that the check is here just to ensure that we
are healthy before we proceed with next, sensitive phase of the process.
Scheduler::step() already has checks for interruptions - here I do an extra
check. I can remove it if you insist.
> ...
>
> >> @@ -1988,7 +2138,7 @@ int bcat_create_item(st_bstream_image_he
> >> */
> >> if (!obs::check_user_existence(thd, sobj->get_name()))
> >> {
>
> No check necessary here?
>
Good point. Perhaps not exactly here, but these bcat_*() callback functions are
called repeatedly from within bstream library (e.g. when reading metadata and
creating objects) and it is good to check for interruption within them.
Especially in bcat_create_item() because, in case of an interruption, it is
important to break execution before next object is re-created, possibly
overwriting the existing one.
Therefore I added interruption checks inside the callback functions. As noted
above, it is not crucial to detect interruption exactly at the point where it
happened. Thus I added only one check, at the beginning of each callback.
Rafal
| Thread |
|---|
| • Re: bzr commit into mysql-6.0-backup branch (Rafal.Somla:2740)Bug#35079 WL#4538 | Rafal Somla | 20 Feb |