Hi Chuck,
Thanks for looking at patches and pointing out potential problems - not
mentioning the actual crash which needs to be fixed. I'm replying now in hope
that we will be able to push WL#4212 asap.
> I see you are storing a copy of the file name instead of referring to it. Do
> you expect or can you think of scenarios where the binlog filename can
> change during backup? If so, I think they would be better handled in
> WL#4209.
While a long BACKUP is running, other DML statements can run as backup is
on-line. Thus entries will be written to binary log and eventually it can rotate
which will change the binlog file name stored in LOG_INFO. This is why it is not
enough to store pointer to the file name but the copy of it must be made.
It seems to me that whatever is done in WL#4209, we always have to save the
binlog position at VP and while doing it need to copy the binlog file name. Or,
do you have some concrete ideas which will change this?
> Changing existing error messages has been known to cause cascading errors
> for other, seemingly unrelated tests (it has a logical explanation as I am
> sure you know). I ran a partial test set (didn't do all rpl or ndb and
> couldn't do backup) and found no tests that failed due to error number
> changes (but over 20 failed due to other unrelated problems). Please run a
> full test to make sure we're ok and include patches to fix any test failures
> found.
Yes, I run a full test suite and it all passed on my computer. Note that I
change only one error message (ER_BACKUP_CANT_RESTORE_TABLE) which was wrongly
defined by previous patch (copy-paste error) - all other messages are new ones.
The exact reason why error messages should not be changed is that users of the
server might parse them and depend on the fact that they remain the same in all
versions of the server. But that applies to the released (GA) versions of the
server, as is mentioned in the policy. We are still working on pre-GA code so it
is the last moment to fix such typos.
> The crashes are back on Windows. The problem is now on line#339 in
> kernel.cc.
This looks very mysterious. The line#399 in kernel.cc is
Backup_restore_ctx::~Backup_restore_ctx()
{
close();
delete m_catalog; <-- HERE
delete m_stream;
}
in my tree. Is it the same line? Are you sure the exception is caught exactly
here or somewhere deeper in the call stack (for example in the destructor of
Backup_info or Restore_info)? Is it possible to see the call stack at the crash
time?
The only reason I can think about, why "delete m_catalog" would generate an
exception, is that m_catalog points at invalid memory. This could happen either
if m_catalog is uninitialized (a "dangling pointer") or it was deleted before.
The only place where m_catalog is deleted is the Backup_restore_ctx destructor.
The destructor can't be called twice for the same object unless it is called
explicitly by "delete". But context object instance is not dynamic - it is
local variable of execute_backup_command() function and is never deleted
explicitly. Thus the second option is excluded.
The first option also seems out of question. The m_catalog member is initialized
to NULL inside Backup_restore_ctx constructor. It is assigned value only in two
places in the code: in Backup_restore_ctx::prepare_for_backup() at line#514 of
kernel.cc and in Backup_restore_ctx::prepare_for_backup() at line#590. In both
cases the value is a valid pointer to the newly created Backup_info or
Restore_info instance.
Finally, the memory dump you attach:
> - m_catalog 0x01e88aa8 {data_size=30 m_snap=0x01e8a70c
> m_table_count=1 ...} backup::Image_info *
> + st_bstream_image_header {version=1 server_version={...}
> flags=0 ...} st_bstream_image_header
> + __vfptr 0x00ca28f8 const backup::Image_info::`vftable' *
> data_size 30 unsigned int
> + m_snap 0x01e8a70c backup::Snapshot_info * [256]
> m_table_count 1 unsigned int
> + mem_root {free=0x01e918d8 used=0x00000000
> pre_alloc=0x00000000 ...} st_mem_root
> + m_dbs {...} Map<unsigned int,backup::Image_info::Db>
indicates that m_catalog points at a valid object instance, not at some random
bytes.
All this makes me think that the problem is not exactly at "delete m_catalog"
line but must lie somewhere deeper. I need more info to find out where.
Rafal