List:Commits« Previous MessageNext Message »
From:Rafal Somla Date:March 14 2008 11:04am
Subject:RE: bk commit into 6.0 tree (rafal:1.2579) WL#4212
View as plain text  
Hi Chuck,

Thanks for looking at patches and pointing out potential problems - not 
mentioning the actual crash which needs to be fixed. I'm replying now in hope 
that we will be able to push WL#4212 asap.

> I see you are storing a copy of the file name instead of referring to it. Do
> you expect or can you think of scenarios where the binlog filename can
> change during backup? If so, I think they would be better handled in
> WL#4209.

While a long BACKUP is running, other DML statements can run as backup is 
on-line. Thus entries will be written to binary log and eventually it can rotate 
which will change the binlog file name stored in LOG_INFO. This is why it is not 
enough to store pointer to the file name but the copy of it must be made.

It seems to me that whatever is done in WL#4209, we always have to save the 
binlog position at VP and while doing it need to copy the binlog file name. Or, 
do you have some concrete ideas which will change this?

> Changing existing error messages has been known to cause cascading errors
> for other, seemingly unrelated tests (it has a logical explanation as I am
> sure you know). I ran a partial test set (didn't do all rpl or ndb and
> couldn't do backup) and found no tests that failed due to error number
> changes (but over 20 failed due to other unrelated problems). Please run a
> full test to make sure we're ok and include patches to fix any test failures
> found.

Yes, I run a full test suite and it all passed on my computer. Note that I 
change only one error message (ER_BACKUP_CANT_RESTORE_TABLE) which was wrongly 
defined by previous patch (copy-paste error) - all other messages are new ones.

The exact reason why error messages should not be changed is that users of the 
server might parse them and depend on the fact that they remain the same in all 
versions of the server. But that applies to the released (GA) versions of the 
server, as is mentioned in the policy. We are still working on pre-GA code so it 
is the last moment to fix such typos.

> The crashes are back on Windows. The problem is now on line#339 in
> kernel.cc.

This looks very mysterious. The line#399 in kernel.cc is

Backup_restore_ctx::~Backup_restore_ctx()
{
   close();

   delete m_catalog;  <-- HERE
   delete m_stream;
}

in my tree. Is it the same line? Are you sure the exception is caught exactly 
here or somewhere deeper in the call stack (for example in the destructor of 
Backup_info or Restore_info)? Is it possible to see the call stack at the crash 
time?

The only reason I can think about, why "delete m_catalog" would generate an 
exception, is that m_catalog points at invalid memory. This could happen either 
if m_catalog is uninitialized (a "dangling pointer") or it was deleted before.

The only place where m_catalog is deleted is the Backup_restore_ctx destructor. 
The destructor can't be called twice for the same object unless it is called 
explicitly by "delete".  But context object instance is not dynamic - it is 
local variable of execute_backup_command() function and is never deleted 
explicitly. Thus the second option is excluded.

The first option also seems out of question. The m_catalog member is initialized 
to NULL inside Backup_restore_ctx constructor. It is assigned value only in two 
places in the code: in Backup_restore_ctx::prepare_for_backup() at line#514 of 
kernel.cc and in Backup_restore_ctx::prepare_for_backup() at line#590. In both 
cases the value is a valid pointer to the newly created Backup_info or 
Restore_info instance.

Finally, the memory dump you attach:

> -		m_catalog	0x01e88aa8 {data_size=30 m_snap=0x01e8a70c
> m_table_count=1 ...}	backup::Image_info *
> +		st_bstream_image_header	{version=1 server_version={...}
> flags=0 ...}	st_bstream_image_header
> +		__vfptr	0x00ca28f8 const backup::Image_info::`vftable'	*
> 		data_size	30	unsigned int
> +		m_snap	0x01e8a70c	backup::Snapshot_info * [256]
> 		m_table_count	1	unsigned int
> +		mem_root	{free=0x01e918d8 used=0x00000000
> pre_alloc=0x00000000 ...}	st_mem_root
> +		m_dbs	{...}	Map<unsigned int,backup::Image_info::Db>

indicates that m_catalog points at a valid object instance, not at some random 
bytes.


All this makes me think that the problem is not exactly at "delete m_catalog" 
line but must lie somewhere deeper. I need more info to find out where.

Rafal
Thread
bk commit into 6.0 tree (rafal:1.2579) WL#4212rsomla11 Mar
  • RE: bk commit into 6.0 tree (rafal:1.2579) WL#4212Chuck Bell12 Mar
RE: bk commit into 6.0 tree (rafal:1.2579) WL#4212Rafal Somla14 Mar
  • RE: bk commit into 6.0 tree (rafal:1.2579) WL#4212Chuck Bell14 Mar
    • Re: bk commit into 6.0 tree (rafal:1.2579) WL#4212Rafal Somla17 Mar