List:Commits« Previous MessageNext Message »
From:Guilhem Bichot Date:February 6 2009 9:20am
Subject:Bug still exists Re: bzr commit into mysql-6.0-backup branch
(ingo.struewing:2737) Bug#40944
View as plain text  
Hello Ingo,

Ingo Struewing a écrit, Le 12/03/2008 05:02 PM:
> #At file:///home2/mydev/bzrroot/mysql-6.0-bug40944-2/
> 
>  2737 Ingo Struewing	2008-12-03
>       Bug#40944 - Backup: crash after myisampack
>       
>       After restore of a compressed MyISAM table, the server crashed
>       when the table was accessed.
>       
>       The problem was twofold.
>       
>       1. The crash happened in the MyISAM keycache. It did not handle
>       an error in reading a block correctly. This is now fixed in
>       key_cache_read(), key_cache_insert(), and key_cache_write().
>       Their BLOCK_ERROR handling does now follow a common pattern.
>       Erroneous blocks are no longer put to the LRU ring, but freed.
>       
>       2. After fixing the above, accesses to a restored compressed table
>       failed with error messages about a corrupted table. Restore of a
>       table works by dropping the old table if exists, creating a new
>       one, and loading the data from the backup image. The MyISAM
>       native driver writes directly to the table files, what was there
>       at backup time. This works even for compressed tables, as long as
>       the table is freshly opened after loading the data. But RESTORE
>       needs to keep the table write locked until it is complete. This
>       includes keeping the table open. When restore is done, it unlocks
>       the tables and requests close. But our table cache does not
>       perform a full close. It marks it unused and keeps it open in the
>       cache. Later statements reuse the open table. But when the MyISAM
>       table format changes, we must force a full close. Since this is
>       difficult to detect, we do now always force close of restored
>       MyISAM tables, if restored through the native driver.


> === modified file 'sql/backup/kernel.cc'
> --- a/sql/backup/kernel.cc	2008-11-28 10:10:39 +0000
> +++ b/sql/backup/kernel.cc	2008-12-03 16:02:14 +0000
> @@ -893,12 +893,68 @@ int Backup_restore_ctx::lock_tables_for_
>   */ 
>  void Backup_restore_ctx::unlock_tables()
>  {
> +  TABLE_LIST *tables= NULL;
> +
>    // Do nothing if tables are not locked.
>    if (!m_tables_locked)
>      return;
>  
>    DBUG_PRINT("restore",("unlocking tables"));
>  
> +  /*
> +    Refresh tables that have been restored with the MyISAM native
> +    driver. MyISAM native driver might restore table layout that differs
> +    from the version created by materialize(). We need to force a final
> +    close after restore with close_cached_tables(). Note that we do this
> +    before we unlock the tables. Otherwise other threads could use the
> +    still open tables before we refresh them.

Unfortunately, this does not work as hoped and the bug still happens, 
please see below.

> +    For more information about this problem see the comment in
> +    myisam_backup_engine.cc:Table_restore::close().
> +
> +    Collect the list of affected tables.
> +  */
> +  for (uint snum= 0; snum < m_catalog->snap_count(); ++snum)
> +  {
> +    backup::Snapshot_info *snap= m_catalog->m_snap[snum];
> +
> +    if (snap->type() == backup::Snapshot_info::NATIVE_SNAPSHOT)
> +    {
> +      for (ulong tnum=0; tnum < snap->table_count(); ++tnum)
> +      {
> +        backup::Image_info::Table *btbl= snap->get_table(tnum);
> +        DBUG_ASSERT(btbl); // All tables should be present in the catalogue.
> +
> +        if (btbl->m_table->table &&
> +            (btbl->m_table->table->s->db_type() == myisam_hton))
> +        {
> +          TABLE_LIST *tlist= (TABLE_LIST*) alloc_root(m_thd->mem_root,
> +                                                    sizeof(TABLE_LIST));
> +          DBUG_EXECUTE_IF("Backup_restore_ctx_unlock_tables_alloc",
> +                          tlist= NULL;);
> +          if (!tlist)
> +          {
> +            /*
> +              Do not use fatal_error() here. We won't set
> +              Diagnostics_area then. Just flush all tables to be safe.
> +            */
> +            close_cached_tables(m_thd, NULL, FALSE, FALSE);
> +            goto close;
> +          }
> +          bzero(tlist, sizeof(TABLE_LIST));
> +          tlist->table_name= const_cast<char*>(btbl->name().ptr());
> +          tlist->alias= tlist->table_name;
> +          tlist->db= const_cast<char*>(btbl->db().name().ptr());
> +          tables= backup::link_table_list(*tlist, tables); // Never errors
> +        }
> +      }
> +    }
> +  }
> +  /* Refresh all affected tables, if any. */
> +  if (tables)
> +    close_cached_tables(m_thd, tables, FALSE, FALSE);

Looking at the documentation (function comment) of 
close_cached_tables(), this function is said to close tables which are 
not used by other threads.
Under concurrency, it is possible that the table is used by another 
thread, then close_cached_tables() does nothing.
I posted a repeatable testcase into BUG#40944 which uses two-thread 
concurrency and DEBUG_SYNC (shameless plug: your DEBUG_SYNC feature 
really really rocks; it makes writing those tests very simple):
- RESTORE starts
- it creates an empty table, opens and successfully locks it
- another thread starts a SELECT: opens table (making it "in use"), 
blocks on the lock held by RESTORE
- RESTORE restores the maria-packed data and index file, unlocks table, 
calls close_cached_tables() which does not close the table (because it's 
"in use" by the SELECT)
- RESTORE goes into close_thread_tables(m_thd) (below) which starts by 
unlocking tables; immediately after that unlocking, the SELECT thread 
resumes and manages to lock the table (which still isn't closed), read 
it and fail with the same error as before this patch ("Incorrect key 
file for table './mysql_db1/t1.MYI'; try to repair it") because it's 
using an old MYISAM_SHARE (which dates from the recreated empty table 
and does not mention the compressed format, as this info is in the old 
MYISAM_SHARE).
- SELECT unlocks the table
- RESTORE resumes close_thread_tables(), which closes table (a real 
close this time, but too late).

> + close:
>    close_thread_tables(m_thd);                   // Never errors

So, SELECT has been able to use the table (MYISAM_SHARE and probably
SQL-layer objects like TABLE or TABLE_SHARE) relying on wrong info (as 
these in-memory structures date from before calling the engine's restore 
driver and describe an empty table).

I am not sure that this problem will disappear when RESTORE will 
atomically create and restore data into the table without letting other 
clients open the table in between (needed to fix BUG#41716 "Backup Error 
when sending data (for table #1) to Snapshot restore driver"). It 
depends on the kind of locking used to achieve the atomicity (will it 
allow other client to open the table, will it force the client to 
close/reopen the table at some proper moment).
I suspect we'll need to involve the Runtime team's experts.

PS: I came through this while debugging BUG#42519 "Maria: RESTORE leads 
to corrupted table and assertion". I saw the close_cached_tables() doing 
nothing (it does help a lot in non-concurrent scenarios, but BUG#42519 
has concurrency).

-- 
Mr. Guilhem Bichot <guilhem@stripped>
Sun Microsystems / MySQL, Lead Software Engineer
Bordeaux, France
www.sun.com / www.mysql.com

Thread
bzr commit into mysql-6.0-backup branch (ingo.struewing:2737) Bug#40944Ingo Struewing3 Dec
  • Re: bzr commit into mysql-6.0-backup branch (ingo.struewing:2737)Bug#40944Øystein Grøvlen18 Dec
    • Re: bzr commit into mysql-6.0-backup branch (ingo.struewing:2737)Bug#40944Ingo Strüwing18 Dec
      • Re: bzr commit into mysql-6.0-backup branch (ingo.struewing:2737)Bug#40944Øystein Grøvlen19 Dec
  • Bug still exists Re: bzr commit into mysql-6.0-backup branch(ingo.struewing:2737) Bug#40944Guilhem Bichot6 Feb
    • Re: Bug still exists Re: bzr commit into mysql-6.0-backup branch(ingo.struewing:2737) Bug#40944Ingo Strüwing6 Feb
      • Re: Bug still exists Re: bzr commit into mysql-6.0-backup branch(ingo.struewing:2737) Bug#40944Rafal Somla6 Feb