List:Commits« Previous MessageNext Message »
From:ingo Date:September 1 2006 10:35am
Subject:bk commit into 5.0 tree (istruewing:1.2208) BUG#17332
View as plain text  
Below is the list of changes that have just been committed into a local
5.0 repository of istruewing. When istruewing does a push these changes will
be propagated to the main repository and, within 24 hours after the
push, to the public repository.
For information on how to access the public repository
see http://dev.mysql.com/doc/mysql/en/installing-source-tree.html

ChangeSet@stripped, 2006-09-01 12:34:55+02:00, istruewing@stripped +2 -0
  Bug#17332 - changing key_buffer_size on a running server
              can crash under load
  
  Resizing a key cache while it was in heavy use could crash the
  server. There were several race conditions.
  
  I reworked some of the algorithms to fix the race conditions.

  include/keycache.h@stripped, 2006-09-01 12:34:53+02:00, istruewing@stripped +2 -0
    Bug#17332 - changing key_buffer_size on a running server
                can crash under load
    Added a wait queue for use by the "resizer" if it has to wait
    for other threads to finish with blocks after the flush is
    complete.

  mysys/mf_keycache.c@stripped, 2006-09-01 12:34:53+02:00, istruewing@stripped +401 -113
    Bug#17332 - changing key_buffer_size on a running server
                can crash under load
    
    Changed the resize_queue handling so that the "resizer" does not
    hang in this queue while it flushes the cache. A thread can only
    wait in one queue. The "resizer" needs to wait for other thing
    during the flush. So it has to leave the queue as soon as it is
    in charge of resizing.
    
    Added a new queue to be used if the resizer has to wait for other 
    threads to finish with blocks after the flush is complete.
    
    Added a check if a block has been freed while a thread was waiting
    for it.
    
    Added a check for a changed page when writing during resize.
    A changed block must be returned as the caller might want to
    overwrite a part of the block only.
    
    Added checks for blocks marked in switch or to be freed by another
    thread. These must not be freed by this thread.
    
    Added flush handling to the eviction procedure. If a block is
    marked in flush, the "evicter" has to wait until this is complete.
    Otherwise it has to mark it in flush before releasing the mutex.
    
    Changed the assignment of a block status bit after a first-time
    read to an addition of the bit to the block status. Other status
    bits must not be cleared here. The block might have been selected
    to be freed while the read was in progress.
    
    Changed the flush handling for changed blocks so that it skips
    blocks which are marked in flush already and waits for them later.
    
    Changed the flush handlind for clean blocks (this means freeing
    them) so that it skips blocks which are are marked in switch or
    to be freed. Also we need to restart from the beginning of the
    chain whenever we detect that the next block changed its state.
    It might not be at the same position in the chain any more. The
    change for the next block can happen while we need to wait for
    readers of the current block before we can free it.
    
    Changed the approach for flushing all blocks. It was not safe to
    select blocks from the LRU ring to find a file for flush. The LRU
    ring can be empty though blocks in use exist. The only safe way
    is to go through the 'changed_blocks' and 'file_blocks' hashes.
    This two-step approach has the advantage that changed blocks do
    not need to be freed immediately after their write. They can
    remain in the cache for further reads until the second step.
    

# This is a BitKeeper patch.  What follows are the unified diffs for the
# set of deltas contained in the patch.  The rest of the patch, the part
# that BitKeeper cares about, is below these diffs.
# User:	istruewing
# Host:	chilla.local
# Root:	/home/mydev/mysql-5.0-bug17332-one

--- 1.56/mysys/mf_keycache.c	2006-09-01 12:35:01 +02:00
+++ 1.57/mysys/mf_keycache.c	2006-09-01 12:35:01 +02:00
@@ -413,6 +413,7 @@ int init_key_cache(KEY_CACHE *keycache, 
     keycache->resize_in_flush= 0;
     keycache->can_be_used= 1;
 
+    keycache->waiting_for_resize_cnt.last_thread= NULL;
     keycache->waiting_for_hash_link.last_thread= NULL;
     keycache->waiting_for_block.last_thread= NULL;
     DBUG_PRINT("exit",
@@ -503,14 +504,22 @@ int resize_key_cache(KEY_CACHE *keycache
 #ifdef THREAD
   wqueue= &keycache->resize_queue;
   thread= my_thread_var;
-  link_into_queue(wqueue, thread);
 
-  while (wqueue->last_thread->next != thread)
+  /* We may need to wait for another thread which is doing a resize already. */
+  while (keycache->resize_in_flush)
   {
+    /* Enter the resize_queue every time as the signalling thread removes us. */
+    link_into_queue(wqueue, thread);
     keycache_pthread_cond_wait(&thread->suspend, &keycache->cache_lock);
+    /* We are removed from the resize_queue here. */
   }
 #endif
 
+  /*
+    Mark the operation in progress. This blocks other threads from doing
+    a resize in parallel. And it prohibits changes to cache blocks.
+    Writing of index blocks has to bypass the cache.
+  */
   keycache->resize_in_flush= 1;
   if (flush_all_key_blocks(keycache))
   {
@@ -520,19 +529,46 @@ int resize_key_cache(KEY_CACHE *keycache
     keycache->can_be_used= 0;
     goto finish;
   }
+
+  /*
+    Mark the flush finished. This would allow other threads to start a
+    resize. But we do still have the cache_lock and do not release it
+    until the end of this function.
+  */
   keycache->resize_in_flush= 0;
+
+  /* Mark the cache unusable. This prohibits all uses of the cache. */
   keycache->can_be_used= 0;
+
 #ifdef THREAD
+  /*
+    Some cache read/write operations may still be unfinished. Wait until
+    they are done. However I doubt that this can ever happen. During
+    flush_all_key_blocks() we waited for every block in use and freed
+    all blocks. There cannot be any block in use any more (IMHO).
+  */
   while (keycache->cnt_for_resize_op)
   {
+    /*
+      Enter the waiting_for_resize_cnt queue every time as the
+      signalling thread removes us.
+    */
+    link_into_queue(&keycache->waiting_for_resize_cnt, thread);
     KEYCACHE_DBUG_PRINT("resize_key_cache: wait",
                         ("suspend thread %ld", thread->id));
     keycache_pthread_cond_wait(&thread->suspend, &keycache->cache_lock);
+    /* We are removed from the waiting_for_resize_cnt queue here. */
   }
 #else
   KEYCACHE_DBUG_ASSERT(keycache->cnt_for_resize_op == 0);
 #endif
 
+  /*
+    Free old cache structures, allocate new structures, and initialize
+    them. Note that the cache_lock mutex and the resize_queue are left
+    untouched. We do not lose the cache_lock and will release it only at
+    the end of this function.
+  */
   end_key_cache(keycache, 0);			/* Don't free mutex */
   /* The following will work even if use_mem is 0 */
   blocks= init_key_cache(keycache, key_cache_block_size, use_mem,
@@ -540,13 +576,17 @@ int resize_key_cache(KEY_CACHE *keycache
 
 finish:
 #ifdef THREAD
-  unlink_from_queue(wqueue, thread);
   /* Signal for the next resize request to proceeed if any */
   if (wqueue->last_thread)
   {
     KEYCACHE_DBUG_PRINT("resize_key_cache: signal",
                         ("thread %ld", wqueue->last_thread->next->id));
     keycache_pthread_cond_signal(&wqueue->last_thread->next->suspend);
+    /*
+      Remove the signalled thread from the resize_queue. It will enter
+      itself again if necessary.
+    */
+    unlink_from_queue(wqueue, wqueue->last_thread->next);
   }
 #endif
   keycache_pthread_mutex_unlock(&keycache->cache_lock);
@@ -572,11 +612,17 @@ static inline void dec_counter_for_resiz
 #ifdef THREAD
   struct st_my_thread_var *last_thread;
   if (!--keycache->cnt_for_resize_op &&
-      (last_thread= keycache->resize_queue.last_thread))
+      (last_thread= keycache->waiting_for_resize_cnt.last_thread))
   {
+    /*
+      The count became zero and there is a thread waiting for this
+      event. Wake it.
+    */
     KEYCACHE_DBUG_PRINT("dec_counter_for_resize_op: signal",
                         ("thread %ld", last_thread->next->id));
     keycache_pthread_cond_signal(&last_thread->next->suspend);
+    /* Unlink the signalled thread from the queue. */
+    unlink_from_queue(&keycache->waiting_for_resize_cnt, last_thread->next);
   }
 #else
   keycache->cnt_for_resize_op--;
@@ -819,6 +865,10 @@ static inline void unlink_changed(BLOCK_
   if (block->next_changed)
     block->next_changed->prev_changed= block->prev_changed;
   *block->prev_changed= block->next_changed;
+
+  /* Avoid accidental traverse into the wrong chain. */
+  block->next_changed= NULL;
+  block->prev_changed= NULL;
 }
 
 
@@ -1138,6 +1188,9 @@ static inline void wait_for_readers(KEY_
     block->condvar= &thread->suspend;
     keycache_pthread_cond_wait(&thread->suspend, &keycache->cache_lock);
     block->condvar= NULL;
+    /* The other thread might have freed the block in between. */
+    if (!block->hash_link)
+      break;
   }
 #else
   KEYCACHE_DBUG_ASSERT(block->hash_link->requests == 0);
@@ -1377,29 +1430,54 @@ restart:
 
     if (page_status != PAGE_READ)
     {
-      /* We don't need the page in the cache: we are going to write on disk */
+      /*
+        The page is not in the cache. We don't need the page in the
+        cache: we are going to write on disk. Cancel the request.
+      */
       hash_link->requests--;
       unlink_hash(keycache, hash_link);
-      return 0;
+      DBUG_RETURN(0);
     }
+
+    /* The page is in the cache. */
+
     if (!(block->status & BLOCK_IN_FLUSH))
     {
-      hash_link->requests--;
+      /* The page is not yet or no longer part of a flush operation. */
+      if (block->status & BLOCK_CHANGED)
+      {
+        /* The block has changed contents, which must not be lost. */
+        reg_requests(keycache, block, 1);
+        *page_st= page_status;
+        DBUG_RETURN(block);
+      }
+
       /*
-        Remove block to invalidate the page in the block buffer
-        as we are going to write directly on disk.
-        Although we have an exlusive lock for the updated key part
-        the control can be yieded by the current thread as we might
-        have unfinished readers of other key parts in the block
-        buffer. Still we are guaranteed not to have any readers
-        of the key part we are writing into until the block is
-        removed from the cache as we set the BLOCL_REASSIGNED
-        flag (see the code below that handles reading requests).
+        Unregister from the hash_link. This must be done before freeing
+        the block. And it must also be done if not freeing it.
       */
-      free_block(keycache, block);
-      return 0;
+      hash_link->requests--;
+
+      /* If the block is not in eviction and not in free, we can free it. */
+      if (!(block->status & (BLOCK_IN_SWITCH | BLOCK_REASSIGNED)))
+      {
+        /*
+          Remove block to invalidate the page in the block buffer
+          as we are going to write directly on disk.
+          Although we have an exlusive lock for the updated key part
+          the control can be yielded by the current thread as we might
+          have unfinished readers of other key parts in the block
+          buffer. Still we are guaranteed not to have any readers
+          of the key part we are writing into until the block is
+          removed from the cache as we set the BLOCK_REASSIGNED
+          flag (see the code below that handles reading requests).
+        */
+        free_block(keycache, block);
+      }
+      DBUG_RETURN(0);
     }
-    /* Wait intil the page is flushed on disk */
+
+    /* Wait until the page is flushed on disk */
     hash_link->requests--;
     {
 #ifdef THREAD
@@ -1424,10 +1502,14 @@ restart:
       */
 #endif
     }
-    /* Invalidate page in the block if it has not been done yet */
-    if (block->status)
+    /*
+      Invalidate page in the block if it has not been done yet, and if
+      it is neither marked for eviction nor for to be freed.
+    */
+    if (block->status &&
+        !(block->status & (BLOCK_IN_SWITCH | BLOCK_REASSIGNED)))
       free_block(keycache, block);
-    return 0;
+    DBUG_RETURN(0);
   }
 
   if (page_status == PAGE_READ &&
@@ -1518,17 +1600,20 @@ restart:
       }
       else
       {
-	/* There are no never used blocks, use a block from the LRU chain */
-
-        /*
-          Wait until a new block is added to the LRU chain;
-          several threads might wait here for the same page,
-          all of them must get the same block
+	/*
+          There are no free blocks and no never used blocks, use a block
+          from the LRU ring.
         */
 
 #ifdef THREAD
         if (! keycache->used_last)
         {
+          /*
+            The LRU ring is empty. Wait until a new block is added to
+            it. Several threads might wait here for the same page, all
+            of them must get the same block
+          */
+
           struct st_my_thread_var *thread= my_thread_var;
           thread->opt_info= (void *) hash_link;
           link_into_queue(&keycache->waiting_for_block, thread);
@@ -1545,6 +1630,11 @@ restart:
 #else
         KEYCACHE_DBUG_ASSERT(keycache->used_last);
 #endif
+        /*
+          If we waited above, hash_link->block has been assigned by
+          link_block(). Otherwise it is still NULL. In the latter case
+          we need to grab a block from the LRU ring ourselves.
+        */
         block= hash_link->block;
         if (! block)
         {
@@ -1559,6 +1649,11 @@ restart:
           hash_link->block= block;
         }
 
+        /*
+          If we had to wait above, there is a small chance that another
+          thread grabbed this block for the same file block already. But
+          in most cases the first condition is true.
+        */
         if (block->hash_link != hash_link &&
 	    ! (block->status & BLOCK_IN_SWITCH) )
         {
@@ -1573,19 +1668,40 @@ restart:
 	    /* The block contains a dirty page - push it out of the cache */
 
             KEYCACHE_DBUG_PRINT("find_key_block", ("block is dirty"));
-
-            keycache_pthread_mutex_unlock(&keycache->cache_lock);
-            /*
-	      The call is thread safe because only the current
-	      thread might change the block->hash_link value
-            */
-	    error= my_pwrite(block->hash_link->file,
-			     block->buffer+block->offset,
-			     block->length - block->offset,
-			     block->hash_link->diskpos+ block->offset,
-			     MYF(MY_NABP | MY_WAIT_IF_FULL));
-            keycache_pthread_mutex_lock(&keycache->cache_lock);
-	    keycache->global_cache_write++;
+            if (block->status & BLOCK_IN_FLUSH)
+            {
+#ifdef THREAD
+              /* Wait until the block is flushed. */
+              struct st_my_thread_var *thread= my_thread_var;
+              add_to_queue(&block->wqueue[COND_FOR_SAVED], thread);
+              do
+              {
+                KEYCACHE_DBUG_PRINT("find_key_block: wait",
+                                    ("suspend thread %ld", thread->id));
+                keycache_pthread_cond_wait(&thread->suspend,
+                                           &keycache->cache_lock);
+              } while (thread->next);
+#else
+              KEYCACHE_DBUG_ASSERT(0);
+              /* No parallel requests in single-threaded case */
+#endif
+            }
+            else
+            {
+              block->status|= BLOCK_IN_FLUSH;
+              keycache_pthread_mutex_unlock(&keycache->cache_lock);
+              /*
+                The call is thread safe because only the current
+                thread might change the block->hash_link value
+              */
+              error= my_pwrite(block->hash_link->file,
+                               block->buffer+block->offset,
+                               block->length - block->offset,
+                               block->hash_link->diskpos+ block->offset,
+                               MYF(MY_NABP | MY_WAIT_IF_FULL));
+              keycache_pthread_mutex_lock(&keycache->cache_lock);
+              keycache->global_cache_write++;
+            }
           }
 
           block->status|= BLOCK_REASSIGNED;
@@ -1715,7 +1831,7 @@ static void read_block(KEY_CACHE *keycac
       block->status|= BLOCK_ERROR;
     else
     {
-      block->status= BLOCK_READ;
+      block->status|= BLOCK_READ;
       block->length= got_length;
     }
     KEYCACHE_DBUG_PRINT("read_block",
@@ -1983,7 +2099,7 @@ int key_cache_insert(KEY_CACHE *keycache
         keycache_pthread_mutex_lock(&keycache->cache_lock);
         /* Here we are alone again. */
 #endif
-        block->status= BLOCK_READ;
+        block->status|= BLOCK_READ;
         block->length= read_length+offset;
         KEYCACHE_DBUG_PRINT("key_cache_insert",
                             ("primary request: new page in cache"));
@@ -2199,6 +2315,16 @@ static void free_block(KEY_CACHE *keycac
   KEYCACHE_DBUG_PRINT("free_block",
                       ("block %u to be freed, hash_link %p",
                        BLOCK_NUMBER(block), block->hash_link));
+  /*
+    IMHO the below conditional makes no sense. I can't see how it could
+    be possible that free_block() is entered with a NULL hash_link
+    pointer. The only place where it can become NULL is in free_block()
+    (or before its first use ever, but for those blocks free_block() is
+    not called). I don't remove the conditional as it cannot harm, but
+    place an DBUG_ASSERT to confirm my hypothesis. Eventually the
+    conditional can be removed.
+  */
+  DBUG_ASSERT(block->hash_link);
   if (block->hash_link)
   {
     /*
@@ -2209,6 +2335,8 @@ static void free_block(KEY_CACHE *keycac
     */
     block->status|= BLOCK_REASSIGNED;
     wait_for_readers(keycache, block);
+    /* The block must not have been freed by another thread. */
+    DBUG_ASSERT(block->hash_link);
     unlink_hash(keycache, block->hash_link);
   }
 
@@ -2300,6 +2428,8 @@ static int flush_cached_blocks(KEY_CACHE
     {
       keycache->blocks_changed--;
       keycache->global_blocks_changed--;
+      DBUG_ASSERT((block->status & BLOCK_IN_FLUSH) &&
+                  !(block->status & (BLOCK_IN_SWITCH | BLOCK_REASSIGNED)));
       free_block(keycache, block);
     }
     else
@@ -2355,6 +2485,7 @@ static int flush_key_blocks_int(KEY_CACH
     uint count= 0;
     BLOCK_LINK **pos,**end;
     BLOCK_LINK *first_in_switch= NULL;
+    BLOCK_LINK *last_in_flush;
     BLOCK_LINK *block, *next;
 #if defined(KEYCACHE_DEBUG)
     uint cnt=0;
@@ -2370,7 +2501,8 @@ static int flush_key_blocks_int(KEY_CACH
            block ;
            block= block->next_changed)
       {
-        if (block->hash_link->file == file)
+        if ((block->hash_link->file == file) &&
+            !(block->status & BLOCK_IN_FLUSH))
         {
           count++;
           KEYCACHE_DBUG_ASSERT(count<= keycache->blocks_used);
@@ -2388,6 +2520,7 @@ static int flush_key_blocks_int(KEY_CACH
 
     /* Retrieve the blocks and write them to a buffer to be flushed */
 restart:
+    last_in_flush= NULL;
     end= (pos= cache)+count;
     for (block= keycache->changed_blocks[FILE_HASH(file)] ;
          block ;
@@ -2400,54 +2533,63 @@ restart:
       next= block->next_changed;
       if (block->hash_link->file == file)
       {
-        /*
-           Mark the block with BLOCK_IN_FLUSH in order not to let
-           other threads to use it for new pages and interfere with
-           our sequence ot flushing dirty file pages
-        */
-        block->status|= BLOCK_IN_FLUSH;
-
-        if (! (block->status & BLOCK_IN_SWITCH))
+        if (!(block->status & BLOCK_IN_FLUSH))
         {
-	  /*
-	    We care only for the blocks for which flushing was not
-	    initiated by other threads as a result of page swapping
+          /*
+            Mark the block with BLOCK_IN_FLUSH in order not to let
+            other threads to use it for new pages and interfere with
+            our sequence ot flushing dirty file pages
           */
-          reg_requests(keycache, block, 1);
-          if (type != FLUSH_IGNORE_CHANGED)
+          block->status|= BLOCK_IN_FLUSH;
+
+          if (!(block->status & BLOCK_IN_SWITCH))
           {
-	    /* It's not a temporary file */
-            if (pos == end)
+            /*
+              We care only for the blocks for which flushing was not
+              initiated by other threads as a result of page swapping
+            */
+            reg_requests(keycache, block, 1);
+            if (type != FLUSH_IGNORE_CHANGED)
             {
-	      /*
-		This happens only if there is not enough
-		memory for the big block
-              */
-              if ((error= flush_cached_blocks(keycache, file, cache,
-                                              end,type)))
-                last_errno=error;
-              /*
-		Restart the scan as some other thread might have changed
-		the changed blocks chain: the blocks that were in switch
-		state before the flush started have to be excluded
-              */
-              goto restart;
+              /* It's not a temporary file */
+              if (pos == end)
+              {
+                /*
+                  This happens only if there is not enough
+                  memory for the big block
+                */
+                if ((error= flush_cached_blocks(keycache, file, cache,
+                                                end,type)))
+                  last_errno=error;
+                /*
+                  Restart the scan as some other thread might have changed
+                  the changed blocks chain: the blocks that were in switch
+                  state before the flush started have to be excluded
+                */
+                goto restart;
+              }
+              *pos++= block;
+            }
+            else
+            {
+              /* It's a temporary file */
+              keycache->blocks_changed--;
+              keycache->global_blocks_changed--;
+              DBUG_ASSERT(!(block->status & BLOCK_REASSIGNED));
+              free_block(keycache, block);
             }
-            *pos++= block;
           }
           else
           {
-            /* It's a temporary file */
-            keycache->blocks_changed--;
-	    keycache->global_blocks_changed--;
-            free_block(keycache, block);
+            /* Link the block into a list of blocks 'in switch' */
+            unlink_changed(block);
+            link_changed(block, &first_in_switch);
           }
         }
         else
         {
-	  /* Link the block into a list of blocks 'in switch' */
-          unlink_changed(block);
-          link_changed(block, &first_in_switch);
+          /* Remember the last block found to be in flush. */
+          last_in_flush= block;
         }
       }
     }
@@ -2456,7 +2598,35 @@ restart:
       if ((error= flush_cached_blocks(keycache, file, cache, pos, type)))
         last_errno= error;
     }
-    /* Wait until list of blocks in switch is empty */
+    else if (last_in_flush)
+    {
+      /*
+        There are no blocks to be flushed by this thread, but blocks in
+        flush by other threads. Wait until one of the blocks is flushed.
+       */
+#ifdef THREAD
+      struct st_my_thread_var *thread= my_thread_var;
+      add_to_queue(&last_in_flush->wqueue[COND_FOR_SAVED], thread);
+      do
+      {
+        KEYCACHE_DBUG_PRINT("flush_key_blocks_int: wait",
+                            ("suspend thread %ld", thread->id));
+        keycache_pthread_cond_wait(&thread->suspend,
+                                   &keycache->cache_lock);
+      } while (thread->next);
+#else
+      KEYCACHE_DBUG_ASSERT(0);
+      /* No parallel requests in single-threaded case */
+#endif
+      /* Be sure not to lose a block. They may be flushed in random order. */
+      goto restart;
+    }
+
+    /*
+      Wait until the list of blocks in switch is empty. The threads that
+      are switching these blocks will relink them to clean file chains
+      while we wait and thus empty the 'first_in_switch' chain.
+    */
     while (first_in_switch)
     {
 #if defined(KEYCACHE_DEBUG)
@@ -2485,29 +2655,87 @@ restart:
       KEYCACHE_DBUG_ASSERT(cnt <= keycache->blocks_used);
 #endif
     }
-    /* The following happens very seldom */
+
     if (! (type == FLUSH_KEEP || type == FLUSH_FORCE_WRITE))
     {
-#if defined(KEYCACHE_DEBUG)
-      cnt=0;
-#endif
-      for (block= keycache->file_blocks[FILE_HASH(file)] ;
-           block ;
-           block= next)
+      uint found;
+
+      /*
+        Finally free all clean blocks for this file.
+        During resize this may be run by two threads in parallel.
+      */
+      do
       {
-#if defined(KEYCACHE_DEBUG)
-        cnt++;
-        KEYCACHE_DBUG_ASSERT(cnt <= keycache->blocks_used);
-#endif
-        next= block->next_changed;
-        if (block->hash_link->file == file &&
-            (! (block->status & BLOCK_CHANGED)
-             || type == FLUSH_IGNORE_CHANGED))
+        found= 0;
+        for (block= keycache->file_blocks[FILE_HASH(file)] ;
+             block ;
+             block= next)
         {
-          reg_requests(keycache, block, 1);
-          free_block(keycache, block);
+          /* Remember the next block. After freeing we cannot get at it. */
+          next= block->next_changed;
+
+          /*
+            We must not free blocks in eviction (BLOCK_IN_SWITCH |
+            BLOCK_REASSIGNED), blocks intended to be freed
+            (BLOCK_REASSIGNED) or changed blocks (though I don't know
+            how this could happen here. Changed blocks should not appear
+            in the file_blocks hash. But I left this check as it was
+            before.)
+          */
+          if ((block->hash_link->file == file) &&
+              !(block->status & (BLOCK_IN_SWITCH | BLOCK_REASSIGNED)) &&
+              (!(block->status & BLOCK_CHANGED)
+               || (type == FLUSH_IGNORE_CHANGED)))
+          {
+            struct st_hash_link *hash_link;
+            my_off_t            diskpos;
+            File                file;
+            uint                status;
+            uint                requests;
+
+            found++;
+            KEYCACHE_DBUG_ASSERT(found <= keycache->blocks_used);
+
+            reg_requests(keycache, block, 1);
+
+            /*
+              free_block() may need to wait for readers of the block.
+              This is the moment where the other thread can move the
+              'next' blocks from the chain. free_block() needs to wait
+              if there are requests for the block pending.
+            */
+            if ((requests= block->requests) && next)
+            {
+              /* Copy values from the 'next' block and its hash_link. */
+              hash_link= next->hash_link;
+              diskpos=   hash_link->diskpos;
+              file=      hash_link->file;
+              status=    next->status;
+            }
+
+            free_block(keycache, block);
+
+            /*
+              If free_block() needed to wait and the state of the 'next'
+              block changed, break the inner loop. 'next' may no longer
+              be part of the current chain.
+
+              We do not want to break the loop after every free_block(),
+              not even only after waits. The chain might be quite long
+              and contain blocks for many files. Traversing it again and
+              again to find more blocks for this file could become quite
+              inefficient.
+            */
+            if (requests && next &&
+                ((status    != next->status) ||
+                 (hash_link != next->hash_link) ||
+                 (file      != hash_link->file) ||
+                 (diskpos   != hash_link->diskpos)))
+              break;
+          }
         }
-      }
+
+      } while (found);
     }
   }
 
@@ -2562,29 +2790,89 @@ int flush_key_blocks(KEY_CACHE *keycache
 
 static int flush_all_key_blocks(KEY_CACHE *keycache)
 {
-#if defined(KEYCACHE_DEBUG)
-  uint cnt=0;
-#endif
-  while (keycache->blocks_changed > 0)
+  uint found;
+
+  /* Flush all changed blocks first. */
+  do
   {
     BLOCK_LINK *block;
-    for (block= keycache->used_last->next_used ; ; block=block->next_used)
+    uint       idx;
+
+    found= 0;
+    /* Step over the whole changed_blocks hash array. */
+    for (idx= 0; idx < CHANGED_BLOCKS_HASH; idx++)
     {
-      if (block->hash_link)
+      /*
+        If an array element is non-empty, use the first block from its
+        chain to find a file for flush. All blocks for this file are
+        flushed. So the same block will not appear at this place again
+        with the next iteration. New writes for blocks are not accepted
+        during the flush.
+      */
+      if ((block= keycache->changed_blocks[idx]))
       {
-#if defined(KEYCACHE_DEBUG)
-        cnt++;
-        KEYCACHE_DBUG_ASSERT(cnt <= keycache->blocks_used);
-#endif
+        /* A block in the changed_blocks hash must have a hash_link. */
+        DBUG_ASSERT(block->hash_link);
+
+        found++;
+        /*
+          Flush dirty block but do not free them yet. They can be used
+          for reading until all other blocks are flushed too.
+        */
+        if (flush_key_blocks_int(keycache, block->hash_link->file,
+				 FLUSH_FORCE_WRITE))
+          return 1;
+      }
+    }
+
+  } while (found);
+
+  /*
+    Now there are no dirty blocks in the cache any more and no new
+    writes are accepted since the start of the flush.
+    Now do also stop new clean blocks from being read into the cache.
+    This prevents a fight between the flushing thread and other threads
+    that requests new pages for read.
+    However there may be blocks still in use for reading. We need to
+    wait for them whenever we hit a block in use.
+  */
+  keycache->can_be_used= FALSE;
+
+  /* Now flush (free) all clean blocks. */
+  do
+  {
+    BLOCK_LINK *block;
+    uint       idx;
+
+    found= 0;
+    /* Step over the whole file_blocks hash array. */
+    for (idx= 0; idx < CHANGED_BLOCKS_HASH; idx++)
+    {
+      /*
+        If an array element is non-empty, use the first block from its
+        chain to find a file for flush. All blocks for this file are
+        freed. So the same block will not appear at this place again
+        with the next iteration. Unless it has been read into the cache
+        anew. In this case readers and the flusher fight against each
+        other. But since the flusher does not need to do I/O for clean
+        blocks, and writes for blocks are not accepted during the flush,
+        it will win finally.
+      */
+      if ((block= keycache->file_blocks[idx]))
+      {
+        /* A block in the file_blocks hash must have a hash_link. */
+        DBUG_ASSERT(block->hash_link);
+
+        found++;
         if (flush_key_blocks_int(keycache, block->hash_link->file,
 				 FLUSH_RELEASE))
           return 1;
-        break;
       }
-      if (block == keycache->used_last)
-        break;
     }
-  }
+
+  } while (found);
+
+  /* Now all blocks are free an no new blocks can come in (!can_be_used). */
   return 0;
 }
 

--- 1.6/include/keycache.h	2006-09-01 12:35:01 +02:00
+++ 1.7/include/keycache.h	2006-09-01 12:35:01 +02:00
@@ -73,6 +73,8 @@ typedef struct st_key_cache
   BLOCK_LINK *used_ins;          /* ptr to the insertion block in LRU chain  */
   pthread_mutex_t cache_lock;    /* to lock access to the cache structure    */
   KEYCACHE_WQUEUE resize_queue;  /* threads waiting during resize operation  */
+  KEYCACHE_WQUEUE waiting_for_resize_cnt; /* waiting for a zero resize count.
+            Using a queue for symmetry though only one thread can wait here. */
   KEYCACHE_WQUEUE waiting_for_hash_link; /* waiting for a free hash link     */
   KEYCACHE_WQUEUE waiting_for_block;    /* requests waiting for a free block */
   BLOCK_LINK *changed_blocks[CHANGED_BLOCKS_HASH]; /* hash for dirty file bl.*/
Thread
bk commit into 5.0 tree (istruewing:1.2208) BUG#17332ingo1 Sep