List:Internals« Previous MessageNext Message »
From:Jay Pipes Date:May 10 2009 5:11pm
Subject:Re: [Drizzle-discuss] ptr_compare replace with memcmp and associated
performance
View as plain text  
Have we tried running the benchmarks with and without -funroll-loops?

-j

Stewart Smith wrote:
> So i broke out mtaylor's patch that we were bumming around with at the UC
> to replace ptr_compare with a simple memcmp call.
> 
> At the UC I benched that this patch actually caused a measurable performance
> regresssion.
> 
> So what's the difference?
> (same benchmark and machine as in previous mail)
> 
> MAX_FIELDS=64 with std::bitset
>     read/write requests:                 70000  (3374.50 per sec.)
>     read/write requests:                 70000  (3102.42 per sec.)
>     read/write requests:                 70000  (3113.47 per sec.)
>     read/write requests:                 70000  (3401.89 per sec.)
>     read/write requests:                 70000  (3164.61 per sec.)
> AVERAGE= 3231.37
> 
> With ptr_compare replaced with memcmp:
>     read/write requests:                 70000  (3090.33 per sec.)
>     read/write requests:                 70000  (3066.97 per sec.)
>     read/write requests:                 70000  (2954.93 per sec.)
>     read/write requests:                 70000  (2875.57 per sec.)
>     read/write requests:                 70000  (2953.99 per sec.)
> AVERAGE= 2988.35
> 
> With ptr_compare replaced with __builtin_memcmp:
>     read/write requests:                 70000  (3245.37 per sec.)
>     read/write requests:                 70000  (3001.91 per sec.)
>     read/write requests:                 70000  (3254.99 per sec.)
>     read/write requests:                 70000  (3177.13 per sec.)
>     read/write requests:                 70000  (3195.49 per sec.)
> AVERAGE= 3174.97
> 
> 
> So, I  thought that with just using the __builtin_memcmp automatically, we
> may be okay with merging this.
> 
> However... I do have the following concerns:
> 1) I'm pretty sure that __builtin_memcmp is gcc only, and that SunStudio
> may have an issue (yay - wrappers!)
> 2) Is this perf difference going to be true on different platforms?
> 3) the various memcmp implementations do seem to be dependent on a few
> things for performance: alignment, size of data.
> 
> For what we're seeing in sysbench, the parameters are things like this:
> "SELECT c from sbtest where id between 9942 and 10041 order by c"
> 
> 0xc30780,0xc2c340,240
> 0xc2e560,0xc2c340,240
> 0xc30780,0xc2e560,240
> 0xc37390,0xc32f50,240
> 0xc35170,0xc32f50,240
> 0xc37390,0xc35170,240
> 0xc3dcc8,0xc39888,240
> 0xc3baa8,0xc2e560,240
> 0xc35170,0xc2e560,240
> 0xc3baa8,0xc35170,240
> 0xc2c340,0xc35170,240
> 0xc35170,0xc3dcc8,240
> 0xc2c618,0xc35170,240
> 0xc35170,0xc3d9f0,240
> 0xc2c8f0,0xc35170,240
> 0x7fbed85f10f0,0x7fbed85eccb0,240
> 
> and from the CREATE TABLE:
>         `c` VARCHAR(120) NOT NULL  COLLATE utf8_general_ci DEFAULT `` ,
> 
> So here it's large, and aligned.
> 
> For some other micro-benchmarks I've clocked things looking much
> different.... so it's possibly query dependent as to what ends up being
> better (i.e. how much data we're comparing).
> 
> (1st parameter is number of repetitions, 2nd is number of bytes to memcmp)
> 
> 
> For comparing equal values:
> 
> stewart@willster:~/src/test/memcmp$ ./a.out 168435455 8
> 168435455 repetitions
> 
> Testing memcmp ..... done,      7.432 seconds
> Testing builtin memcmp ....... done,     16.313 seconds
> Testing loop .... done,      6.144 seconds
> Testing loop32 .... done,      2.764 seconds
> Testing loop64 .... done,      2.128 seconds
> Testing no-op .... done,      1.488 seconds
> stewart@willster:~/src/test/memcmp$ ./a.out 168435455 16
> 168435455 repetitions
> 
> Testing memcmp ..... done,      5.912 seconds
> Testing builtin memcmp ....... done,     26.418 seconds
> Testing loop .... done,     11.201 seconds
> Testing loop32 .... done,      3.980 seconds
> Testing loop64 .... done,      2.564 seconds
> Testing no-op .... done,      1.492 seconds
> stewart@willster:~/src/test/memcmp$ ./a.out 168435455 32
> 168435455 repetitions
> 
> Testing memcmp ..... done,      6.828 seconds
> Testing builtin memcmp ....... done,     46.891 seconds
> Testing loop .... done,     21.473 seconds
> Testing loop32 .... done,      6.536 seconds
> Testing loop64 .... done,      3.804 seconds
> Testing no-op .... done,      1.476 seconds
> stewart@willster:~/src/test/memcmp$ ./a.out 168435455 64
> 168435455 repetitions
> 
> Testing memcmp ..... done,      9.549 seconds
> Testing builtin memcmp ....... done,     87.513 seconds
> Testing loop .... done,     41.455 seconds
> Testing loop32 .... done,     11.669 seconds
> Testing loop64 .... done,      6.368 seconds
> Testing no-op .... done,      1.468 seconds
> stewart@willster:~/src/test/memcmp$ ./a.out 168435455 128
> 168435455 repetitions
> 
> Testing memcmp ..... done,     15.081 seconds
> Testing builtin memcmp ....... done,    169.143 seconds
> Testing loop .... done,     86.397 seconds
> Testing loop32 .... done,     21.877 seconds
> Testing loop64 .... done,     11.445 seconds
> Testing no-op .... done,      1.488 seconds
> stewart@willster:~/src/test/memcmp$ gcc -O3 -fno-builtin memcmpbench.c
> stewart@willster:~/src/test/memcmp$ ./a.out 168435455 256
> 168435455 repetitions
> 
> Testing memcmp ..... done,     26.134 seconds
> Testing loop64 .... done,     21.549 seconds
> Testing no-op .... done,      1.500 seconds
> 
> 
> 
> Completely inequal values are all about the same:
> 
> stewart@willster:~/src/test/memcmp$ ./a.out 168435455 8
> 168435455 repetitions
> 
> Testing memcmp ..... done,      3.204 seconds
> Testing builtin memcmp ....... done,     13.945 seconds
> Testing loop .... done,      1.896 seconds
> Testing loop32 .... done,      2.092 seconds
> Testing loop64 .... done,      2.136 seconds
> Testing no-op .... done,      1.488 seconds
> 
> stewart@willster:~/src/test/memcmp$ ./a.out 168435455 240
> 168435455 repetitions
> 
> Testing memcmp ..... done,      5.520 seconds
> Testing builtin memcmp ....... done,     13.973 seconds
> Testing loop .... done,      1.912 seconds
> Testing loop32 .... done,      2.124 seconds
> Testing loop64 .... done,      2.104 seconds
> Testing no-op .... done,      1.468 seconds
> 
> 
> For half the same:
> stewart@willster:~/src/test/memcmp$ ./a.out 168435455 8
> 168435455 repetitions
> 
> Testing memcmp ..... done,      7.340 seconds
> Testing builtin memcmp ....... done,     19.193 seconds
> Testing loop .... done,      4.212 seconds
> Testing loop32 .... done,      2.956 seconds
> Testing loop64 .... done,      2.112 seconds
> Testing no-op .... done,      1.476 seconds
> 
> stewart@willster:~/src/test/memcmp$ ./a.out 168435455 32
> 168435455 repetitions
> 
> Testing memcmp ..... done,      5.920 seconds
> Testing builtin memcmp ....... done,     34.526 seconds
> Testing loop .... done,     11.885 seconds
> Testing loop32 .... done,      4.444 seconds
> Testing loop64 .... done,      3.388 seconds
> Testing no-op .... done,      1.468 seconds
> 
> stewart@willster:~/src/test/memcmp$ ./a.out 168435455 64
> 168435455 repetitions
> 
> Testing memcmp ..... done,      6.948 seconds
> Testing builtin memcmp ....... done,     54.923 seconds
> Testing loop .... done,     22.081 seconds
> Testing loop32 .... done,      7.000 seconds
> Testing loop64 .... done,      4.444 seconds
> Testing no-op .... done,      1.488 seconds
> 
> 
> Is concurrency an issue here and not raw single threaded performance?
> 
> Let's try with 64 threads (on the same 2 core box):
> 
> 64 threads, __builtin_memcmp:
>     read/write requests:                 70000  (3184.20 per sec.)
>     read/write requests:                 70000  (3184.87 per sec.)
>     read/write requests:                 70000  (3166.83 per sec.)
>     read/write requests:                 70000  (3085.69 per sec.)
> AVERAGE=3155.39
> 
> 64 threads, memcmp:
>     read/write requests:                 70000  (3218.07 per sec.)
>     read/write requests:                 70000  (3219.04 per sec.)
>     read/write requests:                 70000  (3222.48 per sec.)
>     read/write requests:                 70000  (3116.00 per sec.)
> AVERAGE=3193.89
> 
> 
> 64 threads, 32bit cmp loop:
>     read/write requests:                 70000  (3173.76 per sec.)
>     read/write requests:                 70000  (3156.23 per sec.)
>     read/write requests:                 70000  (3206.70 per sec.)
>     read/write requests:                 70000  (3247.61 per sec.)
> AVERAGE=3196.07
> 
> 64 threads, 64bit cmp loop:
>     read/write requests:                 70000  (3210.02 per sec.)
>     read/write requests:                 70000  (3218.87 per sec.)
>     read/write requests:                 70000  (3225.98 per sec.)
>     read/write requests:                 70000  (3131.60 per sec.)
> AVERAGE= 3196.61
> 
> 64 threads, baseline (using original ptr_compare):
>     read/write requests:                 70000  (3542.95 per sec.)
>     read/write requests:                 70000  (3556.15 per sec.)
>     read/write requests:                 70000  (3560.71 per sec.)
>     read/write requests:                 70000  (3476.80 per sec.)
> AVERAGE=3534.15
> 
> i.e. the old ptr_compare whoops the arse of any of the memcmp calls with
> higher concurrency.
> 
> 
> So, after this I can conclude:
> - memcmp is nothing if not consistent across various microbenchmarks
> - builtin_memcmp seems to help a bit at lower concurrency, not at all at
> higher and is seldom useful in micro benchmark.
> 
> - the 64bit loop isn't so good at higher concurrency
> - the 64bit loop is favourable in microbenchmarks
> - memcmp is close to the 64bit loop performance (at most only 2x slower)
> 
> Unrolling loops being the key?
> 
> Testing loop .... done,     11.813 seconds
> Testing unrollloop .... done,      7.628 seconds
> Testing unrollloop32 .... done,      2.532 seconds
> Testing unrollloop64 .... done,      2.536 seconds
> Testing loop32 .... done,      4.432 seconds
> Testing loop64 .... done,      3.792 seconds
> Testing no-op .... done,      1.276 seconds
> 
> (32 and 64 do 32,64 bit compares in an unrolled loop)
> 
> Questions:
> - Why does my 64bit compare loop not equal performance of ptr_compare
> unrolled loop? (in a microbenchmark, it beats it). Concurrency again?
> 
> 
> So what should we do?
> 
> Certainly a  call to memcmp is easier to understand and keeps the code
> nice and simple, but we're talking up to a 10% speed hit here at higher
> concurrency (at least on my hardware).
> 
> I wonder what the difference is on various other hardware setups.
> 
> I'd be very interested to see what happens on sparc.
> 
> I'd also like to see it micro-benchmarked - preferably something we
> could repeat in future (even in ./configure ? on server startup ?)
> 
> I'm voting to keep the current ptr_compare code, and hope that somebody
> provides further explanation on top of what I've looked at here.
> 
> 
> 
> 
> Below is the patch, followed by the code i used for microbenchmarking
> (note the commented out 64bit loop i used for the above loop tests):
> 
> === modified file 'drizzled/filesort.cc'
> --- drizzled/filesort.cc	2009-04-28 00:17:10 +0000
> +++ drizzled/filesort.cc	2009-05-08 05:59:11 +0000
> @@ -1135,7 +1135,7 @@ int merge_buffers(SORTPARAM *param, IO_C
>    }
>    else
>    {
> -    cmp= get_ptr_compare(sort_length);
> +    cmp= (qsort2_cmp)ptr_compare;
>      first_cmp_arg= (void*) &sort_length;
>    }
>    priority_queue<BUFFPEK *, vector<BUFFPEK *>, compare_functor > 
> 
> === modified file 'mysys/mf_sort.cc'
> --- mysys/mf_sort.cc	2009-04-17 21:01:47 +0000
> +++ mysys/mf_sort.cc	2009-05-08 05:59:11 +0000
> @@ -34,7 +34,7 @@ void my_string_ptr_sort(unsigned char *b
>    {
>      if (size && items)
>      {
> -      my_qsort2(base,items, sizeof(unsigned char*), get_ptr_compare(size),
> +      my_qsort2(base,items, sizeof(unsigned char*), (qsort2_cmp)ptr_compare,
>                  (void*) &size);
>      }
>    }
> 
> === modified file 'mysys/my_sys.h'
> --- mysys/my_sys.h	2009-04-27 22:05:43 +0000
> +++ mysys/my_sys.h	2009-05-08 05:59:11 +0000
> @@ -420,7 +420,12 @@ extern void my_qsort(void *base_ptr, siz
>                       qsort_cmp cmp);
>  extern void my_qsort2(void *base_ptr, size_t total_elems, size_t size,
>                        qsort2_cmp cmp, void *cmp_argument);
> -extern qsort2_cmp get_ptr_compare(size_t);
> +
> +#if defined(__cplusplus)
> +extern "C"
> +#endif
> +int ptr_compare(size_t *compare_length, unsigned char **a, unsigned char **b);
> +
>  void my_store_ptr(unsigned char *buff, size_t pack_length, my_off_t pos);
>  my_off_t my_get_ptr(unsigned char *ptr, size_t pack_length);
>  File create_temp_file(char *to, const char *dir, const char *pfx,
> 
> === modified file 'mysys/ptr_cmp.cc'
> --- mysys/ptr_cmp.cc	2009-04-26 16:53:32 +0000
> +++ mysys/ptr_cmp.cc	2009-05-08 05:59:11 +0000
> @@ -21,137 +21,33 @@
>  
>  #include "mysys/mysys_priv.h"
>  #include "plugin/myisam/myisampack.h"
> -
> -static int ptr_compare(size_t *compare_length, unsigned char **a, unsigned char
> **b);
> -static int ptr_compare_0(size_t *compare_length, unsigned char **a, unsigned char
> **b);
> -static int ptr_compare_1(size_t *compare_length, unsigned char **a, unsigned char
> **b);
> -static int ptr_compare_2(size_t *compare_length, unsigned char **a, unsigned char
> **b);
> -static int ptr_compare_3(size_t *compare_length, unsigned char **a, unsigned char
> **b);
> -
> -	/* Get a pointer to a optimal byte-compare function for a given size */
> -
> -qsort2_cmp get_ptr_compare (size_t size)
> -{
> -  if (size < 4)
> -    return (qsort2_cmp) ptr_compare;
> -  switch (size & 3) {
> -    case 0: return (qsort2_cmp) ptr_compare_0;
> -    case 1: return (qsort2_cmp) ptr_compare_1;
> -    case 2: return (qsort2_cmp) ptr_compare_2;
> -    case 3: return (qsort2_cmp) ptr_compare_3;
> -    }
> -  return 0;					/* Impossible */
> -}
> -
> +#include <string.h>
>  
>  	/*
>  	  Compare to keys to see witch is smaller.
> -	  Loop unrolled to make it quick !!
>  	*/
>  
> -#define cmp(N) if (first[N] != last[N]) return (int) first[N] - (int) last[N]
> -
> -static int ptr_compare(size_t *compare_length, unsigned char **a, unsigned char
> **b)
> -{
> -  register int length= *compare_length;
> -  register unsigned char *first,*last;
> -
> -  first= *a; last= *b;
> -  while (--length)
> -  {
> -    if (*first++ != *last++)
> -      return (int) first[-1] - (int) last[-1];
> -  }
> -  return (int) first[0] - (int) last[0];
> -}
> -
> -
> -static int ptr_compare_0(size_t *compare_length,unsigned char **a, unsigned char
> **b)
> +extern "C"
> +int ptr_compare(size_t *compare_length, unsigned char **a, unsigned char **b)
>  {
> -  register int length= *compare_length;
> -  register unsigned char *first,*last;
> -
> -  first= *a; last= *b;
> - loop:
> -  cmp(0);
> -  cmp(1);
> -  cmp(2);
> -  cmp(3);
> -  if ((length-=4))
> +/*  size_t l= *compare_length / sizeof(uint64_t);
> +  uint64_t *aa= (uint64_t*)*a;
> +  uint64_t *bb= (uint64_t*)*b;
> +  while(l--)
>    {
> -    first+=4;
> -    last+=4;
> -    goto loop;
> -  }
> -  return (0);
> -}
> -
> -
> -static int ptr_compare_1(size_t *compare_length,unsigned char **a, unsigned char
> **b)
> -{
> -  register int length= *compare_length-1;
> -  register unsigned char *first,*last;
> -
> -  first= *a+1; last= *b+1;
> -  cmp(-1);
> - loop:
> -  cmp(0);
> -  cmp(1);
> -  cmp(2);
> -  cmp(3);
> -  if ((length-=4))
> -  {
> -    first+=4;
> -    last+=4;
> -    goto loop;
> -  }
> -  return (0);
> -}
> -
> -static int ptr_compare_2(size_t *compare_length,unsigned char **a, unsigned char
> **b)
> -{
> -  register int length= *compare_length-2;
> -  register unsigned char *first,*last;
> -
> -  first= *a +2 ; last= *b +2;
> -  cmp(-2);
> -  cmp(-1);
> - loop:
> -  cmp(0);
> -  cmp(1);
> -  cmp(2);
> -  cmp(3);
> -  if ((length-=4))
> -  {
> -    first+=4;
> -    last+=4;
> -    goto loop;
> +    if(*aa != *bb)
> +    {
> +      if(*aa < *bb)
> +	return -1;
> +      else
> +	return 1;
> +    }
> +    aa++, bb++;
>    }
> -  return (0);
> +  return 0;
> +*/  return memcmp(*a, *b, *compare_length);
>  }
>  
> -static int ptr_compare_3(size_t *compare_length,unsigned char **a, unsigned char
> **b)
> -{
> -  register int length= *compare_length-3;
> -  register unsigned char *first,*last;
> -
> -  first= *a +3 ; last= *b +3;
> -  cmp(-3);
> -  cmp(-2);
> -  cmp(-1);
> - loop:
> -  cmp(0);
> -  cmp(1);
> -  cmp(2);
> -  cmp(3);
> -  if ((length-=4))
> -  {
> -    first+=4;
> -    last+=4;
> -    goto loop;
> -  }
> -  return (0);
> -}
>  
>  void my_store_ptr(unsigned char *buff, size_t pack_length, my_off_t pos)
>  {
> 
> 
> For those interested, i started with the code at
> http://gcc.gnu.org/ml/gcc/2002-10/msg01666.html
> 
> and ended up with something like this:
> 
> #include <sys/resource.h>
> #include <sys/time.h>
> #include <stdio.h>
> #include <assert.h>
> #include <stdlib.h>
> 
> char s1[256];
> char s2[256];
> 
> int cmpsize= 240;
> 
> int
> speed_memcmp ()
> {
>   return memcmp (s1, s2, cmpsize);
> }
> 
> int
> speed_bimemcmp ()
> {
>   return __builtin_memcmp (s1, s2, cmpsize);
> }
> 
> int
> speed_loop ()
> {
>   int i=cmpsize;
>   char *a= s1;
>   char *b= s2;
>   while(i--)
>   {
>     if(*a != *b)
>       return -1;
>     a++, b++;
>   };
>   return 0;
> }
> 
> int
> speed_loop32 ()
> {
>   int i=cmpsize/4;
>   int *a= (int*)s1;
>   int *b= (int*)s2;
>   while(i--)
>   {
>     if(*a != *b)
>       return -1;
>     a++, b++;
>   };
>   return 0;
> }
> 
> int
> speed_loop64 ()
> {
>   int i=cmpsize/8;
>   long long *a= (long long*)s1;
>   long long *b= (long long*)s2;
>   while(i--)
>   {
>     if(*a != *b)
>       return -1;
>     a++, b++;
>   };
> 
>   return 0;
> }
> 
> int speed_null()
> {
>   return 0;
> }
> 
> 
> double
> do_test (int repetitions, int (*test_function) ())
> {
>   struct rusage r1, r2;
> 
>   getrusage(RUSAGE_SELF, &r1);
> 
>   while (repetitions--)
>     test_function ();
> 
>   getrusage(RUSAGE_SELF, &r2);
>   return (r2.ru_utime.tv_sec - r1.ru_utime.tv_sec) +
> 	 (r2.ru_utime.tv_usec - r1.ru_utime.tv_usec) / 1000000.0;
> }
> 
> main(int argc, char **argv)
> {
>   int repetitions = (argc == 1) ? 0x0fffffff : atoi(argv[1]);
>   cmpsize = (argc == 2) ? 240 : atoi(argv[2]);
> 
>   memset(s1, 42, 256);
>   memset(s2, 42, 256);
>   memset(s2+(cmpsize/2), 55, cmpsize/2);
> 
>   printf ("%d repetitions\n\n", repetitions);
>   printf ("Testing memcmp .....");
>   fflush (stdout);
>   printf (" done, %10.3f seconds\n", do_test (repetitions, speed_memcmp));
> 
>   printf ("Testing builtin memcmp .......");
>   fflush (stdout);
>   printf (" done, %10.3f seconds\n", do_test (repetitions, speed_bimemcmp));
> 
>   printf ("Testing loop ....");
>   fflush (stdout);
>   printf (" done, %10.3f seconds\n", do_test (repetitions, speed_loop));
> 
>   printf ("Testing loop32 ....");
>   fflush (stdout);
>   printf (" done, %10.3f seconds\n", do_test (repetitions, speed_loop32));
> 
>   printf ("Testing loop64 ....");
>   fflush (stdout);
>   printf (" done, %10.3f seconds\n", do_test (repetitions, speed_loop64));
> 
>   printf ("Testing no-op ....");
>   fflush (stdout);
>   printf (" done, %10.3f seconds\n", do_test (repetitions, speed_null));
> 
>   exit (0);
> }
> 
> 

Thread
ptr_compare replace with memcmp and associated performanceStewart Smith8 May
  • Re: ptr_compare replace with memcmp and associated performanceEric Day9 May
    • Re: ptr_compare replace with memcmp and associated performanceStewart Smith11 May
  • Re: [Drizzle-discuss] ptr_compare replace with memcmp and associatedperformanceJay Pipes10 May
    • Re: [Drizzle-discuss] ptr_compare replace with memcmp and associatedperformanceRoy Lyseng10 May
  • Re: ptr_compare replace with memcmp and associated performancePeter Benjamin Volk17 May