List:Internals« Previous MessageNext Message »
From:Roy Lyseng Date:May 10 2009 6:17pm
Subject:Re: [Drizzle-discuss] ptr_compare replace with memcmp and associated
performance
View as plain text  
One of the dangers of microbenchmarks is that they do not consider cache 
misses well. Unrolling one loop is no harm, but what happens when you 
have hundreds of unrolled loops in your program? I would imagine that 
the probability of cache misses increase greatly, perhaps so much that 
it regresses performance.

Is it possible to control loop unrolling with a pragma, so that only the 
most performance sensitive loops get unrolled?

Thanks,
Roy

Jay Pipes wrote:
> Have we tried running the benchmarks with and without -funroll-loops?
> 
> -j
> 
> Stewart Smith wrote:
>> So i broke out mtaylor's patch that we were bumming around with at the UC
>> to replace ptr_compare with a simple memcmp call.
>>
>> At the UC I benched that this patch actually caused a measurable 
>> performance
>> regresssion.
>>
>> So what's the difference?
>> (same benchmark and machine as in previous mail)
>>
>> MAX_FIELDS=64 with std::bitset
>>     read/write requests:                 70000  (3374.50 per sec.)
>>     read/write requests:                 70000  (3102.42 per sec.)
>>     read/write requests:                 70000  (3113.47 per sec.)
>>     read/write requests:                 70000  (3401.89 per sec.)
>>     read/write requests:                 70000  (3164.61 per sec.)
>> AVERAGE= 3231.37
>>
>> With ptr_compare replaced with memcmp:
>>     read/write requests:                 70000  (3090.33 per sec.)
>>     read/write requests:                 70000  (3066.97 per sec.)
>>     read/write requests:                 70000  (2954.93 per sec.)
>>     read/write requests:                 70000  (2875.57 per sec.)
>>     read/write requests:                 70000  (2953.99 per sec.)
>> AVERAGE= 2988.35
>>
>> With ptr_compare replaced with __builtin_memcmp:
>>     read/write requests:                 70000  (3245.37 per sec.)
>>     read/write requests:                 70000  (3001.91 per sec.)
>>     read/write requests:                 70000  (3254.99 per sec.)
>>     read/write requests:                 70000  (3177.13 per sec.)
>>     read/write requests:                 70000  (3195.49 per sec.)
>> AVERAGE= 3174.97
>>
>>
>> So, I  thought that with just using the __builtin_memcmp 
>> automatically, we
>> may be okay with merging this.
>>
>> However... I do have the following concerns:
>> 1) I'm pretty sure that __builtin_memcmp is gcc only, and that SunStudio
>> may have an issue (yay - wrappers!)
>> 2) Is this perf difference going to be true on different platforms?
>> 3) the various memcmp implementations do seem to be dependent on a few
>> things for performance: alignment, size of data.
>>
>> For what we're seeing in sysbench, the parameters are things like this:
>> "SELECT c from sbtest where id between 9942 and 10041 order by c"
>>
>> 0xc30780,0xc2c340,240
>> 0xc2e560,0xc2c340,240
>> 0xc30780,0xc2e560,240
>> 0xc37390,0xc32f50,240
>> 0xc35170,0xc32f50,240
>> 0xc37390,0xc35170,240
>> 0xc3dcc8,0xc39888,240
>> 0xc3baa8,0xc2e560,240
>> 0xc35170,0xc2e560,240
>> 0xc3baa8,0xc35170,240
>> 0xc2c340,0xc35170,240
>> 0xc35170,0xc3dcc8,240
>> 0xc2c618,0xc35170,240
>> 0xc35170,0xc3d9f0,240
>> 0xc2c8f0,0xc35170,240
>> 0x7fbed85f10f0,0x7fbed85eccb0,240
>>
>> and from the CREATE TABLE:
>>         `c` VARCHAR(120) NOT NULL  COLLATE utf8_general_ci DEFAULT `` ,
>>
>> So here it's large, and aligned.
>>
>> For some other micro-benchmarks I've clocked things looking much
>> different.... so it's possibly query dependent as to what ends up being
>> better (i.e. how much data we're comparing).
>>
>> (1st parameter is number of repetitions, 2nd is number of bytes to 
>> memcmp)
>>
>>
>> For comparing equal values:
>>
>> stewart@willster:~/src/test/memcmp$ ./a.out 168435455 8
>> 168435455 repetitions
>>
>> Testing memcmp ..... done,      7.432 seconds
>> Testing builtin memcmp ....... done,     16.313 seconds
>> Testing loop .... done,      6.144 seconds
>> Testing loop32 .... done,      2.764 seconds
>> Testing loop64 .... done,      2.128 seconds
>> Testing no-op .... done,      1.488 seconds
>> stewart@willster:~/src/test/memcmp$ ./a.out 168435455 16
>> 168435455 repetitions
>>
>> Testing memcmp ..... done,      5.912 seconds
>> Testing builtin memcmp ....... done,     26.418 seconds
>> Testing loop .... done,     11.201 seconds
>> Testing loop32 .... done,      3.980 seconds
>> Testing loop64 .... done,      2.564 seconds
>> Testing no-op .... done,      1.492 seconds
>> stewart@willster:~/src/test/memcmp$ ./a.out 168435455 32
>> 168435455 repetitions
>>
>> Testing memcmp ..... done,      6.828 seconds
>> Testing builtin memcmp ....... done,     46.891 seconds
>> Testing loop .... done,     21.473 seconds
>> Testing loop32 .... done,      6.536 seconds
>> Testing loop64 .... done,      3.804 seconds
>> Testing no-op .... done,      1.476 seconds
>> stewart@willster:~/src/test/memcmp$ ./a.out 168435455 64
>> 168435455 repetitions
>>
>> Testing memcmp ..... done,      9.549 seconds
>> Testing builtin memcmp ....... done,     87.513 seconds
>> Testing loop .... done,     41.455 seconds
>> Testing loop32 .... done,     11.669 seconds
>> Testing loop64 .... done,      6.368 seconds
>> Testing no-op .... done,      1.468 seconds
>> stewart@willster:~/src/test/memcmp$ ./a.out 168435455 128
>> 168435455 repetitions
>>
>> Testing memcmp ..... done,     15.081 seconds
>> Testing builtin memcmp ....... done,    169.143 seconds
>> Testing loop .... done,     86.397 seconds
>> Testing loop32 .... done,     21.877 seconds
>> Testing loop64 .... done,     11.445 seconds
>> Testing no-op .... done,      1.488 seconds
>> stewart@willster:~/src/test/memcmp$ gcc -O3 -fno-builtin memcmpbench.c
>> stewart@willster:~/src/test/memcmp$ ./a.out 168435455 256
>> 168435455 repetitions
>>
>> Testing memcmp ..... done,     26.134 seconds
>> Testing loop64 .... done,     21.549 seconds
>> Testing no-op .... done,      1.500 seconds
>>
>>
>>
>> Completely inequal values are all about the same:
>>
>> stewart@willster:~/src/test/memcmp$ ./a.out 168435455 8
>> 168435455 repetitions
>>
>> Testing memcmp ..... done,      3.204 seconds
>> Testing builtin memcmp ....... done,     13.945 seconds
>> Testing loop .... done,      1.896 seconds
>> Testing loop32 .... done,      2.092 seconds
>> Testing loop64 .... done,      2.136 seconds
>> Testing no-op .... done,      1.488 seconds
>>
>> stewart@willster:~/src/test/memcmp$ ./a.out 168435455 240
>> 168435455 repetitions
>>
>> Testing memcmp ..... done,      5.520 seconds
>> Testing builtin memcmp ....... done,     13.973 seconds
>> Testing loop .... done,      1.912 seconds
>> Testing loop32 .... done,      2.124 seconds
>> Testing loop64 .... done,      2.104 seconds
>> Testing no-op .... done,      1.468 seconds
>>
>>
>> For half the same:
>> stewart@willster:~/src/test/memcmp$ ./a.out 168435455 8
>> 168435455 repetitions
>>
>> Testing memcmp ..... done,      7.340 seconds
>> Testing builtin memcmp ....... done,     19.193 seconds
>> Testing loop .... done,      4.212 seconds
>> Testing loop32 .... done,      2.956 seconds
>> Testing loop64 .... done,      2.112 seconds
>> Testing no-op .... done,      1.476 seconds
>>
>> stewart@willster:~/src/test/memcmp$ ./a.out 168435455 32
>> 168435455 repetitions
>>
>> Testing memcmp ..... done,      5.920 seconds
>> Testing builtin memcmp ....... done,     34.526 seconds
>> Testing loop .... done,     11.885 seconds
>> Testing loop32 .... done,      4.444 seconds
>> Testing loop64 .... done,      3.388 seconds
>> Testing no-op .... done,      1.468 seconds
>>
>> stewart@willster:~/src/test/memcmp$ ./a.out 168435455 64
>> 168435455 repetitions
>>
>> Testing memcmp ..... done,      6.948 seconds
>> Testing builtin memcmp ....... done,     54.923 seconds
>> Testing loop .... done,     22.081 seconds
>> Testing loop32 .... done,      7.000 seconds
>> Testing loop64 .... done,      4.444 seconds
>> Testing no-op .... done,      1.488 seconds
>>
>>
>> Is concurrency an issue here and not raw single threaded performance?
>>
>> Let's try with 64 threads (on the same 2 core box):
>>
>> 64 threads, __builtin_memcmp:
>>     read/write requests:                 70000  (3184.20 per sec.)
>>     read/write requests:                 70000  (3184.87 per sec.)
>>     read/write requests:                 70000  (3166.83 per sec.)
>>     read/write requests:                 70000  (3085.69 per sec.)
>> AVERAGE=3155.39
>>
>> 64 threads, memcmp:
>>     read/write requests:                 70000  (3218.07 per sec.)
>>     read/write requests:                 70000  (3219.04 per sec.)
>>     read/write requests:                 70000  (3222.48 per sec.)
>>     read/write requests:                 70000  (3116.00 per sec.)
>> AVERAGE=3193.89
>>
>>
>> 64 threads, 32bit cmp loop:
>>     read/write requests:                 70000  (3173.76 per sec.)
>>     read/write requests:                 70000  (3156.23 per sec.)
>>     read/write requests:                 70000  (3206.70 per sec.)
>>     read/write requests:                 70000  (3247.61 per sec.)
>> AVERAGE=3196.07
>>
>> 64 threads, 64bit cmp loop:
>>     read/write requests:                 70000  (3210.02 per sec.)
>>     read/write requests:                 70000  (3218.87 per sec.)
>>     read/write requests:                 70000  (3225.98 per sec.)
>>     read/write requests:                 70000  (3131.60 per sec.)
>> AVERAGE= 3196.61
>>
>> 64 threads, baseline (using original ptr_compare):
>>     read/write requests:                 70000  (3542.95 per sec.)
>>     read/write requests:                 70000  (3556.15 per sec.)
>>     read/write requests:                 70000  (3560.71 per sec.)
>>     read/write requests:                 70000  (3476.80 per sec.)
>> AVERAGE=3534.15
>>
>> i.e. the old ptr_compare whoops the arse of any of the memcmp calls with
>> higher concurrency.
>>
>>
>> So, after this I can conclude:
>> - memcmp is nothing if not consistent across various microbenchmarks
>> - builtin_memcmp seems to help a bit at lower concurrency, not at all at
>> higher and is seldom useful in micro benchmark.
>>
>> - the 64bit loop isn't so good at higher concurrency
>> - the 64bit loop is favourable in microbenchmarks
>> - memcmp is close to the 64bit loop performance (at most only 2x slower)
>>
>> Unrolling loops being the key?
>>
>> Testing loop .... done,     11.813 seconds
>> Testing unrollloop .... done,      7.628 seconds
>> Testing unrollloop32 .... done,      2.532 seconds
>> Testing unrollloop64 .... done,      2.536 seconds
>> Testing loop32 .... done,      4.432 seconds
>> Testing loop64 .... done,      3.792 seconds
>> Testing no-op .... done,      1.276 seconds
>>
>> (32 and 64 do 32,64 bit compares in an unrolled loop)
>>
>> Questions:
>> - Why does my 64bit compare loop not equal performance of ptr_compare
>> unrolled loop? (in a microbenchmark, it beats it). Concurrency again?
>>
>>
>> So what should we do?
>>
>> Certainly a  call to memcmp is easier to understand and keeps the code
>> nice and simple, but we're talking up to a 10% speed hit here at higher
>> concurrency (at least on my hardware).
>>
>> I wonder what the difference is on various other hardware setups.
>>
>> I'd be very interested to see what happens on sparc.
>>
>> I'd also like to see it micro-benchmarked - preferably something we
>> could repeat in future (even in ./configure ? on server startup ?)
>>
>> I'm voting to keep the current ptr_compare code, and hope that somebody
>> provides further explanation on top of what I've looked at here.
>>
>>
>>
>>
>> Below is the patch, followed by the code i used for microbenchmarking
>> (note the commented out 64bit loop i used for the above loop tests):
>>
>> === modified file 'drizzled/filesort.cc'
>> --- drizzled/filesort.cc    2009-04-28 00:17:10 +0000
>> +++ drizzled/filesort.cc    2009-05-08 05:59:11 +0000
>> @@ -1135,7 +1135,7 @@ int merge_buffers(SORTPARAM *param, IO_C
>>    }
>>    else
>>    {
>> -    cmp= get_ptr_compare(sort_length);
>> +    cmp= (qsort2_cmp)ptr_compare;
>>      first_cmp_arg= (void*) &sort_length;
>>    }
>>    priority_queue<BUFFPEK *, vector<BUFFPEK *>, compare_functor >
>> === modified file 'mysys/mf_sort.cc'
>> --- mysys/mf_sort.cc    2009-04-17 21:01:47 +0000
>> +++ mysys/mf_sort.cc    2009-05-08 05:59:11 +0000
>> @@ -34,7 +34,7 @@ void my_string_ptr_sort(unsigned char *b
>>    {
>>      if (size && items)
>>      {
>> -      my_qsort2(base,items, sizeof(unsigned char*), 
>> get_ptr_compare(size),
>> +      my_qsort2(base,items, sizeof(unsigned char*), 
>> (qsort2_cmp)ptr_compare,
>>                  (void*) &size);
>>      }
>>    }
>>
>> === modified file 'mysys/my_sys.h'
>> --- mysys/my_sys.h    2009-04-27 22:05:43 +0000
>> +++ mysys/my_sys.h    2009-05-08 05:59:11 +0000
>> @@ -420,7 +420,12 @@ extern void my_qsort(void *base_ptr, siz
>>                       qsort_cmp cmp);
>>  extern void my_qsort2(void *base_ptr, size_t total_elems, size_t size,
>>                        qsort2_cmp cmp, void *cmp_argument);
>> -extern qsort2_cmp get_ptr_compare(size_t);
>> +
>> +#if defined(__cplusplus)
>> +extern "C"
>> +#endif
>> +int ptr_compare(size_t *compare_length, unsigned char **a, unsigned 
>> char **b);
>> +
>>  void my_store_ptr(unsigned char *buff, size_t pack_length, my_off_t 
>> pos);
>>  my_off_t my_get_ptr(unsigned char *ptr, size_t pack_length);
>>  File create_temp_file(char *to, const char *dir, const char *pfx,
>>
>> === modified file 'mysys/ptr_cmp.cc'
>> --- mysys/ptr_cmp.cc    2009-04-26 16:53:32 +0000
>> +++ mysys/ptr_cmp.cc    2009-05-08 05:59:11 +0000
>> @@ -21,137 +21,33 @@
>>  
>>  #include "mysys/mysys_priv.h"
>>  #include "plugin/myisam/myisampack.h"
>> -
>> -static int ptr_compare(size_t *compare_length, unsigned char **a, 
>> unsigned char **b);
>> -static int ptr_compare_0(size_t *compare_length, unsigned char **a, 
>> unsigned char **b);
>> -static int ptr_compare_1(size_t *compare_length, unsigned char **a, 
>> unsigned char **b);
>> -static int ptr_compare_2(size_t *compare_length, unsigned char **a, 
>> unsigned char **b);
>> -static int ptr_compare_3(size_t *compare_length, unsigned char **a, 
>> unsigned char **b);
>> -
>> -    /* Get a pointer to a optimal byte-compare function for a given 
>> size */
>> -
>> -qsort2_cmp get_ptr_compare (size_t size)
>> -{
>> -  if (size < 4)
>> -    return (qsort2_cmp) ptr_compare;
>> -  switch (size & 3) {
>> -    case 0: return (qsort2_cmp) ptr_compare_0;
>> -    case 1: return (qsort2_cmp) ptr_compare_1;
>> -    case 2: return (qsort2_cmp) ptr_compare_2;
>> -    case 3: return (qsort2_cmp) ptr_compare_3;
>> -    }
>> -  return 0;                    /* Impossible */
>> -}
>> -
>> +#include <string.h>
>>  
>>      /*
>>        Compare to keys to see witch is smaller.
>> -      Loop unrolled to make it quick !!
>>      */
>>  
>> -#define cmp(N) if (first[N] != last[N]) return (int) first[N] - (int) 
>> last[N]
>> -
>> -static int ptr_compare(size_t *compare_length, unsigned char **a, 
>> unsigned char **b)
>> -{
>> -  register int length= *compare_length;
>> -  register unsigned char *first,*last;
>> -
>> -  first= *a; last= *b;
>> -  while (--length)
>> -  {
>> -    if (*first++ != *last++)
>> -      return (int) first[-1] - (int) last[-1];
>> -  }
>> -  return (int) first[0] - (int) last[0];
>> -}
>> -
>> -
>> -static int ptr_compare_0(size_t *compare_length,unsigned char **a, 
>> unsigned char **b)
>> +extern "C"
>> +int ptr_compare(size_t *compare_length, unsigned char **a, unsigned 
>> char **b)
>>  {
>> -  register int length= *compare_length;
>> -  register unsigned char *first,*last;
>> -
>> -  first= *a; last= *b;
>> - loop:
>> -  cmp(0);
>> -  cmp(1);
>> -  cmp(2);
>> -  cmp(3);
>> -  if ((length-=4))
>> +/*  size_t l= *compare_length / sizeof(uint64_t);
>> +  uint64_t *aa= (uint64_t*)*a;
>> +  uint64_t *bb= (uint64_t*)*b;
>> +  while(l--)
>>    {
>> -    first+=4;
>> -    last+=4;
>> -    goto loop;
>> -  }
>> -  return (0);
>> -}
>> -
>> -
>> -static int ptr_compare_1(size_t *compare_length,unsigned char **a, 
>> unsigned char **b)
>> -{
>> -  register int length= *compare_length-1;
>> -  register unsigned char *first,*last;
>> -
>> -  first= *a+1; last= *b+1;
>> -  cmp(-1);
>> - loop:
>> -  cmp(0);
>> -  cmp(1);
>> -  cmp(2);
>> -  cmp(3);
>> -  if ((length-=4))
>> -  {
>> -    first+=4;
>> -    last+=4;
>> -    goto loop;
>> -  }
>> -  return (0);
>> -}
>> -
>> -static int ptr_compare_2(size_t *compare_length,unsigned char **a, 
>> unsigned char **b)
>> -{
>> -  register int length= *compare_length-2;
>> -  register unsigned char *first,*last;
>> -
>> -  first= *a +2 ; last= *b +2;
>> -  cmp(-2);
>> -  cmp(-1);
>> - loop:
>> -  cmp(0);
>> -  cmp(1);
>> -  cmp(2);
>> -  cmp(3);
>> -  if ((length-=4))
>> -  {
>> -    first+=4;
>> -    last+=4;
>> -    goto loop;
>> +    if(*aa != *bb)
>> +    {
>> +      if(*aa < *bb)
>> +    return -1;
>> +      else
>> +    return 1;
>> +    }
>> +    aa++, bb++;
>>    }
>> -  return (0);
>> +  return 0;
>> +*/  return memcmp(*a, *b, *compare_length);
>>  }
>>  
>> -static int ptr_compare_3(size_t *compare_length,unsigned char **a, 
>> unsigned char **b)
>> -{
>> -  register int length= *compare_length-3;
>> -  register unsigned char *first,*last;
>> -
>> -  first= *a +3 ; last= *b +3;
>> -  cmp(-3);
>> -  cmp(-2);
>> -  cmp(-1);
>> - loop:
>> -  cmp(0);
>> -  cmp(1);
>> -  cmp(2);
>> -  cmp(3);
>> -  if ((length-=4))
>> -  {
>> -    first+=4;
>> -    last+=4;
>> -    goto loop;
>> -  }
>> -  return (0);
>> -}
>>  
>>  void my_store_ptr(unsigned char *buff, size_t pack_length, my_off_t pos)
>>  {
>>
>>
>> For those interested, i started with the code at
>> http://gcc.gnu.org/ml/gcc/2002-10/msg01666.html
>>
>> and ended up with something like this:
>>
>> #include <sys/resource.h>
>> #include <sys/time.h>
>> #include <stdio.h>
>> #include <assert.h>
>> #include <stdlib.h>
>>
>> char s1[256];
>> char s2[256];
>>
>> int cmpsize= 240;
>>
>> int
>> speed_memcmp ()
>> {
>>   return memcmp (s1, s2, cmpsize);
>> }
>>
>> int
>> speed_bimemcmp ()
>> {
>>   return __builtin_memcmp (s1, s2, cmpsize);
>> }
>>
>> int
>> speed_loop ()
>> {
>>   int i=cmpsize;
>>   char *a= s1;
>>   char *b= s2;
>>   while(i--)
>>   {
>>     if(*a != *b)
>>       return -1;
>>     a++, b++;
>>   };
>>   return 0;
>> }
>>
>> int
>> speed_loop32 ()
>> {
>>   int i=cmpsize/4;
>>   int *a= (int*)s1;
>>   int *b= (int*)s2;
>>   while(i--)
>>   {
>>     if(*a != *b)
>>       return -1;
>>     a++, b++;
>>   };
>>   return 0;
>> }
>>
>> int
>> speed_loop64 ()
>> {
>>   int i=cmpsize/8;
>>   long long *a= (long long*)s1;
>>   long long *b= (long long*)s2;
>>   while(i--)
>>   {
>>     if(*a != *b)
>>       return -1;
>>     a++, b++;
>>   };
>>
>>   return 0;
>> }
>>
>> int speed_null()
>> {
>>   return 0;
>> }
>>
>>
>> double
>> do_test (int repetitions, int (*test_function) ())
>> {
>>   struct rusage r1, r2;
>>
>>   getrusage(RUSAGE_SELF, &r1);
>>
>>   while (repetitions--)
>>     test_function ();
>>
>>   getrusage(RUSAGE_SELF, &r2);
>>   return (r2.ru_utime.tv_sec - r1.ru_utime.tv_sec) +
>>      (r2.ru_utime.tv_usec - r1.ru_utime.tv_usec) / 1000000.0;
>> }
>>
>> main(int argc, char **argv)
>> {
>>   int repetitions = (argc == 1) ? 0x0fffffff : atoi(argv[1]);
>>   cmpsize = (argc == 2) ? 240 : atoi(argv[2]);
>>
>>   memset(s1, 42, 256);
>>   memset(s2, 42, 256);
>>   memset(s2+(cmpsize/2), 55, cmpsize/2);
>>
>>   printf ("%d repetitions\n\n", repetitions);
>>   printf ("Testing memcmp .....");
>>   fflush (stdout);
>>   printf (" done, %10.3f seconds\n", do_test (repetitions, 
>> speed_memcmp));
>>
>>   printf ("Testing builtin memcmp .......");
>>   fflush (stdout);
>>   printf (" done, %10.3f seconds\n", do_test (repetitions, 
>> speed_bimemcmp));
>>
>>   printf ("Testing loop ....");
>>   fflush (stdout);
>>   printf (" done, %10.3f seconds\n", do_test (repetitions, speed_loop));
>>
>>   printf ("Testing loop32 ....");
>>   fflush (stdout);
>>   printf (" done, %10.3f seconds\n", do_test (repetitions, 
>> speed_loop32));
>>
>>   printf ("Testing loop64 ....");
>>   fflush (stdout);
>>   printf (" done, %10.3f seconds\n", do_test (repetitions, 
>> speed_loop64));
>>
>>   printf ("Testing no-op ....");
>>   fflush (stdout);
>>   printf (" done, %10.3f seconds\n", do_test (repetitions, speed_null));
>>
>>   exit (0);
>> }
>>
>>
> 
> 
Thread
ptr_compare replace with memcmp and associated performanceStewart Smith8 May
  • Re: ptr_compare replace with memcmp and associated performanceEric Day9 May
    • Re: ptr_compare replace with memcmp and associated performanceStewart Smith11 May
  • Re: [Drizzle-discuss] ptr_compare replace with memcmp and associatedperformanceJay Pipes10 May
    • Re: [Drizzle-discuss] ptr_compare replace with memcmp and associatedperformanceRoy Lyseng10 May
  • Re: ptr_compare replace with memcmp and associated performancePeter Benjamin Volk17 May