List:Commits« Previous MessageNext Message »
From:Alexander Barkov Date:November 22 2010 3:11pm
Subject:Re: bzr commit into mysql-5.5-bugteam branch (bar:3114) Bug#57737
View as plain text  
Hi Roy,

Thanks for your review!

Please see my comments inline:


Roy Lyseng wrote:
> Hi Alexander,
> 
> some preliminary notes to the code changes.
> 
> I still need some time to understand the logic of the patch.
> 
> On 11.11.10 09.44, Alexander Barkov wrote:
>> #At file:///home/bar/mysql-bzr/mysql-5.5-bugteam.b57737/ based on 
>> revid:dmitry.shulga@stripped
>>
>>   3114 Alexander Barkov    2010-11-11
>>        Bug#57737 Character sets: search fails with like, contraction, 
>> index
>>
>>        Problem: LIKE over an indexed column optimized away good results,
>>        because my_like_range_utf32/utf16 returned wrong ranges for 
>> contractions.
>>
>>        Fix: Make my_like_range_utf32/utf16 return correct ranges for 
>> contractions.
>>
>>        Note, this code presens in my_like_range_mb and 
>> my_like_range_ucs2.
>>        It was forgotten in utf32/utf16 versions (during mysql-6.0 
>> push/revert mess).
>>
>>
>>        added:
>>          @ mysql-test/include/ctype_czech.inc
>>          Adding shared test code
>>
>>        modified:
>>          @ include/m_ctype.h
>>          Adding helper functions for contractions.
>>
>>          @ mysql-test/r/ctype_uca.result
>>          @ mysql-test/r/ctype_utf16_uca.result
>>          @ mysql-test/r/ctype_utf32_uca.result
>>          @ mysql-test/t/ctype_uca.test
>>          @ mysql-test/t/ctype_utf16_uca.test
>>          @ mysql-test/t/ctype_utf32_uca.test
>>          Adding tests.
>>
>>          @ strings/ctype-mb.c
>>          Pad function did not put the last character.
>>
>>          @ strings/ctype-ucs2.c
>>          - my_fill_mb2 did not put the high byte, as previously
>>            it was used to put only characters in ASCII range.
>>            Now it puts high byte as well
>>            (needed to pupulate cs->max_sort_char correctly).
>>
>>          - Adding contraction handling into my_like_range_utf16
>>          - Removing legacy "Temporary fix" from my_like_range_utf16
>>          - Using my_fill_mb2 instead of loop (duplicate code 
>> eliminations)
>>          - Adding contraction handling into my_like_range_utf32
>>
>>      added:
>>        mysql-test/include/ctype_czech.inc
>>      modified:
>>        include/m_ctype.h
>>        mysql-test/r/ctype_uca.result
>>        mysql-test/r/ctype_utf16_uca.result
>>        mysql-test/r/ctype_utf32_uca.result
>>        mysql-test/t/ctype_uca.test
>>        mysql-test/t/ctype_utf16_uca.test
>>        mysql-test/t/ctype_utf32_uca.test
>>        strings/ctype-mb.c
>>        strings/ctype-ucs2.c

<skip>

>>         memcpy(str, buf, buflen);
>>
>> === modified file 'strings/ctype-ucs2.c'
>> --- a/strings/ctype-ucs2.c    2010-10-20 19:02:59 +0000
>> +++ b/strings/ctype-ucs2.c    2010-11-11 08:39:03 +0000
>> @@ -903,7 +903,7 @@ static void
>>   my_fill_mb2(CHARSET_INFO *cs __attribute__((unused)),
>>               char *s, size_t l, int fill)
> 
> Type of argument 'fill' should be uint16.
>>   {
>> -  for ( ; l>= 2; s[0]= 0, s[1]= fill, s+= 2, l-= 2);
>> +  for ( ; l>= 2; s[0]= (fill>>  8), s[1]= (fill&  0xFF), s+= 2, l-=
> 2);
>>   }


my_fill_mb2 is used to initialize the "fill" member of
MY_CHARSET_HANDLER structures, for example:

MY_CHARSET_HANDLER my_charset_utf16_handler=
{
   ...
   my_fill_mb2,
   ...
};

We cannot change prototype for my_fill_mb2 alone,
without changing MY_CHARSET_HANDLER definition and
therefore all other MY_CHARSET_HANDLER constants.


>>
>>
>> @@ -1590,14 +1590,19 @@ my_like_range_utf16(CHARSET_INFO *cs,
>>                       char *min_str,char *max_str,
>>                       size_t *min_length,size_t *max_length)
> 
> Can you please document arguments cs, w_one, w_many, min_length and 
> max_length?
> And likewise for my_like_range_utf32().


Yes, will do that in the next patch version.


> 
> The 'escape' argument is a single byte character. Is that correct in the 
> general case? If so, I think the fact should be documented.

I think it's documented here:
http://dev.mysql.com/doc/refman/5.0/en/string-comparison-functions.html#operator_like

"The escape sequence should be empty or one character long."


> 
>>   {
>> -  const char *end=ptr+ptr_length;
>> -  char *min_org=min_str;
>> -  char *min_end=min_str+res_length;
>> +  const char *end= ptr + ptr_length;
>> +  char *min_org= min_str;
> 
> min_org can be 'const char *'.

It could be 'const char *'.

I have only one doubt: for some reasons in all similar cases we use
'char *' in this context:


int func(char *s)
{
   const char *s0= s;
   ... /* traverse through s */
   return s - s0;
}

I can see it all around the code.

I thought the reason was that some compilers can produce warnings
for arithmetic operations between pointers with different qualifiers
(i.e. with 'const' and without 'const').



> 
>> +  char *min_end= min_str + res_length;
>> +  char *max_end= max_str + res_length;
>>     size_t charlen= res_length / cs->mbmaxlen;
>> +  my_bool have_contractions= my_cs_have_contractions(cs);
> 
> For symmetry, you may add here DBUG_ASSERT((res_length % 2) == 0);

I agree.

> 
>>
>>     for ( ; ptr + 1<  end&&  min_str + 1<  min_end&& 
> charlen>  0
>>           ; ptr+=2, charlen--)
>>     {
>> +    my_wc_t wc2;
>> +    int res;
> 
> The 'res' local variable is redundant (never tested).

Oops. This is wrong that it's not used.
I forgot to fix 'ptr+= 2' to 'ptr+= res'.
Thanks for noticing thing!

> 
>> +
>>       if (ptr[0] == '\0'&&  ptr[1] == escape&&  ptr + 1< 
> end)
>>       {
>>         ptr+=2;                                        /* Skip escape */
>> @@ -1623,34 +1628,35 @@ my_like_range_utf16(CHARSET_INFO *cs,
>>         *min_length= ((cs->state&  MY_CS_BINSORT) ? (size_t) (min_str 
>> - min_org) :
>>                       res_length);
>>         *max_length= res_length;
>> -      do {
>> -        *min_str++ = 0;
>> -        *min_str++ = 0;
>> -        *max_str++ = (char) (cs->max_sort_char>>  8);
>> -        *max_str++ = (char) (cs->max_sort_char&  255);
>> -      } while (min_str + 1<  min_end);
>> -      return FALSE;
>> +      goto pad_min_max;
>> +    }
>> +    if (have_contractions&&  ptr + 3<  end&&
>> +        ptr[0] == '\0'&&
>> +        my_cs_can_be_contraction_head(cs, (uchar) ptr[1])&&
> 
> The cast of ptr[1] is not according to the prototype of 
> my_cs_can_be_contraction_head().

The idea is to cast from "signed char" to "unsigned char" here.
Cast to my_wc_t is done automatically.

It could be:

  my_cs_can_be_contraction_head(cs, (my_wc_t) (uchar) ptr[1])&&

  to have everything done explicitly.

But I am not sure if we really need this.


> 
>> +        (res= my_utf16_uni(cs,&wc2, (uchar*) ptr + 2, (uchar*) 
>> end))>  0)
>> +    {
>> +      /*
>> +         Check contraction head followed by a wildcard,
>> +         or contraction that does not fit to result
>> +      */
>> +      if ((wc2 == (my_wc_t) w_one || wc2 == (my_wc_t) w_many) ||
>> +          (charlen == 1&&
>> +           my_cs_can_be_contraction_tail(cs, wc2)&&
>> +           my_cs_contraction2_weight(cs, (uchar) ptr[1], wc2)))
>> +      {
>> +        *min_length= *max_length= res_length;
>> +        goto pad_min_max;
>> +      }
>>       }
>>       *min_str++= *max_str++ = ptr[0];
>>       *min_str++= *max_str++ = ptr[1];
>>     }
>>
>> -  /* Temporary fix for handling w_one at end of string (key 
>> compression) */
>> -  {
>> -    char *tmp;
>> -    for (tmp= min_str ; tmp-1>  min_org&&  tmp[-1] == '\0'&& 
> 
>> tmp[-2]=='\0';)
>> -    {
>> -      *--tmp=' ';
>> -      *--tmp='\0';
>> -    }
>> -  }
>> -
>> -  *min_length= *max_length = (size_t) (min_str - min_org);
>> -  while (min_str + 1<  min_end)
>> -  {
>> -    *min_str++ = *max_str++ = '\0';
>> -    *min_str++ = *max_str++ = ' ';      /* Because if key compression */
>> -  }
>> +  *min_length= *max_length= (size_t) (min_str - min_org);
>> +
>> +pad_min_max:
>> +  my_fill_mb2(cs, min_str, min_end - min_str, cs->min_sort_char);
>> +  my_fill_mb2(cs, max_str, max_end - max_str, cs->max_sort_char);
>>     return FALSE;
>>   }
>>
>> @@ -2583,12 +2589,13 @@ my_like_range_utf32(CHARSET_INFO *cs,
>>     char *min_end= min_str + res_length;
>>     char *max_end= max_str + res_length;
>>     size_t charlen= res_length / cs->mbmaxlen;
>> +  my_bool have_contractions= my_cs_have_contractions(cs);
>>
>>     DBUG_ASSERT((res_length % 4) == 0);
>>
>>     for ( ; charlen>  0; ptr+= 4, charlen--)
>>     {
>> -    my_wc_t wc;
>> +    my_wc_t wc, wc2;
>>       int res;
> 
> The 'res' local variable is redundant.
> 
>>       if ((res= my_utf32_uni(cs,&wc, (uchar*) ptr, (uchar*) end))<  0)
>>       {
>> @@ -2640,6 +2647,24 @@ my_like_range_utf32(CHARSET_INFO *cs,
>>         goto pad_min_max;
>>       }
>>
>> +    if (have_contractions&&
>> +        my_cs_can_be_contraction_head(cs, wc)&&
>> +        (res= my_utf32_uni(cs,&wc2, (uchar*) ptr + 4, (uchar*) 
>> end))>  0)
>> +    {
>> +      /*
>> +         Check contraction head followed by a wildcard
>> +         or contraction which does not fit to result
>> +      */
>> +      if ((wc2 == (my_wc_t) w_one || wc2 == (my_wc_t) w_many) ||
>> +          (charlen == 1&&
>> +           my_cs_can_be_contraction_tail(cs, wc2)&&
>> +           my_cs_contraction2_weight(cs, wc, wc2)))
>> +      {
>> +        *min_length= *max_length= res_length;
>> +        goto pad_min_max;
>> +      }
>> +    }
>> +
>>       /* Normal character */
>>       if (my_uni_utf32(cs, wc, (uchar*) min_str, (uchar*) min_end) != 
>> 4 ||
>>           my_uni_utf32(cs, wc, (uchar*) max_str, (uchar*) max_end) != 4)
> 
> In escape handling, this function performs:
>   *min_str++= 4;
>   *max_str++= 4;
> Is that correct?

utf32 is a fixed length character set, 4 bytes per character.
Looks correct.

> 
> The return values in the two functions seem to be inconsistent - There 
> are no return TRUE in my_like_range_utf16() but two in 
> my_like_range_utf32().

I think utf16 version should return TRUE on an incorrect byte sequences,
like utf32 version does.


I'm starting to think that it was over-engineering to
have separate functions for ucs2, utf16, utf32.

It really makes sense for things like strnncoll(),
which affect performance seriously, as they executed
for every row.

But in case of my_like_range it could be a single function
which handles all three character sets.
my_like_range_xxx() is executed only one time per query,
so some character set specific optimization in every
function hardly affects the overall performance.

In 5.6 we'll have some more collation features,
which will need new code in my_like_range-family
functions, which means duplicate code again.


I think I'll put utf16 and utf32 versions together
into a single function.


ucs2 version have been here since 4.1, so it's a kind of stable thing.
Let's leave it alone, at least until 5.6.

Thanks.


> 
> Thanks,
> Roy
> 

Thread
bzr commit into mysql-5.5-bugteam branch (bar:3114) Bug#57737Alexander Barkov11 Nov
  • Re: bzr commit into mysql-5.5-bugteam branch (bar:3114) Bug#57737Roy Lyseng19 Nov
    • Re: bzr commit into mysql-5.5-bugteam branch (bar:3114) Bug#57737Alexander Barkov22 Nov
      • Re: bzr commit into mysql-5.5-bugteam branch (bar:3114) Bug#57737Roy Lyseng22 Nov
        • Re: bzr commit into mysql-5.5-bugteam branch (bar:3114) Bug#57737Alexander Barkov23 Nov
  • Re: bzr commit into mysql-5.5-bugteam branch (bar:3114) Bug#57737Roy Lyseng22 Nov