List:Commits« Previous MessageNext Message »
From:Roy Lyseng Date:November 22 2010 5:15pm
Subject:Re: bzr commit into mysql-5.5-bugteam branch (bar:3114) Bug#57737
View as plain text  
Hi Alexander,

On 22.11.10 16.11, Alexander Barkov wrote:
> Hi Roy,
>
> Thanks for your review!
>
> Please see my comments inline:
>
>
> Roy Lyseng wrote:
>> Hi Alexander,
>>
>> some preliminary notes to the code changes.
>>
>> I still need some time to understand the logic of the patch.
>>
>> On 11.11.10 09.44, Alexander Barkov wrote:
>>> #At file:///home/bar/mysql-bzr/mysql-5.5-bugteam.b57737/ based on
>>> revid:dmitry.shulga@stripped
>>>
>>> 3114 Alexander Barkov 2010-11-11
>>> Bug#57737 Character sets: search fails with like, contraction, index
>>>
>>> Problem: LIKE over an indexed column optimized away good results,
>>> because my_like_range_utf32/utf16 returned wrong ranges for contractions.
>>>
>>> Fix: Make my_like_range_utf32/utf16 return correct ranges for contractions.
>>>
>>> Note, this code presens in my_like_range_mb and my_like_range_ucs2.
>>> It was forgotten in utf32/utf16 versions (during mysql-6.0 push/revert
> mess).
>>>
>>>
>>> added:
>>> @ mysql-test/include/ctype_czech.inc
>>> Adding shared test code
>>>
>>> modified:
>>> @ include/m_ctype.h
>>> Adding helper functions for contractions.
>>>
>>> @ mysql-test/r/ctype_uca.result
>>> @ mysql-test/r/ctype_utf16_uca.result
>>> @ mysql-test/r/ctype_utf32_uca.result
>>> @ mysql-test/t/ctype_uca.test
>>> @ mysql-test/t/ctype_utf16_uca.test
>>> @ mysql-test/t/ctype_utf32_uca.test
>>> Adding tests.
>>>
>>> @ strings/ctype-mb.c
>>> Pad function did not put the last character.
>>>
>>> @ strings/ctype-ucs2.c
>>> - my_fill_mb2 did not put the high byte, as previously
>>> it was used to put only characters in ASCII range.
>>> Now it puts high byte as well
>>> (needed to pupulate cs->max_sort_char correctly).
>>>
>>> - Adding contraction handling into my_like_range_utf16
>>> - Removing legacy "Temporary fix" from my_like_range_utf16
>>> - Using my_fill_mb2 instead of loop (duplicate code eliminations)
>>> - Adding contraction handling into my_like_range_utf32
>>>
>>> added:
>>> mysql-test/include/ctype_czech.inc
>>> modified:
>>> include/m_ctype.h
>>> mysql-test/r/ctype_uca.result
>>> mysql-test/r/ctype_utf16_uca.result
>>> mysql-test/r/ctype_utf32_uca.result
>>> mysql-test/t/ctype_uca.test
>>> mysql-test/t/ctype_utf16_uca.test
>>> mysql-test/t/ctype_utf32_uca.test
>>> strings/ctype-mb.c
>>> strings/ctype-ucs2.c
>
> <skip>
>
>>> memcpy(str, buf, buflen);
>>>
>>> === modified file 'strings/ctype-ucs2.c'
>>> --- a/strings/ctype-ucs2.c 2010-10-20 19:02:59 +0000
>>> +++ b/strings/ctype-ucs2.c 2010-11-11 08:39:03 +0000
>>> @@ -903,7 +903,7 @@ static void
>>> my_fill_mb2(CHARSET_INFO *cs __attribute__((unused)),
>>> char *s, size_t l, int fill)
>>
>> Type of argument 'fill' should be uint16.
>>> {
>>> - for ( ; l>= 2; s[0]= 0, s[1]= fill, s+= 2, l-= 2);
>>> + for ( ; l>= 2; s[0]= (fill>> 8), s[1]= (fill& 0xFF), s+= 2,
> l-= 2);
>>> }
>
>
> my_fill_mb2 is used to initialize the "fill" member of
> MY_CHARSET_HANDLER structures, for example:
>
> MY_CHARSET_HANDLER my_charset_utf16_handler=
> {
> ...
> my_fill_mb2,
> ...
> };
>
> We cannot change prototype for my_fill_mb2 alone,
> without changing MY_CHARSET_HANDLER definition and
> therefore all other MY_CHARSET_HANDLER constants.

Perhaps you can typecast the argument in the call, then?
>
>
>>>
>>>
>>> @@ -1590,14 +1590,19 @@ my_like_range_utf16(CHARSET_INFO *cs,
>>> char *min_str,char *max_str,
>>> size_t *min_length,size_t *max_length)
>>
>> Can you please document arguments cs, w_one, w_many, min_length and max_length?
>> And likewise for my_like_range_utf32().
>
>
> Yes, will do that in the next patch version.

Thanks
>
>
>>
>> The 'escape' argument is a single byte character. Is that correct in the
>> general case? If so, I think the fact should be documented.
>
> I think it's documented here:
>
> http://dev.mysql.com/doc/refman/5.0/en/string-comparison-functions.html#operator_like
>
>
> "The escape sequence should be empty or one character long."

It says one character. It does not say that it is restricted to one byte.
>
>
>>
>>> {
>>> - const char *end=ptr+ptr_length;
>>> - char *min_org=min_str;
>>> - char *min_end=min_str+res_length;
>>> + const char *end= ptr + ptr_length;
>>> + char *min_org= min_str;
>>
>> min_org can be 'const char *'.
>
> It could be 'const char *'.
>
> I have only one doubt: for some reasons in all similar cases we use
> 'char *' in this context:
>
>
> int func(char *s)
> {
> const char *s0= s;
> ... /* traverse through s */
> return s - s0;
> }
>
> I can see it all around the code.
>
> I thought the reason was that some compilers can produce warnings
> for arithmetic operations between pointers with different qualifiers
> (i.e. with 'const' and without 'const').

Hm, could be. I tried it out on Linux, and it did not give any warnings. Not a 
big issue though...
>
>
>
>>
>>> + char *min_end= min_str + res_length;
>>> + char *max_end= max_str + res_length;
>>> size_t charlen= res_length / cs->mbmaxlen;
>>> + my_bool have_contractions= my_cs_have_contractions(cs);
>>
>> For symmetry, you may add here DBUG_ASSERT((res_length % 2) == 0);
>
> I agree.
>
>>
>>>
>>> for ( ; ptr + 1< end&& min_str + 1< min_end&&
> charlen> 0
>>> ; ptr+=2, charlen--)
>>> {
>>> + my_wc_t wc2;
>>> + int res;
>>
>> The 'res' local variable is redundant (never tested).
>
> Oops. This is wrong that it's not used.
> I forgot to fix 'ptr+= 2' to 'ptr+= res'.
> Thanks for noticing thing!

:)

>
>>
>>> +
>>> if (ptr[0] == '\0'&& ptr[1] == escape&& ptr + 1< end)
>>> {
>>> ptr+=2; /* Skip escape */
>>> @@ -1623,34 +1628,35 @@ my_like_range_utf16(CHARSET_INFO *cs,
>>> *min_length= ((cs->state& MY_CS_BINSORT) ? (size_t) (min_str -
> min_org) :
>>> res_length);
>>> *max_length= res_length;
>>> - do {
>>> - *min_str++ = 0;
>>> - *min_str++ = 0;
>>> - *max_str++ = (char) (cs->max_sort_char>> 8);
>>> - *max_str++ = (char) (cs->max_sort_char& 255);
>>> - } while (min_str + 1< min_end);
>>> - return FALSE;
>>> + goto pad_min_max;
>>> + }
>>> + if (have_contractions&& ptr + 3< end&&
>>> + ptr[0] == '\0'&&
>>> + my_cs_can_be_contraction_head(cs, (uchar) ptr[1])&&
>>
>> The cast of ptr[1] is not according to the prototype of
>> my_cs_can_be_contraction_head().
>
> The idea is to cast from "signed char" to "unsigned char" here.
> Cast to my_wc_t is done automatically.
>
> It could be:
>
> my_cs_can_be_contraction_head(cs, (my_wc_t) (uchar) ptr[1])&&
>
> to have everything done explicitly.
>
> But I am not sure if we really need this.

Is the (uchar) cast needed? At least, the (my_wc_t) would show that you have 
thought about the cast issue. But: no big deal, it's up to you.
>
>
>>
>>> + (res= my_utf16_uni(cs,&wc2, (uchar*) ptr + 2, (uchar*) end))> 0)
>>> + {
>>> + /*
>>> + Check contraction head followed by a wildcard,
>>> + or contraction that does not fit to result
>>> + */
>>> + if ((wc2 == (my_wc_t) w_one || wc2 == (my_wc_t) w_many) ||
>>> + (charlen == 1&&
>>> + my_cs_can_be_contraction_tail(cs, wc2)&&
>>> + my_cs_contraction2_weight(cs, (uchar) ptr[1], wc2)))
>>> + {
>>> + *min_length= *max_length= res_length;
>>> + goto pad_min_max;
>>> + }
>>> }
>>> *min_str++= *max_str++ = ptr[0];
>>> *min_str++= *max_str++ = ptr[1];
>>> }
>>>
>>> - /* Temporary fix for handling w_one at end of string (key compression) */
>>> - {
>>> - char *tmp;
>>> - for (tmp= min_str ; tmp-1> min_org&& tmp[-1] == '\0'&&
> tmp[-2]=='\0';)
>>> - {
>>> - *--tmp=' ';
>>> - *--tmp='\0';
>>> - }
>>> - }
>>> -
>>> - *min_length= *max_length = (size_t) (min_str - min_org);
>>> - while (min_str + 1< min_end)
>>> - {
>>> - *min_str++ = *max_str++ = '\0';
>>> - *min_str++ = *max_str++ = ' '; /* Because if key compression */
>>> - }
>>> + *min_length= *max_length= (size_t) (min_str - min_org);
>>> +
>>> +pad_min_max:
>>> + my_fill_mb2(cs, min_str, min_end - min_str, cs->min_sort_char);
>>> + my_fill_mb2(cs, max_str, max_end - max_str, cs->max_sort_char);
>>> return FALSE;
>>> }
>>>
>>> @@ -2583,12 +2589,13 @@ my_like_range_utf32(CHARSET_INFO *cs,
>>> char *min_end= min_str + res_length;
>>> char *max_end= max_str + res_length;
>>> size_t charlen= res_length / cs->mbmaxlen;
>>> + my_bool have_contractions= my_cs_have_contractions(cs);
>>>
>>> DBUG_ASSERT((res_length % 4) == 0);
>>>
>>> for ( ; charlen> 0; ptr+= 4, charlen--)
>>> {
>>> - my_wc_t wc;
>>> + my_wc_t wc, wc2;
>>> int res;
>>
>> The 'res' local variable is redundant.
>>
>>> if ((res= my_utf32_uni(cs,&wc, (uchar*) ptr, (uchar*) end))< 0)
>>> {
>>> @@ -2640,6 +2647,24 @@ my_like_range_utf32(CHARSET_INFO *cs,
>>> goto pad_min_max;
>>> }
>>>
>>> + if (have_contractions&&
>>> + my_cs_can_be_contraction_head(cs, wc)&&
>>> + (res= my_utf32_uni(cs,&wc2, (uchar*) ptr + 4, (uchar*) end))> 0)
>>> + {
>>> + /*
>>> + Check contraction head followed by a wildcard
>>> + or contraction which does not fit to result
>>> + */
>>> + if ((wc2 == (my_wc_t) w_one || wc2 == (my_wc_t) w_many) ||
>>> + (charlen == 1&&
>>> + my_cs_can_be_contraction_tail(cs, wc2)&&
>>> + my_cs_contraction2_weight(cs, wc, wc2)))
>>> + {
>>> + *min_length= *max_length= res_length;
>>> + goto pad_min_max;
>>> + }
>>> + }
>>> +
>>> /* Normal character */
>>> if (my_uni_utf32(cs, wc, (uchar*) min_str, (uchar*) min_end) != 4 ||
>>> my_uni_utf32(cs, wc, (uchar*) max_str, (uchar*) max_end) != 4)
>>
>> In escape handling, this function performs:
>> *min_str++= 4;
>> *max_str++= 4;
>> Is that correct?
>
> utf32 is a fixed length character set, 4 bytes per character.
> Looks correct.

You are assigning the value 4 to *max_str and *min_str, before incrementing the 
pointers. Is that correct?
>
>>
>> The return values in the two functions seem to be inconsistent - There are no
>> return TRUE in my_like_range_utf16() but two in my_like_range_utf32().
>
> I think utf16 version should return TRUE on an incorrect byte sequences,
> like utf32 version does.

Ok.
>
>
> I'm starting to think that it was over-engineering to
> have separate functions for ucs2, utf16, utf32.
>
> It really makes sense for things like strnncoll(),
> which affect performance seriously, as they executed
> for every row.
>
> But in case of my_like_range it could be a single function
> which handles all three character sets.
> my_like_range_xxx() is executed only one time per query,
> so some character set specific optimization in every
> function hardly affects the overall performance.
>
> In 5.6 we'll have some more collation features,
> which will need new code in my_like_range-family
> functions, which means duplicate code again.
>
>
> I think I'll put utf16 and utf32 versions together
> into a single function.
>
>
> ucs2 version have been here since 4.1, so it's a kind of stable thing.
> Let's leave it alone, at least until 5.6.

Sounds reasonable.

Thanks,
Roy
Thread
bzr commit into mysql-5.5-bugteam branch (bar:3114) Bug#57737Alexander Barkov11 Nov
  • Re: bzr commit into mysql-5.5-bugteam branch (bar:3114) Bug#57737Roy Lyseng19 Nov
    • Re: bzr commit into mysql-5.5-bugteam branch (bar:3114) Bug#57737Alexander Barkov22 Nov
      • Re: bzr commit into mysql-5.5-bugteam branch (bar:3114) Bug#57737Roy Lyseng22 Nov
        • Re: bzr commit into mysql-5.5-bugteam branch (bar:3114) Bug#57737Alexander Barkov23 Nov
  • Re: bzr commit into mysql-5.5-bugteam branch (bar:3114) Bug#57737Roy Lyseng22 Nov