List:Commits« Previous MessageNext Message »
From:Alexander Barkov Date:November 23 2010 1:13pm
Subject:Re: bzr commit into mysql-5.5-bugteam branch (bar:3114) Bug#57737
View as plain text  
Hi Roy,

Please find the new patch here:

http://lists.mysql.com/commits/124735

As discussed earlier, it removes three separate functions
for UCS2/UTF16/UTF32 and introduces a single generic function
which is suitable for all of them.

Also, I added SQL functions (available in debug build only)
to test ranges:

SELECT
   LIKE_RANGE_MIN(pattern, result_len),
   LIKE_RANGE_MIN(pattern, result_len);


Roy Lyseng wrote:
> Hi Alexander,
> 
> On 22.11.10 16.11, Alexander Barkov wrote:
>> Hi Roy,
>>
>> Thanks for your review!
>>
>> Please see my comments inline:
>>
>>
>> Roy Lyseng wrote:
>>> Hi Alexander,
>>>
>>> some preliminary notes to the code changes.
>>>
>>> I still need some time to understand the logic of the patch.
>>>
>>> On 11.11.10 09.44, Alexander Barkov wrote:
>>>> #At file:///home/bar/mysql-bzr/mysql-5.5-bugteam.b57737/ based on
>>>> revid:dmitry.shulga@stripped
>>>>
>>>> 3114 Alexander Barkov 2010-11-11
>>>> Bug#57737 Character sets: search fails with like, contraction, index
>>>>
>>>> Problem: LIKE over an indexed column optimized away good results,
>>>> because my_like_range_utf32/utf16 returned wrong ranges for 
>>>> contractions.
>>>>
>>>> Fix: Make my_like_range_utf32/utf16 return correct ranges for 
>>>> contractions.
>>>>
>>>> Note, this code presens in my_like_range_mb and my_like_range_ucs2.
>>>> It was forgotten in utf32/utf16 versions (during mysql-6.0 
>>>> push/revert mess).
>>>>
>>>>
>>>> added:
>>>> @ mysql-test/include/ctype_czech.inc
>>>> Adding shared test code
>>>>
>>>> modified:
>>>> @ include/m_ctype.h
>>>> Adding helper functions for contractions.
>>>>
>>>> @ mysql-test/r/ctype_uca.result
>>>> @ mysql-test/r/ctype_utf16_uca.result
>>>> @ mysql-test/r/ctype_utf32_uca.result
>>>> @ mysql-test/t/ctype_uca.test
>>>> @ mysql-test/t/ctype_utf16_uca.test
>>>> @ mysql-test/t/ctype_utf32_uca.test
>>>> Adding tests.
>>>>
>>>> @ strings/ctype-mb.c
>>>> Pad function did not put the last character.
>>>>
>>>> @ strings/ctype-ucs2.c
>>>> - my_fill_mb2 did not put the high byte, as previously
>>>> it was used to put only characters in ASCII range.
>>>> Now it puts high byte as well
>>>> (needed to pupulate cs->max_sort_char correctly).
>>>>
>>>> - Adding contraction handling into my_like_range_utf16
>>>> - Removing legacy "Temporary fix" from my_like_range_utf16
>>>> - Using my_fill_mb2 instead of loop (duplicate code eliminations)
>>>> - Adding contraction handling into my_like_range_utf32
>>>>
>>>> added:
>>>> mysql-test/include/ctype_czech.inc
>>>> modified:
>>>> include/m_ctype.h
>>>> mysql-test/r/ctype_uca.result
>>>> mysql-test/r/ctype_utf16_uca.result
>>>> mysql-test/r/ctype_utf32_uca.result
>>>> mysql-test/t/ctype_uca.test
>>>> mysql-test/t/ctype_utf16_uca.test
>>>> mysql-test/t/ctype_utf32_uca.test
>>>> strings/ctype-mb.c
>>>> strings/ctype-ucs2.c
>>
>> <skip>
>>
>>>> memcpy(str, buf, buflen);
>>>>
>>>> === modified file 'strings/ctype-ucs2.c'
>>>> --- a/strings/ctype-ucs2.c 2010-10-20 19:02:59 +0000
>>>> +++ b/strings/ctype-ucs2.c 2010-11-11 08:39:03 +0000
>>>> @@ -903,7 +903,7 @@ static void
>>>> my_fill_mb2(CHARSET_INFO *cs __attribute__((unused)),
>>>> char *s, size_t l, int fill)
>>>
>>> Type of argument 'fill' should be uint16.
>>>> {
>>>> - for ( ; l>= 2; s[0]= 0, s[1]= fill, s+= 2, l-= 2);
>>>> + for ( ; l>= 2; s[0]= (fill>> 8), s[1]= (fill& 0xFF), s+=
> 2, l-= 2);
>>>> }
>>
>>
>> my_fill_mb2 is used to initialize the "fill" member of
>> MY_CHARSET_HANDLER structures, for example:
>>
>> MY_CHARSET_HANDLER my_charset_utf16_handler=
>> {
>> ...
>> my_fill_mb2,
>> ...
>> };
>>
>> We cannot change prototype for my_fill_mb2 alone,
>> without changing MY_CHARSET_HANDLER definition and
>> therefore all other MY_CHARSET_HANDLER constants.
> 
> Perhaps you can typecast the argument in the call, then?

I added DBUG_ASSERT() in my_fill_mb2 to make sure we're not
trying to set a value higher than U+FFFF.


I did not add cast in the calls:

   cs->cset->fill(cs, min_str, min_end - min_str - res_length_diff,...
   cs->cset->fill(cs, max_str, max_end - max_str - res_length_diff,...

as it should be passed correctly as size_t automatically
and without warnings.

>>
>>
>>>>
>>>>
>>>> @@ -1590,14 +1590,19 @@ my_like_range_utf16(CHARSET_INFO *cs,
>>>> char *min_str,char *max_str,
>>>> size_t *min_length,size_t *max_length)
>>>
>>> Can you please document arguments cs, w_one, w_many, min_length and 
>>> max_length?
>>> And likewise for my_like_range_utf32().
>>
>>
>> Yes, will do that in the next patch version.
> 
> Thanks

Done.

>>
>>
>>>
>>> The 'escape' argument is a single byte character. Is that correct in the
>>> general case? If so, I think the fact should be documented.
>>
>> I think it's documented here:
>>
> http://dev.mysql.com/doc/refman/5.0/en/string-comparison-functions.html#operator_like 
>>
>>
>>
>> "The escape sequence should be empty or one character long."
> 
> It says one character. It does not say that it is restricted to one byte.

It needs to be fixed, to make Unicode and Asian character sets work 
properly.

The LIKE operator itself does support multi-byte one-character
long ESCAPE sequence. Only like_range does not.

Looks like a bug (legacy from 4.0 times which has not been addressed yet).


I'd like to do it in a separate patch.
I think we should put 'escape', 'one' and 'many' into a structure:

typedef struct my_wild_param_st
{
   my_wc_t escape;
   my_wc_t one;
   my_wc_t many;
} MY_WILD_PARAM;

and pass this structure to both xxx_wildcmp() and xxx_like_range() 
functions.

I suggest to do it in 5.6.


>>
>>
>>>
>>>> {
>>>> - const char *end=ptr+ptr_length;
>>>> - char *min_org=min_str;
>>>> - char *min_end=min_str+res_length;
>>>> + const char *end= ptr + ptr_length;
>>>> + char *min_org= min_str;
>>>
>>> min_org can be 'const char *'.
>>
>> It could be 'const char *'.
>>
>> I have only one doubt: for some reasons in all similar cases we use
>> 'char *' in this context:
>>
>>
>> int func(char *s)
>> {
>> const char *s0= s;
>> ... /* traverse through s */
>> return s - s0;
>> }
>>
>> I can see it all around the code.
>>
>> I thought the reason was that some compilers can produce warnings
>> for arithmetic operations between pointers with different qualifiers
>> (i.e. with 'const' and without 'const').
> 
> Hm, could be. I tried it out on Linux, and it did not give any warnings. 
> Not a big issue though...

I changed min_org and max_org to 'const char *'.
Let's see if we get any warnings on the non-gcc pushbuild machines.


>>
>>
>>
>>>
>>>> + char *min_end= min_str + res_length;
>>>> + char *max_end= max_str + res_length;
>>>> size_t charlen= res_length / cs->mbmaxlen;
>>>> + my_bool have_contractions= my_cs_have_contractions(cs);
>>>
>>> For symmetry, you may add here DBUG_ASSERT((res_length % 2) == 0);
>>
>> I agree.


Instead of DBUG_ASSERT I modified the code to fill up the 'odd'
reminder to 0x00.


>>
>>>
>>>>
>>>> for ( ; ptr + 1< end&& min_str + 1< min_end&&
> charlen> 0
>>>> ; ptr+=2, charlen--)
>>>> {
>>>> + my_wc_t wc2;
>>>> + int res;
>>>
>>> The 'res' local variable is redundant (never tested).
>>
>> Oops. This is wrong that it's not used.
>> I forgot to fix 'ptr+= 2' to 'ptr+= res'.
>> Thanks for noticing thing!
> 
> :)

Fixed in the new version.


> 
>>
>>>
>>>> +
>>>> if (ptr[0] == '\0'&& ptr[1] == escape&& ptr + 1< end)
>>>> {
>>>> ptr+=2; /* Skip escape */
>>>> @@ -1623,34 +1628,35 @@ my_like_range_utf16(CHARSET_INFO *cs,
>>>> *min_length= ((cs->state& MY_CS_BINSORT) ? (size_t) (min_str - 
>>>> min_org) :
>>>> res_length);
>>>> *max_length= res_length;
>>>> - do {
>>>> - *min_str++ = 0;
>>>> - *min_str++ = 0;
>>>> - *max_str++ = (char) (cs->max_sort_char>> 8);
>>>> - *max_str++ = (char) (cs->max_sort_char& 255);
>>>> - } while (min_str + 1< min_end);
>>>> - return FALSE;
>>>> + goto pad_min_max;
>>>> + }
>>>> + if (have_contractions&& ptr + 3< end&&
>>>> + ptr[0] == '\0'&&
>>>> + my_cs_can_be_contraction_head(cs, (uchar) ptr[1])&&
>>>
>>> The cast of ptr[1] is not according to the prototype of
>>> my_cs_can_be_contraction_head().
>>
>> The idea is to cast from "signed char" to "unsigned char" here.
>> Cast to my_wc_t is done automatically.
>>
>> It could be:
>>
>> my_cs_can_be_contraction_head(cs, (my_wc_t) (uchar) ptr[1])&&
>>
>> to have everything done explicitly.
>>
>> But I am not sure if we really need this.
> 
> Is the (uchar) cast needed? At least, the (my_wc_t) would show that you 
> have thought about the cast issue. But: no big deal, it's up to you.

In the new version I pass variables of my_wc_t type to the contraction
helper routines. So now cast is needed anymore.


>>
>>
>>>
>>>> + (res= my_utf16_uni(cs,&wc2, (uchar*) ptr + 2, (uchar*) end))>
> 0)
>>>> + {
>>>> + /*
>>>> + Check contraction head followed by a wildcard,
>>>> + or contraction that does not fit to result
>>>> + */
>>>> + if ((wc2 == (my_wc_t) w_one || wc2 == (my_wc_t) w_many) ||
>>>> + (charlen == 1&&
>>>> + my_cs_can_be_contraction_tail(cs, wc2)&&
>>>> + my_cs_contraction2_weight(cs, (uchar) ptr[1], wc2)))
>>>> + {
>>>> + *min_length= *max_length= res_length;
>>>> + goto pad_min_max;
>>>> + }
>>>> }
>>>> *min_str++= *max_str++ = ptr[0];
>>>> *min_str++= *max_str++ = ptr[1];
>>>> }
>>>>
>>>> - /* Temporary fix for handling w_one at end of string (key 
>>>> compression) */
>>>> - {
>>>> - char *tmp;
>>>> - for (tmp= min_str ; tmp-1> min_org&& tmp[-1] ==
> '\0'&& 
>>>> tmp[-2]=='\0';)
>>>> - {
>>>> - *--tmp=' ';
>>>> - *--tmp='\0';
>>>> - }
>>>> - }
>>>> -
>>>> - *min_length= *max_length = (size_t) (min_str - min_org);
>>>> - while (min_str + 1< min_end)
>>>> - {
>>>> - *min_str++ = *max_str++ = '\0';
>>>> - *min_str++ = *max_str++ = ' '; /* Because if key compression */
>>>> - }
>>>> + *min_length= *max_length= (size_t) (min_str - min_org);
>>>> +
>>>> +pad_min_max:
>>>> + my_fill_mb2(cs, min_str, min_end - min_str, cs->min_sort_char);
>>>> + my_fill_mb2(cs, max_str, max_end - max_str, cs->max_sort_char);
>>>> return FALSE;
>>>> }
>>>>
>>>> @@ -2583,12 +2589,13 @@ my_like_range_utf32(CHARSET_INFO *cs,
>>>> char *min_end= min_str + res_length;
>>>> char *max_end= max_str + res_length;
>>>> size_t charlen= res_length / cs->mbmaxlen;
>>>> + my_bool have_contractions= my_cs_have_contractions(cs);
>>>>
>>>> DBUG_ASSERT((res_length % 4) == 0);
>>>>
>>>> for ( ; charlen> 0; ptr+= 4, charlen--)
>>>> {
>>>> - my_wc_t wc;
>>>> + my_wc_t wc, wc2;
>>>> int res;
>>>
>>> The 'res' local variable is redundant.
>>>
>>>> if ((res= my_utf32_uni(cs,&wc, (uchar*) ptr, (uchar*) end))< 0)
>>>> {
>>>> @@ -2640,6 +2647,24 @@ my_like_range_utf32(CHARSET_INFO *cs,
>>>> goto pad_min_max;
>>>> }
>>>>
>>>> + if (have_contractions&&
>>>> + my_cs_can_be_contraction_head(cs, wc)&&
>>>> + (res= my_utf32_uni(cs,&wc2, (uchar*) ptr + 4, (uchar*) end))>
> 0)
>>>> + {
>>>> + /*
>>>> + Check contraction head followed by a wildcard
>>>> + or contraction which does not fit to result
>>>> + */
>>>> + if ((wc2 == (my_wc_t) w_one || wc2 == (my_wc_t) w_many) ||
>>>> + (charlen == 1&&
>>>> + my_cs_can_be_contraction_tail(cs, wc2)&&
>>>> + my_cs_contraction2_weight(cs, wc, wc2)))
>>>> + {
>>>> + *min_length= *max_length= res_length;
>>>> + goto pad_min_max;
>>>> + }
>>>> + }
>>>> +
>>>> /* Normal character */
>>>> if (my_uni_utf32(cs, wc, (uchar*) min_str, (uchar*) min_end) != 4 ||
>>>> my_uni_utf32(cs, wc, (uchar*) max_str, (uchar*) max_end) != 4)
>>>
>>> In escape handling, this function performs:
>>> *min_str++= 4;
>>> *max_str++= 4;
>>> Is that correct?
>>
>> utf32 is a fixed length character set, 4 bytes per character.
>> Looks correct.
> 
> You are assigning the value 4 to *max_str and *min_str, before 
> incrementing the pointers. Is that correct?


Oh, I must have been drunk when I wrote this :)


>>
>>>
>>> The return values in the two functions seem to be inconsistent - 
>>> There are no
>>> return TRUE in my_like_range_utf16() but two in my_like_range_utf32().
>>
>> I think utf16 version should return TRUE on an incorrect byte sequences,
>> like utf32 version does.
> 
> Ok.


The new function returns TRUE when a non-well-formed byte sequence is met.


>>
>>
>> I'm starting to think that it was over-engineering to
>> have separate functions for ucs2, utf16, utf32.
>>
>> It really makes sense for things like strnncoll(),
>> which affect performance seriously, as they executed
>> for every row.
>>
>> But in case of my_like_range it could be a single function
>> which handles all three character sets.
>> my_like_range_xxx() is executed only one time per query,
>> so some character set specific optimization in every
>> function hardly affects the overall performance.
>>
>> In 5.6 we'll have some more collation features,
>> which will need new code in my_like_range-family
>> functions, which means duplicate code again.
>>
>>
>> I think I'll put utf16 and utf32 versions together
>> into a single function.
>>
>>
>> ucs2 version have been here since 4.1, so it's a kind of stable thing.
>> Let's leave it alone, at least until 5.6.
> 
> Sounds reasonable.

Done :)

The debug LIKE_RANGE_MIN and LIKE_RANGE_MAX functions revealed
a small problem with ignorable characters in the UCS2 function.
So I replaced it to the new functions as well.


Can you please have a look into the new patch?

Thank you!


> 
> Thanks,
> Roy

Thread
bzr commit into mysql-5.5-bugteam branch (bar:3114) Bug#57737Alexander Barkov11 Nov
  • Re: bzr commit into mysql-5.5-bugteam branch (bar:3114) Bug#57737Roy Lyseng19 Nov
    • Re: bzr commit into mysql-5.5-bugteam branch (bar:3114) Bug#57737Alexander Barkov22 Nov
      • Re: bzr commit into mysql-5.5-bugteam branch (bar:3114) Bug#57737Roy Lyseng22 Nov
        • Re: bzr commit into mysql-5.5-bugteam branch (bar:3114) Bug#57737Alexander Barkov23 Nov
  • Re: bzr commit into mysql-5.5-bugteam branch (bar:3114) Bug#57737Roy Lyseng22 Nov