From: Roy Lyseng Date: November 22 2010 5:15pm Subject: Re: bzr commit into mysql-5.5-bugteam branch (bar:3114) Bug#57737 List-Archive: http://lists.mysql.com/commits/124664 Message-Id: <4CEAA51E.1010306@oracle.com> MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Hi Alexander, On 22.11.10 16.11, Alexander Barkov wrote: > Hi Roy, > > Thanks for your review! > > Please see my comments inline: > > > Roy Lyseng wrote: >> Hi Alexander, >> >> some preliminary notes to the code changes. >> >> I still need some time to understand the logic of the patch. >> >> On 11.11.10 09.44, Alexander Barkov wrote: >>> #At file:///home/bar/mysql-bzr/mysql-5.5-bugteam.b57737/ based on >>> revid:dmitry.shulga@stripped >>> >>> 3114 Alexander Barkov 2010-11-11 >>> Bug#57737 Character sets: search fails with like, contraction, index >>> >>> Problem: LIKE over an indexed column optimized away good results, >>> because my_like_range_utf32/utf16 returned wrong ranges for contractions. >>> >>> Fix: Make my_like_range_utf32/utf16 return correct ranges for contractions. >>> >>> Note, this code presens in my_like_range_mb and my_like_range_ucs2. >>> It was forgotten in utf32/utf16 versions (during mysql-6.0 push/revert mess). >>> >>> >>> added: >>> @ mysql-test/include/ctype_czech.inc >>> Adding shared test code >>> >>> modified: >>> @ include/m_ctype.h >>> Adding helper functions for contractions. >>> >>> @ mysql-test/r/ctype_uca.result >>> @ mysql-test/r/ctype_utf16_uca.result >>> @ mysql-test/r/ctype_utf32_uca.result >>> @ mysql-test/t/ctype_uca.test >>> @ mysql-test/t/ctype_utf16_uca.test >>> @ mysql-test/t/ctype_utf32_uca.test >>> Adding tests. >>> >>> @ strings/ctype-mb.c >>> Pad function did not put the last character. >>> >>> @ strings/ctype-ucs2.c >>> - my_fill_mb2 did not put the high byte, as previously >>> it was used to put only characters in ASCII range. >>> Now it puts high byte as well >>> (needed to pupulate cs->max_sort_char correctly). >>> >>> - Adding contraction handling into my_like_range_utf16 >>> - Removing legacy "Temporary fix" from my_like_range_utf16 >>> - Using my_fill_mb2 instead of loop (duplicate code eliminations) >>> - Adding contraction handling into my_like_range_utf32 >>> >>> added: >>> mysql-test/include/ctype_czech.inc >>> modified: >>> include/m_ctype.h >>> mysql-test/r/ctype_uca.result >>> mysql-test/r/ctype_utf16_uca.result >>> mysql-test/r/ctype_utf32_uca.result >>> mysql-test/t/ctype_uca.test >>> mysql-test/t/ctype_utf16_uca.test >>> mysql-test/t/ctype_utf32_uca.test >>> strings/ctype-mb.c >>> strings/ctype-ucs2.c > > > >>> memcpy(str, buf, buflen); >>> >>> === modified file 'strings/ctype-ucs2.c' >>> --- a/strings/ctype-ucs2.c 2010-10-20 19:02:59 +0000 >>> +++ b/strings/ctype-ucs2.c 2010-11-11 08:39:03 +0000 >>> @@ -903,7 +903,7 @@ static void >>> my_fill_mb2(CHARSET_INFO *cs __attribute__((unused)), >>> char *s, size_t l, int fill) >> >> Type of argument 'fill' should be uint16. >>> { >>> - for ( ; l>= 2; s[0]= 0, s[1]= fill, s+= 2, l-= 2); >>> + for ( ; l>= 2; s[0]= (fill>> 8), s[1]= (fill& 0xFF), s+= 2, l-= 2); >>> } > > > my_fill_mb2 is used to initialize the "fill" member of > MY_CHARSET_HANDLER structures, for example: > > MY_CHARSET_HANDLER my_charset_utf16_handler= > { > ... > my_fill_mb2, > ... > }; > > We cannot change prototype for my_fill_mb2 alone, > without changing MY_CHARSET_HANDLER definition and > therefore all other MY_CHARSET_HANDLER constants. Perhaps you can typecast the argument in the call, then? > > >>> >>> >>> @@ -1590,14 +1590,19 @@ my_like_range_utf16(CHARSET_INFO *cs, >>> char *min_str,char *max_str, >>> size_t *min_length,size_t *max_length) >> >> Can you please document arguments cs, w_one, w_many, min_length and max_length? >> And likewise for my_like_range_utf32(). > > > Yes, will do that in the next patch version. Thanks > > >> >> The 'escape' argument is a single byte character. Is that correct in the >> general case? If so, I think the fact should be documented. > > I think it's documented here: > http://dev.mysql.com/doc/refman/5.0/en/string-comparison-functions.html#operator_like > > > "The escape sequence should be empty or one character long." It says one character. It does not say that it is restricted to one byte. > > >> >>> { >>> - const char *end=ptr+ptr_length; >>> - char *min_org=min_str; >>> - char *min_end=min_str+res_length; >>> + const char *end= ptr + ptr_length; >>> + char *min_org= min_str; >> >> min_org can be 'const char *'. > > It could be 'const char *'. > > I have only one doubt: for some reasons in all similar cases we use > 'char *' in this context: > > > int func(char *s) > { > const char *s0= s; > ... /* traverse through s */ > return s - s0; > } > > I can see it all around the code. > > I thought the reason was that some compilers can produce warnings > for arithmetic operations between pointers with different qualifiers > (i.e. with 'const' and without 'const'). Hm, could be. I tried it out on Linux, and it did not give any warnings. Not a big issue though... > > > >> >>> + char *min_end= min_str + res_length; >>> + char *max_end= max_str + res_length; >>> size_t charlen= res_length / cs->mbmaxlen; >>> + my_bool have_contractions= my_cs_have_contractions(cs); >> >> For symmetry, you may add here DBUG_ASSERT((res_length % 2) == 0); > > I agree. > >> >>> >>> for ( ; ptr + 1< end&& min_str + 1< min_end&& charlen> 0 >>> ; ptr+=2, charlen--) >>> { >>> + my_wc_t wc2; >>> + int res; >> >> The 'res' local variable is redundant (never tested). > > Oops. This is wrong that it's not used. > I forgot to fix 'ptr+= 2' to 'ptr+= res'. > Thanks for noticing thing! :) > >> >>> + >>> if (ptr[0] == '\0'&& ptr[1] == escape&& ptr + 1< end) >>> { >>> ptr+=2; /* Skip escape */ >>> @@ -1623,34 +1628,35 @@ my_like_range_utf16(CHARSET_INFO *cs, >>> *min_length= ((cs->state& MY_CS_BINSORT) ? (size_t) (min_str - min_org) : >>> res_length); >>> *max_length= res_length; >>> - do { >>> - *min_str++ = 0; >>> - *min_str++ = 0; >>> - *max_str++ = (char) (cs->max_sort_char>> 8); >>> - *max_str++ = (char) (cs->max_sort_char& 255); >>> - } while (min_str + 1< min_end); >>> - return FALSE; >>> + goto pad_min_max; >>> + } >>> + if (have_contractions&& ptr + 3< end&& >>> + ptr[0] == '\0'&& >>> + my_cs_can_be_contraction_head(cs, (uchar) ptr[1])&& >> >> The cast of ptr[1] is not according to the prototype of >> my_cs_can_be_contraction_head(). > > The idea is to cast from "signed char" to "unsigned char" here. > Cast to my_wc_t is done automatically. > > It could be: > > my_cs_can_be_contraction_head(cs, (my_wc_t) (uchar) ptr[1])&& > > to have everything done explicitly. > > But I am not sure if we really need this. Is the (uchar) cast needed? At least, the (my_wc_t) would show that you have thought about the cast issue. But: no big deal, it's up to you. > > >> >>> + (res= my_utf16_uni(cs,&wc2, (uchar*) ptr + 2, (uchar*) end))> 0) >>> + { >>> + /* >>> + Check contraction head followed by a wildcard, >>> + or contraction that does not fit to result >>> + */ >>> + if ((wc2 == (my_wc_t) w_one || wc2 == (my_wc_t) w_many) || >>> + (charlen == 1&& >>> + my_cs_can_be_contraction_tail(cs, wc2)&& >>> + my_cs_contraction2_weight(cs, (uchar) ptr[1], wc2))) >>> + { >>> + *min_length= *max_length= res_length; >>> + goto pad_min_max; >>> + } >>> } >>> *min_str++= *max_str++ = ptr[0]; >>> *min_str++= *max_str++ = ptr[1]; >>> } >>> >>> - /* Temporary fix for handling w_one at end of string (key compression) */ >>> - { >>> - char *tmp; >>> - for (tmp= min_str ; tmp-1> min_org&& tmp[-1] == '\0'&& tmp[-2]=='\0';) >>> - { >>> - *--tmp=' '; >>> - *--tmp='\0'; >>> - } >>> - } >>> - >>> - *min_length= *max_length = (size_t) (min_str - min_org); >>> - while (min_str + 1< min_end) >>> - { >>> - *min_str++ = *max_str++ = '\0'; >>> - *min_str++ = *max_str++ = ' '; /* Because if key compression */ >>> - } >>> + *min_length= *max_length= (size_t) (min_str - min_org); >>> + >>> +pad_min_max: >>> + my_fill_mb2(cs, min_str, min_end - min_str, cs->min_sort_char); >>> + my_fill_mb2(cs, max_str, max_end - max_str, cs->max_sort_char); >>> return FALSE; >>> } >>> >>> @@ -2583,12 +2589,13 @@ my_like_range_utf32(CHARSET_INFO *cs, >>> char *min_end= min_str + res_length; >>> char *max_end= max_str + res_length; >>> size_t charlen= res_length / cs->mbmaxlen; >>> + my_bool have_contractions= my_cs_have_contractions(cs); >>> >>> DBUG_ASSERT((res_length % 4) == 0); >>> >>> for ( ; charlen> 0; ptr+= 4, charlen--) >>> { >>> - my_wc_t wc; >>> + my_wc_t wc, wc2; >>> int res; >> >> The 'res' local variable is redundant. >> >>> if ((res= my_utf32_uni(cs,&wc, (uchar*) ptr, (uchar*) end))< 0) >>> { >>> @@ -2640,6 +2647,24 @@ my_like_range_utf32(CHARSET_INFO *cs, >>> goto pad_min_max; >>> } >>> >>> + if (have_contractions&& >>> + my_cs_can_be_contraction_head(cs, wc)&& >>> + (res= my_utf32_uni(cs,&wc2, (uchar*) ptr + 4, (uchar*) end))> 0) >>> + { >>> + /* >>> + Check contraction head followed by a wildcard >>> + or contraction which does not fit to result >>> + */ >>> + if ((wc2 == (my_wc_t) w_one || wc2 == (my_wc_t) w_many) || >>> + (charlen == 1&& >>> + my_cs_can_be_contraction_tail(cs, wc2)&& >>> + my_cs_contraction2_weight(cs, wc, wc2))) >>> + { >>> + *min_length= *max_length= res_length; >>> + goto pad_min_max; >>> + } >>> + } >>> + >>> /* Normal character */ >>> if (my_uni_utf32(cs, wc, (uchar*) min_str, (uchar*) min_end) != 4 || >>> my_uni_utf32(cs, wc, (uchar*) max_str, (uchar*) max_end) != 4) >> >> In escape handling, this function performs: >> *min_str++= 4; >> *max_str++= 4; >> Is that correct? > > utf32 is a fixed length character set, 4 bytes per character. > Looks correct. You are assigning the value 4 to *max_str and *min_str, before incrementing the pointers. Is that correct? > >> >> The return values in the two functions seem to be inconsistent - There are no >> return TRUE in my_like_range_utf16() but two in my_like_range_utf32(). > > I think utf16 version should return TRUE on an incorrect byte sequences, > like utf32 version does. Ok. > > > I'm starting to think that it was over-engineering to > have separate functions for ucs2, utf16, utf32. > > It really makes sense for things like strnncoll(), > which affect performance seriously, as they executed > for every row. > > But in case of my_like_range it could be a single function > which handles all three character sets. > my_like_range_xxx() is executed only one time per query, > so some character set specific optimization in every > function hardly affects the overall performance. > > In 5.6 we'll have some more collation features, > which will need new code in my_like_range-family > functions, which means duplicate code again. > > > I think I'll put utf16 and utf32 versions together > into a single function. > > > ucs2 version have been here since 4.1, so it's a kind of stable thing. > Let's leave it alone, at least until 5.6. Sounds reasonable. Thanks, Roy