List:Commits« Previous MessageNext Message »
From:Guilhem Bichot Date:April 4 2011 12:49pm
Subject:Re: bzr commit into mysql-trunk branch (alexander.barkov:3302) Bug#58329
Bug#11765369
View as plain text  
Hello,

Alexander Barkov a écrit, Le 01.04.2011 13:42:
> Hi Guilhem,
> 
> Thanks for review. Please find comments inline:
> 
> 
> On 04/01/2011 11:19 AM, Guilhem Bichot wrote:
>> Hello Bar,
>>
>> Alexander Barkov a écrit, Le 18.03.2011 13:46:
>>> #At file:///home/bar/mysql-bzr/mysql-trunk.b58329/ based on
>>> revid:alexander.nozdrin@stripped
>>>
>>> 3302 Alexander Barkov 2011-03-18
>>> Bug#11765369 - Bug#58329: BETWEEN DOES NOT USE INDEXES FOR DATE OR
>>> DATETIME FIELDS FOR UCS2+ Problem:
>>> WHERE date_column = 'ucs2 literal'
>>> was rewritten to
>>> WHERE CONVERT(date_column USING ucs2) = 'ucs2 literal'
>>> so index on date_column did not work any more.
>>> Fix: rewrite to
>>> WHERE date_column = CONVERT('ucs2 literal' USING latin1)
>>
>> Without the patch: why did we choose to convert date_column instead of
>> the literal? There was probably a rule somewhere?
> 
> Yes. We choose the collation of an operation using special rules,
> which are very close to the SQL Standard collation aggregation rules,
> with some MySQL extensions.
> 
> Please have a look into item.cc, into this method:
> 
> bool DTCollation::aggregate(DTCollation &dt, uint flags)
> 
> and the comments. The comments should give the idea of how it works.

yes they do

>> Why is the bug specific of UCS2?
> 
> The same problem happens with UTF16 and UTF32.
> 
> What happens in DTCollation::aggregate() in this particular example is:
> - both sides have the same coercibility
> - *only* one of the sides is Unicode

ok and Unicode is superset.

> So we allow automatic conversion of the non-Unicode side to the
> character set of the Unicode side (UCS2 in this example).
> 
> 
> Note, the problem does not happen with UTF8:
> we optimize away the CONVERT() function, because:
> - DATE/TIME columns use latin1 internally
> - DATE/TIME column values are limited to MY_REPERTOIRE_ASCII
> - latin1 and utf8 are compatible on the ASCII range
> 
> This optimization is done in agg_item_set_converter().
> (have a look into its comments as well, if interested)

done. Could you please add some doxygen comment to agg_item_set_converter()?


>>> === modified file 'sql/item_cmpfunc.cc'
>>> --- a/sql/item_cmpfunc.cc 2011-03-17 11:33:17 +0000
>>> +++ b/sql/item_cmpfunc.cc 2011-03-18 12:43:19 +0000
>>> @@ -484,10 +484,31 @@ void Item_bool_func2::fix_length_and_dec
>>> DTCollation coll;
>>> if (args[0]->result_type() == STRING_RESULT &&
>>> - args[1]->result_type() == STRING_RESULT &&
>>> - agg_arg_charsets_for_comparison(coll, args, 2))
>>> - return;
>>> - + args[1]->result_type() == STRING_RESULT)
>>> + {
>>> + if (args[0]->is_datetime() || args[1]->is_datetime())
>>> + {
>>> + /*
>>> + For DATE/TIME we always convert to my_charset_numeric (latin1),
>>> + to make this:
>>> + WHERE date_column = 'ucs2 literal';
>>> +
>>> + rewrite to
>>> + WHERE date_column = CONVERT('ucs2 literal' USING latin1);
>>> +
>>> + instead of + WHERE CONVERT(date_column USING ucs2) = 'ucs2 literal';
>>> +
>>> + This makes it possible to use index on date_column for optimization.
>>> + */
>>> + coll.set(&my_charset_numeric);
>>> + if (agg_item_set_converter(coll, func_name(), args, 2, 0, 1))
>>> + return;
>>> + }
>>> + else if (agg_arg_charsets_for_comparison(coll, args, 2))
>>> + return;
>>> + }
>>
>> I worry to have one agg* function called for DATE/TIME and another one
>> called for other types,
> 
> The agg* functions are called only for STRING_RESULT types.
> Never for the other result_type codes.

ok. Rephrasing:
"I worry to have one agg* function called for DATE/TIME and another one
called for other STRING_RESULT".

>  > i.e. I worry about "if/else".
> 
> The source of the problem is that DATE/TIME data types do not
> have their own result_type code - they pretend to be STRING_RESULT.
> In some cases we need to distinguish between "normal" strings
> and "fake" strings, like DATE/TIME. This is one of the cases.
> 
> The "real" solution is to introduce separate type codes
> for DATE/TIME data types, but this is too much for the scope
> of this bug fix.
> 
> With the current code, I don't have a good idea how
> to do this without "if/else".
> 
> We don't need to convert to Unicode in this case.
> We need to do the other way around here:
> convert the Unicode (i.e. UCS2) side to the character
> set of DATE column, which is latin1.
> 
> (strictly speaking, it's pure ASCII, but we use my_charset_latin1
> for ASCII internally)
> 
> 
>> I feel that this
>> could introduce DATE/TIME specific code paths, and thus bugs in the 
>> future.
>> It looks like agg_arg_charsets_for_comparison() calls
>> agg_item_set_converter(), so can't we do this instead, to avoid if/else:
>>
>>  > if (args[0]->result_type() == STRING_RESULT &&
>>  > + args[1]->result_type() == STRING_RESULT)
>>  > + {
>>  > + if (args[0]->is_datetime() || args[1]->is_datetime())
>>  > + {
>>  > + /*
>>  > + For DATE/TIME we always convert to my_charset_numeric (latin1),
>>  > + to make this:
>>  > + WHERE date_column = 'ucs2 literal';
>>  > +
>>  > + rewrite to
>>  > + WHERE date_column = CONVERT('ucs2 literal' USING latin1);
>>  > +
>>  > + instead of
>>  > + WHERE CONVERT(date_column USING ucs2) = 'ucs2 literal';
>>  > +
>>  > + This makes it possible to use index on date_column for optimization.
>>  > + */
>>  > + coll.set(&my_charset_numeric);
>>  > + }
>>  > + if (agg_arg_charsets_for_comparison(coll, args, 2))
>>  > + return;
>>  > + }
> 
> 
> It will not work. The original value of "coll" does not matter.
> 
> agg_arg_charsets_for_comparison() will reset "coll" to UCS2.

Could you please then document, in a doxygen comment for the proper 
function, that "coll" is an _out_ parameter? I see at least 
agg_item_collations() would deserve it, as it does:
c.set(something).
Callers of this function, if they also take a "coll" parameter, should 
also mention in their doxygen comment that it's an _out_ parameter.

This is important, because different from agg_item_set_converter(), 
which uses "coll" as an _in_ parameter. By the way that function doesn't 
change "coll", so I would emphasize this by declaring a const:
agg_item_set_converter(const DTCollation &const, etc)

> It does two things:
> 
> - calculates the collation of the operation using
>   agg_item_collations()
> 
> - install CONVERT() function for the arguments which need it,
>   using agg_item_set_converter()

I understand your explanations (I think :-).
But there is something illogical. The current code (unpatched) looks like:

   DTCollation coll;
   if (args[0]->result_type() == STRING_RESULT &&
       args[1]->result_type() == STRING_RESULT &&
       agg_arg_charsets_for_comparison(coll, args, 2))
     return;

   args[0]->cmp_context= args[1]->cmp_context=
     item_cmp_type(args[0]->result_type(), args[1]->result_type());
   // Make a special case of compare with fields to get nicer DATE 
comparisons

   if (functype() == LIKE_FUNC)  // Disable conversion in case of LIKE 
function.
   {
     set_cmp_func();
     return;
   }

Two questions:

Q1) the code above says something like: "if those are two strings let's 
aggregate the charsets for comparison of the strings". To me, this is 
valid logic. It should apply the same for DATE fields. So I think that 
the fix should be in the function which is responsible for saying how 
two charsets aggregate: you wrote that agg_arg_charsets_for_comparison():

 > does two things:
 >
 > - calculates the collation of the operation using
 >   agg_item_collations()
 >
 > - install CONVERT() function for the arguments which need it,
 >   using agg_item_set_converter()

so:

 > It does two things:
 >
 > - calculates the collation of the operation using
 >   agg_item_collations()

here agg_item_collations() could realize that we're using DATEs and make 
an exception (so it's this function which would have your bugfix),

 > - install CONVERT() function for the arguments which need it,
 >   using agg_item_set_converter()

and so here it would naturally know what arguments need a CONVERT(). 
Reusing the logic of agg_arg_charsets_for_comparison() entirely.

In other words: it's the job of agg_item_collations() to say how 
collations aggregate, so it's this function which needs a fix. Callers 
of this function should not have to care that a DATE requires special 
treatments.

I suspect that this way, we could have a more "unique execution path" 
for DATE and other STRING_RESULT.
It also prepares nicely for when DATE is not a STRING_RESULT anymore: at 
that time, maybe we will put in agg_item_collations() some virtual 
function call based on type, which will implement the exception.

Now, maybe I'm dreaming and this is impossible. In that case, I'll say 
"Bar knows his domain and future plans, his patch fixes the bug, so 
trust him" and you can push your fix.

2) for LIKE_FUNC (see code above), we make a special case it seems. "=" 
is a simplified form of LIKE after all, can't we fix bug#58329 with the 
same code as is used for LIKE_FUNC? If we can't, will the same bug as 
bug#58329 remain with
"WHERE col_datetime LIKE ucs2_literal" (where in theory an index could 
be usable too).
?

Thread
bzr commit into mysql-trunk branch (alexander.barkov:3302) Bug#58329Bug#11765369Alexander Barkov18 Mar
  • Re: bzr commit into mysql-trunk branch (alexander.barkov:3302) Bug#58329Bug#11765369Guilhem Bichot1 Apr
    • Re: bzr commit into mysql-trunk branch (alexander.barkov:3302) Bug#58329Bug#11765369Alexander Barkov1 Apr
      • Re: bzr commit into mysql-trunk branch (alexander.barkov:3302) Bug#58329Bug#11765369Guilhem Bichot4 Apr