List:Commits« Previous MessageNext Message »
From:Guilhem Bichot Date:March 2 2011 8:28pm
Subject:Re: bzr commit into mysql-5.5 branch (alexander.barkov:3357) Bug#57341
Bug#11764503
View as plain text  
Hello,

Alexander Barkov a écrit, Le 01.03.2011 15:13:
> Hi Guilhem,
> 
> Guilhem Bichot wrote:
>> Hello,
>>
>> Alexander Barkov a écrit, Le 01.03.2011 11:24:
>>> #At file:///home/bar/mysql-bzr/mysql-5.5.b57341/ based on 
>>> revid:tor.didriksen@stripped
>>>
>>>  3357 Alexander Barkov    2011-03-01
>>>       Bug#11764503 (Bug#57341) Query in EXPLAIN EXTENDED shows wrong 
>>> characters
>>>               @ mysql-test/r/ctype_latin1.result
>>>         @ mysql-test/r/ctype_utf8.result
>>>         @ mysql-test/t/ctype_latin1.test
>>>         @ mysql-test/t/ctype_utf8.test
>>>         Adding tests
>>>               @ sql/mysqld.h
>>>         Adding a new Item::print() mode - QT_EXPLAIN, which
>>>         - displays introducers
>>>         - prints ASCII characters as is
>>>         - prints non-ASCII characters using hex-escape
>>>         Note: as "EXPLAIN" output is only for human readability purposes
>>>         and does not need to be a pasrable SQL, using hex-escape is Ok.
>>>               @ sql/item.cc
>>>         Implementing QT_EXPLAIN mode for Item_string::print().
>>>         Note: ErrConvString class perfectly suites for EXPLAIN purposes.
>>>               @ sql/sql_parse.cc
>>>         Using new QT_EXPLAIN mode for "EXPLAIN EXTENDED SELECT".
>>
>>> === modified file 'mysql-test/r/ctype_latin1.result'
>>> --- a/mysql-test/r/ctype_latin1.result    2011-02-10 08:18:08 +0000
>>> +++ b/mysql-test/r/ctype_latin1.result    2011-03-01 10:17:10 +0000
>>> @@ -3246,5 +3246,20 @@ maketime(`a`,`a`,`a`)
>>>  DROP TABLE t1;
>>>  SET sql_mode=default;
>>>  #
>>> +# Bug#11764503 (Bug#57341) Query in EXPLAIN EXTENDED shows wrong 
>>> characters
>>> +#
>>> +SET NAMES utf8;
>>> +EXPLAIN EXTENDED SELECT 'abcdó', _latin1'abcdó',
> _utf8'abcdó';
>>> +id    select_type    table    type    possible_keys    key    
>>> key_len    ref    rows    filtered    Extra
>>> +1    SIMPLE    NULL    NULL    NULL    NULL    NULL    NULL    
>>> NULL    NULL    No tables used
>>> +Warnings:
>>> +Note    1003    select 'abcdó' AS
> `abcdó`,_latin1'abcd\xC3\xB3' AS 
>>> `abcdó`,_utf8'abcd\xC3\xB3' AS `abcdó`
>>> +SET NAMES latin1;
>>
>> I had looked a bit into this bug, and my impression was that the cause 
>> is rather one piece of code (the one printing literals of the SELECT 
>> list, or the one printing AS clauses?), which forgets to convert to 
>> the proper charset.
>> In other words, if I type
>> EXPLAIN EXTENDED SELECT 'abcdó', _latin1'abcdó',
> _utf8'abcdó';
>> wouldn't the proper solution be to display
>> Note    1003    select 'abcdó' AS `abcdó`, _latin1'abcdó' AS
> `abcdó`, 
>> etc?
>>
>> Isn't the solution chosen in the patch hiding the real bug?
> 
> Does not really forgets/hides. I had been thinking
> a few hours how to fix it properly, then intentionally made it this way.
> Let me try to explain why. Sorry for a long answer. This is tricky.
> 
> 
> 1. When there are no character set introducers (e.g. _latin1 or _utf8),
> everything works fine with my change:
> a. the patch does not touch behaviour for ASCII characters
> b. the patch fixes behaviour for non-ASCII characters
>   (which was the report about), so non-ASCII characters
>   are displayed well now in "EXPLAIN EXTENDED".
> 
> 2. When there are introducers:
> 2a. the patch does not change behaviour for ASCII characters
> 
> 
> 2b. when there are some non-ASCII characters, it's a tough situation.
> (it's not in the original report)
> 
> Suppose we have this query:
> 
> SET NAMES utf8;
> EXPLAIN EXTENDED SELECT _latin1'abcdó';  # where 'ó' is \xC3\xB3
> (utf8)
> 
> By the above query we say that:
> "although the character set of the entire query is utf8,
> please take the constant in quotes as latin1",
> so the sequence "\xC3\xB3" is treated as:
> 
> - 0xC3 = LATIN CAPITAL LETTER A WITH TILDE
> - 0xB3 = SUPERSCRIPT TREE
> 
> 
> To make display it in "EXPLAIN EXTENDED" in the original
> form _latin1'abcdó', Item_string would have to remember
> two character sets:
> 
> - latin1, which is the real character set of the constant,
>   e.g. for val_str() and all other usual Item functionality
> 
> - utf8, which is the "visible" character set,
>   i.e. how it was seen to the user when the query was typed.
>   So in Item_string::print() we could do:
>   "although the string is latin1, pretend to be utf8 when displaying",
>   i.e. backward substitution for what happened at parse time.

Ok... Thanks for explaining!
Without adding the "visible charset", Item_string could know that it has 
to print itself in utf8, because of this code in execute_sqlcom_select():

String str(buff,(uint32) sizeof(buff), system_charset_info);
                                            ^^^^^^^^^^^
str.length(0);
thd->lex->unit.print(&str, QT_ORDINARY);

But even that way, it would print two unicode characters:
LATIN CAPITAL LETTER A WITH TILDE
and SUPERSCRIPT TREE
instead of understanding that it needs to take the latin1 codes of those 
two and parse them as a single utf8 code, a single unicode character...

Somehow the "visible charset" would be like informing 
Item_string::print() that it is preceded by "_latin1". Item_string() 
could then understand what it has to do.
But I imagine this would make functions' prototypes longer, and maybe 
the printing logic more complex.

> This has a disadvantage:
> This would need a new member "CHARSET_INFO *visible_charset"
> in Item_string, which would be used extremely rarely: *only* for
> EXPLAIN EXTENDED purposes, only for the case when there
> is a string with an introducer and with non-ASCII characters.
> I think it would be a slow down without a good reason,
> because it would need initialization of the new member in all
> constructors.
> 
> Another serious disadvantage is that raw 8-bit data with
> introducer totally  messes everything up after conversion
> to another character set, and we're planning some work in this
> direction: "WL#3529 Unicode Escape Sequences",
> so we'll use escape sequences every time when an extended string
> with an introducer needs to be  displayed.
> 
> With WL#3529, it will be displayed as follows:
> 
> SET NAMES utf8;
> EXPLAIN EXTENDED _latin1'abcdó';
> -> Note    1003    select _latin1 U&'abcd\00C3\00B3' AS `abcdó`
> 
> Notice the U& syntax, where
> - 00C3 is Unicode code point for "LATIN1 CAPITAL LETTER A WITH TILDE"
> - 00B3 is Unicode code point for "SUPERSCRIPT THREE"
> 
> 
> It will be similar in other context, for example for VIEWs:
> 
> SET NAMES utf8;
> CREATE VIEW v1 AS SELECT _latin1'abcdó';
> SHOW CREATE VIEW v1;
> 
> will display:
>   CREATE VIEW v1 AS SELECT _latin1'abcd\00C3\00B3';
> 
> which can be harmlessly converted to any character set
> and then parsed back without data loss.

Ok. Would it be possible in the long run to have both forms unified? it 
sounds strange to have
  select _latin1 U&'abcd\00C3\00B3'
for plain SELECT, and
  select _latin1'abcd\00C3\00B3'
for SELECT-in-view.

> So I think, my decision should be fine:
> - it does not slow down Item_string to "fix" very rare situation
> 
> - it is inline with the general development direction:
>   a kind of small step towards "introducer-safe environment"
>   which will be later fully implemented in WL#3529
>   (I'm just using HEX sequences at the moment,
>    rather than Unicode sequences - for simplicity, and not
>    to confuse users by unknown yet syntax)

ok, I'll get back to reviewing the patch.
Thread
bzr commit into mysql-5.5 branch (alexander.barkov:3357) Bug#57341Bug#11764503Alexander Barkov1 Mar
  • Re: bzr commit into mysql-5.5 branch (alexander.barkov:3357) Bug#57341Bug#11764503Guilhem Bichot1 Mar
    • Re: bzr commit into mysql-5.5 branch (alexander.barkov:3357) Bug#57341Bug#11764503Alexander Barkov1 Mar
      • Re: bzr commit into mysql-5.5 branch (alexander.barkov:3357) Bug#57341Bug#11764503Guilhem Bichot2 Mar
        • Re: bzr commit into mysql-5.5 branch (alexander.barkov:3357) Bug#57341Bug#11764503Alexander Barkov3 Mar
  • Re: bzr commit into mysql-5.5 branch (alexander.barkov:3357) Bug#57341Bug#11764503Guilhem Bichot3 Mar
    • Re: bzr commit into mysql-5.5 branch (alexander.barkov:3357) Bug#57341Bug#11764503Alexander Barkov4 Mar
      • Re: bzr commit into mysql-5.5 branch (alexander.barkov:3357) Bug#57341Bug#11764503Guilhem Bichot4 Mar
        • Re: bzr commit into mysql-5.5 branch (alexander.barkov:3357) Bug#57341Bug#11764503Alexander Barkov4 Mar