List:Commits« Previous MessageNext Message »
From:Alexander Barkov Date:March 3 2011 6:17am
Subject:Re: bzr commit into mysql-5.5 branch (alexander.barkov:3357) Bug#57341
Bug#11764503
View as plain text  
Hi Guilhem,

Guilhem Bichot wrote:
> Hello,
> 
> Alexander Barkov a écrit, Le 01.03.2011 15:13:
>> Hi Guilhem,
>>
>> Guilhem Bichot wrote:
>>> Hello,
>>>
>>> Alexander Barkov a écrit, Le 01.03.2011 11:24:
>>>> #At file:///home/bar/mysql-bzr/mysql-5.5.b57341/ based on 
>>>> revid:tor.didriksen@stripped
>>>>
>>>>  3357 Alexander Barkov    2011-03-01
>>>>       Bug#11764503 (Bug#57341) Query in EXPLAIN EXTENDED shows wrong 
>>>> characters
>>>>               @ mysql-test/r/ctype_latin1.result
>>>>         @ mysql-test/r/ctype_utf8.result
>>>>         @ mysql-test/t/ctype_latin1.test
>>>>         @ mysql-test/t/ctype_utf8.test
>>>>         Adding tests
>>>>               @ sql/mysqld.h
>>>>         Adding a new Item::print() mode - QT_EXPLAIN, which
>>>>         - displays introducers
>>>>         - prints ASCII characters as is
>>>>         - prints non-ASCII characters using hex-escape
>>>>         Note: as "EXPLAIN" output is only for human readability 
>>>> purposes
>>>>         and does not need to be a pasrable SQL, using hex-escape is Ok.
>>>>               @ sql/item.cc
>>>>         Implementing QT_EXPLAIN mode for Item_string::print().
>>>>         Note: ErrConvString class perfectly suites for EXPLAIN 
>>>> purposes.
>>>>               @ sql/sql_parse.cc
>>>>         Using new QT_EXPLAIN mode for "EXPLAIN EXTENDED SELECT".
>>>
>>>> === modified file 'mysql-test/r/ctype_latin1.result'
>>>> --- a/mysql-test/r/ctype_latin1.result    2011-02-10 08:18:08 +0000
>>>> +++ b/mysql-test/r/ctype_latin1.result    2011-03-01 10:17:10 +0000
>>>> @@ -3246,5 +3246,20 @@ maketime(`a`,`a`,`a`)
>>>>  DROP TABLE t1;
>>>>  SET sql_mode=default;
>>>>  #
>>>> +# Bug#11764503 (Bug#57341) Query in EXPLAIN EXTENDED shows wrong 
>>>> characters
>>>> +#
>>>> +SET NAMES utf8;
>>>> +EXPLAIN EXTENDED SELECT 'abcdó', _latin1'abcdó',
> _utf8'abcdó';
>>>> +id    select_type    table    type    possible_keys    key    
>>>> key_len    ref    rows    filtered    Extra
>>>> +1    SIMPLE    NULL    NULL    NULL    NULL    NULL    NULL    
>>>> NULL    NULL    No tables used
>>>> +Warnings:
>>>> +Note    1003    select 'abcdó' AS
> `abcdó`,_latin1'abcd\xC3\xB3' AS 
>>>> `abcdó`,_utf8'abcd\xC3\xB3' AS `abcdó`
>>>> +SET NAMES latin1;
>>>
>>> I had looked a bit into this bug, and my impression was that the 
>>> cause is rather one piece of code (the one printing literals of the 
>>> SELECT list, or the one printing AS clauses?), which forgets to 
>>> convert to the proper charset.
>>> In other words, if I type
>>> EXPLAIN EXTENDED SELECT 'abcdó', _latin1'abcdó',
> _utf8'abcdó';
>>> wouldn't the proper solution be to display
>>> Note    1003    select 'abcdó' AS `abcdó`,
> _latin1'abcdó' AS 
>>> `abcdó`, etc?
>>>
>>> Isn't the solution chosen in the patch hiding the real bug?
>>
>> Does not really forgets/hides. I had been thinking
>> a few hours how to fix it properly, then intentionally made it this way.
>> Let me try to explain why. Sorry for a long answer. This is tricky.
>>
>>
>> 1. When there are no character set introducers (e.g. _latin1 or _utf8),
>> everything works fine with my change:
>> a. the patch does not touch behaviour for ASCII characters
>> b. the patch fixes behaviour for non-ASCII characters
>>   (which was the report about), so non-ASCII characters
>>   are displayed well now in "EXPLAIN EXTENDED".
>>
>> 2. When there are introducers:
>> 2a. the patch does not change behaviour for ASCII characters
>>
>>
>> 2b. when there are some non-ASCII characters, it's a tough situation.
>> (it's not in the original report)
>>
>> Suppose we have this query:
>>
>> SET NAMES utf8;
>> EXPLAIN EXTENDED SELECT _latin1'abcdó';  # where 'ó' is \xC3\xB3
> (utf8)
>>
>> By the above query we say that:
>> "although the character set of the entire query is utf8,
>> please take the constant in quotes as latin1",
>> so the sequence "\xC3\xB3" is treated as:
>>
>> - 0xC3 = LATIN CAPITAL LETTER A WITH TILDE
>> - 0xB3 = SUPERSCRIPT TREE
>>
>>
>> To make display it in "EXPLAIN EXTENDED" in the original
>> form _latin1'abcdó', Item_string would have to remember
>> two character sets:
>>
>> - latin1, which is the real character set of the constant,
>>   e.g. for val_str() and all other usual Item functionality
>>
>> - utf8, which is the "visible" character set,
>>   i.e. how it was seen to the user when the query was typed.
>>   So in Item_string::print() we could do:
>>   "although the string is latin1, pretend to be utf8 when displaying",
>>   i.e. backward substitution for what happened at parse time.
> 
> Ok... Thanks for explaining!
> Without adding the "visible charset", Item_string could know that it has 
> to print itself in utf8, because of this code in execute_sqlcom_select():
> 
> String str(buff,(uint32) sizeof(buff), system_charset_info);
>                                            ^^^^^^^^^^^
> str.length(0);
> thd->lex->unit.print(&str, QT_ORDINARY);
> 
> But even that way, it would print two unicode characters:
> LATIN CAPITAL LETTER A WITH TILDE
> and SUPERSCRIPT TREE
> instead of understanding that it needs to take the latin1 codes of those 
> two and parse them as a single utf8 code, a single unicode character...

True. The user would see the introducer followed by something different,
not what was typed originally. That would be confusing.

Another approach would be to print this way, but remove the introducer:

SELECT _latin1'abcdó' ...

   would be printed as:

SELECT 'abcdó' AS ...

This is, by the way, what is currently done during:
"SELECT VIEW_DEFINITION FROM INFORMATION_SCHEMA.VIEWS" query.


But I don't like this because:
- EXPLAIN tests (e.g. func_str.test, func_test.test) expect
   introducers to be printed, so they start to fail

- This does not really 100% correctly reflect the original query.
   In some context this difference is really important.


> Somehow the "visible charset" would be like informing 
> Item_string::print() that it is preceded by "_latin1". Item_string() 
> could then understand what it has to do.
> But I imagine this would make functions' prototypes longer, and maybe 
> the printing logic more complex.

There is a "m_cs_specified" member which indicates whether
introducer was specified. One of my ideas was to change it to
"CHARSET_INFO *visible_charset". When visible_charset is NULL,
it would mean "no introducer".

But as the direction is towards WL#3529
(which will remove m_cs_specified or visible_charset anyway),
I decided not to do this way.

> 
>> This has a disadvantage:
>> This would need a new member "CHARSET_INFO *visible_charset"
>> in Item_string, which would be used extremely rarely: *only* for
>> EXPLAIN EXTENDED purposes, only for the case when there
>> is a string with an introducer and with non-ASCII characters.
>> I think it would be a slow down without a good reason,
>> because it would need initialization of the new member in all
>> constructors.
>>
>> Another serious disadvantage is that raw 8-bit data with
>> introducer totally  messes everything up after conversion
>> to another character set, and we're planning some work in this
>> direction: "WL#3529 Unicode Escape Sequences",
>> so we'll use escape sequences every time when an extended string
>> with an introducer needs to be  displayed.
>>
>> With WL#3529, it will be displayed as follows:
>>
>> SET NAMES utf8;
>> EXPLAIN EXTENDED _latin1'abcdó';
>> -> Note    1003    select _latin1 U&'abcd\00C3\00B3' AS
> `abcdó`
>>
>> Notice the U& syntax, where
>> - 00C3 is Unicode code point for "LATIN1 CAPITAL LETTER A WITH TILDE"
>> - 00B3 is Unicode code point for "SUPERSCRIPT THREE"
>>
>>
>> It will be similar in other context, for example for VIEWs:
>>
>> SET NAMES utf8;
>> CREATE VIEW v1 AS SELECT _latin1'abcdó';
>> SHOW CREATE VIEW v1;
>>
>> will display:
>>   CREATE VIEW v1 AS SELECT _latin1'abcd\00C3\00B3';
>>
>> which can be harmlessly converted to any character set
>> and then parsed back without data loss.
> 
> Ok. Would it be possible in the long run to have both forms unified? it 
> sounds strange to have
>  select _latin1 U&'abcd\00C3\00B3'
> for plain SELECT, and
>  select _latin1'abcd\00C3\00B3'
> for SELECT-in-view.

Ah, sorry for a typo.
It will print in the same way in all cases of course:

   select _latin1 U&'abcd\00C3\00B3'

for both EXPLAIN and SHOW CREATE VIEW.

> 
>> So I think, my decision should be fine:
>> - it does not slow down Item_string to "fix" very rare situation
>>
>> - it is inline with the general development direction:
>>   a kind of small step towards "introducer-safe environment"
>>   which will be later fully implemented in WL#3529
>>   (I'm just using HEX sequences at the moment,
>>    rather than Unicode sequences - for simplicity, and not
>>    to confuse users by unknown yet syntax)
> 
> ok, I'll get back to reviewing the patch.

Thread
bzr commit into mysql-5.5 branch (alexander.barkov:3357) Bug#57341Bug#11764503Alexander Barkov1 Mar
  • Re: bzr commit into mysql-5.5 branch (alexander.barkov:3357) Bug#57341Bug#11764503Guilhem Bichot1 Mar
    • Re: bzr commit into mysql-5.5 branch (alexander.barkov:3357) Bug#57341Bug#11764503Alexander Barkov1 Mar
      • Re: bzr commit into mysql-5.5 branch (alexander.barkov:3357) Bug#57341Bug#11764503Guilhem Bichot2 Mar
        • Re: bzr commit into mysql-5.5 branch (alexander.barkov:3357) Bug#57341Bug#11764503Alexander Barkov3 Mar
  • Re: bzr commit into mysql-5.5 branch (alexander.barkov:3357) Bug#57341Bug#11764503Guilhem Bichot3 Mar
    • Re: bzr commit into mysql-5.5 branch (alexander.barkov:3357) Bug#57341Bug#11764503Alexander Barkov4 Mar
      • Re: bzr commit into mysql-5.5 branch (alexander.barkov:3357) Bug#57341Bug#11764503Guilhem Bichot4 Mar
        • Re: bzr commit into mysql-5.5 branch (alexander.barkov:3357) Bug#57341Bug#11764503Alexander Barkov4 Mar