Hello,
Alexander Barkov a écrit, Le 01.03.2011 15:13:
> Hi Guilhem,
>
> Guilhem Bichot wrote:
>> Hello,
>>
>> Alexander Barkov a écrit, Le 01.03.2011 11:24:
>>> #At file:///home/bar/mysql-bzr/mysql-5.5.b57341/ based on
>>> revid:tor.didriksen@stripped
>>>
>>> 3357 Alexander Barkov 2011-03-01
>>> Bug#11764503 (Bug#57341) Query in EXPLAIN EXTENDED shows wrong
>>> characters
>>> @ mysql-test/r/ctype_latin1.result
>>> @ mysql-test/r/ctype_utf8.result
>>> @ mysql-test/t/ctype_latin1.test
>>> @ mysql-test/t/ctype_utf8.test
>>> Adding tests
>>> @ sql/mysqld.h
>>> Adding a new Item::print() mode - QT_EXPLAIN, which
>>> - displays introducers
>>> - prints ASCII characters as is
>>> - prints non-ASCII characters using hex-escape
>>> Note: as "EXPLAIN" output is only for human readability purposes
>>> and does not need to be a pasrable SQL, using hex-escape is Ok.
>>> @ sql/item.cc
>>> Implementing QT_EXPLAIN mode for Item_string::print().
>>> Note: ErrConvString class perfectly suites for EXPLAIN purposes.
>>> @ sql/sql_parse.cc
>>> Using new QT_EXPLAIN mode for "EXPLAIN EXTENDED SELECT".
>>
>>> === modified file 'mysql-test/r/ctype_latin1.result'
>>> --- a/mysql-test/r/ctype_latin1.result 2011-02-10 08:18:08 +0000
>>> +++ b/mysql-test/r/ctype_latin1.result 2011-03-01 10:17:10 +0000
>>> @@ -3246,5 +3246,20 @@ maketime(`a`,`a`,`a`)
>>> DROP TABLE t1;
>>> SET sql_mode=default;
>>> #
>>> +# Bug#11764503 (Bug#57341) Query in EXPLAIN EXTENDED shows wrong
>>> characters
>>> +#
>>> +SET NAMES utf8;
>>> +EXPLAIN EXTENDED SELECT 'abcdó', _latin1'abcdó',
> _utf8'abcdó';
>>> +id select_type table type possible_keys key
>>> key_len ref rows filtered Extra
>>> +1 SIMPLE NULL NULL NULL NULL NULL NULL
>>> NULL NULL No tables used
>>> +Warnings:
>>> +Note 1003 select 'abcdó' AS
> `abcdó`,_latin1'abcd\xC3\xB3' AS
>>> `abcdó`,_utf8'abcd\xC3\xB3' AS `abcdó`
>>> +SET NAMES latin1;
>>
>> I had looked a bit into this bug, and my impression was that the cause
>> is rather one piece of code (the one printing literals of the SELECT
>> list, or the one printing AS clauses?), which forgets to convert to
>> the proper charset.
>> In other words, if I type
>> EXPLAIN EXTENDED SELECT 'abcdó', _latin1'abcdó',
> _utf8'abcdó';
>> wouldn't the proper solution be to display
>> Note 1003 select 'abcdó' AS `abcdó`, _latin1'abcdó' AS
> `abcdó`,
>> etc?
>>
>> Isn't the solution chosen in the patch hiding the real bug?
>
> Does not really forgets/hides. I had been thinking
> a few hours how to fix it properly, then intentionally made it this way.
> Let me try to explain why. Sorry for a long answer. This is tricky.
>
>
> 1. When there are no character set introducers (e.g. _latin1 or _utf8),
> everything works fine with my change:
> a. the patch does not touch behaviour for ASCII characters
> b. the patch fixes behaviour for non-ASCII characters
> (which was the report about), so non-ASCII characters
> are displayed well now in "EXPLAIN EXTENDED".
>
> 2. When there are introducers:
> 2a. the patch does not change behaviour for ASCII characters
>
>
> 2b. when there are some non-ASCII characters, it's a tough situation.
> (it's not in the original report)
>
> Suppose we have this query:
>
> SET NAMES utf8;
> EXPLAIN EXTENDED SELECT _latin1'abcdó'; # where 'ó' is \xC3\xB3
> (utf8)
>
> By the above query we say that:
> "although the character set of the entire query is utf8,
> please take the constant in quotes as latin1",
> so the sequence "\xC3\xB3" is treated as:
>
> - 0xC3 = LATIN CAPITAL LETTER A WITH TILDE
> - 0xB3 = SUPERSCRIPT TREE
>
>
> To make display it in "EXPLAIN EXTENDED" in the original
> form _latin1'abcdó', Item_string would have to remember
> two character sets:
>
> - latin1, which is the real character set of the constant,
> e.g. for val_str() and all other usual Item functionality
>
> - utf8, which is the "visible" character set,
> i.e. how it was seen to the user when the query was typed.
> So in Item_string::print() we could do:
> "although the string is latin1, pretend to be utf8 when displaying",
> i.e. backward substitution for what happened at parse time.
Ok... Thanks for explaining!
Without adding the "visible charset", Item_string could know that it has
to print itself in utf8, because of this code in execute_sqlcom_select():
String str(buff,(uint32) sizeof(buff), system_charset_info);
^^^^^^^^^^^
str.length(0);
thd->lex->unit.print(&str, QT_ORDINARY);
But even that way, it would print two unicode characters:
LATIN CAPITAL LETTER A WITH TILDE
and SUPERSCRIPT TREE
instead of understanding that it needs to take the latin1 codes of those
two and parse them as a single utf8 code, a single unicode character...
Somehow the "visible charset" would be like informing
Item_string::print() that it is preceded by "_latin1". Item_string()
could then understand what it has to do.
But I imagine this would make functions' prototypes longer, and maybe
the printing logic more complex.
> This has a disadvantage:
> This would need a new member "CHARSET_INFO *visible_charset"
> in Item_string, which would be used extremely rarely: *only* for
> EXPLAIN EXTENDED purposes, only for the case when there
> is a string with an introducer and with non-ASCII characters.
> I think it would be a slow down without a good reason,
> because it would need initialization of the new member in all
> constructors.
>
> Another serious disadvantage is that raw 8-bit data with
> introducer totally messes everything up after conversion
> to another character set, and we're planning some work in this
> direction: "WL#3529 Unicode Escape Sequences",
> so we'll use escape sequences every time when an extended string
> with an introducer needs to be displayed.
>
> With WL#3529, it will be displayed as follows:
>
> SET NAMES utf8;
> EXPLAIN EXTENDED _latin1'abcdó';
> -> Note 1003 select _latin1 U&'abcd\00C3\00B3' AS `abcdó`
>
> Notice the U& syntax, where
> - 00C3 is Unicode code point for "LATIN1 CAPITAL LETTER A WITH TILDE"
> - 00B3 is Unicode code point for "SUPERSCRIPT THREE"
>
>
> It will be similar in other context, for example for VIEWs:
>
> SET NAMES utf8;
> CREATE VIEW v1 AS SELECT _latin1'abcdó';
> SHOW CREATE VIEW v1;
>
> will display:
> CREATE VIEW v1 AS SELECT _latin1'abcd\00C3\00B3';
>
> which can be harmlessly converted to any character set
> and then parsed back without data loss.
Ok. Would it be possible in the long run to have both forms unified? it
sounds strange to have
select _latin1 U&'abcd\00C3\00B3'
for plain SELECT, and
select _latin1'abcd\00C3\00B3'
for SELECT-in-view.
> So I think, my decision should be fine:
> - it does not slow down Item_string to "fix" very rare situation
>
> - it is inline with the general development direction:
> a kind of small step towards "introducer-safe environment"
> which will be later fully implemented in WL#3529
> (I'm just using HEX sequences at the moment,
> rather than Unicode sequences - for simplicity, and not
> to confuse users by unknown yet syntax)
ok, I'll get back to reviewing the patch.