Hi Guilhem,
Guilhem Bichot wrote:
> Hello,
>
> Alexander Barkov a écrit, Le 01.03.2011 11:24:
>> #At file:///home/bar/mysql-bzr/mysql-5.5.b57341/ based on
>> revid:tor.didriksen@stripped
>>
>> 3357 Alexander Barkov 2011-03-01
>> Bug#11764503 (Bug#57341) Query in EXPLAIN EXTENDED shows wrong
>> characters
>> @ mysql-test/r/ctype_latin1.result
>> @ mysql-test/r/ctype_utf8.result
>> @ mysql-test/t/ctype_latin1.test
>> @ mysql-test/t/ctype_utf8.test
>> Adding tests
>> @ sql/mysqld.h
>> Adding a new Item::print() mode - QT_EXPLAIN, which
>> - displays introducers
>> - prints ASCII characters as is
>> - prints non-ASCII characters using hex-escape
>> Note: as "EXPLAIN" output is only for human readability purposes
>> and does not need to be a pasrable SQL, using hex-escape is Ok.
>> @ sql/item.cc
>> Implementing QT_EXPLAIN mode for Item_string::print().
>> Note: ErrConvString class perfectly suites for EXPLAIN purposes.
>> @ sql/sql_parse.cc
>> Using new QT_EXPLAIN mode for "EXPLAIN EXTENDED SELECT".
>
>> === modified file 'mysql-test/r/ctype_latin1.result'
>> --- a/mysql-test/r/ctype_latin1.result 2011-02-10 08:18:08 +0000
>> +++ b/mysql-test/r/ctype_latin1.result 2011-03-01 10:17:10 +0000
>> @@ -3246,5 +3246,20 @@ maketime(`a`,`a`,`a`)
>> DROP TABLE t1;
>> SET sql_mode=default;
>> #
>> +# Bug#11764503 (Bug#57341) Query in EXPLAIN EXTENDED shows wrong
>> characters
>> +#
>> +SET NAMES utf8;
>> +EXPLAIN EXTENDED SELECT 'abcdó', _latin1'abcdó',
> _utf8'abcdó';
>> +id select_type table type possible_keys key
>> key_len ref rows filtered Extra
>> +1 SIMPLE NULL NULL NULL NULL NULL NULL
>> NULL NULL No tables used
>> +Warnings:
>> +Note 1003 select 'abcdó' AS `abcdó`,_latin1'abcd\xC3\xB3' AS
>
>> `abcdó`,_utf8'abcd\xC3\xB3' AS `abcdó`
>> +SET NAMES latin1;
>
> I had looked a bit into this bug, and my impression was that the cause
> is rather one piece of code (the one printing literals of the SELECT
> list, or the one printing AS clauses?), which forgets to convert to the
> proper charset.
> In other words, if I type
> EXPLAIN EXTENDED SELECT 'abcdó', _latin1'abcdó', _utf8'abcdó';
> wouldn't the proper solution be to display
> Note 1003 select 'abcdó' AS `abcdó`, _latin1'abcdó' AS
> `abcdó`, etc?
>
> Isn't the solution chosen in the patch hiding the real bug?
Does not really forgets/hides. I had been thinking
a few hours how to fix it properly, then intentionally made it this way.
Let me try to explain why. Sorry for a long answer. This is tricky.
1. When there are no character set introducers (e.g. _latin1 or _utf8),
everything works fine with my change:
a. the patch does not touch behaviour for ASCII characters
b. the patch fixes behaviour for non-ASCII characters
(which was the report about), so non-ASCII characters
are displayed well now in "EXPLAIN EXTENDED".
2. When there are introducers:
2a. the patch does not change behaviour for ASCII characters
2b. when there are some non-ASCII characters, it's a tough situation.
(it's not in the original report)
Suppose we have this query:
SET NAMES utf8;
EXPLAIN EXTENDED SELECT _latin1'abcdó'; # where 'ó' is \xC3\xB3 (utf8)
By the above query we say that:
"although the character set of the entire query is utf8,
please take the constant in quotes as latin1",
so the sequence "\xC3\xB3" is treated as:
- 0xC3 = LATIN CAPITAL LETTER A WITH TILDE
- 0xB3 = SUPERSCRIPT TREE
To make display it in "EXPLAIN EXTENDED" in the original
form _latin1'abcdó', Item_string would have to remember
two character sets:
- latin1, which is the real character set of the constant,
e.g. for val_str() and all other usual Item functionality
- utf8, which is the "visible" character set,
i.e. how it was seen to the user when the query was typed.
So in Item_string::print() we could do:
"although the string is latin1, pretend to be utf8 when displaying",
i.e. backward substitution for what happened at parse time.
This has a disadvantage:
This would need a new member "CHARSET_INFO *visible_charset"
in Item_string, which would be used extremely rarely: *only* for
EXPLAIN EXTENDED purposes, only for the case when there
is a string with an introducer and with non-ASCII characters.
I think it would be a slow down without a good reason,
because it would need initialization of the new member in all
constructors.
Another serious disadvantage is that raw 8-bit data with
introducer totally messes everything up after conversion
to another character set, and we're planning some work in this
direction: "WL#3529 Unicode Escape Sequences",
so we'll use escape sequences every time when an extended string
with an introducer needs to be displayed.
With WL#3529, it will be displayed as follows:
SET NAMES utf8;
EXPLAIN EXTENDED _latin1'abcdó';
-> Note 1003 select _latin1 U&'abcd\00C3\00B3' AS `abcdó`
Notice the U& syntax, where
- 00C3 is Unicode code point for "LATIN1 CAPITAL LETTER A WITH TILDE"
- 00B3 is Unicode code point for "SUPERSCRIPT THREE"
It will be similar in other context, for example for VIEWs:
SET NAMES utf8;
CREATE VIEW v1 AS SELECT _latin1'abcdó';
SHOW CREATE VIEW v1;
will display:
CREATE VIEW v1 AS SELECT _latin1'abcd\00C3\00B3';
which can be harmlessly converted to any character set
and then parsed back without data loss.
So I think, my decision should be fine:
- it does not slow down Item_string to "fix" very rare situation
- it is inline with the general development direction:
a kind of small step towards "introducer-safe environment"
which will be later fully implemented in WL#3529
(I'm just using HEX sequences at the moment,
rather than Unicode sequences - for simplicity, and not
to confuse users by unknown yet syntax)