From: Alexander Barkov Date: March 1 2011 2:13pm Subject: Re: bzr commit into mysql-5.5 branch (alexander.barkov:3357) Bug#57341 Bug#11764503 List-Archive: http://lists.mysql.com/commits/132207 Message-Id: <4D6CFF27.7080707@oracle.com> MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 8bit Hi Guilhem, Guilhem Bichot wrote: > Hello, > > Alexander Barkov a écrit, Le 01.03.2011 11:24: >> #At file:///home/bar/mysql-bzr/mysql-5.5.b57341/ based on >> revid:tor.didriksen@stripped >> >> 3357 Alexander Barkov 2011-03-01 >> Bug#11764503 (Bug#57341) Query in EXPLAIN EXTENDED shows wrong >> characters >> @ mysql-test/r/ctype_latin1.result >> @ mysql-test/r/ctype_utf8.result >> @ mysql-test/t/ctype_latin1.test >> @ mysql-test/t/ctype_utf8.test >> Adding tests >> @ sql/mysqld.h >> Adding a new Item::print() mode - QT_EXPLAIN, which >> - displays introducers >> - prints ASCII characters as is >> - prints non-ASCII characters using hex-escape >> Note: as "EXPLAIN" output is only for human readability purposes >> and does not need to be a pasrable SQL, using hex-escape is Ok. >> @ sql/item.cc >> Implementing QT_EXPLAIN mode for Item_string::print(). >> Note: ErrConvString class perfectly suites for EXPLAIN purposes. >> @ sql/sql_parse.cc >> Using new QT_EXPLAIN mode for "EXPLAIN EXTENDED SELECT". > >> === modified file 'mysql-test/r/ctype_latin1.result' >> --- a/mysql-test/r/ctype_latin1.result 2011-02-10 08:18:08 +0000 >> +++ b/mysql-test/r/ctype_latin1.result 2011-03-01 10:17:10 +0000 >> @@ -3246,5 +3246,20 @@ maketime(`a`,`a`,`a`) >> DROP TABLE t1; >> SET sql_mode=default; >> # >> +# Bug#11764503 (Bug#57341) Query in EXPLAIN EXTENDED shows wrong >> characters >> +# >> +SET NAMES utf8; >> +EXPLAIN EXTENDED SELECT 'abcdó', _latin1'abcdó', _utf8'abcdó'; >> +id select_type table type possible_keys key >> key_len ref rows filtered Extra >> +1 SIMPLE NULL NULL NULL NULL NULL NULL >> NULL NULL No tables used >> +Warnings: >> +Note 1003 select 'abcdó' AS `abcdó`,_latin1'abcd\xC3\xB3' AS >> `abcdó`,_utf8'abcd\xC3\xB3' AS `abcdó` >> +SET NAMES latin1; > > I had looked a bit into this bug, and my impression was that the cause > is rather one piece of code (the one printing literals of the SELECT > list, or the one printing AS clauses?), which forgets to convert to the > proper charset. > In other words, if I type > EXPLAIN EXTENDED SELECT 'abcdó', _latin1'abcdó', _utf8'abcdó'; > wouldn't the proper solution be to display > Note 1003 select 'abcdó' AS `abcdó`, _latin1'abcdó' AS `abcdó`, etc? > > Isn't the solution chosen in the patch hiding the real bug? Does not really forgets/hides. I had been thinking a few hours how to fix it properly, then intentionally made it this way. Let me try to explain why. Sorry for a long answer. This is tricky. 1. When there are no character set introducers (e.g. _latin1 or _utf8), everything works fine with my change: a. the patch does not touch behaviour for ASCII characters b. the patch fixes behaviour for non-ASCII characters (which was the report about), so non-ASCII characters are displayed well now in "EXPLAIN EXTENDED". 2. When there are introducers: 2a. the patch does not change behaviour for ASCII characters 2b. when there are some non-ASCII characters, it's a tough situation. (it's not in the original report) Suppose we have this query: SET NAMES utf8; EXPLAIN EXTENDED SELECT _latin1'abcdó'; # where 'ó' is \xC3\xB3 (utf8) By the above query we say that: "although the character set of the entire query is utf8, please take the constant in quotes as latin1", so the sequence "\xC3\xB3" is treated as: - 0xC3 = LATIN CAPITAL LETTER A WITH TILDE - 0xB3 = SUPERSCRIPT TREE To make display it in "EXPLAIN EXTENDED" in the original form _latin1'abcdó', Item_string would have to remember two character sets: - latin1, which is the real character set of the constant, e.g. for val_str() and all other usual Item functionality - utf8, which is the "visible" character set, i.e. how it was seen to the user when the query was typed. So in Item_string::print() we could do: "although the string is latin1, pretend to be utf8 when displaying", i.e. backward substitution for what happened at parse time. This has a disadvantage: This would need a new member "CHARSET_INFO *visible_charset" in Item_string, which would be used extremely rarely: *only* for EXPLAIN EXTENDED purposes, only for the case when there is a string with an introducer and with non-ASCII characters. I think it would be a slow down without a good reason, because it would need initialization of the new member in all constructors. Another serious disadvantage is that raw 8-bit data with introducer totally messes everything up after conversion to another character set, and we're planning some work in this direction: "WL#3529 Unicode Escape Sequences", so we'll use escape sequences every time when an extended string with an introducer needs to be displayed. With WL#3529, it will be displayed as follows: SET NAMES utf8; EXPLAIN EXTENDED _latin1'abcdó'; -> Note 1003 select _latin1 U&'abcd\00C3\00B3' AS `abcdó` Notice the U& syntax, where - 00C3 is Unicode code point for "LATIN1 CAPITAL LETTER A WITH TILDE" - 00B3 is Unicode code point for "SUPERSCRIPT THREE" It will be similar in other context, for example for VIEWs: SET NAMES utf8; CREATE VIEW v1 AS SELECT _latin1'abcdó'; SHOW CREATE VIEW v1; will display: CREATE VIEW v1 AS SELECT _latin1'abcd\00C3\00B3'; which can be harmlessly converted to any character set and then parsed back without data loss. So I think, my decision should be fine: - it does not slow down Item_string to "fix" very rare situation - it is inline with the general development direction: a kind of small step towards "introducer-safe environment" which will be later fully implemented in WL#3529 (I'm just using HEX sequences at the moment, rather than Unicode sequences - for simplicity, and not to confuse users by unknown yet syntax)