Hello,
Alexander Barkov a écrit, Le 12.11.2010 20:26:
> Hi Guilhem,
>
> Thanks for your review. Please find comments inline.
>
> Guilhem Bichot wrote:
>> Hello,
>>
>> Alexander Barkov a écrit, Le 12.11.2010 15:45:
>>> #At file:///home/bar/mysql-bzr/mysql-5.5-bugteam.b57306/ based on
>>> revid:vvaintroub@stripped
>>>
>>> 3121 Alexander Barkov 2010-11-12
>>> Bug#57306 SHOW PROCESSLIST does not display string literals well.
>>> Problem: Extended characters outside of ASCII range where
>>> not displayed
>>> properly in SHOW PROCESSLIST, because thd_info->query was
>>> always sent as system_character_set (utf8). This was wrong,
>>> because query buffer
>>> is never converted to utf8 - it is always have client character
>>> set.
>>> Fix: sending query buffer using character_set_client of
>>> its thread.
>>>
>>> modified:
>>> mysql-test/r/show_check.result
>>> mysql-test/t/show_check.test
>>> sql/sql_show.cc
>>
>>> === modified file 'mysql-test/t/show_check.test'
>>> --- a/mysql-test/t/show_check.test 2010-07-27 14:32:42 +0000
>>> +++ b/mysql-test/t/show_check.test 2010-11-12 14:45:32 +0000
>>> @@ -1328,3 +1328,24 @@ disconnect con1;
>>> # Wait till all disconnects are completed
>>> --source include/wait_until_count_sessions.inc
>>>
>>> +--echo #
>>> +--echo # Bug#57306 SHOW PROCESSLIST does not display string literals
>>> well.
>>> +--echo #
>>> +
>>> +SET NAMES latin1;
>>> +SELECT GET_LOCK('t', 1000);
>>> +--connect (con1,localhost,root,,)
>>> +--connection con1
>>> +SET NAMES latin1;
>>> +--send SELECT GET_LOCK('t',1000) AS 'óóóó';
>>> +--connection default
>>> +--replace_column 1 ### 6 ### 7 ###
>>> +SHOW PROCESSLIST;
>>> +SET NAMES utf8;
>>> +--replace_column 1 ### 6 ### 7 ###
>>> +SHOW PROCESSLIST;
>>> +SELECT RELEASE_LOCK('t');
>>> +--connection con1
>>> +--reap
>>> +--disconnect con1
>>> +--connection default
>>
>> The default connection did "SET NAMES latin1" and then "SET NAMES
>> utf8". I guess this last command restores the connection's state to
>> what it was before "SET NAMES latin1"?
>
> I added this not really for purpose of character set restoring.
> My idea was to check that character set conversion for the "Info"
> column work correctly, so I am testing two different character sets,
> pretending that I have two different clients reading "SHOW PROCESSLIST",
> using latin1 and using utf8.
>
>
> The first "SHOW PROCESSLIST" (after "SET NAMES latin1")
> prints exactly what con1 sent, because con1 uses latin1
> as well. So no conversion happens here. The default thread
> gets result which is binary equal to what con1 sent.
> You can see it as "SMALL LETTER O WITH ACUTE" four times.
> This test checks what the report is actually about.
>
> The second "SHOW PROCESSLIST" (after "SET NAMES utf8") applies
> conversion from latin1 to utf8, whose result is visible
> as "CAPITAL LETTER A WITH TILDE" followed by "SUPERSCRIPT THREE"
> four times. This is an extra test for better coverage.
>
>
> Details:
> Encoding for the test code I added in show_check.test is utf8.
> So when I put a letter "SMALL LETTER O WITH ACUTE ABOVE" into
> show_check.test, I actually put two bytes 0xC3B3.
>
> As I previously use "SET NAMES latin1" on the con1 thread,
> I make server interpret 0xC3B3 as latin1 data.
>
> Now if you have a look here:
> ftp://ftp.unicode.org/Public/MAPPINGS/ISO8859/8859-1.TXT
> 0xC3 0x00C4 # LATIN CAPITAL LETTER A WITH TILDE
> 0xB3 0x00B3 # SUPERSCRIPT THREE
>
> which is exactly what we can see in the results for the
> second "SHOW PROCESSLIST", with utf8.
>
>
> This proves that conversion from latin1 to utf8 happened correctly.
Thanks for the clear explanations, I learnt something!
I still have my original concern though: a test should leave the
connection in the same state as when it started, including the used
charset; imagine that a test is added afterwards to show_check.test.
That's why I asked: connection "default" does "SET NAMES latin1" at
start, shouldn't it restore charset variables to some default before ending?
>>> === modified file 'sql/sql_show.cc'
>>> --- a/sql/sql_show.cc 2010-11-02 14:02:16 +0000
>>> +++ b/sql/sql_show.cc 2010-11-12 14:45:32 +0000
>>> @@ -1728,6 +1728,7 @@ public:
>>> uint command;
>>> const char *user,*host,*db,*proc_info,*state_info;
>>> char *query;
>>> + CHARSET_INFO *character_set_client;
>>> };
>>>
>>> #ifdef HAVE_EXPLICIT_TEMPLATE_INSTANTIATION
>>> @@ -1833,6 +1834,7 @@ void mysqld_list_processes(THD *thd,cons
>>> }
>>> mysql_mutex_unlock(&tmp->LOCK_thd_data);
>>> thd_info->start_time= tmp->start_time;
>>> + thd_info->character_set_client=
>>> tmp->variables.character_set_client;
>>
>> This is an "unprotected" read of another thread's data
>> (tmp->variables.character_set_client). Imagine that this other thread
>> is precisely in the process of changing its @@character_set_client: we
>> may be reading a pointer value while it's changing, which may in
>> theory give garbage (some bytes of the pointer have changed and some
>> not yet); imagine this garbage pointer is an invalid pointer (not
>> pointing to any charset); then when we print the query in SHOW
>> PROCESSLIST, we'll dereference this invalid pointer and crash?
>> I read somewhere that on modern Intel CPUs, 64-bit reads are atomic,
>> but it's always unclear what that exactly means and what happens for
>> other CPUs...
>> So I don't feel very confident about this. Or is there no problem?
>>
>> I see the query is read for LOCK_thd_data.
>
> "query" is copied using strmake:
>
> thd_info->query= (char*) thd->strmake(tmp->query(),length);
>
> which is of course not atomic. We do need a mutex here.
>
>
> But I think there is no problem with copying CHARSET_INFO pointer.
> My understanding is that pointers use processor data types,
> which should be copied atomically, on both 32-bit and 64-bit hardware.
Hard to know whether this is always true on all machines we release
MySQL on.
It's indeed true for Intel CPUs starting from Pentium, that a read or
write of a 64-bit integer aligned on a 64-bit boundary is atomic (I
checked the Intel CPU manual), so the reader should receive the old or
new value but nothing else, so at worse we would show garbage characters
(using the wrong charset to interpret the query). Though, what about
Sparc, HP PA-RISC, PowerPC...
I'm not aware of other parts in code where we currently assume a pointer
read to be atomic.
In performance schema, where the user reading Performance Schema tables
reads data which is being updated by other threads at this moment (dirty
read of pointer, as in your patch), there is "sanitization" of the
dirtily read pointers (see call to "sanitize_mutex_class" in
storage/perfschema/table_events_waits.cc). In that case, "sanitization"
means that perfschema checks that the read pointer is in certain
boundaries, which ensures that when dereferenced, it will not cause a
segmentation fault (so the data shown to the user may be wrong, but not
crash).
>> If there is a problem, maybe we should change the code which updates
>> THD::variables::character_set_client to use LOCK_thd_data, or an
>> atomic operation; and above we would use the mutex or the atomic
>> operation...?
>
> I don't think we really need it.
Mmmm... not quite convinced...
Well, I see several ways:
1) keep the dirty read, but sanitize the result (by checking that it is
equal to one of the valid charset pointers (do we have a list of a all
CHARSET_INFO pointers? maybe in SHOW CHARSETS we do?))
2) protect all code which reads/writes the variable
(character_set_client) with a mutex or atomic operations. The code which
writes this variable (when one does SET @@CHARACTER_SET_CLIENT=...) is
generic code: @@character_set_client is a Sys_var_struct (a class which
serves for other variables, see sys_vars.*). You could do it the
following way. Make character_set_client a derived class of
Sys_var_struct, define its session_update() to use a mutex. In all
logic: this charset variable is used by other threads to interpret
thd->query_string; so it should be protected just like thd->query_string
is protected: with LOCK_thd_data (see THD::set_query()). This way, in
sql_show.cc, where we already use the mutex to read the other thread's
query, we would read the charset in the same critical section (before
unlocking the mutex). So character_set_client's session_update() would be:
{
/* comment to explain why mutex */
lock mutex LOCK_thd_data;
Sys_var_struct::session_update(); // call to base class
unlock;
}.
Replication code may have to be touched too.
I think I prefer (2).