List:Commits« Previous MessageNext Message »
From:Guilhem Bichot Date:November 15 2010 12:54pm
Subject:Re: bzr commit into mysql-5.5-bugteam branch (bar:3121) Bug#57306
View as plain text  
Hello,

Alexander Barkov a écrit, Le 12.11.2010 20:26:
> Hi Guilhem,
> 
> Thanks for your review. Please find comments inline.
> 
> Guilhem Bichot wrote:
>> Hello,
>>
>> Alexander Barkov a écrit, Le 12.11.2010 15:45:
>>> #At file:///home/bar/mysql-bzr/mysql-5.5-bugteam.b57306/ based on 
>>> revid:vvaintroub@stripped
>>>
>>>  3121 Alexander Barkov    2010-11-12
>>>       Bug#57306 SHOW PROCESSLIST does not display string literals well.
>>>             Problem: Extended characters outside of ASCII range where 
>>> not displayed
>>>       properly in SHOW PROCESSLIST, because thd_info->query was 
>>> always sent as       system_character_set (utf8). This was wrong, 
>>> because query buffer
>>>       is never converted to utf8 - it is always have client character 
>>> set.
>>>             Fix: sending query buffer using character_set_client of 
>>> its thread.
>>>
>>>     modified:
>>>       mysql-test/r/show_check.result
>>>       mysql-test/t/show_check.test
>>>       sql/sql_show.cc
>>
>>> === modified file 'mysql-test/t/show_check.test'
>>> --- a/mysql-test/t/show_check.test    2010-07-27 14:32:42 +0000
>>> +++ b/mysql-test/t/show_check.test    2010-11-12 14:45:32 +0000
>>> @@ -1328,3 +1328,24 @@ disconnect con1;
>>>  # Wait till all disconnects are completed
>>>  --source include/wait_until_count_sessions.inc
>>>  
>>> +--echo #
>>> +--echo # Bug#57306 SHOW PROCESSLIST does not display string literals 
>>> well.
>>> +--echo #
>>> +
>>> +SET NAMES latin1;
>>> +SELECT GET_LOCK('t', 1000);
>>> +--connect (con1,localhost,root,,)
>>> +--connection con1
>>> +SET NAMES latin1;
>>> +--send SELECT GET_LOCK('t',1000) AS 'óóóó';
>>> +--connection default
>>> +--replace_column 1 ### 6 ### 7 ###
>>> +SHOW PROCESSLIST;
>>> +SET NAMES utf8;
>>> +--replace_column 1 ### 6 ### 7 ###
>>> +SHOW PROCESSLIST;
>>> +SELECT RELEASE_LOCK('t');
>>> +--connection con1
>>> +--reap
>>> +--disconnect con1
>>> +--connection default
>>
>> The default connection did "SET NAMES latin1" and then "SET NAMES 
>> utf8". I guess this last command restores the connection's state to 
>> what it was before "SET NAMES latin1"?
> 
> I added this not really for purpose of character set restoring.
> My idea was to check that character set conversion for the "Info"
> column work correctly, so I am testing two different character sets,
> pretending that I have two different clients reading "SHOW PROCESSLIST",
> using latin1 and using utf8.
> 
> 
> The first "SHOW PROCESSLIST" (after "SET NAMES latin1")
> prints exactly what con1 sent, because con1 uses latin1
> as well. So no conversion happens here. The default thread
> gets result which is binary equal to what con1 sent.
> You can see it as "SMALL LETTER O WITH ACUTE" four times.
> This test checks what the report is actually about.
> 
> The second "SHOW PROCESSLIST" (after "SET NAMES utf8") applies
> conversion from latin1 to utf8, whose result is visible
> as "CAPITAL LETTER A WITH TILDE" followed by "SUPERSCRIPT THREE"
> four times. This is an extra test for better coverage.
> 
> 
> Details:
> Encoding for the test code I added in show_check.test is utf8.
> So when I put a letter "SMALL LETTER O WITH ACUTE ABOVE" into
> show_check.test, I actually put two bytes 0xC3B3.
> 
> As I previously use "SET NAMES latin1" on the con1 thread,
> I make server interpret 0xC3B3 as latin1 data.
> 
> Now if you have a look here:
> ftp://ftp.unicode.org/Public/MAPPINGS/ISO8859/8859-1.TXT
> 0xC3    0x00C4    #    LATIN CAPITAL LETTER A WITH TILDE
> 0xB3    0x00B3    #    SUPERSCRIPT THREE
> 
> which is exactly what we can see in the results for the
> second "SHOW PROCESSLIST", with utf8.
> 
> 
> This proves that conversion from latin1 to utf8 happened correctly.

Thanks for the clear explanations, I learnt something!
I still have my original concern though: a test should leave the 
connection in the same state as when it started, including the used 
charset; imagine that a test is added afterwards to show_check.test. 
That's why I asked: connection "default" does "SET NAMES latin1" at 
start, shouldn't it restore charset variables to some default before ending?

>>> === modified file 'sql/sql_show.cc'
>>> --- a/sql/sql_show.cc    2010-11-02 14:02:16 +0000
>>> +++ b/sql/sql_show.cc    2010-11-12 14:45:32 +0000
>>> @@ -1728,6 +1728,7 @@ public:
>>>    uint   command;
>>>    const char *user,*host,*db,*proc_info,*state_info;
>>>    char *query;
>>> +  CHARSET_INFO *character_set_client;
>>>  };
>>>  
>>>  #ifdef HAVE_EXPLICIT_TEMPLATE_INSTANTIATION
>>> @@ -1833,6 +1834,7 @@ void mysqld_list_processes(THD *thd,cons
>>>          }
>>>          mysql_mutex_unlock(&tmp->LOCK_thd_data);
>>>          thd_info->start_time= tmp->start_time;
>>> +        thd_info->character_set_client= 
>>> tmp->variables.character_set_client;
>>
>> This is an "unprotected" read of another thread's data 
>> (tmp->variables.character_set_client). Imagine that this other thread 
>> is precisely in the process of changing its @@character_set_client: we 
>> may be reading a pointer value while it's changing, which may in 
>> theory give garbage (some bytes of the pointer have changed and some 
>> not yet); imagine this garbage pointer is an invalid pointer (not 
>> pointing to any charset); then when we print the query in SHOW 
>> PROCESSLIST, we'll dereference this invalid pointer and crash?
>> I read somewhere that on modern Intel CPUs, 64-bit reads are atomic, 
>> but it's always unclear what that exactly means and what happens for 
>> other CPUs...
>> So I don't feel very confident about this. Or is there no problem?
>>
>> I see the query is read for LOCK_thd_data.
> 
> "query" is copied using strmake:
> 
> thd_info->query= (char*) thd->strmake(tmp->query(),length);
> 
> which is of course not atomic. We do need a mutex here.
> 
> 
> But I think there is no problem with copying CHARSET_INFO pointer.
> My understanding is that pointers use processor data types,
> which should be copied atomically, on both 32-bit and 64-bit hardware.

Hard to know whether this is always true on all machines we release 
MySQL on.
It's indeed true for Intel CPUs starting from Pentium, that a read or 
write of a 64-bit integer aligned on a 64-bit boundary is atomic (I 
checked the Intel CPU manual), so the reader should receive the old or 
new value but nothing else, so at worse we would show garbage characters 
(using the wrong charset to interpret the query). Though, what about 
Sparc, HP PA-RISC, PowerPC...
I'm not aware of other parts in code where we currently assume a pointer 
read to be atomic.
In performance schema, where the user reading Performance Schema tables 
reads data which is being updated by other threads at this moment (dirty 
read of pointer, as in your patch), there is "sanitization" of the 
dirtily read pointers (see call to "sanitize_mutex_class" in 
storage/perfschema/table_events_waits.cc). In that case, "sanitization" 
means that perfschema checks that the read pointer is in certain 
boundaries, which ensures that when dereferenced, it will not cause a 
segmentation fault (so the data shown to the user may be wrong, but not 
crash).

>> If there is a problem, maybe we should change the code which updates 
>> THD::variables::character_set_client to use LOCK_thd_data, or an 
>> atomic operation; and above we would use the mutex or the atomic 
>> operation...?
> 
> I don't think we really need it.

Mmmm... not quite convinced...
Well, I see several ways:
1) keep the dirty read, but sanitize the result (by checking that it is 
equal to one of the valid charset pointers (do we have a list of a all 
CHARSET_INFO pointers? maybe in SHOW CHARSETS we do?))
2) protect all code which reads/writes the variable 
(character_set_client) with a mutex or atomic operations. The code which 
writes this variable (when one does SET @@CHARACTER_SET_CLIENT=...) is 
generic code: @@character_set_client is a Sys_var_struct (a class which 
serves for other variables, see sys_vars.*). You could do it the 
following way. Make character_set_client a derived class of 
Sys_var_struct, define its session_update() to use a mutex. In all 
logic: this charset variable is used by other threads to interpret 
thd->query_string; so it should be protected just like thd->query_string 
is protected: with LOCK_thd_data (see THD::set_query()). This way, in 
sql_show.cc, where we already use the mutex to read the other thread's 
query, we would read the charset in the same critical section (before 
unlocking the mutex). So character_set_client's session_update() would be:
{
  /* comment to explain why mutex */
  lock mutex LOCK_thd_data;
  Sys_var_struct::session_update(); // call to base class
  unlock;
}.
Replication code may have to be touched too.
I think I prefer (2).
Thread
bzr commit into mysql-5.5-bugteam branch (bar:3121) Bug#57306Alexander Barkov12 Nov
  • Re: bzr commit into mysql-5.5-bugteam branch (bar:3121) Bug#57306Guilhem Bichot12 Nov
    • Re: bzr commit into mysql-5.5-bugteam branch (bar:3121) Bug#57306Alexander Barkov12 Nov
      • Re: bzr commit into mysql-5.5-bugteam branch (bar:3121) Bug#57306Tor Didriksen15 Nov
        • Re: bzr commit into mysql-5.5-bugteam branch (bar:3121) Bug#57306Alexander Barkov15 Nov
          • Re: bzr commit into mysql-5.5-bugteam branch (bar:3121) Bug#57306Alexander Barkov15 Nov
            • Re: bzr commit into mysql-5.5-bugteam branch (bar:3121) Bug#57306Guilhem Bichot15 Nov
              • Re: bzr commit into mysql-5.5-bugteam branch (bar:3121) Bug#57306Alexander Barkov16 Nov
      • Re: bzr commit into mysql-5.5-bugteam branch (bar:3121) Bug#57306Guilhem Bichot15 Nov