List:Internals« Previous MessageNext Message »
From:Yan Yu Date:December 24 2009 5:35am
Subject:Re: system_charset_info
View as plain text  
Hi, THANK YOU SO MUCH for the explanation.
It is really helpful!
Best,
yan

On Dec 23, 2009, at 6:56 AM, Alexander Barkov wrote:

> Hi Monty, Yan,
>
>
> Michael Widenius wrote:
>> Hi!
>>
>>>>>>> "Yan" == Yan Yu <yan2yu2@stripped> writes:
>>
>> <cut>
>>
>> Yan> There are still many other broken test cases that we need to
>> Yan> investigate.
>>
>> Yan> What is the bug ?
>>
>> Yan> http://bugs.mysql.com/bug.php?id=49609
>>
>> Thanks. Either MySQL people or MariaDB people will certainly take a
>> look at this.
>
>
> This bug is fixed in MySQL-6.0 sources.
> I don't think it the fix will ever be back-ported to an
> earlier MySQL version. So you can try to back-port
> the fix from MySQL-6.0 to the desired Maria version.
>
>
>>
>> <cut>
>>
>> Monty> Most errors in this case should mainly be in the test suite.
>>
>> Monty> If you run mysql-test-run --force and examine the results,  
>> ignoring
>> Monty> errors in the test suite, you should be able to get some  
>> assurance
>> Monty> that your change works.
>>
>> Yan> Could you elaborate on what do you mean by "ignoring the  
>> errors in the test suite"?  what kind of error I can ignore?
>>
>> Yan> Currently I encounter three types of errors:
>> Yan> (1) most benign type of error is:
>> Yan> If I run mysql-test-run, some individual tests would fail;  
>> However, if I run a failed test by itself, using command:
>> Yan> mysql-test-run --record [single-test-name],
>> Yan> the test would succeed.  I even tried to bundle several tests  
>> proceeding the failed test together with the failed
>> Yan> I assume I can ignore this type of error.
>>
>> --record overwrites the old test data, so by using this, you may hide
>> a error.
>>
>> What I would suggest you to do:
>>
>> mysql-test-run --force
>>
>> and after it's done, do a diff of the .result and .reject files in  
>> the
>> 'r' directories.
>>
>> You can ignore
>> - Only the characterset name in the result files has changed
>>  (should be a majority of the cases)
>> - EXPLAIN statements where the key lengths are different..
>> - Test results where the order of lines has changed that depends on
>>  the different sorting of character sets.
>> - Difference caused because of different max string lengths
>>  (varchar/blob differences)
>>
>> Any other test results you should try to understand what could cause
>> them.
>>
>> Yan> (2) this type of error seems caused by MYSQL_TYPE_VAR_STRING  
>> vs. MYSQL_TYPE_BLOB.
>> Yan> for example, in sql/sql_select.cc, line  9456, function  
>> create_tmp_field_for_schema(),
>> Yan> based on item->max_length, it will create either a  Field_blob  
>> object, or a Field_varstring object.
>>
>> Yan>  9459     if (item->max_length > MAX_FIELD_VARCHARLENGTH)
>> Yan>  9460       field= new Field_blob(item->max_length, item- 
>> >maybe_null,
>> Yan>  9461                             item->name, item- 

>> >collation.collation);
>> Yan>  9462     else
>> Yan>  9463       field= new Field_varstring(item->max_length, item- 
>> >maybe_null,
>> Yan>  9464                                  item->name,
>> Yan>  9465                                  table->s, item- 
>> >collation.collation);
>>
>> Yan> if the system_charset_info is a multibyte encoding, e.g.,  
>> utf8, the max_length of the column type would be greater than  
>> MAX_FIELD_VARCHARLENGTH, it will create an MYSQL_TYPE_BLOB object,  
>> otherwise, it will create an MYSQL_TYPE_VAR_STRING object if the  
>> system_charset_info is latin1.
>>
>> Yan> The difference between MYSQL_TYPE_VAR_STRING vs.  
>> MYSQL_TYPE_BLOB manifests itself in at least two aspects:
>> Yan> (I) the "Type" of the COLUMN_TYPE returned to "describe/ 
>> explain [table name]" command is different in these two cases.
>> Yan> this causes mysql_client_test fails.
>>
>> This you can ignore.
>>
>> Yan> (II) the wire format (i.e., packing results to the client) is  
>> different for these two fields: MYSQL_TYPE_VAR_STRING vs.  
>> MYSQL_TYPE_BLOB.  Field_varstring::store() and Field_blob::store()  
>> in field.cc.
>> Yan> Somehow it seems like that the client got confused after I  
>> change system_charset_info to latin1, (I stepped traced the server  
>> code, it seems fine)
>>
>> Should not cause a problem for the client (if there isn't an unknown
>> bug somewhere)
>>
>> Yan> the results from "describe [table name] would miss information  
>> on two columns: Type and Null Column.
>> Yan> e.g., result for "describe Columns;" returns the following:
>> Yan> +--------------------------+------+------+-----+--------- 
>> +-------+
>> Yan> | Field                    | Type | Null | Key | Default |  
>> Extra |
>> Yan> +--------------------------+------+------+-----+--------- 
>> +-------+
>> Yan> | TABLE_CATALOG            |      |      |     | NULL     
>> |       |
>> Yan> | TABLE_SCHEMA             |      |      |     |          
>> |       |
>> Yan> | TABLE_NAME               |      |      |     |          
>> |       |
>> Yan> | COLUMN_NAME              |      |      |     |          
>> |       |
>> Yan> | ORDINAL_POSITION         |      |      |     | 0        
>> |       |
>> Yan> | COLUMN_DEFAULT           |      |      |     | NULL     
>> |       |
>>
>> Yan> I have not been able to pinpoint the root cause yet, BUT after  
>> I change
>> Yan> one line of ST_FIELD_INFO columns_fields_info[] from
>>
>> Yan> +  {"COLUMN_TYPE", 65535, MYSQL_TYPE_STRING, 0, 0, "Type",  
>> OPEN_FRM_ONLY},
>> Yan> to
>> Yan> -  {"COLUMN_TYPE", 5000, MYSQL_TYPE_STRING, 0, 0, "Type",  
>> OPEN_FRM_ONLY},
>>
>> Yan> the problem seems gone.
>>
>> Yan> My Q is:  would it be safe for me to make this change?
>>
>> For your purpose, the above should be safe.
>>
>> However, I need to look at what could cause the above display bug.
>>
>> Yan> (3) problems caused by conversion between different charsets.  
>> mb_wc() or wc_mb()
>> Yan> our system_charset uses latin1, and
>> Yan> "create.test"  includes the following SQL statement:
>> Yan> set names utf8;
>> Yan> create database  
>> имя_базы_в_кодировкYan> use  
>> имя_базы_в_кодировке_утф8_длиной_больше_чем_45 
>> ;
>> Yan> select database();
>>
>> Yan>  
>> "имя_базы_в_кодировке_утф8_длиной_больше_чем_45 
>> " was converted to unicode string internally.
>> Yan> it seems like either the conversion is mis-handled somewhere,  
>> or the conversion is missing in one direction. Would this be  
>> possible?
>>
>> This could be becase of two reasons
>>
>> - files_charset_info needs to be utf8.
>> or
>> - files_charset_info needs to be same as system_charset_info.
>>
>> Bar, can you answer this one ?
>
> The original idea of introducing "files_charset_info" was the  
> assumption
> that the file system character set can be different. For example,
> on Windows, it depends on OEM code page. In case of a Western machine
> it would be cp850. But we never went this way. Instead we
> we introduced table-name-to-file-name encoding, so  
> files_charset_info is
> just an alias for system_charset_info now.
>
>
>>
>> Yan> somehow, the database name is represented internally as
>> Yan> "???_????_?_?????????_???8_??????_??????_???_45",
>> Yan> and this same string  is also returned back to the client for  
>> "select database();" command.
>> Yan> I do not know much about the unicode string, so not sure about  
>> the conversion code.
>
> If you changed system_character_set to latin1, then
> non-Latin1 letters (e.g. Cyrillic in the above example)
> won't work well in identifiers. So I suggest to fix the
> code to disallow using of non-latin1 characters in identifiers.
>
> I think the fix should be done on parser level, in sql_yacc.yy.
>
> See the code corresponding to the "IDENT_sys" and maybe
> also "TEST_STRING_sys" rules. It uses thd->convert_string()
> implemented in sql_class.cc.
>
> You can extend this method by adding a new parameter, a pointer
> to "uint", to return conversion error count to the upper level.
> Conversion error count is returned in the "dummy_errors" variable,
> from the copy_and_convert() call.
>
> The new method version can do something like this:
>
> bool THD::convert_string(LEX_STRING *to, CHARSET_INFO *to_cs,
>                          const char *from, uint from_length,
>                          CHARSET_INFO *from_cs,
>                          uint *conversion_error_count)
> {
>   DBUG_ENTER("convert_string");
>   size_t new_length= to_cs->mbmaxlen * from_length;
>   if (!(to->str= (char*) alloc(new_length+1)))
>   {
>     to->length= 0;                              // Safety fix
>     DBUG_RETURN(1);                             // EOM
>   }
>   to->length= copy_and_convert((char*) to->str, new_length, to_cs,
>                                from, from_length, from_cs,
>                                conversion_error_count);
>   to->str[to->length]=0;                        // Safety
>   DBUG_RETURN(0);
> }
>
>
>
> Now in sql_yacc.yy you can check conversion_error_count
> returned from thd->convert_string().
> If conversion_error_count is 0, then everything is OK.
> If conversion_error_count > 0, then the identifier
> contains some non-Latin1 character and you just send
> a "Bad identifier" error to the client and terminate
> query execution.
>
>
>
>
>>
>> Regards,
>> Monty
>

Thread
system_charset_infoYan Yu20 Dec
  • re: system_charset_infoMichael Widenius22 Dec
  • Re: system_charset_infoAlexander Barkov23 Dec
Re: system_charset_infoMichael Widenius22 Dec
  • RE: system_charset_infoYan Yu23 Dec
  • Re: system_charset_infoAlexander Barkov23 Dec
    • Re: system_charset_infoYan Yu24 Dec