List:Internals« Previous MessageNext Message »
From:Alexander Barkov Date:December 23 2009 2:56pm
Subject:Re: system_charset_info
View as plain text  
Hi Monty, Yan,


Michael Widenius wrote:
> Hi!
> 
>>>>>> "Yan" == Yan Yu <yan2yu2@stripped> writes:
> 
> <cut>
> 
> Yan> There are still many other broken test cases that we need to
> Yan> investigate.
> 
> Yan> What is the bug ?
> 
> Yan> http://bugs.mysql.com/bug.php?id=49609
> 
> Thanks. Either MySQL people or MariaDB people will certainly take a
> look at this.


This bug is fixed in MySQL-6.0 sources.
I don't think it the fix will ever be back-ported to an
earlier MySQL version. So you can try to back-port
the fix from MySQL-6.0 to the desired Maria version.


> 
> <cut>
> 
> Monty> Most errors in this case should mainly be in the test suite.
> 
> Monty> If you run mysql-test-run --force and examine the results, ignoring
> Monty> errors in the test suite, you should be able to get some assurance
> Monty> that your change works.
> 
> Yan> Could you elaborate on what do you mean by "ignoring the errors in the test
> suite"?  what kind of error I can ignore?
> 
> Yan> Currently I encounter three types of errors:
> Yan> (1) most benign type of error is:
> Yan> If I run mysql-test-run, some individual tests would fail; However, if I run
> a failed test by itself, using command:
> Yan> mysql-test-run --record [single-test-name],
> Yan> the test would succeed.  I even tried to bundle several tests proceeding the
> failed test together with the failed
> Yan> I assume I can ignore this type of error.
> 
> --record overwrites the old test data, so by using this, you may hide
> a error.
> 
> What I would suggest you to do:
> 
> mysql-test-run --force
> 
> and after it's done, do a diff of the .result and .reject files in the
> 'r' directories.
> 
> You can ignore
> - Only the characterset name in the result files has changed
>   (should be a majority of the cases)
> - EXPLAIN statements where the key lengths are different..
> - Test results where the order of lines has changed that depends on
>   the different sorting of character sets.
> - Difference caused because of different max string lengths
>   (varchar/blob differences)
> 
> Any other test results you should try to understand what could cause
> them.
> 
> Yan> (2) this type of error seems caused by MYSQL_TYPE_VAR_STRING vs.
> MYSQL_TYPE_BLOB.
> Yan> for example, in sql/sql_select.cc, line  9456, function
> create_tmp_field_for_schema(),
> Yan> based on item->max_length, it will create either a  Field_blob object, or
> a Field_varstring object.
> 
> Yan>  9459     if (item->max_length > MAX_FIELD_VARCHARLENGTH)
> Yan>  9460       field= new Field_blob(item->max_length, item->maybe_null,
> Yan>  9461                             item->name,
> item->collation.collation);
> Yan>  9462     else
> Yan>  9463       field= new Field_varstring(item->max_length,
> item->maybe_null,
> Yan>  9464                                  item->name,
> Yan>  9465                                  table->s,
> item->collation.collation);
> 
> Yan> if the system_charset_info is a multibyte encoding, e.g., utf8, the
> max_length of the column type would be greater than MAX_FIELD_VARCHARLENGTH, it will
> create an MYSQL_TYPE_BLOB object, otherwise, it will create an MYSQL_TYPE_VAR_STRING
> object if the system_charset_info is latin1.
> 
> Yan> The difference between MYSQL_TYPE_VAR_STRING vs. MYSQL_TYPE_BLOB manifests
> itself in at least two aspects:
> Yan> (I) the "Type" of the COLUMN_TYPE returned to "describe/explain [table name]"
> command is different in these two cases.
> Yan> this causes mysql_client_test fails.
> 
> This you can ignore.
> 
> Yan> (II) the wire format (i.e., packing results to the client) is different for
> these two fields: MYSQL_TYPE_VAR_STRING vs. MYSQL_TYPE_BLOB.  Field_varstring::store() and
> Field_blob::store() in field.cc.
> Yan> Somehow it seems like that the client got confused after I change
> system_charset_info to latin1, (I stepped traced the server code, it seems fine)
> 
> Should not cause a problem for the client (if there isn't an unknown
> bug somewhere)
> 
> Yan> the results from "describe [table name] would miss information on two
> columns: Type and Null Column.
> Yan> e.g., result for "describe Columns;" returns the following:
> Yan> +--------------------------+------+------+-----+---------+-------+
> Yan> | Field                    | Type | Null | Key | Default | Extra |
> Yan> +--------------------------+------+------+-----+---------+-------+
> Yan> | TABLE_CATALOG            |      |      |     | NULL    |       |
> Yan> | TABLE_SCHEMA             |      |      |     |         |       |
> Yan> | TABLE_NAME               |      |      |     |         |       |
> Yan> | COLUMN_NAME              |      |      |     |         |       |
> Yan> | ORDINAL_POSITION         |      |      |     | 0       |       |
> Yan> | COLUMN_DEFAULT           |      |      |     | NULL    |       |
> 
> Yan> I have not been able to pinpoint the root cause yet, BUT after I change
> Yan> one line of ST_FIELD_INFO columns_fields_info[] from
> 
> Yan> +  {"COLUMN_TYPE", 65535, MYSQL_TYPE_STRING, 0, 0, "Type", OPEN_FRM_ONLY},
> Yan> to
> Yan> -  {"COLUMN_TYPE", 5000, MYSQL_TYPE_STRING, 0, 0, "Type", OPEN_FRM_ONLY},
> 
> Yan> the problem seems gone.
> 
> Yan> My Q is:  would it be safe for me to make this change?
> 
> For your purpose, the above should be safe.
> 
> However, I need to look at what could cause the above display bug.
> 
> Yan> (3) problems caused by conversion between different charsets. mb_wc() or
> wc_mb()
> Yan> our system_charset uses latin1, and
> Yan> "create.test"  includes the following SQL statement:
> Yan> set names utf8;
> Yan> create database
> имя_базы_в_кодировке_утф8_длиной_больше_чем_45;
> Yan> use
> имя_базы_в_кодировке_утф8_длиной_больше_чем_45;
> Yan> select database();
> 
> Yan>
> "имя_базы_в_кодировке_утф8_длиной_больше_чем_45" was
> converted to unicode string internally.
> Yan> it seems like either the conversion is mis-handled somewhere, or the
> conversion is missing in one direction. Would this be possible?
> 
> This could be becase of two reasons
> 
> - files_charset_info needs to be utf8.
> or
> - files_charset_info needs to be same as system_charset_info.
> 
> Bar, can you answer this one ?

The original idea of introducing "files_charset_info" was the assumption
that the file system character set can be different. For example,
on Windows, it depends on OEM code page. In case of a Western machine
it would be cp850. But we never went this way. Instead we
we introduced table-name-to-file-name encoding, so files_charset_info is
just an alias for system_charset_info now.


> 
> Yan> somehow, the database name is represented internally as
> Yan> "???_????_?_?????????_???8_??????_??????_???_45",
> Yan> and this same string  is also returned back to the client for "select
> database();" command.
> Yan> I do not know much about the unicode string, so not sure about the conversion
> code.

If you changed system_character_set to latin1, then
non-Latin1 letters (e.g. Cyrillic in the above example)
won't work well in identifiers. So I suggest to fix the
code to disallow using of non-latin1 characters in identifiers.

I think the fix should be done on parser level, in sql_yacc.yy.

See the code corresponding to the "IDENT_sys" and maybe
also "TEST_STRING_sys" rules. It uses thd->convert_string()
implemented in sql_class.cc.

You can extend this method by adding a new parameter, a pointer
to "uint", to return conversion error count to the upper level.
Conversion error count is returned in the "dummy_errors" variable,
from the copy_and_convert() call.

The new method version can do something like this:

bool THD::convert_string(LEX_STRING *to, CHARSET_INFO *to_cs,
                          const char *from, uint from_length,
                          CHARSET_INFO *from_cs,
                          uint *conversion_error_count)
{
   DBUG_ENTER("convert_string");
   size_t new_length= to_cs->mbmaxlen * from_length;
   if (!(to->str= (char*) alloc(new_length+1)))
   {
     to->length= 0;                              // Safety fix
     DBUG_RETURN(1);                             // EOM
   }
   to->length= copy_and_convert((char*) to->str, new_length, to_cs,
                                from, from_length, from_cs,
                                conversion_error_count);
   to->str[to->length]=0;                        // Safety
   DBUG_RETURN(0);
}



Now in sql_yacc.yy you can check conversion_error_count
returned from thd->convert_string().
If conversion_error_count is 0, then everything is OK.
If conversion_error_count > 0, then the identifier
contains some non-Latin1 character and you just send
a "Bad identifier" error to the client and terminate
query execution.




> 
> Regards,
> Monty

Thread
system_charset_infoYan Yu20 Dec
  • re: system_charset_infoMichael Widenius22 Dec
  • Re: system_charset_infoAlexander Barkov23 Dec
Re: system_charset_infoMichael Widenius22 Dec
  • RE: system_charset_infoYan Yu23 Dec
  • Re: system_charset_infoAlexander Barkov23 Dec
    • Re: system_charset_infoYan Yu24 Dec