List:Internals« Previous MessageNext Message »
From:Michael Widenius Date:August 29 2001 2:50pm
Subject:Re: varchar casting not correct
View as plain text  
Hi!

>>>>> "Timothy" == Timothy Smith <tim@stripped> writes:

Timothy> On 2001 Aug 28, Dirk Nehring <dnehring@stripped> wrote:
>> can anyone check
>> 
>>
> http://sourceforge.net/tracker/?func=detail&atid=100047&aid=445109&group_id=47

Timothy> I closed it because we can't have a discussion on sourceforge,
Timothy> and it's not a simple bug that can just be fixed.

>> This issue is not closed in my mind, since the problem still
>> exists. There is not reason to handle "xá1" and "xä1" differently!
> The
>> solution should be that "xa1", "xá1" and "xä1" are different strings
> and
>> mysql should take care of it.

It depends in which language zone you are in.
If you are in Europe (particularly in Scandinavia, but also in other
places) MySQL does the correct thing here!

Timothy> I agree with you that the current behavior is not the best.  The
Timothy> question is, what is the best way to fix it?

Timothy> I think there should be two solutions, one for MySQL 3.23 and
Timothy> one for MySQL 4.0.  This is a rather delicate issue, because the
Timothy> vast majority of MySQL users are using the latin1 character set
Timothy> for their tables, and if we change how the comparisons are done,
Timothy> everyone has to repair their tables or things will be broken.

Timothy> For MySQL 3.23, we can introduce a new character set called
Timothy> latin1_tmp.conf which will behave "properly" (whatever that is).
Timothy> If this bug is causing you trouble, you can then easily set the
Timothy> default character set for your server to latin1_tmp, repair your
Timothy> tables, and the problem will be fixed.

Timothy> For MySQL 4.0, we can rename latin1_tmp to latin1, and make a
Timothy> separate latin1_fi.conf (more on the names later) to match the
Timothy> current latin1.conf behavior.

We could do that, but this is likely to crash a lot of MySQL
installations if we do this too early :(

When we in 4.1 support multiple character sets / table then this will
be easy to do as we can just map the name latin1 to latin1_fi but
keep the internal character number for latin1.

Timothy> Now, the question of what the proper behavior is needs to be
Timothy> solved.  We have two main axes to consider: case folding (CF) and
Timothy> accent folding (AF).

It's not that easy.  In some languages you fold some accents but not
others.  For example in Finland E and É is sorted the same way, but
A and Ä are completely different.

<cut>

Timothy> Now, how can we decide which of these to use?  Currently it's
Timothy> been done on an ad-hoc basis. latin1 is +CF +AF. latin5 is +CF
Timothy> -AF.  Mainly because, for the person who created the conf file,
Timothy> that's what worked best.  It would be nice to have a more regular
Timothy> treatment of this.

Timothy> The only solution I can think of is to offer all four versions
Timothy> and let the user decide which to use.  The naming convention
Timothy> could be something like:

Timothy>     latin1_af           +CF -AF (your proposed behavior)
Timothy>     latin1              +CF +AF (current behavior, roughly)
Timothy>     latin1_cf_af        -CF -AF (pretty close to CHAR(x) BINARY)
Timothy>     latin1_cf           -CF +AF (probably not too useful)

Timothy> This is a bit complicated, but probably not too difficult to
Timothy> explain.  Since these character sets don't get updated, there
Timothy> shouldn't be much maintenance trouble.

I don't think the above will be that useful as almost no on will be
able to use the _af or _cf character sets in practice.  Life is just
not that simple :(

Timothy> The last question is, how do we accommodate the Finns?  And it's
Timothy> not just the Finns, either.  There is a separate character set
Timothy> for german sort order.  There could be one for french, etc., etc.
Timothy> How should they be handled?

The only solution I see in the long run is to have a LOTS of character
sets.

Timothy> Well, as long as we don't add an Afrikaans charset, the above
Timothy> scheme can be extended to handle languages by tagging them with
Timothy> the two-letter language code.

Timothy>     latin1_fi           +CF +AF Finnish style (current behavior)
Timothy>     latin1_fi_af        +CF -AF Finnish ...
Timothy>     ...

In finland, you ONLY would need latin1_fi ;  There is no use for a
latin_fi_af set.

Timothy> This is exhaustive, and exhausting.  But I'm not sure what else
Timothy> to do to solve this problem.

Regards,
Monty
Thread
varchar casting not correctDirk Nehring28 Aug
  • Re: varchar casting not correctTimothy Smith28 Aug
    • Re: varchar casting not correctDirk Nehring29 Aug
      • Re: varchar casting not correctMichael Widenius29 Aug
        • Re: varchar casting not correctDirk Nehring30 Aug
          • Re: varchar casting not correctMichael Widenius30 Aug
    • Re: varchar casting not correctMichael Widenius29 Aug