List:Internals« Previous MessageNext Message »
From:Thimble Smith Date:November 1 2000 8:59pm
Subject:Re: Unicode
View as plain text  
On Wed, Nov 01, 2000 at 04:16:54PM +0100, Kay-Michael Goertz wrote:
> Hallo Tim,
> 
> yes, it will be a good idea to start first with a discussion on the
> mailing list.

OK, I'm posting this to the list.

> But a few question more.
> 
> As I see, you would like to have a lot of character sets with
> one encoding (eg UTF-8) to give the right order. But there
> is a problem. As I know from german, there are (in german)
> 2 different orders for the umlaute (öäü). I afraid, we will
> find such things and others, which make such solution not
> very general. Would it not be better, to use as order the
> codes of the characters and to give the possibility to use
> functions(build in) to get the data in a different order?  So I
> could hold data for different languages in one database (which
> is the idea to use unicode) and to let different user get the
> data in the order they are used to?

The problem with this is that indexes use the character set to
sort the values.  If you change the sort order, then you won't
be able to find your data anymore.

Also, ORDER BY can be optimized to just read rows in order out
of the index in some cases.  But if we did it the way you
suggest, this wouldn't be possible anymore (at least, not in
general).

But, I agree with you that it would be nice to be able to store
the data in just one encoding (e.g., charset=utf-16) and then
do the sorting separately.  This might be possible, or even
easy, but I'm not positive so I will ask Monty to comment on
this idea.  Of course it's possible, but I'm not sure if it
can be done with MySQL without rewriting a LOT of code (and I
don't know what the performance implications would be).

> Why not to use 2 new data types, unicode char and unicode
> varchar? So I could mix in one database "normal" data and
> unicode data. Existing databases could easily extended to use
> unicode if needed. And so it would be possible for you (mysql
> ab) to use data structures, ... as you want. Of course, so we
> need functions to convert between uft-8 and ucs-2 on one hand
> and between unicode and other character sets on the other. We
> can start with a few character sets and caaan slowly let the
> thing grow without changing the basic code for this and all
> the databases have not be converted for a new version, which
> supports now a few more character sets.

The plan is to have the character set be chosen at the server,
table or column level.  I mean, you could do something like this
(the syntax isn't certain yet):

    -- both s and t will use utf8_de
    create table x (s char(10), t char(10)) charset=utf8_de;

    -- s uses utf8_de, t uses utf16_cn
    create table x (s char(10), t char(10) charset=utf16_cn) charset=utf8_de;

    -- s uses the server's default charset, t uses latin1
    create table x (s char(10), t char(10) charset=latin1);

This should be flexible enough for most needs.


Below is my earlier letter to you, for those who are reading
the list.


On Tue, Oct 31, 2000 at 04:44:52PM +0100, Kay-Michael Goertz wrote:
> 
> I needed Unicode, so I used Binary Varchar. All data are put
> as UTF-8.  The data can accessed from an HTML-interface (using
> perl to communicate to mysql). For simple solutions this will
> work, but I'm missing a few import things like right ordering,
> functions to convert between different coding and so on. Is it
> possible to take part in developing such features?

Kay-Michael,

Yes, it is possible for you to be involved with developing this.
I don't have anything done on it yet - just some research (and
not much of that).

The plan is to handle most of this in the same way as we handle
other character sets.  That is, we'll add new character sets for:

    utf8-de     ucs2-de      ucs4-de    ...
    utf8-cn     ucs2-cn
    utf8-fr     ...
    ... usw.

Obviously all of these should share as much code as possible, but
they need to be separate character sets (as MySQL sees them) in
order to have separate sorting behaviors.


So, basically, this involves writing these functions for each of
the above character sets:

  -- these functions are for sorting, and probably will need to
  -- be different for every language we support (but might be
  -- shared among different encodings)
  strcoll
  strnncoll
  strxfrm
  strnxfrm
  like_range

  -- these functions are for multi-byte encodings, and will need
  -- to be different for each encoding (but can be shared among
  -- different languages)
  ismbchar
  ismbhead
  mbcharlen

Another snag that I'm not positive about is upper- and lower-
case mappings.  Currently we don't handle this properly for
multi-byte encodings.  I guess nobody cares because the only
multi-byte encodings are for alphabets that don't have upper-
and lower-case letters.  BUT that's not the case with the
UCS.  I expect we'll just add two more functions to the above
list (and correspondingly to the CHARSET_INFO struct), and we
will change the macro for toupper/tolower to something like:

/* if a toupper function exists for the charset, use it;
   otherwise use the lookup table mapping */
#define toupper(c) (charset_info->to_upper_func ? \
                        charset_info->to_upper_func((c)) : \
                        charset_info->to_upper_map((c)))

This is just a hunch of what it would be like.


Once this work is done, then adding convenience functions to
convert a UTF-8 string into a UCS-4 string could be added - but
those really aren't that important to worry about up front.  I
don't think they'll affect anything except themselves.


Notice that I'm using the ISO/IEC 10646 names for the encodings.
I think we should stay away from the word "Unicode", because
nobody means Unicode when they say "Unicode".  I think we should
probably implement UTF-8 first, and then probably UTF-16 or
UCS-4 (I don't think the latter is very good for database storage
reasons).


Another thing I'm not sure about is sorting based on more
than one language.  For example (and I'm going to expose some
ignorance here), imagine that the kanji characters are sorted
differently in Japanese and in Chinese.  And, imagine that the
latin characters are sorted differently in Swedish and in French.
In this case, you might want to sort 4 different ways:

    Japanese-Swedish
    Japanese-French
    Chinese-Swedish
    Chinese-French

Then, imagine that people want the kanji words to come before the
latin words.  Or the other way around.  How should this stuff be
handled?  Should we even care?  There could be a huge number of
combinations that people might want.  This issue can certainly
wait until the others are solved.


There are probably some other things that will need to be figured
out.  But this should give a start.  I'd like to open this up to
the public.  Would you mind my posting this summary to the
internals@stripped mailing list, and discussing it there?

Tim

-- 
   __  ___     ___ ____  __
  /  |/  /_ __/ __/ __ \/ /    Tim Smith <tim@stripped>
 / /|_/ / // /\ \/ /_/ / /__   MySQL AB, Development Team
/_/  /_/\_, /___/\___\_\___/   Boone, NC  USA
       <___/   www.mysql.com
Thread
Re: UnicodeThimble Smith1 Nov
  • Re: UnicodeTom Emerson2 Nov