Author: paul
Date: 2007-12-13 21:17:54 +0100 (Thu, 13 Dec 2007)
New Revision: 9298
Log:
r34262@arctic: paul | 2007-12-13 14:17:24 -0600
Unicode updates. Much based on info from PeterG.
Modified:
trunk/refman-6.0/internationalization.xml
trunk/refman-6.0/unicode-tmp.xml
Property changes on: trunk
___________________________________________________________________
Name: svk:merge
- 4767c598-dc10-0410-bea0-d01b485662eb:/mysqldoc-local/mysqldoc/trunk:34260
7d8d2c4e-af1d-0410-ab9f-b038ce55645b:/mysqldoc-local/mysqldoc:28105
b5ec3a16-e900-0410-9ad2-d183a3acac99:/mysqldoc-local/mysqldoc/trunk:14218
bf112a9c-6c03-0410-a055-ad865cd57414:/mysqldoc-local/mysqldoc/trunk:22582
+ 4767c598-dc10-0410-bea0-d01b485662eb:/mysqldoc-local/mysqldoc/trunk:34262
7d8d2c4e-af1d-0410-ab9f-b038ce55645b:/mysqldoc-local/mysqldoc:28105
b5ec3a16-e900-0410-9ad2-d183a3acac99:/mysqldoc-local/mysqldoc/trunk:14218
bf112a9c-6c03-0410-a055-ad865cd57414:/mysqldoc-local/mysqldoc/trunk:22582
Modified: trunk/refman-6.0/internationalization.xml
===================================================================
--- trunk/refman-6.0/internationalization.xml 2007-12-13 20:09:10 UTC (rev 9297)
+++ trunk/refman-6.0/internationalization.xml 2007-12-13 20:17:54 UTC (rev 9298)
Changed blocks: 4, Lines Added: 490, Lines Deleted: 42; 18796 bytes
@@ -3248,49 +3248,62 @@
<programlisting>
mysql> <userinput>SHOW CHARACTER SET;</userinput>
-+----------+-----------------------------+---------------------+
-| Charset | Description | Default collation |
-+----------+-----------------------------+---------------------+
-| big5 | Big5 Traditional Chinese | big5_chinese_ci |
-| dec8 | DEC West European | dec8_swedish_ci |
-| cp850 | DOS West European | cp850_general_ci |
-| hp8 | HP West European | hp8_english_ci |
-| koi8r | KOI8-R Relcom Russian | koi8r_general_ci |
-| latin1 | cp1252 West European | latin1_swedish_ci |
-| latin2 | ISO 8859-2 Central European | latin2_general_ci |
-| swe7 | 7bit Swedish | swe7_swedish_ci |
-| ascii | US ASCII | ascii_general_ci |
-| ujis | EUC-JP Japanese | ujis_japanese_ci |
-| sjis | Shift-JIS Japanese | sjis_japanese_ci |
-| hebrew | ISO 8859-8 Hebrew | hebrew_general_ci |
-| tis620 | TIS620 Thai | tis620_thai_ci |
-| euckr | EUC-KR Korean | euckr_korean_ci |
-| koi8u | KOI8-U Ukrainian | koi8u_general_ci |
-| gb2312 | GB2312 Simplified Chinese | gb2312_chinese_ci |
-| greek | ISO 8859-7 Greek | greek_general_ci |
-| cp1250 | Windows Central European | cp1250_general_ci |
-| gbk | GBK Simplified Chinese | gbk_chinese_ci |
-| latin5 | ISO 8859-9 Turkish | latin5_turkish_ci |
-| armscii8 | ARMSCII-8 Armenian | armscii8_general_ci |
-| utf8 | UTF-8 Unicode | utf8_general_ci |
-| ucs2 | UCS-2 Unicode | ucs2_general_ci |
-| cp866 | DOS Russian | cp866_general_ci |
-| keybcs2 | DOS Kamenicky Czech-Slovak | keybcs2_general_ci |
-| macce | Mac Central European | macce_general_ci |
-| macroman | Mac West European | macroman_general_ci |
-| cp852 | DOS Central European | cp852_general_ci |
-| latin7 | ISO 8859-13 Baltic | latin7_general_ci |
-| cp1251 | Windows Cyrillic | cp1251_general_ci |
-| cp1256 | Windows Arabic | cp1256_general_ci |
-| cp1257 | Windows Baltic | cp1257_general_ci |
-| binary | Binary pseudo charset | binary |
-| geostd8 | GEOSTD8 Georgian | geostd8_general_ci |
-| cp932 | SJIS for Windows Japanese | cp932_japanese_ci |
-| eucjpms | UJIS for Windows Japanese | eucjpms_japanese_ci |
-+----------+-----------------------------+---------------------+
++----------+-----------------------------+---------------------+--------+
+| Charset | Description | Default collation | Maxlen |
++----------+-----------------------------+---------------------+--------+
+| big5 | Big5 Traditional Chinese | big5_chinese_ci | 2 |
+| dec8 | DEC West European | dec8_swedish_ci | 1 |
+| cp850 | DOS West European | cp850_general_ci | 1 |
+| hp8 | HP West European | hp8_english_ci | 1 |
+| koi8r | KOI8-R Relcom Russian | koi8r_general_ci | 1 |
+| latin1 | cp1252 West European | latin1_swedish_ci | 1 |
+| latin2 | ISO 8859-2 Central European | latin2_general_ci | 1 |
+| swe7 | 7bit Swedish | swe7_swedish_ci | 1 |
+| ascii | US ASCII | ascii_general_ci | 1 |
+| ujis | EUC-JP Japanese | ujis_japanese_ci | 3 |
+| sjis | Shift-JIS Japanese | sjis_japanese_ci | 2 |
+| hebrew | ISO 8859-8 Hebrew | hebrew_general_ci | 1 |
+| tis620 | TIS620 Thai | tis620_thai_ci | 1 |
+| euckr | EUC-KR Korean | euckr_korean_ci | 2 |
+| koi8u | KOI8-U Ukrainian | koi8u_general_ci | 1 |
+| gb2312 | GB2312 Simplified Chinese | gb2312_chinese_ci | 2 |
+| greek | ISO 8859-7 Greek | greek_general_ci | 1 |
+| cp1250 | Windows Central European | cp1250_general_ci | 1 |
+| gbk | GBK Simplified Chinese | gbk_chinese_ci | 2 |
+| latin5 | ISO 8859-9 Turkish | latin5_turkish_ci | 1 |
+| armscii8 | ARMSCII-8 Armenian | armscii8_general_ci | 1 |
+| utf8mb3 | UTF-8 Unicode | utf8mb3_general_ci | 3 |
+| ucs2 | UCS-2 Unicode | ucs2_general_ci | 2 |
+| cp866 | DOS Russian | cp866_general_ci | 1 |
+| keybcs2 | DOS Kamenicky Czech-Slovak | keybcs2_general_ci | 1 |
+| macce | Mac Central European | macce_general_ci | 1 |
+| macroman | Mac West European | macroman_general_ci | 1 |
+| cp852 | DOS Central European | cp852_general_ci | 1 |
+| latin7 | ISO 8859-13 Baltic | latin7_general_ci | 1 |
+| utf8 | UTF-8 Unicode | utf8_general_ci | 4 |
+| cp1251 | Windows Cyrillic | cp1251_general_ci | 1 |
+| utf16 | UTF-16 Unicode | utf16_general_ci | 4 |
+| cp1256 | Windows Arabic | cp1256_general_ci | 1 |
+| cp1257 | Windows Baltic | cp1257_general_ci | 1 |
+| utf32 | UTF-32 Unicode | utf32_general_ci | 4 |
+| binary | Binary pseudo charset | binary | 1 |
+| geostd8 | GEOSTD8 Georgian | geostd8_general_ci | 1 |
+| cp932 | SJIS for Windows Japanese | cp932_japanese_ci | 2 |
+| eucjpms | UJIS for Windows Japanese | eucjpms_japanese_ci | 3 |
++----------+-----------------------------+---------------------+--------+
</programlisting>
<para>
+ The <literal>utf8</literal> character set prior to MySQL 6.0.4
+ encoded BMP characters using up to three bytes per character. In
+ MySQL 6.0.4, this character set is renamed to
+ <literal>utf8mb3</literal>, and a new <literal>utf8</literal> is
+ added that encodes BMP and supplementary characters using up to
+ four bytes per character. The <literal>utf16</literal> and
+ <literal>utf32</literal> were also added in MySQL 6.0.4.
+ </para>
+
+ <para>
In cases where a character set has multiple collations, it might
not be clear which collation is most suitable for a given
application. To avoid choosing the wrong collation, it can be
@@ -3304,8 +3317,8 @@
<title>Unicode Character Sets</title>
<para>
- MySQL has two Unicode character sets. You can store text in
- about 650 languages using these character sets.
+ MySQL has several Unicode character sets. You can store text
+ in about 650 languages using these character sets.
</para>
<itemizedlist>
@@ -3594,6 +3607,432 @@
</itemizedlist>
</listitem>
+ <listitem>
+ <para>
+ <literal>utf8mb3</literal> (UTF-8 Unicode) collations:
+ </para>
+
+ <itemizedlist>
+
+ <listitem>
+ <para>
+ <literal>utf8mb3_bin</literal>
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <literal>utf8mb3_czech_ci</literal>
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <literal>utf8mb3_danish_ci</literal>
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <literal>utf8mb3_esperanto_ci</literal>
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <literal>utf8mb3_estonian_ci</literal>
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <literal>utf8mb3_general_ci</literal> (default)
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <literal>utf8mb3_hungarian_ci</literal>
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <literal>utf8mb3_icelandic_ci</literal>
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <literal>utf8mb3_latvian_ci</literal>
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <literal>utf8mb3_lithuanian_ci</literal>
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <literal>utf8mb3_persian_ci</literal>
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <literal>utf8mb3_polish_ci</literal>
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <literal>utf8mb3_roman_ci</literal>
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <literal>utf8mb3_romanian_ci</literal>
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <literal>utf8mb3_sinhala_ci</literal>
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <literal>utf8mb3_slovak_ci</literal>
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <literal>utf8mb3_slovenian_ci</literal>
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <literal>utf8mb3_spanish2_ci</literal>
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <literal>utf8mb3_spanish_ci</literal>
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <literal>utf8mb3_swedish_ci</literal>
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <literal>utf8mb3_turkish_ci</literal>
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <literal>utf8mb3_unicode_ci</literal>
+ </para>
+ </listitem>
+
+ </itemizedlist>
+ </listitem>
+
+ <listitem>
+ <para>
+ <literal>utf16</literal> (UTF-16 Unicode) collations:
+ </para>
+
+ <itemizedlist>
+
+ <listitem>
+ <para>
+ <literal>utf16_bin</literal>
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <literal>utf16_czech_ci</literal>
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <literal>utf16_danish_ci</literal>
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <literal>utf16_esperanto_ci</literal>
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <literal>utf16_estonian_ci</literal>
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <literal>utf16_general_ci</literal> (default)
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <literal>utf16_hungarian_ci</literal>
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <literal>utf16_icelandic_ci</literal>
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <literal>utf16_latvian_ci</literal>
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <literal>utf16_lithuanian_ci</literal>
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <literal>utf16_persian_ci</literal>
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <literal>utf16_polish_ci</literal>
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <literal>utf16_roman_ci</literal>
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <literal>utf16_romanian_ci</literal>
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <literal>utf16_sinhala_ci</literal>
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <literal>utf16_slovak_ci</literal>
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <literal>utf16_slovenian_ci</literal>
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <literal>utf16_spanish2_ci</literal>
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <literal>utf16_spanish_ci</literal>
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <literal>utf16_swedish_ci</literal>
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <literal>utf16_turkish_ci</literal>
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <literal>utf16_unicode_ci</literal>
+ </para>
+ </listitem>
+
+ </itemizedlist>
+ </listitem>
+
+ <listitem>
+ <para>
+ <literal>utf32</literal> (UTF-32 Unicode) collations:
+ </para>
+
+ <itemizedlist>
+
+ <listitem>
+ <para>
+ <literal>utf32_bin</literal>
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <literal>utf32_czech_ci</literal>
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <literal>utf32_danish_ci</literal>
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <literal>utf32_esperanto_ci</literal>
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <literal>utf32_estonian_ci</literal>
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <literal>utf32_general_ci</literal> (default)
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <literal>utf32_hungarian_ci</literal>
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <literal>utf32_icelandic_ci</literal>
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <literal>utf32_latvian_ci</literal>
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <literal>utf32_lithuanian_ci</literal>
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <literal>utf32_persian_ci</literal>
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <literal>utf32_polish_ci</literal>
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <literal>utf32_roman_ci</literal>
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <literal>utf32_romanian_ci</literal>
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <literal>utf32_sinhala_ci</literal>
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <literal>utf32_slovak_ci</literal>
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <literal>utf32_slovenian_ci</literal>
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <literal>utf32_spanish2_ci</literal>
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <literal>utf32_spanish_ci</literal>
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <literal>utf32_swedish_ci</literal>
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <literal>utf32_turkish_ci</literal>
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <literal>utf32_unicode_ci</literal>
+ </para>
+ </listitem>
+
+ </itemizedlist>
+ </listitem>
+
</itemizedlist>
<para>
@@ -3619,6 +4058,15 @@
MySQL 5.2.6.
</para>
+ <para>
+ The collations for <literal>utf8mb3</literal>,
+ <literal>utf16</literal>, and <literal>utf16</literal> were
+ added in MySQL 6.0.4. In 6.0.4, the old (three-byte)
+ <literal>utf8</literal> character set was renamed to
+ <literal>utf8mb3</literal> and the new (four-byte)
+ <literal>utf8</literal> character set was added.
+ </para>
+
<indexterm>
<primary>Unicode Collation Algorithm</primary>
</indexterm>
Modified: trunk/refman-6.0/unicode-tmp.xml
===================================================================
--- trunk/refman-6.0/unicode-tmp.xml 2007-12-13 20:09:10 UTC (rev 9297)
+++ trunk/refman-6.0/unicode-tmp.xml 2007-12-13 20:17:54 UTC (rev 9298)
Changed blocks: 4, Lines Added: 698, Lines Deleted: 5; 24105 bytes
@@ -9,10 +9,20 @@
<title>Unicode Support</title>
<para>
- MySQL ¤t-series; supports two character sets for storing
- Unicode data:
+ [NEED CHANGELOG ENTRY] <emphasis role="bold">Incompatible
+ change:</emphasis>
</para>
+ <para>
+ [NEED UPGRADE SECTION ENTRY] <emphasis role="bold">Incompatible
+ change:</emphasis>
+ </para>
+
+ <para>
+ The initial implementation of Unicode support (in MySQL 4.1)
+ included two character sets for storing Unicode data:
+ </para>
+
<itemizedlist>
<listitem>
@@ -31,11 +41,16 @@
</itemizedlist>
<para>
- MySQL support for Unicode includes the characters of Unicode Version
- 3.0, in the Basic Multilingual Plane (BMP). The BMP characters have
- these characteristics:
+ [Do I need <emphasis>brief</emphasis> description of them here to
+ set stage for added charsets later?]
</para>
+ <para>
+ Specficially, those character sets supported the characters of
+ Unicode Version 3.0, in the Basic Multilingual Plane (BMP). The BMP
+ characters have these characteristics:
+ </para>
+
<itemizedlist>
<listitem>
@@ -67,6 +82,57 @@
</itemizedlist>
<para>
+ Characters not supported by the preceding character sets include
+ supplementary characters that lie outside the BMP. As of MySQL 6.0
+ (versions 6.0.4 and up), Unicode support does include supplementary
+ characters, which requires new character sets that have a broader
+ range and therefore take more space. The following table shows a
+ brief feature comparison of previous and current Unicode support.
+ </para>
+
+<programlisting>
+Before MySQL 6.0 MySQL 6.0
+
+all Unicode 3.0 characters all Unicode 5.0 characters
+no supplementary characters with supplementary characters
+ucs2 character set no change
+utf8 character set for up to 3 bytes renamed to utf8mb3
+ new utf8 character set for up to 4 bytes
+ utf16 character set
+ utf32 character set
+</programlisting>
+
+ <para>
+ Many of the supplementary characters represent characters in East
+ Asian languages. However, what MySQL 6.0 adds is support for more
+ Japanese and Chinese characters in Unicode character sets, not
+ support for new Japanese and Chinese character sets.
+ </para>
+
+ <para>
+ Most of these changes are upward compatible. For notes on potential
+ incompatibility issues for your applications, see
+ <xref linkend="charset-unicode-upgrading"/>.
+ </para>
+
+ <para>
+ Note particularly that an old table created before MySQL 6.0 that
+ used the <literal>utf8</literal> character set (the three-byte
+ version) is treated as using the <literal>utf8mb3</literal>
+ character set after an upgrade to MySQL 6.0. The table's contents
+ will not have changed in any way and it can continue to be used as
+ before. [Any exceptions? Literal-charset-name tests, perhaps] If you
+ like, you can convert the table to use the new
+ <literal>utf8</literal> character set (the four-byte version), as
+ described later. [WHERE]
+ </para>
+
+ <para>
+ [Still have not given a <emphasis>brief</emphasis> (bullet list) of
+ the supported character sets]
+ </para>
+
+ <para>
In UCS-2 (binary Unicode representation), every character is
represented by a two-byte Unicode code with the most significant
byte first. For example: <literal>LATIN CAPITAL LETTER A</literal>
@@ -148,4 +214,631 @@
bytes for a <literal>CHAR(10) CHARACTER SET utf8</literal> column.
</para>
+ <section id="charset-unicode-characteristics">
+
+ <title>Characteristics of Unicode Character Sets</title>
+
+ <para>
+ [This can be xref-ed from the section that lists character sets]
+ </para>
+
+ <para>
+ Unicode and MySQL character set names:
+ </para>
+
+<programlisting>
+Unicode MySQL
+
+UCS-2 ucs2
+UTF-8 utf8
+UTF-16 utf16
+UTF-32 utf32
+</programlisting>
+
+ <para>
+ <emphasis role="bold">Old Character Set
+ <literal>ucs2</literal></emphasis>
+ </para>
+
+ <para>
+ The <literal>ucs2</literal> character set does not change in
+ essence. There are some Unicode-5.0 characters that are inside the
+ BMP, which MySQL will support with <literal>ucs2</literal>. But
+ <literal>ucs2</literal> remains what it always has been: a fixed
+ 16-bit encoding for Unicode BMP characters.
+ </para>
+
+ <para>
+ <emphasis role="bold">New Character Set
+ <literal>utf16</literal></emphasis>
+ </para>
+
+ <para>
+ The <literal>utf16</literal> character set is the
+ <literal>ucs2</literal> character set with an extension that
+ enables encoding of supplementary characters.
+ </para>
+
+ <para>
+ For a BMP character, <literal>utf16</literal> and
+ <literal>ucs2</literal> have identical storage characteristics:
+ same code values, same encoding, same length.
+ </para>
+
+ <para>
+ For a supplementary character, <literal>utf16</literal> has a
+ special sequence for representing the character using 32 bits.
+ This is called the <quote>surrogate</quote> mechanism: For a
+ number greater than 0xffff, take 10 bits and add them to 0xd800
+ and put them in the first 16-bit word, take 10 more bits and add
+ them to 0xdc00 and put them in the next 16-bit word. So all
+ supplementary characters require 32 bits in all, and the first 16
+ bits are a number between 0xd800 and 0xdbff, and the last 16 bits
+ are a number between 0xdc00 and 0xdfff. Examples are in "15.5
+ Surrogates Area" in the Unicode 4.0 document
+ http://unicode.org/versions/Unicode4.0.0/ch15.pdf
+ </para>
+
+ <para>
+ Because <literal>utf16</literal> supports surrogates and
+ <literal>ucs2</literal> doesn't, there is a validity check which
+ applies only in <literal>utf16</literal>: you can't insert a top
+ surrogate without a bottom surrogate, or vice versa. For example:
+ </para>
+
+<programlisting>
+INSERT INTO t (ucs2_column) VALUES (0xd800); /* legal */
+INSERT INTO t (utf16_column)VALUES (0xd800); /* illegal */
+</programlisting>
+
+ <para>
+ There is no validity check for characters which are technically
+ valid but are not true Unicode (that is, Unicode considers them to
+ be <quote>unassigned code points</quote> or <quote>private
+ use</quote> characters or even <quote>illegals</quote> like
+ 0xffff). For example, since U+F8FF is the Apple Logo:
+ </para>
+
+<programlisting>
+INSERT INTO t (utf16_column)VALUES (0xf8ff); /* legal */
+</programlisting>
+
+ <para>
+ Such characters cannot be expected to mean the same thing to
+ everyone.
+ </para>
+
+ <para>
+ Because MySQL must allow for the worst case (that 1 character
+ requires 4 bytes) the maximum length of a <literal>utf16</literal>
+ column or index is only one half of the maximum length for a
+ <literal>ucs2</literal> column or index. For example, since at the
+ time of writing the maximum length of a Falcon index key is 1100
+ bytes, this is legal: [Using default index block size?]
+ </para>
+
+<programlisting>
+CREATE TABLE tf (s1 VARCHAR(550) CHARACTER SET ucs2) engine=falcon;
+CREATE INDEX i ON tf (s1);
+CREATE TABLE tg (s1 VARCHAR(275) CHARACTER SET utf16) engine=falcon;
+CREATE INDEX i ON tg (s1);
+</programlisting>
+
+ <para>
+ You cannot use <literal>utf16</literal> as a client character set.
+ [No SET NAMES]
+ </para>
+
+ <para>
+ <emphasis role="bold">Revised Character Set
+ <literal>utf8</literal></emphasis>
+ </para>
+
+ <para>
+ We changed the characteristics of the <literal>utf8</literal>
+ character set, from (maximum 3 bytes per character) to (maximum 4
+ bytes per character). We had reasons:
+ </para>
+
+ <itemizedlist>
+
+ <listitem>
+ <para>
+ <literal>utf8</literal> is the 'SQL:2003 standard' name for
+ the full-repertoiree character set
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ because the new is 'upward compatible' with the old, with
+ respect to repertoire, switching to MySQL 6.0 will involve no
+ conversion of existing data or definitions, unless columns are
+ very long.
+ </para>
+ </listitem>
+
+ </itemizedlist>
+
+ <para>
+ [Move to upgrade notes?] Let's not minimize that subordinate
+ clause "unless columns are very long". The catch, as with
+ <literal>utf16</literal> versus <literal>ucs2</literal>, is that
+ the maximum length of a column or index key is unchanged in terms
+ of bytes. Therefore it is smaller in terms of characters, because
+ the maximum length of a character is 4 bytes instead of 3. The
+ three things to watch for are:
+ </para>
+
+ <orderedlist>
+
+ <listitem>
+ <para>
+ Look at all application programs, and make sure that buffer
+ lengths aren't calculated with an algorithm like "number of
+ characters times 3"
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ Look at all indexes on <literal>utf8</literal> columns, and
+ make sure they won't exceed the maximum length for the storage
+ engine. Sometimes the maximum can change due to storage engine
+ enhancements.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ Look at the definitions of <literal>utf8</literal> columns,
+ and make sure they won't exceed the maximum length for the
+ storage engine.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ Do the same for <literal>TEXT</literal> as well as
+ <literal>CHAR</literal> or <literal>VARCHAR</literal>.
+ </para>
+ </listitem>
+
+ </orderedlist>
+
+ <para>
+ The MySQL Reference Manual will call this an INCOMPATIBLE CHANGE.
+ </para>
+
+ <para>
+ If the above situations apply for you, then you will have to
+ reduce the defined length, or you will have to change the column
+ definition from <literal>CHARACTER SET utf8</literal> to
+ <literal>CHARACTER SET utf8mb3</literal> (see
+ <literal>utf8mb3</literal> description in next section). Other
+ than that, we currently expect that the transition from 5.1 to 6.0
+ will be painless.
+ </para>
+
+ <para>
+ For a BMP character, the 5.1 and 6.0 versions of
+ <literal>utf8</literal> have identical storage characteristics:
+ same code values, same encoding, same length.
+ </para>
+
+ <para>
+ For a supplementary character, <literal>utf8</literal> in 5.1
+ can't store the character at all, while <literal>utf8</literal> in
+ 6.0 requires 4 bytes to store. Since <literal>utf8</literal> in
+ 5.1 can't store the character at all, you don't have any
+ supplementary characters in <literal>utf8</literal> columns in
+ your 5.1 database, so you don't have to worry about converting
+ characters or losing data.
+ </para>
+
+ <para>
+ <emphasis role="bold">Renamed Character Set
+ <literal>utf8mb3</literal></emphasis>
+ </para>
+
+ <para>
+ For the few cases where it's desirable to have complete
+ compatibility with the old <literal>utf8</literal> character set,
+ you can define with <literal>CHARACTER SET utf8mb3</literal>. The
+ <literal>utf8mb3</literal> character set has these
+ characteristics:
+ </para>
+
+ <itemizedlist>
+
+ <listitem>
+ <para>
+ No support for supplementary characters (BMP characters only)
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ Maximum of three bytes per multi-byte character
+ </para>
+ </listitem>
+
+ </itemizedlist>
+
+ <para>
+ In other words, <literal>utf8mb3</literal> in MySQL 6.0 has
+ exactly the same characteristics as <literal>utf8</literal> in
+ MySQL 5.1. We expect that <literal>utf8mb3</literal> will only be
+ needed when updating from MySQL 5.1, as described in the previous
+ section.
+ </para>
+
+ <para>
+ Exactly the same set of characters is available in
+ <literal>utf8mb3</literal> as in <literal>ucs2</literal>, that is,
+ they have the same repertoire.
+ </para>
+
+ <para>
+ <emphasis role="bold">New Character Set
+ <literal>utf32</literal></emphasis>
+ </para>
+
+ <para>
+ The <literal>utf32</literal> character set is fixed length (like
+ <literal>ucs2</literal>), but uses 32 bits for every character
+ (unlike <literal>ucs2</literal>, which uses 16 bits for every
+ character).
+ </para>
+
+ <para>
+ Probably most people will avoid <literal>utf32</literal> for space
+ reasons. But it does have two advantages: it's predictable for
+ storage (number of bytes = number of characters times 4), and
+ there are no tricks in encoding so the stored value = the code
+ value.
+ </para>
+
+ <para>
+ To show how the latter advantage is useful, here's a trick that
+ shows how a user can determine the <literal>utf8</literal> value
+ given the <literal>utf32</literal> code value.
+ </para>
+
+<programlisting>
+/* Assume code value = 100cc LINEAR B CHARIOT WHEEL */
+CREATE TABLE tmp (utf32 CHAR(1) CHARACTER SET utf32,
+ utf8 CHAR(1) CHARACTER SET utf8);
+INSERT INTO tmp VALUES (0x000100cc,NULL);
+UPDATE tmp SET utf8 = utf32;
+SELECT HEX(utf32),HEX(utf8) FROM tmp;
+</programlisting>
+
+ <para>
+ Storage is big-endian, that is, the highest byte comes first and
+ there is no need for a mark to specify the byte ordering.
+ </para>
+
+ <para>
+ MySQL is very forgiving about additions of unassigned Unicode
+ characters, private-use-area characters, and other code values
+ that you won't find in the official Unicode 5.0 Character Database
+ (http://www.unicode.org/Public/UNIDATA/UnicodeData.txt). There is
+ in fact only one validity check for <literal>utf32</literal>: no
+ code value may be greater than 0x10ffff. For example this is
+ illegal:
+ </para>
+
+<programlisting>
+INSERT INTO t (utf32_column) VALUES (0x110000); /* illegal */
+</programlisting>
+
+ <para>
+ You cannot use <literal>utf32</literal> as a client character set.
+ [No SET NAMES]
+ </para>
+
+ <para>
+ [Mention that new charsets have collations the correspond to those
+ of the existing charsets? Give some examples, and xref the
+ collation-listing page.]
+ </para>
+
+ <para>
+ [Say something about collation properties of supplementary
+ characters]
+ </para>
+
+ <para>
+ <emphasis role="bold">How MySQL Collates New Characters</emphasis>
+ </para>
+
+ <para>
+ For all collations except the "binary" (_bin) collations, MySQL
+ performs a table lookup to find a character's collating weight.
+ This weight is visible with the new weight_string() function. (See
+ here for description of weight_string:
+ http://forge.mysql.com/worklog/task.php?id=3716) But what if a
+ character isn't in the table, for example because it's a "new"
+ character? In that case things get more complex.
+ </para>
+
+ <para>
+ For BMP characters in general collations, e.g. utf8_general_ci,
+ weight = code point.
+ </para>
+
+ <para>
+ For BMP characters in uca collations, e.g. utf8_unicode_ci,
+ </para>
+
+<programlisting>
+if (code >= 0x3400 && code <= 0x4DB5)
+ base= 0xFB80; /* CJK Ideograph Extension */
+else if (code >= 0x4E00 && code <= 0x9FA5)
+ base= 0xFB40; /* CJK Ideograph */
+else
+ base= 0xFBC0; /* All other characters */
+aaaa= base + (code >> 15);
+bbbb= (code & 0x7FFF) | 0x8000;
+</programlisting>
+
+ <para>
+ For example,
+ </para>
+
+<programlisting>
+mysql> SELECT HEX(weight_string(_ucs2 0x04CF COLLATE ucs2_unicode_ci));
++-----------------------------------------------------------+
+| hex(weight_string(_ucs2 0x04CF collate ucs2_unicode_ci)) |
++-----------------------------------------------------------+
+| FBC084CF |
++-----------------------------------------------------------+
+1 row in set (0.00 sec)
+</programlisting>
+
+ <para>
+ Thus, U+04cf CYRILLIC SMALL LETTER PALOCHKA is, with all uca
+ collations, greater than U+04c0 CYRILLIC LETTER PALOCHKA. Only
+ later, after MySQL upgrades the collations according to WL#2673,
+ will all palochkas sort together.
+ </para>
+
+ <para>
+ For supplementary characters: the weight is the weight for 0xfffd
+ REPLACEMENT CHARACTER. That is, to MySQL, all supplementary
+ characters are equal to each other, and greater than almost all
+ BMP characters.
+ </para>
+
+ <para>
+ An example with Deseret characters and COUNT(DISTINCT):
+ </para>
+
+<programlisting>
+CREATE TABLE t (s1 VARCHAR(5) CHARACTER SET utf32 COLLATE utf32_unicode_ci);
+INSERT INTO t VALUES (0xfffd); /* REPLACEMENT CHARACTER */
+INSERT INTO t VALUES (0x010412); /* DESERET CAPITAL LETTER BEE */
+INSERT INTO t VALUES (0x010413); /* DESERET CAPITAL LETTER TEE */
+SELECT COUNT(DISTINCT s1) FROM t;
+</programlisting>
+
+ <para>
+ The result is: 1, because Deseret Bee = Deseret Tee = Replacement
+ Character, in the MySQL Unicode collation.
+ </para>
+
+ <para>
+ An example with cuneiform characters and WEIGHT_STRING():
+ </para>
+
+<programlisting>
+/*
+The four characters in the INSERT string are
+00000041 # LATIN CAPITAL LETTER A
+0001218F # CUNEIFORM SIGN KAB
+000121A7 # CUNEIFORM SIGN KISH
+00000042 # LATIN CAPITAL LETTER B
+*/
+CREATE TABLE t (s1 CHAR(4) CHARACTER SET utf32 COLLATE utf32_unicode_ci);
+INSERT INTO t VALUES (0x000000410001218f000121a700000042);
+SELECT HEX(WEIGHT_STRING(s1)) FROM t;
+</programlisting>
+
+ <para>
+ The result is:
+ </para>
+
+<programlisting>
+0E33 FBC0 8FFD FBC0 8FFD 0E4A
+</programlisting>
+
+ <para>
+ where 0E33 and 0E4a are primary weights as in UCA 4.0.0
+ ftp://www.unicode.org/Public/UCA/4.0.0/allkeys-4.0.0.txt where
+ FBC0 8FFD is the weight for 0x8ffd according to the calculation
+ given above "For BMP characters in uca collations".
+ </para>
+
+ <para>
+ The "supplementary character collation" rule is non-wonderful. It
+ was necessary, so that we wouldn't have to rewrite the collation
+ algorithms at the same time that we were making possibly
+ disruptive changes to the character sets. We don't expect the rule
+ to cause trouble, because the characters are very rare, so it will
+ be very rare that a multi-character string consists entirely of
+ supplementary characters. In Japan, since the supplementary
+ characters are obscure Kanji ideographs, the typical user doesn't
+ care what order they're in anyway. However, if you really want to
+ get rows sorted by MySQL's rule and secondarily by code point
+ value, it's easy:
+ </para>
+
+<programlisting>
+ORDER BY s1 COLLATE utf32_unicode_ci, s1 COLLATE utf32_bin
+</programlisting>
+
+ <para>
+ <emphasis role="bold">The utf16_bin Collation</emphasis>
+ </para>
+
+ <para>
+ [Move to Unicode collations page?]
+ </para>
+
+ <para>
+ There is a difference between "ordering by the character's code
+ value" and "ordering by the character's binary representation", a
+ difference which only appears with utf16_bin, because of
+ surrogates.
+ </para>
+
+ <para>
+ Suppose utf16_bin was a binary comparison "byte by byte" rather
+ than "character by character". If that were so, then the order of
+ characters in utf16_bin (binary collation for utf16) would differ
+ from the order in utf8_bin. For example:
+ </para>
+
+ <para>
+ Here is a chart showing two rare characters. The first character
+ is in the range E000-FFFF, so it is greater than a surrogate but
+ less than a supplementary. The second character is a
+ supplementary.
+ </para>
+
+<programlisting>
+Code point Character utf8 utf16
+---------- --------- ---- -----
+0FF9D HALFWIDTH KATAKANA LETTER N EF BE 9D FF 9D
+10384 UGARITIC LETTER DELTA F0 90 8E 84 D8 00 DF 84
+</programlisting>
+
+ <para>
+ The two characters in the chart are in order by code point value,
+ because 0xff9d < 0x10384. And they are in order by utf8 value,
+ because 0xef < 0xf0. But they are not in order by utf16 value,
+ if we use byte-by-byte comparison, because 0xff > 0xd8.
+ </para>
+
+ <para>
+ So MySQL's utf16_bin collation is not "byte by byte". It is "code
+ point". When MySQL sees a supplementary-character encoding in
+ utf16, it converts to the character's code-point value, and then
+ compares. Therefore utf8_bin and utf16_bin are the same ordering.
+ This is consistent with the SQL:2008 standard requirement for a
+ UCS_BASIC collation: "UCS_BASIC is a collation in which the
+ ordering is determined entirely by the Unicode scalar values of
+ the characters in the strings being sorted. It is applicable to
+ the UCS character repertoire. Since every character repertoire is
+ a subset of the UCS repertoire, the UCS_BASIC collation is
+ potentially applicable to every character set. NOTE 11 — The
+ Unicode scalar value of a character is its code point treated as
+ an unsigned integer."
+ </para>
+
+ <para>
+ If the character set is ucs2, then comparison is byte-by-byte, but
+ ucs2 strings shouldn't contain surrogates anyway.
+ </para>
+
+ </section>
+
+ <section id="charset-unicode-upgrading">
+
+ <title>Upgrading from Previous to Current Unicode Support</title>
+
+ <para>
+ Upgrading notes:
+ </para>
+
+ <itemizedlist>
+
+ <listitem>
+ <para>
+ What to do to upgrade
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ Run mysql_upgrade, just as for any upgrade. (But this may
+ cause downgrading problems.)
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ utf8 becomes utf8mb3: Potential gotchas (literal name tests)
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ New utf8 requires more space: Potential gotchas (possibly
+ shorter max column lengths, shorter index lengths)
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ ucs2 restrictions that now apply to utf16/utf32
+ </para>
+ </listitem>
+
+ </itemizedlist>
+
+ <para>
+ Incompatibility topics to cover:
+ </para>
+
+ <itemizedlist>
+
+ <listitem>
+ <para>
+ Column and index length issues
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ Issues related to renaming of utf8 to utf8mb3 and new utf8
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ What will prevent a *down*grade
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ Workaround for event scheduler getting disabled?
+ </para>
+ </listitem>
+
+ </itemizedlist>
+
+ <para>
+ Give some specific examples to illustrate things that may break:
+ Maximum length in characters of <literal>VARCHAR</literal> columns
+ is 65535/4, not 65535/3. Maximum length in characters of
+ <literal>TINYTEXT</literal> columns is 255/4, not 255/3. Give
+ example for CHAR(255) in InnoDB: Full column could be indexed in
+ old utf8, but cannot in new utf8.
+ </para>
+
+ <para>
+ <emphasis role="bold">Identifiers</emphasis>
+ </para>
+
+ <para>
+ MySQL does not allow use of supplementary characters for
+ identifiers. If you try to use such a character in a name, you
+ will get an error message.
+ </para>
+
+ </section>
+
</section>
| Thread |
|---|
| • svn commit - mysqldoc@docsrva: r9298 - in trunk: . refman-6.0 | paul | 13 Dec |