List:Commits« Previous MessageNext Message »
From:paul Date:December 13 2007 8:17pm
Subject:svn commit - mysqldoc@docsrva: r9298 - in trunk: . refman-6.0
View as plain text  
Author: paul
Date: 2007-12-13 21:17:54 +0100 (Thu, 13 Dec 2007)
New Revision: 9298

Log:
 r34262@arctic:  paul | 2007-12-13 14:17:24 -0600
 Unicode updates. Much based on info from PeterG.


Modified:
   trunk/refman-6.0/internationalization.xml
   trunk/refman-6.0/unicode-tmp.xml

Property changes on: trunk
___________________________________________________________________
Name: svk:merge
   - 4767c598-dc10-0410-bea0-d01b485662eb:/mysqldoc-local/mysqldoc/trunk:34260
7d8d2c4e-af1d-0410-ab9f-b038ce55645b:/mysqldoc-local/mysqldoc:28105
b5ec3a16-e900-0410-9ad2-d183a3acac99:/mysqldoc-local/mysqldoc/trunk:14218
bf112a9c-6c03-0410-a055-ad865cd57414:/mysqldoc-local/mysqldoc/trunk:22582
   + 4767c598-dc10-0410-bea0-d01b485662eb:/mysqldoc-local/mysqldoc/trunk:34262
7d8d2c4e-af1d-0410-ab9f-b038ce55645b:/mysqldoc-local/mysqldoc:28105
b5ec3a16-e900-0410-9ad2-d183a3acac99:/mysqldoc-local/mysqldoc/trunk:14218
bf112a9c-6c03-0410-a055-ad865cd57414:/mysqldoc-local/mysqldoc/trunk:22582


Modified: trunk/refman-6.0/internationalization.xml
===================================================================
--- trunk/refman-6.0/internationalization.xml	2007-12-13 20:09:10 UTC (rev 9297)
+++ trunk/refman-6.0/internationalization.xml	2007-12-13 20:17:54 UTC (rev 9298)
Changed blocks: 4, Lines Added: 490, Lines Deleted: 42; 18796 bytes

@@ -3248,49 +3248,62 @@
 
 <programlisting>
 mysql&gt; <userinput>SHOW CHARACTER SET;</userinput>
-+----------+-----------------------------+---------------------+
-| Charset  | Description                 | Default collation   |
-+----------+-----------------------------+---------------------+
-| big5     | Big5 Traditional Chinese    | big5_chinese_ci     |
-| dec8     | DEC West European           | dec8_swedish_ci     |
-| cp850    | DOS West European           | cp850_general_ci    |
-| hp8      | HP West European            | hp8_english_ci      |
-| koi8r    | KOI8-R Relcom Russian       | koi8r_general_ci    |
-| latin1   | cp1252 West European        | latin1_swedish_ci   |
-| latin2   | ISO 8859-2 Central European | latin2_general_ci   |
-| swe7     | 7bit Swedish                | swe7_swedish_ci     |
-| ascii    | US ASCII                    | ascii_general_ci    |
-| ujis     | EUC-JP Japanese             | ujis_japanese_ci    |
-| sjis     | Shift-JIS Japanese          | sjis_japanese_ci    |
-| hebrew   | ISO 8859-8 Hebrew           | hebrew_general_ci   |
-| tis620   | TIS620 Thai                 | tis620_thai_ci      |
-| euckr    | EUC-KR Korean               | euckr_korean_ci     |
-| koi8u    | KOI8-U Ukrainian            | koi8u_general_ci    |
-| gb2312   | GB2312 Simplified Chinese   | gb2312_chinese_ci   |
-| greek    | ISO 8859-7 Greek            | greek_general_ci    |
-| cp1250   | Windows Central European    | cp1250_general_ci   |
-| gbk      | GBK Simplified Chinese      | gbk_chinese_ci      |
-| latin5   | ISO 8859-9 Turkish          | latin5_turkish_ci   |
-| armscii8 | ARMSCII-8 Armenian          | armscii8_general_ci |
-| utf8     | UTF-8 Unicode               | utf8_general_ci     |
-| ucs2     | UCS-2 Unicode               | ucs2_general_ci     |
-| cp866    | DOS Russian                 | cp866_general_ci    |
-| keybcs2  | DOS Kamenicky Czech-Slovak  | keybcs2_general_ci  |
-| macce    | Mac Central European        | macce_general_ci    |
-| macroman | Mac West European           | macroman_general_ci |
-| cp852    | DOS Central European        | cp852_general_ci    |
-| latin7   | ISO 8859-13 Baltic          | latin7_general_ci   |
-| cp1251   | Windows Cyrillic            | cp1251_general_ci   |
-| cp1256   | Windows Arabic              | cp1256_general_ci   |
-| cp1257   | Windows Baltic              | cp1257_general_ci   |
-| binary   | Binary pseudo charset       | binary              |
-| geostd8  | GEOSTD8 Georgian            | geostd8_general_ci  |
-| cp932    | SJIS for Windows Japanese   | cp932_japanese_ci   |
-| eucjpms  | UJIS for Windows Japanese   | eucjpms_japanese_ci |
-+----------+-----------------------------+---------------------+
++----------+-----------------------------+---------------------+--------+
+| Charset  | Description                 | Default collation   | Maxlen |
++----------+-----------------------------+---------------------+--------+
+| big5     | Big5 Traditional Chinese    | big5_chinese_ci     |      2 | 
+| dec8     | DEC West European           | dec8_swedish_ci     |      1 | 
+| cp850    | DOS West European           | cp850_general_ci    |      1 | 
+| hp8      | HP West European            | hp8_english_ci      |      1 | 
+| koi8r    | KOI8-R Relcom Russian       | koi8r_general_ci    |      1 | 
+| latin1   | cp1252 West European        | latin1_swedish_ci   |      1 | 
+| latin2   | ISO 8859-2 Central European | latin2_general_ci   |      1 | 
+| swe7     | 7bit Swedish                | swe7_swedish_ci     |      1 | 
+| ascii    | US ASCII                    | ascii_general_ci    |      1 | 
+| ujis     | EUC-JP Japanese             | ujis_japanese_ci    |      3 | 
+| sjis     | Shift-JIS Japanese          | sjis_japanese_ci    |      2 | 
+| hebrew   | ISO 8859-8 Hebrew           | hebrew_general_ci   |      1 | 
+| tis620   | TIS620 Thai                 | tis620_thai_ci      |      1 | 
+| euckr    | EUC-KR Korean               | euckr_korean_ci     |      2 | 
+| koi8u    | KOI8-U Ukrainian            | koi8u_general_ci    |      1 | 
+| gb2312   | GB2312 Simplified Chinese   | gb2312_chinese_ci   |      2 | 
+| greek    | ISO 8859-7 Greek            | greek_general_ci    |      1 | 
+| cp1250   | Windows Central European    | cp1250_general_ci   |      1 | 
+| gbk      | GBK Simplified Chinese      | gbk_chinese_ci      |      2 | 
+| latin5   | ISO 8859-9 Turkish          | latin5_turkish_ci   |      1 | 
+| armscii8 | ARMSCII-8 Armenian          | armscii8_general_ci |      1 | 
+| utf8mb3  | UTF-8 Unicode               | utf8mb3_general_ci  |      3 | 
+| ucs2     | UCS-2 Unicode               | ucs2_general_ci     |      2 | 
+| cp866    | DOS Russian                 | cp866_general_ci    |      1 | 
+| keybcs2  | DOS Kamenicky Czech-Slovak  | keybcs2_general_ci  |      1 | 
+| macce    | Mac Central European        | macce_general_ci    |      1 | 
+| macroman | Mac West European           | macroman_general_ci |      1 | 
+| cp852    | DOS Central European        | cp852_general_ci    |      1 | 
+| latin7   | ISO 8859-13 Baltic          | latin7_general_ci   |      1 | 
+| utf8     | UTF-8 Unicode               | utf8_general_ci     |      4 | 
+| cp1251   | Windows Cyrillic            | cp1251_general_ci   |      1 | 
+| utf16    | UTF-16 Unicode              | utf16_general_ci    |      4 | 
+| cp1256   | Windows Arabic              | cp1256_general_ci   |      1 | 
+| cp1257   | Windows Baltic              | cp1257_general_ci   |      1 | 
+| utf32    | UTF-32 Unicode              | utf32_general_ci    |      4 | 
+| binary   | Binary pseudo charset       | binary              |      1 | 
+| geostd8  | GEOSTD8 Georgian            | geostd8_general_ci  |      1 | 
+| cp932    | SJIS for Windows Japanese   | cp932_japanese_ci   |      2 | 
+| eucjpms  | UJIS for Windows Japanese   | eucjpms_japanese_ci |      3 | 
++----------+-----------------------------+---------------------+--------+
 </programlisting>
 
       <para>
+        The <literal>utf8</literal> character set prior to MySQL 6.0.4
+        encoded BMP characters using up to three bytes per character. In
+        MySQL 6.0.4, this character set is renamed to
+        <literal>utf8mb3</literal>, and a new <literal>utf8</literal> is
+        added that encodes BMP and supplementary characters using up to
+        four bytes per character. The <literal>utf16</literal> and
+        <literal>utf32</literal> were also added in MySQL 6.0.4.
+      </para>
+
+      <para>
         In cases where a character set has multiple collations, it might
         not be clear which collation is most suitable for a given
         application. To avoid choosing the wrong collation, it can be

@@ -3304,8 +3317,8 @@
         <title>Unicode Character Sets</title>
 
         <para>
-          MySQL has two Unicode character sets. You can store text in
-          about 650 languages using these character sets.
+          MySQL has several Unicode character sets. You can store text
+          in about 650 languages using these character sets.
         </para>
 
         <itemizedlist>

@@ -3594,6 +3607,432 @@
             </itemizedlist>
           </listitem>
 
+          <listitem>
+            <para>
+              <literal>utf8mb3</literal> (UTF-8 Unicode) collations:
+            </para>
+
+            <itemizedlist>
+
+              <listitem>
+                <para>
+                  <literal>utf8mb3_bin</literal>
+                </para>
+              </listitem>
+
+              <listitem>
+                <para>
+                  <literal>utf8mb3_czech_ci</literal>
+                </para>
+              </listitem>
+
+              <listitem>
+                <para>
+                  <literal>utf8mb3_danish_ci</literal>
+                </para>
+              </listitem>
+
+              <listitem>
+                <para>
+                  <literal>utf8mb3_esperanto_ci</literal>
+                </para>
+              </listitem>
+
+              <listitem>
+                <para>
+                  <literal>utf8mb3_estonian_ci</literal>
+                </para>
+              </listitem>
+
+              <listitem>
+                <para>
+                  <literal>utf8mb3_general_ci</literal> (default)
+                </para>
+              </listitem>
+
+              <listitem>
+                <para>
+                  <literal>utf8mb3_hungarian_ci</literal>
+                </para>
+              </listitem>
+
+              <listitem>
+                <para>
+                  <literal>utf8mb3_icelandic_ci</literal>
+                </para>
+              </listitem>
+
+              <listitem>
+                <para>
+                  <literal>utf8mb3_latvian_ci</literal>
+                </para>
+              </listitem>
+
+              <listitem>
+                <para>
+                  <literal>utf8mb3_lithuanian_ci</literal>
+                </para>
+              </listitem>
+
+              <listitem>
+                <para>
+                  <literal>utf8mb3_persian_ci</literal>
+                </para>
+              </listitem>
+
+              <listitem>
+                <para>
+                  <literal>utf8mb3_polish_ci</literal>
+                </para>
+              </listitem>
+
+              <listitem>
+                <para>
+                  <literal>utf8mb3_roman_ci</literal>
+                </para>
+              </listitem>
+
+              <listitem>
+                <para>
+                  <literal>utf8mb3_romanian_ci</literal>
+                </para>
+              </listitem>
+
+              <listitem>
+                <para>
+                  <literal>utf8mb3_sinhala_ci</literal>
+                </para>
+              </listitem>
+
+              <listitem>
+                <para>
+                  <literal>utf8mb3_slovak_ci</literal>
+                </para>
+              </listitem>
+
+              <listitem>
+                <para>
+                  <literal>utf8mb3_slovenian_ci</literal>
+                </para>
+              </listitem>
+
+              <listitem>
+                <para>
+                  <literal>utf8mb3_spanish2_ci</literal>
+                </para>
+              </listitem>
+
+              <listitem>
+                <para>
+                  <literal>utf8mb3_spanish_ci</literal>
+                </para>
+              </listitem>
+
+              <listitem>
+                <para>
+                  <literal>utf8mb3_swedish_ci</literal>
+                </para>
+              </listitem>
+
+              <listitem>
+                <para>
+                  <literal>utf8mb3_turkish_ci</literal>
+                </para>
+              </listitem>
+
+              <listitem>
+                <para>
+                  <literal>utf8mb3_unicode_ci</literal>
+                </para>
+              </listitem>
+
+            </itemizedlist>
+          </listitem>
+
+          <listitem>
+            <para>
+              <literal>utf16</literal> (UTF-16 Unicode) collations:
+            </para>
+
+            <itemizedlist>
+
+              <listitem>
+                <para>
+                  <literal>utf16_bin</literal>
+                </para>
+              </listitem>
+
+              <listitem>
+                <para>
+                  <literal>utf16_czech_ci</literal>
+                </para>
+              </listitem>
+
+              <listitem>
+                <para>
+                  <literal>utf16_danish_ci</literal>
+                </para>
+              </listitem>
+
+              <listitem>
+                <para>
+                  <literal>utf16_esperanto_ci</literal>
+                </para>
+              </listitem>
+
+              <listitem>
+                <para>
+                  <literal>utf16_estonian_ci</literal>
+                </para>
+              </listitem>
+
+              <listitem>
+                <para>
+                  <literal>utf16_general_ci</literal> (default)
+                </para>
+              </listitem>
+
+              <listitem>
+                <para>
+                  <literal>utf16_hungarian_ci</literal>
+                </para>
+              </listitem>
+
+              <listitem>
+                <para>
+                  <literal>utf16_icelandic_ci</literal>
+                </para>
+              </listitem>
+
+              <listitem>
+                <para>
+                  <literal>utf16_latvian_ci</literal>
+                </para>
+              </listitem>
+
+              <listitem>
+                <para>
+                  <literal>utf16_lithuanian_ci</literal>
+                </para>
+              </listitem>
+
+              <listitem>
+                <para>
+                  <literal>utf16_persian_ci</literal>
+                </para>
+              </listitem>
+
+              <listitem>
+                <para>
+                  <literal>utf16_polish_ci</literal>
+                </para>
+              </listitem>
+
+              <listitem>
+                <para>
+                  <literal>utf16_roman_ci</literal>
+                </para>
+              </listitem>
+
+              <listitem>
+                <para>
+                  <literal>utf16_romanian_ci</literal>
+                </para>
+              </listitem>
+
+              <listitem>
+                <para>
+                  <literal>utf16_sinhala_ci</literal>
+                </para>
+              </listitem>
+
+              <listitem>
+                <para>
+                  <literal>utf16_slovak_ci</literal>
+                </para>
+              </listitem>
+
+              <listitem>
+                <para>
+                  <literal>utf16_slovenian_ci</literal>
+                </para>
+              </listitem>
+
+              <listitem>
+                <para>
+                  <literal>utf16_spanish2_ci</literal>
+                </para>
+              </listitem>
+
+              <listitem>
+                <para>
+                  <literal>utf16_spanish_ci</literal>
+                </para>
+              </listitem>
+
+              <listitem>
+                <para>
+                  <literal>utf16_swedish_ci</literal>
+                </para>
+              </listitem>
+
+              <listitem>
+                <para>
+                  <literal>utf16_turkish_ci</literal>
+                </para>
+              </listitem>
+
+              <listitem>
+                <para>
+                  <literal>utf16_unicode_ci</literal>
+                </para>
+              </listitem>
+
+            </itemizedlist>
+          </listitem>
+
+          <listitem>
+            <para>
+              <literal>utf32</literal> (UTF-32 Unicode) collations:
+            </para>
+
+            <itemizedlist>
+
+              <listitem>
+                <para>
+                  <literal>utf32_bin</literal>
+                </para>
+              </listitem>
+
+              <listitem>
+                <para>
+                  <literal>utf32_czech_ci</literal>
+                </para>
+              </listitem>
+
+              <listitem>
+                <para>
+                  <literal>utf32_danish_ci</literal>
+                </para>
+              </listitem>
+
+              <listitem>
+                <para>
+                  <literal>utf32_esperanto_ci</literal>
+                </para>
+              </listitem>
+
+              <listitem>
+                <para>
+                  <literal>utf32_estonian_ci</literal>
+                </para>
+              </listitem>
+
+              <listitem>
+                <para>
+                  <literal>utf32_general_ci</literal> (default)
+                </para>
+              </listitem>
+
+              <listitem>
+                <para>
+                  <literal>utf32_hungarian_ci</literal>
+                </para>
+              </listitem>
+
+              <listitem>
+                <para>
+                  <literal>utf32_icelandic_ci</literal>
+                </para>
+              </listitem>
+
+              <listitem>
+                <para>
+                  <literal>utf32_latvian_ci</literal>
+                </para>
+              </listitem>
+
+              <listitem>
+                <para>
+                  <literal>utf32_lithuanian_ci</literal>
+                </para>
+              </listitem>
+
+              <listitem>
+                <para>
+                  <literal>utf32_persian_ci</literal>
+                </para>
+              </listitem>
+
+              <listitem>
+                <para>
+                  <literal>utf32_polish_ci</literal>
+                </para>
+              </listitem>
+
+              <listitem>
+                <para>
+                  <literal>utf32_roman_ci</literal>
+                </para>
+              </listitem>
+
+              <listitem>
+                <para>
+                  <literal>utf32_romanian_ci</literal>
+                </para>
+              </listitem>
+
+              <listitem>
+                <para>
+                  <literal>utf32_sinhala_ci</literal>
+                </para>
+              </listitem>
+
+              <listitem>
+                <para>
+                  <literal>utf32_slovak_ci</literal>
+                </para>
+              </listitem>
+
+              <listitem>
+                <para>
+                  <literal>utf32_slovenian_ci</literal>
+                </para>
+              </listitem>
+
+              <listitem>
+                <para>
+                  <literal>utf32_spanish2_ci</literal>
+                </para>
+              </listitem>
+
+              <listitem>
+                <para>
+                  <literal>utf32_spanish_ci</literal>
+                </para>
+              </listitem>
+
+              <listitem>
+                <para>
+                  <literal>utf32_swedish_ci</literal>
+                </para>
+              </listitem>
+
+              <listitem>
+                <para>
+                  <literal>utf32_turkish_ci</literal>
+                </para>
+              </listitem>
+
+              <listitem>
+                <para>
+                  <literal>utf32_unicode_ci</literal>
+                </para>
+              </listitem>
+
+            </itemizedlist>
+          </listitem>
+
         </itemizedlist>
 
         <para>

@@ -3619,6 +4058,15 @@
           MySQL 5.2.6.
         </para>
 
+        <para>
+          The collations for <literal>utf8mb3</literal>,
+          <literal>utf16</literal>, and <literal>utf16</literal> were
+          added in MySQL 6.0.4. In 6.0.4, the old (three-byte)
+          <literal>utf8</literal> character set was renamed to
+          <literal>utf8mb3</literal> and the new (four-byte)
+          <literal>utf8</literal> character set was added.
+        </para>
+
         <indexterm>
           <primary>Unicode Collation Algorithm</primary>
         </indexterm>


Modified: trunk/refman-6.0/unicode-tmp.xml
===================================================================
--- trunk/refman-6.0/unicode-tmp.xml	2007-12-13 20:09:10 UTC (rev 9297)
+++ trunk/refman-6.0/unicode-tmp.xml	2007-12-13 20:17:54 UTC (rev 9298)
Changed blocks: 4, Lines Added: 698, Lines Deleted: 5; 24105 bytes

@@ -9,10 +9,20 @@
   <title>Unicode Support</title>
 
   <para>
-    MySQL &current-series; supports two character sets for storing
-    Unicode data:
+    [NEED CHANGELOG ENTRY] <emphasis role="bold">Incompatible
+    change:</emphasis>
   </para>
 
+  <para>
+    [NEED UPGRADE SECTION ENTRY] <emphasis role="bold">Incompatible
+    change:</emphasis>
+  </para>
+
+  <para>
+    The initial implementation of Unicode support (in MySQL 4.1)
+    included two character sets for storing Unicode data:
+  </para>
+
   <itemizedlist>
 
     <listitem>

@@ -31,11 +41,16 @@
   </itemizedlist>
 
   <para>
-    MySQL support for Unicode includes the characters of Unicode Version
-    3.0, in the Basic Multilingual Plane (BMP). The BMP characters have
-    these characteristics:
+    [Do I need <emphasis>brief</emphasis> description of them here to
+    set stage for added charsets later?]
   </para>
 
+  <para>
+    Specficially, those character sets supported the characters of
+    Unicode Version 3.0, in the Basic Multilingual Plane (BMP). The BMP
+    characters have these characteristics:
+  </para>
+
   <itemizedlist>
 
     <listitem>

@@ -67,6 +82,57 @@
   </itemizedlist>
 
   <para>
+    Characters not supported by the preceding character sets include
+    supplementary characters that lie outside the BMP. As of MySQL 6.0
+    (versions 6.0.4 and up), Unicode support does include supplementary
+    characters, which requires new character sets that have a broader
+    range and therefore take more space. The following table shows a
+    brief feature comparison of previous and current Unicode support.
+  </para>
+
+<programlisting>
+Before MySQL 6.0                      MySQL 6.0
+
+all Unicode 3.0 characters            all Unicode 5.0 characters
+no supplementary characters           with supplementary characters
+ucs2 character set                    no change
+utf8 character set for up to 3 bytes  renamed to utf8mb3
+                                      new utf8 character set for up to 4 bytes
+                                      utf16 character set
+                                      utf32 character set
+</programlisting>
+
+  <para>
+    Many of the supplementary characters represent characters in East
+    Asian languages. However, what MySQL 6.0 adds is support for more
+    Japanese and Chinese characters in Unicode character sets, not
+    support for new Japanese and Chinese character sets.
+  </para>
+
+  <para>
+    Most of these changes are upward compatible. For notes on potential
+    incompatibility issues for your applications, see
+    <xref linkend="charset-unicode-upgrading"/>.
+  </para>
+
+  <para>
+    Note particularly that an old table created before MySQL 6.0 that
+    used the <literal>utf8</literal> character set (the three-byte
+    version) is treated as using the <literal>utf8mb3</literal>
+    character set after an upgrade to MySQL 6.0. The table's contents
+    will not have changed in any way and it can continue to be used as
+    before. [Any exceptions? Literal-charset-name tests, perhaps] If you
+    like, you can convert the table to use the new
+    <literal>utf8</literal> character set (the four-byte version), as
+    described later. [WHERE]
+  </para>
+
+  <para>
+    [Still have not given a <emphasis>brief</emphasis> (bullet list) of
+    the supported character sets]
+  </para>
+
+  <para>
     In UCS-2 (binary Unicode representation), every character is
     represented by a two-byte Unicode code with the most significant
     byte first. For example: <literal>LATIN CAPITAL LETTER A</literal>

@@ -148,4 +214,631 @@
     bytes for a <literal>CHAR(10) CHARACTER SET utf8</literal> column.
   </para>
 
+  <section id="charset-unicode-characteristics">
+
+    <title>Characteristics of Unicode Character Sets</title>
+
+    <para>
+      [This can be xref-ed from the section that lists character sets]
+    </para>
+
+    <para>
+      Unicode and MySQL character set names:
+    </para>
+
+<programlisting>
+Unicode   MySQL
+
+UCS-2     ucs2
+UTF-8     utf8
+UTF-16    utf16
+UTF-32    utf32
+</programlisting>
+
+    <para>
+      <emphasis role="bold">Old Character Set
+      <literal>ucs2</literal></emphasis>
+    </para>
+
+    <para>
+      The <literal>ucs2</literal> character set does not change in
+      essence. There are some Unicode-5.0 characters that are inside the
+      BMP, which MySQL will support with <literal>ucs2</literal>. But
+      <literal>ucs2</literal> remains what it always has been: a fixed
+      16-bit encoding for Unicode BMP characters.
+    </para>
+
+    <para>
+      <emphasis role="bold">New Character Set
+      <literal>utf16</literal></emphasis>
+    </para>
+
+    <para>
+      The <literal>utf16</literal> character set is the
+      <literal>ucs2</literal> character set with an extension that
+      enables encoding of supplementary characters.
+    </para>
+
+    <para>
+      For a BMP character, <literal>utf16</literal> and
+      <literal>ucs2</literal> have identical storage characteristics:
+      same code values, same encoding, same length.
+    </para>
+
+    <para>
+      For a supplementary character, <literal>utf16</literal> has a
+      special sequence for representing the character using 32 bits.
+      This is called the <quote>surrogate</quote> mechanism: For a
+      number greater than 0xffff, take 10 bits and add them to 0xd800
+      and put them in the first 16-bit word, take 10 more bits and add
+      them to 0xdc00 and put them in the next 16-bit word. So all
+      supplementary characters require 32 bits in all, and the first 16
+      bits are a number between 0xd800 and 0xdbff, and the last 16 bits
+      are a number between 0xdc00 and 0xdfff. Examples are in "15.5
+      Surrogates Area" in the Unicode 4.0 document
+      http://unicode.org/versions/Unicode4.0.0/ch15.pdf
+    </para>
+
+    <para>
+      Because <literal>utf16</literal> supports surrogates and
+      <literal>ucs2</literal> doesn't, there is a validity check which
+      applies only in <literal>utf16</literal>: you can't insert a top
+      surrogate without a bottom surrogate, or vice versa. For example:
+    </para>
+
+<programlisting>
+INSERT INTO t (ucs2_column) VALUES (0xd800); /* legal */
+INSERT INTO t (utf16_column)VALUES (0xd800); /* illegal */
+</programlisting>
+
+    <para>
+      There is no validity check for characters which are technically
+      valid but are not true Unicode (that is, Unicode considers them to
+      be <quote>unassigned code points</quote> or <quote>private
+      use</quote> characters or even <quote>illegals</quote> like
+      0xffff). For example, since U+F8FF is the Apple Logo:
+    </para>
+
+<programlisting>
+INSERT INTO t (utf16_column)VALUES (0xf8ff); /* legal */
+</programlisting>
+
+    <para>
+      Such characters cannot be expected to mean the same thing to
+      everyone.
+    </para>
+
+    <para>
+      Because MySQL must allow for the worst case (that 1 character
+      requires 4 bytes) the maximum length of a <literal>utf16</literal>
+      column or index is only one half of the maximum length for a
+      <literal>ucs2</literal> column or index. For example, since at the
+      time of writing the maximum length of a Falcon index key is 1100
+      bytes, this is legal: [Using default index block size?]
+    </para>
+
+<programlisting>
+CREATE TABLE tf (s1 VARCHAR(550) CHARACTER SET ucs2) engine=falcon;
+CREATE INDEX i ON tf (s1);
+CREATE TABLE tg (s1 VARCHAR(275) CHARACTER SET utf16) engine=falcon;
+CREATE INDEX i ON tg (s1);
+</programlisting>
+
+    <para>
+      You cannot use <literal>utf16</literal> as a client character set.
+      [No SET NAMES]
+    </para>
+
+    <para>
+      <emphasis role="bold">Revised Character Set
+      <literal>utf8</literal></emphasis>
+    </para>
+
+    <para>
+      We changed the characteristics of the <literal>utf8</literal>
+      character set, from (maximum 3 bytes per character) to (maximum 4
+      bytes per character). We had reasons:
+    </para>
+
+    <itemizedlist>
+
+      <listitem>
+        <para>
+          <literal>utf8</literal> is the 'SQL:2003 standard' name for
+          the full-repertoiree character set
+        </para>
+      </listitem>
+
+      <listitem>
+        <para>
+          because the new is 'upward compatible' with the old, with
+          respect to repertoire, switching to MySQL 6.0 will involve no
+          conversion of existing data or definitions, unless columns are
+          very long.
+        </para>
+      </listitem>
+
+    </itemizedlist>
+
+    <para>
+      [Move to upgrade notes?] Let's not minimize that subordinate
+      clause "unless columns are very long". The catch, as with
+      <literal>utf16</literal> versus <literal>ucs2</literal>, is that
+      the maximum length of a column or index key is unchanged in terms
+      of bytes. Therefore it is smaller in terms of characters, because
+      the maximum length of a character is 4 bytes instead of 3. The
+      three things to watch for are:
+    </para>
+
+    <orderedlist>
+
+      <listitem>
+        <para>
+          Look at all application programs, and make sure that buffer
+          lengths aren't calculated with an algorithm like "number of
+          characters times 3"
+        </para>
+      </listitem>
+
+      <listitem>
+        <para>
+          Look at all indexes on <literal>utf8</literal> columns, and
+          make sure they won't exceed the maximum length for the storage
+          engine. Sometimes the maximum can change due to storage engine
+          enhancements.
+        </para>
+      </listitem>
+
+      <listitem>
+        <para>
+          Look at the definitions of <literal>utf8</literal> columns,
+          and make sure they won't exceed the maximum length for the
+          storage engine.
+        </para>
+      </listitem>
+
+      <listitem>
+        <para>
+          Do the same for <literal>TEXT</literal> as well as
+          <literal>CHAR</literal> or <literal>VARCHAR</literal>.
+        </para>
+      </listitem>
+
+    </orderedlist>
+
+    <para>
+      The MySQL Reference Manual will call this an INCOMPATIBLE CHANGE.
+    </para>
+
+    <para>
+      If the above situations apply for you, then you will have to
+      reduce the defined length, or you will have to change the column
+      definition from <literal>CHARACTER SET utf8</literal> to
+      <literal>CHARACTER SET utf8mb3</literal> (see
+      <literal>utf8mb3</literal> description in next section). Other
+      than that, we currently expect that the transition from 5.1 to 6.0
+      will be painless.
+    </para>
+
+    <para>
+      For a BMP character, the 5.1 and 6.0 versions of
+      <literal>utf8</literal> have identical storage characteristics:
+      same code values, same encoding, same length.
+    </para>
+
+    <para>
+      For a supplementary character, <literal>utf8</literal> in 5.1
+      can't store the character at all, while <literal>utf8</literal> in
+      6.0 requires 4 bytes to store. Since <literal>utf8</literal> in
+      5.1 can't store the character at all, you don't have any
+      supplementary characters in <literal>utf8</literal> columns in
+      your 5.1 database, so you don't have to worry about converting
+      characters or losing data.
+    </para>
+
+    <para>
+      <emphasis role="bold">Renamed Character Set
+      <literal>utf8mb3</literal></emphasis>
+    </para>
+
+    <para>
+      For the few cases where it's desirable to have complete
+      compatibility with the old <literal>utf8</literal> character set,
+      you can define with <literal>CHARACTER SET utf8mb3</literal>. The
+      <literal>utf8mb3</literal> character set has these
+      characteristics:
+    </para>
+
+    <itemizedlist>
+
+      <listitem>
+        <para>
+          No support for supplementary characters (BMP characters only)
+        </para>
+      </listitem>
+
+      <listitem>
+        <para>
+          Maximum of three bytes per multi-byte character
+        </para>
+      </listitem>
+
+    </itemizedlist>
+
+    <para>
+      In other words, <literal>utf8mb3</literal> in MySQL 6.0 has
+      exactly the same characteristics as <literal>utf8</literal> in
+      MySQL 5.1. We expect that <literal>utf8mb3</literal> will only be
+      needed when updating from MySQL 5.1, as described in the previous
+      section.
+    </para>
+
+    <para>
+      Exactly the same set of characters is available in
+      <literal>utf8mb3</literal> as in <literal>ucs2</literal>, that is,
+      they have the same repertoire.
+    </para>
+
+    <para>
+      <emphasis role="bold">New Character Set
+      <literal>utf32</literal></emphasis>
+    </para>
+
+    <para>
+      The <literal>utf32</literal> character set is fixed length (like
+      <literal>ucs2</literal>), but uses 32 bits for every character
+      (unlike <literal>ucs2</literal>, which uses 16 bits for every
+      character).
+    </para>
+
+    <para>
+      Probably most people will avoid <literal>utf32</literal> for space
+      reasons. But it does have two advantages: it's predictable for
+      storage (number of bytes = number of characters times 4), and
+      there are no tricks in encoding so the stored value = the code
+      value.
+    </para>
+
+    <para>
+      To show how the latter advantage is useful, here's a trick that
+      shows how a user can determine the <literal>utf8</literal> value
+      given the <literal>utf32</literal> code value.
+    </para>
+
+<programlisting>
+/* Assume code value = 100cc LINEAR B CHARIOT WHEEL */
+CREATE TABLE tmp (utf32 CHAR(1) CHARACTER SET utf32,
+                  utf8 CHAR(1) CHARACTER SET utf8);
+INSERT INTO tmp VALUES (0x000100cc,NULL);
+UPDATE tmp SET utf8 = utf32;
+SELECT HEX(utf32),HEX(utf8) FROM tmp;
+</programlisting>
+
+    <para>
+      Storage is big-endian, that is, the highest byte comes first and
+      there is no need for a mark to specify the byte ordering.
+    </para>
+
+    <para>
+      MySQL is very forgiving about additions of unassigned Unicode
+      characters, private-use-area characters, and other code values
+      that you won't find in the official Unicode 5.0 Character Database
+      (http://www.unicode.org/Public/UNIDATA/UnicodeData.txt). There is
+      in fact only one validity check for <literal>utf32</literal>: no
+      code value may be greater than 0x10ffff. For example this is
+      illegal:
+    </para>
+
+<programlisting>
+INSERT INTO t (utf32_column) VALUES (0x110000); /* illegal */
+</programlisting>
+
+    <para>
+      You cannot use <literal>utf32</literal> as a client character set.
+      [No SET NAMES]
+    </para>
+
+    <para>
+      [Mention that new charsets have collations the correspond to those
+      of the existing charsets? Give some examples, and xref the
+      collation-listing page.]
+    </para>
+
+    <para>
+      [Say something about collation properties of supplementary
+      characters]
+    </para>
+
+    <para>
+      <emphasis role="bold">How MySQL Collates New Characters</emphasis>
+    </para>
+
+    <para>
+      For all collations except the "binary" (_bin) collations, MySQL
+      performs a table lookup to find a character's collating weight.
+      This weight is visible with the new weight_string() function. (See
+      here for description of weight_string:
+      http://forge.mysql.com/worklog/task.php?id=3716) But what if a
+      character isn't in the table, for example because it's a "new"
+      character? In that case things get more complex.
+    </para>
+
+    <para>
+      For BMP characters in general collations, e.g. utf8_general_ci,
+      weight = code point.
+    </para>
+
+    <para>
+      For BMP characters in uca collations, e.g. utf8_unicode_ci,
+    </para>
+
+<programlisting>
+if (code &gt;= 0x3400 &amp;&amp; code &lt;= 0x4DB5)
+  base= 0xFB80; /* CJK Ideograph Extension */
+else if (code &gt;= 0x4E00 &amp;&amp; code &lt;= 0x9FA5)
+  base= 0xFB40; /* CJK Ideograph */
+else
+  base= 0xFBC0; /* All other characters */
+aaaa= base +  (code &gt;&gt; 15);
+bbbb= (code &amp; 0x7FFF) | 0x8000;
+</programlisting>
+
+    <para>
+      For example,
+    </para>
+
+<programlisting>
+mysql> SELECT HEX(weight_string(_ucs2 0x04CF COLLATE ucs2_unicode_ci));
++-----------------------------------------------------------+
+| hex(weight_string(_ucs2 0x04CF  collate ucs2_unicode_ci)) |
++-----------------------------------------------------------+
+| FBC084CF                                                  |
++-----------------------------------------------------------+
+1 row in set (0.00 sec)
+</programlisting>
+
+    <para>
+      Thus, U+04cf CYRILLIC SMALL LETTER PALOCHKA is, with all uca
+      collations, greater than U+04c0 CYRILLIC LETTER PALOCHKA. Only
+      later, after MySQL upgrades the collations according to WL#2673,
+      will all palochkas sort together.
+    </para>
+
+    <para>
+      For supplementary characters: the weight is the weight for 0xfffd
+      REPLACEMENT CHARACTER. That is, to MySQL, all supplementary
+      characters are equal to each other, and greater than almost all
+      BMP characters.
+    </para>
+
+    <para>
+      An example with Deseret characters and COUNT(DISTINCT):
+    </para>
+
+<programlisting>
+CREATE TABLE t (s1 VARCHAR(5) CHARACTER SET utf32 COLLATE utf32_unicode_ci);
+INSERT INTO t VALUES (0xfffd);   /* REPLACEMENT CHARACTER */
+INSERT INTO t VALUES (0x010412); /* DESERET CAPITAL LETTER BEE */
+INSERT INTO t VALUES (0x010413); /* DESERET CAPITAL LETTER TEE */
+SELECT COUNT(DISTINCT s1) FROM t;
+</programlisting>
+
+    <para>
+      The result is: 1, because Deseret Bee = Deseret Tee = Replacement
+      Character, in the MySQL Unicode collation.
+    </para>
+
+    <para>
+      An example with cuneiform characters and WEIGHT_STRING():
+    </para>
+
+<programlisting>
+/*
+The four characters in the INSERT string are
+00000041  # LATIN CAPITAL LETTER A
+0001218F # CUNEIFORM SIGN KAB
+000121A7 # CUNEIFORM SIGN KISH
+00000042  # LATIN CAPITAL LETTER B
+*/
+CREATE TABLE t (s1 CHAR(4) CHARACTER SET utf32 COLLATE utf32_unicode_ci);
+INSERT INTO t VALUES (0x000000410001218f000121a700000042);
+SELECT HEX(WEIGHT_STRING(s1)) FROM t;
+</programlisting>
+
+    <para>
+      The result is:
+    </para>
+
+<programlisting>
+0E33 FBC0 8FFD FBC0 8FFD 0E4A
+</programlisting>
+
+    <para>
+      where 0E33 and 0E4a are primary weights as in UCA 4.0.0
+      ftp://www.unicode.org/Public/UCA/4.0.0/allkeys-4.0.0.txt where
+      FBC0 8FFD is the weight for 0x8ffd according to the calculation
+      given above "For BMP characters in uca collations".
+    </para>
+
+    <para>
+      The "supplementary character collation" rule is non-wonderful. It
+      was necessary, so that we wouldn't have to rewrite the collation
+      algorithms at the same time that we were making possibly
+      disruptive changes to the character sets. We don't expect the rule
+      to cause trouble, because the characters are very rare, so it will
+      be very rare that a multi-character string consists entirely of
+      supplementary characters. In Japan, since the supplementary
+      characters are obscure Kanji ideographs, the typical user doesn't
+      care what order they're in anyway. However, if you really want to
+      get rows sorted by MySQL's rule and secondarily by code point
+      value, it's easy:
+    </para>
+
+<programlisting>
+ORDER BY s1 COLLATE utf32_unicode_ci, s1 COLLATE utf32_bin
+</programlisting>
+
+    <para>
+      <emphasis role="bold">The utf16_bin Collation</emphasis>
+    </para>
+
+    <para>
+      [Move to Unicode collations page?]
+    </para>
+
+    <para>
+      There is a difference between "ordering by the character's code
+      value" and "ordering by the character's binary representation", a
+      difference which only appears with utf16_bin, because of
+      surrogates.
+    </para>
+
+    <para>
+      Suppose utf16_bin was a binary comparison "byte by byte" rather
+      than "character by character". If that were so, then the order of
+      characters in utf16_bin (binary collation for utf16) would differ
+      from the order in utf8_bin. For example:
+    </para>
+
+    <para>
+      Here is a chart showing two rare characters. The first character
+      is in the range E000-FFFF, so it is greater than a surrogate but
+      less than a supplementary. The second character is a
+      supplementary.
+    </para>
+
+<programlisting>
+Code point  Character                    utf8         utf16
+----------  ---------                    ----         -----
+0FF9D       HALFWIDTH KATAKANA LETTER N  EF BE 9D     FF 9D
+10384       UGARITIC LETTER DELTA        F0 90 8E 84  D8 00 DF 84
+</programlisting>
+
+    <para>
+      The two characters in the chart are in order by code point value,
+      because 0xff9d &lt; 0x10384. And they are in order by utf8 value,
+      because 0xef &lt; 0xf0. But they are not in order by utf16 value,
+      if we use byte-by-byte comparison, because 0xff &gt; 0xd8.
+    </para>
+
+    <para>
+      So MySQL's utf16_bin collation is not "byte by byte". It is "code
+      point". When MySQL sees a supplementary-character encoding in
+      utf16, it converts to the character's code-point value, and then
+      compares. Therefore utf8_bin and utf16_bin are the same ordering.
+      This is consistent with the SQL:2008 standard requirement for a
+      UCS_BASIC collation: "UCS_BASIC is a collation in which the
+      ordering is determined entirely by the Unicode scalar values of
+      the characters in the strings being sorted. It is applicable to
+      the UCS character repertoire. Since every character repertoire is
+      a subset of the UCS repertoire, the UCS_BASIC collation is
+      potentially applicable to every character set. NOTE 11 — The
+      Unicode scalar value of a character is its code point treated as
+      an unsigned integer."
+    </para>
+
+    <para>
+      If the character set is ucs2, then comparison is byte-by-byte, but
+      ucs2 strings shouldn't contain surrogates anyway.
+    </para>
+
+  </section>
+
+  <section id="charset-unicode-upgrading">
+
+    <title>Upgrading from Previous to Current Unicode Support</title>
+
+    <para>
+      Upgrading notes:
+    </para>
+
+    <itemizedlist>
+
+      <listitem>
+        <para>
+          What to do to upgrade
+        </para>
+      </listitem>
+
+      <listitem>
+        <para>
+          Run mysql_upgrade, just as for any upgrade. (But this may
+          cause downgrading problems.)
+        </para>
+      </listitem>
+
+      <listitem>
+        <para>
+          utf8 becomes utf8mb3: Potential gotchas (literal name tests)
+        </para>
+      </listitem>
+
+      <listitem>
+        <para>
+          New utf8 requires more space: Potential gotchas (possibly
+          shorter max column lengths, shorter index lengths)
+        </para>
+      </listitem>
+
+      <listitem>
+        <para>
+          ucs2 restrictions that now apply to utf16/utf32
+        </para>
+      </listitem>
+
+    </itemizedlist>
+
+    <para>
+      Incompatibility topics to cover:
+    </para>
+
+    <itemizedlist>
+
+      <listitem>
+        <para>
+          Column and index length issues
+        </para>
+      </listitem>
+
+      <listitem>
+        <para>
+          Issues related to renaming of utf8 to utf8mb3 and new utf8
+        </para>
+      </listitem>
+
+      <listitem>
+        <para>
+          What will prevent a *down*grade
+        </para>
+      </listitem>
+
+      <listitem>
+        <para>
+          Workaround for event scheduler getting disabled?
+        </para>
+      </listitem>
+
+    </itemizedlist>
+
+    <para>
+      Give some specific examples to illustrate things that may break:
+      Maximum length in characters of <literal>VARCHAR</literal> columns
+      is 65535/4, not 65535/3. Maximum length in characters of
+      <literal>TINYTEXT</literal> columns is 255/4, not 255/3. Give
+      example for CHAR(255) in InnoDB: Full column could be indexed in
+      old utf8, but cannot in new utf8.
+    </para>
+
+    <para>
+      <emphasis role="bold">Identifiers</emphasis>
+    </para>
+
+    <para>
+      MySQL does not allow use of supplementary characters for
+      identifiers. If you try to use such a character in a name, you
+      will get an error message.
+    </para>
+
+  </section>
+
 </section>


Thread
svn commit - mysqldoc@docsrva: r9298 - in trunk: . refman-6.0paul13 Dec