List:Commits« Previous MessageNext Message »
From:paul Date:May 29 2008 5:49pm
Subject:svn commit - mysqldoc@docsrva: r10862 - in trunk: . it/refman-5.1 pt/refman-5.1
View as plain text  
Author: paul
Date: 2008-05-29 19:49:41 +0200 (Thu, 29 May 2008)
New Revision: 10862

Log:
 r31753@frost:  paul | 2008-05-29 11:37:51 -0500
 Sync translations


Modified:
   trunk/it/refman-5.1/internationalization.xml
   trunk/pt/refman-5.1/internationalization.xml

Property changes on: trunk
___________________________________________________________________
Name: svk:merge
   - 4767c598-dc10-0410-bea0-d01b485662eb:/mysqldoc-local/mysqldoc/trunk:35828
7d8d2c4e-af1d-0410-ab9f-b038ce55645b:/mysqldoc-local/mysqldoc:31752
b5ec3a16-e900-0410-9ad2-d183a3acac99:/mysqldoc-local/mysqldoc/trunk:14218
bf112a9c-6c03-0410-a055-ad865cd57414:/mysqldoc-local/mysqldoc/trunk:31512
   + 4767c598-dc10-0410-bea0-d01b485662eb:/mysqldoc-local/mysqldoc/trunk:35828
7d8d2c4e-af1d-0410-ab9f-b038ce55645b:/mysqldoc-local/mysqldoc:31753
b5ec3a16-e900-0410-9ad2-d183a3acac99:/mysqldoc-local/mysqldoc/trunk:14218
bf112a9c-6c03-0410-a055-ad865cd57414:/mysqldoc-local/mysqldoc/trunk:31512


Modified: trunk/it/refman-5.1/internationalization.xml
===================================================================
--- trunk/it/refman-5.1/internationalization.xml	2008-05-29 17:49:32 UTC (rev 10861)
+++ trunk/it/refman-5.1/internationalization.xml	2008-05-29 17:49:41 UTC (rev 10862)
Changed blocks: 2, Lines Added: 902, Lines Deleted: 9; 29935 bytes

@@ -5985,6 +5985,906 @@
 
   </section>
 
+  <section id="adding-collation">
+
+    <title>How to Add a New Collation to a Character Set</title>
+
+    <indexterm>
+      <primary>collation</primary>
+      <secondary>adding</secondary>
+    </indexterm>
+
+    <para>
+      A collation is a set of rules that defines how to compare and sort
+      character strings. Each collation in MySQL belongs to a single
+      character set. Every character set has at least one collation, and
+      most have two or more collations.
+    </para>
+
+    <para>
+      A collation orders characters based on weights. Each character in
+      a character set maps to a weight. Characters with equal weights
+      compare as equal, and characters with unequal weights compare
+      according to the relative magnitude of their weights.
+    </para>
+
+    <para>
+      MySQL supports several collation implementations, as discussed in
+      <xref linkend="charset-collation-implementations"/>. Some of these
+      can be added to MySQL without recompiling:
+    </para>
+
+    <itemizedlist>
+
+      <listitem>
+        <para>
+          Simple collations for 8-bit character sets
+        </para>
+      </listitem>
+
+      <listitem>
+        <para>
+          UCA-based collations for Unicode character sets
+        </para>
+      </listitem>
+
+      <listitem>
+        <para>
+          Binary (<literal><replaceable>xxx</replaceable>_bin</literal>)
+          collations
+        </para>
+      </listitem>
+
+    </itemizedlist>
+
+    <para>
+      The following discussion describes how to add collations of the
+      first two types to existing character sets. All existing character
+      sets already have a binary collation, so there is no need here to
+      describe how to add one.
+    </para>
+
+    <para>
+      Summary of the procedure for adding a new collation:
+    </para>
+
+    <orderedlist>
+
+      <listitem>
+        <para>
+          Choose a collation ID
+        </para>
+      </listitem>
+
+      <listitem>
+        <para>
+          Add configuration information that names the collation and
+          describes the character-ordering rules
+        </para>
+      </listitem>
+
+      <listitem>
+        <para>
+          Restart the server
+        </para>
+      </listitem>
+
+      <listitem>
+        <para>
+          Verify that the collation is present
+        </para>
+      </listitem>
+
+    </orderedlist>
+
+    <para>
+      The instructions here cover only collations that can be added
+      without recompiling MySQL. To add a collation that does require
+      recompiling (as implemented by means of functions in a C source
+      file), use the instructions in
+      <xref linkend="adding-character-set"/>. However, instead of adding
+      all the information required for a complete character set, just
+      modify the appropriate files for an existing character set. That
+      is, based on what is already present for the character set's
+      current collations, add new data structures, functions, and
+      configuration information for the new collation. For an example,
+      see the MySQL Blog article in the following list of additional
+      resources.
+    </para>
+
+    <para>
+      <emphasis role="bold">Additional resources</emphasis>
+    </para>
+
+    <itemizedlist>
+
+      <listitem>
+        <para>
+          The Unicode Collation Algorithm (UCA) specification:
+          <ulink url="http://www.unicode.org/reports/tr10/"/>
+        </para>
+      </listitem>
+
+      <listitem>
+        <para>
+          The Locale Data Markup Language (LDML) specification:
+          <ulink url="http://www.unicode.org/reports/tr35/"/>
+        </para>
+      </listitem>
+
+      <listitem>
+        <para>
+          MySQL University session <quote>How to Add a
+          Collation</quote>:
+          <ulink url="http://forge.mysql.com/wiki/How_to_Add_a_Collation"/>
+        </para>
+      </listitem>
+
+      <listitem>
+        <para>
+          MySQL Blog article <quote>Instructions for adding a new
+          Unicode collation</quote>:
+          <ulink url="http://blogs.mysql.com/peterg/2008/05/19/instructions-for-adding-a-new-unicode-collation/"/>
+        </para>
+      </listitem>
+
+    </itemizedlist>
+
+    <section id="charset-collation-implementations">
+
+      <title>Collation Implementation Types</title>
+
+      <para>
+        MySQL implements several types of collations:
+      </para>
+
+      <para>
+        <emphasis role="bold">Simple collations for 8-bit character
+        sets</emphasis>
+      </para>
+
+      <para>
+        This kind of collation is implemented using an array of 256
+        weights that defines a one-to-one mapping from character codes
+        to weights. <literal>latin1_swedish_ci</literal> is an example.
+        It is a case-insensitive collation, so the uppercase and
+        lowercase versions of a character have the same weights and they
+        compare as equal.
+      </para>
+
+<programlisting>
+mysql&gt; <userinput>SET NAMES 'latin1' COLLATE 'latin1_swedish_ci';</userinput>
+Query OK, 0 rows affected (0.00 sec)
+
+mysql&gt; <userinput>SELECT 'a' = 'A';</userinput>
++-----------+
+| 'a' = 'A' |
++-----------+
+|         1 | 
++-----------+
+1 row in set (0.00 sec)
+</programlisting>
+
+      <para>
+        <emphasis role="bold">Complex collations for 8-bit character
+        sets</emphasis>
+      </para>
+
+      <para>
+        This kind of collation is implemented using functions in a C
+        source file that define how to order characters, as described in
+        <xref linkend="adding-character-set"/>.
+      </para>
+
+      <para>
+        <emphasis role="bold">Collations for non-Unicode multi-byte
+        character sets</emphasis>
+      </para>
+
+      <para>
+        For this type of collation, 8-bit (single-byte) and multi-byte
+        characters are handled differently. For 8-bit characters,
+        character codes map to weights in case-insensitive fashion. (For
+        example, the single-byte characters <literal>'a'</literal> and
+        <literal>'A'</literal> both have a weight of
+        <literal>0x41</literal>.) For multi-byte characters, there are
+        two types of relationship between character codes and weights:
+      </para>
+
+      <itemizedlist>
+
+        <listitem>
+          <para>
+            Weights equal character codes.
+            <literal>sjis_japanese_ci</literal> is an example of this
+            kind of collation. The multi-byte character
+            <literal>'&#x3062;'</literal> has a character code of
+            <literal>0x82C0</literal>, and the weight is also
+            <literal>0x82C0</literal>.
+          </para>
+        </listitem>
+
+        <listitem>
+          <para>
+            Character codes map one-to-one to weights, but a code is not
+            necessarily equal to the weight.
+            <literal>gbk_chinese_ci</literal> is an example of this kind
+            of collation. The multi-byte character
+            <literal>'&#x81b0;'</literal> has a character code of
+            <literal>0x81B0</literal> but a weight of
+            <literal>0xC286</literal>.
+          </para>
+        </listitem>
+
+      </itemizedlist>
+
+      <para>
+        <emphasis role="bold">Collations for Unicode multi-byte
+        character sets</emphasis>
+      </para>
+
+      <para>
+        Some of these collations are based on the Unicode Collation
+        Algorithm (UCA), others are not.
+      </para>
+
+      <para>
+        Non-UCA collations have a one-to-one mapping from character code
+        to weight. In MySQL, such collations are case insensitive and
+        accent insensitive. <literal>utf8_general_ci</literal> is an
+        example: <literal>'a'</literal>, <literal>'A'</literal>,
+        <literal>'À'</literal>, and <literal>'á'</literal> each have
+        different character codes but all have a weight of
+        <literal>0x0041</literal> and compare as equal.
+      </para>
+
+<programlisting>
+mysql&gt; <userinput>SET NAMES 'utf8' COLLATE 'utf8_general_ci';</userinput>
+Query OK, 0 rows affected (0.00 sec)
+
+mysql&gt; <userinput>SELECT 'a' = 'A', 'a' = 'À', 'a' = 'á';</userinput>
++-----------+-----------+-----------+
+| 'a' = 'A' | 'a' = 'À' | 'a' = 'á' |
++-----------+-----------+-----------+
+|         1 |         1 |         1 | 
++-----------+-----------+-----------+
+1 row in set (0.06 sec)
+</programlisting>
+
+      <para>
+        UCA-based collations in MySQL have these properties:
+      </para>
+
+      <itemizedlist>
+
+        <listitem>
+          <para>
+            If a character has weights, each weight uses 2 bytes (16
+            bits)
+          </para>
+        </listitem>
+
+        <listitem>
+          <para>
+            A character may have zero weights (or an empty weight). In
+            this case, the character is ignorable. Example: "U+0000
+            NULL" does not have a weight and is ignorable.
+          </para>
+        </listitem>
+
+        <listitem>
+          <para>
+            A character may have one weight. Example:
+            <literal>'a'</literal> has a weight of
+            <literal>0x0E33</literal>.
+          </para>
+        </listitem>
+
+        <listitem>
+          <para>
+            A character may have many weights. This is an expansion.
+            Example: The German letter <literal>'ß'</literal> (SZ
+            LEAGUE, or SHARP S) has a weight of
+            <literal>0x0FEA0FEA</literal>.
+          </para>
+        </listitem>
+
+        <listitem>
+          <para>
+            Many characters may have one weight. This is a contraction.
+            Example: <literal>'ch'</literal> is a single letter in Czech
+            and has a weight of <literal>0x0EE2</literal>.
+          </para>
+        </listitem>
+
+      </itemizedlist>
+
+      <para>
+        A many-characters-to-many-weights mapping is also possible (this
+        is contraction with expansion), but is not supported by MySQL.
+      </para>
+
+      <para>
+        <emphasis role="bold">Miscellaneous collations</emphasis>
+      </para>
+
+      <para>
+        There are also a few collations that do not fall into any of the
+        previous categories.
+      </para>
+
+    </section>
+
+    <section id="adding-collation-choosing-id">
+
+      <title>Choosing a Collation ID</title>
+
+      <para>
+        Each collation must have a unique ID. To add a new collation,
+        you must choose an ID value that is not currently used. The ID
+        that you choose is the value that will show up in these
+        contexts:
+      </para>
+
+      <itemizedlist>
+
+        <listitem>
+          <para>
+            The <literal>Id</literal> column of <literal>SHOW
+            COLLATION</literal> output
+          </para>
+        </listitem>
+
+        <listitem>
+          <para>
+            The <literal>ID</literal> column of the
+            <literal>INFORMATION_SCHEMA.COLLATIONS</literal> table
+          </para>
+        </listitem>
+
+        <listitem>
+          <para>
+            The <literal>charsetnr</literal> member of the
+            <literal>MYSQL_FIELD</literal> C API data structure
+          </para>
+        </listitem>
+
+        <listitem>
+          <para>
+            The <literal>number</literal> member of the
+            <literal>MY_CHARSET_INFO</literal> data structure returned
+            by the
+            <function role="capi">mysql_get_character_set_info()</function>
+            C API function
+          </para>
+        </listitem>
+
+      </itemizedlist>
+
+      <para>
+        To determine the largest currently used ID, issue the following
+        statement:
+      </para>
+
+<programlisting>
+mysql&gt; <userinput>SELECT MAX(ID) FROM INFORMATION_SCHEMA.COLLATIONS;</userinput>
++---------+
+| MAX(ID) |
++---------+
+|     210 | 
++---------+
+</programlisting>
+
+      <para>
+        For the ouput just shown, you could choose an ID higher than 210
+        for the new collation.
+      </para>
+
+      <para>
+        To display a list of all currently used IDs, issue this
+        statement:
+      </para>
+
+<programlisting>
+mysql&gt; <userinput>SELECT ID FROM INFORMATION_SCHEMA.COLLATIONS ORDER BY ID;</userinput>
++-----+
+| ID  |
++-----+
+|   1 | 
+|   2 | 
+| ... | 
+|  52 | 
+|  53 | 
+|  57 | 
+|  58 | 
+| ... | 
+|  98 | 
+|  99 | 
+| 128 | 
+| 129 | 
+| ... | 
+| 210 | 
++-----+
+</programlisting>
+
+      <para>
+        In this case, you can either choose an unused ID from within the
+        current range of IDs, or choose an ID that is higher than the
+        current maximum ID. For example, in the output just shown, there
+        are unused IDs between 53 and 57, and between 99 and 128. Or you
+        could choose an ID higher than 210.
+      </para>
+
+      <warning>
+        <para>
+          If you upgrade MySQL, you may find that the collation ID you
+          choose has been assigned to a collation included in the new
+          MySQL distribution. In this case, you will need to choose a
+          new value for your own collation.
+        </para>
+
+        <para>
+          In addition, before upgrading, you should save the
+          configuration files that you change. If you upgrade in place,
+          the process will replace the your modified files.
+        </para>
+      </warning>
+
+    </section>
+
+    <section id="adding-collation-simple-8bit">
+
+      <title>Adding a Simple Collation to an 8-Bit Character Set</title>
+
+      <para>
+        To add a simple collation for an 8-bit character set without
+        recompiling MySQL, use the following procedure. The example adds
+        a collation named <literal>latin1_test_ci</literal> to the
+        <literal>latin1</literal> character set.
+      </para>
+
+      <orderedlist>
+
+        <listitem>
+          <para>
+            Choose a collation ID, as shown in
+            <xref linkend="adding-collation-choosing-id"/>. The
+            following steps use an ID of 56.
+          </para>
+        </listitem>
+
+        <listitem>
+          <para>
+            You will need to modify the <literal>Index.xml</literal> and
+            <literal>latin1.xml</literal> configuration files. These
+            files will be located in the directory named by the
+            <literal>character_sets_dir</literal> system variable. You
+            can check the variable value as follows, although the
+            pathname might be different on your system:
+          </para>
+
+<programlisting>
+mysql&gt; <userinput>SHOW VARIABLES LIKE 'character_sets_dir';</userinput>
++--------------------+-----------------------------------------+
+| Variable_name      | Value                                   |
++--------------------+-----------------------------------------+
+| character_sets_dir | /user/local/mysql/share/mysql/charsets/ | 
++--------------------+-----------------------------------------+
+</programlisting>
+        </listitem>
+
+        <listitem>
+          <para>
+            Choose a name for the collation and list it in the
+            <filename>Index.xml</filename> file. Find the
+            <literal>&lt;charset&gt;</literal> element for the character
+            set to which the collation is being added, and add a
+            <literal>&lt;collation&gt;</literal> element that indicates
+            the collation name and ID. For example:
+          </para>
+
+<programlisting>
+&lt;charset name="latin1"&gt;
+  ...
+  &lt;!-- associate collation name with its ID --&gt;
+  &lt;collation name="latin1_test_ci" id="56"/&gt;
+  ...
+&lt;/charset&gt;
+</programlisting>
+        </listitem>
+
+        <listitem>
+          <para>
+            In the <filename>latin1.xml</filename> configuration file,
+            add a <literal>&lt;collation&gt;</literal> element that
+            names the collation and that contains a
+            <literal>&lt;map&gt;</literal> element that defines a
+            character code-to-weight mapping table. Each word within the
+            <literal>&lt;map&gt;</literal> element must be a number in
+            hexadecimal format.
+          </para>
+
+<programlisting>
+&lt;collation name="latin1_test_ci"&gt;
+&lt;map&gt;
+ 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
+ 10 11 12 13 14 15 16 17 18 19 1A 1B 1C 1D 1E 1F
+ 20 21 22 23 24 25 26 27 28 29 2A 2B 2C 2D 2E 2F
+ 30 31 32 33 34 35 36 37 38 39 3A 3B 3C 3D 3E 3F
+ 40 41 42 43 44 45 46 47 48 49 4A 4B 4C 4D 4E 4F
+ 50 51 52 53 54 55 56 57 58 59 5A 5B 5C 5D 5E 5F
+ 60 41 42 43 44 45 46 47 48 49 4A 4B 4C 4D 4E 4F
+ 50 51 52 53 54 55 56 57 58 59 5A 7B 7C 7D 7E 7F
+ 80 81 82 83 84 85 86 87 88 89 8A 8B 8C 8D 8E 8F
+ 90 91 92 93 94 95 96 97 98 99 9A 9B 9C 9D 9E 9F
+ A0 A1 A2 A3 A4 A5 A6 A7 A8 A9 AA AB AC AD AE AF
+ B0 B1 B2 B3 B4 B5 B6 B7 B8 B9 BA BB BC BD BE BF
+ 41 41 41 41 5B 5D 5B 43 45 45 45 45 49 49 49 49
+ 44 4E 4F 4F 4F 4F 5C D7 5C 55 55 55 59 59 DE DF
+ 41 41 41 41 5B 5D 5B 43 45 45 45 45 49 49 49 49
+ 44 4E 4F 4F 4F 4F 5C F7 5C 55 55 55 59 59 DE FF
+&lt;/map&gt;
+&lt;/collation&gt;
+</programlisting>
+        </listitem>
+
+        <listitem>
+          <para>
+            Restart the server and use this statement to verify that the
+            collation is present:
+          </para>
+
+<programlisting>
+mysql&gt; <userinput>SHOW COLLATION LIKE 'latin1_test_ci';</userinput>
++----------------+---------+----+---------+----------+---------+
+| Collation      | Charset | Id | Default | Compiled | Sortlen |
++----------------+---------+----+---------+----------+---------+
+| latin1_test_ci | latin1  | 56 |         |          |       1 | 
++----------------+---------+----+---------+----------+---------+
+</programlisting>
+        </listitem>
+
+      </orderedlist>
+
+    </section>
+
+    <section id="adding-collation-unicode-uca">
+
+      <title>Adding a UCA Collation to a Unicode Character Set</title>
+
+      <para>
+        UCA collations for Unicode character serts can be added to MySQL
+        without recompiling by using a subset of the Locale Data Markup
+        Language (LDML), which is available at
+        <ulink url="http://www.unicode.org/reports/tr35/"/>. In
+        &current-series;, this method of adding collations is supported
+        as of MySQL 5.1.20. With this method, you begin with an existing
+        <quote>base</quote> collation. Then you describe the new
+        collation in terms of how it differs from the base collation,
+        rather than defining the entire collation. The following table
+        lists the base collations for the Unicode character sets.
+      </para>
+
+      <informaltable>
+        <tgroup cols="2">
+          <colspec colwidth="30*"/>
+          <colspec colwidth="60*"/>
+          <tbody>
+            <row>
+              <entry><emphasis role="bold">Character Set</emphasis></entry>
+              <entry><emphasis role="bold">Base Collation</emphasis></entry>
+            </row>
+            <row>
+              <entry><literal>utf8</literal></entry>
+              <entry><literal>utf8_unicode_ci</literal></entry>
+            </row>
+            <row>
+              <entry><literal>ucs2</literal></entry>
+              <entry><literal>ucs2_unicode_ci</literal></entry>
+            </row>
+          </tbody>
+        </tgroup>
+      </informaltable>
+
+      <para>
+        The following brief summary describes the LDML characteristics
+        required for understanding the procedure for adding a collation
+        given later in this section:
+      </para>
+
+      <itemizedlist>
+
+        <listitem>
+          <para>
+            LDML has reset rules and shift rules.
+          </para>
+        </listitem>
+
+        <listitem>
+          <para>
+            Characters named in these rules can be written in
+            <literal>\u<replaceable>nnnn</replaceable></literal> format,
+            where <replaceable>nnnn</replaceable> is the hexadecimal
+            Unicode code point value. Basic Latin letters
+            <literal>A-Z</literal> <literal>a-z</literal> can also be
+            written literally.
+          </para>
+        </listitem>
+
+        <listitem>
+          <para>
+            A reset rule does not specify any ordering in and of itself.
+            Instead, it <quote>resets</quote> the ordering for
+            subsequent shift rules to cause them to be taken in relation
+            to a given character. Either of the following rules resets
+            subsequent shift rules to be taken in relation to the letter
+            <literal>'A'</literal>:
+          </para>
+
+<programlisting>
+&lt;reset&gt;A&lt;/reset&gt;
+
+&lt;reset&gt;\u0041&lt;/reset&gt;
+</programlisting>
+        </listitem>
+
+        <listitem>
+          <para>
+            Shift rules define primary, secondary, and tertiary
+            differences of a character from another character. They are
+            specified using <literal>&lt;p&gt;</literal>,
+            <literal>&lt;s&gt;</literal>, and
+            <literal>&lt;t&gt;</literal> elements. Either of the
+            following rules specifies a primary shift rule for the
+            <literal>'G'</literal> character:
+          </para>
+
+<programlisting>
+&lt;p&gt;G&lt;/p&gt;
+
+&lt;p&gt;\u0047&lt;/p&gt;
+</programlisting>
+        </listitem>
+
+        <listitem>
+          <para>
+            Use the shift rules as follows to distinguish characters:
+          </para>
+
+          <itemizedlist>
+
+            <listitem>
+              <para>
+                Use primary differences to distinguish separate letters
+              </para>
+            </listitem>
+
+            <listitem>
+              <para>
+                Use secondary differences to distiguish accent
+                variations
+              </para>
+            </listitem>
+
+            <listitem>
+              <para>
+                Use tertiary differences to distinguish lettercase
+                variations
+              </para>
+            </listitem>
+
+          </itemizedlist>
+        </listitem>
+
+      </itemizedlist>
+
+      <remark role="todo">
+        Add: Examples of the use of these rules
+      </remark>
+
+      <para>
+        To add a UCA collation for a Unicode character set without
+        recompiling MySQL, use the following procedure. The example adds
+        a collation named <literal>utf8_phone_ci</literal> to the
+        <literal>utf8</literal> character set. The collation is designed
+        for a scenario involving a Web application for which users post
+        their names and phone numbers. Phone numbers can be given in
+        very different formats:
+      </para>
+
+<programlisting>
++7-12345-67
++7-12-345-67
++7 12 345 67
++7 (12) 345 67
++71234567
+</programlisting>
+
+      <para>
+        The problem raised by dealing with these kinds of values is that
+        the varying allowable formats make searching for a specific
+        phone number very difficult. The solution is to define a new
+        collation that reorders punctuation characters, making them
+        ignorable.
+      </para>
+
+      <orderedlist>
+
+        <listitem>
+          <para>
+            Choose a collation ID, as shown in
+            <xref linkend="adding-collation-choosing-id"/>. The
+            following steps use an ID of 252.
+          </para>
+        </listitem>
+
+        <listitem>
+          <para>
+            You will need to modify the <literal>Index.xml</literal>
+            configuration file. This file will be located in the
+            directory named by the <literal>character_sets_dir</literal>
+            system variable. You can check the variable value as
+            follows, although the pathname might be different on your
+            system:
+          </para>
+
+<programlisting>
+mysql&gt; <userinput>SHOW VARIABLES LIKE 'character_sets_dir';</userinput>
++--------------------+-----------------------------------------+
+| Variable_name      | Value                                   |
++--------------------+-----------------------------------------+
+| character_sets_dir | /user/local/mysql/share/mysql/charsets/ | 
++--------------------+-----------------------------------------+
+</programlisting>
+        </listitem>
+
+        <listitem>
+          <para>
+            Choose a name for the collation and list it in the
+            <filename>Index.xml</filename> file. In addition, you'll
+            need to provide the collation ordering rules. Find the
+            <literal>&lt;charset&gt;</literal> element for the character
+            set to which the collation is being added, and add a
+            <literal>&lt;collation&gt;</literal> element that indicates
+            the collation name and ID. Within the
+            <literal>&lt;collation&gt;</literal> element, provide a
+            <literal>&lt;rules&gt;</literal> element containing the
+            ordering rules:
+          </para>
+
+<programlisting>
+&lt;charset name="utf8"&gt;
+  ...
+  &lt;!-- associate collation name with its ID --&gt;
+  &lt;collation name="utf8_phone_ci" id="252"&gt;
+    &lt;rules&gt;
+      &lt;reset&gt;\u0000&lt;/reset&gt;
+        &lt;s&gt;\u0020&lt;/s&gt; &lt;!-- space --&gt;
+        &lt;s&gt;\u0028&lt;/s&gt; &lt;!-- left parenthesis --&gt;
+        &lt;s&gt;\u0029&lt;/s&gt; &lt;!-- right parenthesis --&gt;
+        &lt;s&gt;\u002B&lt;/s&gt; &lt;!-- plus --&gt;
+        &lt;s&gt;\u002D&lt;/s&gt; &lt;!-- hyphen --&gt;
+    &lt;/rules&gt;
+  &lt;/collation&gt;
+  ...
+&lt;/charset&gt;
+</programlisting>
+        </listitem>
+
+        <listitem>
+          <para>
+            If you want a similar collation for other Unicode character
+            sets, add other <literal>&lt;collation&gt;</literal>
+            elements. For example, to define
+            <literal>ucs2_phone_ci</literal>, add a
+            <literal>&lt;collation&gt;</literal> element to the
+            <literal>&lt;charset name="ucs2"&gt;</literal> element.
+            Remember that each collation must have its own unique ID.
+          </para>
+        </listitem>
+
+        <listitem>
+          <para>
+            Restart the server and use this statement to verify that the
+            collation is present:
+          </para>
+
+<programlisting>
+mysql&gt; <userinput>SHOW COLLATION LIKE 'utf8_phone_ci';</userinput>
++---------------+---------+-----+---------+----------+---------+
+| Collation     | Charset | Id  | Default | Compiled | Sortlen |
++---------------+---------+-----+---------+----------+---------+
+| utf8_phone_ci | utf8    | 252 |         |          |       8 | 
++---------------+---------+-----+---------+----------+---------+
+</programlisting>
+        </listitem>
+
+      </orderedlist>
+
+      <para>
+        Now we can test the collation to make sure that it has the
+        desired properties.
+      </para>
+
+      <para>
+        Create a table containing some sample phone numbers using the
+        new collation:
+      </para>
+
+<programlisting>
+<!--
+mysql> DROP TABLE IF EXISTS phonebook;
+Query OK, 0 rows affected, 1 warning (0.00 sec)
+-->
+mysql&gt; <userinput>CREATE TABLE phonebook (</userinput>
+    -&gt; <userinput>  name VARCHAR(64),</userinput>
+    -&gt; <userinput>  phone VARCHAR(64) CHARACTER SET utf8 COLLATE utf8_phone_ci</userinput>
+    -&gt; <userinput>);</userinput>
+Query OK, 0 rows affected (0.09 sec)
+
+mysql&gt; <userinput>INSERT INTO phonebook VALUES ('Svoj','+7 912 800 80 02');</userinput>
+Query OK, 1 row affected (0.00 sec)
+
+mysql&gt; <userinput>INSERT INTO phonebook VALUES ('Hf','+7 (912) 800 80 04');</userinput>
+Query OK, 1 row affected (0.00 sec)
+
+mysql&gt; <userinput>INSERT INTO phonebook VALUES ('Bar','+7-912-800-80-01');</userinput>
+Query OK, 1 row affected (0.00 sec)
+
+mysql&gt; <userinput>INSERT INTO phonebook VALUES ('Ramil','(7912) 800 80 03');</userinput>
+Query OK, 1 row affected (0.00 sec)
+
+mysql&gt; <userinput>INSERT INTO phonebook VALUES ('Sanja','+380 (912) 8008005');</userinput>
+Query OK, 1 row affected (0.00 sec)
+</programlisting>
+
+      <para>
+        Run some queries to see whether the ignored punctuation
+        characters are in fact ignored for sorting and comparisons:
+      </para>
+
+<programlisting>
+mysql&gt; <userinput>SELECT * FROM phonebook ORDER BY phone;</userinput>
++-------+--------------------+
+| name  | phone              |
++-------+--------------------+
+| Sanja | +380 (912) 8008005 | 
+| Bar   | +7-912-800-80-01   | 
+| Svoj  | +7 912 800 80 02   | 
+| Ramil | (7912) 800 80 03   | 
+| Hf    | +7 (912) 800 80 04 | 
++-------+--------------------+
+5 rows in set (0.00 sec)
+
+mysql&gt; <userinput>SELECT * FROM phonebook WHERE phone='+7(912)800-80-01';</userinput>
++------+------------------+
+| name | phone            |
++------+------------------+
+| Bar  | +7-912-800-80-01 | 
++------+------------------+
+1 row in set (0.00 sec)
+
+mysql&gt; <userinput>SELECT * FROM phonebook WHERE phone='79128008001';</userinput>
++------+------------------+
+| name | phone            |
++------+------------------+
+| Bar  | +7-912-800-80-01 | 
++------+------------------+
+1 row in set (0.00 sec)
+
+mysql&gt; <userinput>SELECT * FROM phonebook WHERE phone='7 9 1 2 8 0 0 8 0 0 1';</userinput>
++------+------------------+
+| name | phone            |
++------+------------------+
+| Bar  | +7-912-800-80-01 | 
++------+------------------+
+1 row in set (0.00 sec)
+</programlisting>
+
+    </section>
+
+  </section>
+
   <section id="problems-with-character-sets">
 
     <title>Problems With Character Sets</title>

@@ -6027,21 +6927,14 @@
           program with support for the character set.
         </para>
 
-        <remark role="todo">
-          Add xref to LDML section when it gets added.
-        </remark>
-
         <para>
           For Unicode character sets, you can define collations without
-          recompiling by using LDML notation.
+          recompiling by using LDML notation. See
+          <xref linkend="adding-collation-unicode-uca"/>.
         </para>
       </listitem>
 
       <listitem>
-        <remark role="todo">
-          dynamic = not compiled in?
-        </remark>
-
         <para>
           The character set is a dynamic character set, but you do not
           have a configuration file for it. In this case, you should


Modified: trunk/pt/refman-5.1/internationalization.xml
===================================================================
--- trunk/pt/refman-5.1/internationalization.xml	2008-05-29 17:49:32 UTC (rev 10861)
+++ trunk/pt/refman-5.1/internationalization.xml	2008-05-29 17:49:41 UTC (rev 10862)
Changed blocks: 6, Lines Added: 925, Lines Deleted: 37; 34318 bytes

@@ -5500,12 +5500,12 @@
           The <literal>&lt;charset&gt;</literal> element must list all
           the collations for the character set. These must include at
           least a binary collation and a default collation. The default
-          collation is usually <literal>general_ci</literal> (general,
-          case insensitive). It is possible for the binary collation to
-          be the default collation, but usually they are different. The
-          default collation should have a <literal>primary</literal>
-          flag. The binary collation should have a
-          <literal>binary</literal> flag.
+          collation is usually named using a suffix of
+          <literal>general_ci</literal> (general, case insensitive). It
+          is possible for the binary collation to be the default
+          collation, but usually they are different. The default
+          collation should have a <literal>primary</literal> flag. The
+          binary collation should have a <literal>binary</literal> flag.
         </para>
 
         <para>

@@ -5830,19 +5830,17 @@
         <literal>ctype_<replaceable>MYSET</replaceable>[]</literal>,
         <literal>to_lower_<replaceable>MYSET</replaceable>[]</literal>,
         and so forth. Not every complex character set has all of the
-        arrays. See the existing
-        <filename>ctype-<replaceable>charset_name</replaceable>.c</filename>
-        files for examples. See the
-        <filename>CHARSET_INFO.txt</filename> file in the
-        <filename>strings</filename> directory for additional
+        arrays. See the existing <filename>ctype-*.c</filename> files
+        for examples. See the <filename>CHARSET_INFO.txt</filename> file
+        in the <filename>strings</filename> directory for additional
         information.
       </para>
 
       <para>
         The <literal>ctype</literal> array is indexed by character value
-        + 1. This is an old legacy convention for handling
-        <literal>EOF</literal>. The other arrays are indexed by
-        character value.
+        + 1 and has 257 elements. This is an old legacy convention for
+        handling <literal>EOF</literal>. The other arrays are indexed by
+        character value and have 256 elements.
       </para>
 
       <para>

@@ -5935,10 +5933,9 @@
       <para>
         The existing character sets provide the best documentation and
         examples to show how these functions are implemented. Look at
-        the
-        <filename>ctype-<replaceable>charset_name</replaceable>.c</filename>
-        files in the <filename>strings</filename> directory, such as the
-        files for the <literal>big5</literal>, <literal>czech</literal>,
+        the <filename>ctype-*.c</filename> files in the
+        <filename>strings</filename> directory, such as the files for
+        the <literal>big5</literal>, <literal>czech</literal>,
         <literal>gbk</literal>, <literal>sjis</literal>, and
         <literal>tis160</literal> character sets. Take a look at the
         <literal>MY_COLLATION_HANDLER</literal> structures to see how

@@ -5973,16 +5970,14 @@
       <para>
         The existing character sets provide the best documentation and
         examples to show how these functions are implemented. Look at
-        the
-        <filename>ctype-<replaceable>charset_name</replaceable>.c</filename>
-        files in the <filename>strings</filename> directory, such as the
-        files for the <literal>euc_kr</literal>,
-        <literal>gb2312</literal>, <literal>gbk</literal>,
-        <literal>sjis</literal>, and <literal>ujis</literal> character
-        sets. Take a look at the <literal>MY_CHARSET_HANDLER</literal>
-        structures to see how they are used, and see the
-        <filename>CHARSET_INFO.txt</filename> file in the
-        <filename>strings</filename> directory for additional
+        the <filename>ctype-*.c</filename> files in the
+        <filename>strings</filename> directory, such as the files for
+        the <literal>euc_kr</literal>, <literal>gb2312</literal>,
+        <literal>gbk</literal>, <literal>sjis</literal>, and
+        <literal>ujis</literal> character sets. Take a look at the
+        <literal>MY_CHARSET_HANDLER</literal> structures to see how they
+        are used, and see the <filename>CHARSET_INFO.txt</filename> file
+        in the <filename>strings</filename> directory for additional
         information.
       </para>
 

@@ -5990,6 +5985,906 @@
 
   </section>
 
+  <section id="adding-collation">
+
+    <title>How to Add a New Collation to a Character Set</title>
+
+    <indexterm>
+      <primary>collation</primary>
+      <secondary>adding</secondary>
+    </indexterm>
+
+    <para>
+      A collation is a set of rules that defines how to compare and sort
+      character strings. Each collation in MySQL belongs to a single
+      character set. Every character set has at least one collation, and
+      most have two or more collations.
+    </para>
+
+    <para>
+      A collation orders characters based on weights. Each character in
+      a character set maps to a weight. Characters with equal weights
+      compare as equal, and characters with unequal weights compare
+      according to the relative magnitude of their weights.
+    </para>
+
+    <para>
+      MySQL supports several collation implementations, as discussed in
+      <xref linkend="charset-collation-implementations"/>. Some of these
+      can be added to MySQL without recompiling:
+    </para>
+
+    <itemizedlist>
+
+      <listitem>
+        <para>
+          Simple collations for 8-bit character sets
+        </para>
+      </listitem>
+
+      <listitem>
+        <para>
+          UCA-based collations for Unicode character sets
+        </para>
+      </listitem>
+
+      <listitem>
+        <para>
+          Binary (<literal><replaceable>xxx</replaceable>_bin</literal>)
+          collations
+        </para>
+      </listitem>
+
+    </itemizedlist>
+
+    <para>
+      The following discussion describes how to add collations of the
+      first two types to existing character sets. All existing character
+      sets already have a binary collation, so there is no need here to
+      describe how to add one.
+    </para>
+
+    <para>
+      Summary of the procedure for adding a new collation:
+    </para>
+
+    <orderedlist>
+
+      <listitem>
+        <para>
+          Choose a collation ID
+        </para>
+      </listitem>
+
+      <listitem>
+        <para>
+          Add configuration information that names the collation and
+          describes the character-ordering rules
+        </para>
+      </listitem>
+
+      <listitem>
+        <para>
+          Restart the server
+        </para>
+      </listitem>
+
+      <listitem>
+        <para>
+          Verify that the collation is present
+        </para>
+      </listitem>
+
+    </orderedlist>
+
+    <para>
+      The instructions here cover only collations that can be added
+      without recompiling MySQL. To add a collation that does require
+      recompiling (as implemented by means of functions in a C source
+      file), use the instructions in
+      <xref linkend="adding-character-set"/>. However, instead of adding
+      all the information required for a complete character set, just
+      modify the appropriate files for an existing character set. That
+      is, based on what is already present for the character set's
+      current collations, add new data structures, functions, and
+      configuration information for the new collation. For an example,
+      see the MySQL Blog article in the following list of additional
+      resources.
+    </para>
+
+    <para>
+      <emphasis role="bold">Additional resources</emphasis>
+    </para>
+
+    <itemizedlist>
+
+      <listitem>
+        <para>
+          The Unicode Collation Algorithm (UCA) specification:
+          <ulink url="http://www.unicode.org/reports/tr10/"/>
+        </para>
+      </listitem>
+
+      <listitem>
+        <para>
+          The Locale Data Markup Language (LDML) specification:
+          <ulink url="http://www.unicode.org/reports/tr35/"/>
+        </para>
+      </listitem>
+
+      <listitem>
+        <para>
+          MySQL University session <quote>How to Add a
+          Collation</quote>:
+          <ulink url="http://forge.mysql.com/wiki/How_to_Add_a_Collation"/>
+        </para>
+      </listitem>
+
+      <listitem>
+        <para>
+          MySQL Blog article <quote>Instructions for adding a new
+          Unicode collation</quote>:
+          <ulink url="http://blogs.mysql.com/peterg/2008/05/19/instructions-for-adding-a-new-unicode-collation/"/>
+        </para>
+      </listitem>
+
+    </itemizedlist>
+
+    <section id="charset-collation-implementations">
+
+      <title>Collation Implementation Types</title>
+
+      <para>
+        MySQL implements several types of collations:
+      </para>
+
+      <para>
+        <emphasis role="bold">Simple collations for 8-bit character
+        sets</emphasis>
+      </para>
+
+      <para>
+        This kind of collation is implemented using an array of 256
+        weights that defines a one-to-one mapping from character codes
+        to weights. <literal>latin1_swedish_ci</literal> is an example.
+        It is a case-insensitive collation, so the uppercase and
+        lowercase versions of a character have the same weights and they
+        compare as equal.
+      </para>
+
+<programlisting>
+mysql&gt; <userinput>SET NAMES 'latin1' COLLATE 'latin1_swedish_ci';</userinput>
+Query OK, 0 rows affected (0.00 sec)
+
+mysql&gt; <userinput>SELECT 'a' = 'A';</userinput>
++-----------+
+| 'a' = 'A' |
++-----------+
+|         1 | 
++-----------+
+1 row in set (0.00 sec)
+</programlisting>
+
+      <para>
+        <emphasis role="bold">Complex collations for 8-bit character
+        sets</emphasis>
+      </para>
+
+      <para>
+        This kind of collation is implemented using functions in a C
+        source file that define how to order characters, as described in
+        <xref linkend="adding-character-set"/>.
+      </para>
+
+      <para>
+        <emphasis role="bold">Collations for non-Unicode multi-byte
+        character sets</emphasis>
+      </para>
+
+      <para>
+        For this type of collation, 8-bit (single-byte) and multi-byte
+        characters are handled differently. For 8-bit characters,
+        character codes map to weights in case-insensitive fashion. (For
+        example, the single-byte characters <literal>'a'</literal> and
+        <literal>'A'</literal> both have a weight of
+        <literal>0x41</literal>.) For multi-byte characters, there are
+        two types of relationship between character codes and weights:
+      </para>
+
+      <itemizedlist>
+
+        <listitem>
+          <para>
+            Weights equal character codes.
+            <literal>sjis_japanese_ci</literal> is an example of this
+            kind of collation. The multi-byte character
+            <literal>'&#x3062;'</literal> has a character code of
+            <literal>0x82C0</literal>, and the weight is also
+            <literal>0x82C0</literal>.
+          </para>
+        </listitem>
+
+        <listitem>
+          <para>
+            Character codes map one-to-one to weights, but a code is not
+            necessarily equal to the weight.
+            <literal>gbk_chinese_ci</literal> is an example of this kind
+            of collation. The multi-byte character
+            <literal>'&#x81b0;'</literal> has a character code of
+            <literal>0x81B0</literal> but a weight of
+            <literal>0xC286</literal>.
+          </para>
+        </listitem>
+
+      </itemizedlist>
+
+      <para>
+        <emphasis role="bold">Collations for Unicode multi-byte
+        character sets</emphasis>
+      </para>
+
+      <para>
+        Some of these collations are based on the Unicode Collation
+        Algorithm (UCA), others are not.
+      </para>
+
+      <para>
+        Non-UCA collations have a one-to-one mapping from character code
+        to weight. In MySQL, such collations are case insensitive and
+        accent insensitive. <literal>utf8_general_ci</literal> is an
+        example: <literal>'a'</literal>, <literal>'A'</literal>,
+        <literal>'À'</literal>, and <literal>'á'</literal> each have
+        different character codes but all have a weight of
+        <literal>0x0041</literal> and compare as equal.
+      </para>
+
+<programlisting>
+mysql&gt; <userinput>SET NAMES 'utf8' COLLATE 'utf8_general_ci';</userinput>
+Query OK, 0 rows affected (0.00 sec)
+
+mysql&gt; <userinput>SELECT 'a' = 'A', 'a' = 'À', 'a' = 'á';</userinput>
++-----------+-----------+-----------+
+| 'a' = 'A' | 'a' = 'À' | 'a' = 'á' |
++-----------+-----------+-----------+
+|         1 |         1 |         1 | 
++-----------+-----------+-----------+
+1 row in set (0.06 sec)
+</programlisting>
+
+      <para>
+        UCA-based collations in MySQL have these properties:
+      </para>
+
+      <itemizedlist>
+
+        <listitem>
+          <para>
+            If a character has weights, each weight uses 2 bytes (16
+            bits)
+          </para>
+        </listitem>
+
+        <listitem>
+          <para>
+            A character may have zero weights (or an empty weight). In
+            this case, the character is ignorable. Example: "U+0000
+            NULL" does not have a weight and is ignorable.
+          </para>
+        </listitem>
+
+        <listitem>
+          <para>
+            A character may have one weight. Example:
+            <literal>'a'</literal> has a weight of
+            <literal>0x0E33</literal>.
+          </para>
+        </listitem>
+
+        <listitem>
+          <para>
+            A character may have many weights. This is an expansion.
+            Example: The German letter <literal>'ß'</literal> (SZ
+            LEAGUE, or SHARP S) has a weight of
+            <literal>0x0FEA0FEA</literal>.
+          </para>
+        </listitem>
+
+        <listitem>
+          <para>
+            Many characters may have one weight. This is a contraction.
+            Example: <literal>'ch'</literal> is a single letter in Czech
+            and has a weight of <literal>0x0EE2</literal>.
+          </para>
+        </listitem>
+
+      </itemizedlist>
+
+      <para>
+        A many-characters-to-many-weights mapping is also possible (this
+        is contraction with expansion), but is not supported by MySQL.
+      </para>
+
+      <para>
+        <emphasis role="bold">Miscellaneous collations</emphasis>
+      </para>
+
+      <para>
+        There are also a few collations that do not fall into any of the
+        previous categories.
+      </para>
+
+    </section>
+
+    <section id="adding-collation-choosing-id">
+
+      <title>Choosing a Collation ID</title>
+
+      <para>
+        Each collation must have a unique ID. To add a new collation,
+        you must choose an ID value that is not currently used. The ID
+        that you choose is the value that will show up in these
+        contexts:
+      </para>
+
+      <itemizedlist>
+
+        <listitem>
+          <para>
+            The <literal>Id</literal> column of <literal>SHOW
+            COLLATION</literal> output
+          </para>
+        </listitem>
+
+        <listitem>
+          <para>
+            The <literal>ID</literal> column of the
+            <literal>INFORMATION_SCHEMA.COLLATIONS</literal> table
+          </para>
+        </listitem>
+
+        <listitem>
+          <para>
+            The <literal>charsetnr</literal> member of the
+            <literal>MYSQL_FIELD</literal> C API data structure
+          </para>
+        </listitem>
+
+        <listitem>
+          <para>
+            The <literal>number</literal> member of the
+            <literal>MY_CHARSET_INFO</literal> data structure returned
+            by the
+            <function role="capi">mysql_get_character_set_info()</function>
+            C API function
+          </para>
+        </listitem>
+
+      </itemizedlist>
+
+      <para>
+        To determine the largest currently used ID, issue the following
+        statement:
+      </para>
+
+<programlisting>
+mysql&gt; <userinput>SELECT MAX(ID) FROM INFORMATION_SCHEMA.COLLATIONS;</userinput>
++---------+
+| MAX(ID) |
++---------+
+|     210 | 
++---------+
+</programlisting>
+
+      <para>
+        For the ouput just shown, you could choose an ID higher than 210
+        for the new collation.
+      </para>
+
+      <para>
+        To display a list of all currently used IDs, issue this
+        statement:
+      </para>
+
+<programlisting>
+mysql&gt; <userinput>SELECT ID FROM INFORMATION_SCHEMA.COLLATIONS ORDER BY ID;</userinput>
++-----+
+| ID  |
++-----+
+|   1 | 
+|   2 | 
+| ... | 
+|  52 | 
+|  53 | 
+|  57 | 
+|  58 | 
+| ... | 
+|  98 | 
+|  99 | 
+| 128 | 
+| 129 | 
+| ... | 
+| 210 | 
++-----+
+</programlisting>
+
+      <para>
+        In this case, you can either choose an unused ID from within the
+        current range of IDs, or choose an ID that is higher than the
+        current maximum ID. For example, in the output just shown, there
+        are unused IDs between 53 and 57, and between 99 and 128. Or you
+        could choose an ID higher than 210.
+      </para>
+
+      <warning>
+        <para>
+          If you upgrade MySQL, you may find that the collation ID you
+          choose has been assigned to a collation included in the new
+          MySQL distribution. In this case, you will need to choose a
+          new value for your own collation.
+        </para>
+
+        <para>
+          In addition, before upgrading, you should save the
+          configuration files that you change. If you upgrade in place,
+          the process will replace the your modified files.
+        </para>
+      </warning>
+
+    </section>
+
+    <section id="adding-collation-simple-8bit">
+
+      <title>Adding a Simple Collation to an 8-Bit Character Set</title>
+
+      <para>
+        To add a simple collation for an 8-bit character set without
+        recompiling MySQL, use the following procedure. The example adds
+        a collation named <literal>latin1_test_ci</literal> to the
+        <literal>latin1</literal> character set.
+      </para>
+
+      <orderedlist>
+
+        <listitem>
+          <para>
+            Choose a collation ID, as shown in
+            <xref linkend="adding-collation-choosing-id"/>. The
+            following steps use an ID of 56.
+          </para>
+        </listitem>
+
+        <listitem>
+          <para>
+            You will need to modify the <literal>Index.xml</literal> and
+            <literal>latin1.xml</literal> configuration files. These
+            files will be located in the directory named by the
+            <literal>character_sets_dir</literal> system variable. You
+            can check the variable value as follows, although the
+            pathname might be different on your system:
+          </para>
+
+<programlisting>
+mysql&gt; <userinput>SHOW VARIABLES LIKE 'character_sets_dir';</userinput>
++--------------------+-----------------------------------------+
+| Variable_name      | Value                                   |
++--------------------+-----------------------------------------+
+| character_sets_dir | /user/local/mysql/share/mysql/charsets/ | 
++--------------------+-----------------------------------------+
+</programlisting>
+        </listitem>
+
+        <listitem>
+          <para>
+            Choose a name for the collation and list it in the
+            <filename>Index.xml</filename> file. Find the
+            <literal>&lt;charset&gt;</literal> element for the character
+            set to which the collation is being added, and add a
+            <literal>&lt;collation&gt;</literal> element that indicates
+            the collation name and ID. For example:
+          </para>
+
+<programlisting>
+&lt;charset name="latin1"&gt;
+  ...
+  &lt;!-- associate collation name with its ID --&gt;
+  &lt;collation name="latin1_test_ci" id="56"/&gt;
+  ...
+&lt;/charset&gt;
+</programlisting>
+        </listitem>
+
+        <listitem>
+          <para>
+            In the <filename>latin1.xml</filename> configuration file,
+            add a <literal>&lt;collation&gt;</literal> element that
+            names the collation and that contains a
+            <literal>&lt;map&gt;</literal> element that defines a
+            character code-to-weight mapping table. Each word within the
+            <literal>&lt;map&gt;</literal> element must be a number in
+            hexadecimal format.
+          </para>
+
+<programlisting>
+&lt;collation name="latin1_test_ci"&gt;
+&lt;map&gt;
+ 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
+ 10 11 12 13 14 15 16 17 18 19 1A 1B 1C 1D 1E 1F
+ 20 21 22 23 24 25 26 27 28 29 2A 2B 2C 2D 2E 2F
+ 30 31 32 33 34 35 36 37 38 39 3A 3B 3C 3D 3E 3F
+ 40 41 42 43 44 45 46 47 48 49 4A 4B 4C 4D 4E 4F
+ 50 51 52 53 54 55 56 57 58 59 5A 5B 5C 5D 5E 5F
+ 60 41 42 43 44 45 46 47 48 49 4A 4B 4C 4D 4E 4F
+ 50 51 52 53 54 55 56 57 58 59 5A 7B 7C 7D 7E 7F
+ 80 81 82 83 84 85 86 87 88 89 8A 8B 8C 8D 8E 8F
+ 90 91 92 93 94 95 96 97 98 99 9A 9B 9C 9D 9E 9F
+ A0 A1 A2 A3 A4 A5 A6 A7 A8 A9 AA AB AC AD AE AF
+ B0 B1 B2 B3 B4 B5 B6 B7 B8 B9 BA BB BC BD BE BF
+ 41 41 41 41 5B 5D 5B 43 45 45 45 45 49 49 49 49
+ 44 4E 4F 4F 4F 4F 5C D7 5C 55 55 55 59 59 DE DF
+ 41 41 41 41 5B 5D 5B 43 45 45 45 45 49 49 49 49
+ 44 4E 4F 4F 4F 4F 5C F7 5C 55 55 55 59 59 DE FF
+&lt;/map&gt;
+&lt;/collation&gt;
+</programlisting>
+        </listitem>
+
+        <listitem>
+          <para>
+            Restart the server and use this statement to verify that the
+            collation is present:
+          </para>
+
+<programlisting>
+mysql&gt; <userinput>SHOW COLLATION LIKE 'latin1_test_ci';</userinput>
++----------------+---------+----+---------+----------+---------+
+| Collation      | Charset | Id | Default | Compiled | Sortlen |
++----------------+---------+----+---------+----------+---------+
+| latin1_test_ci | latin1  | 56 |         |          |       1 | 
++----------------+---------+----+---------+----------+---------+
+</programlisting>
+        </listitem>
+
+      </orderedlist>
+
+    </section>
+
+    <section id="adding-collation-unicode-uca">
+
+      <title>Adding a UCA Collation to a Unicode Character Set</title>
+
+      <para>
+        UCA collations for Unicode character serts can be added to MySQL
+        without recompiling by using a subset of the Locale Data Markup
+        Language (LDML), which is available at
+        <ulink url="http://www.unicode.org/reports/tr35/"/>. In
+        &current-series;, this method of adding collations is supported
+        as of MySQL 5.1.20. With this method, you begin with an existing
+        <quote>base</quote> collation. Then you describe the new
+        collation in terms of how it differs from the base collation,
+        rather than defining the entire collation. The following table
+        lists the base collations for the Unicode character sets.
+      </para>
+
+      <informaltable>
+        <tgroup cols="2">
+          <colspec colwidth="30*"/>
+          <colspec colwidth="60*"/>
+          <tbody>
+            <row>
+              <entry><emphasis role="bold">Character Set</emphasis></entry>
+              <entry><emphasis role="bold">Base Collation</emphasis></entry>
+            </row>
+            <row>
+              <entry><literal>utf8</literal></entry>
+              <entry><literal>utf8_unicode_ci</literal></entry>
+            </row>
+            <row>
+              <entry><literal>ucs2</literal></entry>
+              <entry><literal>ucs2_unicode_ci</literal></entry>
+            </row>
+          </tbody>
+        </tgroup>
+      </informaltable>
+
+      <para>
+        The following brief summary describes the LDML characteristics
+        required for understanding the procedure for adding a collation
+        given later in this section:
+      </para>
+
+      <itemizedlist>
+
+        <listitem>
+          <para>
+            LDML has reset rules and shift rules.
+          </para>
+        </listitem>
+
+        <listitem>
+          <para>
+            Characters named in these rules can be written in
+            <literal>\u<replaceable>nnnn</replaceable></literal> format,
+            where <replaceable>nnnn</replaceable> is the hexadecimal
+            Unicode code point value. Basic Latin letters
+            <literal>A-Z</literal> <literal>a-z</literal> can also be
+            written literally.
+          </para>
+        </listitem>
+
+        <listitem>
+          <para>
+            A reset rule does not specify any ordering in and of itself.
+            Instead, it <quote>resets</quote> the ordering for
+            subsequent shift rules to cause them to be taken in relation
+            to a given character. Either of the following rules resets
+            subsequent shift rules to be taken in relation to the letter
+            <literal>'A'</literal>:
+          </para>
+
+<programlisting>
+&lt;reset&gt;A&lt;/reset&gt;
+
+&lt;reset&gt;\u0041&lt;/reset&gt;
+</programlisting>
+        </listitem>
+
+        <listitem>
+          <para>
+            Shift rules define primary, secondary, and tertiary
+            differences of a character from another character. They are
+            specified using <literal>&lt;p&gt;</literal>,
+            <literal>&lt;s&gt;</literal>, and
+            <literal>&lt;t&gt;</literal> elements. Either of the
+            following rules specifies a primary shift rule for the
+            <literal>'G'</literal> character:
+          </para>
+
+<programlisting>
+&lt;p&gt;G&lt;/p&gt;
+
+&lt;p&gt;\u0047&lt;/p&gt;
+</programlisting>
+        </listitem>
+
+        <listitem>
+          <para>
+            Use the shift rules as follows to distinguish characters:
+          </para>
+
+          <itemizedlist>
+
+            <listitem>
+              <para>
+                Use primary differences to distinguish separate letters
+              </para>
+            </listitem>
+
+            <listitem>
+              <para>
+                Use secondary differences to distiguish accent
+                variations
+              </para>
+            </listitem>
+
+            <listitem>
+              <para>
+                Use tertiary differences to distinguish lettercase
+                variations
+              </para>
+            </listitem>
+
+          </itemizedlist>
+        </listitem>
+
+      </itemizedlist>
+
+      <remark role="todo">
+        Add: Examples of the use of these rules
+      </remark>
+
+      <para>
+        To add a UCA collation for a Unicode character set without
+        recompiling MySQL, use the following procedure. The example adds
+        a collation named <literal>utf8_phone_ci</literal> to the
+        <literal>utf8</literal> character set. The collation is designed
+        for a scenario involving a Web application for which users post
+        their names and phone numbers. Phone numbers can be given in
+        very different formats:
+      </para>
+
+<programlisting>
++7-12345-67
++7-12-345-67
++7 12 345 67
++7 (12) 345 67
++71234567
+</programlisting>
+
+      <para>
+        The problem raised by dealing with these kinds of values is that
+        the varying allowable formats make searching for a specific
+        phone number very difficult. The solution is to define a new
+        collation that reorders punctuation characters, making them
+        ignorable.
+      </para>
+
+      <orderedlist>
+
+        <listitem>
+          <para>
+            Choose a collation ID, as shown in
+            <xref linkend="adding-collation-choosing-id"/>. The
+            following steps use an ID of 252.
+          </para>
+        </listitem>
+
+        <listitem>
+          <para>
+            You will need to modify the <literal>Index.xml</literal>
+            configuration file. This file will be located in the
+            directory named by the <literal>character_sets_dir</literal>
+            system variable. You can check the variable value as
+            follows, although the pathname might be different on your
+            system:
+          </para>
+
+<programlisting>
+mysql&gt; <userinput>SHOW VARIABLES LIKE 'character_sets_dir';</userinput>
++--------------------+-----------------------------------------+
+| Variable_name      | Value                                   |
++--------------------+-----------------------------------------+
+| character_sets_dir | /user/local/mysql/share/mysql/charsets/ | 
++--------------------+-----------------------------------------+
+</programlisting>
+        </listitem>
+
+        <listitem>
+          <para>
+            Choose a name for the collation and list it in the
+            <filename>Index.xml</filename> file. In addition, you'll
+            need to provide the collation ordering rules. Find the
+            <literal>&lt;charset&gt;</literal> element for the character
+            set to which the collation is being added, and add a
+            <literal>&lt;collation&gt;</literal> element that indicates
+            the collation name and ID. Within the
+            <literal>&lt;collation&gt;</literal> element, provide a
+            <literal>&lt;rules&gt;</literal> element containing the
+            ordering rules:
+          </para>
+
+<programlisting>
+&lt;charset name="utf8"&gt;
+  ...
+  &lt;!-- associate collation name with its ID --&gt;
+  &lt;collation name="utf8_phone_ci" id="252"&gt;
+    &lt;rules&gt;
+      &lt;reset&gt;\u0000&lt;/reset&gt;
+        &lt;s&gt;\u0020&lt;/s&gt; &lt;!-- space --&gt;
+        &lt;s&gt;\u0028&lt;/s&gt; &lt;!-- left parenthesis --&gt;
+        &lt;s&gt;\u0029&lt;/s&gt; &lt;!-- right parenthesis --&gt;
+        &lt;s&gt;\u002B&lt;/s&gt; &lt;!-- plus --&gt;
+        &lt;s&gt;\u002D&lt;/s&gt; &lt;!-- hyphen --&gt;
+    &lt;/rules&gt;
+  &lt;/collation&gt;
+  ...
+&lt;/charset&gt;
+</programlisting>
+        </listitem>
+
+        <listitem>
+          <para>
+            If you want a similar collation for other Unicode character
+            sets, add other <literal>&lt;collation&gt;</literal>
+            elements. For example, to define
+            <literal>ucs2_phone_ci</literal>, add a
+            <literal>&lt;collation&gt;</literal> element to the
+            <literal>&lt;charset name="ucs2"&gt;</literal> element.
+            Remember that each collation must have its own unique ID.
+          </para>
+        </listitem>
+
+        <listitem>
+          <para>
+            Restart the server and use this statement to verify that the
+            collation is present:
+          </para>
+
+<programlisting>
+mysql&gt; <userinput>SHOW COLLATION LIKE 'utf8_phone_ci';</userinput>
++---------------+---------+-----+---------+----------+---------+
+| Collation     | Charset | Id  | Default | Compiled | Sortlen |
++---------------+---------+-----+---------+----------+---------+
+| utf8_phone_ci | utf8    | 252 |         |          |       8 | 
++---------------+---------+-----+---------+----------+---------+
+</programlisting>
+        </listitem>
+
+      </orderedlist>
+
+      <para>
+        Now we can test the collation to make sure that it has the
+        desired properties.
+      </para>
+
+      <para>
+        Create a table containing some sample phone numbers using the
+        new collation:
+      </para>
+
+<programlisting>
+<!--
+mysql> DROP TABLE IF EXISTS phonebook;
+Query OK, 0 rows affected, 1 warning (0.00 sec)
+-->
+mysql&gt; <userinput>CREATE TABLE phonebook (</userinput>
+    -&gt; <userinput>  name VARCHAR(64),</userinput>
+    -&gt; <userinput>  phone VARCHAR(64) CHARACTER SET utf8 COLLATE utf8_phone_ci</userinput>
+    -&gt; <userinput>);</userinput>
+Query OK, 0 rows affected (0.09 sec)
+
+mysql&gt; <userinput>INSERT INTO phonebook VALUES ('Svoj','+7 912 800 80 02');</userinput>
+Query OK, 1 row affected (0.00 sec)
+
+mysql&gt; <userinput>INSERT INTO phonebook VALUES ('Hf','+7 (912) 800 80 04');</userinput>
+Query OK, 1 row affected (0.00 sec)
+
+mysql&gt; <userinput>INSERT INTO phonebook VALUES ('Bar','+7-912-800-80-01');</userinput>
+Query OK, 1 row affected (0.00 sec)
+
+mysql&gt; <userinput>INSERT INTO phonebook VALUES ('Ramil','(7912) 800 80 03');</userinput>
+Query OK, 1 row affected (0.00 sec)
+
+mysql&gt; <userinput>INSERT INTO phonebook VALUES ('Sanja','+380 (912) 8008005');</userinput>
+Query OK, 1 row affected (0.00 sec)
+</programlisting>
+
+      <para>
+        Run some queries to see whether the ignored punctuation
+        characters are in fact ignored for sorting and comparisons:
+      </para>
+
+<programlisting>
+mysql&gt; <userinput>SELECT * FROM phonebook ORDER BY phone;</userinput>
++-------+--------------------+
+| name  | phone              |
++-------+--------------------+
+| Sanja | +380 (912) 8008005 | 
+| Bar   | +7-912-800-80-01   | 
+| Svoj  | +7 912 800 80 02   | 
+| Ramil | (7912) 800 80 03   | 
+| Hf    | +7 (912) 800 80 04 | 
++-------+--------------------+
+5 rows in set (0.00 sec)
+
+mysql&gt; <userinput>SELECT * FROM phonebook WHERE phone='+7(912)800-80-01';</userinput>
++------+------------------+
+| name | phone            |
++------+------------------+
+| Bar  | +7-912-800-80-01 | 
++------+------------------+
+1 row in set (0.00 sec)
+
+mysql&gt; <userinput>SELECT * FROM phonebook WHERE phone='79128008001';</userinput>
++------+------------------+
+| name | phone            |
++------+------------------+
+| Bar  | +7-912-800-80-01 | 
++------+------------------+
+1 row in set (0.00 sec)
+
+mysql&gt; <userinput>SELECT * FROM phonebook WHERE phone='7 9 1 2 8 0 0 8 0 0 1';</userinput>
++------+------------------+
+| name | phone            |
++------+------------------+
+| Bar  | +7-912-800-80-01 | 
++------+------------------+
+1 row in set (0.00 sec)
+</programlisting>
+
+    </section>
+
+  </section>
+
   <section id="problems-with-character-sets">
 
     <title>Problems With Character Sets</title>

@@ -6032,21 +6927,14 @@
           program with support for the character set.
         </para>
 
-        <remark role="todo">
-          Add xref to LDML section when it gets added.
-        </remark>
-
         <para>
           For Unicode character sets, you can define collations without
-          recompiling by using LDML notation.
+          recompiling by using LDML notation. See
+          <xref linkend="adding-collation-unicode-uca"/>.
         </para>
       </listitem>
 
       <listitem>
-        <remark role="todo">
-          dynamic = not compiled in?
-        </remark>
-
         <para>
           The character set is a dynamic character set, but you do not
           have a configuration file for it. In this case, you should


Thread
svn commit - mysqldoc@docsrva: r10862 - in trunk: . it/refman-5.1 pt/refman-5.1paul29 May