Author: paul
Date: 2008-05-29 19:49:41 +0200 (Thu, 29 May 2008)
New Revision: 10862
Log:
r31753@frost: paul | 2008-05-29 11:37:51 -0500
Sync translations
Modified:
trunk/it/refman-5.1/internationalization.xml
trunk/pt/refman-5.1/internationalization.xml
Property changes on: trunk
___________________________________________________________________
Name: svk:merge
- 4767c598-dc10-0410-bea0-d01b485662eb:/mysqldoc-local/mysqldoc/trunk:35828
7d8d2c4e-af1d-0410-ab9f-b038ce55645b:/mysqldoc-local/mysqldoc:31752
b5ec3a16-e900-0410-9ad2-d183a3acac99:/mysqldoc-local/mysqldoc/trunk:14218
bf112a9c-6c03-0410-a055-ad865cd57414:/mysqldoc-local/mysqldoc/trunk:31512
+ 4767c598-dc10-0410-bea0-d01b485662eb:/mysqldoc-local/mysqldoc/trunk:35828
7d8d2c4e-af1d-0410-ab9f-b038ce55645b:/mysqldoc-local/mysqldoc:31753
b5ec3a16-e900-0410-9ad2-d183a3acac99:/mysqldoc-local/mysqldoc/trunk:14218
bf112a9c-6c03-0410-a055-ad865cd57414:/mysqldoc-local/mysqldoc/trunk:31512
Modified: trunk/it/refman-5.1/internationalization.xml
===================================================================
--- trunk/it/refman-5.1/internationalization.xml 2008-05-29 17:49:32 UTC (rev 10861)
+++ trunk/it/refman-5.1/internationalization.xml 2008-05-29 17:49:41 UTC (rev 10862)
Changed blocks: 2, Lines Added: 902, Lines Deleted: 9; 29935 bytes
@@ -5985,6 +5985,906 @@
</section>
+ <section id="adding-collation">
+
+ <title>How to Add a New Collation to a Character Set</title>
+
+ <indexterm>
+ <primary>collation</primary>
+ <secondary>adding</secondary>
+ </indexterm>
+
+ <para>
+ A collation is a set of rules that defines how to compare and sort
+ character strings. Each collation in MySQL belongs to a single
+ character set. Every character set has at least one collation, and
+ most have two or more collations.
+ </para>
+
+ <para>
+ A collation orders characters based on weights. Each character in
+ a character set maps to a weight. Characters with equal weights
+ compare as equal, and characters with unequal weights compare
+ according to the relative magnitude of their weights.
+ </para>
+
+ <para>
+ MySQL supports several collation implementations, as discussed in
+ <xref linkend="charset-collation-implementations"/>. Some of these
+ can be added to MySQL without recompiling:
+ </para>
+
+ <itemizedlist>
+
+ <listitem>
+ <para>
+ Simple collations for 8-bit character sets
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ UCA-based collations for Unicode character sets
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ Binary (<literal><replaceable>xxx</replaceable>_bin</literal>)
+ collations
+ </para>
+ </listitem>
+
+ </itemizedlist>
+
+ <para>
+ The following discussion describes how to add collations of the
+ first two types to existing character sets. All existing character
+ sets already have a binary collation, so there is no need here to
+ describe how to add one.
+ </para>
+
+ <para>
+ Summary of the procedure for adding a new collation:
+ </para>
+
+ <orderedlist>
+
+ <listitem>
+ <para>
+ Choose a collation ID
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ Add configuration information that names the collation and
+ describes the character-ordering rules
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ Restart the server
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ Verify that the collation is present
+ </para>
+ </listitem>
+
+ </orderedlist>
+
+ <para>
+ The instructions here cover only collations that can be added
+ without recompiling MySQL. To add a collation that does require
+ recompiling (as implemented by means of functions in a C source
+ file), use the instructions in
+ <xref linkend="adding-character-set"/>. However, instead of adding
+ all the information required for a complete character set, just
+ modify the appropriate files for an existing character set. That
+ is, based on what is already present for the character set's
+ current collations, add new data structures, functions, and
+ configuration information for the new collation. For an example,
+ see the MySQL Blog article in the following list of additional
+ resources.
+ </para>
+
+ <para>
+ <emphasis role="bold">Additional resources</emphasis>
+ </para>
+
+ <itemizedlist>
+
+ <listitem>
+ <para>
+ The Unicode Collation Algorithm (UCA) specification:
+ <ulink url="http://www.unicode.org/reports/tr10/"/>
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ The Locale Data Markup Language (LDML) specification:
+ <ulink url="http://www.unicode.org/reports/tr35/"/>
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ MySQL University session <quote>How to Add a
+ Collation</quote>:
+ <ulink url="http://forge.mysql.com/wiki/How_to_Add_a_Collation"/>
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ MySQL Blog article <quote>Instructions for adding a new
+ Unicode collation</quote>:
+ <ulink url="http://blogs.mysql.com/peterg/2008/05/19/instructions-for-adding-a-new-unicode-collation/"/>
+ </para>
+ </listitem>
+
+ </itemizedlist>
+
+ <section id="charset-collation-implementations">
+
+ <title>Collation Implementation Types</title>
+
+ <para>
+ MySQL implements several types of collations:
+ </para>
+
+ <para>
+ <emphasis role="bold">Simple collations for 8-bit character
+ sets</emphasis>
+ </para>
+
+ <para>
+ This kind of collation is implemented using an array of 256
+ weights that defines a one-to-one mapping from character codes
+ to weights. <literal>latin1_swedish_ci</literal> is an example.
+ It is a case-insensitive collation, so the uppercase and
+ lowercase versions of a character have the same weights and they
+ compare as equal.
+ </para>
+
+<programlisting>
+mysql> <userinput>SET NAMES 'latin1' COLLATE 'latin1_swedish_ci';</userinput>
+Query OK, 0 rows affected (0.00 sec)
+
+mysql> <userinput>SELECT 'a' = 'A';</userinput>
++-----------+
+| 'a' = 'A' |
++-----------+
+| 1 |
++-----------+
+1 row in set (0.00 sec)
+</programlisting>
+
+ <para>
+ <emphasis role="bold">Complex collations for 8-bit character
+ sets</emphasis>
+ </para>
+
+ <para>
+ This kind of collation is implemented using functions in a C
+ source file that define how to order characters, as described in
+ <xref linkend="adding-character-set"/>.
+ </para>
+
+ <para>
+ <emphasis role="bold">Collations for non-Unicode multi-byte
+ character sets</emphasis>
+ </para>
+
+ <para>
+ For this type of collation, 8-bit (single-byte) and multi-byte
+ characters are handled differently. For 8-bit characters,
+ character codes map to weights in case-insensitive fashion. (For
+ example, the single-byte characters <literal>'a'</literal> and
+ <literal>'A'</literal> both have a weight of
+ <literal>0x41</literal>.) For multi-byte characters, there are
+ two types of relationship between character codes and weights:
+ </para>
+
+ <itemizedlist>
+
+ <listitem>
+ <para>
+ Weights equal character codes.
+ <literal>sjis_japanese_ci</literal> is an example of this
+ kind of collation. The multi-byte character
+ <literal>'ぢ'</literal> has a character code of
+ <literal>0x82C0</literal>, and the weight is also
+ <literal>0x82C0</literal>.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ Character codes map one-to-one to weights, but a code is not
+ necessarily equal to the weight.
+ <literal>gbk_chinese_ci</literal> is an example of this kind
+ of collation. The multi-byte character
+ <literal>'膰'</literal> has a character code of
+ <literal>0x81B0</literal> but a weight of
+ <literal>0xC286</literal>.
+ </para>
+ </listitem>
+
+ </itemizedlist>
+
+ <para>
+ <emphasis role="bold">Collations for Unicode multi-byte
+ character sets</emphasis>
+ </para>
+
+ <para>
+ Some of these collations are based on the Unicode Collation
+ Algorithm (UCA), others are not.
+ </para>
+
+ <para>
+ Non-UCA collations have a one-to-one mapping from character code
+ to weight. In MySQL, such collations are case insensitive and
+ accent insensitive. <literal>utf8_general_ci</literal> is an
+ example: <literal>'a'</literal>, <literal>'A'</literal>,
+ <literal>'À'</literal>, and <literal>'á'</literal> each have
+ different character codes but all have a weight of
+ <literal>0x0041</literal> and compare as equal.
+ </para>
+
+<programlisting>
+mysql> <userinput>SET NAMES 'utf8' COLLATE 'utf8_general_ci';</userinput>
+Query OK, 0 rows affected (0.00 sec)
+
+mysql> <userinput>SELECT 'a' = 'A', 'a' = 'À', 'a' = 'á';</userinput>
++-----------+-----------+-----------+
+| 'a' = 'A' | 'a' = 'À' | 'a' = 'á' |
++-----------+-----------+-----------+
+| 1 | 1 | 1 |
++-----------+-----------+-----------+
+1 row in set (0.06 sec)
+</programlisting>
+
+ <para>
+ UCA-based collations in MySQL have these properties:
+ </para>
+
+ <itemizedlist>
+
+ <listitem>
+ <para>
+ If a character has weights, each weight uses 2 bytes (16
+ bits)
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ A character may have zero weights (or an empty weight). In
+ this case, the character is ignorable. Example: "U+0000
+ NULL" does not have a weight and is ignorable.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ A character may have one weight. Example:
+ <literal>'a'</literal> has a weight of
+ <literal>0x0E33</literal>.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ A character may have many weights. This is an expansion.
+ Example: The German letter <literal>'ß'</literal> (SZ
+ LEAGUE, or SHARP S) has a weight of
+ <literal>0x0FEA0FEA</literal>.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ Many characters may have one weight. This is a contraction.
+ Example: <literal>'ch'</literal> is a single letter in Czech
+ and has a weight of <literal>0x0EE2</literal>.
+ </para>
+ </listitem>
+
+ </itemizedlist>
+
+ <para>
+ A many-characters-to-many-weights mapping is also possible (this
+ is contraction with expansion), but is not supported by MySQL.
+ </para>
+
+ <para>
+ <emphasis role="bold">Miscellaneous collations</emphasis>
+ </para>
+
+ <para>
+ There are also a few collations that do not fall into any of the
+ previous categories.
+ </para>
+
+ </section>
+
+ <section id="adding-collation-choosing-id">
+
+ <title>Choosing a Collation ID</title>
+
+ <para>
+ Each collation must have a unique ID. To add a new collation,
+ you must choose an ID value that is not currently used. The ID
+ that you choose is the value that will show up in these
+ contexts:
+ </para>
+
+ <itemizedlist>
+
+ <listitem>
+ <para>
+ The <literal>Id</literal> column of <literal>SHOW
+ COLLATION</literal> output
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ The <literal>ID</literal> column of the
+ <literal>INFORMATION_SCHEMA.COLLATIONS</literal> table
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ The <literal>charsetnr</literal> member of the
+ <literal>MYSQL_FIELD</literal> C API data structure
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ The <literal>number</literal> member of the
+ <literal>MY_CHARSET_INFO</literal> data structure returned
+ by the
+ <function role="capi">mysql_get_character_set_info()</function>
+ C API function
+ </para>
+ </listitem>
+
+ </itemizedlist>
+
+ <para>
+ To determine the largest currently used ID, issue the following
+ statement:
+ </para>
+
+<programlisting>
+mysql> <userinput>SELECT MAX(ID) FROM INFORMATION_SCHEMA.COLLATIONS;</userinput>
++---------+
+| MAX(ID) |
++---------+
+| 210 |
++---------+
+</programlisting>
+
+ <para>
+ For the ouput just shown, you could choose an ID higher than 210
+ for the new collation.
+ </para>
+
+ <para>
+ To display a list of all currently used IDs, issue this
+ statement:
+ </para>
+
+<programlisting>
+mysql> <userinput>SELECT ID FROM INFORMATION_SCHEMA.COLLATIONS ORDER BY ID;</userinput>
++-----+
+| ID |
++-----+
+| 1 |
+| 2 |
+| ... |
+| 52 |
+| 53 |
+| 57 |
+| 58 |
+| ... |
+| 98 |
+| 99 |
+| 128 |
+| 129 |
+| ... |
+| 210 |
++-----+
+</programlisting>
+
+ <para>
+ In this case, you can either choose an unused ID from within the
+ current range of IDs, or choose an ID that is higher than the
+ current maximum ID. For example, in the output just shown, there
+ are unused IDs between 53 and 57, and between 99 and 128. Or you
+ could choose an ID higher than 210.
+ </para>
+
+ <warning>
+ <para>
+ If you upgrade MySQL, you may find that the collation ID you
+ choose has been assigned to a collation included in the new
+ MySQL distribution. In this case, you will need to choose a
+ new value for your own collation.
+ </para>
+
+ <para>
+ In addition, before upgrading, you should save the
+ configuration files that you change. If you upgrade in place,
+ the process will replace the your modified files.
+ </para>
+ </warning>
+
+ </section>
+
+ <section id="adding-collation-simple-8bit">
+
+ <title>Adding a Simple Collation to an 8-Bit Character Set</title>
+
+ <para>
+ To add a simple collation for an 8-bit character set without
+ recompiling MySQL, use the following procedure. The example adds
+ a collation named <literal>latin1_test_ci</literal> to the
+ <literal>latin1</literal> character set.
+ </para>
+
+ <orderedlist>
+
+ <listitem>
+ <para>
+ Choose a collation ID, as shown in
+ <xref linkend="adding-collation-choosing-id"/>. The
+ following steps use an ID of 56.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ You will need to modify the <literal>Index.xml</literal> and
+ <literal>latin1.xml</literal> configuration files. These
+ files will be located in the directory named by the
+ <literal>character_sets_dir</literal> system variable. You
+ can check the variable value as follows, although the
+ pathname might be different on your system:
+ </para>
+
+<programlisting>
+mysql> <userinput>SHOW VARIABLES LIKE 'character_sets_dir';</userinput>
++--------------------+-----------------------------------------+
+| Variable_name | Value |
++--------------------+-----------------------------------------+
+| character_sets_dir | /user/local/mysql/share/mysql/charsets/ |
++--------------------+-----------------------------------------+
+</programlisting>
+ </listitem>
+
+ <listitem>
+ <para>
+ Choose a name for the collation and list it in the
+ <filename>Index.xml</filename> file. Find the
+ <literal><charset></literal> element for the character
+ set to which the collation is being added, and add a
+ <literal><collation></literal> element that indicates
+ the collation name and ID. For example:
+ </para>
+
+<programlisting>
+<charset name="latin1">
+ ...
+ <!-- associate collation name with its ID -->
+ <collation name="latin1_test_ci" id="56"/>
+ ...
+</charset>
+</programlisting>
+ </listitem>
+
+ <listitem>
+ <para>
+ In the <filename>latin1.xml</filename> configuration file,
+ add a <literal><collation></literal> element that
+ names the collation and that contains a
+ <literal><map></literal> element that defines a
+ character code-to-weight mapping table. Each word within the
+ <literal><map></literal> element must be a number in
+ hexadecimal format.
+ </para>
+
+<programlisting>
+<collation name="latin1_test_ci">
+<map>
+ 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
+ 10 11 12 13 14 15 16 17 18 19 1A 1B 1C 1D 1E 1F
+ 20 21 22 23 24 25 26 27 28 29 2A 2B 2C 2D 2E 2F
+ 30 31 32 33 34 35 36 37 38 39 3A 3B 3C 3D 3E 3F
+ 40 41 42 43 44 45 46 47 48 49 4A 4B 4C 4D 4E 4F
+ 50 51 52 53 54 55 56 57 58 59 5A 5B 5C 5D 5E 5F
+ 60 41 42 43 44 45 46 47 48 49 4A 4B 4C 4D 4E 4F
+ 50 51 52 53 54 55 56 57 58 59 5A 7B 7C 7D 7E 7F
+ 80 81 82 83 84 85 86 87 88 89 8A 8B 8C 8D 8E 8F
+ 90 91 92 93 94 95 96 97 98 99 9A 9B 9C 9D 9E 9F
+ A0 A1 A2 A3 A4 A5 A6 A7 A8 A9 AA AB AC AD AE AF
+ B0 B1 B2 B3 B4 B5 B6 B7 B8 B9 BA BB BC BD BE BF
+ 41 41 41 41 5B 5D 5B 43 45 45 45 45 49 49 49 49
+ 44 4E 4F 4F 4F 4F 5C D7 5C 55 55 55 59 59 DE DF
+ 41 41 41 41 5B 5D 5B 43 45 45 45 45 49 49 49 49
+ 44 4E 4F 4F 4F 4F 5C F7 5C 55 55 55 59 59 DE FF
+</map>
+</collation>
+</programlisting>
+ </listitem>
+
+ <listitem>
+ <para>
+ Restart the server and use this statement to verify that the
+ collation is present:
+ </para>
+
+<programlisting>
+mysql> <userinput>SHOW COLLATION LIKE 'latin1_test_ci';</userinput>
++----------------+---------+----+---------+----------+---------+
+| Collation | Charset | Id | Default | Compiled | Sortlen |
++----------------+---------+----+---------+----------+---------+
+| latin1_test_ci | latin1 | 56 | | | 1 |
++----------------+---------+----+---------+----------+---------+
+</programlisting>
+ </listitem>
+
+ </orderedlist>
+
+ </section>
+
+ <section id="adding-collation-unicode-uca">
+
+ <title>Adding a UCA Collation to a Unicode Character Set</title>
+
+ <para>
+ UCA collations for Unicode character serts can be added to MySQL
+ without recompiling by using a subset of the Locale Data Markup
+ Language (LDML), which is available at
+ <ulink url="http://www.unicode.org/reports/tr35/"/>. In
+ ¤t-series;, this method of adding collations is supported
+ as of MySQL 5.1.20. With this method, you begin with an existing
+ <quote>base</quote> collation. Then you describe the new
+ collation in terms of how it differs from the base collation,
+ rather than defining the entire collation. The following table
+ lists the base collations for the Unicode character sets.
+ </para>
+
+ <informaltable>
+ <tgroup cols="2">
+ <colspec colwidth="30*"/>
+ <colspec colwidth="60*"/>
+ <tbody>
+ <row>
+ <entry><emphasis role="bold">Character Set</emphasis></entry>
+ <entry><emphasis role="bold">Base Collation</emphasis></entry>
+ </row>
+ <row>
+ <entry><literal>utf8</literal></entry>
+ <entry><literal>utf8_unicode_ci</literal></entry>
+ </row>
+ <row>
+ <entry><literal>ucs2</literal></entry>
+ <entry><literal>ucs2_unicode_ci</literal></entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </informaltable>
+
+ <para>
+ The following brief summary describes the LDML characteristics
+ required for understanding the procedure for adding a collation
+ given later in this section:
+ </para>
+
+ <itemizedlist>
+
+ <listitem>
+ <para>
+ LDML has reset rules and shift rules.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ Characters named in these rules can be written in
+ <literal>\u<replaceable>nnnn</replaceable></literal> format,
+ where <replaceable>nnnn</replaceable> is the hexadecimal
+ Unicode code point value. Basic Latin letters
+ <literal>A-Z</literal> <literal>a-z</literal> can also be
+ written literally.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ A reset rule does not specify any ordering in and of itself.
+ Instead, it <quote>resets</quote> the ordering for
+ subsequent shift rules to cause them to be taken in relation
+ to a given character. Either of the following rules resets
+ subsequent shift rules to be taken in relation to the letter
+ <literal>'A'</literal>:
+ </para>
+
+<programlisting>
+<reset>A</reset>
+
+<reset>\u0041</reset>
+</programlisting>
+ </listitem>
+
+ <listitem>
+ <para>
+ Shift rules define primary, secondary, and tertiary
+ differences of a character from another character. They are
+ specified using <literal><p></literal>,
+ <literal><s></literal>, and
+ <literal><t></literal> elements. Either of the
+ following rules specifies a primary shift rule for the
+ <literal>'G'</literal> character:
+ </para>
+
+<programlisting>
+<p>G</p>
+
+<p>\u0047</p>
+</programlisting>
+ </listitem>
+
+ <listitem>
+ <para>
+ Use the shift rules as follows to distinguish characters:
+ </para>
+
+ <itemizedlist>
+
+ <listitem>
+ <para>
+ Use primary differences to distinguish separate letters
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ Use secondary differences to distiguish accent
+ variations
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ Use tertiary differences to distinguish lettercase
+ variations
+ </para>
+ </listitem>
+
+ </itemizedlist>
+ </listitem>
+
+ </itemizedlist>
+
+ <remark role="todo">
+ Add: Examples of the use of these rules
+ </remark>
+
+ <para>
+ To add a UCA collation for a Unicode character set without
+ recompiling MySQL, use the following procedure. The example adds
+ a collation named <literal>utf8_phone_ci</literal> to the
+ <literal>utf8</literal> character set. The collation is designed
+ for a scenario involving a Web application for which users post
+ their names and phone numbers. Phone numbers can be given in
+ very different formats:
+ </para>
+
+<programlisting>
++7-12345-67
++7-12-345-67
++7 12 345 67
++7 (12) 345 67
++71234567
+</programlisting>
+
+ <para>
+ The problem raised by dealing with these kinds of values is that
+ the varying allowable formats make searching for a specific
+ phone number very difficult. The solution is to define a new
+ collation that reorders punctuation characters, making them
+ ignorable.
+ </para>
+
+ <orderedlist>
+
+ <listitem>
+ <para>
+ Choose a collation ID, as shown in
+ <xref linkend="adding-collation-choosing-id"/>. The
+ following steps use an ID of 252.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ You will need to modify the <literal>Index.xml</literal>
+ configuration file. This file will be located in the
+ directory named by the <literal>character_sets_dir</literal>
+ system variable. You can check the variable value as
+ follows, although the pathname might be different on your
+ system:
+ </para>
+
+<programlisting>
+mysql> <userinput>SHOW VARIABLES LIKE 'character_sets_dir';</userinput>
++--------------------+-----------------------------------------+
+| Variable_name | Value |
++--------------------+-----------------------------------------+
+| character_sets_dir | /user/local/mysql/share/mysql/charsets/ |
++--------------------+-----------------------------------------+
+</programlisting>
+ </listitem>
+
+ <listitem>
+ <para>
+ Choose a name for the collation and list it in the
+ <filename>Index.xml</filename> file. In addition, you'll
+ need to provide the collation ordering rules. Find the
+ <literal><charset></literal> element for the character
+ set to which the collation is being added, and add a
+ <literal><collation></literal> element that indicates
+ the collation name and ID. Within the
+ <literal><collation></literal> element, provide a
+ <literal><rules></literal> element containing the
+ ordering rules:
+ </para>
+
+<programlisting>
+<charset name="utf8">
+ ...
+ <!-- associate collation name with its ID -->
+ <collation name="utf8_phone_ci" id="252">
+ <rules>
+ <reset>\u0000</reset>
+ <s>\u0020</s> <!-- space -->
+ <s>\u0028</s> <!-- left parenthesis -->
+ <s>\u0029</s> <!-- right parenthesis -->
+ <s>\u002B</s> <!-- plus -->
+ <s>\u002D</s> <!-- hyphen -->
+ </rules>
+ </collation>
+ ...
+</charset>
+</programlisting>
+ </listitem>
+
+ <listitem>
+ <para>
+ If you want a similar collation for other Unicode character
+ sets, add other <literal><collation></literal>
+ elements. For example, to define
+ <literal>ucs2_phone_ci</literal>, add a
+ <literal><collation></literal> element to the
+ <literal><charset name="ucs2"></literal> element.
+ Remember that each collation must have its own unique ID.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ Restart the server and use this statement to verify that the
+ collation is present:
+ </para>
+
+<programlisting>
+mysql> <userinput>SHOW COLLATION LIKE 'utf8_phone_ci';</userinput>
++---------------+---------+-----+---------+----------+---------+
+| Collation | Charset | Id | Default | Compiled | Sortlen |
++---------------+---------+-----+---------+----------+---------+
+| utf8_phone_ci | utf8 | 252 | | | 8 |
++---------------+---------+-----+---------+----------+---------+
+</programlisting>
+ </listitem>
+
+ </orderedlist>
+
+ <para>
+ Now we can test the collation to make sure that it has the
+ desired properties.
+ </para>
+
+ <para>
+ Create a table containing some sample phone numbers using the
+ new collation:
+ </para>
+
+<programlisting>
+<!--
+mysql> DROP TABLE IF EXISTS phonebook;
+Query OK, 0 rows affected, 1 warning (0.00 sec)
+-->
+mysql> <userinput>CREATE TABLE phonebook (</userinput>
+ -> <userinput> name VARCHAR(64),</userinput>
+ -> <userinput> phone VARCHAR(64) CHARACTER SET utf8 COLLATE utf8_phone_ci</userinput>
+ -> <userinput>);</userinput>
+Query OK, 0 rows affected (0.09 sec)
+
+mysql> <userinput>INSERT INTO phonebook VALUES ('Svoj','+7 912 800 80 02');</userinput>
+Query OK, 1 row affected (0.00 sec)
+
+mysql> <userinput>INSERT INTO phonebook VALUES ('Hf','+7 (912) 800 80 04');</userinput>
+Query OK, 1 row affected (0.00 sec)
+
+mysql> <userinput>INSERT INTO phonebook VALUES ('Bar','+7-912-800-80-01');</userinput>
+Query OK, 1 row affected (0.00 sec)
+
+mysql> <userinput>INSERT INTO phonebook VALUES ('Ramil','(7912) 800 80 03');</userinput>
+Query OK, 1 row affected (0.00 sec)
+
+mysql> <userinput>INSERT INTO phonebook VALUES ('Sanja','+380 (912) 8008005');</userinput>
+Query OK, 1 row affected (0.00 sec)
+</programlisting>
+
+ <para>
+ Run some queries to see whether the ignored punctuation
+ characters are in fact ignored for sorting and comparisons:
+ </para>
+
+<programlisting>
+mysql> <userinput>SELECT * FROM phonebook ORDER BY phone;</userinput>
++-------+--------------------+
+| name | phone |
++-------+--------------------+
+| Sanja | +380 (912) 8008005 |
+| Bar | +7-912-800-80-01 |
+| Svoj | +7 912 800 80 02 |
+| Ramil | (7912) 800 80 03 |
+| Hf | +7 (912) 800 80 04 |
++-------+--------------------+
+5 rows in set (0.00 sec)
+
+mysql> <userinput>SELECT * FROM phonebook WHERE phone='+7(912)800-80-01';</userinput>
++------+------------------+
+| name | phone |
++------+------------------+
+| Bar | +7-912-800-80-01 |
++------+------------------+
+1 row in set (0.00 sec)
+
+mysql> <userinput>SELECT * FROM phonebook WHERE phone='79128008001';</userinput>
++------+------------------+
+| name | phone |
++------+------------------+
+| Bar | +7-912-800-80-01 |
++------+------------------+
+1 row in set (0.00 sec)
+
+mysql> <userinput>SELECT * FROM phonebook WHERE phone='7 9 1 2 8 0 0 8 0 0 1';</userinput>
++------+------------------+
+| name | phone |
++------+------------------+
+| Bar | +7-912-800-80-01 |
++------+------------------+
+1 row in set (0.00 sec)
+</programlisting>
+
+ </section>
+
+ </section>
+
<section id="problems-with-character-sets">
<title>Problems With Character Sets</title>
@@ -6027,21 +6927,14 @@
program with support for the character set.
</para>
- <remark role="todo">
- Add xref to LDML section when it gets added.
- </remark>
-
<para>
For Unicode character sets, you can define collations without
- recompiling by using LDML notation.
+ recompiling by using LDML notation. See
+ <xref linkend="adding-collation-unicode-uca"/>.
</para>
</listitem>
<listitem>
- <remark role="todo">
- dynamic = not compiled in?
- </remark>
-
<para>
The character set is a dynamic character set, but you do not
have a configuration file for it. In this case, you should
Modified: trunk/pt/refman-5.1/internationalization.xml
===================================================================
--- trunk/pt/refman-5.1/internationalization.xml 2008-05-29 17:49:32 UTC (rev 10861)
+++ trunk/pt/refman-5.1/internationalization.xml 2008-05-29 17:49:41 UTC (rev 10862)
Changed blocks: 6, Lines Added: 925, Lines Deleted: 37; 34318 bytes
@@ -5500,12 +5500,12 @@
The <literal><charset></literal> element must list all
the collations for the character set. These must include at
least a binary collation and a default collation. The default
- collation is usually <literal>general_ci</literal> (general,
- case insensitive). It is possible for the binary collation to
- be the default collation, but usually they are different. The
- default collation should have a <literal>primary</literal>
- flag. The binary collation should have a
- <literal>binary</literal> flag.
+ collation is usually named using a suffix of
+ <literal>general_ci</literal> (general, case insensitive). It
+ is possible for the binary collation to be the default
+ collation, but usually they are different. The default
+ collation should have a <literal>primary</literal> flag. The
+ binary collation should have a <literal>binary</literal> flag.
</para>
<para>
@@ -5830,19 +5830,17 @@
<literal>ctype_<replaceable>MYSET</replaceable>[]</literal>,
<literal>to_lower_<replaceable>MYSET</replaceable>[]</literal>,
and so forth. Not every complex character set has all of the
- arrays. See the existing
- <filename>ctype-<replaceable>charset_name</replaceable>.c</filename>
- files for examples. See the
- <filename>CHARSET_INFO.txt</filename> file in the
- <filename>strings</filename> directory for additional
+ arrays. See the existing <filename>ctype-*.c</filename> files
+ for examples. See the <filename>CHARSET_INFO.txt</filename> file
+ in the <filename>strings</filename> directory for additional
information.
</para>
<para>
The <literal>ctype</literal> array is indexed by character value
- + 1. This is an old legacy convention for handling
- <literal>EOF</literal>. The other arrays are indexed by
- character value.
+ + 1 and has 257 elements. This is an old legacy convention for
+ handling <literal>EOF</literal>. The other arrays are indexed by
+ character value and have 256 elements.
</para>
<para>
@@ -5935,10 +5933,9 @@
<para>
The existing character sets provide the best documentation and
examples to show how these functions are implemented. Look at
- the
- <filename>ctype-<replaceable>charset_name</replaceable>.c</filename>
- files in the <filename>strings</filename> directory, such as the
- files for the <literal>big5</literal>, <literal>czech</literal>,
+ the <filename>ctype-*.c</filename> files in the
+ <filename>strings</filename> directory, such as the files for
+ the <literal>big5</literal>, <literal>czech</literal>,
<literal>gbk</literal>, <literal>sjis</literal>, and
<literal>tis160</literal> character sets. Take a look at the
<literal>MY_COLLATION_HANDLER</literal> structures to see how
@@ -5973,16 +5970,14 @@
<para>
The existing character sets provide the best documentation and
examples to show how these functions are implemented. Look at
- the
- <filename>ctype-<replaceable>charset_name</replaceable>.c</filename>
- files in the <filename>strings</filename> directory, such as the
- files for the <literal>euc_kr</literal>,
- <literal>gb2312</literal>, <literal>gbk</literal>,
- <literal>sjis</literal>, and <literal>ujis</literal> character
- sets. Take a look at the <literal>MY_CHARSET_HANDLER</literal>
- structures to see how they are used, and see the
- <filename>CHARSET_INFO.txt</filename> file in the
- <filename>strings</filename> directory for additional
+ the <filename>ctype-*.c</filename> files in the
+ <filename>strings</filename> directory, such as the files for
+ the <literal>euc_kr</literal>, <literal>gb2312</literal>,
+ <literal>gbk</literal>, <literal>sjis</literal>, and
+ <literal>ujis</literal> character sets. Take a look at the
+ <literal>MY_CHARSET_HANDLER</literal> structures to see how they
+ are used, and see the <filename>CHARSET_INFO.txt</filename> file
+ in the <filename>strings</filename> directory for additional
information.
</para>
@@ -5990,6 +5985,906 @@
</section>
+ <section id="adding-collation">
+
+ <title>How to Add a New Collation to a Character Set</title>
+
+ <indexterm>
+ <primary>collation</primary>
+ <secondary>adding</secondary>
+ </indexterm>
+
+ <para>
+ A collation is a set of rules that defines how to compare and sort
+ character strings. Each collation in MySQL belongs to a single
+ character set. Every character set has at least one collation, and
+ most have two or more collations.
+ </para>
+
+ <para>
+ A collation orders characters based on weights. Each character in
+ a character set maps to a weight. Characters with equal weights
+ compare as equal, and characters with unequal weights compare
+ according to the relative magnitude of their weights.
+ </para>
+
+ <para>
+ MySQL supports several collation implementations, as discussed in
+ <xref linkend="charset-collation-implementations"/>. Some of these
+ can be added to MySQL without recompiling:
+ </para>
+
+ <itemizedlist>
+
+ <listitem>
+ <para>
+ Simple collations for 8-bit character sets
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ UCA-based collations for Unicode character sets
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ Binary (<literal><replaceable>xxx</replaceable>_bin</literal>)
+ collations
+ </para>
+ </listitem>
+
+ </itemizedlist>
+
+ <para>
+ The following discussion describes how to add collations of the
+ first two types to existing character sets. All existing character
+ sets already have a binary collation, so there is no need here to
+ describe how to add one.
+ </para>
+
+ <para>
+ Summary of the procedure for adding a new collation:
+ </para>
+
+ <orderedlist>
+
+ <listitem>
+ <para>
+ Choose a collation ID
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ Add configuration information that names the collation and
+ describes the character-ordering rules
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ Restart the server
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ Verify that the collation is present
+ </para>
+ </listitem>
+
+ </orderedlist>
+
+ <para>
+ The instructions here cover only collations that can be added
+ without recompiling MySQL. To add a collation that does require
+ recompiling (as implemented by means of functions in a C source
+ file), use the instructions in
+ <xref linkend="adding-character-set"/>. However, instead of adding
+ all the information required for a complete character set, just
+ modify the appropriate files for an existing character set. That
+ is, based on what is already present for the character set's
+ current collations, add new data structures, functions, and
+ configuration information for the new collation. For an example,
+ see the MySQL Blog article in the following list of additional
+ resources.
+ </para>
+
+ <para>
+ <emphasis role="bold">Additional resources</emphasis>
+ </para>
+
+ <itemizedlist>
+
+ <listitem>
+ <para>
+ The Unicode Collation Algorithm (UCA) specification:
+ <ulink url="http://www.unicode.org/reports/tr10/"/>
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ The Locale Data Markup Language (LDML) specification:
+ <ulink url="http://www.unicode.org/reports/tr35/"/>
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ MySQL University session <quote>How to Add a
+ Collation</quote>:
+ <ulink url="http://forge.mysql.com/wiki/How_to_Add_a_Collation"/>
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ MySQL Blog article <quote>Instructions for adding a new
+ Unicode collation</quote>:
+ <ulink url="http://blogs.mysql.com/peterg/2008/05/19/instructions-for-adding-a-new-unicode-collation/"/>
+ </para>
+ </listitem>
+
+ </itemizedlist>
+
+ <section id="charset-collation-implementations">
+
+ <title>Collation Implementation Types</title>
+
+ <para>
+ MySQL implements several types of collations:
+ </para>
+
+ <para>
+ <emphasis role="bold">Simple collations for 8-bit character
+ sets</emphasis>
+ </para>
+
+ <para>
+ This kind of collation is implemented using an array of 256
+ weights that defines a one-to-one mapping from character codes
+ to weights. <literal>latin1_swedish_ci</literal> is an example.
+ It is a case-insensitive collation, so the uppercase and
+ lowercase versions of a character have the same weights and they
+ compare as equal.
+ </para>
+
+<programlisting>
+mysql> <userinput>SET NAMES 'latin1' COLLATE 'latin1_swedish_ci';</userinput>
+Query OK, 0 rows affected (0.00 sec)
+
+mysql> <userinput>SELECT 'a' = 'A';</userinput>
++-----------+
+| 'a' = 'A' |
++-----------+
+| 1 |
++-----------+
+1 row in set (0.00 sec)
+</programlisting>
+
+ <para>
+ <emphasis role="bold">Complex collations for 8-bit character
+ sets</emphasis>
+ </para>
+
+ <para>
+ This kind of collation is implemented using functions in a C
+ source file that define how to order characters, as described in
+ <xref linkend="adding-character-set"/>.
+ </para>
+
+ <para>
+ <emphasis role="bold">Collations for non-Unicode multi-byte
+ character sets</emphasis>
+ </para>
+
+ <para>
+ For this type of collation, 8-bit (single-byte) and multi-byte
+ characters are handled differently. For 8-bit characters,
+ character codes map to weights in case-insensitive fashion. (For
+ example, the single-byte characters <literal>'a'</literal> and
+ <literal>'A'</literal> both have a weight of
+ <literal>0x41</literal>.) For multi-byte characters, there are
+ two types of relationship between character codes and weights:
+ </para>
+
+ <itemizedlist>
+
+ <listitem>
+ <para>
+ Weights equal character codes.
+ <literal>sjis_japanese_ci</literal> is an example of this
+ kind of collation. The multi-byte character
+ <literal>'ぢ'</literal> has a character code of
+ <literal>0x82C0</literal>, and the weight is also
+ <literal>0x82C0</literal>.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ Character codes map one-to-one to weights, but a code is not
+ necessarily equal to the weight.
+ <literal>gbk_chinese_ci</literal> is an example of this kind
+ of collation. The multi-byte character
+ <literal>'膰'</literal> has a character code of
+ <literal>0x81B0</literal> but a weight of
+ <literal>0xC286</literal>.
+ </para>
+ </listitem>
+
+ </itemizedlist>
+
+ <para>
+ <emphasis role="bold">Collations for Unicode multi-byte
+ character sets</emphasis>
+ </para>
+
+ <para>
+ Some of these collations are based on the Unicode Collation
+ Algorithm (UCA), others are not.
+ </para>
+
+ <para>
+ Non-UCA collations have a one-to-one mapping from character code
+ to weight. In MySQL, such collations are case insensitive and
+ accent insensitive. <literal>utf8_general_ci</literal> is an
+ example: <literal>'a'</literal>, <literal>'A'</literal>,
+ <literal>'À'</literal>, and <literal>'á'</literal> each have
+ different character codes but all have a weight of
+ <literal>0x0041</literal> and compare as equal.
+ </para>
+
+<programlisting>
+mysql> <userinput>SET NAMES 'utf8' COLLATE 'utf8_general_ci';</userinput>
+Query OK, 0 rows affected (0.00 sec)
+
+mysql> <userinput>SELECT 'a' = 'A', 'a' = 'À', 'a' = 'á';</userinput>
++-----------+-----------+-----------+
+| 'a' = 'A' | 'a' = 'À' | 'a' = 'á' |
++-----------+-----------+-----------+
+| 1 | 1 | 1 |
++-----------+-----------+-----------+
+1 row in set (0.06 sec)
+</programlisting>
+
+ <para>
+ UCA-based collations in MySQL have these properties:
+ </para>
+
+ <itemizedlist>
+
+ <listitem>
+ <para>
+ If a character has weights, each weight uses 2 bytes (16
+ bits)
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ A character may have zero weights (or an empty weight). In
+ this case, the character is ignorable. Example: "U+0000
+ NULL" does not have a weight and is ignorable.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ A character may have one weight. Example:
+ <literal>'a'</literal> has a weight of
+ <literal>0x0E33</literal>.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ A character may have many weights. This is an expansion.
+ Example: The German letter <literal>'ß'</literal> (SZ
+ LEAGUE, or SHARP S) has a weight of
+ <literal>0x0FEA0FEA</literal>.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ Many characters may have one weight. This is a contraction.
+ Example: <literal>'ch'</literal> is a single letter in Czech
+ and has a weight of <literal>0x0EE2</literal>.
+ </para>
+ </listitem>
+
+ </itemizedlist>
+
+ <para>
+ A many-characters-to-many-weights mapping is also possible (this
+ is contraction with expansion), but is not supported by MySQL.
+ </para>
+
+ <para>
+ <emphasis role="bold">Miscellaneous collations</emphasis>
+ </para>
+
+ <para>
+ There are also a few collations that do not fall into any of the
+ previous categories.
+ </para>
+
+ </section>
+
+ <section id="adding-collation-choosing-id">
+
+ <title>Choosing a Collation ID</title>
+
+ <para>
+ Each collation must have a unique ID. To add a new collation,
+ you must choose an ID value that is not currently used. The ID
+ that you choose is the value that will show up in these
+ contexts:
+ </para>
+
+ <itemizedlist>
+
+ <listitem>
+ <para>
+ The <literal>Id</literal> column of <literal>SHOW
+ COLLATION</literal> output
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ The <literal>ID</literal> column of the
+ <literal>INFORMATION_SCHEMA.COLLATIONS</literal> table
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ The <literal>charsetnr</literal> member of the
+ <literal>MYSQL_FIELD</literal> C API data structure
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ The <literal>number</literal> member of the
+ <literal>MY_CHARSET_INFO</literal> data structure returned
+ by the
+ <function role="capi">mysql_get_character_set_info()</function>
+ C API function
+ </para>
+ </listitem>
+
+ </itemizedlist>
+
+ <para>
+ To determine the largest currently used ID, issue the following
+ statement:
+ </para>
+
+<programlisting>
+mysql> <userinput>SELECT MAX(ID) FROM INFORMATION_SCHEMA.COLLATIONS;</userinput>
++---------+
+| MAX(ID) |
++---------+
+| 210 |
++---------+
+</programlisting>
+
+ <para>
+ For the ouput just shown, you could choose an ID higher than 210
+ for the new collation.
+ </para>
+
+ <para>
+ To display a list of all currently used IDs, issue this
+ statement:
+ </para>
+
+<programlisting>
+mysql> <userinput>SELECT ID FROM INFORMATION_SCHEMA.COLLATIONS ORDER BY ID;</userinput>
++-----+
+| ID |
++-----+
+| 1 |
+| 2 |
+| ... |
+| 52 |
+| 53 |
+| 57 |
+| 58 |
+| ... |
+| 98 |
+| 99 |
+| 128 |
+| 129 |
+| ... |
+| 210 |
++-----+
+</programlisting>
+
+ <para>
+ In this case, you can either choose an unused ID from within the
+ current range of IDs, or choose an ID that is higher than the
+ current maximum ID. For example, in the output just shown, there
+ are unused IDs between 53 and 57, and between 99 and 128. Or you
+ could choose an ID higher than 210.
+ </para>
+
+ <warning>
+ <para>
+ If you upgrade MySQL, you may find that the collation ID you
+ choose has been assigned to a collation included in the new
+ MySQL distribution. In this case, you will need to choose a
+ new value for your own collation.
+ </para>
+
+ <para>
+ In addition, before upgrading, you should save the
+ configuration files that you change. If you upgrade in place,
+ the process will replace the your modified files.
+ </para>
+ </warning>
+
+ </section>
+
+ <section id="adding-collation-simple-8bit">
+
+ <title>Adding a Simple Collation to an 8-Bit Character Set</title>
+
+ <para>
+ To add a simple collation for an 8-bit character set without
+ recompiling MySQL, use the following procedure. The example adds
+ a collation named <literal>latin1_test_ci</literal> to the
+ <literal>latin1</literal> character set.
+ </para>
+
+ <orderedlist>
+
+ <listitem>
+ <para>
+ Choose a collation ID, as shown in
+ <xref linkend="adding-collation-choosing-id"/>. The
+ following steps use an ID of 56.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ You will need to modify the <literal>Index.xml</literal> and
+ <literal>latin1.xml</literal> configuration files. These
+ files will be located in the directory named by the
+ <literal>character_sets_dir</literal> system variable. You
+ can check the variable value as follows, although the
+ pathname might be different on your system:
+ </para>
+
+<programlisting>
+mysql> <userinput>SHOW VARIABLES LIKE 'character_sets_dir';</userinput>
++--------------------+-----------------------------------------+
+| Variable_name | Value |
++--------------------+-----------------------------------------+
+| character_sets_dir | /user/local/mysql/share/mysql/charsets/ |
++--------------------+-----------------------------------------+
+</programlisting>
+ </listitem>
+
+ <listitem>
+ <para>
+ Choose a name for the collation and list it in the
+ <filename>Index.xml</filename> file. Find the
+ <literal><charset></literal> element for the character
+ set to which the collation is being added, and add a
+ <literal><collation></literal> element that indicates
+ the collation name and ID. For example:
+ </para>
+
+<programlisting>
+<charset name="latin1">
+ ...
+ <!-- associate collation name with its ID -->
+ <collation name="latin1_test_ci" id="56"/>
+ ...
+</charset>
+</programlisting>
+ </listitem>
+
+ <listitem>
+ <para>
+ In the <filename>latin1.xml</filename> configuration file,
+ add a <literal><collation></literal> element that
+ names the collation and that contains a
+ <literal><map></literal> element that defines a
+ character code-to-weight mapping table. Each word within the
+ <literal><map></literal> element must be a number in
+ hexadecimal format.
+ </para>
+
+<programlisting>
+<collation name="latin1_test_ci">
+<map>
+ 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
+ 10 11 12 13 14 15 16 17 18 19 1A 1B 1C 1D 1E 1F
+ 20 21 22 23 24 25 26 27 28 29 2A 2B 2C 2D 2E 2F
+ 30 31 32 33 34 35 36 37 38 39 3A 3B 3C 3D 3E 3F
+ 40 41 42 43 44 45 46 47 48 49 4A 4B 4C 4D 4E 4F
+ 50 51 52 53 54 55 56 57 58 59 5A 5B 5C 5D 5E 5F
+ 60 41 42 43 44 45 46 47 48 49 4A 4B 4C 4D 4E 4F
+ 50 51 52 53 54 55 56 57 58 59 5A 7B 7C 7D 7E 7F
+ 80 81 82 83 84 85 86 87 88 89 8A 8B 8C 8D 8E 8F
+ 90 91 92 93 94 95 96 97 98 99 9A 9B 9C 9D 9E 9F
+ A0 A1 A2 A3 A4 A5 A6 A7 A8 A9 AA AB AC AD AE AF
+ B0 B1 B2 B3 B4 B5 B6 B7 B8 B9 BA BB BC BD BE BF
+ 41 41 41 41 5B 5D 5B 43 45 45 45 45 49 49 49 49
+ 44 4E 4F 4F 4F 4F 5C D7 5C 55 55 55 59 59 DE DF
+ 41 41 41 41 5B 5D 5B 43 45 45 45 45 49 49 49 49
+ 44 4E 4F 4F 4F 4F 5C F7 5C 55 55 55 59 59 DE FF
+</map>
+</collation>
+</programlisting>
+ </listitem>
+
+ <listitem>
+ <para>
+ Restart the server and use this statement to verify that the
+ collation is present:
+ </para>
+
+<programlisting>
+mysql> <userinput>SHOW COLLATION LIKE 'latin1_test_ci';</userinput>
++----------------+---------+----+---------+----------+---------+
+| Collation | Charset | Id | Default | Compiled | Sortlen |
++----------------+---------+----+---------+----------+---------+
+| latin1_test_ci | latin1 | 56 | | | 1 |
++----------------+---------+----+---------+----------+---------+
+</programlisting>
+ </listitem>
+
+ </orderedlist>
+
+ </section>
+
+ <section id="adding-collation-unicode-uca">
+
+ <title>Adding a UCA Collation to a Unicode Character Set</title>
+
+ <para>
+ UCA collations for Unicode character serts can be added to MySQL
+ without recompiling by using a subset of the Locale Data Markup
+ Language (LDML), which is available at
+ <ulink url="http://www.unicode.org/reports/tr35/"/>. In
+ ¤t-series;, this method of adding collations is supported
+ as of MySQL 5.1.20. With this method, you begin with an existing
+ <quote>base</quote> collation. Then you describe the new
+ collation in terms of how it differs from the base collation,
+ rather than defining the entire collation. The following table
+ lists the base collations for the Unicode character sets.
+ </para>
+
+ <informaltable>
+ <tgroup cols="2">
+ <colspec colwidth="30*"/>
+ <colspec colwidth="60*"/>
+ <tbody>
+ <row>
+ <entry><emphasis role="bold">Character Set</emphasis></entry>
+ <entry><emphasis role="bold">Base Collation</emphasis></entry>
+ </row>
+ <row>
+ <entry><literal>utf8</literal></entry>
+ <entry><literal>utf8_unicode_ci</literal></entry>
+ </row>
+ <row>
+ <entry><literal>ucs2</literal></entry>
+ <entry><literal>ucs2_unicode_ci</literal></entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </informaltable>
+
+ <para>
+ The following brief summary describes the LDML characteristics
+ required for understanding the procedure for adding a collation
+ given later in this section:
+ </para>
+
+ <itemizedlist>
+
+ <listitem>
+ <para>
+ LDML has reset rules and shift rules.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ Characters named in these rules can be written in
+ <literal>\u<replaceable>nnnn</replaceable></literal> format,
+ where <replaceable>nnnn</replaceable> is the hexadecimal
+ Unicode code point value. Basic Latin letters
+ <literal>A-Z</literal> <literal>a-z</literal> can also be
+ written literally.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ A reset rule does not specify any ordering in and of itself.
+ Instead, it <quote>resets</quote> the ordering for
+ subsequent shift rules to cause them to be taken in relation
+ to a given character. Either of the following rules resets
+ subsequent shift rules to be taken in relation to the letter
+ <literal>'A'</literal>:
+ </para>
+
+<programlisting>
+<reset>A</reset>
+
+<reset>\u0041</reset>
+</programlisting>
+ </listitem>
+
+ <listitem>
+ <para>
+ Shift rules define primary, secondary, and tertiary
+ differences of a character from another character. They are
+ specified using <literal><p></literal>,
+ <literal><s></literal>, and
+ <literal><t></literal> elements. Either of the
+ following rules specifies a primary shift rule for the
+ <literal>'G'</literal> character:
+ </para>
+
+<programlisting>
+<p>G</p>
+
+<p>\u0047</p>
+</programlisting>
+ </listitem>
+
+ <listitem>
+ <para>
+ Use the shift rules as follows to distinguish characters:
+ </para>
+
+ <itemizedlist>
+
+ <listitem>
+ <para>
+ Use primary differences to distinguish separate letters
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ Use secondary differences to distiguish accent
+ variations
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ Use tertiary differences to distinguish lettercase
+ variations
+ </para>
+ </listitem>
+
+ </itemizedlist>
+ </listitem>
+
+ </itemizedlist>
+
+ <remark role="todo">
+ Add: Examples of the use of these rules
+ </remark>
+
+ <para>
+ To add a UCA collation for a Unicode character set without
+ recompiling MySQL, use the following procedure. The example adds
+ a collation named <literal>utf8_phone_ci</literal> to the
+ <literal>utf8</literal> character set. The collation is designed
+ for a scenario involving a Web application for which users post
+ their names and phone numbers. Phone numbers can be given in
+ very different formats:
+ </para>
+
+<programlisting>
++7-12345-67
++7-12-345-67
++7 12 345 67
++7 (12) 345 67
++71234567
+</programlisting>
+
+ <para>
+ The problem raised by dealing with these kinds of values is that
+ the varying allowable formats make searching for a specific
+ phone number very difficult. The solution is to define a new
+ collation that reorders punctuation characters, making them
+ ignorable.
+ </para>
+
+ <orderedlist>
+
+ <listitem>
+ <para>
+ Choose a collation ID, as shown in
+ <xref linkend="adding-collation-choosing-id"/>. The
+ following steps use an ID of 252.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ You will need to modify the <literal>Index.xml</literal>
+ configuration file. This file will be located in the
+ directory named by the <literal>character_sets_dir</literal>
+ system variable. You can check the variable value as
+ follows, although the pathname might be different on your
+ system:
+ </para>
+
+<programlisting>
+mysql> <userinput>SHOW VARIABLES LIKE 'character_sets_dir';</userinput>
++--------------------+-----------------------------------------+
+| Variable_name | Value |
++--------------------+-----------------------------------------+
+| character_sets_dir | /user/local/mysql/share/mysql/charsets/ |
++--------------------+-----------------------------------------+
+</programlisting>
+ </listitem>
+
+ <listitem>
+ <para>
+ Choose a name for the collation and list it in the
+ <filename>Index.xml</filename> file. In addition, you'll
+ need to provide the collation ordering rules. Find the
+ <literal><charset></literal> element for the character
+ set to which the collation is being added, and add a
+ <literal><collation></literal> element that indicates
+ the collation name and ID. Within the
+ <literal><collation></literal> element, provide a
+ <literal><rules></literal> element containing the
+ ordering rules:
+ </para>
+
+<programlisting>
+<charset name="utf8">
+ ...
+ <!-- associate collation name with its ID -->
+ <collation name="utf8_phone_ci" id="252">
+ <rules>
+ <reset>\u0000</reset>
+ <s>\u0020</s> <!-- space -->
+ <s>\u0028</s> <!-- left parenthesis -->
+ <s>\u0029</s> <!-- right parenthesis -->
+ <s>\u002B</s> <!-- plus -->
+ <s>\u002D</s> <!-- hyphen -->
+ </rules>
+ </collation>
+ ...
+</charset>
+</programlisting>
+ </listitem>
+
+ <listitem>
+ <para>
+ If you want a similar collation for other Unicode character
+ sets, add other <literal><collation></literal>
+ elements. For example, to define
+ <literal>ucs2_phone_ci</literal>, add a
+ <literal><collation></literal> element to the
+ <literal><charset name="ucs2"></literal> element.
+ Remember that each collation must have its own unique ID.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ Restart the server and use this statement to verify that the
+ collation is present:
+ </para>
+
+<programlisting>
+mysql> <userinput>SHOW COLLATION LIKE 'utf8_phone_ci';</userinput>
++---------------+---------+-----+---------+----------+---------+
+| Collation | Charset | Id | Default | Compiled | Sortlen |
++---------------+---------+-----+---------+----------+---------+
+| utf8_phone_ci | utf8 | 252 | | | 8 |
++---------------+---------+-----+---------+----------+---------+
+</programlisting>
+ </listitem>
+
+ </orderedlist>
+
+ <para>
+ Now we can test the collation to make sure that it has the
+ desired properties.
+ </para>
+
+ <para>
+ Create a table containing some sample phone numbers using the
+ new collation:
+ </para>
+
+<programlisting>
+<!--
+mysql> DROP TABLE IF EXISTS phonebook;
+Query OK, 0 rows affected, 1 warning (0.00 sec)
+-->
+mysql> <userinput>CREATE TABLE phonebook (</userinput>
+ -> <userinput> name VARCHAR(64),</userinput>
+ -> <userinput> phone VARCHAR(64) CHARACTER SET utf8 COLLATE utf8_phone_ci</userinput>
+ -> <userinput>);</userinput>
+Query OK, 0 rows affected (0.09 sec)
+
+mysql> <userinput>INSERT INTO phonebook VALUES ('Svoj','+7 912 800 80 02');</userinput>
+Query OK, 1 row affected (0.00 sec)
+
+mysql> <userinput>INSERT INTO phonebook VALUES ('Hf','+7 (912) 800 80 04');</userinput>
+Query OK, 1 row affected (0.00 sec)
+
+mysql> <userinput>INSERT INTO phonebook VALUES ('Bar','+7-912-800-80-01');</userinput>
+Query OK, 1 row affected (0.00 sec)
+
+mysql> <userinput>INSERT INTO phonebook VALUES ('Ramil','(7912) 800 80 03');</userinput>
+Query OK, 1 row affected (0.00 sec)
+
+mysql> <userinput>INSERT INTO phonebook VALUES ('Sanja','+380 (912) 8008005');</userinput>
+Query OK, 1 row affected (0.00 sec)
+</programlisting>
+
+ <para>
+ Run some queries to see whether the ignored punctuation
+ characters are in fact ignored for sorting and comparisons:
+ </para>
+
+<programlisting>
+mysql> <userinput>SELECT * FROM phonebook ORDER BY phone;</userinput>
++-------+--------------------+
+| name | phone |
++-------+--------------------+
+| Sanja | +380 (912) 8008005 |
+| Bar | +7-912-800-80-01 |
+| Svoj | +7 912 800 80 02 |
+| Ramil | (7912) 800 80 03 |
+| Hf | +7 (912) 800 80 04 |
++-------+--------------------+
+5 rows in set (0.00 sec)
+
+mysql> <userinput>SELECT * FROM phonebook WHERE phone='+7(912)800-80-01';</userinput>
++------+------------------+
+| name | phone |
++------+------------------+
+| Bar | +7-912-800-80-01 |
++------+------------------+
+1 row in set (0.00 sec)
+
+mysql> <userinput>SELECT * FROM phonebook WHERE phone='79128008001';</userinput>
++------+------------------+
+| name | phone |
++------+------------------+
+| Bar | +7-912-800-80-01 |
++------+------------------+
+1 row in set (0.00 sec)
+
+mysql> <userinput>SELECT * FROM phonebook WHERE phone='7 9 1 2 8 0 0 8 0 0 1';</userinput>
++------+------------------+
+| name | phone |
++------+------------------+
+| Bar | +7-912-800-80-01 |
++------+------------------+
+1 row in set (0.00 sec)
+</programlisting>
+
+ </section>
+
+ </section>
+
<section id="problems-with-character-sets">
<title>Problems With Character Sets</title>
@@ -6032,21 +6927,14 @@
program with support for the character set.
</para>
- <remark role="todo">
- Add xref to LDML section when it gets added.
- </remark>
-
<para>
For Unicode character sets, you can define collations without
- recompiling by using LDML notation.
+ recompiling by using LDML notation. See
+ <xref linkend="adding-collation-unicode-uca"/>.
</para>
</listitem>
<listitem>
- <remark role="todo">
- dynamic = not compiled in?
- </remark>
-
<para>
The character set is a dynamic character set, but you do not
have a configuration file for it. In this case, you should
| Thread |
|---|
| • svn commit - mysqldoc@docsrva: r10862 - in trunk: . it/refman-5.1 pt/refman-5.1 | paul | 29 May |