Author: pd221994
Date: 2011-05-13 19:41:11 +0200 (Fri, 13 May 2011)
New Revision: 26218
Log:
r47978@dhcp-adc-twvpn-1-vpnpool-10-154-14-71: paul | 2011-05-13 12:26:51 -0500
Adding charset/collation instruction revisions
Modified:
svk:merge
trunk/refman-5.0/internationalization.xml
trunk/refman-5.1/internationalization.xml
trunk/refman-5.5/internationalization.xml
trunk/refman-5.6/internationalization.xml
trunk/refman-6.0/internationalization.xml
Property changes on: trunk
___________________________________________________________________
Modified: svk:merge
===================================================================
Changed blocks: 0, Lines Added: 0, Lines Deleted: 0; 1277 bytes
Modified: trunk/refman-5.0/internationalization.xml
===================================================================
--- trunk/refman-5.0/internationalization.xml 2011-05-13 16:12:35 UTC (rev 26217)
+++ trunk/refman-5.0/internationalization.xml 2011-05-13 17:41:11 UTC (rev 26218)
Changed blocks: 29, Lines Added: 254, Lines Deleted: 198; 26827 bytes
@@ -6129,9 +6129,9 @@
<listitem>
<para>
- If the character set does not need to use special string
- collating routines for sorting and does not need multi-byte
- character support, it is simple.
+ If the character set does not need special string collating
+ routines for sorting and does not need multi-byte character
+ support, it is simple.
</para>
</listitem>
@@ -6165,7 +6165,8 @@
<replaceable>MYSET</replaceable> to the
<filename>sql/share/charsets/Index.xml</filename> file. Use
the existing contents in the file as a guide to adding new
- contents.
+ contents. A partial listing for the <literal>latin1</literal>
+ <literal><charset></literal> element follows:
</para>
<programlisting>
@@ -6179,14 +6180,19 @@
</collation>
<collation name="latin1_danish_ci" id="15" order="Danish"/>
...
+ <collation name="latin1_bin" id="47" order="Binary">
+ <flag>binary</flag>
+ <flag>compiled</flag>
+ </collation>
+ ...
</charset>
</programlisting>
<para>
The <literal><charset></literal> element must list all
the collations for the character set. These must include at
- least a binary collation and a default collation. The default
- collation is usually named using a suffix of
+ least a binary collation and a default (primary) collation.
+ The default collation is often named using a suffix of
<literal>general_ci</literal> (general, case insensitive). It
is possible for the binary collation to be the default
collation, but usually they are different. The default
@@ -6277,7 +6283,7 @@
character set:
</para>
- <orderedlist>
+ <itemizedlist>
<listitem>
<para>
@@ -6319,7 +6325,7 @@
</para>
</listitem>
- </orderedlist>
+ </itemizedlist>
</listitem>
<listitem>
@@ -6465,7 +6471,8 @@
<para>
Each simple character set has a configuration file located in
- the <filename>sql/share/charsets</filename> directory. The file
+ the <filename>sql/share/charsets</filename> directory. For a
+ character set named <replaceable>MYSYS</replaceable>, the file
is named
<filename><replaceable>MYSET</replaceable>.xml</filename>. It
uses <literal><map></literal> array elements to list
@@ -6517,17 +6524,18 @@
<literal>ctype_<replaceable>MYSET</replaceable>[]</literal>,
<literal>to_lower_<replaceable>MYSET</replaceable>[]</literal>,
and so forth. Not every complex character set has all of the
- arrays. See the existing <filename>ctype-*.c</filename> files
- for examples. See the <filename>CHARSET_INFO.txt</filename> file
- in the <filename>strings</filename> directory for additional
+ arrays. See also the existing <filename>ctype-*.c</filename>
+ files for examples. See the
+ <filename>CHARSET_INFO.txt</filename> file in the
+ <filename>strings</filename> directory for additional
information.
</para>
<para>
- The <literal><ctype></literal> array is indexed by
- character value + 1 and has 257 elements. This is a legacy
- convention for handling <literal>EOF</literal>. The other arrays
- are indexed by character value and have 256 elements.
+ Most of the arrays are indexed by character value and have 256
+ elements. The <literal><ctype></literal> array is indexed
+ by character value + 1 and has 257 elements. This is a legacy
+ convention for handling <literal>EOF</literal>.
</para>
<para>
@@ -6582,14 +6590,14 @@
</programlisting>
<para>
- Each <literal><collation></literal> element contains a
- mapping array that indicates how characters should be ordered
- for comparison and sorting purposes. MySQL sorts characters
- based on the values of this information. In some cases, this is
- the same as the <literal>upper</literal> array, which means that
- sorting is case-insensitive. For more complicated sorting rules
- (for complex character sets), see the discussion of string
- collating in <xref linkend="string-collating"/>.
+ Each <literal><collation></literal> array indicates how
+ characters should be ordered for comparison and sorting
+ purposes. MySQL sorts characters based on the values of this
+ information. In some cases, this is the same as the
+ <literal><upper></literal> array, which means that sorting
+ is case-insensitive. For more complicated sorting rules (for
+ complex character sets), see the discussion of string collating
+ in <xref linkend="string-collating"/>.
</para>
</section>
@@ -6608,12 +6616,13 @@
</indexterm>
<para>
- For simple character sets, sorting rules are specified in the
- <filename><replaceable>MYSET</replaceable>.xml</filename>
+ For a simple character set named
+ <replaceable>MYSET</replaceable>, sorting rules are specified in
+ the <filename><replaceable>MYSET</replaceable>.xml</filename>
configuration file using <literal><map></literal> array
elements within <literal><collation></literal> elements.
If the sorting rules for your language are too complex to be
- handled with simple arrays, you need to define string collating
+ handled with simple arrays, you must define string collating
functions in the
<filename>ctype-<replaceable>MYSET</replaceable>.c</filename>
source file in the <filename>strings</filename> directory.
@@ -6628,9 +6637,10 @@
<literal>gbk</literal>, <literal>sjis</literal>, and
<literal>tis160</literal> character sets. Take a look at the
<literal>MY_COLLATION_HANDLER</literal> structures to see how
- they are used, and see the <filename>CHARSET_INFO.txt</filename>
- file in the <filename>strings</filename> directory for
- additional information.
+ they are used. See also the
+ <filename>CHARSET_INFO.txt</filename> file in the
+ <filename>strings</filename> directory for additional
+ information.
</para>
</section>
@@ -6649,9 +6659,9 @@
</indexterm>
<para>
- If you want to add support for a new character set that includes
- multi-byte characters, you need to use multi-byte character
- functions in the
+ If you want to add support for a new character set named
+ <replaceable>MYSET</replaceable> that includes multi-byte
+ characters, you must use multi-byte character functions in the
<filename>ctype-<replaceable>MYSET</replaceable>.c</filename>
source file in the <filename>strings</filename> directory.
</para>
@@ -6665,9 +6675,9 @@
<literal>gbk</literal>, <literal>sjis</literal>, and
<literal>ujis</literal> character sets. Take a look at the
<literal>MY_CHARSET_HANDLER</literal> structures to see how they
- are used, and see the <filename>CHARSET_INFO.txt</filename> file
- in the <filename>strings</filename> directory for additional
- information.
+ are used. See also the <filename>CHARSET_INFO.txt</filename>
+ file in the <filename>strings</filename> directory for
+ additional information.
</para>
</section>
@@ -6727,9 +6737,9 @@
</itemizedlist>
<para>
- The following discussion describes how to add collations of the
- first two types to existing character sets. All existing character
- sets already have a binary collation, so there is no need here to
+ The following sections describe how to add collations of the first
+ two types to existing character sets. All existing character sets
+ already have a binary collation, so there is no need here to
describe how to add one.
</para>
@@ -6775,10 +6785,8 @@
all the information required for a complete character set, just
modify the appropriate files for an existing character set. That
is, based on what is already present for the character set's
- current collations, add new data structures, functions, and
- configuration information for the new collation. For an example,
- see the MySQL Blog article in the following list of additional
- resources.
+ current collations, add data structures, functions, and
+ configuration information for the new collation.
</para>
<bridgehead>
@@ -6855,6 +6863,11 @@
</programlisting>
<para>
+ For implementation instructions, see
+ <xref linkend="adding-collation-simple-8bit"/>.
+ </para>
+
+ <para>
<emphasis role="bold">Complex collations for 8-bit character
sets</emphasis>
</para>
@@ -6908,6 +6921,11 @@
</itemizedlist>
<para>
+ For implementation instructions, see
+ <xref linkend="adding-character-set"/>.
+ </para>
+
+ <para>
<emphasis role="bold">Collations for Unicode multi-byte
character sets</emphasis>
</para>
@@ -6994,6 +7012,12 @@
</para>
<para>
+ For implementation instructions, for a non-UCA colluation, see
+ <xref linkend="adding-character-set"/>. For a UCA collation, see
+ <xref linkend="adding-collation-unicode-uca"/>.
+ </para>
+
+ <para>
<emphasis role="bold">Miscellaneous collations</emphasis>
</para>
@@ -7019,16 +7043,16 @@
<listitem>
<para>
- The <literal>Id</literal> column of
- <literal role="stmt">SHOW COLLATION</literal> output
+ The <literal>ID</literal> column of the
+ <literal role="is">INFORMATION_SCHEMA.COLLATIONS</literal>
+ table
</para>
</listitem>
<listitem>
<para>
- The <literal>ID</literal> column of the
- <literal role="is">INFORMATION_SCHEMA.COLLATIONS</literal>
- table
+ The <literal>Id</literal> column of
+ <literal role="stmt">SHOW COLLATION</literal> output
</para>
</listitem>
@@ -7193,9 +7217,9 @@
add a <literal><collation></literal> element that
names the collation and that contains a
<literal><map></literal> element that defines a
- character code-to-weight mapping table. Each word within the
- <literal><map></literal> element must be a number in
- hexadecimal format.
+ character code-to-weight mapping table for character codes 0
+ to 255. Each value within the <literal><map></literal>
+ element must be a number in hexadecimal format.
</para>
<programlisting>
@@ -7252,8 +7276,8 @@
<literal><collation></literal> element within a
<literal><charset></literal> character set description.
The procedure described here does not require recompiling MySQL.
- It uses a subset of the Locale Data Markup Language (LDML),
- which is available at
+ It uses a subset of the Locale Data Markup Language (LDML)
+ specification, which is available at
<ulink url="http://www.unicode.org/reports/tr35/"/>. In
¤t-series;, this method of adding collations is supported
as of MySQL 5.0.46. With this method, you need not define the
@@ -7264,7 +7288,8 @@
for which UCA collations can be defined.
</para>
- <informaltable>
+ <table>
+ <title>MySQL Character Sets Available for User-Defined UCA Collations</title>
<tgroup cols="2">
<colspec colwidth="30*"/>
<colspec colwidth="60*"/>
@@ -7285,65 +7310,79 @@
</row>
</tbody>
</tgroup>
- </informaltable>
+ </table>
<para>
- The following brief summary describes the LDML characteristics
- required to understand the procedure for adding a collation
- given later in this section:
+ The following sections show how to add a collation that is
+ defined using LDML syntax, and provide a summary of LDML rules
+ supported in MySQL.
</para>
- <itemizedlist>
+ <section id="ldml-rules">
- <listitem>
- <para>
- LDML has reset rules and shift rules.
- </para>
- </listitem>
+ <title>LDML Syntax Supported in MySQL</title>
- <listitem>
- <para>
- Characters named in these rules can be written in
- <literal>\u<replaceable>nnnn</replaceable></literal> format,
- where <replaceable>nnnn</replaceable> is the hexadecimal
- Unicode code point value. Basic Latin letters
- <literal>A-Z</literal> and <literal>a-z</literal> can also
- be written literally (this is a MySQL limitation; the LDML
- specification permits literal non-Latin1 characters in the
- rules). Only characters in the Basic Multilingual Plane can
- be specified. This notation does not apply to characters
- outside the BMP range of <literal>0000</literal> to
- <literal>FFFF</literal>.
- </para>
- </listitem>
+ <para>
+ This section describes the LDML rules that MySQL recognizes.
+ These are a subset of the rules described in the LDML
+ specification available at
+ <ulink url="http://www.unicode.org/reports/tr35/"/>. The rules
+ here are all supported except that character sorting occurs
+ only at the primary level. Rules that specify secondary or
+ higher sort levels are recognized but have no effect.
+ </para>
- <listitem>
- <para>
- A reset rule does not specify any ordering in and of itself.
- Instead, it <quote>resets</quote> the ordering for
- subsequent shift rules to cause them to be taken in relation
- to a given character. Either of the following rules resets
- subsequent shift rules to be taken in relation to the letter
- <literal>'A'</literal>:
- </para>
+ <itemizedlist>
+ <listitem>
+ <para>
+ Characters named in LDML rules can be written in
+ <literal>\u<replaceable>nnnn</replaceable></literal>
+ format, where <replaceable>nnnn</replaceable> is the
+ hexadecimal Unicode code point value. Basic Latin letters
+ <literal>A-Z</literal> and <literal>a-z</literal> can also
+ be written literally (this is a MySQL limitation; the LDML
+ specification permits literal non-Latin1 characters in the
+ rules). Only characters in the Basic Multilingual Plane
+ can be specified. This notation does not apply to
+ characters outside the BMP range of
+ <literal>0000</literal> to <literal>FFFF</literal>.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ LDML has reset rules and shift rules.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ A reset rule does not specify any ordering in and of
+ itself. Instead, it <quote>resets</quote> the ordering for
+ subsequent shift rules to cause them to be taken in
+ relation to a given character. Either of the following
+ rules resets subsequent shift rules to be taken in
+ relation to the letter <literal>'A'</literal>:
+ </para>
+
<programlisting>
<reset>A</reset>
<reset>\u0041</reset>
</programlisting>
- </listitem>
+ </listitem>
- <listitem>
- <para>
- Shift rules define primary, secondary, and tertiary
- differences of a character from another character. They are
- specified using <literal><p></literal>,
- <literal><s></literal>, and
- <literal><t></literal> elements. Either of the
- following rules specifies a primary shift rule for the
- <literal>'G'</literal> character:
- </para>
+ <listitem>
+ <para>
+ Shift rules define primary, secondary, and tertiary
+ differences of a character from another character. They
+ are specified using <literal><p></literal>,
+ <literal><s></literal>, and
+ <literal><t></literal> elements. Either of the
+ following rules specifies a primary shift rule for the
+ <literal>'G'</literal> character:
+ </para>
<programlisting>
<p>G</p>
@@ -7351,43 +7390,57 @@
<p>\u0047</p>
</programlisting>
- <itemizedlist>
+ <itemizedlist>
- <listitem>
- <para>
- Use primary differences to distinguish separate letters.
- </para>
- </listitem>
+ <listitem>
+ <para>
+ Use primary differences to distinguish separate
+ letters.
+ </para>
+ </listitem>
- <listitem>
- <para>
- Use secondary differences to distinguish accent
- variations.
- </para>
- </listitem>
+ <listitem>
+ <para>
+ Use secondary differences to distinguish accent
+ variations.
+ </para>
+ </listitem>
- <listitem>
- <para>
- Use tertiary differences to distinguish lettercase
- variations.
- </para>
- </listitem>
+ <listitem>
+ <para>
+ Use tertiary differences to distinguish lettercase
+ variations.
+ </para>
+ </listitem>
- </itemizedlist>
- </listitem>
+ </itemizedlist>
+ </listitem>
- </itemizedlist>
+ </itemizedlist>
- <para>
- To add a UCA collation for a Unicode character set without
- recompiling MySQL, use the following procedure. The example adds
- a collation named <literal>utf8_phone_ci</literal> to the
- <literal>utf8</literal> character set. The collation is designed
- for a scenario involving a Web application for which users post
- their names and phone numbers. Phone numbers can be given in
- very different formats:
- </para>
+ </section>
+ <section id="ldml-collation-example">
+
+ <title>Defining a UCA Collation using LDML Syntax</title>
+
+ <para>
+ To add a UCA collation for a Unicode character set without
+ recompiling MySQL, use the following procedure. If you are
+ unfamiliar with the LDML rules used to describe the
+ collation's sort characteristics, see
+ <xref linkend="ldml-rules"/>.
+ </para>
+
+ <para>
+ The example adds a collation named
+ <literal>utf8_phone_ci</literal> to the
+ <literal>utf8</literal> character set. The collation is
+ designed for a scenario involving a Web application for which
+ users post their names and phone numbers. Phone numbers can be
+ given in very different formats:
+ </para>
+
<programlisting>
+7-12345-67
+7-12-345-67
@@ -7396,33 +7449,33 @@
+71234567
</programlisting>
- <para>
- The problem raised by dealing with these kinds of values is that
- the varying permissible formats make searching for a specific
- phone number very difficult. The solution is to define a new
- collation that reorders punctuation characters, making them
- ignorable.
- </para>
+ <para>
+ The problem raised by dealing with these kinds of values is
+ that the varying permissible formats make searching for a
+ specific phone number very difficult. The solution is to
+ define a new collation that reorders punctuation characters,
+ making them ignorable.
+ </para>
- <orderedlist>
+ <orderedlist>
- <listitem>
- <para>
- Choose a collation ID, as shown in
- <xref linkend="adding-collation-choosing-id"/>. The
- following steps use an ID of 252.
- </para>
- </listitem>
+ <listitem>
+ <para>
+ Choose a collation ID, as shown in
+ <xref linkend="adding-collation-choosing-id"/>. The
+ following steps use an ID of 252.
+ </para>
+ </listitem>
- <listitem>
- <para>
- To modify the <literal>Index.xml</literal> configuration
- file. This file will be located in the directory named by
- the <literal role="sysvar">character_sets_dir</literal>
- system variable. You can check the variable value as
- follows, although the path name might be different on your
- system:
- </para>
+ <listitem>
+ <para>
+ To modify the <literal>Index.xml</literal> configuration
+ file. This file will be located in the directory named by
+ the <literal role="sysvar">character_sets_dir</literal>
+ system variable. You can check the variable value as
+ follows, although the path name might be different on your
+ system:
+ </para>
<programlisting>
mysql> <userinput>SHOW VARIABLES LIKE 'character_sets_dir';</userinput>
@@ -7432,21 +7485,22 @@
| character_sets_dir | /user/local/mysql/share/mysql/charsets/ |
+--------------------+-----------------------------------------+
</programlisting>
- </listitem>
+ </listitem>
- <listitem>
- <para>
- Choose a name for the collation and list it in the
- <filename>Index.xml</filename> file. In addition, you'll
- need to provide the collation ordering rules. Find the
- <literal><charset></literal> element for the character
- set to which the collation is being added, and add a
- <literal><collation></literal> element that indicates
- the collation name and ID, to associate the name with the
- ID. Within the <literal><collation></literal> element,
- provide a <literal><rules></literal> element
- containing the ordering rules:
- </para>
+ <listitem>
+ <para>
+ Choose a name for the collation and list it in the
+ <filename>Index.xml</filename> file. In addition, you'll
+ need to provide the collation ordering rules. Find the
+ <literal><charset></literal> element for the
+ character set to which the collation is being added, and
+ add a <literal><collation></literal> element that
+ indicates the collation name and ID, to associate the name
+ with the ID. Within the
+ <literal><collation></literal> element, provide a
+ <literal><rules></literal> element containing the
+ ordering rules:
+ </para>
<programlisting>
<charset name="utf8">
@@ -7464,25 +7518,25 @@
...
</charset>
</programlisting>
- </listitem>
+ </listitem>
- <listitem>
- <para>
- If you want a similar collation for other Unicode character
- sets, add other <literal><collation></literal>
- elements. For example, to define
- <literal>ucs2_phone_ci</literal>, add a
- <literal><collation></literal> element to the
- <literal><charset name="ucs2"></literal> element.
- Remember that each collation must have its own unique ID.
- </para>
- </listitem>
+ <listitem>
+ <para>
+ If you want a similar collation for other Unicode
+ character sets, add other
+ <literal><collation></literal> elements. For
+ example, to define <literal>ucs2_phone_ci</literal>, add a
+ <literal><collation></literal> element to the
+ <literal><charset name="ucs2"></literal> element.
+ Remember that each collation must have its own unique ID.
+ </para>
+ </listitem>
- <listitem>
- <para>
- Restart the server and use this statement to verify that the
- collation is present:
- </para>
+ <listitem>
+ <para>
+ Restart the server and use this statement to verify that
+ the collation is present:
+ </para>
<programlisting>
mysql> <userinput>SHOW COLLATION LIKE 'utf8_phone_ci';</userinput>
@@ -7492,19 +7546,19 @@
| utf8_phone_ci | utf8 | 252 | | | 8 |
+---------------+---------+-----+---------+----------+---------+
</programlisting>
- </listitem>
+ </listitem>
- </orderedlist>
+ </orderedlist>
- <para>
- Now test the collation to make sure that it has the desired
- properties.
- </para>
+ <para>
+ Now test the collation to make sure that it has the desired
+ properties.
+ </para>
- <para>
- Create a table containing some sample phone numbers using the
- new collation:
- </para>
+ <para>
+ Create a table containing some sample phone numbers using the
+ new collation:
+ </para>
<programlisting>
<!--
@@ -7533,10 +7587,10 @@
Query OK, 1 row affected (0.00 sec)
</programlisting>
- <para>
- Run some queries to see whether the ignored punctuation
- characters are in fact ignored for sorting and comparisons:
- </para>
+ <para>
+ Run some queries to see whether the ignored punctuation
+ characters are in fact ignored for sorting and comparisons:
+ </para>
<programlisting>
mysql> <userinput>SELECT * FROM phonebook ORDER BY phone;</userinput>
@@ -7576,6 +7630,8 @@
1 row in set (0.00 sec)
</programlisting>
+ </section>
+
</section>
</section>
Modified: trunk/refman-5.1/internationalization.xml
===================================================================
--- trunk/refman-5.1/internationalization.xml 2011-05-13 16:12:35 UTC (rev 26217)
+++ trunk/refman-5.1/internationalization.xml 2011-05-13 17:41:11 UTC (rev 26218)
Changed blocks: 29, Lines Added: 254, Lines Deleted: 198; 26827 bytes
@@ -6319,9 +6319,9 @@
<listitem>
<para>
- If the character set does not need to use special string
- collating routines for sorting and does not need multi-byte
- character support, it is simple.
+ If the character set does not need special string collating
+ routines for sorting and does not need multi-byte character
+ support, it is simple.
</para>
</listitem>
@@ -6355,7 +6355,8 @@
<replaceable>MYSET</replaceable> to the
<filename>sql/share/charsets/Index.xml</filename> file. Use
the existing contents in the file as a guide to adding new
- contents.
+ contents. A partial listing for the <literal>latin1</literal>
+ <literal><charset></literal> element follows:
</para>
<programlisting>
@@ -6369,14 +6370,19 @@
</collation>
<collation name="latin1_danish_ci" id="15" order="Danish"/>
...
+ <collation name="latin1_bin" id="47" order="Binary">
+ <flag>binary</flag>
+ <flag>compiled</flag>
+ </collation>
+ ...
</charset>
</programlisting>
<para>
The <literal><charset></literal> element must list all
the collations for the character set. These must include at
- least a binary collation and a default collation. The default
- collation is usually named using a suffix of
+ least a binary collation and a default (primary) collation.
+ The default collation is often named using a suffix of
<literal>general_ci</literal> (general, case insensitive). It
is possible for the binary collation to be the default
collation, but usually they are different. The default
@@ -6467,7 +6473,7 @@
character set:
</para>
- <orderedlist>
+ <itemizedlist>
<listitem>
<para>
@@ -6509,7 +6515,7 @@
</para>
</listitem>
- </orderedlist>
+ </itemizedlist>
</listitem>
<listitem>
@@ -6655,7 +6661,8 @@
<para>
Each simple character set has a configuration file located in
- the <filename>sql/share/charsets</filename> directory. The file
+ the <filename>sql/share/charsets</filename> directory. For a
+ character set named <replaceable>MYSYS</replaceable>, the file
is named
<filename><replaceable>MYSET</replaceable>.xml</filename>. It
uses <literal><map></literal> array elements to list
@@ -6707,17 +6714,18 @@
<literal>ctype_<replaceable>MYSET</replaceable>[]</literal>,
<literal>to_lower_<replaceable>MYSET</replaceable>[]</literal>,
and so forth. Not every complex character set has all of the
- arrays. See the existing <filename>ctype-*.c</filename> files
- for examples. See the <filename>CHARSET_INFO.txt</filename> file
- in the <filename>strings</filename> directory for additional
+ arrays. See also the existing <filename>ctype-*.c</filename>
+ files for examples. See the
+ <filename>CHARSET_INFO.txt</filename> file in the
+ <filename>strings</filename> directory for additional
information.
</para>
<para>
- The <literal><ctype></literal> array is indexed by
- character value + 1 and has 257 elements. This is a legacy
- convention for handling <literal>EOF</literal>. The other arrays
- are indexed by character value and have 256 elements.
+ Most of the arrays are indexed by character value and have 256
+ elements. The <literal><ctype></literal> array is indexed
+ by character value + 1 and has 257 elements. This is a legacy
+ convention for handling <literal>EOF</literal>.
</para>
<para>
@@ -6772,14 +6780,14 @@
</programlisting>
<para>
- Each <literal><collation></literal> element contains a
- mapping array that indicates how characters should be ordered
- for comparison and sorting purposes. MySQL sorts characters
- based on the values of this information. In some cases, this is
- the same as the <literal>upper</literal> array, which means that
- sorting is case-insensitive. For more complicated sorting rules
- (for complex character sets), see the discussion of string
- collating in <xref linkend="string-collating"/>.
+ Each <literal><collation></literal> array indicates how
+ characters should be ordered for comparison and sorting
+ purposes. MySQL sorts characters based on the values of this
+ information. In some cases, this is the same as the
+ <literal><upper></literal> array, which means that sorting
+ is case-insensitive. For more complicated sorting rules (for
+ complex character sets), see the discussion of string collating
+ in <xref linkend="string-collating"/>.
</para>
</section>
@@ -6798,12 +6806,13 @@
</indexterm>
<para>
- For simple character sets, sorting rules are specified in the
- <filename><replaceable>MYSET</replaceable>.xml</filename>
+ For a simple character set named
+ <replaceable>MYSET</replaceable>, sorting rules are specified in
+ the <filename><replaceable>MYSET</replaceable>.xml</filename>
configuration file using <literal><map></literal> array
elements within <literal><collation></literal> elements.
If the sorting rules for your language are too complex to be
- handled with simple arrays, you need to define string collating
+ handled with simple arrays, you must define string collating
functions in the
<filename>ctype-<replaceable>MYSET</replaceable>.c</filename>
source file in the <filename>strings</filename> directory.
@@ -6818,9 +6827,10 @@
<literal>gbk</literal>, <literal>sjis</literal>, and
<literal>tis160</literal> character sets. Take a look at the
<literal>MY_COLLATION_HANDLER</literal> structures to see how
- they are used, and see the <filename>CHARSET_INFO.txt</filename>
- file in the <filename>strings</filename> directory for
- additional information.
+ they are used. See also the
+ <filename>CHARSET_INFO.txt</filename> file in the
+ <filename>strings</filename> directory for additional
+ information.
</para>
</section>
@@ -6839,9 +6849,9 @@
</indexterm>
<para>
- If you want to add support for a new character set that includes
- multi-byte characters, you need to use multi-byte character
- functions in the
+ If you want to add support for a new character set named
+ <replaceable>MYSET</replaceable> that includes multi-byte
+ characters, you must use multi-byte character functions in the
<filename>ctype-<replaceable>MYSET</replaceable>.c</filename>
source file in the <filename>strings</filename> directory.
</para>
@@ -6855,9 +6865,9 @@
<literal>gbk</literal>, <literal>sjis</literal>, and
<literal>ujis</literal> character sets. Take a look at the
<literal>MY_CHARSET_HANDLER</literal> structures to see how they
- are used, and see the <filename>CHARSET_INFO.txt</filename> file
- in the <filename>strings</filename> directory for additional
- information.
+ are used. See also the <filename>CHARSET_INFO.txt</filename>
+ file in the <filename>strings</filename> directory for
+ additional information.
</para>
</section>
@@ -6917,9 +6927,9 @@
</itemizedlist>
<para>
- The following discussion describes how to add collations of the
- first two types to existing character sets. All existing character
- sets already have a binary collation, so there is no need here to
+ The following sections describe how to add collations of the first
+ two types to existing character sets. All existing character sets
+ already have a binary collation, so there is no need here to
describe how to add one.
</para>
@@ -6965,10 +6975,8 @@
all the information required for a complete character set, just
modify the appropriate files for an existing character set. That
is, based on what is already present for the character set's
- current collations, add new data structures, functions, and
- configuration information for the new collation. For an example,
- see the MySQL Blog article in the following list of additional
- resources.
+ current collations, add data structures, functions, and
+ configuration information for the new collation.
</para>
<bridgehead>
@@ -7045,6 +7053,11 @@
</programlisting>
<para>
+ For implementation instructions, see
+ <xref linkend="adding-collation-simple-8bit"/>.
+ </para>
+
+ <para>
<emphasis role="bold">Complex collations for 8-bit character
sets</emphasis>
</para>
@@ -7098,6 +7111,11 @@
</itemizedlist>
<para>
+ For implementation instructions, see
+ <xref linkend="adding-character-set"/>.
+ </para>
+
+ <para>
<emphasis role="bold">Collations for Unicode multi-byte
character sets</emphasis>
</para>
@@ -7184,6 +7202,12 @@
</para>
<para>
+ For implementation instructions, for a non-UCA colluation, see
+ <xref linkend="adding-character-set"/>. For a UCA collation, see
+ <xref linkend="adding-collation-unicode-uca"/>.
+ </para>
+
+ <para>
<emphasis role="bold">Miscellaneous collations</emphasis>
</para>
@@ -7209,16 +7233,16 @@
<listitem>
<para>
- The <literal>Id</literal> column of
- <literal role="stmt">SHOW COLLATION</literal> output
+ The <literal>ID</literal> column of the
+ <literal role="is">INFORMATION_SCHEMA.COLLATIONS</literal>
+ table
</para>
</listitem>
<listitem>
<para>
- The <literal>ID</literal> column of the
- <literal role="is">INFORMATION_SCHEMA.COLLATIONS</literal>
- table
+ The <literal>Id</literal> column of
+ <literal role="stmt">SHOW COLLATION</literal> output
</para>
</listitem>
@@ -7383,9 +7407,9 @@
add a <literal><collation></literal> element that
names the collation and that contains a
<literal><map></literal> element that defines a
- character code-to-weight mapping table. Each word within the
- <literal><map></literal> element must be a number in
- hexadecimal format.
+ character code-to-weight mapping table for character codes 0
+ to 255. Each value within the <literal><map></literal>
+ element must be a number in hexadecimal format.
</para>
<programlisting>
@@ -7442,8 +7466,8 @@
<literal><collation></literal> element within a
<literal><charset></literal> character set description.
The procedure described here does not require recompiling MySQL.
- It uses a subset of the Locale Data Markup Language (LDML),
- which is available at
+ It uses a subset of the Locale Data Markup Language (LDML)
+ specification, which is available at
<ulink url="http://www.unicode.org/reports/tr35/"/>. In
¤t-series;, this method of adding collations is supported
as of MySQL 5.1.20. With this method, you need not define the
@@ -7454,7 +7478,8 @@
for which UCA collations can be defined.
</para>
- <informaltable>
+ <table>
+ <title>MySQL Character Sets Available for User-Defined UCA Collations</title>
<tgroup cols="2">
<colspec colwidth="30*"/>
<colspec colwidth="60*"/>
@@ -7475,65 +7500,79 @@
</row>
</tbody>
</tgroup>
- </informaltable>
+ </table>
<para>
- The following brief summary describes the LDML characteristics
- required to understand the procedure for adding a collation
- given later in this section:
+ The following sections show how to add a collation that is
+ defined using LDML syntax, and provide a summary of LDML rules
+ supported in MySQL.
</para>
- <itemizedlist>
+ <section id="ldml-rules">
- <listitem>
- <para>
- LDML has reset rules and shift rules.
- </para>
- </listitem>
+ <title>LDML Syntax Supported in MySQL</title>
- <listitem>
- <para>
- Characters named in these rules can be written in
- <literal>\u<replaceable>nnnn</replaceable></literal> format,
- where <replaceable>nnnn</replaceable> is the hexadecimal
- Unicode code point value. Basic Latin letters
- <literal>A-Z</literal> and <literal>a-z</literal> can also
- be written literally (this is a MySQL limitation; the LDML
- specification permits literal non-Latin1 characters in the
- rules). Only characters in the Basic Multilingual Plane can
- be specified. This notation does not apply to characters
- outside the BMP range of <literal>0000</literal> to
- <literal>FFFF</literal>.
- </para>
- </listitem>
+ <para>
+ This section describes the LDML rules that MySQL recognizes.
+ These are a subset of the rules described in the LDML
+ specification available at
+ <ulink url="http://www.unicode.org/reports/tr35/"/>. The rules
+ here are all supported except that character sorting occurs
+ only at the primary level. Rules that specify secondary or
+ higher sort levels are recognized but have no effect.
+ </para>
- <listitem>
- <para>
- A reset rule does not specify any ordering in and of itself.
- Instead, it <quote>resets</quote> the ordering for
- subsequent shift rules to cause them to be taken in relation
- to a given character. Either of the following rules resets
- subsequent shift rules to be taken in relation to the letter
- <literal>'A'</literal>:
- </para>
+ <itemizedlist>
+ <listitem>
+ <para>
+ Characters named in LDML rules can be written in
+ <literal>\u<replaceable>nnnn</replaceable></literal>
+ format, where <replaceable>nnnn</replaceable> is the
+ hexadecimal Unicode code point value. Basic Latin letters
+ <literal>A-Z</literal> and <literal>a-z</literal> can also
+ be written literally (this is a MySQL limitation; the LDML
+ specification permits literal non-Latin1 characters in the
+ rules). Only characters in the Basic Multilingual Plane
+ can be specified. This notation does not apply to
+ characters outside the BMP range of
+ <literal>0000</literal> to <literal>FFFF</literal>.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ LDML has reset rules and shift rules.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ A reset rule does not specify any ordering in and of
+ itself. Instead, it <quote>resets</quote> the ordering for
+ subsequent shift rules to cause them to be taken in
+ relation to a given character. Either of the following
+ rules resets subsequent shift rules to be taken in
+ relation to the letter <literal>'A'</literal>:
+ </para>
+
<programlisting>
<reset>A</reset>
<reset>\u0041</reset>
</programlisting>
- </listitem>
+ </listitem>
- <listitem>
- <para>
- Shift rules define primary, secondary, and tertiary
- differences of a character from another character. They are
- specified using <literal><p></literal>,
- <literal><s></literal>, and
- <literal><t></literal> elements. Either of the
- following rules specifies a primary shift rule for the
- <literal>'G'</literal> character:
- </para>
+ <listitem>
+ <para>
+ Shift rules define primary, secondary, and tertiary
+ differences of a character from another character. They
+ are specified using <literal><p></literal>,
+ <literal><s></literal>, and
+ <literal><t></literal> elements. Either of the
+ following rules specifies a primary shift rule for the
+ <literal>'G'</literal> character:
+ </para>
<programlisting>
<p>G</p>
@@ -7541,43 +7580,57 @@
<p>\u0047</p>
</programlisting>
- <itemizedlist>
+ <itemizedlist>
- <listitem>
- <para>
- Use primary differences to distinguish separate letters.
- </para>
- </listitem>
+ <listitem>
+ <para>
+ Use primary differences to distinguish separate
+ letters.
+ </para>
+ </listitem>
- <listitem>
- <para>
- Use secondary differences to distinguish accent
- variations.
- </para>
- </listitem>
+ <listitem>
+ <para>
+ Use secondary differences to distinguish accent
+ variations.
+ </para>
+ </listitem>
- <listitem>
- <para>
- Use tertiary differences to distinguish lettercase
- variations.
- </para>
- </listitem>
+ <listitem>
+ <para>
+ Use tertiary differences to distinguish lettercase
+ variations.
+ </para>
+ </listitem>
- </itemizedlist>
- </listitem>
+ </itemizedlist>
+ </listitem>
- </itemizedlist>
+ </itemizedlist>
- <para>
- To add a UCA collation for a Unicode character set without
- recompiling MySQL, use the following procedure. The example adds
- a collation named <literal>utf8_phone_ci</literal> to the
- <literal>utf8</literal> character set. The collation is designed
- for a scenario involving a Web application for which users post
- their names and phone numbers. Phone numbers can be given in
- very different formats:
- </para>
+ </section>
+ <section id="ldml-collation-example">
+
+ <title>Defining a UCA Collation using LDML Syntax</title>
+
+ <para>
+ To add a UCA collation for a Unicode character set without
+ recompiling MySQL, use the following procedure. If you are
+ unfamiliar with the LDML rules used to describe the
+ collation's sort characteristics, see
+ <xref linkend="ldml-rules"/>.
+ </para>
+
+ <para>
+ The example adds a collation named
+ <literal>utf8_phone_ci</literal> to the
+ <literal>utf8</literal> character set. The collation is
+ designed for a scenario involving a Web application for which
+ users post their names and phone numbers. Phone numbers can be
+ given in very different formats:
+ </para>
+
<programlisting>
+7-12345-67
+7-12-345-67
@@ -7586,33 +7639,33 @@
+71234567
</programlisting>
- <para>
- The problem raised by dealing with these kinds of values is that
- the varying permissible formats make searching for a specific
- phone number very difficult. The solution is to define a new
- collation that reorders punctuation characters, making them
- ignorable.
- </para>
+ <para>
+ The problem raised by dealing with these kinds of values is
+ that the varying permissible formats make searching for a
+ specific phone number very difficult. The solution is to
+ define a new collation that reorders punctuation characters,
+ making them ignorable.
+ </para>
- <orderedlist>
+ <orderedlist>
- <listitem>
- <para>
- Choose a collation ID, as shown in
- <xref linkend="adding-collation-choosing-id"/>. The
- following steps use an ID of 252.
- </para>
- </listitem>
+ <listitem>
+ <para>
+ Choose a collation ID, as shown in
+ <xref linkend="adding-collation-choosing-id"/>. The
+ following steps use an ID of 252.
+ </para>
+ </listitem>
- <listitem>
- <para>
- To modify the <literal>Index.xml</literal> configuration
- file. This file will be located in the directory named by
- the <literal role="sysvar">character_sets_dir</literal>
- system variable. You can check the variable value as
- follows, although the path name might be different on your
- system:
- </para>
+ <listitem>
+ <para>
+ To modify the <literal>Index.xml</literal> configuration
+ file. This file will be located in the directory named by
+ the <literal role="sysvar">character_sets_dir</literal>
+ system variable. You can check the variable value as
+ follows, although the path name might be different on your
+ system:
+ </para>
<programlisting>
mysql> <userinput>SHOW VARIABLES LIKE 'character_sets_dir';</userinput>
@@ -7622,21 +7675,22 @@
| character_sets_dir | /user/local/mysql/share/mysql/charsets/ |
+--------------------+-----------------------------------------+
</programlisting>
- </listitem>
+ </listitem>
- <listitem>
- <para>
- Choose a name for the collation and list it in the
- <filename>Index.xml</filename> file. In addition, you'll
- need to provide the collation ordering rules. Find the
- <literal><charset></literal> element for the character
- set to which the collation is being added, and add a
- <literal><collation></literal> element that indicates
- the collation name and ID, to associate the name with the
- ID. Within the <literal><collation></literal> element,
- provide a <literal><rules></literal> element
- containing the ordering rules:
- </para>
+ <listitem>
+ <para>
+ Choose a name for the collation and list it in the
+ <filename>Index.xml</filename> file. In addition, you'll
+ need to provide the collation ordering rules. Find the
+ <literal><charset></literal> element for the
+ character set to which the collation is being added, and
+ add a <literal><collation></literal> element that
+ indicates the collation name and ID, to associate the name
+ with the ID. Within the
+ <literal><collation></literal> element, provide a
+ <literal><rules></literal> element containing the
+ ordering rules:
+ </para>
<programlisting>
<charset name="utf8">
@@ -7654,25 +7708,25 @@
...
</charset>
</programlisting>
- </listitem>
+ </listitem>
- <listitem>
- <para>
- If you want a similar collation for other Unicode character
- sets, add other <literal><collation></literal>
- elements. For example, to define
- <literal>ucs2_phone_ci</literal>, add a
- <literal><collation></literal> element to the
- <literal><charset name="ucs2"></literal> element.
- Remember that each collation must have its own unique ID.
- </para>
- </listitem>
+ <listitem>
+ <para>
+ If you want a similar collation for other Unicode
+ character sets, add other
+ <literal><collation></literal> elements. For
+ example, to define <literal>ucs2_phone_ci</literal>, add a
+ <literal><collation></literal> element to the
+ <literal><charset name="ucs2"></literal> element.
+ Remember that each collation must have its own unique ID.
+ </para>
+ </listitem>
- <listitem>
- <para>
- Restart the server and use this statement to verify that the
- collation is present:
- </para>
+ <listitem>
+ <para>
+ Restart the server and use this statement to verify that
+ the collation is present:
+ </para>
<programlisting>
mysql> <userinput>SHOW COLLATION LIKE 'utf8_phone_ci';</userinput>
@@ -7682,19 +7736,19 @@
| utf8_phone_ci | utf8 | 252 | | | 8 |
+---------------+---------+-----+---------+----------+---------+
</programlisting>
- </listitem>
+ </listitem>
- </orderedlist>
+ </orderedlist>
- <para>
- Now test the collation to make sure that it has the desired
- properties.
- </para>
+ <para>
+ Now test the collation to make sure that it has the desired
+ properties.
+ </para>
- <para>
- Create a table containing some sample phone numbers using the
- new collation:
- </para>
+ <para>
+ Create a table containing some sample phone numbers using the
+ new collation:
+ </para>
<programlisting>
<!--
@@ -7723,10 +7777,10 @@
Query OK, 1 row affected (0.00 sec)
</programlisting>
- <para>
- Run some queries to see whether the ignored punctuation
- characters are in fact ignored for sorting and comparisons:
- </para>
+ <para>
+ Run some queries to see whether the ignored punctuation
+ characters are in fact ignored for sorting and comparisons:
+ </para>
<programlisting>
mysql> <userinput>SELECT * FROM phonebook ORDER BY phone;</userinput>
@@ -7766,6 +7820,8 @@
1 row in set (0.00 sec)
</programlisting>
+ </section>
+
</section>
</section>
Modified: trunk/refman-5.5/internationalization.xml
===================================================================
--- trunk/refman-5.5/internationalization.xml 2011-05-13 16:12:35 UTC (rev 26217)
+++ trunk/refman-5.5/internationalization.xml 2011-05-13 17:41:11 UTC (rev 26218)
Changed blocks: 29, Lines Added: 267, Lines Deleted: 210; 27861 bytes
@@ -7585,9 +7585,9 @@
<listitem>
<para>
- If the character set does not need to use special string
- collating routines for sorting and does not need multi-byte
- character support, it is simple.
+ If the character set does not need special string collating
+ routines for sorting and does not need multi-byte character
+ support, it is simple.
</para>
</listitem>
@@ -7621,7 +7621,8 @@
<replaceable>MYSET</replaceable> to the
<filename>sql/share/charsets/Index.xml</filename> file. Use
the existing contents in the file as a guide to adding new
- contents.
+ contents. A partial listing for the <literal>latin1</literal>
+ <literal><charset></literal> element follows:
</para>
<programlisting>
@@ -7635,14 +7636,19 @@
</collation>
<collation name="latin1_danish_ci" id="15" order="Danish"/>
...
+ <collation name="latin1_bin" id="47" order="Binary">
+ <flag>binary</flag>
+ <flag>compiled</flag>
+ </collation>
+ ...
</charset>
</programlisting>
<para>
The <literal><charset></literal> element must list all
the collations for the character set. These must include at
- least a binary collation and a default collation. The default
- collation is usually named using a suffix of
+ least a binary collation and a default (primary) collation.
+ The default collation is often named using a suffix of
<literal>general_ci</literal> (general, case insensitive). It
is possible for the binary collation to be the default
collation, but usually they are different. The default
@@ -7735,7 +7741,7 @@
character set:
</para>
- <orderedlist>
+ <itemizedlist>
<listitem>
<para>
@@ -7777,7 +7783,7 @@
</para>
</listitem>
- </orderedlist>
+ </itemizedlist>
</listitem>
<listitem>
@@ -7879,7 +7885,8 @@
<para>
Each simple character set has a configuration file located in
- the <filename>sql/share/charsets</filename> directory. The file
+ the <filename>sql/share/charsets</filename> directory. For a
+ character set named <replaceable>MYSYS</replaceable>, the file
is named
<filename><replaceable>MYSET</replaceable>.xml</filename>. It
uses <literal><map></literal> array elements to list
@@ -7931,17 +7938,18 @@
<literal>ctype_<replaceable>MYSET</replaceable>[]</literal>,
<literal>to_lower_<replaceable>MYSET</replaceable>[]</literal>,
and so forth. Not every complex character set has all of the
- arrays. See the existing <filename>ctype-*.c</filename> files
- for examples. See the <filename>CHARSET_INFO.txt</filename> file
- in the <filename>strings</filename> directory for additional
+ arrays. See also the existing <filename>ctype-*.c</filename>
+ files for examples. See the
+ <filename>CHARSET_INFO.txt</filename> file in the
+ <filename>strings</filename> directory for additional
information.
</para>
<para>
- The <literal><ctype></literal> array is indexed by
- character value + 1 and has 257 elements. This is a legacy
- convention for handling <literal>EOF</literal>. The other arrays
- are indexed by character value and have 256 elements.
+ Most of the arrays are indexed by character value and have 256
+ elements. The <literal><ctype></literal> array is indexed
+ by character value + 1 and has 257 elements. This is a legacy
+ convention for handling <literal>EOF</literal>.
</para>
<para>
@@ -7996,14 +8004,14 @@
</programlisting>
<para>
- Each <literal><collation></literal> element contains a
- mapping array that indicates how characters should be ordered
- for comparison and sorting purposes. MySQL sorts characters
- based on the values of this information. In some cases, this is
- the same as the <literal>upper</literal> array, which means that
- sorting is case-insensitive. For more complicated sorting rules
- (for complex character sets), see the discussion of string
- collating in <xref linkend="string-collating"/>.
+ Each <literal><collation></literal> array indicates how
+ characters should be ordered for comparison and sorting
+ purposes. MySQL sorts characters based on the values of this
+ information. In some cases, this is the same as the
+ <literal><upper></literal> array, which means that sorting
+ is case-insensitive. For more complicated sorting rules (for
+ complex character sets), see the discussion of string collating
+ in <xref linkend="string-collating"/>.
</para>
</section>
@@ -8022,12 +8030,13 @@
</indexterm>
<para>
- For simple character sets, sorting rules are specified in the
- <filename><replaceable>MYSET</replaceable>.xml</filename>
+ For a simple character set named
+ <replaceable>MYSET</replaceable>, sorting rules are specified in
+ the <filename><replaceable>MYSET</replaceable>.xml</filename>
configuration file using <literal><map></literal> array
elements within <literal><collation></literal> elements.
If the sorting rules for your language are too complex to be
- handled with simple arrays, you need to define string collating
+ handled with simple arrays, you must define string collating
functions in the
<filename>ctype-<replaceable>MYSET</replaceable>.c</filename>
source file in the <filename>strings</filename> directory.
@@ -8042,9 +8051,10 @@
<literal>gbk</literal>, <literal>sjis</literal>, and
<literal>tis160</literal> character sets. Take a look at the
<literal>MY_COLLATION_HANDLER</literal> structures to see how
- they are used, and see the <filename>CHARSET_INFO.txt</filename>
- file in the <filename>strings</filename> directory for
- additional information.
+ they are used. See also the
+ <filename>CHARSET_INFO.txt</filename> file in the
+ <filename>strings</filename> directory for additional
+ information.
</para>
</section>
@@ -8063,9 +8073,9 @@
</indexterm>
<para>
- If you want to add support for a new character set that includes
- multi-byte characters, you need to use multi-byte character
- functions in the
+ If you want to add support for a new character set named
+ <replaceable>MYSET</replaceable> that includes multi-byte
+ characters, you must use multi-byte character functions in the
<filename>ctype-<replaceable>MYSET</replaceable>.c</filename>
source file in the <filename>strings</filename> directory.
</para>
@@ -8079,9 +8089,9 @@
<literal>gbk</literal>, <literal>sjis</literal>, and
<literal>ujis</literal> character sets. Take a look at the
<literal>MY_CHARSET_HANDLER</literal> structures to see how they
- are used, and see the <filename>CHARSET_INFO.txt</filename> file
- in the <filename>strings</filename> directory for additional
- information.
+ are used. See also the <filename>CHARSET_INFO.txt</filename>
+ file in the <filename>strings</filename> directory for
+ additional information.
</para>
</section>
@@ -8141,9 +8151,9 @@
</itemizedlist>
<para>
- The following discussion describes how to add collations of the
- first two types to existing character sets. All existing character
- sets already have a binary collation, so there is no need here to
+ The following sections describe how to add collations of the first
+ two types to existing character sets. All existing character sets
+ already have a binary collation, so there is no need here to
describe how to add one.
</para>
@@ -8189,10 +8199,8 @@
all the information required for a complete character set, just
modify the appropriate files for an existing character set. That
is, based on what is already present for the character set's
- current collations, add new data structures, functions, and
- configuration information for the new collation. For an example,
- see the MySQL Blog article in the following list of additional
- resources.
+ current collations, add data structures, functions, and
+ configuration information for the new collation.
</para>
<bridgehead>
@@ -8269,6 +8277,11 @@
</programlisting>
<para>
+ For implementation instructions, see
+ <xref linkend="adding-collation-simple-8bit"/>.
+ </para>
+
+ <para>
<emphasis role="bold">Complex collations for 8-bit character
sets</emphasis>
</para>
@@ -8322,6 +8335,11 @@
</itemizedlist>
<para>
+ For implementation instructions, see
+ <xref linkend="adding-character-set"/>.
+ </para>
+
+ <para>
<emphasis role="bold">Collations for Unicode multi-byte
character sets</emphasis>
</para>
@@ -8408,6 +8426,12 @@
</para>
<para>
+ For implementation instructions, for a non-UCA colluation, see
+ <xref linkend="adding-character-set"/>. For a UCA collation, see
+ <xref linkend="adding-collation-unicode-uca"/>.
+ </para>
+
+ <para>
<emphasis role="bold">Miscellaneous collations</emphasis>
</para>
@@ -8434,16 +8458,16 @@
<listitem>
<para>
- The <literal>Id</literal> column of
- <literal role="stmt">SHOW COLLATION</literal> output
+ The <literal>ID</literal> column of the
+ <literal role="is">INFORMATION_SCHEMA.COLLATIONS</literal>
+ table
</para>
</listitem>
<listitem>
<para>
- The <literal>ID</literal> column of the
- <literal role="is">INFORMATION_SCHEMA.COLLATIONS</literal>
- table
+ The <literal>Id</literal> column of
+ <literal role="stmt">SHOW COLLATION</literal> output
</para>
</listitem>
@@ -8597,9 +8621,9 @@
add a <literal><collation></literal> element that
names the collation and that contains a
<literal><map></literal> element that defines a
- character code-to-weight mapping table. Each word within the
- <literal><map></literal> element must be a number in
- hexadecimal format.
+ character code-to-weight mapping table for character codes 0
+ to 255. Each value within the <literal><map></literal>
+ element must be a number in hexadecimal format.
</para>
<programlisting>
@@ -8656,8 +8680,8 @@
<literal><collation></literal> element within a
<literal><charset></literal> character set description.
The procedure described here does not require recompiling MySQL.
- It uses a subset of the Locale Data Markup Language (LDML),
- which is available at
+ It uses a subset of the Locale Data Markup Language (LDML)
+ specification, which is available at
<ulink url="http://www.unicode.org/reports/tr35/"/>. With this
method, you need not define the entire collation. Instead, you
begin with an existing <quote>base</quote> collation and
@@ -8667,7 +8691,8 @@
defined.
</para>
- <informaltable>
+ <table>
+ <title>MySQL Character Sets Available for User-Defined UCA Collations</title>
<tgroup cols="2">
<colspec colwidth="30*"/>
<colspec colwidth="60*"/>
@@ -8696,65 +8721,79 @@
</row>
</tbody>
</tgroup>
- </informaltable>
+ </table>
<para>
- The following brief summary describes the LDML characteristics
- required to understand the procedure for adding a collation
- given later in this section:
+ The following sections show how to add a collation that is
+ defined using LDML syntax, and provide a summary of LDML rules
+ supported in MySQL.
</para>
- <itemizedlist>
+ <section id="ldml-rules">
- <listitem>
- <para>
- LDML has reset, shift, and identity rules.
- </para>
- </listitem>
+ <title>LDML Syntax Supported in MySQL</title>
- <listitem>
- <para>
- Characters named in these rules can be written in
- <literal>\u<replaceable>nnnn</replaceable></literal> format,
- where <replaceable>nnnn</replaceable> is the hexadecimal
- Unicode code point value. Basic Latin letters
- <literal>A-Z</literal> and <literal>a-z</literal> can also
- be written literally (this is a MySQL limitation; the LDML
- specification permits literal non-Latin1 characters in the
- rules). Only characters in the Basic Multilingual Plane can
- be specified. This notation does not apply to characters
- outside the BMP range of <literal>0000</literal> to
- <literal>FFFF</literal>.
- </para>
- </listitem>
+ <para>
+ This section describes the LDML rules that MySQL recognizes.
+ These are a subset of the rules described in the LDML
+ specification available at
+ <ulink url="http://www.unicode.org/reports/tr35/"/>. The rules
+ here are all supported except that character sorting occurs
+ only at the primary level. Rules that specify secondary or
+ higher sort levels are recognized but have no effect.
+ </para>
- <listitem>
- <para>
- A reset rule does not specify any ordering in and of itself.
- Instead, it <quote>resets</quote> the ordering for
- subsequent shift rules to cause them to be taken in relation
- to a given character. Either of the following rules resets
- subsequent shift rules to be taken in relation to the letter
- <literal>'A'</literal>:
- </para>
+ <itemizedlist>
+ <listitem>
+ <para>
+ Characters named in LDML rules can be written in
+ <literal>\u<replaceable>nnnn</replaceable></literal>
+ format, where <replaceable>nnnn</replaceable> is the
+ hexadecimal Unicode code point value. Basic Latin letters
+ <literal>A-Z</literal> and <literal>a-z</literal> can also
+ be written literally (this is a MySQL limitation; the LDML
+ specification permits literal non-Latin1 characters in the
+ rules). Only characters in the Basic Multilingual Plane
+ can be specified. This notation does not apply to
+ characters outside the BMP range of
+ <literal>0000</literal> to <literal>FFFF</literal>.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ LDML has reset rules and shift rules.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ A reset rule does not specify any ordering in and of
+ itself. Instead, it <quote>resets</quote> the ordering for
+ subsequent shift rules to cause them to be taken in
+ relation to a given character. Either of the following
+ rules resets subsequent shift rules to be taken in
+ relation to the letter <literal>'A'</literal>:
+ </para>
+
<programlisting>
<reset>A</reset>
<reset>\u0041</reset>
</programlisting>
- </listitem>
+ </listitem>
- <listitem>
- <para>
- Shift rules define primary, secondary, and tertiary
- differences of a character from another character. They are
- specified using <literal><p></literal>,
- <literal><s></literal>, and
- <literal><t></literal> elements. Either of the
- following rules specifies a primary shift rule for the
- <literal>'G'</literal> character:
- </para>
+ <listitem>
+ <para>
+ Shift rules define primary, secondary, and tertiary
+ differences of a character from another character. They
+ are specified using <literal><p></literal>,
+ <literal><s></literal>, and
+ <literal><t></literal> elements. Either of the
+ following rules specifies a primary shift rule for the
+ <literal>'G'</literal> character:
+ </para>
<programlisting>
<p>G</p>
@@ -8762,62 +8801,77 @@
<p>\u0047</p>
</programlisting>
- <itemizedlist>
+ <itemizedlist>
- <listitem>
- <para>
- Use primary differences to distinguish separate letters.
- </para>
- </listitem>
+ <listitem>
+ <para>
+ Use primary differences to distinguish separate
+ letters.
+ </para>
+ </listitem>
- <listitem>
- <para>
- Use secondary differences to distinguish accent
- variations.
- </para>
- </listitem>
+ <listitem>
+ <para>
+ Use secondary differences to distinguish accent
+ variations.
+ </para>
+ </listitem>
- <listitem>
- <para>
- Use tertiary differences to distinguish lettercase
- variations.
- </para>
- </listitem>
+ <listitem>
+ <para>
+ Use tertiary differences to distinguish lettercase
+ variations.
+ </para>
+ </listitem>
- </itemizedlist>
- </listitem>
+ </itemizedlist>
+ </listitem>
- <listitem>
- <para>
- Identity rules indicate that one character sorts identically
- to another. The following rules cause <literal>'b'</literal>
- sort the same as <literal>'a'</literal>:
- </para>
+ <listitem>
+ <para>
+ Identity rules indicate that one character sorts
+ identically to another. The following rules cause
+ <literal>'b'</literal> sort the same as
+ <literal>'a'</literal>:
+ </para>
<programlisting>
<reset>a</reset>
<i>b</i>
</programlisting>
- <para>
- Identity rules are supported as of MySQL 5.5.3. Prior to
- 5.5.3, use <literal><s> ... </s></literal>
- instead.
- </para>
- </listitem>
+ <para>
+ Identity rules are supported as of MySQL 5.5.3. Prior to
+ 5.5.3, use <literal><s> ... </s></literal>
+ instead.
+ </para>
+ </listitem>
- </itemizedlist>
+ </itemizedlist>
- <para>
- To add a UCA collation for a Unicode character set without
- recompiling MySQL, use the following procedure. The example adds
- a collation named <literal>utf8_phone_ci</literal> to the
- <literal>utf8</literal> character set. The collation is designed
- for a scenario involving a Web application for which users post
- their names and phone numbers. Phone numbers can be given in
- very different formats:
- </para>
+ </section>
+ <section id="ldml-collation-example">
+
+ <title>Defining a UCA Collation using LDML Syntax</title>
+
+ <para>
+ To add a UCA collation for a Unicode character set without
+ recompiling MySQL, use the following procedure. If you are
+ unfamiliar with the LDML rules used to describe the
+ collation's sort characteristics, see
+ <xref linkend="ldml-rules"/>.
+ </para>
+
+ <para>
+ The example adds a collation named
+ <literal>utf8_phone_ci</literal> to the
+ <literal>utf8</literal> character set. The collation is
+ designed for a scenario involving a Web application for which
+ users post their names and phone numbers. Phone numbers can be
+ given in very different formats:
+ </para>
+
<programlisting>
+7-12345-67
+7-12-345-67
@@ -8826,33 +8880,33 @@
+71234567
</programlisting>
- <para>
- The problem raised by dealing with these kinds of values is that
- the varying permissible formats make searching for a specific
- phone number very difficult. The solution is to define a new
- collation that reorders punctuation characters, making them
- ignorable.
- </para>
+ <para>
+ The problem raised by dealing with these kinds of values is
+ that the varying permissible formats make searching for a
+ specific phone number very difficult. The solution is to
+ define a new collation that reorders punctuation characters,
+ making them ignorable.
+ </para>
- <orderedlist>
+ <orderedlist>
- <listitem>
- <para>
- Choose a collation ID, as shown in
- <xref linkend="adding-collation-choosing-id"/>. The
- following steps use an ID of 1029.
- </para>
- </listitem>
+ <listitem>
+ <para>
+ Choose a collation ID, as shown in
+ <xref linkend="adding-collation-choosing-id"/>. The
+ following steps use an ID of 1029.
+ </para>
+ </listitem>
- <listitem>
- <para>
- To modify the <literal>Index.xml</literal> configuration
- file. This file will be located in the directory named by
- the <literal role="sysvar">character_sets_dir</literal>
- system variable. You can check the variable value as
- follows, although the path name might be different on your
- system:
- </para>
+ <listitem>
+ <para>
+ To modify the <literal>Index.xml</literal> configuration
+ file. This file will be located in the directory named by
+ the <literal role="sysvar">character_sets_dir</literal>
+ system variable. You can check the variable value as
+ follows, although the path name might be different on your
+ system:
+ </para>
<programlisting>
mysql> <userinput>SHOW VARIABLES LIKE 'character_sets_dir';</userinput>
@@ -8862,21 +8916,22 @@
| character_sets_dir | /user/local/mysql/share/mysql/charsets/ |
+--------------------+-----------------------------------------+
</programlisting>
- </listitem>
+ </listitem>
- <listitem>
- <para>
- Choose a name for the collation and list it in the
- <filename>Index.xml</filename> file. In addition, you'll
- need to provide the collation ordering rules. Find the
- <literal><charset></literal> element for the character
- set to which the collation is being added, and add a
- <literal><collation></literal> element that indicates
- the collation name and ID, to associate the name with the
- ID. Within the <literal><collation></literal> element,
- provide a <literal><rules></literal> element
- containing the ordering rules:
- </para>
+ <listitem>
+ <para>
+ Choose a name for the collation and list it in the
+ <filename>Index.xml</filename> file. In addition, you'll
+ need to provide the collation ordering rules. Find the
+ <literal><charset></literal> element for the
+ character set to which the collation is being added, and
+ add a <literal><collation></literal> element that
+ indicates the collation name and ID, to associate the name
+ with the ID. Within the
+ <literal><collation></literal> element, provide a
+ <literal><rules></literal> element containing the
+ ordering rules:
+ </para>
<programlisting>
<charset name="utf8">
@@ -8894,25 +8949,25 @@
...
</charset>
</programlisting>
- </listitem>
+ </listitem>
- <listitem>
- <para>
- If you want a similar collation for other Unicode character
- sets, add other <literal><collation></literal>
- elements. For example, to define
- <literal>ucs2_phone_ci</literal>, add a
- <literal><collation></literal> element to the
- <literal><charset name="ucs2"></literal> element.
- Remember that each collation must have its own unique ID.
- </para>
- </listitem>
+ <listitem>
+ <para>
+ If you want a similar collation for other Unicode
+ character sets, add other
+ <literal><collation></literal> elements. For
+ example, to define <literal>ucs2_phone_ci</literal>, add a
+ <literal><collation></literal> element to the
+ <literal><charset name="ucs2"></literal> element.
+ Remember that each collation must have its own unique ID.
+ </para>
+ </listitem>
- <listitem>
- <para>
- Restart the server and use this statement to verify that the
- collation is present:
- </para>
+ <listitem>
+ <para>
+ Restart the server and use this statement to verify that
+ the collation is present:
+ </para>
<programlisting>
mysql> <userinput>SHOW COLLATION LIKE 'utf8_phone_ci';</userinput>
@@ -8922,19 +8977,19 @@
| utf8_phone_ci | utf8 | 1029 | | | 8 |
+---------------+---------+------+---------+----------+---------+
</programlisting>
- </listitem>
+ </listitem>
- </orderedlist>
+ </orderedlist>
- <para>
- Now test the collation to make sure that it has the desired
- properties.
- </para>
+ <para>
+ Now test the collation to make sure that it has the desired
+ properties.
+ </para>
- <para>
- Create a table containing some sample phone numbers using the
- new collation:
- </para>
+ <para>
+ Create a table containing some sample phone numbers using the
+ new collation:
+ </para>
<programlisting>
<!--
@@ -8963,10 +9018,10 @@
Query OK, 1 row affected (0.00 sec)
</programlisting>
- <para>
- Run some queries to see whether the ignored punctuation
- characters are in fact ignored for sorting and comparisons:
- </para>
+ <para>
+ Run some queries to see whether the ignored punctuation
+ characters are in fact ignored for sorting and comparisons:
+ </para>
<programlisting>
mysql> <userinput>SELECT * FROM phonebook ORDER BY phone;</userinput>
@@ -9006,6 +9061,8 @@
1 row in set (0.00 sec)
</programlisting>
+ </section>
+
</section>
</section>
Modified: trunk/refman-5.6/internationalization.xml
===================================================================
--- trunk/refman-5.6/internationalization.xml 2011-05-13 16:12:35 UTC (rev 26217)
+++ trunk/refman-5.6/internationalization.xml 2011-05-13 17:41:11 UTC (rev 26218)
Changed blocks: 29, Lines Added: 278, Lines Deleted: 221; 29525 bytes
@@ -7725,9 +7725,9 @@
<listitem>
<para>
- If the character set does not need to use special string
- collating routines for sorting and does not need multi-byte
- character support, it is simple.
+ If the character set does not need special string collating
+ routines for sorting and does not need multi-byte character
+ support, it is simple.
</para>
</listitem>
@@ -7761,7 +7761,8 @@
<replaceable>MYSET</replaceable> to the
<filename>sql/share/charsets/Index.xml</filename> file. Use
the existing contents in the file as a guide to adding new
- contents.
+ contents. A partial listing for the <literal>latin1</literal>
+ <literal><charset></literal> element follows:
</para>
<programlisting>
@@ -7775,14 +7776,19 @@
</collation>
<collation name="latin1_danish_ci" id="15" order="Danish"/>
...
+ <collation name="latin1_bin" id="47" order="Binary">
+ <flag>binary</flag>
+ <flag>compiled</flag>
+ </collation>
+ ...
</charset>
</programlisting>
<para>
The <literal><charset></literal> element must list all
the collations for the character set. These must include at
- least a binary collation and a default collation. The default
- collation is usually named using a suffix of
+ least a binary collation and a default (primary) collation.
+ The default collation is often named using a suffix of
<literal>general_ci</literal> (general, case insensitive). It
is possible for the binary collation to be the default
collation, but usually they are different. The default
@@ -7874,7 +7880,7 @@
character set:
</para>
- <orderedlist>
+ <itemizedlist>
<listitem>
<para>
@@ -7916,7 +7922,7 @@
</para>
</listitem>
- </orderedlist>
+ </itemizedlist>
</listitem>
<listitem>
@@ -8018,7 +8024,8 @@
<para>
Each simple character set has a configuration file located in
- the <filename>sql/share/charsets</filename> directory. The file
+ the <filename>sql/share/charsets</filename> directory. For a
+ character set named <replaceable>MYSYS</replaceable>, the file
is named
<filename><replaceable>MYSET</replaceable>.xml</filename>. It
uses <literal><map></literal> array elements to list
@@ -8070,17 +8077,18 @@
<literal>ctype_<replaceable>MYSET</replaceable>[]</literal>,
<literal>to_lower_<replaceable>MYSET</replaceable>[]</literal>,
and so forth. Not every complex character set has all of the
- arrays. See the existing <filename>ctype-*.c</filename> files
- for examples. See the <filename>CHARSET_INFO.txt</filename> file
- in the <filename>strings</filename> directory for additional
+ arrays. See also the existing <filename>ctype-*.c</filename>
+ files for examples. See the
+ <filename>CHARSET_INFO.txt</filename> file in the
+ <filename>strings</filename> directory for additional
information.
</para>
<para>
- The <literal><ctype></literal> array is indexed by
- character value + 1 and has 257 elements. This is a legacy
- convention for handling <literal>EOF</literal>. The other arrays
- are indexed by character value and have 256 elements.
+ Most of the arrays are indexed by character value and have 256
+ elements. The <literal><ctype></literal> array is indexed
+ by character value + 1 and has 257 elements. This is a legacy
+ convention for handling <literal>EOF</literal>.
</para>
<para>
@@ -8135,14 +8143,14 @@
</programlisting>
<para>
- Each <literal><collation></literal> element contains a
- mapping array that indicates how characters should be ordered
- for comparison and sorting purposes. MySQL sorts characters
- based on the values of this information. In some cases, this is
- the same as the <literal>upper</literal> array, which means that
- sorting is case-insensitive. For more complicated sorting rules
- (for complex character sets), see the discussion of string
- collating in <xref linkend="string-collating"/>.
+ Each <literal><collation></literal> array indicates how
+ characters should be ordered for comparison and sorting
+ purposes. MySQL sorts characters based on the values of this
+ information. In some cases, this is the same as the
+ <literal><upper></literal> array, which means that sorting
+ is case-insensitive. For more complicated sorting rules (for
+ complex character sets), see the discussion of string collating
+ in <xref linkend="string-collating"/>.
</para>
</section>
@@ -8161,12 +8169,13 @@
</indexterm>
<para>
- For simple character sets, sorting rules are specified in the
- <filename><replaceable>MYSET</replaceable>.xml</filename>
+ For a simple character set named
+ <replaceable>MYSET</replaceable>, sorting rules are specified in
+ the <filename><replaceable>MYSET</replaceable>.xml</filename>
configuration file using <literal><map></literal> array
elements within <literal><collation></literal> elements.
If the sorting rules for your language are too complex to be
- handled with simple arrays, you need to define string collating
+ handled with simple arrays, you must define string collating
functions in the
<filename>ctype-<replaceable>MYSET</replaceable>.c</filename>
source file in the <filename>strings</filename> directory.
@@ -8181,9 +8190,10 @@
<literal>gbk</literal>, <literal>sjis</literal>, and
<literal>tis160</literal> character sets. Take a look at the
<literal>MY_COLLATION_HANDLER</literal> structures to see how
- they are used, and see the <filename>CHARSET_INFO.txt</filename>
- file in the <filename>strings</filename> directory for
- additional information.
+ they are used. See also the
+ <filename>CHARSET_INFO.txt</filename> file in the
+ <filename>strings</filename> directory for additional
+ information.
</para>
</section>
@@ -8202,9 +8212,9 @@
</indexterm>
<para>
- If you want to add support for a new character set that includes
- multi-byte characters, you need to use multi-byte character
- functions in the
+ If you want to add support for a new character set named
+ <replaceable>MYSET</replaceable> that includes multi-byte
+ characters, you must use multi-byte character functions in the
<filename>ctype-<replaceable>MYSET</replaceable>.c</filename>
source file in the <filename>strings</filename> directory.
</para>
@@ -8218,9 +8228,9 @@
<literal>gbk</literal>, <literal>sjis</literal>, and
<literal>ujis</literal> character sets. Take a look at the
<literal>MY_CHARSET_HANDLER</literal> structures to see how they
- are used, and see the <filename>CHARSET_INFO.txt</filename> file
- in the <filename>strings</filename> directory for additional
- information.
+ are used. See also the <filename>CHARSET_INFO.txt</filename>
+ file in the <filename>strings</filename> directory for
+ additional information.
</para>
</section>
@@ -8307,9 +8317,9 @@
</itemizedlist>
<para>
- The following discussion describes how to add collations of the
- first two types to existing character sets. All existing character
- sets already have a binary collation, so there is no need here to
+ The following sections describe how to add collations of the first
+ two types to existing character sets. All existing character sets
+ already have a binary collation, so there is no need here to
describe how to add one.
</para>
@@ -8355,10 +8365,8 @@
all the information required for a complete character set, just
modify the appropriate files for an existing character set. That
is, based on what is already present for the character set's
- current collations, add new data structures, functions, and
- configuration information for the new collation. For an example,
- see the MySQL Blog article in the following list of additional
- resources.
+ current collations, add data structures, functions, and
+ configuration information for the new collation.
</para>
<bridgehead>
@@ -8443,6 +8451,11 @@
</programlisting>
<para>
+ For implementation instructions, see
+ <xref linkend="adding-collation-simple-8bit"/>.
+ </para>
+
+ <para>
<emphasis role="bold">Complex collations for 8-bit character
sets</emphasis>
</para>
@@ -8544,6 +8557,11 @@
</itemizedlist>
<para>
+ For implementation instructions, see
+ <xref linkend="adding-character-set"/>.
+ </para>
+
+ <para>
<emphasis role="bold">Collations for Unicode multi-byte
character sets</emphasis>
</para>
@@ -8684,6 +8702,12 @@
</para>
<para>
+ For implementation instructions, for a non-UCA colluation, see
+ <xref linkend="adding-character-set"/>. For a UCA collation, see
+ <xref linkend="adding-collation-unicode-uca"/>.
+ </para>
+
+ <para>
<emphasis role="bold">Miscellaneous collations</emphasis>
</para>
@@ -8709,16 +8733,16 @@
<listitem>
<para>
- The <literal>Id</literal> column of
- <literal role="stmt">SHOW COLLATION</literal> output
+ The <literal>ID</literal> column of the
+ <literal role="is">INFORMATION_SCHEMA.COLLATIONS</literal>
+ table
</para>
</listitem>
<listitem>
<para>
- The <literal>ID</literal> column of the
- <literal role="is">INFORMATION_SCHEMA.COLLATIONS</literal>
- table
+ The <literal>Id</literal> column of
+ <literal role="stmt">SHOW COLLATION</literal> output
</para>
</listitem>
@@ -8872,9 +8896,9 @@
add a <literal><collation></literal> element that
names the collation and that contains a
<literal><map></literal> element that defines a
- character code-to-weight mapping table. Each word within the
- <literal><map></literal> element must be a number in
- hexadecimal format.
+ character code-to-weight mapping table for character codes 0
+ to 255. Each value within the <literal><map></literal>
+ element must be a number in hexadecimal format.
</para>
<programlisting>
@@ -8931,8 +8955,8 @@
<literal><collation></literal> element within a
<literal><charset></literal> character set description.
The procedure described here does not require recompiling MySQL.
- It uses a subset of the Locale Data Markup Language (LDML),
- which is available at
+ It uses a subset of the Locale Data Markup Language (LDML)
+ specification, which is available at
<ulink url="http://www.unicode.org/reports/tr35/"/>. With this
method, you need not define the entire collation. Instead, you
begin with an existing <quote>base</quote> collation and
@@ -8940,12 +8964,13 @@
base collation. The following table lists the base collations of
the Unicode character sets for which UCA collations can be
defined. It is not possible to create user-defined UCA
- collations for <literal>utf16le</literal> because there is no
- <literal>utf16le_unicode_ci</literal> collation, which would
- serve as the basis for such collations.
+ collations for <literal>utf16le</literal>; there is no
+ <literal>utf16le_unicode_ci</literal> collation that would serve
+ as the basis for such collations.
</para>
- <informaltable>
+ <table>
+ <title>MySQL Character Sets Available for User-Defined UCA Collations</title>
<tgroup cols="2">
<colspec colwidth="30*"/>
<colspec colwidth="60*"/>
@@ -8974,65 +8999,79 @@
</row>
</tbody>
</tgroup>
- </informaltable>
+ </table>
<para>
- The following brief summary describes the LDML characteristics
- required to understand the procedure for adding a collation
- given later in this section:
+ The following sections show how to add a collation that is
+ defined using LDML syntax, and provide a summary of LDML rules
+ supported in MySQL.
</para>
- <itemizedlist>
+ <section id="ldml-rules">
- <listitem>
- <para>
- LDML has reset, shift, and identity rules.
- </para>
- </listitem>
+ <title>LDML Syntax Supported in MySQL</title>
- <listitem>
- <para>
- Characters named in these rules can be written in
- <literal>\u<replaceable>nnnn</replaceable></literal> format,
- where <replaceable>nnnn</replaceable> is the hexadecimal
- Unicode code point value. Basic Latin letters
- <literal>A-Z</literal> and <literal>a-z</literal> can also
- be written literally (this is a MySQL limitation; the LDML
- specification permits literal non-Latin1 characters in the
- rules). Only characters in the Basic Multilingual Plane can
- be specified. This notation does not apply to characters
- outside the BMP range of <literal>0000</literal> to
- <literal>FFFF</literal>.
- </para>
- </listitem>
+ <para>
+ This section describes the LDML rules that MySQL recognizes.
+ These are a subset of the rules described in the LDML
+ specification available at
+ <ulink url="http://www.unicode.org/reports/tr35/"/>. The rules
+ here are all supported except that character sorting occurs
+ only at the primary level. Rules that specify secondary or
+ higher sort levels are recognized but have no effect.
+ </para>
- <listitem>
- <para>
- A reset rule does not specify any ordering in and of itself.
- Instead, it <quote>resets</quote> the ordering for
- subsequent shift rules to cause them to be taken in relation
- to a given character. Either of the following rules resets
- subsequent shift rules to be taken in relation to the letter
- <literal>'A'</literal>:
- </para>
+ <itemizedlist>
+ <listitem>
+ <para>
+ Characters named in LDML rules can be written in
+ <literal>\u<replaceable>nnnn</replaceable></literal>
+ format, where <replaceable>nnnn</replaceable> is the
+ hexadecimal Unicode code point value. Basic Latin letters
+ <literal>A-Z</literal> and <literal>a-z</literal> can also
+ be written literally (this is a MySQL limitation; the LDML
+ specification permits literal non-Latin1 characters in the
+ rules). Only characters in the Basic Multilingual Plane
+ can be specified. This notation does not apply to
+ characters outside the BMP range of
+ <literal>0000</literal> to <literal>FFFF</literal>.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ LDML has reset rules and shift rules.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ A reset rule does not specify any ordering in and of
+ itself. Instead, it <quote>resets</quote> the ordering for
+ subsequent shift rules to cause them to be taken in
+ relation to a given character. Either of the following
+ rules resets subsequent shift rules to be taken in
+ relation to the letter <literal>'A'</literal>:
+ </para>
+
<programlisting>
<reset>A</reset>
<reset>\u0041</reset>
</programlisting>
- </listitem>
+ </listitem>
- <listitem>
- <para>
- Shift rules define primary, secondary, and tertiary
- differences of a character from another character. They are
- specified using <literal><p></literal>,
- <literal><s></literal>, and
- <literal><t></literal> elements. Either of the
- following rules specifies a primary shift rule for the
- <literal>'G'</literal> character:
- </para>
+ <listitem>
+ <para>
+ Shift rules define primary, secondary, and tertiary
+ differences of a character from another character. They
+ are specified using <literal><p></literal>,
+ <literal><s></literal>, and
+ <literal><t></literal> elements. Either of the
+ following rules specifies a primary shift rule for the
+ <literal>'G'</literal> character:
+ </para>
<programlisting>
<p>G</p>
@@ -9040,76 +9079,91 @@
<p>\u0047</p>
</programlisting>
- <itemizedlist>
+ <itemizedlist>
- <listitem>
- <para>
- Use primary differences to distinguish separate letters.
- </para>
- </listitem>
+ <listitem>
+ <para>
+ Use primary differences to distinguish separate
+ letters.
+ </para>
+ </listitem>
- <listitem>
- <para>
- Use secondary differences to distinguish accent
- variations.
- </para>
- </listitem>
+ <listitem>
+ <para>
+ Use secondary differences to distinguish accent
+ variations.
+ </para>
+ </listitem>
- <listitem>
- <para>
- Use tertiary differences to distinguish lettercase
- variations.
- </para>
- </listitem>
+ <listitem>
+ <para>
+ Use tertiary differences to distinguish lettercase
+ variations.
+ </para>
+ </listitem>
- </itemizedlist>
- </listitem>
+ </itemizedlist>
+ </listitem>
- <listitem>
- <para>
- Identity rules indicate that one character sorts identically
- to another. The following rules cause <literal>'b'</literal>
- sort the same as <literal>'a'</literal>:
- </para>
+ <listitem>
+ <para>
+ Identity rules indicate that one character sorts
+ identically to another. The following rules cause
+ <literal>'b'</literal> sort the same as
+ <literal>'a'</literal>:
+ </para>
<programlisting>
<reset>a</reset>
<i>b</i>
</programlisting>
- </listitem>
+ </listitem>
- <listitem>
- <para>
- In MySQL ¤t-series;, an extension to LDML rules is
- that the <literal><collation></literal> element
- permits an optional <literal>version</literal> attribute in
- <literal><collation></literal> tags to indicate the
- UCA version on which the collation is based. If the
- <literal>version</literal> attribute is omitted, its default
- value is <literal>4.0.0</literal>. For example, the
- following specification indicates that the collation is
- based on UCA 5.2.0:
- </para>
+ <listitem>
+ <para>
+ In MySQL ¤t-series;, an extension to LDML rules is
+ that the <literal><collation></literal> element
+ permits an optional <literal>version</literal> attribute
+ in <literal><collation></literal> tags to indicate
+ the UCA version on which the collation is based. If the
+ <literal>version</literal> attribute is omitted, its
+ default value is <literal>4.0.0</literal>. For example,
+ the following specification indicates that the collation
+ is based on UCA 5.2.0:
+ </para>
<programlisting>
<collation id="<replaceable>nnn</replaceable>" name="utf8_<replaceable>xxx</replaceable>_ci" version="5.2.0">
...
</collation>
</programlisting>
- </listitem>
+ </listitem>
- </itemizedlist>
+ </itemizedlist>
- <para>
- To add a UCA collation for a Unicode character set without
- recompiling MySQL, use the following procedure. The example adds
- a collation named <literal>utf8_phone_ci</literal> to the
- <literal>utf8</literal> character set. The collation is designed
- for a scenario involving a Web application for which users post
- their names and phone numbers. Phone numbers can be given in
- very different formats:
- </para>
+ </section>
+ <section id="ldml-collation-example">
+
+ <title>Defining a UCA Collation using LDML Syntax</title>
+
+ <para>
+ To add a UCA collation for a Unicode character set without
+ recompiling MySQL, use the following procedure. If you are
+ unfamiliar with the LDML rules used to describe the
+ collation's sort characteristics, see
+ <xref linkend="ldml-rules"/>.
+ </para>
+
+ <para>
+ The example adds a collation named
+ <literal>utf8_phone_ci</literal> to the
+ <literal>utf8</literal> character set. The collation is
+ designed for a scenario involving a Web application for which
+ users post their names and phone numbers. Phone numbers can be
+ given in very different formats:
+ </para>
+
<programlisting>
+7-12345-67
+7-12-345-67
@@ -9118,33 +9172,33 @@
+71234567
</programlisting>
- <para>
- The problem raised by dealing with these kinds of values is that
- the varying permissible formats make searching for a specific
- phone number very difficult. The solution is to define a new
- collation that reorders punctuation characters, making them
- ignorable.
- </para>
+ <para>
+ The problem raised by dealing with these kinds of values is
+ that the varying permissible formats make searching for a
+ specific phone number very difficult. The solution is to
+ define a new collation that reorders punctuation characters,
+ making them ignorable.
+ </para>
- <orderedlist>
+ <orderedlist>
- <listitem>
- <para>
- Choose a collation ID, as shown in
- <xref linkend="adding-collation-choosing-id"/>. The
- following steps use an ID of 1029.
- </para>
- </listitem>
+ <listitem>
+ <para>
+ Choose a collation ID, as shown in
+ <xref linkend="adding-collation-choosing-id"/>. The
+ following steps use an ID of 1029.
+ </para>
+ </listitem>
- <listitem>
- <para>
- To modify the <literal>Index.xml</literal> configuration
- file. This file will be located in the directory named by
- the <literal role="sysvar">character_sets_dir</literal>
- system variable. You can check the variable value as
- follows, although the path name might be different on your
- system:
- </para>
+ <listitem>
+ <para>
+ To modify the <literal>Index.xml</literal> configuration
+ file. This file will be located in the directory named by
+ the <literal role="sysvar">character_sets_dir</literal>
+ system variable. You can check the variable value as
+ follows, although the path name might be different on your
+ system:
+ </para>
<programlisting>
mysql> <userinput>SHOW VARIABLES LIKE 'character_sets_dir';</userinput>
@@ -9154,21 +9208,22 @@
| character_sets_dir | /user/local/mysql/share/mysql/charsets/ |
+--------------------+-----------------------------------------+
</programlisting>
- </listitem>
+ </listitem>
- <listitem>
- <para>
- Choose a name for the collation and list it in the
- <filename>Index.xml</filename> file. In addition, you'll
- need to provide the collation ordering rules. Find the
- <literal><charset></literal> element for the character
- set to which the collation is being added, and add a
- <literal><collation></literal> element that indicates
- the collation name and ID, to associate the name with the
- ID. Within the <literal><collation></literal> element,
- provide a <literal><rules></literal> element
- containing the ordering rules:
- </para>
+ <listitem>
+ <para>
+ Choose a name for the collation and list it in the
+ <filename>Index.xml</filename> file. In addition, you'll
+ need to provide the collation ordering rules. Find the
+ <literal><charset></literal> element for the
+ character set to which the collation is being added, and
+ add a <literal><collation></literal> element that
+ indicates the collation name and ID, to associate the name
+ with the ID. Within the
+ <literal><collation></literal> element, provide a
+ <literal><rules></literal> element containing the
+ ordering rules:
+ </para>
<programlisting>
<charset name="utf8">
@@ -9186,25 +9241,25 @@
...
</charset>
</programlisting>
- </listitem>
+ </listitem>
- <listitem>
- <para>
- If you want a similar collation for other Unicode character
- sets, add other <literal><collation></literal>
- elements. For example, to define
- <literal>ucs2_phone_ci</literal>, add a
- <literal><collation></literal> element to the
- <literal><charset name="ucs2"></literal> element.
- Remember that each collation must have its own unique ID.
- </para>
- </listitem>
+ <listitem>
+ <para>
+ If you want a similar collation for other Unicode
+ character sets, add other
+ <literal><collation></literal> elements. For
+ example, to define <literal>ucs2_phone_ci</literal>, add a
+ <literal><collation></literal> element to the
+ <literal><charset name="ucs2"></literal> element.
+ Remember that each collation must have its own unique ID.
+ </para>
+ </listitem>
- <listitem>
- <para>
- Restart the server and use this statement to verify that the
- collation is present:
- </para>
+ <listitem>
+ <para>
+ Restart the server and use this statement to verify that
+ the collation is present:
+ </para>
<programlisting>
mysql> <userinput>SHOW COLLATION LIKE 'utf8_phone_ci';</userinput>
@@ -9214,19 +9269,19 @@
| utf8_phone_ci | utf8 | 1029 | | | 8 |
+---------------+---------+------+---------+----------+---------+
</programlisting>
- </listitem>
+ </listitem>
- </orderedlist>
+ </orderedlist>
- <para>
- Now test the collation to make sure that it has the desired
- properties.
- </para>
+ <para>
+ Now test the collation to make sure that it has the desired
+ properties.
+ </para>
- <para>
- Create a table containing some sample phone numbers using the
- new collation:
- </para>
+ <para>
+ Create a table containing some sample phone numbers using the
+ new collation:
+ </para>
<programlisting>
<!--
@@ -9255,10 +9310,10 @@
Query OK, 1 row affected (0.00 sec)
</programlisting>
- <para>
- Run some queries to see whether the ignored punctuation
- characters are in fact ignored for sorting and comparisons:
- </para>
+ <para>
+ Run some queries to see whether the ignored punctuation
+ characters are in fact ignored for sorting and comparisons:
+ </para>
<programlisting>
mysql> <userinput>SELECT * FROM phonebook ORDER BY phone;</userinput>
@@ -9298,6 +9353,8 @@
1 row in set (0.00 sec)
</programlisting>
+ </section>
+
</section>
</section>
Modified: trunk/refman-6.0/internationalization.xml
===================================================================
--- trunk/refman-6.0/internationalization.xml 2011-05-13 16:12:35 UTC (rev 26217)
+++ trunk/refman-6.0/internationalization.xml 2011-05-13 17:41:11 UTC (rev 26218)
Changed blocks: 29, Lines Added: 267, Lines Deleted: 210; 27891 bytes
@@ -7923,9 +7923,9 @@
<listitem>
<para>
- If the character set does not need to use special string
- collating routines for sorting and does not need multi-byte
- character support, it is simple.
+ If the character set does not need special string collating
+ routines for sorting and does not need multi-byte character
+ support, it is simple.
</para>
</listitem>
@@ -7959,7 +7959,8 @@
<replaceable>MYSET</replaceable> to the
<filename>sql/share/charsets/Index.xml</filename> file. Use
the existing contents in the file as a guide to adding new
- contents.
+ contents. A partial listing for the <literal>latin1</literal>
+ <literal><charset></literal> element follows:
</para>
<programlisting>
@@ -7973,14 +7974,19 @@
</collation>
<collation name="latin1_danish_ci" id="15" order="Danish"/>
...
+ <collation name="latin1_bin" id="47" order="Binary">
+ <flag>binary</flag>
+ <flag>compiled</flag>
+ </collation>
+ ...
</charset>
</programlisting>
<para>
The <literal><charset></literal> element must list all
the collations for the character set. These must include at
- least a binary collation and a default collation. The default
- collation is usually named using a suffix of
+ least a binary collation and a default (primary) collation.
+ The default collation is often named using a suffix of
<literal>general_ci</literal> (general, case insensitive). It
is possible for the binary collation to be the default
collation, but usually they are different. The default
@@ -8073,7 +8079,7 @@
character set:
</para>
- <orderedlist>
+ <itemizedlist>
<listitem>
<para>
@@ -8115,7 +8121,7 @@
</para>
</listitem>
- </orderedlist>
+ </itemizedlist>
</listitem>
<listitem>
@@ -8261,7 +8267,8 @@
<para>
Each simple character set has a configuration file located in
- the <filename>sql/share/charsets</filename> directory. The file
+ the <filename>sql/share/charsets</filename> directory. For a
+ character set named <replaceable>MYSYS</replaceable>, the file
is named
<filename><replaceable>MYSET</replaceable>.xml</filename>. It
uses <literal><map></literal> array elements to list
@@ -8313,17 +8320,18 @@
<literal>ctype_<replaceable>MYSET</replaceable>[]</literal>,
<literal>to_lower_<replaceable>MYSET</replaceable>[]</literal>,
and so forth. Not every complex character set has all of the
- arrays. See the existing <filename>ctype-*.c</filename> files
- for examples. See the <filename>CHARSET_INFO.txt</filename> file
- in the <filename>strings</filename> directory for additional
+ arrays. See also the existing <filename>ctype-*.c</filename>
+ files for examples. See the
+ <filename>CHARSET_INFO.txt</filename> file in the
+ <filename>strings</filename> directory for additional
information.
</para>
<para>
- The <literal><ctype></literal> array is indexed by
- character value + 1 and has 257 elements. This is a legacy
- convention for handling <literal>EOF</literal>. The other arrays
- are indexed by character value and have 256 elements.
+ Most of the arrays are indexed by character value and have 256
+ elements. The <literal><ctype></literal> array is indexed
+ by character value + 1 and has 257 elements. This is a legacy
+ convention for handling <literal>EOF</literal>.
</para>
<para>
@@ -8378,14 +8386,14 @@
</programlisting>
<para>
- Each <literal><collation></literal> element contains a
- mapping array that indicates how characters should be ordered
- for comparison and sorting purposes. MySQL sorts characters
- based on the values of this information. In some cases, this is
- the same as the <literal>upper</literal> array, which means that
- sorting is case-insensitive. For more complicated sorting rules
- (for complex character sets), see the discussion of string
- collating in <xref linkend="string-collating"/>.
+ Each <literal><collation></literal> array indicates how
+ characters should be ordered for comparison and sorting
+ purposes. MySQL sorts characters based on the values of this
+ information. In some cases, this is the same as the
+ <literal><upper></literal> array, which means that sorting
+ is case-insensitive. For more complicated sorting rules (for
+ complex character sets), see the discussion of string collating
+ in <xref linkend="string-collating"/>.
</para>
</section>
@@ -8404,12 +8412,13 @@
</indexterm>
<para>
- For simple character sets, sorting rules are specified in the
- <filename><replaceable>MYSET</replaceable>.xml</filename>
+ For a simple character set named
+ <replaceable>MYSET</replaceable>, sorting rules are specified in
+ the <filename><replaceable>MYSET</replaceable>.xml</filename>
configuration file using <literal><map></literal> array
elements within <literal><collation></literal> elements.
If the sorting rules for your language are too complex to be
- handled with simple arrays, you need to define string collating
+ handled with simple arrays, you must define string collating
functions in the
<filename>ctype-<replaceable>MYSET</replaceable>.c</filename>
source file in the <filename>strings</filename> directory.
@@ -8424,9 +8433,10 @@
<literal>gbk</literal>, <literal>sjis</literal>, and
<literal>tis160</literal> character sets. Take a look at the
<literal>MY_COLLATION_HANDLER</literal> structures to see how
- they are used, and see the <filename>CHARSET_INFO.txt</filename>
- file in the <filename>strings</filename> directory for
- additional information.
+ they are used. See also the
+ <filename>CHARSET_INFO.txt</filename> file in the
+ <filename>strings</filename> directory for additional
+ information.
</para>
</section>
@@ -8445,9 +8455,9 @@
</indexterm>
<para>
- If you want to add support for a new character set that includes
- multi-byte characters, you need to use multi-byte character
- functions in the
+ If you want to add support for a new character set named
+ <replaceable>MYSET</replaceable> that includes multi-byte
+ characters, you must use multi-byte character functions in the
<filename>ctype-<replaceable>MYSET</replaceable>.c</filename>
source file in the <filename>strings</filename> directory.
</para>
@@ -8461,9 +8471,9 @@
<literal>gbk</literal>, <literal>sjis</literal>, and
<literal>ujis</literal> character sets. Take a look at the
<literal>MY_CHARSET_HANDLER</literal> structures to see how they
- are used, and see the <filename>CHARSET_INFO.txt</filename> file
- in the <filename>strings</filename> directory for additional
- information.
+ are used. See also the <filename>CHARSET_INFO.txt</filename>
+ file in the <filename>strings</filename> directory for
+ additional information.
</para>
</section>
@@ -8550,9 +8560,9 @@
</itemizedlist>
<para>
- The following discussion describes how to add collations of the
- first two types to existing character sets. All existing character
- sets already have a binary collation, so there is no need here to
+ The following sections describe how to add collations of the first
+ two types to existing character sets. All existing character sets
+ already have a binary collation, so there is no need here to
describe how to add one.
</para>
@@ -8598,10 +8608,8 @@
all the information required for a complete character set, just
modify the appropriate files for an existing character set. That
is, based on what is already present for the character set's
- current collations, add new data structures, functions, and
- configuration information for the new collation. For an example,
- see the MySQL Blog article in the following list of additional
- resources.
+ current collations, add data structures, functions, and
+ configuration information for the new collation.
</para>
<bridgehead>
@@ -8686,6 +8694,11 @@
</programlisting>
<para>
+ For implementation instructions, see
+ <xref linkend="adding-collation-simple-8bit"/>.
+ </para>
+
+ <para>
<emphasis role="bold">Complex collations for 8-bit character
sets</emphasis>
</para>
@@ -8787,6 +8800,11 @@
</itemizedlist>
<para>
+ For implementation instructions, see
+ <xref linkend="adding-character-set"/>.
+ </para>
+
+ <para>
<emphasis role="bold">Collations for Unicode multi-byte
character sets</emphasis>
</para>
@@ -8927,6 +8945,12 @@
</para>
<para>
+ For implementation instructions, for a non-UCA colluation, see
+ <xref linkend="adding-character-set"/>. For a UCA collation, see
+ <xref linkend="adding-collation-unicode-uca"/>.
+ </para>
+
+ <para>
<emphasis role="bold">Miscellaneous collations</emphasis>
</para>
@@ -8954,16 +8978,16 @@
<listitem>
<para>
- The <literal>Id</literal> column of
- <literal role="stmt">SHOW COLLATION</literal> output
+ The <literal>ID</literal> column of the
+ <literal role="is">INFORMATION_SCHEMA.COLLATIONS</literal>
+ table
</para>
</listitem>
<listitem>
<para>
- The <literal>ID</literal> column of the
- <literal role="is">INFORMATION_SCHEMA.COLLATIONS</literal>
- table
+ The <literal>Id</literal> column of
+ <literal role="stmt">SHOW COLLATION</literal> output
</para>
</listitem>
@@ -9117,9 +9141,9 @@
add a <literal><collation></literal> element that
names the collation and that contains a
<literal><map></literal> element that defines a
- character code-to-weight mapping table. Each word within the
- <literal><map></literal> element must be a number in
- hexadecimal format.
+ character code-to-weight mapping table for character codes 0
+ to 255. Each value within the <literal><map></literal>
+ element must be a number in hexadecimal format.
</para>
<programlisting>
@@ -9176,8 +9200,8 @@
<literal><collation></literal> element within a
<literal><charset></literal> character set description.
The procedure described here does not require recompiling MySQL.
- It uses a subset of the Locale Data Markup Language (LDML),
- which is available at
+ It uses a subset of the Locale Data Markup Language (LDML)
+ specification, which is available at
<ulink url="http://www.unicode.org/reports/tr35/"/>. In
¤t-series;, this method of adding collations is supported
as of MySQL 6.0.4. With this method, you need not define the
@@ -9188,7 +9212,8 @@
for which UCA collations can be defined.
</para>
- <informaltable>
+ <table>
+ <title>MySQL Character Sets Available for User-Defined UCA Collations</title>
<tgroup cols="2">
<colspec colwidth="30*"/>
<colspec colwidth="60*"/>
@@ -9217,65 +9242,79 @@
</row>
</tbody>
</tgroup>
- </informaltable>
+ </table>
<para>
- The following brief summary describes the LDML characteristics
- required to understand the procedure for adding a collation
- given later in this section:
+ The following sections show how to add a collation that is
+ defined using LDML syntax, and provide a summary of LDML rules
+ supported in MySQL.
</para>
- <itemizedlist>
+ <section id="ldml-rules">
- <listitem>
- <para>
- LDML has reset, shift, and identity rules.
- </para>
- </listitem>
+ <title>LDML Syntax Supported in MySQL</title>
- <listitem>
- <para>
- Characters named in these rules can be written in
- <literal>\u<replaceable>nnnn</replaceable></literal> format,
- where <replaceable>nnnn</replaceable> is the hexadecimal
- Unicode code point value. Basic Latin letters
- <literal>A-Z</literal> and <literal>a-z</literal> can also
- be written literally (this is a MySQL limitation; the LDML
- specification permits literal non-Latin1 characters in the
- rules). Only characters in the Basic Multilingual Plane can
- be specified. This notation does not apply to characters
- outside the BMP range of <literal>0000</literal> to
- <literal>FFFF</literal>.
- </para>
- </listitem>
+ <para>
+ This section describes the LDML rules that MySQL recognizes.
+ These are a subset of the rules described in the LDML
+ specification available at
+ <ulink url="http://www.unicode.org/reports/tr35/"/>. The rules
+ here are all supported except that character sorting occurs
+ only at the primary level. Rules that specify secondary or
+ higher sort levels are recognized but have no effect.
+ </para>
- <listitem>
- <para>
- A reset rule does not specify any ordering in and of itself.
- Instead, it <quote>resets</quote> the ordering for
- subsequent shift rules to cause them to be taken in relation
- to a given character. Either of the following rules resets
- subsequent shift rules to be taken in relation to the letter
- <literal>'A'</literal>:
- </para>
+ <itemizedlist>
+ <listitem>
+ <para>
+ Characters named in LDML rules can be written in
+ <literal>\u<replaceable>nnnn</replaceable></literal>
+ format, where <replaceable>nnnn</replaceable> is the
+ hexadecimal Unicode code point value. Basic Latin letters
+ <literal>A-Z</literal> and <literal>a-z</literal> can also
+ be written literally (this is a MySQL limitation; the LDML
+ specification permits literal non-Latin1 characters in the
+ rules). Only characters in the Basic Multilingual Plane
+ can be specified. This notation does not apply to
+ characters outside the BMP range of
+ <literal>0000</literal> to <literal>FFFF</literal>.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ LDML has reset rules and shift rules.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ A reset rule does not specify any ordering in and of
+ itself. Instead, it <quote>resets</quote> the ordering for
+ subsequent shift rules to cause them to be taken in
+ relation to a given character. Either of the following
+ rules resets subsequent shift rules to be taken in
+ relation to the letter <literal>'A'</literal>:
+ </para>
+
<programlisting>
<reset>A</reset>
<reset>\u0041</reset>
</programlisting>
- </listitem>
+ </listitem>
- <listitem>
- <para>
- Shift rules define primary, secondary, and tertiary
- differences of a character from another character. They are
- specified using <literal><p></literal>,
- <literal><s></literal>, and
- <literal><t></literal> elements. Either of the
- following rules specifies a primary shift rule for the
- <literal>'G'</literal> character:
- </para>
+ <listitem>
+ <para>
+ Shift rules define primary, secondary, and tertiary
+ differences of a character from another character. They
+ are specified using <literal><p></literal>,
+ <literal><s></literal>, and
+ <literal><t></literal> elements. Either of the
+ following rules specifies a primary shift rule for the
+ <literal>'G'</literal> character:
+ </para>
<programlisting>
<p>G</p>
@@ -9283,62 +9322,77 @@
<p>\u0047</p>
</programlisting>
- <itemizedlist>
+ <itemizedlist>
- <listitem>
- <para>
- Use primary differences to distinguish separate letters.
- </para>
- </listitem>
+ <listitem>
+ <para>
+ Use primary differences to distinguish separate
+ letters.
+ </para>
+ </listitem>
- <listitem>
- <para>
- Use secondary differences to distinguish accent
- variations.
- </para>
- </listitem>
+ <listitem>
+ <para>
+ Use secondary differences to distinguish accent
+ variations.
+ </para>
+ </listitem>
- <listitem>
- <para>
- Use tertiary differences to distinguish lettercase
- variations.
- </para>
- </listitem>
+ <listitem>
+ <para>
+ Use tertiary differences to distinguish lettercase
+ variations.
+ </para>
+ </listitem>
- </itemizedlist>
- </listitem>
+ </itemizedlist>
+ </listitem>
- <listitem>
- <para>
- Identity rules indicate that one character sorts identically
- to another. The following rules cause <literal>'b'</literal>
- sort the same as <literal>'a'</literal>:
- </para>
+ <listitem>
+ <para>
+ Identity rules indicate that one character sorts
+ identically to another. The following rules cause
+ <literal>'b'</literal> sort the same as
+ <literal>'a'</literal>:
+ </para>
<programlisting>
<reset>a</reset>
<i>b</i>
</programlisting>
- <para>
- Identity rules are supported as of MySQL 6.0.9. Prior to
- 6.0.9, use <literal><s> ... </s></literal>
- instead.
- </para>
- </listitem>
+ <para>
+ Identity rules are supported as of MySQL 6.0.9. Prior to
+ 6.0.9, use <literal><s> ... </s></literal>
+ instead.
+ </para>
+ </listitem>
- </itemizedlist>
+ </itemizedlist>
- <para>
- To add a UCA collation for a Unicode character set without
- recompiling MySQL, use the following procedure. The example adds
- a collation named <literal>utf8_phone_ci</literal> to the
- <literal>utf8</literal> character set. The collation is designed
- for a scenario involving a Web application for which users post
- their names and phone numbers. Phone numbers can be given in
- very different formats:
- </para>
+ </section>
+ <section id="ldml-collation-example">
+
+ <title>Defining a UCA Collation using LDML Syntax</title>
+
+ <para>
+ To add a UCA collation for a Unicode character set without
+ recompiling MySQL, use the following procedure. If you are
+ unfamiliar with the LDML rules used to describe the
+ collation's sort characteristics, see
+ <xref linkend="ldml-rules"/>.
+ </para>
+
+ <para>
+ The example adds a collation named
+ <literal>utf8_phone_ci</literal> to the
+ <literal>utf8</literal> character set. The collation is
+ designed for a scenario involving a Web application for which
+ users post their names and phone numbers. Phone numbers can be
+ given in very different formats:
+ </para>
+
<programlisting>
+7-12345-67
+7-12-345-67
@@ -9347,33 +9401,33 @@
+71234567
</programlisting>
- <para>
- The problem raised by dealing with these kinds of values is that
- the varying permissible formats make searching for a specific
- phone number very difficult. The solution is to define a new
- collation that reorders punctuation characters, making them
- ignorable.
- </para>
+ <para>
+ The problem raised by dealing with these kinds of values is
+ that the varying permissible formats make searching for a
+ specific phone number very difficult. The solution is to
+ define a new collation that reorders punctuation characters,
+ making them ignorable.
+ </para>
- <orderedlist>
+ <orderedlist>
- <listitem>
- <para>
- Choose a collation ID, as shown in
- <xref linkend="adding-collation-choosing-id"/>. The
- following steps use an ID of 1029.
- </para>
- </listitem>
+ <listitem>
+ <para>
+ Choose a collation ID, as shown in
+ <xref linkend="adding-collation-choosing-id"/>. The
+ following steps use an ID of 1029.
+ </para>
+ </listitem>
- <listitem>
- <para>
- To modify the <literal>Index.xml</literal> configuration
- file. This file will be located in the directory named by
- the <literal role="sysvar">character_sets_dir</literal>
- system variable. You can check the variable value as
- follows, although the path name might be different on your
- system:
- </para>
+ <listitem>
+ <para>
+ To modify the <literal>Index.xml</literal> configuration
+ file. This file will be located in the directory named by
+ the <literal role="sysvar">character_sets_dir</literal>
+ system variable. You can check the variable value as
+ follows, although the path name might be different on your
+ system:
+ </para>
<programlisting>
mysql> <userinput>SHOW VARIABLES LIKE 'character_sets_dir';</userinput>
@@ -9383,21 +9437,22 @@
| character_sets_dir | /user/local/mysql/share/mysql/charsets/ |
+--------------------+-----------------------------------------+
</programlisting>
- </listitem>
+ </listitem>
- <listitem>
- <para>
- Choose a name for the collation and list it in the
- <filename>Index.xml</filename> file. In addition, you'll
- need to provide the collation ordering rules. Find the
- <literal><charset></literal> element for the character
- set to which the collation is being added, and add a
- <literal><collation></literal> element that indicates
- the collation name and ID, to associate the name with the
- ID. Within the <literal><collation></literal> element,
- provide a <literal><rules></literal> element
- containing the ordering rules:
- </para>
+ <listitem>
+ <para>
+ Choose a name for the collation and list it in the
+ <filename>Index.xml</filename> file. In addition, you'll
+ need to provide the collation ordering rules. Find the
+ <literal><charset></literal> element for the
+ character set to which the collation is being added, and
+ add a <literal><collation></literal> element that
+ indicates the collation name and ID, to associate the name
+ with the ID. Within the
+ <literal><collation></literal> element, provide a
+ <literal><rules></literal> element containing the
+ ordering rules:
+ </para>
<programlisting>
<charset name="utf8">
@@ -9415,25 +9470,25 @@
...
</charset>
</programlisting>
- </listitem>
+ </listitem>
- <listitem>
- <para>
- If you want a similar collation for other Unicode character
- sets, add other <literal><collation></literal>
- elements. For example, to define
- <literal>ucs2_phone_ci</literal>, add a
- <literal><collation></literal> element to the
- <literal><charset name="ucs2"></literal> element.
- Remember that each collation must have its own unique ID.
- </para>
- </listitem>
+ <listitem>
+ <para>
+ If you want a similar collation for other Unicode
+ character sets, add other
+ <literal><collation></literal> elements. For
+ example, to define <literal>ucs2_phone_ci</literal>, add a
+ <literal><collation></literal> element to the
+ <literal><charset name="ucs2"></literal> element.
+ Remember that each collation must have its own unique ID.
+ </para>
+ </listitem>
- <listitem>
- <para>
- Restart the server and use this statement to verify that the
- collation is present:
- </para>
+ <listitem>
+ <para>
+ Restart the server and use this statement to verify that
+ the collation is present:
+ </para>
<programlisting>
mysql> <userinput>SHOW COLLATION LIKE 'utf8_phone_ci';</userinput>
@@ -9443,19 +9498,19 @@
| utf8_phone_ci | utf8 | 1029 | | | 8 |
+---------------+---------+------+---------+----------+---------+
</programlisting>
- </listitem>
+ </listitem>
- </orderedlist>
+ </orderedlist>
- <para>
- Now test the collation to make sure that it has the desired
- properties.
- </para>
+ <para>
+ Now test the collation to make sure that it has the desired
+ properties.
+ </para>
- <para>
- Create a table containing some sample phone numbers using the
- new collation:
- </para>
+ <para>
+ Create a table containing some sample phone numbers using the
+ new collation:
+ </para>
<programlisting>
<!--
@@ -9484,10 +9539,10 @@
Query OK, 1 row affected (0.00 sec)
</programlisting>
- <para>
- Run some queries to see whether the ignored punctuation
- characters are in fact ignored for sorting and comparisons:
- </para>
+ <para>
+ Run some queries to see whether the ignored punctuation
+ characters are in fact ignored for sorting and comparisons:
+ </para>
<programlisting>
mysql> <userinput>SELECT * FROM phonebook ORDER BY phone;</userinput>
@@ -9527,6 +9582,8 @@
1 row in set (0.00 sec)
</programlisting>
+ </section>
+
</section>
</section>
| Thread |
|---|
| • svn commit - mysqldoc@oter02: r26218 - in trunk: . refman-5.0 refman-5.1 refman-5.5 refman-5.6 refman-6.0 | paul.dubois | 13 May |