Author: pd221994
Date: 2011-05-23 17:26:29 +0200 (Mon, 23 May 2011)
New Revision: 26311
Log:
r48262@dhcp-adc-twvpn-1-vpnpool-10-154-20-51: paul | 2011-05-23 10:20:24 -0500
Document WL#5624: Collation customization improvements
Modified:
svk:merge
trunk/dynamic-docs/changelog/mysqld-2.xml
trunk/refman-5.0/globalization.xml
trunk/refman-5.1/globalization.xml
trunk/refman-5.5/globalization.xml
trunk/refman-5.6/globalization.xml
trunk/refman-6.0/globalization.xml
Property changes on: trunk
___________________________________________________________________
Modified: svk:merge
===================================================================
Changed blocks: 0, Lines Added: 0, Lines Deleted: 0; 1277 bytes
Modified: trunk/dynamic-docs/changelog/mysqld-2.xml
===================================================================
--- trunk/dynamic-docs/changelog/mysqld-2.xml 2011-05-23 14:32:58 UTC (rev 26310)
+++ trunk/dynamic-docs/changelog/mysqld-2.xml 2011-05-23 15:26:29 UTC (rev 26311)
Changed blocks: 1, Lines Added: 73, Lines Deleted: 0; 2331 bytes
@@ -47566,4 +47566,77 @@
</logentry>
+ <logentry entrytype="feature">
+
+ <tags>
+ <manual type="collations"/>
+ <manual type="LDML"/>
+ </tags>
+
+ <bugs>
+ <fixes wlid="5624"/>
+ </bugs>
+
+ <versions>
+ <version ver="5.6.1"/>
+ </versions>
+
+ <message>
+
+ <para>
+ Support for adding Unicode collations that are based on the
+ Unicode Collation Algorithm (UCA) has been improved:
+ </para>
+
+ <itemizedlist>
+
+ <listitem>
+ <para>
+ MySQL now recognizes a larger subset of the LDML syntax that
+ is used to write collation descriptions. In many cases, it
+ is possible to download a collation definition from the
+ Unicode Common Locale Data Repository and paste the relevant
+ part (that is, the part between the
+ <literal><rules></literal> and
+ <literal></rules></literal> tags) into the MySQL
+ <filename>Index.xml</filename> file.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ Character representation in LDML rules is more flexible. Any
+ character can be written literally, not just basic Latin
+ letters. For collations based on UCA 5.2.0, hexadecimal
+ notation can be used for any character, not just BMP
+ characters.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ When problems are found while parsing
+ <filename>Index.xml</filename>, better diagnostics are
+ produced.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ For collations that require tailoring rules, there is no
+ longer a fixed size limit on the tailoring information.
+ </para>
+ </listitem>
+
+ </itemizedlist>
+
+ <para>
+ For more information, see <xref linkend="ldml-rules"/>, and
+ <xref linkend="collation-diagnostics"/>.
+ </para>
+
+ </message>
+
+ </logentry>
+
</changelog>
Modified: trunk/refman-5.0/globalization.xml
===================================================================
--- trunk/refman-5.0/globalization.xml 2011-05-23 14:32:58 UTC (rev 26310)
+++ trunk/refman-5.0/globalization.xml 2011-05-23 15:26:29 UTC (rev 26311)
Changed blocks: 3, Lines Added: 22, Lines Deleted: 18; 2432 bytes
@@ -7593,12 +7593,13 @@
<listitem>
<para>
- A reset rule does not specify any ordering in and of
- itself. Instead, it <quote>resets</quote> the ordering for
- subsequent shift rules to cause them to be taken in
- relation to a given character. Either of these rules
- resets subsequent shift rules to be taken in relation to
- the letter <literal>'A'</literal>:
+ A <literal><reset></literal> rule does not specify
+ any ordering in and of itself. Instead, it
+ <quote>resets</quote> the ordering for subsequent shift
+ rules to cause them to be taken in relation to a given
+ character. Either of the following rules resets subsequent
+ shift rules to be taken in relation to the letter
+ <literal>'A'</literal>:
</para>
<programlisting>
@@ -7610,21 +7611,13 @@
<listitem>
<para>
- Shift rules define primary, secondary, and tertiary
- differences of a character from another character. To
- specify them, use <literal><p></literal>,
+ The <literal><p></literal>,
<literal><s></literal>, and
- <literal><t></literal> elements. Either of these
- rules specifies a primary shift rule for the
- <literal>'G'</literal> character:
+ <literal><t></literal>, shift rules define primary,
+ secondary, and tertiary differences of a character from
+ another character:
</para>
-<programlisting>
-<p>G</p>
-
-<p>\u0047</p>
-</programlisting>
-
<itemizedlist>
<listitem>
@@ -7649,6 +7642,17 @@
</listitem>
</itemizedlist>
+
+ <para>
+ Either of these rules specifies a primary shift rule for
+ the <literal>'G'</literal> character:
+ </para>
+
+<programlisting>
+<p>G</p>
+
+<p>\u0047</p>
+</programlisting>
</listitem>
</itemizedlist>
Modified: trunk/refman-5.1/globalization.xml
===================================================================
--- trunk/refman-5.1/globalization.xml 2011-05-23 14:32:58 UTC (rev 26310)
+++ trunk/refman-5.1/globalization.xml 2011-05-23 15:26:29 UTC (rev 26311)
Changed blocks: 3, Lines Added: 22, Lines Deleted: 18; 2432 bytes
@@ -7783,12 +7783,13 @@
<listitem>
<para>
- A reset rule does not specify any ordering in and of
- itself. Instead, it <quote>resets</quote> the ordering for
- subsequent shift rules to cause them to be taken in
- relation to a given character. Either of these rules
- resets subsequent shift rules to be taken in relation to
- the letter <literal>'A'</literal>:
+ A <literal><reset></literal> rule does not specify
+ any ordering in and of itself. Instead, it
+ <quote>resets</quote> the ordering for subsequent shift
+ rules to cause them to be taken in relation to a given
+ character. Either of the following rules resets subsequent
+ shift rules to be taken in relation to the letter
+ <literal>'A'</literal>:
</para>
<programlisting>
@@ -7800,21 +7801,13 @@
<listitem>
<para>
- Shift rules define primary, secondary, and tertiary
- differences of a character from another character. To
- specify them, use <literal><p></literal>,
+ The <literal><p></literal>,
<literal><s></literal>, and
- <literal><t></literal> elements. Either of these
- rules specifies a primary shift rule for the
- <literal>'G'</literal> character:
+ <literal><t></literal>, shift rules define primary,
+ secondary, and tertiary differences of a character from
+ another character:
</para>
-<programlisting>
-<p>G</p>
-
-<p>\u0047</p>
-</programlisting>
-
<itemizedlist>
<listitem>
@@ -7839,6 +7832,17 @@
</listitem>
</itemizedlist>
+
+ <para>
+ Either of these rules specifies a primary shift rule for
+ the <literal>'G'</literal> character:
+ </para>
+
+<programlisting>
+<p>G</p>
+
+<p>\u0047</p>
+</programlisting>
</listitem>
</itemizedlist>
Modified: trunk/refman-5.5/globalization.xml
===================================================================
--- trunk/refman-5.5/globalization.xml 2011-05-23 14:32:58 UTC (rev 26310)
+++ trunk/refman-5.5/globalization.xml 2011-05-23 15:26:29 UTC (rev 26311)
Changed blocks: 4, Lines Added: 28, Lines Deleted: 24; 3355 bytes
@@ -9020,12 +9020,13 @@
<listitem>
<para>
- A reset rule does not specify any ordering in and of
- itself. Instead, it <quote>resets</quote> the ordering for
- subsequent shift rules to cause them to be taken in
- relation to a given character. Either of these rules
- resets subsequent shift rules to be taken in relation to
- the letter <literal>'A'</literal>:
+ A <literal><reset></literal> rule does not specify
+ any ordering in and of itself. Instead, it
+ <quote>resets</quote> the ordering for subsequent shift
+ rules to cause them to be taken in relation to a given
+ character. Either of the following rules resets subsequent
+ shift rules to be taken in relation to the letter
+ <literal>'A'</literal>:
</para>
<programlisting>
@@ -9037,21 +9038,13 @@
<listitem>
<para>
- Shift rules define primary, secondary, and tertiary
- differences of a character from another character. To
- specify them, use <literal><p></literal>,
+ The <literal><p></literal>,
<literal><s></literal>, and
- <literal><t></literal> elements. Either of these
- rules specifies a primary shift rule for the
- <literal>'G'</literal> character:
+ <literal><t></literal>, shift rules define primary,
+ secondary, and tertiary differences of a character from
+ another character:
</para>
-<programlisting>
-<p>G</p>
-
-<p>\u0047</p>
-</programlisting>
-
<itemizedlist>
<listitem>
@@ -9076,13 +9069,24 @@
</listitem>
</itemizedlist>
+
+ <para>
+ Either of these rules specifies a primary shift rule for
+ the <literal>'G'</literal> character:
+ </para>
+
+<programlisting>
+<p>G</p>
+
+<p>\u0047</p>
+</programlisting>
</listitem>
<listitem>
<para>
- Identity rules indicate that one character sorts
- identically to another. These rules cause
- <literal>'b'</literal> to sort the same as
+ The <literal><i></literal> shift rule indicates that
+ one character sorts identically to another. The following
+ rules cause <literal>'b'</literal> to sort the same as
<literal>'a'</literal>:
</para>
@@ -9092,9 +9096,9 @@
</programlisting>
<para>
- Identity rules are supported as of MySQL 5.5.3. Prior to
- 5.5.3, use <literal><s> ... </s></literal>
- instead.
+ The <literal><i></literal> shift rules is supported
+ as of MySQL 5.5.3. Prior to 5.5.3, use <literal><s>
+ ... </s></literal> instead.
</para>
</listitem>
Modified: trunk/refman-5.6/globalization.xml
===================================================================
--- trunk/refman-5.6/globalization.xml 2011-05-23 14:32:58 UTC (rev 26310)
+++ trunk/refman-5.6/globalization.xml 2011-05-23 15:26:29 UTC (rev 26311)
Changed blocks: 6, Lines Added: 548, Lines Deleted: 43; 24284 bytes
@@ -9233,41 +9233,54 @@
specification available at
<ulink url="http://www.unicode.org/reports/tr35/"/>, which
should be consulted for further information. MySQL recognizes
- a large enough subset of the rules that in many cases, it is
+ a large enough subset of the syntax that, in many cases, it is
possible to download a collation definition from the Unicode
- Common Locale Data Repository and paste into the
- <filename>Index.xml</filename> file the relevant part (that
- is, the part between the <literal><rules></literal> and
- <literal></rules></literal> tags). The rules described
- here are all supported except that character sorting occurs
- only at the primary level. Rules that specify differences at
- secondary or higher sort levels are recognized (and thus can
- be included in collation definitions) but are treated as
- equality at the primary level.
+ Common Locale Data Repository and paste the relevant part
+ (that is, the part between the
+ <literal><rules></literal> and
+ <literal></rules></literal> tags) into the MySQL
+ <filename>Index.xml</filename> file. The rules described here
+ are all supported except that character sorting occurs only at
+ the primary level. Rules that specify differences at secondary
+ or higher sort levels are recognized (and thus can be included
+ in collation definitions) but are treated as equality at the
+ primary level.
</para>
<para>
+ The MySQL server generates diagnostics when it finds problems
+ while parsing the <filename>Index.xml</filename> file. See
+ <xref linkend="collation-diagnostics"/>.
+ </para>
+
+ <para>
<emphasis role="bold">Character Representation</emphasis>
</para>
<para>
- Characters named in LDML rules can be written in
+ Characters named in LDML rules can be written literally or in
<literal>\u<replaceable>nnnn</replaceable></literal> format,
where <replaceable>nnnn</replaceable> is the hexadecimal
- Unicode code point value. Within hexadecimal values, the
- digits <literal>A</literal> through <literal>F</literal> are
- not case sensitive; <literal>\u00E1</literal> and
- <literal>\u00e1</literal> are equivalent. Basic Latin letters
- <literal>A-Z</literal> and <literal>a-z</literal> can also be
- written literally (this is a MySQL limitation; the LDML
- specification permits literal non-Latin1 characters in the
- rules). Only characters in the Basic Multilingual Plane can be
- specified. This notation does not apply to characters outside
- the BMP range of <literal>0000</literal> to
- <literal>FFFF</literal>.
+ Unicode code point value. For example, <literal>A</literal>
+ and <literal>á</literal> can be written literally or as
+ <literal>\u0041</literal> and <literal>\u00E1</literal>.
+ Within hexadecimal values, the digits <literal>A</literal>
+ through <literal>F</literal> are not case sensitive;
+ <literal>\u00E1</literal> and <literal>\u00e1</literal> are
+ equivalent. For UCA 4.0.0 collations, hexadecimal notation can
+ be used only for characters in the Basic Multilingual Plane,
+ not for characters outside the BMP range of
+ <literal>0000</literal> to <literal>FFFF</literal>. For UCA
+ 5.2.0 collations, hexadecimal notation can be used for any
+ character.
</para>
<para>
+ The <filename>Index.xml</filename> file itself should be
+ written using UTF-8 encoding.
+ </para>
+
+ <para>
<emphasis role="bold">Syntax Rules</emphasis>
</para>
@@ -9283,12 +9296,13 @@
<listitem>
<para>
- A reset rule does not specify any ordering in and of
- itself. Instead, it <quote>resets</quote> the ordering for
- subsequent shift rules to cause them to be taken in
- relation to a given character. Either of these rules
- resets subsequent shift rules to be taken in relation to
- the letter <literal>'A'</literal>:
+ A <literal><reset></literal> rule does not specify
+ any ordering in and of itself. Instead, it
+ <quote>resets</quote> the ordering for subsequent shift
+ rules to cause them to be taken in relation to a given
+ character. Either of the following rules resets subsequent
+ shift rules to be taken in relation to the letter
+ <literal>'A'</literal>:
</para>
<programlisting>
@@ -9300,21 +9314,13 @@
<listitem>
<para>
- Shift rules define primary, secondary, and tertiary
- differences of a character from another character. To
- specify them, use <literal><p></literal>,
+ The <literal><p></literal>,
<literal><s></literal>, and
- <literal><t></literal> elements. Either of these
- rules specifies a primary shift rule for the
- <literal>'G'</literal> character:
+ <literal><t></literal>, shift rules define primary,
+ secondary, and tertiary differences of a character from
+ another character:
</para>
-<programlisting>
-<p>G</p>
-
-<p>\u0047</p>
-</programlisting>
-
<itemizedlist>
<listitem>
@@ -9338,14 +9344,79 @@
</para>
</listitem>
+<!--
+If we add this back in, there is also quaternary material
+in the reset-before and abbreviated-syntax descriptions
+that should be added
+ <listitem>
+ <para>
+ Use quaternary differences to distinguish punctuation in
+ <quote>Shifted</quote> mode.
+ </para>
+ </listitem>
+-->
+
</itemizedlist>
+
+ <para>
+ Either of these rules specifies a primary shift rule for
+ the <literal>'G'</literal> character:
+ </para>
+
+<programlisting>
+<p>G</p>
+
+<p>\u0047</p>
+</programlisting>
</listitem>
<listitem>
<para>
- Identity rules indicate that one character sorts
- identically to another. These rules cause
- <literal>'b'</literal> to sort the same as
+ Reset rules permit a <literal>before</literal> attribute.
+ Normally, shift rules after a reset rule indicate
+ characters that sort after the reset character. Shift
+ rules after a reset rule that has the
+ <literal>before</literal> attribute indicate characters
+ that sort before the reset character. The following rules
+ put the character <literal>'b'</literal> immediately
+ before <literal>'a'</literal> at the primary level:
+ </para>
+
+<programlisting>
+<reset before="primary">a</reset>
+<p>b</p>
+</programlisting>
+
+ <para>
+ Permissible <literal>before</literal> attribute values
+ specify the sort level by name or the equivalent numeric
+ value:
+ </para>
+
+<programlisting>
+<reset before="primary">
+<reset before="1">
+
+<reset before="secondary">
+<reset before="2">
+
+<reset before="tertiary">
+<reset before="3">
+</programlisting>
+
+<!--
+<programlisting>
+<reset before="quaternary">
+<reset before="4">
+</programlisting>
+-->
+ </listitem>
+
+ <listitem>
+ <para>
+ The <literal><i></literal> shift rule indicates that
+ one character sorts identically to another. The following
+ rules cause <literal>'b'</literal> to sort the same as
<literal>'a'</literal>:
</para>
@@ -9355,6 +9426,364 @@
</programlisting>
</listitem>
+ <listitem>
+ <para>
+ Abbreviated shift syntax specifies multiple shift rules
+ using a single pair of tags. The following table shows the
+ correspondence between abbreviated syntax rules and the
+ equivalent nonabbreviated rules.
+ </para>
+
+ <table>
+ <title>Abbreviated Shift Syntax</title>
+ <tgroup cols="2">
+ <colspec colwidth="40*"/>
+ <colspec colwidth="60*"/>
+ <thead>
+ <row>
+ <entry>Abbreviated Syntax</entry>
+ <entry>Nonabbreviated Syntax</entry>
+ </row>
+ </thead>
+ <tbody>
+ <row>
+ <entry><literal><pc>xyz</pc></literal></entry>
+ <entry><literal><p>x</p><p>y</p><p>z</p></literal></entry>
+ </row>
+ <row>
+ <entry><literal><sc>xyz</sc></literal></entry>
+ <entry><literal><s>x</s><s>y</s><s>z</s></literal></entry>
+ </row>
+ <row>
+ <entry><literal><tc>xyz</tc></literal></entry>
+ <entry><literal><t>x</t><t>y</t><t>z</t></literal></entry>
+ </row>
+<!--
+ <row>
+ <entry><literal><qc>xyz</qc></literal></entry>
+ <entry><literal><q>x</q><q>y</q><q>z</q></literal></entry>
+ </row>
+-->
+ <row>
+ <entry><literal><ic>xyz</ic></literal></entry>
+ <entry><literal><i>x</i><i>y</i><i>z</i></literal></entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </table>
+ </listitem>
+
+ <listitem>
+ <para>
+ MySQL supports expansions 2 to 6 characters long. An
+ expansion is a reset rule that establishes an anchor point
+ for a multiple-character sequence. The following rules put
+ <literal>'z'</literal> greater at the primary level than
+ the sequence of three characters <literal>'abc'</literal>:
+ </para>
+
+<programlisting>
+<reset>abc</reset>
+<p>z</p>
+</programlisting>
+ </listitem>
+
+ <listitem>
+ <para>
+ MySQL supports contractions 2 to 6 characters long. A
+ contraction is a shift rule that sorts a
+ multiple-character sequence. The following rules put the
+ sequence of three characters <literal>'xyz'</literal>
+ greater at the primary level than <literal>'a'</literal>:
+ </para>
+
+<programlisting>
+<reset>a</reset>
+<p>xyz</p>
+</programlisting>
+ </listitem>
+
+ <listitem>
+ <para>
+ Long expansions and long contractions can be used
+ together. These rules put the sequence of three characters
+ <literal>'xyz'</literal> greater at the primary level than
+ the sequence of three characters <literal>'abc'</literal>:
+ </para>
+
+<programlisting>
+<reset>abc</reset>
+<p>xyz</p>
+</programlisting>
+ </listitem>
+
+ <listitem>
+ <para>
+ Normal expansion syntax uses <literal><x></literal>
+ plus <literal><extend></literal> elements to specify
+ an expansion. The following rules put the character
+ <literal>'k'</literal> greater at the secondary level than
+ the sequence <literal>'ch'</literal>. That is,
+ <literal>'k'</literal> behaves as if it expands to a
+ character after <literal>'c'</literal> followed by
+ <literal>'h'</literal>:
+ </para>
+
+<programlisting>
+<reset>c</reset>
+<x><s>k</s><extend>h</extend></x>
+</programlisting>
+
+ <para>
+ This syntax permits long sequences. These rules sort the
+ sequence <literal>'ccs'</literal> greater at the tertiary
+ level than the sequence <literal>'cscs'</literal>:
+ </para>
+
+<programlisting>
+<reset>cs</reset>
+<x><t>ccs</t><extend>cs</extend></x>
+</programlisting>
+
+ <para>
+ The LDML specification describes normal expansion syntax
+ as <quote>tricky.</quote> See that specification for
+ details.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ Previous context syntax uses <literal><x></literal>
+ plus <literal><context></literal> elements to
+ specify that the context before a character affects how it
+ sorts. The following rules put <literal>'-'</literal>
+ greater at the secondary level than
+ <literal>'a'</literal>, but only when
+ <literal>'-'</literal> goes after <literal>'b'</literal>:
+ </para>
+
+<programlisting>
+<reset>a</reset>
+<x><context>b</context><s>-</s></x>
+</programlisting>
+ </listitem>
+
+ <listitem>
+ <para>
+ Previous context syntax can include the
+ <literal><extend></literal> element. These rules put
+ <literal>'def'</literal> greater at the primary level than
+ <literal>'aghi'</literal>, but only when
+ <literal>'def'</literal> comes after
+ <literal>'abc'</literal>:
+ </para>
+
+<programlisting>
+<reset>a</reset>
+<x><context>abc</context><p>def</p><extend>ghi</extend></x>
+</programlisting>
+ </listitem>
+
+ <listitem>
+ <para>
+ A reset rule can name a logical reset position rather than
+ a literal character:
+ </para>
+
+<programlisting>
+<first_tertiary_ignorable/>
+<last_tertiary_ignorable/>
+<first_secondary_ignorable/>
+<last_secondary_ignorable/>
+<first_primary_ignorable/>
+<last_primary_ignorable/>
+<first_variable/>
+<last_variable/>
+<first_non_ignorable/>
+<last_non_ignorable/>
+<first_trailing/>
+<last_trailing/>
+</programlisting>
+
+ <para>
+ These rules put <literal>'z'</literal> greater at the
+ primary level than nonignorable characters that have a
+ Default Unicode Collation Element Table (DUCET) entry and
+ that are not CJK:
+ </para>
+
+<programlisting>
+<reset><last_non_ignorable/></reset>
+<p>z</p>
+</programlisting>
+
+ <para>
+ Logical positions have the code points shown in the
+ following table.
+ </para>
+
+ <table>
+ <title>Logical Reset Position Code Points</title>
+ <tgroup cols="3">
+ <colspec colwidth="40*"/>
+ <colspec colwidth="30*"/>
+ <colspec colwidth="30*"/>
+ <thead>
+ <row>
+ <entry>Logical Position</entry>
+ <entry>Unicode 4.0.0 Code Point</entry>
+ <entry>Unicode 5.2.0 Code Point</entry>
+ </row>
+ </thead>
+ <tbody>
+ <row>
+ <entry><literal><first_non_ignorable/></literal></entry>
+ <entry>U+02D0</entry>
+ <entry>U+02D0</entry>
+ </row>
+ <row>
+ <entry><literal><last_non_ignorable/></literal></entry>
+ <entry>U+A48C</entry>
+ <entry>U+1342E</entry>
+ </row>
+ <row>
+ <entry><literal><first_primary_ignorable/></literal></entry>
+ <entry>U+0332</entry>
+ <entry>U+0332</entry>
+ </row>
+ <row>
+ <entry><literal><last_primary_ignorable/></literal></entry>
+ <entry>U+20EA</entry>
+ <entry>U+101FD</entry>
+ </row>
+ <row>
+ <entry><literal><first_secondary_ignorable/></literal></entry>
+ <entry>U+0000</entry>
+ <entry>U+0000</entry>
+ </row>
+ <row>
+ <entry><literal><last_secondary_ignorable/></literal></entry>
+ <entry>U+FE73</entry>
+ <entry>U+FE73</entry>
+ </row>
+ <row>
+ <entry><literal><first_tertiary_ignorable/></literal></entry>
+ <entry>U+0000</entry>
+ <entry>U+0000</entry>
+ </row>
+ <row>
+ <entry><literal><last_tertiary_ignorable/></literal></entry>
+ <entry>U+FE73</entry>
+ <entry>U+FE73</entry>
+ </row>
+ <row>
+ <entry><literal><first_trailing/></literal></entry>
+ <entry>U+0000</entry>
+ <entry>U+0000</entry>
+ </row>
+ <row>
+ <entry><literal><last_trailing/></literal></entry>
+ <entry>U+0000</entry>
+ <entry>U+0000</entry>
+ </row>
+ <row>
+ <entry><literal><first_variable/></literal></entry>
+ <entry>U+0009</entry>
+ <entry>U+0009</entry>
+ </row>
+ <row>
+ <entry><literal><last_variable/></literal></entry>
+ <entry>U+2183</entry>
+ <entry>U+1D371</entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </table>
+ </listitem>
+
+ <listitem>
+ <para>
+ The <literal><collation></literal> element permits a
+ <literal>shift-after-method</literal> attribute that
+ affects character weight calculation for shift rules. The
+ attribute has these permitted values:
+ </para>
+
+ <itemizedlist>
+
+ <listitem>
+ <para>
+ <literal>simple</literal>: Calculate character weights
+ as for reset rules that do not have a
+ <literal>before</literal> attribute. This is the
+ default.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <literal>expand</literal>: Use expansions for shifts
+ after reset rules.
+ </para>
+ </listitem>
+
+ </itemizedlist>
+
+ <para>
+ Suppose that <literal>'0'</literal> and
+ <literal>'1'</literal> have weights of
+ <literal>0E29</literal> and <literal>0E2A</literal> and we
+ want to put all basic Latin letters between
+ <literal>'0'</literal> and <literal>'1'</literal>:
+ </para>
+
+<programlisting>
+<reset>0</reset>
+<pc>abcdefghijklmnopqrstuvwxyz</pc>
+</programlisting>
+
+ <para>
+ For simple shift mode, weights are calculated as follows:
+ </para>
+
+<programlisting>
+'a' has weight 0E29+1
+'b' has weight 0E29+2
+'c' has weight 0E29+3
+...
+</programlisting>
+
+ <para>
+ However, there are not enough vacant positions to put 26
+ characters between <literal>'0'</literal> and
+ <literal>'1'</literal>. The result is that digits and
+ letters are intermixed.
+ </para>
+
+ <para>
+ To solve this, use
+ <literal>shift-after-method="expand"</literal>. Then
+ weights are calculated like this:
+ </para>
+
+<programlisting>
+'a' has weight [0E29][233D+1]
+'b' has weight [0E29][233D+2]
+'c' has weight [0E29][233D+3]
+...
+</programlisting>
+
+ <para>
+ <literal>233D</literal> is the UCA 4.0.0 weight for
+ character <literal>0xA48C</literal>, which is the last
+ nonignorable character (a sort of the greatest character
+ in the collation, excluding CJK). UCA 5.2.0 is similar but
+ uses <literal>3ACA</literal>, for character
+ <literal>0x1342E</literal>.
+ </para>
+ </listitem>
+
</itemizedlist>
<para>
@@ -9382,6 +9811,82 @@
</section>
+ <section id="collation-diagnostics">
+
+ <title>Diagnostics During <filename>Index.xml</filename> Parsing</title>
+
+ <para>
+ The MySQL server generates diagnostics when it finds problems
+ while parsing the <filename>Index.xml</filename> file:
+ </para>
+
+ <itemizedlist>
+
+ <listitem>
+ <para>
+ Unknown tags are written to the error log. For example,
+ the following message results if a collation definition
+ contains a <literal><aaa></literal> tag:
+ </para>
+
+<programlisting>
+[Warning] Buffered warning: Unknown LDML tag:
+'charsets/charset/collation/rules/aaa'
+</programlisting>
+ </listitem>
+
+ <listitem>
+ <para>
+ Problems with collations generate warnings that clients
+ can display with <literal role="stmt">SHOW
+ WARNINGS</literal>. Suppose that a reset rule contains an
+ expansion longer than the maximum supported length of 6
+ characters:
+ </para>
+
+<programlisting>
+<reset>abcdefghi</reset>
+<i>x</i>
+</programlisting>
+
+ <para>
+ An attempt to use the collation produces warnings:
+ </para>
+
+<programlisting>
+mysql> <userinput>SELECT _utf8'test' COLLATE utf8_test_ci;</userinput>
+ERROR 1273 (HY000): Unknown collation: 'utf8_test_ci'
+mysql> <userinput>SHOW WARNINGS;</userinput>
++---------+------+---------------------------------------+
+| Level | Code | Message |
++---------+------+---------------------------------------+
+| Error | 1273 | Unknown collation: 'utf8_test_ci' |
+| Warning | 1273 | Expansion is too long at 'hi</reset>' |
++---------+------+---------------------------------------+
+</programlisting>
+ </listitem>
+
+ <listitem>
+ <para>
+ If collation initialization is not possible, the server
+ reports an <quote>Unknown collation</quote> error, and
+ also generates warnings explaining the problems, such as
+ in the previous example. In other cases, when a collation
+ description is generally correct but contains some unknown
+ tags, the collation is initialized and is available for
+ use. The unknown parts are ignored, but a warning is
+ generated in the error log.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para></para>
+ </listitem>
+
+ </itemizedlist>
+
+ </section>
+
</section>
</section>
Modified: trunk/refman-6.0/globalization.xml
===================================================================
--- trunk/refman-6.0/globalization.xml 2011-05-23 14:32:58 UTC (rev 26310)
+++ trunk/refman-6.0/globalization.xml 2011-05-23 15:26:29 UTC (rev 26311)
Changed blocks: 4, Lines Added: 28, Lines Deleted: 24; 3355 bytes
@@ -9525,12 +9525,13 @@
<listitem>
<para>
- A reset rule does not specify any ordering in and of
- itself. Instead, it <quote>resets</quote> the ordering for
- subsequent shift rules to cause them to be taken in
- relation to a given character. Either of these rules
- resets subsequent shift rules to be taken in relation to
- the letter <literal>'A'</literal>:
+ A <literal><reset></literal> rule does not specify
+ any ordering in and of itself. Instead, it
+ <quote>resets</quote> the ordering for subsequent shift
+ rules to cause them to be taken in relation to a given
+ character. Either of the following rules resets subsequent
+ shift rules to be taken in relation to the letter
+ <literal>'A'</literal>:
</para>
<programlisting>
@@ -9542,21 +9543,13 @@
<listitem>
<para>
- Shift rules define primary, secondary, and tertiary
- differences of a character from another character. To
- specify them, use <literal><p></literal>,
+ The <literal><p></literal>,
<literal><s></literal>, and
- <literal><t></literal> elements. Either of these
- rules specifies a primary shift rule for the
- <literal>'G'</literal> character:
+ <literal><t></literal>, shift rules define primary,
+ secondary, and tertiary differences of a character from
+ another character:
</para>
-<programlisting>
-<p>G</p>
-
-<p>\u0047</p>
-</programlisting>
-
<itemizedlist>
<listitem>
@@ -9581,13 +9574,24 @@
</listitem>
</itemizedlist>
+
+ <para>
+ Either of these rules specifies a primary shift rule for
+ the <literal>'G'</literal> character:
+ </para>
+
+<programlisting>
+<p>G</p>
+
+<p>\u0047</p>
+</programlisting>
</listitem>
<listitem>
<para>
- Identity rules indicate that one character sorts
- identically to another. These rules cause
- <literal>'b'</literal> to sort the same as
+ The <literal><i></literal> shift rule indicates that
+ one character sorts identically to another. The following
+ rules cause <literal>'b'</literal> to sort the same as
<literal>'a'</literal>:
</para>
@@ -9597,9 +9601,9 @@
</programlisting>
<para>
- Identity rules are supported as of MySQL 6.0.9. Prior to
- 6.0.9, use <literal><s> ... </s></literal>
- instead.
+ The <literal><i></literal> shift rules is supported
+ as of MySQL 6.0.9. Prior to 6.0.9, use <literal><s>
+ ... </s></literal> instead.
</para>
</listitem>
| Thread |
|---|
| • svn commit - mysqldoc@oter02: r26311 - in trunk: . dynamic-docs/changelog refman-5.0 refman-5.1 refman-5.5 refman-5.6 refman-6.0 | paul.dubois | 23 May |