List:Commits« Previous MessageNext Message »
From:jon Date:October 23 2006 8:19pm
Subject:svn commit - mysqldoc@docsrva: r3704 - trunk/refman-5.0
View as plain text  
Author: jstephens
Date: 2006-10-23 20:19:01 +0200 (Mon, 23 Oct 2006)
New Revision: 3704

Log:

Refactoring CJK FAQ: Completed first pass for 5.0.



Modified:
   trunk/refman-5.0/faqs.xml


Modified: trunk/refman-5.0/faqs.xml
===================================================================
--- trunk/refman-5.0/faqs.xml	2006-10-23 14:44:56 UTC (rev 3703)
+++ trunk/refman-5.0/faqs.xml	2006-10-23 18:19:01 UTC (rev 3704)
Changed blocks: 22, Lines Added: 444, Lines Deleted: 280; 36589 bytes

@@ -4114,10 +4114,10 @@
             </remark>
 
             <para>
-              It is also possible that there are issues with application
-              settings &mdash; consult also the question about
-              <quote>Troubles with Access (or Perl) (or PHP)
-              (etc.)</quote> much later in this FAQ.
+              It is also possible that there are issues with the API
+              configuration setting being used in your application
+              &mdash; see <xref linkend="faqs-cjk-access-perl-php"/>,
+              for more information.
             </para>
 
           </answer>

@@ -4453,7 +4453,7 @@
 
     </section>
 
-    <section id="cjk-faq-great-yen-sign-problem">
+    <section id="cjk-faq-yen-sign">
 
       <title>How Do I Work With the Yen Sign
(<literal>¥</literal>)?</title>
 

@@ -4532,7 +4532,7 @@
 
     </section>
 
-    <section id="cjk-faq-euckr-charset-problems">
+    <section id="cjk-faq-euckr-charset">
 
       <title>Issues with <literal>euckr</literal> (Korean) Character
Set Support</title>
 

@@ -4946,7 +4946,7 @@
 
     </section>
 
-    <section id="cjk-faq-fulltext-searches">
+    <section id="faqs-cjk-like-fulltext-searches">
 
       <title>Why do some <literal>LIKE</literal> and
<literal>FULLTEXT</literal>
         searches fail?</title>

@@ -4962,45 +4962,55 @@
       </indexterm>
 
       <para>
-        There is a simple problem with <literal>LIKE</literal> searches
-        on <literal>BINARY</literal> and <literal>BLOB</literal>
-        columns: we need to know the end of a character. With multi-byte
-        character sets, different characters might have different octet
-        lengths. For example, in <literal>utf8</literal>,
-        <literal>A</literal> requires one byte but
-        <literal>ペ</literal> requires three bytes. Illustration:
+        There is a very simple problem with <literal>LIKE</literal>
+        searches on <literal>BINARY</literal> and
+        <literal>BLOB</literal> columns: we need to know the end of a
+        character. With multi-byte character sets, different characters
+        might have different octet lengths. For example, in
+        <literal>utf8</literal>, <literal>A</literal> requires
one byte
+        but <literal>ペ</literal> requires three bytes, as shown here:
 
 <programlisting>
-        +-------------------------+---------------------------+
-        | octet_length(_utf8 'A') | octet_length(_utf8 'ペ') |
-        +-------------------------+---------------------------+
-        |                       1 |                         3 |
-        +-------------------------+---------------------------+
-        1 row in set (0.00 sec)
-      </programlisting>
++-------------------------+---------------------------+
+| OCTET_LENGTH(_utf8 'A') | OCTET_LENGTH(_utf8 'ペ') |
++-------------------------+---------------------------+
+|                       1 |                         3 |
++-------------------------+---------------------------+
+1 row in set (0.00 sec)
+</programlisting>
 
         If we don't know where the first character ends, then we don't
-        know where the second character begins, and even simple-looking
-        searches like <literal>LIKE '_A%'</literal> will fail. The
+        know where the second character begins, in which case even very
+        simple searches such as <literal>LIKE '_A%'</literal> fail. The
         solution is to use a regular CJK character set in the first
-        place, or convert to a CJK character character set before
-        comparing. Incidentally, this is one reason why MySQL cannot
-        allow encodings of nonexistent characters: It must be strict
-        about rejecting bad input, or it won't know where characters
-        end. There is a simple problem with <literal>FULLTEXT</literal>:
-        we need to know the end of a word. With Western writing this is
-        rarely a problem because there are spaces between words. With
-        Asian writing this is not the case. We could use half-good
-        solutions, like saying that all Han characters represent words,
-        or depending on (Japanese) changes from Katakana to Hiragana
-        which are due to grammatical endings. But the only good solution
-        requires a dictionary, and we haven't found a good open-source
-        dictionary.
+        place, or to convert to a CJK character character set before
+        comparing.
       </para>
 
+      <para>
+        This is one reason why MySQL cannot allow encodings of
+        nonexistent characters. If it is not strict about rejecting bad
+        input, then it has no way of knowing where characters end.
+      </para>
+
+      <para>
+        For <literal>FULLTEXT</literal> searches, we need to know where
+        words begin and end. With Western languages, this is rarely a
+        problem because most (if not all) of these use an
+        easy-to-identify word boundary &mdash; the space character.
+        However, this is not usually the case with Asian writing. We
+        could use arbitrary halfway measures, like assuming that all Han
+        characters represent words, or (for Japanese) depending on
+        changes from Katakana to Hiragana due to grammatical endings.
+        However, the only sure solution requires a comprehensive word
+        list, which means that we would have to include a dictionary in
+        the server for each Asian language supported. This is simply not
+        feasible.
+      </para>
+
     </section>
 
-    <section id="cjk-faq-available-cjk-charsets">
+    <section id="faqs-cjk-available-charsets">
 
       <title>What CJK character sets are available?</title>
 

@@ -5014,22 +5024,37 @@
         <secondary>character sets available</secondary>
       </indexterm>
 
-      <para>
-        The list of CJK character sets may vary depending on version.
-        For example, the <literal>eucjpms</literal> character set is a
-        recent addition. But the language name appears in the
-        <literal>DESCRIPTION</literal> column for every entry in
-        <literal>information_schema.character_sets</literal>. Therefore,
-        to get a current list of all the non-Unicode CJK character sets,
-        say:
+      <qandaset>
 
+        <qandaentry>
+
+          <question>
+
+            <para>
+              What CJK character sets are available in MySQL?
+            </para>
+
+          </question>
+
+          <answer>
+
+            <para>
+              The list of CJK character sets may vary depending on
+              version. For example, the <literal>eucjpms</literal>
+              character set is a recent addition. But the language name
+              appears in the <literal>DESCRIPTION</literal> column for
+              every entry in
+              <literal>information_schema.character_sets</literal>.
+              Therefore, to get a current list of all the non-Unicode
+              CJK character sets, say:
+
 <programlisting>
-mysql> <userinput>SELECT character_set_name, description</userinput>
-    -> <userinput>FROM information_schema.character_sets</userinput>
-    -> <userinput>WHERE description LIKE '%Chinese%'</userinput>
-    -> <userinput>OR    description LIKE '%Japanese%'</userinput>
-    -> <userinput>OR    description LIKE '%Korean%'</userinput>
-    -> <userinput>ORDER BY character_set_name;</userinput>
+mysql&gt; <userinput>SELECT character_set_name, description</userinput>
+    -&gt; <userinput>FROM information_schema.character_sets</userinput>
+    -&gt; <userinput>WHERE description LIKE '%Chinese%'</userinput>
+    -&gt; <userinput>OR description LIKE '%Japanese%'</userinput>
+    -&gt; <userinput>OR description LIKE '%Korean%'</userinput>
+    -&gt; <userinput>ORDER BY character_set_name;</userinput>
 +--------------------+---------------------------+
 | character_set_name | description               |
 +--------------------+---------------------------+

@@ -5044,13 +5069,20 @@
 +--------------------+---------------------------+
 8 rows in set (0.01 sec)
 </programlisting>
-      </para>
+            </para>
 
+          </answer>
+
+        </qandaentry>
+
+      </qandaset>
+
     </section>
 
-    <section id="cjk-faq-character-x-availability">
+    <section id="cjk-faq-character-availability">
 
-      <title>Is character X available in all character sets?</title>
+      <title>Is character <replaceable>X</replaceable> available in all
character
+        sets?</title>
 
       <indexterm type="concept">
         <primary>CJK (Chinese, Japanese, Korean)</primary>

@@ -5062,17 +5094,34 @@
         <secondary>testing if specific characters are available</secondary>
       </indexterm>
 
-      <para>
-        The majority of everyday-use Chinese/Japanese characters
-        (simplified Chinese and basic non-halfwidth Kana Japanese)
-        appear in all CJK character sets. Here is a stored procedure
-        which accepts a UCS-2 Unicode character, converts it to all
-        other character sets, and displays the results in hexadecimal.
+      <qandaset>
 
+        <qandaentry>
+
+          <question>
+
+            <para>
+              How do I know whether character
+              <replaceable>X</replaceable> is available in all character
+              sets?
+            </para>
+
+          </question>
+
+          <answer>
+
+            <para>
+              The majority of simplified Chinese and basic non-halfwidth
+              Japanese <foreignphrase>Kana</foreignphrase> characters
+              appear in all CJK character sets. This stored procedure
+              accepts a <literal>UCS-2</literal> Unicode character,
+              converts it to all other character sets, and displays the
+              results in hexadecimal.
+
 <programlisting>
 DELIMITER //
 
-CREATE PROCEDURE p_convert (ucs2_char CHAR(1) CHARACTER SET ucs2)
+CREATE PROCEDURE p_convert(ucs2_char CHAR(1) CHARACTER SET ucs2)
 BEGIN
 
 CREATE TABLE tj

@@ -5118,19 +5167,21 @@
 END//
 </programlisting>
 
-        The input can be any single <literal>ucs2</literal> character,
-        or it can be the code point value (hexadecimal representation)
-        of that character. Here's an example of what
-        <function>P_CONVERT()</function> can do. An earlier answer said
-        that the character <quote>Katakana Letter Pe</quote> appears in
-        all CJK character sets. We know that the code point value of
-        Katakana Letter Pe is <literal>0x30da</literal>. (By the way, we
-        got the name from Unicode's list of ucs2 encodings and names:
-        <ulink url="http://www.unicode.org/Public/UNIDATA/UnicodeData.txt"/>.)
-        So we'll say:
+              The input can be any single <literal>ucs2</literal>
+              character, or it can be the code point value (hexadecimal
+              representation) of that character. For example, from
+              Unicode's list of <literal>ucs2</literal> encodings and
+              names
+              (<ulink
url="http://www.unicode.org/Public/UNIDATA/UnicodeData.txt"/>),
+              we know that the <foreignphrase>Katakana</foreignphrase>
+              character <foreignphrase>Pe</foreignphrase> appears in all
+              CJK character sets, and that its code point value is
+              <literal>0x30da</literal>. If we use this value as the
+              argument to <literal>p_convert()</literal>, the result is
+              as shown here:
 
 <programlisting>
-mysql> <userinput>CALL P_CONVERT(0x30da)//</userinput>
+mysql&gt; <userinput>CALL p_convert(0x30da)//</userinput>
 +------+--------+------+-------+---------+-------+--------+------+------+------+
 | ucs2 | utf8   | big5 | cp932 | eucjpms | euckr | gb2312 | gbk  | sjis | ujis |
 +------+--------+------+-------+---------+-------+--------+------+------+------+

@@ -5139,13 +5190,21 @@
 1 row in set (0.04 sec)
 </programlisting>
 
-        Since none of the column values is <literal>3F</literal>, we
-        know that every conversion worked.
-      </para>
+              Since none of the column values is <literal>3F</literal>
+              &mdash; that is, the question mark character
+              (<literal>?</literal>) &mdash; we know that every
+              conversion worked.
+            </para>
 
+          </answer>
+
+        </qandaentry>
+
+      </qandaset>
+
     </section>
 
-    <section id="cjk-faq-sorting-problems-unicode-1">
+    <section id="faqs-cjk-sorting-unicode-1">
 
       <title>Strings don't sort correctly in Unicode (I)</title>
 

@@ -5174,35 +5233,48 @@
         <secondary>using the right collation</secondary>
       </indexterm>
 
-      <para>
-        Sometimes people observe that the result of a
-        <literal>utf8_unicode_ci</literal> or
-        <literal>ucs2_unicode_ci</literal> search or <literal>ORDER
-        BY</literal> sort is not what they think a native would expect.
-        Although we never rule out the chance that there is a bug, we
-        have found in the past that people are not correctly reading the
-        standard table of weights for the Unicode Collation Algorithm.
-        So, here's how to check whether we're using the right collation.
-        The correct table for MySQL is this one:
-        <ulink url="http://www.unicode.org/Public/UCA/4.0.0/allkeys-4.0.0.txt"/>.
-        This is different from the first table you will find by
-        navigating from the <literal>unicode.org</literal> home page.
-        MySQL deliberately uses the older 4.0.0 <quote>allkeys</quote>
-        table, instead of the current 4.1.0 table. We are very wary
-        about changing ordering which affects indexes. Here is an
-        example of a problem that we handled recently, for a complaint
-        in our bugs database,
-        <ulink url="http://bugs.mysql.com/bug.php?id=16526"/>:
+      <qandaset>
 
+        <qandaentry>
+
+          <question>
+
+            <para>
+              Why don't CJK strings sort correctly in Unicode? (I)
+            </para>
+
+          </question>
+
+          <answer>
+
+            <para>
+              Sometimes people observe that the result of a
+              <literal>utf8_unicode_ci</literal> or
+              <literal>ucs2_unicode_ci</literal> search, or of an
+              <literal>ORDER BY</literal> sort is not what they think a
+              native would expect. Although we never rule out the
+              possibility that there is a bug, we have found in the past
+              that many people do not read correctly the standard table
+              of weights for the Unicode Collation Algorithm. MySQL uses
+              the table found at
+              <ulink
url="http://www.unicode.org/Public/UCA/4.0.0/allkeys-4.0.0.txt"/>.
+              This is not the first table you will find by navigating
+              from the <literal>unicode.org</literal> home page, because
+              MySQL uses the older 4.0.0 <quote>allkeys</quote>
+              tablethsan the more recent 4.1.0 table. This is because we
+              are very wary about changing ordering which affects
+              indexes, lest we bring about situations such as that
+              reported in Bug #16526, illustrated as follows:
+
 <programlisting>
-mysql> <userinput>CREATE TABLE tj (s1 CHAR(1) CHARACTER SET utf8 COLLATE
utf8_unicode_ci);</userinput>
+mysql&lt; <userinput>CREATE TABLE tj (s1 CHAR(1) CHARACTER SET utf8 COLLATE
utf8_unicode_ci);</userinput>
 Query OK, 0 rows affected (0.05 sec)
 
-mysql> <userinput>INSERT INTO tj VALUES ('が'),('か');</userinput>
+mysql&gt; <userinput>INSERT INTO tj VALUES ('が'),('か');</userinput>
 Query OK, 2 rows affected (0.00 sec)
 Records: 2  Duplicates: 0  Warnings: 0
 
-mysql> <userinput>SELECT * FROM tj WHERE s1 = 'か';</userinput>
+mysql&gt; <userinput>SELECT * FROM tj WHERE s1 = 'か';</userinput>
 +------+
 | s1   |
 +------+

@@ -5212,14 +5284,14 @@
 2 rows in set (0.00 sec)
 </programlisting>
 
-        If your eyes are sharp, you'll see that the character in the
-        first result row isn't the one that we searched for. Why did
-        MySQL retrieve it? First we look for the Unicode code point
-        value, which is possible by reading the hexadecimal number for
-        the <literal>ucs2</literal> version of the characters:
+              The character in the first result row is not the one that
+              we searched for. Why did MySQL retrieve it? First we look
+              for the Unicode code point value, which is possible by
+              reading the hexadecimal number for the
+              <literal>ucs2</literal> version of the characters:
 
 <programlisting>
-mysql> <userinput>SELECT s1,HEX(CONVERT(s1 USING ucs2)) FROM
tj;</userinput>
+mysql> <userinput>SELECT s1, HEX(CONVERT(s1 USING ucs2)) FROM
tj;</userinput>
 +------+-----------------------------+
 | s1   | HEX(CONVERT(s1 USING ucs2)) |
 +------+-----------------------------+

@@ -5229,40 +5301,49 @@
 2 rows in set (0.03 sec)
 </programlisting>
 
-        Now let's search for <literal>304B</literal> and
-        <literal>304C</literal> in the 4.0.0 allkeys table. We'll find
-        these lines:
+              Now we search for <literal>304B</literal> and
+              <literal>304C</literal> in the <literal>4.0.0
+              allkeys</literal> table, and find these lines:
 
 <programlisting>
 304B  ; [.1E57.0020.000E.304B] # HIRAGANA LETTER KA
 304C  ; [.1E57.0020.000E.304B][.0000.0140.0002.3099] # HIRAGANA LETTER GA; QQCM
 </programlisting>
 
-        The official Unicode names (following the <quote>#</quote> mark)
-        are informative; they tell us the Japanese syllabary (Hiragana),
-        the informal classification (letter instead of digit or
-        punctuation), and the Western identifier (<literal>KA</literal>
-        or <literal>GA</literal>, which happen to be voiced/unvoiced
-        components of the same letter pair). More importantly, the
-        Primary Weight (the first hexadecimal number inside the square
-        brackets) is <literal>1E57</literal> on both lines. For
-        comparisons in both searching and sorting, MySQL pays attention
-        only to the Primary Weight, it ignores all the other numbers. So
-        now we know that we're sorting <literal>が</literal> and
-        <literal>か</literal> correctly according to the Unicode
-        specification. If we wanted to distinguish them, we'd have to
-        use a non-Unicode-Collation-Algorithm collation
-        (<literal>utf8_unicode_bin</literal> or
-        <literal>utf8_general_ci</literal>), or compare the
-        <function>HEX()</function> values, or say <literal>ORDER BY
-        CONVERT(s1 USING sjis)</literal>. Being correct <quote>according
-        to Unicode</quote> isn't enough, of course: the person who
-        submitted the bug was equally correct. We plan to add another
-        collation for Japanese according to the JIS X 4061 standard,
-        where voiced/unvoiced letters like KA/GA are distinguishable for
-        ordering purposes.
-      </para>
+              The official Unicode names (following the <quote>#</quote>
+              mark) tell us the Japanese syllabary (Hiragana), the
+              informal classification (letter, digit, or punctuation
+              mark), and the Western identifier (<literal>KA</literal>
+              or <literal>GA</literal>, which happen to be voiced and
+              unvoiced components of the same letter pair). More
+              importantly, the <firstterm>primary weight</firstterm>
+              (the first hexadecimal number inside the square brackets)
+              is <literal>1E57</literal> on both lines. For comparisons
+              in both searching and sorting, MySQL pays attention to the
+              primary weight only, ignoring all the other numbers. This
+              means that we are sorting <literal>が</literal> and
+              <literal>か</literal> correctly according to the Unicode
+              specification. If we wanted to distinguish them, we'd have
+              to use a non-UCA (Unicode Collation Algorithm) collation
+              (<literal>utf8_unicode_bin</literal> or
+              <literal>utf8_general_ci</literal>), or to compare the
+              <function>HEX()</function> values, or use <literal>ORDER
+              BY CONVERT(s1 USING sjis)</literal>. Being correct
+              <quote>according to Unicode</quote> isn't enough, of
+              course: the person who submitted the bug was equally
+              correct. We plan to add another collation for Japanese
+              according to the JIS X 4061 standard, in which
+              voiced/unvoiced letter pairs like
+              <literal>KA</literal>/<literal>GA</literal> are
+              distinguishable for ordering purposes.
+            </para>
 
+          </answer>
+
+        </qandaentry>
+
+      </qandaset>
+
     </section>
 
     <section id="cjk-faq-sorting-problems-unicode-2">

@@ -5294,12 +5375,27 @@
         <secondary>Unicode collations</secondary>
       </indexterm>
 
-      <para>
-        You're using Unicode (<literal>ucs2</literal> or
-        <literal>utf8</literal>), and you know what the Unicode sort
-        order is (see the previous question and answer), but MySQL still
-        seems to sort your table wrong? This might be easy.
+      <qandaset>
 
+        <qandaentry>
+
+          <question>
+
+            <para>
+              Why don't CJK strings sort correctly in Unicode? (II)
+            </para>
+
+          </question>
+
+          <answer>
+
+            <para>
+              You're using Unicode (<literal>ucs2</literal> or
+              <literal>utf8</literal>), and you know what the Unicode
+              sort order is (see the previous question and answer), but
+              MySQL still seems to sort your table wrong? This might be
+              easy.
+
 <programlisting>
 mysql> <userinput>SHOW CREATE TABLE t\G</userinput>
 ******************** 1. row ******************

@@ -5310,8 +5406,8 @@
 1 row in set (0.00 sec)
 </programlisting>
 
-        Hmm, the character set looks okay. Let's look at the
-        <literal>information_schema</literal> for this column.
+              Hmm, the character set looks okay. Let's look at the
+              <literal>information_schema</literal> for this column.
 
 <programlisting>
 mysql> <userinput>SELECT column_name, character_set_name,
collation_name</userinput>

@@ -5326,8 +5422,8 @@
 1 row in set (0.01 sec)
 </programlisting>
 
-        Oops, the collation is <literal>ucs2_general_ci</literal>
-        instead of <literal>ucs2_unicode_ci</literal>! Here's why:
+              Oops, the collation is <literal>ucs2_general_ci</literal>
+              instead of <literal>ucs2_unicode_ci</literal>! Here's why:
 
 <programlisting>
 mysql> <userinput>SHOW CHARSET LIKE 'ucs2%';</userinput>

@@ -5339,15 +5435,22 @@
 1 row in set (0.00 sec)
 </programlisting>
 
-        For <literal>ucs2</literal> and <literal>utf8</literal>,
the
-        <quote>general</quote> collation is the default. To specify that
-        you wanted a <quote>unicode</quote> collation, you should have
-        specified <literal>COLLATE ucs2_unicode_ci</literal>.
-      </para>
+              For <literal>ucs2</literal> and
<literal>utf8</literal>,
+              the <quote>general</quote> collation is the default. To
+              specify that you wanted a <quote>unicode</quote>
+              collation, you should have specified <literal>COLLATE
+              ucs2_unicode_ci</literal>.
+            </para>
 
+          </answer>
+
+        </qandaentry>
+
+      </qandaset>
+
     </section>
 
-    <section id="cjk-faq-supplementary-chars-rejected">
+    <section id="faqs-cjk-supplementary-chars-rejected">
 
       <title>My supplementary characters get rejected</title>
 

@@ -5356,32 +5459,61 @@
         <secondary>rejected characters</secondary>
       </indexterm>
 
-      <para>
-        Right. MySQL doesn't support supplementary characters
-        (characters which need more than 3 bytes with UTF-8). We support
-        only what Unicode calls the <emphasis>Basic Multilingual Plane /
-        Plane 0</emphasis>. Only a few very rare Han characters are
-        supplementary; support for them is uncommon. This has led to bug
-        #12600 (<ulink url="http://bugs.mysql.com/bug.php?id=12600"/>)
-        which we rejected as <quote>not a bug</quote>. With
-        <literal>utf8</literal>, we must truncate an input string when
-        we encounter bytes that we don't understand. Otherwise, we
-        wouldn't know how long the bad multi-byte character is. A
-        workaround is: if you use <literal>ucs2</literal> instead of
-        <literal>utf8</literal>, then the bad characters will change to
-        question marks, but there will be no truncation. Or change the
-        data type to <literal>BLOB</literal> or
-        <literal>BINARY</literal>, which have no validity checking. In
-        our bugs database, bug #14052
-        (<ulink url="http://bugs.mysql.com/bug.php?id=14052"/>) is a
-        feature request for Wikipedia, asking us to support
-        supplementary characters extending <literal>ucs2</literal> as
-        well as <literal>utf8</literal>.
-      </para>
+      <qandaset>
 
+        <qandaentry>
+
+          <question>
+
+            <para>
+              Why are my supplementary characters rejected by MySQL?
+            </para>
+
+          </question>
+
+          <answer>
+
+            <para>
+              MySQL does not support supplementary characters &mdash;
+              that is, characters which need more than 3 bytes &mdash;
+              for <literal>UTF-8</literal>. We support only what Unicode
+              calls the <emphasis>Basic Multilingual Plane / Plane
+              0</emphasis>. Only a few very rare Han characters are
+              supplementary; support for them is uncommon. This has led
+              to reports such as that found in Bug #12600, which we
+              rejected as <quote>not a bug</quote>. With
+              <literal>utf8</literal>, we must truncate an input string
+              when we encounter bytes that we don't understand.
+              Otherwise, we wouldn't know how long the bad multi-byte
+              character is.
+            </para>
+
+            <para>
+              One possible workaround is to use <literal>ucs2</literal>
+              instead of <literal>utf8</literal>, in which case the
+              <quote>bad</quote> characters are changed to question
+              marks; however, no truncation takes place. You can also
+              change the data type to <literal>BLOB</literal> or
+              <literal>BINARY</literal>, which perform no validity
+              checking.
+            </para>
+
+            <para>
+              We intend at some point in the future to add support for
+              <literal>UTF-16</literal>, which would solve such issues
+              by allowing 4-byte characters. However, we have as yet set
+              no definite timetable for doing so.
+            </para>
+
+          </answer>
+
+        </qandaentry>
+
+      </qandaset>
+
     </section>
 
-    <section id="cjk-faq-cjkv">
+    <section id="faqs-cjk-cjkv">
 
       <title>Shouldn't it be CJKV (V for Vietnamese)?</title>
 

@@ -5399,21 +5531,45 @@
         <primary>Vietnamese</primary>
       </indexterm>
 
-      <para>
-        No. The term CJKV (Chinese Japanese Korean Vietnamese) refers to
-        character sets which contain Han (originally Chinese)
-        characters. MySQL has no plan to support the old Vietnamese
-        script using Han characters. MySQL does of course support the
-        modern Vietnamese script with Western characters. Another
-        question that has come up (once) is a request for specialized
-        Vietnamese collation, see
-        <ulink url="http://bugs.mysql.com/bug.php?id=4745"/>. We might
-        do something about it someday, if many more requests arise.
-      </para>
+      <qandaset>
 
+        <qandaentry>
+
+          <question>
+
+            <para>
+              Shouldn't it be <quote>CJKV</quote>?
+            </para>
+
+          </question>
+
+          <answer>
+
+            <para>
+              No. The term <quote>CJKV</quote> (<firstterm>Chinese
+              Japanese Korean Vietnamese</firstterm>) refers to
+              Vietnamese character sets which contain Han (originally
+              Chinese) characters. MySQL has no plan to support the old
+              Vietnamese script using Han characters. MySQL does of
+              course support the modern Vietnamese script with Western
+              characters.
+            </para>
+
+            <para>
+              Bug #4745 is a request for a specialized Vietnamese
+              collation, which we might add in the future if there is
+              sufficient demand for it.
+            </para>
+
+          </answer>
+
+        </qandaentry>
+
+      </qandaset>
+
     </section>
 
-    <section id="cjk-faq-fixing-cjk-problems">
+    <section id="faqs-cjk-db-object-names">
 
       <title>Will MySQL fix any CJK problems in version 5.1?</title>
 

@@ -5422,69 +5578,48 @@
         about are implemented.
       </remark>
 
-      <para>
-        Yes. We're changing the names of files and directories. Here's
-        an example, using mysql as <literal>root</literal> under Linux:
+      <qandaset>
 
-        <orderedlist>
+        <qandaentry>
 
-          <listitem>
+          <question>
+
             <para>
-              Create a table with a name containing a Han character:
-
-<programlisting>
-mysql> <userinput>CREATE TABLE tab_楮 (s1 INT);</userinput>
-Query OK, 0 rows affected (0.07 sec)
-</programlisting>
+              Does MySQL allow CJK characters to be used in database and
+              table names?
             </para>
-          </listitem>
 
-          <listitem>
+          </question>
+
+          <answer>
+
             <para>
-              Find out where MySQL stores database files:
-
-<programlisting>
-mysql> <userinput>SHOW VARIABLES LIKE 'datadir';</userinput>
-+---------------+-----------------------+
-| Variable_name | Value                 |
-+---------------+-----------------------+
-| datadir       | /usr/local/mysql/var/ |
-+---------------+-----------------------+
-1 row in set (0.00 sec)
-</programlisting>
+              This issue is fixed in MySQL 5.1, by automatically
+              rewriting the names of the corresponding directories and
+              files.
             </para>
-          </listitem>
 
-          <listitem>
             <para>
-              Look at the directory to see the MyISAM table files:
-
-<programlisting>
-# cd /usr/local/mysql/var/dba
-# dir tab_*
--rw-rw----  1 root root    0 2006-05-16 10:22 tab_@stripped
--rw-rw----  1 root root 1024 2006-05-16 10:22 tab_@stripped
--rw-rw----  1 root root 8556 2006-05-16 10:22 tab_@stripped
-</programlisting>
+              For example, if you create a database named
+              <literal>楮</literal> on a server whose operating system
+              does not support CJK in directory names, MySQL creates a
+              directory named <literal>@0w@00a5@00ae</literal>. which is
+              just a fancy way of encoding <literal>E6A5AE</literal>
+              &mdash; that is, the Unicode hexadecimal representation
+              for the <literal>楮</literal> character. However, if you
+              run a <literal>SHOW DATABASES</literal> statement, you can
+              see that the database is listed as <literal>楮</literal>.
             </para>
-          </listitem>
 
-        </orderedlist>
+          </answer>
 
-        Notice that MySQL has converted the Han character to
-        <literal>@</literal> + (Unicode value of Han character), that
-        is, to a purely ASCII representation. This solves an old
-        problem, that database files weren't portable, because some
-        computers wouldn't allow <literal>楮</literal> in a file name.
-        Conversion to the new file names will be automatic when you
-        upgrade to version 5.1. This should take care of bug #6313 in
-        our bugs database,
-        <ulink url="http://bugs.mysql.com/bug.php?id=6313"/>.
-      </para>
+        </qandaentry>
 
+      </qandaset>
+
     </section>
 
-    <section id="cjk-faq-manual-translation">
+    <section id="faqs-cjk-manual-translation">
 
       <title>When will MySQL translate the manual again?</title>
 

@@ -5522,15 +5657,35 @@
         [SH] Update as CJK translations of manuals are updated.
       </remark>
 
-      <para>
-        A Beijing-based group has produced a Simplified Chinese version
-        for us under contract. It's complete and can be found on
-        <ulink url="http://dev.mysql.com/doc/#chinese-5.1"/>. It's up to
-        date as of version 5.1.2. The Japanese manual can be downloaded
-        from <ulink url="http://dev.mysql.com/doc/#japanese-4.1"/>. It
-        is still for version 4.1.
-      </para>
+      <qandaset>
 
+        <qandaentry>
+
+          <question>
+
+            <para>
+              Where can I find translations of the MySQL Manual?
+            </para>
+
+          </question>
+
+          <answer>
+
+            <para>
+              A Simplified Chinese version of the Manual, current for
+              MySQL 5.1.12, can be found at
+              <ulink url="http://dev.mysql.com/doc/#chinese-5.1"/>. The
+              Japanese translation of the MySQL 4.1 manual can be
+              downloaded from
+              <ulink url="http://dev.mysql.com/doc/#japanese-4.1"/>.
+            </para>
+
+          </answer>
+
+        </qandaentry>
+
+      </qandaset>
+
     </section>
 
     <section id="cjk-faq-contact">

@@ -5541,63 +5696,72 @@
         [SH] Update if things change.
       </remark>
 
-      <para>
-        Check <ulink url="http://dev.mysql.com/user-groups/"/> to see if
-        there is a MySQL user group near you. If there isn't: why not
-        start one yourself? To contact a sales engineer in MySQL KK's
-        Japan office:
+      <qandaset>
 
+        <qandaentry>
+
+          <question>
+
+            <para>
+              Whom can I talk to?
+            </para>
+
+          </question>
+
+          <answer>
+
+            <para>
+              The following resources are available:
+
+              <itemizedlist>
+
+                <listitem>
+                  <para>
+                    A listing of MySQL user groups can be found at
+                    <ulink url="http://dev.mysql.com/user-groups/"/>.
+                  </para>
+                </listitem>
+
+                <listitem>
+                  <para>
+                    You can contact a sales engineer at the MySQL KK
+                    Japan office using any of the following:
+
 <programlisting>
 Tel: +81(0)3-5326-3133
 Fax: +81(0)3-5326-3001
 Email: dsaito@stripped
 </programlisting>
+                  </para>
+                </listitem>
 
-        To see feature requests about language issues:
+                <listitem>
+                  <para>
+                    View feature requests relating to character set
+                    issues at <ulink url="http://tinyurl.com/y6xcuf"/>.
+                  </para>
+                </listitem>
 
-        <itemizedlist>
+                <listitem>
+                  <para>
+                    Visit the MySQL
+                    <ulink
+                      url="http://forums.mysql.com/list.php?103">Character
+                    Sets, Collation, Unicode Forum</ulink>. We are also
+                    in the process of adding foreign-language forums at
+                    <ulink url="http://forums.mysql.com/"/>.
+                  </para>
+                </listitem>
 
-          <listitem>
-            <para>
-              Go to <ulink url="http://bugs.mysql.com"/>.
+              </itemizedlist>
             </para>
-          </listitem>
 
-          <listitem>
-            <para>
-              Click <guimenu>Advanced Search</guimenu>.
-            </para>
-          </listitem>
+          </answer>
 
-          <listitem>
-            <para>
-              In the <guilabel>Severity</guilabel> dropdown box, click
-              <literal>S4 (Feature Request)</literal>.
-            </para>
-          </listitem>
+        </qandaentry>
 
-          <listitem>
-            <para>
-              In the list box beside <guilabel>Category</guilabel>,
-              click <literal>Character Sets</literal>.
-            </para>
-          </listitem>
+      </qandaset>
 
-          <listitem>
-            <para>
-              Click the <guibutton>Search</guibutton> button.
-            </para>
-          </listitem>
-
-        </itemizedlist>
-
-        You can post CJK questions, or see previous answers, on MySQL's
-        <quote>Character Sets, Collation, Unicode</quote> forum:
-        <ulink url="http://forums.mysql.com/list.php?103"/>. MySQL plans
-        to add native-language forums on
-        <ulink url="http://forums.mysql.com/"/> very soon.
-      </para>
-
     </section>
 
   </section>


Thread
svn commit - mysqldoc@docsrva: r3704 - trunk/refman-5.0jon23 Oct