Author: jstephens
Date: 2006-10-23 20:19:01 +0200 (Mon, 23 Oct 2006)
New Revision: 3704
Log:
Refactoring CJK FAQ: Completed first pass for 5.0.
Modified:
trunk/refman-5.0/faqs.xml
Modified: trunk/refman-5.0/faqs.xml
===================================================================
--- trunk/refman-5.0/faqs.xml 2006-10-23 14:44:56 UTC (rev 3703)
+++ trunk/refman-5.0/faqs.xml 2006-10-23 18:19:01 UTC (rev 3704)
Changed blocks: 22, Lines Added: 444, Lines Deleted: 280; 36589 bytes
@@ -4114,10 +4114,10 @@
</remark>
<para>
- It is also possible that there are issues with application
- settings — consult also the question about
- <quote>Troubles with Access (or Perl) (or PHP)
- (etc.)</quote> much later in this FAQ.
+ It is also possible that there are issues with the API
+ configuration setting being used in your application
+ — see <xref linkend="faqs-cjk-access-perl-php"/>,
+ for more information.
</para>
</answer>
@@ -4453,7 +4453,7 @@
</section>
- <section id="cjk-faq-great-yen-sign-problem">
+ <section id="cjk-faq-yen-sign">
<title>How Do I Work With the Yen Sign (<literal>¥</literal>)?</title>
@@ -4532,7 +4532,7 @@
</section>
- <section id="cjk-faq-euckr-charset-problems">
+ <section id="cjk-faq-euckr-charset">
<title>Issues with <literal>euckr</literal> (Korean) Character Set Support</title>
@@ -4946,7 +4946,7 @@
</section>
- <section id="cjk-faq-fulltext-searches">
+ <section id="faqs-cjk-like-fulltext-searches">
<title>Why do some <literal>LIKE</literal> and <literal>FULLTEXT</literal>
searches fail?</title>
@@ -4962,45 +4962,55 @@
</indexterm>
<para>
- There is a simple problem with <literal>LIKE</literal> searches
- on <literal>BINARY</literal> and <literal>BLOB</literal>
- columns: we need to know the end of a character. With multi-byte
- character sets, different characters might have different octet
- lengths. For example, in <literal>utf8</literal>,
- <literal>A</literal> requires one byte but
- <literal>ペ</literal> requires three bytes. Illustration:
+ There is a very simple problem with <literal>LIKE</literal>
+ searches on <literal>BINARY</literal> and
+ <literal>BLOB</literal> columns: we need to know the end of a
+ character. With multi-byte character sets, different characters
+ might have different octet lengths. For example, in
+ <literal>utf8</literal>, <literal>A</literal> requires one byte
+ but <literal>ペ</literal> requires three bytes, as shown here:
<programlisting>
- +-------------------------+---------------------------+
- | octet_length(_utf8 'A') | octet_length(_utf8 'ペ') |
- +-------------------------+---------------------------+
- | 1 | 3 |
- +-------------------------+---------------------------+
- 1 row in set (0.00 sec)
- </programlisting>
++-------------------------+---------------------------+
+| OCTET_LENGTH(_utf8 'A') | OCTET_LENGTH(_utf8 'ペ') |
++-------------------------+---------------------------+
+| 1 | 3 |
++-------------------------+---------------------------+
+1 row in set (0.00 sec)
+</programlisting>
If we don't know where the first character ends, then we don't
- know where the second character begins, and even simple-looking
- searches like <literal>LIKE '_A%'</literal> will fail. The
+ know where the second character begins, in which case even very
+ simple searches such as <literal>LIKE '_A%'</literal> fail. The
solution is to use a regular CJK character set in the first
- place, or convert to a CJK character character set before
- comparing. Incidentally, this is one reason why MySQL cannot
- allow encodings of nonexistent characters: It must be strict
- about rejecting bad input, or it won't know where characters
- end. There is a simple problem with <literal>FULLTEXT</literal>:
- we need to know the end of a word. With Western writing this is
- rarely a problem because there are spaces between words. With
- Asian writing this is not the case. We could use half-good
- solutions, like saying that all Han characters represent words,
- or depending on (Japanese) changes from Katakana to Hiragana
- which are due to grammatical endings. But the only good solution
- requires a dictionary, and we haven't found a good open-source
- dictionary.
+ place, or to convert to a CJK character character set before
+ comparing.
</para>
+ <para>
+ This is one reason why MySQL cannot allow encodings of
+ nonexistent characters. If it is not strict about rejecting bad
+ input, then it has no way of knowing where characters end.
+ </para>
+
+ <para>
+ For <literal>FULLTEXT</literal> searches, we need to know where
+ words begin and end. With Western languages, this is rarely a
+ problem because most (if not all) of these use an
+ easy-to-identify word boundary — the space character.
+ However, this is not usually the case with Asian writing. We
+ could use arbitrary halfway measures, like assuming that all Han
+ characters represent words, or (for Japanese) depending on
+ changes from Katakana to Hiragana due to grammatical endings.
+ However, the only sure solution requires a comprehensive word
+ list, which means that we would have to include a dictionary in
+ the server for each Asian language supported. This is simply not
+ feasible.
+ </para>
+
</section>
- <section id="cjk-faq-available-cjk-charsets">
+ <section id="faqs-cjk-available-charsets">
<title>What CJK character sets are available?</title>
@@ -5014,22 +5024,37 @@
<secondary>character sets available</secondary>
</indexterm>
- <para>
- The list of CJK character sets may vary depending on version.
- For example, the <literal>eucjpms</literal> character set is a
- recent addition. But the language name appears in the
- <literal>DESCRIPTION</literal> column for every entry in
- <literal>information_schema.character_sets</literal>. Therefore,
- to get a current list of all the non-Unicode CJK character sets,
- say:
+ <qandaset>
+ <qandaentry>
+
+ <question>
+
+ <para>
+ What CJK character sets are available in MySQL?
+ </para>
+
+ </question>
+
+ <answer>
+
+ <para>
+ The list of CJK character sets may vary depending on
+ version. For example, the <literal>eucjpms</literal>
+ character set is a recent addition. But the language name
+ appears in the <literal>DESCRIPTION</literal> column for
+ every entry in
+ <literal>information_schema.character_sets</literal>.
+ Therefore, to get a current list of all the non-Unicode
+ CJK character sets, say:
+
<programlisting>
-mysql> <userinput>SELECT character_set_name, description</userinput>
- -> <userinput>FROM information_schema.character_sets</userinput>
- -> <userinput>WHERE description LIKE '%Chinese%'</userinput>
- -> <userinput>OR description LIKE '%Japanese%'</userinput>
- -> <userinput>OR description LIKE '%Korean%'</userinput>
- -> <userinput>ORDER BY character_set_name;</userinput>
+mysql> <userinput>SELECT character_set_name, description</userinput>
+ -> <userinput>FROM information_schema.character_sets</userinput>
+ -> <userinput>WHERE description LIKE '%Chinese%'</userinput>
+ -> <userinput>OR description LIKE '%Japanese%'</userinput>
+ -> <userinput>OR description LIKE '%Korean%'</userinput>
+ -> <userinput>ORDER BY character_set_name;</userinput>
+--------------------+---------------------------+
| character_set_name | description |
+--------------------+---------------------------+
@@ -5044,13 +5069,20 @@
+--------------------+---------------------------+
8 rows in set (0.01 sec)
</programlisting>
- </para>
+ </para>
+ </answer>
+
+ </qandaentry>
+
+ </qandaset>
+
</section>
- <section id="cjk-faq-character-x-availability">
+ <section id="cjk-faq-character-availability">
- <title>Is character X available in all character sets?</title>
+ <title>Is character <replaceable>X</replaceable> available in all character
+ sets?</title>
<indexterm type="concept">
<primary>CJK (Chinese, Japanese, Korean)</primary>
@@ -5062,17 +5094,34 @@
<secondary>testing if specific characters are available</secondary>
</indexterm>
- <para>
- The majority of everyday-use Chinese/Japanese characters
- (simplified Chinese and basic non-halfwidth Kana Japanese)
- appear in all CJK character sets. Here is a stored procedure
- which accepts a UCS-2 Unicode character, converts it to all
- other character sets, and displays the results in hexadecimal.
+ <qandaset>
+ <qandaentry>
+
+ <question>
+
+ <para>
+ How do I know whether character
+ <replaceable>X</replaceable> is available in all character
+ sets?
+ </para>
+
+ </question>
+
+ <answer>
+
+ <para>
+ The majority of simplified Chinese and basic non-halfwidth
+ Japanese <foreignphrase>Kana</foreignphrase> characters
+ appear in all CJK character sets. This stored procedure
+ accepts a <literal>UCS-2</literal> Unicode character,
+ converts it to all other character sets, and displays the
+ results in hexadecimal.
+
<programlisting>
DELIMITER //
-CREATE PROCEDURE p_convert (ucs2_char CHAR(1) CHARACTER SET ucs2)
+CREATE PROCEDURE p_convert(ucs2_char CHAR(1) CHARACTER SET ucs2)
BEGIN
CREATE TABLE tj
@@ -5118,19 +5167,21 @@
END//
</programlisting>
- The input can be any single <literal>ucs2</literal> character,
- or it can be the code point value (hexadecimal representation)
- of that character. Here's an example of what
- <function>P_CONVERT()</function> can do. An earlier answer said
- that the character <quote>Katakana Letter Pe</quote> appears in
- all CJK character sets. We know that the code point value of
- Katakana Letter Pe is <literal>0x30da</literal>. (By the way, we
- got the name from Unicode's list of ucs2 encodings and names:
- <ulink url="http://www.unicode.org/Public/UNIDATA/UnicodeData.txt"/>.)
- So we'll say:
+ The input can be any single <literal>ucs2</literal>
+ character, or it can be the code point value (hexadecimal
+ representation) of that character. For example, from
+ Unicode's list of <literal>ucs2</literal> encodings and
+ names
+ (<ulink url="http://www.unicode.org/Public/UNIDATA/UnicodeData.txt"/>),
+ we know that the <foreignphrase>Katakana</foreignphrase>
+ character <foreignphrase>Pe</foreignphrase> appears in all
+ CJK character sets, and that its code point value is
+ <literal>0x30da</literal>. If we use this value as the
+ argument to <literal>p_convert()</literal>, the result is
+ as shown here:
<programlisting>
-mysql> <userinput>CALL P_CONVERT(0x30da)//</userinput>
+mysql> <userinput>CALL p_convert(0x30da)//</userinput>
+------+--------+------+-------+---------+-------+--------+------+------+------+
| ucs2 | utf8 | big5 | cp932 | eucjpms | euckr | gb2312 | gbk | sjis | ujis |
+------+--------+------+-------+---------+-------+--------+------+------+------+
@@ -5139,13 +5190,21 @@
1 row in set (0.04 sec)
</programlisting>
- Since none of the column values is <literal>3F</literal>, we
- know that every conversion worked.
- </para>
+ Since none of the column values is <literal>3F</literal>
+ — that is, the question mark character
+ (<literal>?</literal>) — we know that every
+ conversion worked.
+ </para>
+ </answer>
+
+ </qandaentry>
+
+ </qandaset>
+
</section>
- <section id="cjk-faq-sorting-problems-unicode-1">
+ <section id="faqs-cjk-sorting-unicode-1">
<title>Strings don't sort correctly in Unicode (I)</title>
@@ -5174,35 +5233,48 @@
<secondary>using the right collation</secondary>
</indexterm>
- <para>
- Sometimes people observe that the result of a
- <literal>utf8_unicode_ci</literal> or
- <literal>ucs2_unicode_ci</literal> search or <literal>ORDER
- BY</literal> sort is not what they think a native would expect.
- Although we never rule out the chance that there is a bug, we
- have found in the past that people are not correctly reading the
- standard table of weights for the Unicode Collation Algorithm.
- So, here's how to check whether we're using the right collation.
- The correct table for MySQL is this one:
- <ulink url="http://www.unicode.org/Public/UCA/4.0.0/allkeys-4.0.0.txt"/>.
- This is different from the first table you will find by
- navigating from the <literal>unicode.org</literal> home page.
- MySQL deliberately uses the older 4.0.0 <quote>allkeys</quote>
- table, instead of the current 4.1.0 table. We are very wary
- about changing ordering which affects indexes. Here is an
- example of a problem that we handled recently, for a complaint
- in our bugs database,
- <ulink url="http://bugs.mysql.com/bug.php?id=16526"/>:
+ <qandaset>
+ <qandaentry>
+
+ <question>
+
+ <para>
+ Why don't CJK strings sort correctly in Unicode? (I)
+ </para>
+
+ </question>
+
+ <answer>
+
+ <para>
+ Sometimes people observe that the result of a
+ <literal>utf8_unicode_ci</literal> or
+ <literal>ucs2_unicode_ci</literal> search, or of an
+ <literal>ORDER BY</literal> sort is not what they think a
+ native would expect. Although we never rule out the
+ possibility that there is a bug, we have found in the past
+ that many people do not read correctly the standard table
+ of weights for the Unicode Collation Algorithm. MySQL uses
+ the table found at
+ <ulink url="http://www.unicode.org/Public/UCA/4.0.0/allkeys-4.0.0.txt"/>.
+ This is not the first table you will find by navigating
+ from the <literal>unicode.org</literal> home page, because
+ MySQL uses the older 4.0.0 <quote>allkeys</quote>
+ tablethsan the more recent 4.1.0 table. This is because we
+ are very wary about changing ordering which affects
+ indexes, lest we bring about situations such as that
+ reported in Bug #16526, illustrated as follows:
+
<programlisting>
-mysql> <userinput>CREATE TABLE tj (s1 CHAR(1) CHARACTER SET utf8 COLLATE utf8_unicode_ci);</userinput>
+mysql< <userinput>CREATE TABLE tj (s1 CHAR(1) CHARACTER SET utf8 COLLATE utf8_unicode_ci);</userinput>
Query OK, 0 rows affected (0.05 sec)
-mysql> <userinput>INSERT INTO tj VALUES ('が'),('か');</userinput>
+mysql> <userinput>INSERT INTO tj VALUES ('が'),('か');</userinput>
Query OK, 2 rows affected (0.00 sec)
Records: 2 Duplicates: 0 Warnings: 0
-mysql> <userinput>SELECT * FROM tj WHERE s1 = 'か';</userinput>
+mysql> <userinput>SELECT * FROM tj WHERE s1 = 'か';</userinput>
+------+
| s1 |
+------+
@@ -5212,14 +5284,14 @@
2 rows in set (0.00 sec)
</programlisting>
- If your eyes are sharp, you'll see that the character in the
- first result row isn't the one that we searched for. Why did
- MySQL retrieve it? First we look for the Unicode code point
- value, which is possible by reading the hexadecimal number for
- the <literal>ucs2</literal> version of the characters:
+ The character in the first result row is not the one that
+ we searched for. Why did MySQL retrieve it? First we look
+ for the Unicode code point value, which is possible by
+ reading the hexadecimal number for the
+ <literal>ucs2</literal> version of the characters:
<programlisting>
-mysql> <userinput>SELECT s1,HEX(CONVERT(s1 USING ucs2)) FROM tj;</userinput>
+mysql> <userinput>SELECT s1, HEX(CONVERT(s1 USING ucs2)) FROM tj;</userinput>
+------+-----------------------------+
| s1 | HEX(CONVERT(s1 USING ucs2)) |
+------+-----------------------------+
@@ -5229,40 +5301,49 @@
2 rows in set (0.03 sec)
</programlisting>
- Now let's search for <literal>304B</literal> and
- <literal>304C</literal> in the 4.0.0 allkeys table. We'll find
- these lines:
+ Now we search for <literal>304B</literal> and
+ <literal>304C</literal> in the <literal>4.0.0
+ allkeys</literal> table, and find these lines:
<programlisting>
304B ; [.1E57.0020.000E.304B] # HIRAGANA LETTER KA
304C ; [.1E57.0020.000E.304B][.0000.0140.0002.3099] # HIRAGANA LETTER GA; QQCM
</programlisting>
- The official Unicode names (following the <quote>#</quote> mark)
- are informative; they tell us the Japanese syllabary (Hiragana),
- the informal classification (letter instead of digit or
- punctuation), and the Western identifier (<literal>KA</literal>
- or <literal>GA</literal>, which happen to be voiced/unvoiced
- components of the same letter pair). More importantly, the
- Primary Weight (the first hexadecimal number inside the square
- brackets) is <literal>1E57</literal> on both lines. For
- comparisons in both searching and sorting, MySQL pays attention
- only to the Primary Weight, it ignores all the other numbers. So
- now we know that we're sorting <literal>が</literal> and
- <literal>か</literal> correctly according to the Unicode
- specification. If we wanted to distinguish them, we'd have to
- use a non-Unicode-Collation-Algorithm collation
- (<literal>utf8_unicode_bin</literal> or
- <literal>utf8_general_ci</literal>), or compare the
- <function>HEX()</function> values, or say <literal>ORDER BY
- CONVERT(s1 USING sjis)</literal>. Being correct <quote>according
- to Unicode</quote> isn't enough, of course: the person who
- submitted the bug was equally correct. We plan to add another
- collation for Japanese according to the JIS X 4061 standard,
- where voiced/unvoiced letters like KA/GA are distinguishable for
- ordering purposes.
- </para>
+ The official Unicode names (following the <quote>#</quote>
+ mark) tell us the Japanese syllabary (Hiragana), the
+ informal classification (letter, digit, or punctuation
+ mark), and the Western identifier (<literal>KA</literal>
+ or <literal>GA</literal>, which happen to be voiced and
+ unvoiced components of the same letter pair). More
+ importantly, the <firstterm>primary weight</firstterm>
+ (the first hexadecimal number inside the square brackets)
+ is <literal>1E57</literal> on both lines. For comparisons
+ in both searching and sorting, MySQL pays attention to the
+ primary weight only, ignoring all the other numbers. This
+ means that we are sorting <literal>が</literal> and
+ <literal>か</literal> correctly according to the Unicode
+ specification. If we wanted to distinguish them, we'd have
+ to use a non-UCA (Unicode Collation Algorithm) collation
+ (<literal>utf8_unicode_bin</literal> or
+ <literal>utf8_general_ci</literal>), or to compare the
+ <function>HEX()</function> values, or use <literal>ORDER
+ BY CONVERT(s1 USING sjis)</literal>. Being correct
+ <quote>according to Unicode</quote> isn't enough, of
+ course: the person who submitted the bug was equally
+ correct. We plan to add another collation for Japanese
+ according to the JIS X 4061 standard, in which
+ voiced/unvoiced letter pairs like
+ <literal>KA</literal>/<literal>GA</literal> are
+ distinguishable for ordering purposes.
+ </para>
+ </answer>
+
+ </qandaentry>
+
+ </qandaset>
+
</section>
<section id="cjk-faq-sorting-problems-unicode-2">
@@ -5294,12 +5375,27 @@
<secondary>Unicode collations</secondary>
</indexterm>
- <para>
- You're using Unicode (<literal>ucs2</literal> or
- <literal>utf8</literal>), and you know what the Unicode sort
- order is (see the previous question and answer), but MySQL still
- seems to sort your table wrong? This might be easy.
+ <qandaset>
+ <qandaentry>
+
+ <question>
+
+ <para>
+ Why don't CJK strings sort correctly in Unicode? (II)
+ </para>
+
+ </question>
+
+ <answer>
+
+ <para>
+ You're using Unicode (<literal>ucs2</literal> or
+ <literal>utf8</literal>), and you know what the Unicode
+ sort order is (see the previous question and answer), but
+ MySQL still seems to sort your table wrong? This might be
+ easy.
+
<programlisting>
mysql> <userinput>SHOW CREATE TABLE t\G</userinput>
******************** 1. row ******************
@@ -5310,8 +5406,8 @@
1 row in set (0.00 sec)
</programlisting>
- Hmm, the character set looks okay. Let's look at the
- <literal>information_schema</literal> for this column.
+ Hmm, the character set looks okay. Let's look at the
+ <literal>information_schema</literal> for this column.
<programlisting>
mysql> <userinput>SELECT column_name, character_set_name, collation_name</userinput>
@@ -5326,8 +5422,8 @@
1 row in set (0.01 sec)
</programlisting>
- Oops, the collation is <literal>ucs2_general_ci</literal>
- instead of <literal>ucs2_unicode_ci</literal>! Here's why:
+ Oops, the collation is <literal>ucs2_general_ci</literal>
+ instead of <literal>ucs2_unicode_ci</literal>! Here's why:
<programlisting>
mysql> <userinput>SHOW CHARSET LIKE 'ucs2%';</userinput>
@@ -5339,15 +5435,22 @@
1 row in set (0.00 sec)
</programlisting>
- For <literal>ucs2</literal> and <literal>utf8</literal>, the
- <quote>general</quote> collation is the default. To specify that
- you wanted a <quote>unicode</quote> collation, you should have
- specified <literal>COLLATE ucs2_unicode_ci</literal>.
- </para>
+ For <literal>ucs2</literal> and <literal>utf8</literal>,
+ the <quote>general</quote> collation is the default. To
+ specify that you wanted a <quote>unicode</quote>
+ collation, you should have specified <literal>COLLATE
+ ucs2_unicode_ci</literal>.
+ </para>
+ </answer>
+
+ </qandaentry>
+
+ </qandaset>
+
</section>
- <section id="cjk-faq-supplementary-chars-rejected">
+ <section id="faqs-cjk-supplementary-chars-rejected">
<title>My supplementary characters get rejected</title>
@@ -5356,32 +5459,61 @@
<secondary>rejected characters</secondary>
</indexterm>
- <para>
- Right. MySQL doesn't support supplementary characters
- (characters which need more than 3 bytes with UTF-8). We support
- only what Unicode calls the <emphasis>Basic Multilingual Plane /
- Plane 0</emphasis>. Only a few very rare Han characters are
- supplementary; support for them is uncommon. This has led to bug
- #12600 (<ulink url="http://bugs.mysql.com/bug.php?id=12600"/>)
- which we rejected as <quote>not a bug</quote>. With
- <literal>utf8</literal>, we must truncate an input string when
- we encounter bytes that we don't understand. Otherwise, we
- wouldn't know how long the bad multi-byte character is. A
- workaround is: if you use <literal>ucs2</literal> instead of
- <literal>utf8</literal>, then the bad characters will change to
- question marks, but there will be no truncation. Or change the
- data type to <literal>BLOB</literal> or
- <literal>BINARY</literal>, which have no validity checking. In
- our bugs database, bug #14052
- (<ulink url="http://bugs.mysql.com/bug.php?id=14052"/>) is a
- feature request for Wikipedia, asking us to support
- supplementary characters extending <literal>ucs2</literal> as
- well as <literal>utf8</literal>.
- </para>
+ <qandaset>
+ <qandaentry>
+
+ <question>
+
+ <para>
+ Why are my supplementary characters rejected by MySQL?
+ </para>
+
+ </question>
+
+ <answer>
+
+ <para>
+ MySQL does not support supplementary characters —
+ that is, characters which need more than 3 bytes —
+ for <literal>UTF-8</literal>. We support only what Unicode
+ calls the <emphasis>Basic Multilingual Plane / Plane
+ 0</emphasis>. Only a few very rare Han characters are
+ supplementary; support for them is uncommon. This has led
+ to reports such as that found in Bug #12600, which we
+ rejected as <quote>not a bug</quote>. With
+ <literal>utf8</literal>, we must truncate an input string
+ when we encounter bytes that we don't understand.
+ Otherwise, we wouldn't know how long the bad multi-byte
+ character is.
+ </para>
+
+ <para>
+ One possible workaround is to use <literal>ucs2</literal>
+ instead of <literal>utf8</literal>, in which case the
+ <quote>bad</quote> characters are changed to question
+ marks; however, no truncation takes place. You can also
+ change the data type to <literal>BLOB</literal> or
+ <literal>BINARY</literal>, which perform no validity
+ checking.
+ </para>
+
+ <para>
+ We intend at some point in the future to add support for
+ <literal>UTF-16</literal>, which would solve such issues
+ by allowing 4-byte characters. However, we have as yet set
+ no definite timetable for doing so.
+ </para>
+
+ </answer>
+
+ </qandaentry>
+
+ </qandaset>
+
</section>
- <section id="cjk-faq-cjkv">
+ <section id="faqs-cjk-cjkv">
<title>Shouldn't it be CJKV (V for Vietnamese)?</title>
@@ -5399,21 +5531,45 @@
<primary>Vietnamese</primary>
</indexterm>
- <para>
- No. The term CJKV (Chinese Japanese Korean Vietnamese) refers to
- character sets which contain Han (originally Chinese)
- characters. MySQL has no plan to support the old Vietnamese
- script using Han characters. MySQL does of course support the
- modern Vietnamese script with Western characters. Another
- question that has come up (once) is a request for specialized
- Vietnamese collation, see
- <ulink url="http://bugs.mysql.com/bug.php?id=4745"/>. We might
- do something about it someday, if many more requests arise.
- </para>
+ <qandaset>
+ <qandaentry>
+
+ <question>
+
+ <para>
+ Shouldn't it be <quote>CJKV</quote>?
+ </para>
+
+ </question>
+
+ <answer>
+
+ <para>
+ No. The term <quote>CJKV</quote> (<firstterm>Chinese
+ Japanese Korean Vietnamese</firstterm>) refers to
+ Vietnamese character sets which contain Han (originally
+ Chinese) characters. MySQL has no plan to support the old
+ Vietnamese script using Han characters. MySQL does of
+ course support the modern Vietnamese script with Western
+ characters.
+ </para>
+
+ <para>
+ Bug #4745 is a request for a specialized Vietnamese
+ collation, which we might add in the future if there is
+ sufficient demand for it.
+ </para>
+
+ </answer>
+
+ </qandaentry>
+
+ </qandaset>
+
</section>
- <section id="cjk-faq-fixing-cjk-problems">
+ <section id="faqs-cjk-db-object-names">
<title>Will MySQL fix any CJK problems in version 5.1?</title>
@@ -5422,69 +5578,48 @@
about are implemented.
</remark>
- <para>
- Yes. We're changing the names of files and directories. Here's
- an example, using mysql as <literal>root</literal> under Linux:
+ <qandaset>
- <orderedlist>
+ <qandaentry>
- <listitem>
+ <question>
+
<para>
- Create a table with a name containing a Han character:
-
-<programlisting>
-mysql> <userinput>CREATE TABLE tab_楮 (s1 INT);</userinput>
-Query OK, 0 rows affected (0.07 sec)
-</programlisting>
+ Does MySQL allow CJK characters to be used in database and
+ table names?
</para>
- </listitem>
- <listitem>
+ </question>
+
+ <answer>
+
<para>
- Find out where MySQL stores database files:
-
-<programlisting>
-mysql> <userinput>SHOW VARIABLES LIKE 'datadir';</userinput>
-+---------------+-----------------------+
-| Variable_name | Value |
-+---------------+-----------------------+
-| datadir | /usr/local/mysql/var/ |
-+---------------+-----------------------+
-1 row in set (0.00 sec)
-</programlisting>
+ This issue is fixed in MySQL 5.1, by automatically
+ rewriting the names of the corresponding directories and
+ files.
</para>
- </listitem>
- <listitem>
<para>
- Look at the directory to see the MyISAM table files:
-
-<programlisting>
-# cd /usr/local/mysql/var/dba
-# dir tab_*
--rw-rw---- 1 root root 0 2006-05-16 10:22 tab_@stripped
--rw-rw---- 1 root root 1024 2006-05-16 10:22 tab_@stripped
--rw-rw---- 1 root root 8556 2006-05-16 10:22 tab_@stripped
-</programlisting>
+ For example, if you create a database named
+ <literal>楮</literal> on a server whose operating system
+ does not support CJK in directory names, MySQL creates a
+ directory named <literal>@0w@00a5@00ae</literal>. which is
+ just a fancy way of encoding <literal>E6A5AE</literal>
+ — that is, the Unicode hexadecimal representation
+ for the <literal>楮</literal> character. However, if you
+ run a <literal>SHOW DATABASES</literal> statement, you can
+ see that the database is listed as <literal>楮</literal>.
</para>
- </listitem>
- </orderedlist>
+ </answer>
- Notice that MySQL has converted the Han character to
- <literal>@</literal> + (Unicode value of Han character), that
- is, to a purely ASCII representation. This solves an old
- problem, that database files weren't portable, because some
- computers wouldn't allow <literal>楮</literal> in a file name.
- Conversion to the new file names will be automatic when you
- upgrade to version 5.1. This should take care of bug #6313 in
- our bugs database,
- <ulink url="http://bugs.mysql.com/bug.php?id=6313"/>.
- </para>
+ </qandaentry>
+ </qandaset>
+
</section>
- <section id="cjk-faq-manual-translation">
+ <section id="faqs-cjk-manual-translation">
<title>When will MySQL translate the manual again?</title>
@@ -5522,15 +5657,35 @@
[SH] Update as CJK translations of manuals are updated.
</remark>
- <para>
- A Beijing-based group has produced a Simplified Chinese version
- for us under contract. It's complete and can be found on
- <ulink url="http://dev.mysql.com/doc/#chinese-5.1"/>. It's up to
- date as of version 5.1.2. The Japanese manual can be downloaded
- from <ulink url="http://dev.mysql.com/doc/#japanese-4.1"/>. It
- is still for version 4.1.
- </para>
+ <qandaset>
+ <qandaentry>
+
+ <question>
+
+ <para>
+ Where can I find translations of the MySQL Manual?
+ </para>
+
+ </question>
+
+ <answer>
+
+ <para>
+ A Simplified Chinese version of the Manual, current for
+ MySQL 5.1.12, can be found at
+ <ulink url="http://dev.mysql.com/doc/#chinese-5.1"/>. The
+ Japanese translation of the MySQL 4.1 manual can be
+ downloaded from
+ <ulink url="http://dev.mysql.com/doc/#japanese-4.1"/>.
+ </para>
+
+ </answer>
+
+ </qandaentry>
+
+ </qandaset>
+
</section>
<section id="cjk-faq-contact">
@@ -5541,63 +5696,72 @@
[SH] Update if things change.
</remark>
- <para>
- Check <ulink url="http://dev.mysql.com/user-groups/"/> to see if
- there is a MySQL user group near you. If there isn't: why not
- start one yourself? To contact a sales engineer in MySQL KK's
- Japan office:
+ <qandaset>
+ <qandaentry>
+
+ <question>
+
+ <para>
+ Whom can I talk to?
+ </para>
+
+ </question>
+
+ <answer>
+
+ <para>
+ The following resources are available:
+
+ <itemizedlist>
+
+ <listitem>
+ <para>
+ A listing of MySQL user groups can be found at
+ <ulink url="http://dev.mysql.com/user-groups/"/>.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ You can contact a sales engineer at the MySQL KK
+ Japan office using any of the following:
+
<programlisting>
Tel: +81(0)3-5326-3133
Fax: +81(0)3-5326-3001
Email: dsaito@stripped
</programlisting>
+ </para>
+ </listitem>
- To see feature requests about language issues:
+ <listitem>
+ <para>
+ View feature requests relating to character set
+ issues at <ulink url="http://tinyurl.com/y6xcuf"/>.
+ </para>
+ </listitem>
- <itemizedlist>
+ <listitem>
+ <para>
+ Visit the MySQL
+ <ulink
+ url="http://forums.mysql.com/list.php?103">Character
+ Sets, Collation, Unicode Forum</ulink>. We are also
+ in the process of adding foreign-language forums at
+ <ulink url="http://forums.mysql.com/"/>.
+ </para>
+ </listitem>
- <listitem>
- <para>
- Go to <ulink url="http://bugs.mysql.com"/>.
+ </itemizedlist>
</para>
- </listitem>
- <listitem>
- <para>
- Click <guimenu>Advanced Search</guimenu>.
- </para>
- </listitem>
+ </answer>
- <listitem>
- <para>
- In the <guilabel>Severity</guilabel> dropdown box, click
- <literal>S4 (Feature Request)</literal>.
- </para>
- </listitem>
+ </qandaentry>
- <listitem>
- <para>
- In the list box beside <guilabel>Category</guilabel>,
- click <literal>Character Sets</literal>.
- </para>
- </listitem>
+ </qandaset>
- <listitem>
- <para>
- Click the <guibutton>Search</guibutton> button.
- </para>
- </listitem>
-
- </itemizedlist>
-
- You can post CJK questions, or see previous answers, on MySQL's
- <quote>Character Sets, Collation, Unicode</quote> forum:
- <ulink url="http://forums.mysql.com/list.php?103"/>. MySQL plans
- to add native-language forums on
- <ulink url="http://forums.mysql.com/"/> very soon.
- </para>
-
</section>
</section>
| Thread |
|---|
| • svn commit - mysqldoc@docsrva: r3704 - trunk/refman-5.0 | jon | 23 Oct |