Author: jstephens
Date: 2006-10-23 08:44:04 +0200 (Mon, 23 Oct 2006)
New Revision: 3700
Log:
Refactoring CJK FAQ (first part).
Modified:
trunk/refman-5.0/faqs.xml
Modified: trunk/refman-5.0/faqs.xml
===================================================================
--- trunk/refman-5.0/faqs.xml 2006-10-23 04:29:21 UTC (rev 3699)
+++ trunk/refman-5.0/faqs.xml 2006-10-23 06:44:04 UTC (rev 3700)
Changed blocks: 12, Lines Added: 344, Lines Deleted: 167; 26771 bytes
@@ -3850,14 +3850,15 @@
</indexterm>
<para>
- This Frequently-Asked-Questions section comes from the experiences
- of MySQL's Support and Development groups, after handling many
- enquiries about CJK (Chinese Japanese Korean) issues.
+ This set of Frequently-Asked-Questions derives from the experience
+ of MySQL's Support and Development groups in handling many
+ enquiries about CJK (Chinese-Japanese-Korean) issues.
</para>
- <section id="cjk-faq-question-marks">
+ <section id="faqs-cjk-question-marks">
- <title>SELECT shows non-Latin characters as "?"s. Why?</title>
+ <title>Why Does <literal>SELECT</literal> Display non-Latin
characters as
+ <quote>?</quote> characters?</title>
<indexterm type="concept">
<primary>CJK (Chinese, Japanese, Korean)</primary>
@@ -3865,32 +3866,63 @@
</indexterm>
<para>
- You inserted CJK characters with <literal>INSERT</literal>, but
- when you do a <literal>SELECT</literal>, they all look like
- <quote>?</quote>. It usually is a setting in MySQL that doesn't
- match the settings for the application program or the operating
- system. These are common troubleshooting steps:
+ <emphasis role="bold">Problem</emphasis>: You inserted CJK
+ characters with <literal>INSERT</literal>, but when you do a
+ <literal>SELECT</literal>, they all look like
<quote>?</quote>.
+ </para>
- <itemizedlist>
+ <para>
+ <emphasis role="bold">Solution</emphasis>: This problem is
+ usually due to a setting in MySQL that doesn't match the
+ settings for the application program or the operating system.
+ Here are some common steps for correcting these types of issues,
+ given as a series of questions and answers:
+ </para>
- <listitem>
+ <qandaset defaultlabel="qanda">
+
+ <qandaentry>
+
+ <question>
+
<para>
- Find out: what version do you have? The statement
- <literal>SELECT VERSION();</literal> will tell you. This
- FAQ is for MySQL version 5, so some of the answers here
- will not apply to you if you have version 4.0 or 4.1.
+ What MySQL version are you using?
</para>
- </listitem>
- <listitem>
+ </question>
+
+ <answer>
+
<para>
- Find out: what character set is the database column really
- in? Too frequently, people think that the character set
- will be the same as the server's set (false), or the set
- used for display purposes (false). Make sure, by saying
- <literal>SHOW CREATE TABLE tablename</literal>, or better
- yet by saying this:
+ Use the statement <literal>SELECT VERSION();</literal> to
+ find out.
+ </para>
+ </answer>
+
+ </qandaentry>
+
+ <qandaentry>
+
+ <question>
+
+ <para>
+ What character set is the database column actually using?
+ </para>
+
+ </question>
+
+ <answer>
+
+ <para>
+ All too often, people think that the client character set
+ is always the same as either the server character set or
+ the character set used for display purposes. However, both
+ of these are false assumptions. You can make sure by
+ checking the result of <literal>SHOW CREATE TABLE
+ <replaceable>tablename</replaceable></literal> or
—
+ better — yet by using this statement:
+
<programlisting>
SELECT character_set_name, collation_name
FROM information_schema.columns WHERE table_schema = your_database_name
@@ -3898,62 +3930,118 @@
AND column_name = your_column_name;
</programlisting>
</para>
- </listitem>
- <listitem>
+ </answer>
+
+ </qandaentry>
+
+ <qandaentry>
+
+ <question>
+
<para>
- Find out: what is the hexadecimal value?
+ What is hexadecimal value of the character or characters?
+ </para>
+ </question>
+
+ <answer>
+
+ <para>
+ You can obtain this information for a column
+ <replaceable>column_name</replaceable> in the table
+ <replaceable>table_name</replaceable> using the following
+ query:
+ </para>
+
<programlisting>
-SELECT HEX(your_column_name)
-FROM your_table_name;
+SELECT HEX(<replaceable>column_name</replaceable>)
+FROM <replaceable>table_name</replaceable>;
</programlisting>
- If you see <literal>3F</literal>, then that really is the
- encoding for <literal>?</literal>, so no wonder you see
- <quote>?</quote>. Probably this happened because of a
- problem converting a particular character from your client
+ <para>
+ <literal>3F</literal> is the encoding for the
+ <literal>?</literal> character; this means that
+ <literal>?</literal> is the character actually stored in
+ the column. This most often happens because of a problem
+ converting a particular character from your client
character set to the target character set.
</para>
- </listitem>
- <listitem>
+ </answer>
+
+ </qandaentry>
+
+ <qandaentry>
+
+ <question>
+
<para>
- Find out: is a literal round trip possible, that is, if
- you select <quote>literal</quote> (or <quote>_introducer
- hexadecimal-value</quote>) do you get
- <quote>literal</quote> as a result? For example, with the
- Japanese Katakana Letter Pe, which looks like
- <literal>ペ'</literal>, and which exists in all CJK
- character sets, and which has the code point value
- (hexadecimal coding) <literal>0x30da</literal>, enter:
+ Is a round trip possible — that is, if you select
+ <replaceable>literal</replaceable> (or
+ <replaceable>_introducer hexadecimal-value</replaceable>)
+ do you obtain <replaceable>literal</replaceable> as a
+ result?
+ </para>
+ </question>
+
+ <answer>
+
+ <para>
+ For example, for the Japanese Katakana Letter
+ <emphasis>Pe</emphasis> — which looks like
+ <literal>ペ'</literal>, exists in all CJK character sets,
+ and has the code point value (hexadecimal coding)
+ <literal>0x30da</literal>, you would enter:
+
<programlisting>
SELECT 'ペ' AS `ペ`; /* or SELECT _ucs2 0x30da; */
</programlisting>
- If the result doesn't look like <literal>ペ</literal>, a
- round trip failed. For bug reports, we might ask people to
- follow up with <literal>SELECT hex('ペ');</literal>. Then
- we can see whether the client encoding is right.
+ If the result doesn't look like <literal>ペ</literal>,
+ the round trip failed.
</para>
- </listitem>
- <listitem>
<para>
- Find out: is it the browser or application? Just use
- <command>mysql</command> (the MySQL client program, which
- on Windows will be <command>mysql.exe</command>). If
+ For bug reports regarding such failures, we might ask you
+ to follow up with <literal>SELECT HEX('ペ');</literal>.
+ Then we can determine whether the client encoding is
+ correct.
+ </para>
+
+ </answer>
+
+ </qandaentry>
+
+ <qandaentry>
+
+ <question>
+
+ <para>
+ Is the problem with the browser or other application,
+ rather than with MySQL?
+ </para>
+
+ </question>
+
+ <answer>
+
+ <para>
+ Use the <command>mysql</command> client program (on
+ Windows: <command>mysql.exe</command>). If
<command>mysql</command> displays correctly but your
- application doesn't, then your problem is probably
- <quote>Settings</quote>, but consult also the question
- about <quote>Troubles with Access (or Perl) (or PHP)
- (etc.)</quote> much later in this FAQ.
+ application doesn't, then your problem is probably due to
+ settings in the application. <quote>Settings</quote>, but
+ consult also the question about <quote>Troubles with
+ Access (or Perl) (or PHP) (etc.)</quote> much later in
+ this FAQ.
</para>
<para>
- To find your settings, the statement you need here is
- <literal>SHOW VARIABLES</literal>. For example:
+ To find out what your settings are, use the <literal>SHOW
+ VARIABLES</literal> statement, whose output should
+ resemble what is shown here:
<programlisting>
mysql> <userinput>SHOW VARIABLES LIKE 'char%';</userinput>
@@ -3972,7 +4060,7 @@
8 rows in set (0.03 sec)
</programlisting>
- The above are typical character-set settings for an
+ These are typical character-set settings for an
international-oriented client (notice the use of
<literal>utf8</literal> Unicode) connected to a server in
the West (<literal>latin1</literal> is a West Europe
@@ -3981,38 +4069,39 @@
<para>
Although Unicode (usually the <literal>utf8</literal>
- variant on Unix, usually the <literal>ucs2</literal>
- variant on Windows) is better than <quote>latin</quote>,
- it's often not what your operating system utilities
- support best. Many Windows users find that a Microsoft
- character set, such as <literal>cp932</literal> for
- Japanese Windows, is what's suitable.
+ variant on Unix, and the <literal>ucs2</literal> variant
+ on Windows) is preferable to Latin, it's often not what
+ your operating system utilities support best. Many Windows
+ users find that a Microsoft character set, such as
+ <literal>cp932</literal> for Japanese Windows, is what's
+ suitable.
</para>
<para>
If you can't control the server settings, and you have no
- idea what your underlying computer is about, then try
- changing to a common character set for the country that
- you're in (<literal>euckr</literal> = Korea,
+ idea what your underlying computer is, then try changing
+ to a common character set for the country that you're in
+ (<literal>euckr</literal> = Korea;
<literal>gb2312</literal> or <literal>gbk</literal>
=
- People's Republic of China, <literal>big5</literal> =
- other China, <literal>sjis</literal> or
- <literal>ujis</literal> or <literal>cp932</literal>
or
- <literal>eucjpms</literal> = Japan,
- <literal>ucs2</literal> or <literal>utf8</literal>
=
- anywhere). Usually it is only necessary to change the
- client and connection and results settings, and there is a
- simple statement which changes all three at once, namely
+ People's Republic of China; <literal>big5</literal> =
+ Taiwan; <literal>sjis</literal>,
<literal>ujis</literal>,
+ <literal>cp932</literal>, or
<literal>eucjpms</literal> =
+ Japan; <literal>ucs2</literal> or
<literal>utf8</literal>
+ = anywhere). Usually it is necessary to change only the
+ client and connection and results settings. There is a
+ simple statement which changes all three at once:
<literal>SET NAMES</literal>. For example:
+ </para>
<programlisting>
SET NAMES 'big5';
</programlisting>
- Once you get the correct setting, you can make it
- permanent by editing <filename>my.cnf</filename> or
+ <para>
+ Once the setting is correct, you can make it permanent by
+ editing <filename>my.cnf</filename> or
<filename>my.ini</filename>. For example you might add
- lines looking like this:
+ lines looking like these:
<programlisting>
[mysqld]
@@ -4021,16 +4110,18 @@
default-character-set=big5
</programlisting>
</para>
- </listitem>
- </itemizedlist>
- </para>
+ </answer>
+ </qandaentry>
+
+ </qandaset>
+
</section>
- <section id="cjk-faq-gb-charset-problems">
+ <section id="faqs-cjk-gb-charsets-supported">
- <title>Troubles with GB character sets (Chinese)</title>
+ <title>What GB (Chinese) Character Sets Does MySQL Support?</title>
<indexterm type="concept">
<primary>CJK (Chinese, Japanese, Korean)</primary>
@@ -4046,51 +4137,109 @@
<primary>gb2312, gbk</primary>
</indexterm>
- <para>
- <remark role="update">
- [SH] References to d.udm.net (Bar's pages) need to be changed
- once we've moved those pages to the Reference Manual.
- </remark>
+ <remark role="update">
+ [SH] References to d.udm.net (Bar's pages) need to be changed
+ once we've moved those pages to the Reference Manual.
+ </remark>
- MySQL supports the two common variants of the GB (<quote>Guojia
- Biaozhun</quote> or <quote>National Standard</quote>) character
- sets which are official in the People's Republic of China:
- <literal>gb2312</literal> and <literal>gbk</literal>.
Sometimes
- people try to insert <literal>gbk</literal> characters into
- <literal>gb2312</literal>, and it works most of the time because
- <literal>gbk</literal> is a superset of
- <literal>gb2312</literal>. But eventually they try to insert a
- rarer Chinese character and it doesn't work. (Example: bug
- #16072 in our bugs database,
- <ulink url="http://bugs.mysql.com/bug.php?id=16072"/>). So we'll
- try to clarify here exactly what characters are legitimate in
- <literal>gb2312</literal> or <literal>gbk</literal>, with
- reference to the official documents. Please check these
- references before reporting <literal>gb2312</literal> or
- <literal>gbk</literal> bugs. We now have a graphic listing of
- the <literal>gbk</literal> characters, currently on the site of
- Mr Alexander Barkov (MySQL's principal programmer for character
- set issues). The chart is in order according to the
- <literal>gb2312_chinese_ci</literal> collation:
- <ulink url="http://d.udm.net/bar/~bar/charts/gb2312_chinese_ci.html"/>.
- MySQL's <literal>gbk</literal> is in reality <quote>Microsoft
- code page 936</quote>. This differs from the official
- <literal>gbk</literal> for characters
<literal>A1A4</literal>
- (middle dot), <literal>A1AA</literal> (em dash),
- <literal>A6E0-A6F5</literal>, and
<literal>A8BB-A8C0</literal>.
- For a listing of the differences, see
- <ulink
url="http://recode.progiciels-bpi.ca/showfile.html?name=dist/libiconv/gbk.h"/>.
- For a listing of gbk/Unicode mappings, see
- <ulink
url="http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP936.TXT"/>.
- For MySQL's listing of gbk characters, see
- <ulink url="http://d.udm.net/bar/~bar/charts/gbk_chinese_ci.html"/>.
- </para>
+ <remark role="todo">
+ [js] Get the charts from Bar perferably as CSV or some other
+ format that will be easy to turn into DB-XML for inlcusion as
+ appendices to the RefMan.
+ </remark>
+ <qandaset>
+
+ <qandaentry>
+
+ <question>
+
+ <para>
+ What GB (Chinese) Character Sets Does MySQL Support?
+ </para>
+
+ </question>
+
+ <answer>
+
+ <para>
+ MySQL supports the two common variants of the
+ <firstterm>GB</firstterm> (<foreignphrase>Guojia
+ Biaozhun</foreignphrase>, or <emphasis>National
+ Standard</emphasis>) character sets which are official in
+ the People's Republic of China: <literal>gb2312</literal>
+ and <literal>gbk</literal>. Sometimes people try to insert
+ <literal>gbk</literal> characters into
+ <literal>gb2312</literal>, and it works most of the time
+ because <literal>gbk</literal> is a superset of
+ <literal>gb2312</literal> — but eventually they try
+ to insert a rarer Chinese character and it doesn't work.
+ (See Bug #16072 for an example).
+ </para>
+
+ <para>
+ Here, we try to clarify exactly what characters are
+ legitimate in <literal>gb2312</literal> or
+ <literal>gbk</literal>, with reference to the official
+ documents. Please check these references before reporting
+ <literal>gb2312</literal> or <literal>gbk</literal>
bugs.
+
+ <itemizedlist>
+
+ <listitem>
+ <para>
+ For a complete listing of the
+ <literal>gb2312</literal> characters, ordered
+ according to the
+ <literal>gb2312_chinese_ci</literal> collation:
+ <ulink
url="http://d.udm.net/bar/~bar/charts/gb2312_chinese_ci.html"/>.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ MySQL's <literal>gbk</literal> is in reality
+ <quote>Microsoft code page 936</quote>. This differs
+ from the official <literal>gbk</literal> for
+ characters <literal>A1A4</literal> (middle dot),
+ <literal>A1AA</literal> (em dash),
+ <literal>A6E0-A6F5</literal>, and
+ <literal>A8BB-A8C0</literal>. For a listing of the
+ differences, see
+ <ulink
url="http://recode.progiciels-bpi.ca/showfile.html?name=dist/libiconv/gbk.h"/>.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ For a listing of <literal>gbk</literal>/Unicode
+ mappings, see
+ <ulink
url="http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP936.TXT"/>.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ For MySQL's listing of <literal>gbk</literal>
+ characters, see
+ <ulink
url="http://d.udm.net/bar/~bar/charts/gbk_chinese_ci.html"/>.
+ </para>
+ </listitem>
+
+ </itemizedlist>
+ </para>
+
+ </answer>
+
+ </qandaentry>
+
+ </qandaset>
+
</section>
- <section id="cjk-faq-big5-charset-problems">
+ <section id="faqs-cjk-big5-charset-problems">
- <title>Troubles with big5 character set (Chinese)</title>
+ <title>Troubles with the Big5 Character Set (Chinese)</title>
<indexterm type="concept">
<primary>CJK (Chinese, Japanese, Korean)</primary>
@@ -4108,25 +4257,27 @@
<para>
MySQL supports the Big5 character set which is common in Hong
- Kong and the Republic of China (Taiwan). MySQL's
- <literal>big5</literal> is in reality <quote>Microsoft code
page
- 950</quote>, which is very similar to the original
- <literal>big5</literal> character set. This is a recent change,
- starting with MySQL version 4.1.16 / 5.0.16. We made the change
- as a result of a bug report, bug #12476 in our bugs database,
- <ulink url="http://bugs.mysql.com/bug.php?id=12476"/> (title:
- <quote>Some big5 codes are still missing ...</quote>). For
- example, the following statements work in the current version of
- MySQL, but not in old versions:
+ Kong and Taiwan (Republic of China). MySQL's
+ <literal>big5</literal> is in reality Microsoft code page 950,
+ which is very similar to the original <literal>big5</literal>
+ character set. We changed to this
+ <remark role="todo">
+ [js] Fix version # as appropriate.
+ </remark>
+
+ character set starting with MySQL version 4.1.16 / 5.0.16 (as a
+ result of Bug #12476). For example, the following statements
+ work in current versions of MySQL, but not in old versions:
+
<programlisting>
-mysql> <userinput>create table big5 (big5 char(1) character set
big5);</userinput>
+mysql> <userinput>CREATE TABLE big5 (BIG5 CHAR(1) CHARACTER SET
BIG5);</userinput>
Query OK, 0 rows affected (0.13 sec)
-mysql> <userinput>insert into big5 values (0xf9dc);</userinput>
+mysql> <userinput>INSERT INTO big5 VALUES (0xf9dc);</userinput>
Query OK, 1 row affected (0.00 sec)
-mysql> <userinput>select * from big5;</userinput>
+mysql> <userinput>SELECT * FROM big5;</userinput>
+------+
| big5 |
+------+
@@ -4135,18 +4286,16 @@
1 row in set (0.02 sec)
</programlisting>
- There is a feature request for adding HKSCS extensions (bug
- #13577 in our bugs database,
- <ulink url="http://bugs.mysql.com/bug.php?id=13577)"/>. People
- who need the extension may find the suggested patch for bug
- #13577 is of interest.
+ There is a feature request for adding <literal>HKSCS</literal>
+ extensions. People who need this extension may find the
+ suggested patch for Bug #13577 to be of interest.
</para>
</section>
<section id="cjk-faq-charset-conversion-problems">
- <title>Troubles with character-set conversions (Japanese)</title>
+ <title>What Problems Does MySQL Have with Japanese Character Set
Conversions?</title>
<indexterm type="concept">
<primary>CJK (Chinese, Japanese, Korean)</primary>
@@ -4170,17 +4319,36 @@
between character sets. For example, there might be a Unix
server (typically with <literal>sjis</literal> or
<literal>ujis</literal>) and a Windows client (typically with
- <literal>cp932</literal>). But conversions can seem to fail.
- Here's why. In this conversion table, the
- <literal>ucs2</literal> column is the source, and the
-
<literal>sjis</literal>/<literal>cp932</literal>/<literal>ujis</literal>/<literal>eucjpms</literal>
- columns are the destination, that is, what the hexadecimal
- result would be if we used <literal>CONVERT(ucs2)</literal> or
- if we assigned a <literal>ucs2</literal> column containing the
- value to an
-
<literal>sjis</literal>/<literal>cp932</literal>/<literal>ujis</literal>/<literal>eucjpms</literal>
- column.
+ <literal>cp932</literal>).
+ </para>
+ <qandaset>
+
+ <qandaentry>
+
+ <question>
+
+ <para>
+ Why do such conversions appear to fail?
+ </para>
+
+ </question>
+
+ <answer>
+
+ <para>
+ Here's why. In this conversion table, the
+ <literal>ucs2</literal> column is the source, and the
+ <literal>sjis</literal>, <literal>cp932</literal>,
+ <literal>ujis</literal>, and
<literal>eucjpms</literal>
+ columns are the destinations — that is, they
+ represent what the hexadecimal result would be if we used
+ <literal>CONVERT(ucs2)</literal> or if we assigned a
+ <literal>ucs2</literal> column containing the value to an
+ <literal>sjis</literal>, <literal>cp932</literal>,
+ <literal>ujis</literal>, or
<literal>eucjpms</literal>
+ column.
+
<programlisting>
character name ucs2 sjis cp932 ujis eucjpms
-------------- ---- ---- ---- ---- -------
@@ -4219,7 +4387,7 @@
FULLWIDTH NOT SIGN FFE2 3F 81CA 3F A2CC
</programlisting>
- For example, consider this extract from the table:
+ For example, consider this extract from the table:
<programlisting>
ucs2 sjis cp932
@@ -4228,20 +4396,29 @@
FULLWIDTH NOT SIGN FFE2 3F 81CA
</programlisting>
- It means <quote>for NOT SIGN which is Unicode U+00AC, MySQL
- converts to sjis code point 0x81CA and to cp932 code point
- 3F</quote>. (<literal>3F</literal> is question mark
- (<quote>?</quote>) and is what we always use when we can't
- convert.) Now, what should we do if we want to convert
- <literal>sjis 81CA</literal> to <literal>cp932</literal>?
Our
- answer is: <quote>?</quote>. There are serious complaints about
- this, many people would prefer a <quote>loose</quote>
- conversion, so that <literal>81CA (NOT SIGN)</literal> in
- <literal>sjis</literal> becomes <literal>81CA (FULLWIDTH NOT
- SIGN)</literal> in <literal>cp932</literal>. We are considering
- changing.
- </para>
+ It means that MySQL converts the NOT SIGN (Unicode
+ <literal>U+00AC</literal>) to
<literal>sjis</literal> code
+ point <literal>0x81CA</literal> and to
+ <literal>cp932</literal> code point
<literal>3F</literal>.
+ (<literal>3F</literal> is the question mark
+ (<quote>?</quote>) — this is what we always use when
+ we can't convert.) Now, what should we do if we want to
+ convert <literal>sjis 81CA</literal> to
+ <literal>cp932</literal>? Our answer is:
<quote>?</quote>.
+ There are serious complaints about this, many people would
+ prefer a <quote>loose</quote> conversion, so that
+ <literal>81CA (NOT SIGN)</literal> in
+ <literal>sjis</literal> becomes <literal>81CA (FULLWIDTH
+ NOT SIGN)</literal> in <literal>cp932</literal>. We are
+ considering changing.
+ </para>
+ </answer>
+
+ </qandaentry>
+
+ </qandaset>
+
</section>
<section id="cjk-faq-great-yen-sign-problem">
| Thread |
|---|
| • svn commit - mysqldoc@docsrva: r3700 - trunk/refman-5.0 | jon | 23 Oct |