List:Internals« Previous MessageNext Message »
From:Eric Prud'hommeaux Date:June 8 2005 8:08am
Subject:Re: UTF-8 Character set question
View as plain text  
On Tue, Jun 07, 2005 at 01:54:00PM -0400, Barbara Deaton wrote:
> I have a MySQL 4.1.7 C API application that is having problems with displaying UTF-8
> characters.  I was trying to
verify what the data should be when I noticed that the MySQL command line client wasn't
displaying the data
correctly either.
>
> I installed the MySQL 4.1.12 client on a Japanese windows OS and I am connecting to a
> Japanese server.  From the
command line when I try to read data from the table I get:
>
> mysql> select label from label;
> +-----------------------------+
> | label                       |
> +-----------------------------+
> | Customer Satisfaction Index |
> | Actual                      |
> | template_jpnmai             |
> | ???                         |
> | DEFAULT CULTURE             |
> | ???????                     |
> | ??                          |


I note that you're getting the correct number of question marks. This
indicates that you've got all the encoding bits in order. (Otherwise,
there'd be little correlation between the number of kanji and the
number of '?'s displayed.) You might want to suspect your terminal.
I'm using bash in xterm on a debian box with
  LANG=en_US
  LC_CTYPE=en_US.UTF-8

> The insert statement, used to insert the data is:
>
> INSERT INTO `label` VALUES
>
> ('bf30ad53-ac19-7ef8-01fe-e6fcf369cffe','0939f271-37d4-4cca-9c0b-953aa9b9476c','Customer
> Satisfaction Index'),
>
> ('92186b04-c925-43b9-bfa9-42e106f48252','1b414a4f-edd9-4f39-9a3a-3d49470f12b3','Actual'),
>
> ('1c27d26d-ac1a-1959-00e3-eda6ac3640b8','1c27d1b1-ac1a-1959-00e3-eda63367a1f7','template_jpnmai'),
>
> ('1c27d26d-ac1a-1959-00e3-eda6ac3640b8','1c27d26d-ac1a-1959-00e3-eda6ac3640b8','日本語'),
> ('DEFAULT_CULTURE_UNSET','1c27d26d-ac1a-1959-00e3-eda6ac3640b8','DEFAULT CULTURE'),
>
> ('1c27d26d-ac1a-1959-00e3-eda6ac3640b8','1c2a2cf7-ac1a-1959-00e3-eda6d3682288','コンタクト情報'),
>
> ('1c27d26d-ac1a-1959-00e3-eda6ac3640b8','1c2a79de-ac1a-1959-00e3-eda6f861a069','目標')
>
> Is there a client side option that I should be setting to let it know that the
> character set is UTF-8?  I thought
this was all handled by the server?

There is a variable called character_set from which you presumably get to select
from character_sets.

mysql> SHOW VARIABLES...
| character_set  | latin1                                                                 
                                                                                          
                                               |
| character_sets | latin1 big5 cp1251 cp1257 croat czech danish dec8 dos estonia euc_kr
gb2312 gbk german1 greek hebrew hp8 hungarian koi8_ru koi8_ukr latin1_de latin2 latin5
sjis swe7 tis620 ujis usa7 win1250 win1251ukr win1251 |

Most of those are 8 bit, but some of them, like sjis are variable
width. None that I noticed define characters for all of the unicode
characters. I stuck with latin1 (which is to say, I did nothing) as
it uses all of 0x00-0xFF, meaning, any utf-8 byte pattern can match
a set of latin1 bytes.

In the end, I don't have a nice collation to make かがきぎくぐけげこご
sort in the correct order, but at least what I put in is what I get
out:

mysql> CREATE TABLE label (foo CHAR(36), bar CHAR(36), label VARCHAR(40));
...
INSERT...
mysql> select label from label;
+-----------------------------+
| label                       |
+-----------------------------+
| Customer Satisfaction Index |
| Actual                      |
| template_jpnmai             |
| 日本語                   |
| DEFAULT CULTURE             |
| コンタクト情報       |
| 目標                      |
-- 
-eric

office: +81.466.49.1170 W3C, Keio Research Institute at SFC,
                        Shonan Fujisawa Campus, Keio University,
                        5322 Endo, Fujisawa, Kanagawa 252-8520
                        JAPAN
        +1.617.258.5741 NE43-344, MIT, Cambridge, MA 02144 USA
cell:   +81.90.6533.3882

(eric@stripped)
Feel free to forward this message to any list for any purpose other than
email address distribution.

Attachment: [application/pgp-signature] Digital signature signature.asc
Thread
UTF-8 Character set questionBarbara Deaton7 Jun
  • Re: UTF-8 Character set questionEric Prud'hommeaux8 Jun