List:General Discussion« Previous MessageNext Message »
From:Tierlieb Date:March 7 2011 5:30pm
Subject:Chinese sorting and filtering on an table with charset utf-8
View as plain text  
Hello there!

I'm trying to get some software ready to work with Chinese/Hanzi 
characters/graphems/sinograms. My tables, using mysql 5.x, use UTF-8 for 
all text data. It is not an option to change that for a local Chinese 
installation.

Storing data works quite nicely. That's what I like about UTF-8.

But sorting does not work. And, related to that, comparisons neither. 
I'm getting the hopefully silly notion that there is no collation for 
CHARSET=UTF-8 that could sort according to the rules of GB2312 (which, 
if it was true, would probably an interesting story to explain why).

This leads me to fear that I might have to use a Pinyin translation 
library (like pinyin4j, which only translates character by character and 
is not bijective), store that in a separate column and sort that. Which 
would only give approximate results since without a huge dictionary 
backing, the lack of bijective mapping between Hanzi and Pinyin will 
result in a lack of information.

Or I could  create  a separate column with GB2312 as encoding and sort 
by this using the collation I want (gb2312_chinese_ci most likely). The 
latter would provider better sorting than the former, but require me to 
do the same thing for other countries, too (I assume there'd be people 
interested in having big5 and that is still the same language...).

Or I could do it one level above the database layer - which is possible, 
but I assume any code I could write would be less sophisticated (and 
thereby slower) than what the database can do.

Previous research: I've seen the threads "Chinese order by with utf8" 
and "Chinese and MySQL / UTF8 and versions" which are a bit older than 
I'd like and also don't seem conclusive.

So what are my options here? I'm not married to the idea of using GB2312 
- I would be willing to use any other Chinese collation, as long as it 
is better than the default one (which sorts by number value).

Thanks in advance,
Tierlieb
Thread
Chinese sorting and filtering on an table with charset utf-8Tierlieb7 Mar