I'm trying to get some software ready to work with Chinese/Hanzi
characters/graphems/sinograms. My tables, using mysql 5.x, use UTF-8 for
all text data. It is not an option to change that for a local Chinese
Storing data works quite nicely. That's what I like about UTF-8.
But sorting does not work. And, related to that, comparisons neither.
I'm getting the hopefully silly notion that there is no collation for
CHARSET=UTF-8 that could sort according to the rules of GB2312 (which,
if it was true, would probably an interesting story to explain why).
This leads me to fear that I might have to use a Pinyin translation
library (like pinyin4j, which only translates character by character and
is not bijective), store that in a separate column and sort that. Which
would only give approximate results since without a huge dictionary
backing, the lack of bijective mapping between Hanzi and Pinyin will
result in a lack of information.
Or I could create a separate column with GB2312 as encoding and sort
by this using the collation I want (gb2312_chinese_ci most likely). The
latter would provider better sorting than the former, but require me to
do the same thing for other countries, too (I assume there'd be people
interested in having big5 and that is still the same language...).
Or I could do it one level above the database layer - which is possible,
but I assume any code I could write would be less sophisticated (and
thereby slower) than what the database can do.
Previous research: I've seen the threads "Chinese order by with utf8"
and "Chinese and MySQL / UTF8 and versions" which are a bit older than
I'd like and also don't seem conclusive.
So what are my options here? I'm not married to the idea of using GB2312
- I would be willing to use any other Chinese collation, as long as it
is better than the default one (which sorts by number value).
Thanks in advance,
|• Chinese sorting and filtering on an table with charset utf-8||Tierlieb||7 Mar|