After doing much research I am at a loss how really utf8_bin and
utf8_general_ci differ, besides the obvious -- sort order.
What I inferred from MySQL docs and some online articles is the following.
utf8_bin is pretty much what it suggests, utf8 strings are stored as bytes
and MySQL server compares/sorts these on single byte basis, where the value
of each byte determines sort order. Make sense to me as utf8 is multi-byte
So what is utf8_general_ci collation all about, since it is DEFAULT
collation for utf8 character set. But here is the kicker, it does not
implement all utf8 characters, especially for languages like jp, ko, arabic,
and other right to left languages.
I know for fact ucs2_general_uca (default collation for ucs2 Unicode
encoding) has only partial Unicode support, so my guess is, the same may be
true for utf8_general_ci collation.
The only way I can get full Unicode support is to use utf8_bin collation as
utf8_general_ci messes some characters? (chars ?? are returned from the db)
My question is, has anyone else run into similar issues with collations?
Especially, how can utf8_general_ci not support all characters but be the
default collation for utf8?
The only way to get things to work is to set charset to utf8 for everything,
and utf8_bin collation for everything as well. (connection, db, tables)
I had to modify the mysql connector 1.0.4 to include:
SET collation_connection = @@collation_database;
# database collation is utf8_bin