I installed mysql-essential-5.0.18-win32.msi (under Win XP) in order to
use it with Bugzilla. Since I had some problem with importing old
Bugzilla dumps containing some special hungarian vowels (they are
outside of latin1 characterset), I checked how mysql stores these
characters (vowels) inside.
I stored the following string (entering a new bug in Bugzilla) into
database (hopefully you can read it):
These 9-9 vowels are used in the hungarian alpabet (small and capital
To my great surprise in mysql its hexadump is (select hex(column)):
It is longer (152 characters) that I expected (18x2x2 = 72 characters).
Here I cut the string into as many lines as vowels, moreover I placed
into brackets the UTF8 codes that should be used normally instead of them.
C383C2A1 (C3A1) LATIN SMALL LETTER A WITH ACUTE
C383C2A9 (C3A9) LATIN SMALL LETTER E WITH ACUTE
C383C2AD (C3AD) LATIN SMALL LETTER I WITH ACUTE
C383C2B3 (C3B3) LATIN SMALL LETTER O WITH ACUTE
C383C2B6 (C3B6) LATIN SMALL LETTER O WITH DIAERESIS
C385E28098 (C591) LATIN SMALL LETTER O WITH DOUBLE ACUTE
C383C2BA (C3BA) LATIN SMALL LETTER U WITH ACUTE
C383C2BC (C3BC) LATIN SMALL LETTER U WITH DIAERESIS
C385C2B1 (C5B1) LATIN SMALL LETTER U WITH DOUBLE ACUTE
C383C281 (C381) LATIN CAPITAL LETTER A WITH ACUTE
C383E280B0 (C389) LATIN CAPITAL LETTER E WITH ACUTE
C383C28D (C38D) LATIN CAPITAL LETTER I WITH ACUTE
C383E2809C (C393) LATIN CAPITAL LETTER O WITH ACUTE
C383E28093 (C396) LATIN CAPITAL LETTER O WITH DIAERESIS
C385C290 (C590) LATIN CAPITAL LETTER O WITH DOUBLE ACUTE
C383C5A1 (C39A) LATIN CAPITAL LETTER U WITH ACUTE
C383C593 (C39C) LATIN CAPITAL LETTER U WITH DIAERESIS
C385C2B0 (C5B0) LATIN CAPITAL LETTER U WITH DOUBLE ACUTE
The letters should be consumed just 2 bytes, but in mysql all of them
uses 4 or 5(!!!) bytes. Does MySql 5.0 store the UTF8 codes not
directly? Since Bugzilla displays the entered bug correctly (it reads
the vowels properly from database), I think there must be some rule in
mysql storing utf8 codes. When I make a (mysql)dump from the database,
also these long codes are inside of the dump.
The old dumps were made still with mysql-essential-4.1.12-win32.msi and
they contain proper UTF8 codes (with 2 bytes length per hungarian
special vowel), so I do not know why MySQL 5.0 stores these vowels
differently than MySQL 4.1.