List:General Discussion« Previous MessageNext Message »
From:Zsolt Branyiczky Date:March 3 2006 9:03am
Subject:mysql 5.0 and utf8 letters inside
View as plain text  
Hi,

I installed mysql-essential-5.0.18-win32.msi (under Win XP) in order to 
use it with Bugzilla. Since I had some problem with importing old 
Bugzilla dumps containing some special hungarian vowels (they are 
outside of latin1 characterset), I checked how mysql stores these 
characters (vowels) inside.

I stored the following string (entering a new bug in Bugzilla) into 
database (hopefully you can read it):
áéíóöőúüűÁÉÍÓÖŐÚÜŰ
These 9-9 vowels are used in the hungarian alpabet (small and capital 
letters).

To my great surprise in mysql its hexadump is (select hex(column)):
C383C2A1C383C2A9C383C2ADC383C2B3C383C2B6C385E28098C383C2BAC383C2BCC385C2B1C383C281C383E280B0C383C28DC383E2809CC383E28093C385C290C383C5A1C383C593C385C2B0
It is longer (152 characters) that I expected (18x2x2 = 72 characters).

Here I cut the string into as many lines as vowels, moreover I placed 
into brackets the UTF8 codes that should be used normally instead of them.

C383C2A1 (C3A1) LATIN SMALL LETTER A WITH ACUTE
C383C2A9 (C3A9) LATIN SMALL LETTER E WITH ACUTE
C383C2AD (C3AD) LATIN SMALL LETTER I WITH ACUTE
C383C2B3 (C3B3) LATIN SMALL LETTER O WITH ACUTE
C383C2B6 (C3B6) LATIN SMALL LETTER O WITH DIAERESIS
C385E28098 (C591) LATIN SMALL LETTER O WITH DOUBLE ACUTE
C383C2BA (C3BA) LATIN SMALL LETTER U WITH ACUTE
C383C2BC (C3BC) LATIN SMALL LETTER U WITH DIAERESIS
C385C2B1 (C5B1) LATIN SMALL LETTER U WITH DOUBLE ACUTE
C383C281 (C381) LATIN CAPITAL LETTER A WITH ACUTE
C383E280B0 (C389) LATIN CAPITAL LETTER E WITH ACUTE
C383C28D (C38D) LATIN CAPITAL LETTER I WITH ACUTE
C383E2809C (C393) LATIN CAPITAL LETTER O WITH ACUTE
C383E28093 (C396) LATIN CAPITAL LETTER O WITH DIAERESIS
C385C290 (C590) LATIN CAPITAL LETTER O WITH DOUBLE ACUTE
C383C5A1 (C39A) LATIN CAPITAL LETTER U WITH ACUTE
C383C593 (C39C) LATIN CAPITAL LETTER U WITH DIAERESIS
C385C2B0 (C5B0) LATIN CAPITAL LETTER U WITH DOUBLE ACUTE

The letters should be consumed just 2 bytes, but in mysql all of them 
uses 4 or 5(!!!) bytes. Does MySql 5.0 store the UTF8 codes not 
directly? Since Bugzilla displays the entered bug correctly (it reads 
the vowels properly from database), I think there must be some rule in 
mysql storing utf8 codes. When I make a (mysql)dump from the database, 
also these long codes are inside of the dump.

The old dumps were made still with mysql-essential-4.1.12-win32.msi and 
they contain proper UTF8 codes (with 2 bytes length per hungarian 
special vowel), so I do not know why MySQL 5.0 stores these vowels 
differently than MySQL 4.1.

Regards
Zsolt
Thread
mysql 5.0 and utf8 letters insideZsolt Branyiczky3 Mar
  • Re: mysql 5.0 and utf8 letters insideZsolt Branyiczky3 Mar