List:Bugs« Previous MessageNext Message »
From:Linus Upson Date:June 30 2004 6:37pm
Subject:unique key and utf8
View as plain text  
The following was tested with MySQL 4.1.1 running on RedHat ES 3. This message was sent in UTF-8.

The unique key index prefix length is misleading when you use the utf8 character set. If you specify a prefix of N characters the uniqueness constraint appears to be enforced for the first 3*N bytes of column, not the first N characters. I assume the constant 3 comes from the fact that mysql uses a maximum of 3 bytes to encode a utf8 character.

The following example demonstrates the problem:

CREATE TABLE `t` (
  `c` varchar(20) NOT NULL default '',
  UNIQUE KEY `c` (`c`(1))
) TYPE=MyISAM DEFAULT CHARSET=utf8

insert into `t` values ('12345');
insert into `t` values ('1----');
insert into `t` values ('12---');
insert into `t` values ('123--');
# ERROR 1062 (23000): Duplicate entry '123' for key 1

insert into `t` values ('fü');
insert into `t` values ('fübar');
# ERROR 1062 (23000): Duplicate entry 'fü' for key 1

Even though the unique prefix is specified as 1 character, the first set of insert statements show that mysql won't detect a duplicate entry until you try to insert 3 ASCII characters. The second set of insert statements show that mysql will hit the uniqueness constraint at 2 characters if one of them is encoded with two utf8 bytes. It is also possible for the index cover only part of a multibyte encoded character.

I discovered this problem because mysql silently shortened the unique index prefix on one of my utf8 varchar columns. When I execute the following statement:

CREATE TABLE `t` (
  `c` varchar(150) NOT NULL default '',
  UNIQUE KEY `c` (`c`(150))
) TYPE=MyISAM DEFAULT CHARSET=utf8

there are no errors or warnings, but the result is:

mysql> show create table t \G
*************************** 1. row ***************************
       Table: t
Create Table: CREATE TABLE `t` (
  `c` varchar(150) NOT NULL default '',
  UNIQUE KEY `c` (`c`(64))
) TYPE=MyISAM DEFAULT CHARSET=utf8
1 row in set (0.00 sec)

The 150 character prefix should still fit within the 500 byte index limit of 4.1.1 since 3 * 150 = 450, but for some reason the prefix has been reduced to 64.

There are three possible bug reports:

1. Since mysql silently converts the index prefix from characters to bytes, it is impossible to have a uniqueness constraint with a predictable number of characters unless the entire column is indexed or a fixed width character encoding is used.

2. The documentation isn't clear about what the unique index prefix implies for multibyte character sets.

3. There should a warning when mysql shortens a unique constraint. An even better fix would have mysql index the entire column.

Thanks,
Linus
Thread
unique key and utf8Linus Upson30 Jun
  • Re: unique key and utf8Sergei Golubchik30 Jun
    • RE: unique key and utf8Linus Upson4 Jul
      • Re: unique key and utf8Sinisa Milivojevic12 Jul