The following was tested with MySQL 4.1.1 running on RedHat ES 3. This message was sent in UTF-8.
The unique key index prefix length is misleading when you use the utf8 character set. If you specify a prefix of N characters the uniqueness constraint appears to be enforced for the first 3*N bytes of column, not the first N characters. I assume the constant 3 comes from the fact that mysql uses a maximum of 3 bytes to encode a utf8 character.
The following example demonstrates the problem:
CREATE TABLE `t` (
`c` varchar(20) NOT NULL default '',
UNIQUE KEY `c` (`c`(1))
) TYPE=MyISAM DEFAULT CHARSET=utf8
insert into `t` values ('12345');
insert into `t` values ('1----');
insert into `t` values ('12---');
insert into `t` values ('123--');
# ERROR 1062 (23000): Duplicate entry '123' for key 1
insert into `t` values ('fü');
insert into `t` values ('fübar');
# ERROR 1062 (23000): Duplicate entry 'fü' for key 1
Even though the unique prefix is specified as 1 character, the first set of insert statements show that mysql won't detect a duplicate entry until you try to insert 3 ASCII characters. The second set of insert statements show that mysql will hit the uniqueness constraint at 2 characters if one of them is encoded with two utf8 bytes. It is also possible for the index cover only part of a multibyte encoded character.
I discovered this problem because mysql silently shortened the unique index prefix on one of my utf8 varchar columns. When I execute the following statement:
CREATE TABLE `t` (
`c` varchar(150) NOT NULL default '',
UNIQUE KEY `c` (`c`(150))
) TYPE=MyISAM DEFAULT CHARSET=utf8
there are no errors or warnings, but the result is:
mysql> show create table t \G
*************************** 1. row ***************************
Table: t
Create Table: CREATE TABLE `t` (
`c` varchar(150) NOT NULL default '',
UNIQUE KEY `c` (`c`(64))
) TYPE=MyISAM DEFAULT CHARSET=utf8
1 row in set (0.00 sec)
The 150 character prefix should still fit within the 500 byte index limit of 4.1.1 since 3 * 150 = 450, but for some reason the prefix has been reduced to 64.
There are three possible bug reports:
1. Since mysql silently converts the index prefix from characters to bytes, it is impossible to have a uniqueness constraint with a predictable number of characters unless the entire column is indexed or a fixed width character encoding is used.
2. The documentation isn't clear about what the unique index prefix implies for multibyte character sets.
3. There should a warning when mysql shortens a unique constraint. An even better fix would have mysql index the entire column.
Thanks,
Linus