List:General Discussion« Previous MessageNext Message »
From:Peter Gulutzan Date:January 31 2007 5:48pm
Subject:Re: Hungarian collation
View as plain text  
On Tue, 2006-10-17 at 10:15 -0600, Peter Gulutzan wrote:
> Hi,
> 
> MySQL is looking for an authoritative, official statement
> which states all the current Hungarian collation rules.
> Please let other MySQL-using Hungarians (especially if you
> know a user group in Hungary) know about these
> questions. Best of all would be a translation of the
> Hungarian government standard, if there is one.
<cut>

I'm happy to say that several Hungarian experts
(acknowledged in the document below) helped us.
The proper direction for MySQL is clearer now.

I'm unhappy to say that implementing the new
Hungarian collations will be difficult. One
of the rules requires support for multiple
weights. That is a separate task for which
MySQL will need a long, long time.

The rest of this letter is "Worklog task
high level specification WL#2993" titled
"Add Hungarian collations for cp1250, latin2,
utf8, ucs2".

MySQL failed to find an authoritative, official statement
which states all the current Hungarian collation rules.
There is one, but it's not translated from Hungarian
(see "References" section later in this document).

MySQL also asked for help from Hungarians on lists.mysql.com.
Thanks to Peter Ritzinger, Csongor Fagyal, József Kozell,
András Bártházi, ImRe, and all others who helped.
This document reflects their interpretation of the standard
and their opinions, though no consensus exists.

1. Not deprecating latin2_hungarian_ci

The current latin2_hungarian_ci collation
is a chart in sql/share/charsets/latin2.xml.
Alexander Barkov has provided an easy-to-read web page: 
http://myoffice.izhnet.ru/bar/~bar/charts/latin2_hungarian_ci.html
This collation is unlike the Hungarian dictionaries,
collation descriptions, or products that we've seen.
For example the first letter is:
Latin Capital Letter A
= Latin Small Letter A
= control Single Shift 3
= No-Break Space
= Latin Small Letter L with caron
= Latin Capital Letter A with acute
= Latin Small Letter A with acute
But there is no reason that small L with caron
(which is Slovak not Hungarian) ever sorts with
A, there's some dispute whether A with acute
should sort with A, and all other
accented variants of A should be in this list too.
MySQL considered deprecating this
collation (which implies that MySQL will eventually
remove it), after introducing a new and more correct
one. However, that deprecation will not occur.

2. Hungarian alphabet chart

This is the chart of the Hungarian alphabet;
a á b c cs d dz dzs e é f g gy h i í j k l ly m n
ny o ó ö ő p q r s sz t ty u ú ü ű v w x y z zs

Unicode names of special characters in the chart are:
á U+00E1 LATIN SMALL LETTER A WITH ACUTE
é U+00E9 LATIN SMALL LETTER E WITH ACUTE
í U+00ED LATIN SMALL LETTER I WITH ACUTE
ó U+00F3 LATIN SMALL LETTER O WITH ACUTE
ö U+00F6 LATIN SMALL LETTER O WITH DIAERESIS
ő U+0151 LATIN SMALL LETTER O WITH DOUBLE ACUTE
ú U+00FA LATIN SMALL LETTER U WITH ACUTE
ü U+00FC LATIN SMALL LETTER U WITH DIAERESIS
ű U+0171 LATIN SMALL LETTER U WITH DOUBLE ACUTE

Each of the items in the chart is a "Hungarian
letter". The chart is in order, except for special
cases noted in later sections. Each character has
an upper case equivalent (not shown) which has the
same position in the collating sequence -- all the
new collations will be "case insensitive".

The DOUBLE ACUTE letters ő and ű are sometimes shown
as õ and û but that is either an error or a workaround
for character sets that do not contain ő and ű.
MySQL will treat all letters that do not appear
on the chart as "non-Hungarian" letters.

3. Special short/long vowel pairs rule

These are short/long vowel pairs:
a á, e é, i í, o ó, ö ő, u ú, ü ű
That is, á is the long form of a, and so on. 

There is a special rule for these short/long
vowel pairs:
long = short usually (primary rule),
but long > short if all other letters are equal (secondary rule).

Therefore:
'BÁ'>'BA' is true
'BÁ'>'BAC' is false

To put it another way:
First compare the strings without paying attention to
short-versus-long difference. Then, if they're equal,
compare them again paying attention to short-versus-long
difference.

Comparison is left-to-right. Therefore:
'BÁA'>'BAÁ' is true

This rule is controversial. Some Hungarians disagree.
Also, Simonsen's rules say A-acute > A, etc.:
http://std.dkuug.dk/i18n/locales/hu_HU
Also, there has been argument about the same thing for glibc:
http://sources.redhat.com/ml/libc-locales/2005-q4/msg00002.html
But there seems to be agreement that the short/long
vowel pairs rule is "standard" and MySQL will have to follow it.

3. Special digraph/trigraph rule

A digraph is two letters treated as a single letter.
A trigraph is three letters treated as a single letter.

Hungarian has eight digraphs: cs dz gy ly ny sz ty zs,
and one trigraph: dzs. These letters appear according
to the order in the chart in Section 2. Therefore
'GYA' > 'GZA' is true

4. Special double digraph/trigraph rule

There is a further detail when you see the first part of a
digraph followed by the digraph. For example, in 'ggy',
'g' is the first part of 'gy' and it's followed by 'gy'
... and MySQL can treat it as a repetition of the digraph,
i.e. as if it's 'gygy'. This applies to all digraphs and
trigraphs, so:
ccs = cscs, ddz = dzdz, ddsz = dzsdzs, ggy = gygy,
lly = lyly, nny = nyny, ssz = szsz, tty = tyty, zzs = zszs.
For example,
'TTY' = 'TYTY' is true
'TTY' > 'TZ' is true (because TY > TZ).

This does not mean "cscs sorts with ccs". Instead it means
"ccs sorts with cscs". Therefore these words are in order:
CB CC CD CSCSA CCSB.

In standard Hungarian, the special digraph/trigraph rule
is not always followed, there are exceptions for
words of foreign origin. MySQL has no way to know
of such exceptions, it would always have to follow the rule
rigidly.

This is not a requirement. The implementor will decide
whether to implement the double digraph/trigraph rule.

5. Non-Hungarian rule

All characters outside the Hungarian alphabet should be done
according to UCA 4.0.0 (until MySQL switches to the newer UCA).
This merely means that MySQL will handle (say) the German
character Ä as equal to A, because that's what happens in
the basic UCA-based collation, utf8_unicode_ci.

The Hungarian Academy of Sciences site says:
Ä = A usually but but Ä > A if all other characters are equal.
(Compare the short/long vowel pairs rule.)
For example:
'galamb' < 'Gärtner' < 'gáz' and 'mosna' < 'Mošna'.
But, as stated, "MySQL will follow UCA", not Hungarian Academy
of Sciences, for non-Hungarian letters.

6. LIKE rule

There is no standard for LIKE. By preference,
SELECT * FROM t WHERE column LIKE 'ká%'
would return rows where column1 = 'kar'. Addition of
suffixes can cause letters to lose or gain accents.
Similarly, one correspondent said it "would be nice" if:
column1 LIKE 'cs%' would return rows where column1 = 'ccs'.

This is not a requirement. The implementor will decide
whether to implement the LIKE rule.

7. Rules are for all comparisons

The special short/long vowel pairs rule and the
special digraph/trigraph rule are sometimes
thought of as a "secondary" difference, but the
effect is not minor. The rules affect not only ORDER BY,
but all comparisons. It means that a unique-key index can
contain both 'BÁ' and 'BA'. It means that a search for
"WHERE hungarian_column = 'ANDRAS'" will not return
'ANDRÁS'.

Our concern at this time is only for the "primary sort", the
collation necessary for searches. The "secondary sort" or
"tertiary sort" rules, the ones that affect only ORDER BY,
are of interest but will only be of importance in the future.

8. Names

The new collations will have these names:
ucs2_hungarian2_ci      for UCS2 character set
utf8_hungarian2_ci      for UTF8 character set
latin2_hungarian2_ci    for latin2 character set
cp1250_hungarian2_ci    for cp1250 character set

9. Prerequisites

Since this task requires multiple weights, it depends on
another worklog task: WL#896 Primary, Secondary and
Tertiary Sorts. WL#896 is not scheduled for 5.1 or 5.2,
because this is not a simple matter of changing a few
lines in strings/ctype-uca.c.

10. Other DBMSs

Using an Oracle 10g table with AL16UTF16 encoding for NCHAR,
SELECT * FROM t1 ORDER BY NLSSORT(c1,NLS_SORT='xhungarian')
returned rows in this order
BAD, BÁD, BADA, BÁDA, ttz, tty, tz, ty.
This indicates that Oracle observes the short/long vowel pairs
rule and the digraph/trigraph rule, but not the double
digraph/trigraph rule.

References
----------

"Hungarian collation" discussion on lists.mysql.com
http://lists.mysql.com/mysql/202704

Bug#12519 Incorrect Hungarian collation
http://bugs.mysql.com/bug.php?id=12519

thread 'nlssort' on comp.databases.oracle.server 2002-11-10
http://www.dbforums.com/archive/index.php/t-562201.html

Blog referring to Microsoft Hungarian Technical sorting:
http://blogs.msdn.com/michkap/archive/2005/11/26/495072.aspx

Hungarian sites that cite rules:
http://bme-tk.bme.hu/other/kuszob/hangok.htm 
http://www.mszt.hu/standardsearch/detail.asp?id=007042
http://mek.oszk.hu/01500/01547/index.phtml
http://www.tug.org.in/tug2002/presentations/bujdoso.pdf

-- 
Peter Gulutzan, Senior Software Architect
MySQL AB, www.mysql.com
Office: +1 780 472-6838
Mobile: +1 780 904-0297
VoIP:   +1 408 213-6654


Thread
Hungarian collationPeter Gulutzan17 Oct
  • Re: Hungarian collationPeter Gulutzan31 Jan
RE: Hungarian collationimre19 Oct
RE: Hungarian collationPeter Gulutzan30 Oct
  • RE: Hungarian collationimre31 Oct