From: Martin Ramsch Date: March 15 1999 5:12pm Subject: Re: DISTINCT weirdness.. List-Archive: http://lists.mysql.com/mysql/269 Message-Id: <19990315181233.A7850@forwiss.uni-passau.de> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="lrZ03NoBR/3+SXJZ" Content-Transfer-Encoding: 8bit --lrZ03NoBR/3+SXJZ Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: 8bit On Mo, 1999-03-15 09:12:30 -0600, Jay <3pound@stripped> wrote: > It only finds 1 dupe out of 3234 rows, where as SELECT DISTINCT .. into > outfile 'foo' counts 2937 to the file.. so, 297 dupes. Right? (That is if a > dupe is A = B) > > This is bizarre to me, I trust my script.. the single dupe it found is* 100% > equal to another record. I trust your script, too! :-) You might see the same problem I discovered last week: The compiled-in sorting order not only influences sorting but recognition of duplicates, too! (Default being iso-8859-1 with sorting the Swedish way, for details see "strings/ctype-*' in the MySQL sources.) So if sorting treats 'a' and 'ä' (a umlaut) as the same, then also 'Bar' and 'Bär' are seen as duplicates. Additionally (and independent of the sorting-order) comparisons are case independent by default. You can get rid of both case independence and sorting equivalences by declaring the affected text column fields as "BINARY". Search the manual for this keyword for more information or have a look at the manual excerpts at the end of this message. Unfortunately there seems to be no way to avoid changes to the table definition. Either a text field _is_ binary, then only identical values can be duplicates, but sorting for this field is in "ASCII" order (actually iso-8859-1). Or it is _not_ binary, then you might get duplicates where one shouldn't expect them. I'm not quite sure, if this behaviour is a bug or an intended (but -- in my opinion -- mis-designed) feature. Or maybe I'm completely wrong? "Official" statements from Monty et al regarding this subject would be very appreciated. To test yourself more in detail what's actually going on, you might try the following SQL statements: DROP TABLE IF EXISTS test.chars; CREATE TABLE test.chars ( c1 CHAR(1), c2 CHAR(1) ); INSERT INTO test.chars VALUES ... {list of all possible pairs of characters} ... ("þ", "ÿ"); DROP TABLE IF EXISTS test.duplicate; CREATE TABLE test.duplicate ( t1 INT, c1 CHAR(1), t2 INT, c2 CHAR(1) ); INSERT INTO test.duplicate SELECT DISTINCT ASCII(LOWER(c1)), LOWER(c1), ASCII(LOWER(c2)), LOWER(c2) FROM test.chars WHERE c1 = c2 AND ASCII(LOWER(c1)) < ASCII(LOWER(c2)); SELECT * FROM test.duplicate GROUP BY t2; To create the full list of pairs of characters I wrote a little perl script, see attachment. With the above SQL you should get a list of all character groups that are seen as equivalent for sorting and duplicates. Regards, Martin Quotes from MySQL Manual, V3.22.19b =================================== | 1.4 The main features of MySQL | [...] | * All data are saved in ISO-8859-1 Latin1 format. All comparisons for | normal string columns are case insensitive. | * Sorting is done according to the ISO-8859-1 Latin1 character set | (the Swedish way at the moment). It is possible to change this in | the source by adding new sort order arrays. To see an example of | very advanced sorting, look at the Czech sorting code. MySQL | supports many different character sets that can be specified at | compile time. | 5.1 MySQL extensions to ANSI SQL92 | [...] | * All string comparisons are case insensitive by default, with | sort ordering determined by the current character set (ISO-8859-1 | Latin1 by default). If you don't like this, you should declare your | columns with the BINARY attribute, which causes comparisons to be | done according to the ASCII order used on the MySQL server host. | 4.7.3 Typical configure options | [...] | * By default, MySQL uses the ISO-8859-1 (Latin1) character set. [...] | CHARSET may be one of big5, czech, danish, dec8, dos, german1, | hebrew, hp8, hungarian, koi8_ru, ru, latin1, latin2, sjis, swe7, | tis620, ujis, usa7 or win1251. [...] Note that if you want to | change the character set, you must do a make distclean between | configurations! | 9.1.1 The character set used for data and sorting | | By default, MySQL uses the ISO-8859-1 (Latin1) character set. This is | the character set used in the USA and western Europe. | The character set determines what characters are allowed in names | and how things are sorted by the ORDER BY and GROUP BY clauses of the | SELECT statement. | You can change the character set at compile time by using the | --with-charset=charset option to configure. [...] | 16.13 16.13 Case sensitivity in searches | | By default, MySQL searches are case-insensitive (although there are | some character sets that are never case insensitive, such as | czech). That means that if you search with col_name LIKE 'a%', you | will get all column values that start with A or a. If you want to make | this search case-sensitive, use something like INDEX(col_name, "A")=0 | to check a prefix. Or use STRCMP(col_name, "A") = 0 if the column | value must be exactly "A". | | Simple comparison operations (>=, >, = , < , <=, sorting and grouping) | are based on each character's "sort value". Characters with the same | sort value (like E, e and 'e) are treated as the same character! | | LIKE comparisons are done on the uppercase value of each character | (E == e but E <> 'e) | | If you want a column always to be treated in case-sensitive fashion, | declare it as BINARY. -- Martin Ramsch PGP KeyID=0xE8EF4F75 FiPr=52 44 5E F3 B0 B1 38 26 E4 EC 80 58 7B 31 3A D7 --lrZ03NoBR/3+SXJZ Content-Type: application/x-perl Content-Disposition: attachment; filename="dupltest.pl" #!/usr/local/bin/perl print <