List:General Discussion« Previous MessageNext Message »
From:Martin Ramsch Date:March 15 1999 5:12pm
Subject:Re: DISTINCT weirdness..
View as plain text  
On Mo, 1999-03-15 09:12:30 -0600,
Jay <3pound@stripped> wrote:
> It only finds 1 dupe out of 3234 rows, where as SELECT DISTINCT .. into
> outfile 'foo' counts 2937 to the file.. so, 297 dupes. Right? (That is if a
> dupe is A = B)
> 
> This is bizarre to me, I trust my script.. the single dupe it found is* 100%
> equal to another record.

I trust your script, too! :-)

You might see the same problem I discovered last week:

  The compiled-in sorting order not only influences sorting
  but recognition of duplicates, too!

  (Default being iso-8859-1 with sorting the Swedish way,
   for details see "strings/ctype-*' in the MySQL sources.)

  So if sorting treats 'a' and 'ä' (a umlaut) as the same,
  then also 'Bar' and 'Bär' are seen as duplicates.

  Additionally (and independent of the sorting-order) comparisons
  are case independent by default.

You can get rid of both case independence and sorting equivalences by
declaring the affected text column fields as "BINARY".  Search the
manual for this keyword for more information or have a look at the
manual excerpts at the end of this message.

Unfortunately there seems to be no way to avoid changes to the table
definition.  Either a text field _is_ binary, then only identical
values can be duplicates, but sorting for this field is in "ASCII"
order (actually iso-8859-1).  Or it is _not_ binary, then you might
get duplicates where one shouldn't expect them.

I'm not quite sure, if this behaviour is a bug or an intended (but --
in my opinion -- mis-designed) feature.

Or maybe I'm completely wrong?  "Official" statements from Monty et al
regarding this subject would be very appreciated.


To test yourself more in detail what's actually going on, you might
try the following SQL statements:

    DROP TABLE IF EXISTS test.chars;
    CREATE TABLE test.chars (
            c1      CHAR(1),
            c2      CHAR(1)
    );
    
    INSERT INTO test.chars VALUES
    ...
    {list of all possible pairs of characters}
    ...
      ("þ", "ÿ");
    
    DROP TABLE IF EXISTS test.duplicate;
    CREATE TABLE test.duplicate (
            t1      INT,
            c1      CHAR(1),
            t2      INT,
            c2      CHAR(1)
    );
    
    INSERT INTO test.duplicate
      SELECT DISTINCT
        ASCII(LOWER(c1)), LOWER(c1), ASCII(LOWER(c2)), LOWER(c2)
      FROM test.chars
      WHERE c1 = c2
        AND ASCII(LOWER(c1)) < ASCII(LOWER(c2));
    
    SELECT * FROM test.duplicate
    GROUP BY t2;

To create the full list of pairs of characters I wrote a little perl
script, see attachment.

With the above SQL you should get a list of all character groups that
are seen as equivalent for sorting and duplicates.

Regards,
  Martin


Quotes from MySQL Manual, V3.22.19b
===================================
| 1.4 The main features of MySQL
| [...]
|  * All data are saved in ISO-8859-1 Latin1 format. All comparisons for
|    normal string columns are case insensitive.
|  * Sorting is done according to the ISO-8859-1 Latin1 character set
|    (the Swedish way at the moment). It is possible to change this in
|    the source by adding new sort order arrays. To see an example of
|    very advanced sorting, look at the Czech sorting code. MySQL
|    supports many different character sets that can be specified at
|    compile time.

| 5.1 MySQL extensions to ANSI SQL92
| [...]
|  * All string comparisons are case insensitive by default, with
|    sort ordering determined by the current character set (ISO-8859-1
|    Latin1 by default). If you don't like this, you should declare your
|    columns with the BINARY attribute, which causes comparisons to be
|    done according to the ASCII order used on the MySQL server host.

| 4.7.3 Typical configure options
| [...]
|  * By default, MySQL uses the ISO-8859-1 (Latin1) character set. [...]
|    CHARSET may be one of big5, czech, danish, dec8, dos, german1,
|    hebrew, hp8, hungarian, koi8_ru, ru, latin1, latin2, sjis, swe7,
|    tis620, ujis, usa7 or win1251. [...]  Note that if you want to
|    change the character set, you must do a make distclean between
|    configurations!

| 9.1.1 The character set used for data and sorting
| 
| By default, MySQL uses the ISO-8859-1 (Latin1) character set. This is
| the character set used in the USA and western Europe.
|   The character set determines what characters are allowed in names
| and how things are sorted by the ORDER BY and GROUP BY clauses of the
| SELECT statement.
|   You can change the character set at compile time by using the
| --with-charset=charset option to configure. [...]

| 16.13 16.13 Case sensitivity in searches
| 
| By default, MySQL searches are case-insensitive (although there are
| some character sets that are never case insensitive, such as
| czech). That means that if you search with col_name LIKE 'a%', you
| will get all column values that start with A or a. If you want to make
| this search case-sensitive, use something like INDEX(col_name, "A")=0
| to check a prefix. Or use STRCMP(col_name, "A") = 0 if the column
| value must be exactly "A".
| 
| Simple comparison operations (>=, >, = , < , <=, sorting and grouping)
| are based on each character's "sort value". Characters with the same
| sort value (like E, e and 'e) are treated as the same character!
| 
| LIKE comparisons are done on the uppercase value of each character
| (E == e but E <> 'e)
| 
| If you want a column always to be treated in case-sensitive fashion,
| declare it as BINARY.

-- 
Martin Ramsch <m.ramsch@stripped> <URL: http://home.pages.de/~ramsch/ >
PGP KeyID=0xE8EF4F75 FiPr=52 44 5E F3 B0 B1 38 26  E4 EC 80 58 7B 31 3A D7

Attachment: [application/x-perl] dupltest.pl
Thread
DISTINCT weirdness..Jay15 Mar
  • Re: DISTINCT weirdness..Scott Mackie15 Mar
  • Re: DISTINCT weirdness..Martin Ramsch15 Mar
    • Re: DISTINCT weirdness..Michael Widenius15 Mar
      • Re: DISTINCT weirdness..Martin Ramsch17 Mar
  • Re: DISTINCT weirdness..Martin B. Jespersen16 Mar