List:General Discussion« Previous MessageNext Message »
From:Michael Widenius Date:March 15 1999 6:46pm
Subject:Re: DISTINCT weirdness..
View as plain text  
Hi!

>>>>> "Martin" == Martin Ramsch <m.ramsch@stripped> writes:

Martin> On Mo, 1999-03-15 09:12:30 -0600,
Martin> Jay <3pound@stripped> wrote:
>> It only finds 1 dupe out of 3234 rows, where as SELECT DISTINCT .. into
>> outfile 'foo' counts 2937 to the file.. so, 297 dupes. Right? (That is if a
>> dupe is A = B)
>> 
>> This is bizarre to me, I trust my script.. the single dupe it found is* 100%
>> equal to another record.

Martin> I trust your script, too! :-)

Martin> You might see the same problem I discovered last week:

Martin>   The compiled-in sorting order not only influences sorting
Martin>   but recognition of duplicates, too!

Martin>   (Default being iso-8859-1 with sorting the Swedish way,
Martin>    for details see "strings/ctype-*' in the MySQL sources.)

Martin>   So if sorting treats 'a' and 'ä' (a umlaut) as the same,
Martin>   then also 'Bar' and 'Bär' are seen as duplicates.

The above isn't completely true: 'Bar' and 'Bär' are treated as
distinct values , while 'BAR' and 'bar' isn't distinct

mysql> select "Bar" = "Bär","BAR"="bar", BINARY "BAR" = "bar";
+---------------+-------------+----------------------+
| "Bar" = "Bär" | "BAR"="bar" | BINARY "BAR" = "bar" |
+---------------+-------------+----------------------+
|             0 |           1 |                    0 |
+---------------+-------------+----------------------+
1 row in set (0.02 sec)

However 'e' and 'é' have in Swedish the same sort value, and because
of this 'e' and 'é' are treaded as the same character when it comes to 
comparisons.

Martin>   Additionally (and independent of the sorting-order) comparisons
Martin>   are case independent by default.

Martin> You can get rid of both case independence and sorting equivalences by
Martin> declaring the affected text column fields as "BINARY".  Search the
Martin> manual for this keyword for more information or have a look at the
Martin> manual excerpts at the end of this message.

Martin> Unfortunately there seems to be no way to avoid changes to the table
Martin> definition.  Either a text field _is_ binary, then only identical
Martin> values can be duplicates, but sorting for this field is in "ASCII"
Martin> order (actually iso-8859-1).  Or it is _not_ binary, then you might
Martin> get duplicates where one shouldn't expect them.

In MySQL 3.23 you will be able to do:

SELECT DISTINCT BINARY text_column FROM TABLE;

(The BINARY attribute casts the text_column to a binary column, that
is sorted / compared according to the ASCII values for the individual
characters)

Martin> I'm not quite sure, if this behaviour is a bug or an intended (but --
Martin> in my opinion -- mis-designed) feature.

Martin> Or maybe I'm completely wrong?  "Official" statements from Monty et al
Martin> regarding this subject would be very appreciated.

I appreciate any ideas how to do it better.  The problem is to make
everything 'hold' together.  I think that the DISTINCT should compare
the strings the same way as the normal compare operations, and this
put some restrictions how things can be solved.

Regards,
Monty
Thread
DISTINCT weirdness..Jay15 Mar
  • Re: DISTINCT weirdness..Scott Mackie15 Mar
  • Re: DISTINCT weirdness..Martin Ramsch15 Mar
    • Re: DISTINCT weirdness..Michael Widenius15 Mar
      • Re: DISTINCT weirdness..Martin Ramsch17 Mar
  • Re: DISTINCT weirdness..Martin B. Jespersen16 Mar