From: Michael Widenius Date: March 15 1999 6:46pm Subject: Re: DISTINCT weirdness.. List-Archive: http://lists.mysql.com/mysql/276 Message-Id: <14061.21425.254631.880772@monty.pp.sci.fi> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: quoted-printable Hi! >>>>> "Martin" =3D=3D Martin Ramsch writes: Martin> On Mo, 1999-03-15 09:12:30 -0600, Martin> Jay <3pound@stripped> wrote: >> It only finds 1 dupe out of 3234 rows, where as SELECT DISTINCT .. i= nto >> outfile 'foo' counts 2937 to the file.. so, 297 dupes. Right? (That = is if a >> dupe is A =3D B) >>=20 >> This is bizarre to me, I trust my script.. the single dupe it found = is* 100% >> equal to another record. Martin> I trust your script, too! :-) Martin> You might see the same problem I discovered last week: Martin> The compiled-in sorting order not only influences sorting Martin> but recognition of duplicates, too! Martin> (Default being iso-8859-1 with sorting the Swedish way, Martin> for details see "strings/ctype-*' in the MySQL sources.) Martin> So if sorting treats 'a' and '=E4' (a umlaut) as the same, Martin> then also 'Bar' and 'B=E4r' are seen as duplicates. The above isn't completely true: 'Bar' and 'B=E4r' are treated as distinct values , while 'BAR' and 'bar' isn't distinct mysql> select "Bar" =3D "B=E4r","BAR"=3D"bar", BINARY "BAR" =3D "bar"; +---------------+-------------+----------------------+ | "Bar" =3D "B=E4r" | "BAR"=3D"bar" | BINARY "BAR" =3D "bar" | +---------------+-------------+----------------------+ | 0 | 1 | 0 | +---------------+-------------+----------------------+ 1 row in set (0.02 sec) However 'e' and '=E9' have in Swedish the same sort value, and because of this 'e' and '=E9' are treaded as the same character when it comes t= o=20 comparisons. Martin> Additionally (and independent of the sorting-order) compariso= ns Martin> are case independent by default. Martin> You can get rid of both case independence and sorting equivalen= ces by Martin> declaring the affected text column fields as "BINARY". Search = the Martin> manual for this keyword for more information or have a look at = the Martin> manual excerpts at the end of this message. Martin> Unfortunately there seems to be no way to avoid changes to the = table Martin> definition. Either a text field _is_ binary, then only identic= al Martin> values can be duplicates, but sorting for this field is in "ASC= II" Martin> order (actually iso-8859-1). Or it is _not_ binary, then you m= ight Martin> get duplicates where one shouldn't expect them. In MySQL 3.23 you will be able to do: SELECT DISTINCT BINARY text_column FROM TABLE; (The BINARY attribute casts the text_column to a binary column, that is sorted / compared according to the ASCII values for the individual characters) Martin> I'm not quite sure, if this behaviour is a bug or an intended (= but -- Martin> in my opinion -- mis-designed) feature. Martin> Or maybe I'm completely wrong? "Official" statements from Mont= y et al Martin> regarding this subject would be very appreciated. I appreciate any ideas how to do it better. The problem is to make everything 'hold' together. I think that the DISTINCT should compare the strings the same way as the normal compare operations, and this put some restrictions how things can be solved. Regards, Monty