On Mon, 11 Apr 2005, Dan Bolser wrote:
>Given two columns of a table (Column1 and Column2) of length x, return two
>columns (Column1_Scram and Column2_Scram) such that the distributions of
>values in Column1 and Column2 are preserved in Column1_Scram and
>Column2_Scram, but the pairs of values are randomized.
>Solution suggested by Shawn Green:
>Create a table with two columns, and populate this table with random pairs
>of primary keys picked from the original table. Additionally, allow no
>duplicate primary keys within either column. Select x rows from this
>table, and link both primary keys (the primary key pair) back to the
>original table to get the appropriate number of randomized pairs of
>Column1 and Column2.
>He suggests doing the above like this (more or less):
>PK A B
>1 a c
>2 a d
>3 b e
>CREATE TEMPORARY TABLE IntermediateTable (
> PK1 INT NOT NULL,
> A CHAR(1) NOT NULL,
> PK2 INT NOT NULL,
> B CHAR(1) NOT NULL,
> UNIQUE INDEX (PK1,A),
> UNIQUE INDEX (PK2,B)
>INSERT IGNORE INTO IntermediateTable
> x.PK, x.A,
> y.PK, y.B
> OriginalTable x,
> OriginalTable y
> OriginalTable x ON (PK1 = x.PK) INNER JOIN
> OriginalTable y ON (PK2 = y.PK)
>The problem with this solution:
>Its too slow on reasonable sized tables!
Their is also a problem with the way RAND() works...
This takes soooooo long to pick a random row. Some cleaver 'LIMIT'
optimization could pick a results set almost instantly, instead of taking
in excess of half an hour with ~50,000 rows.