On Mon, 11 Apr 2005, Dan Bolser wrote:
>
>Requirement:
>
>Given two columns of a table (Column1 and Column2) of length x, return two
>columns (Column1_Scram and Column2_Scram) such that the distributions of
>values in Column1 and Column2 are preserved in Column1_Scram and
>Column2_Scram, but the pairs of values are randomized.
>
>
>Solution suggested by Shawn Green:
>
>Create a table with two columns, and populate this table with random pairs
>of primary keys picked from the original table. Additionally, allow no
>duplicate primary keys within either column. Select x rows from this
>table, and link both primary keys (the primary key pair) back to the
>original table to get the appropriate number of randomized pairs of
>Column1 and Column2.
>
>He suggests doing the above like this (more or less):
>
>OriginalTable
>PK A B
>1 a c
>2 a d
>3 b e
>...
>
>CREATE TEMPORARY TABLE IntermediateTable (
> PK1 INT NOT NULL,
> A CHAR(1) NOT NULL,
> PK2 INT NOT NULL,
> B CHAR(1) NOT NULL,
> #
> UNIQUE INDEX (PK1,A),
> UNIQUE INDEX (PK2,B)
>);
>
>INSERT IGNORE INTO IntermediateTable
>SELECT
> x.PK, x.A,
> y.PK, y.B
>FROM
> OriginalTable x,
> OriginalTable y
>ORDER BY
> RAND();
>
>SELECT
> x.A,
> y.B
>FROM
> IntermediateTable
>INNER JOIN
> OriginalTable x ON (PK1 = x.PK) INNER JOIN
> OriginalTable y ON (PK2 = y.PK)
>LIMIT
> the_length_of_OriginalTable;
>
>
>The problem with this solution:
>
>Its too slow on reasonable sized tables!
Their is also a problem with the way RAND() works...
SELECT
x.PK, x.A,
y.PK, y.B
FROM
OriginalTable x,
OriginalTable y
ORDER BY
RAND()
LIMIT
1;
This takes soooooo long to pick a random row. Some cleaver 'LIMIT'
optimization could pick a results set almost instantly, instead of taking
in excess of half an hour with ~50,000 rows.