List:General Discussion« Previous MessageNext Message »
From:hsv Date:January 21 2014 6:38am
Subject:Re: grouping by the difference between values in rows
View as plain text  
>>>> 2014/01/12 14:17 -0500, Larry Martell >>>>
I've been asked to do something that I do not think is possible in SQL.

I have a query that has this basic form:

SELECT a, b, c, d, AVG(e), STD(e), CONCAT(x, ',', y) as f
FROM t
GROUP BY a, b, c, d, f

x and y are numbers (378.18, 2213.797 or 378.218, 2213.949 or
10053.490, 2542.094).

The business issue is that if either x or y in 2 rows that are in the
same a, b, c, d group are within 1 of each other then they should be
grouped together. And to make it more complicated, the tolerance is
applied as a rolling continuum. For example, if the x and y in a set
of grouped rows are:

row 1: 1.5, 9.5
row 2: 2.4, 20.8
row 3: 3.3, 40.6
row 4: 4.2, 2.5
row 5: 5.1, 10.1
row 6: 6.0, 7.9
row 7: 8.0, 21.0
row 8: 100, 200

1 through 6 get combined because all their X values are within the
tolerance of some other X in the set that's been combined. 7's Y value
is within the tolerance of 2's Y, so that should be combined as well.
8 is not combined because neither the X or Y value is within the
tolerance of any X or Y in the set that was combined.

In python I can easily parse the data and identify the rows that need
to be combined, but then I've lost the ability to calculate the
average and std. The only way I can think of to do this is to remove
the grouping from the SQL and do all the grouping and aggregating
myself. But this query often returns 20k to 30k rows after grouping.
It could easily be 80k to 100k rows that I have to process if I remove
the grouping and I think that will be very slow.

Anyone have any ideas?
<<<<<<<<
I suspect you can carry out their ideas by something like this, in an SQL procedure:

Besides your table "t", there are tables "t1", "t2", "t3", "tpair", and "tq".

With a cursor copy records from "t" to "t1" (with all of "t" s fields and an  "g1"
besides) ordered by a, b, c, d, x, going through all the complications of deciding where
a group boundary falls, numbering the groups by "g1".

Repeat this copying from "t1" to "t2" (which has besides "g1" also "g2"), ordered by a, b,
c, d, y, numbering the groups by "g2". Now "t1" no longer is needed.

Copy all distinct pairs of "g1" and "g2" to "tpair".

Until g = MIN(g1) of "tpair" is null, move (INSERT ... SELECT; DELETE ...) from "tpair" to
"tq" all pairs where g1=g, and as long as there is anything to move from "tpair" to "tq"
alternatly move records where any in "tpair" s "g2" match any already in "tq" and any in
"tpair" s "g1" match any already in "tq". Now all the pairs in "tq" represent the same
group: every record in "t2" with a pair in "tq" is copied into "t3" with a new number
"g3" instead of the pair.

After this one may group "t3" by "g3".

Thread
grouping by the difference between values in rowsLarry Martell12 Jan 2014
  • Re: grouping by the difference between values in rowsPeter Brawley12 Jan 2014
    • Re: grouping by the difference between values in rowsLarry Martell13 Jan 2014
      • Re: grouping by the difference between values in rowsPeter Brawley13 Jan 2014
        • Re: grouping by the difference between values in rowsLarry Martell13 Jan 2014
        • Re: grouping by the difference between values in rowsshawn l.green15 Jan 2014
          • Re: grouping by the difference between values in rowsLarry Martell16 Jan 2014
  • Re: grouping by the difference between values in rowshsv21 Jan 2014
    • Re: grouping by the difference between values in rowsTakeshi Hashimoto21 Jan 2014