List:General Discussion« Previous MessageNext Message »
From:Larry Martell Date:January 16 2014 4:42am
Subject:Re: grouping by the difference between values in rows
View as plain text  
On Wed, Jan 15, 2014 at 4:06 PM, shawn l.green <shawn.l.green@stripped> wrote:
> Hello Larry,
>
>
> On 1/13/2014 2:43 AM, Peter Brawley wrote:
>>
>>
>> On 2014-01-12 9:13 PM, Larry Martell wrote:
>>>
>>> On Sun, Jan 12, 2014 at 2:47 PM, Peter Brawley
>>> <peter.brawley@stripped> wrote:
>>>>
>>>> On 2014-01-12 1:17 PM, Larry Martell wrote:
>>>>>
>>>>> I've been asked to do something that I do not think is possible in
> SQL.
>>>>>
>>>>> I have a query that has this basic form:
>>>>>
>>>>> SELECT a, b, c, d, AVG(e), STD(e), CONCAT(x, ',', y) as f
>>>>> FROM t
>>>>> GROUP BY a, b, c, d, f
>>>>>
>>>>> x and y are numbers (378.18, 2213.797 or 378.218, 2213.949 or
>>>>> 10053.490, 2542.094).
>>>>>
>>>>> The business issue is that if either x or y in 2 rows that are in
> the
>>>>> same a, b, c, d group are within 1 of each other then they should be
>>>>> grouped together. And to make it more complicated, the tolerance is
>>>>> applied as a rolling continuum. For example, if the x and y in a set
>>>>> of grouped rows are:
>>>>>
>>>>> row 1: 1.5, 9.5
>>>>> row 2: 2.4, 20.8
>>>>> row 3: 3.3, 40.6
>>>>> row 4: 4.2, 2.5
>>>>> row 5: 5.1, 10.1
>>>>> row 6: 6.0, 7.9
>>>>> row 7: 8.0, 21.0
>>>>> row 8: 100, 200
>>>>>
>>>>> 1 through 6 get combined because all their X values are within the
>>>>> tolerance of some other X in the set that's been combined. 7's Y
> value
>>>>> is within the tolerance of 2's Y, so that should be combined as
> well.
>>>>> 8 is not combined because neither the X or Y value is within the
>>>>> tolerance of any X or Y in the set that was combined.
>>>>>
>>>>> In python I can easily parse the data and identify the rows that
> need
>>>>> to be combined, but then I've lost the ability to calculate the
>>>>> average and std. The only way I can think of to do this is to remove
>>>>> the grouping from the SQL and do all the grouping and aggregating
>>>>> myself. But this query often returns 20k to 30k rows after grouping.
>>>>> It could easily be 80k to 100k rows that I have to process if I
> remove
>>>>> the grouping and I think that will be very slow.
>>>>>
>>>>> Anyone have any ideas?
>>>>
>>>>
>>>> Could you compute the row-to-row values & write them to a temp
> table,
>>>> then
>>>> run the SQL that incorporates that result column?
>>>
>>> I thought of temp tables, but I could not come up with a way to use
>>> them for this. How can I apply the x/y tolerance grouping in sql?
>>
>>
>> Run the query you showed, saving the result to a temp table. In an sproc
>> or your preferred app language, do the row-to-row processing to generate
>> a new column in the temp table from the biz rules you outlined, now
>> query the revised temp table as desired.
>>
>
> This is a very interesting problem but I am not sure what to do with a
> grouping. To me this appears to be a two-dimensional, nearest-neighbor-type
> sorting problem.
>
> The way you outline it you could end up with all of the elements in your
> analysis plane in a single group depending on how closely they are to each
> other. Are long strings of points and blobs (clusters of points) in your 2-D
> search space (your plane is defined by your {a,b,c,d} tuple) acceptable
> results of your re-grouping process?
>
> If I had to draw this out graphically, you are stamping each of your {x,y}
> points with a 2x2 boundary box (each point is in intersection in the middle)
> and you want to know which sets of those boxes form a contiguous
> intersection area (they overlap).
>
> Is that what you are looking for? One technique would be to start with
> Peter's suggestion. Then reprocess that list to generate an acyclic directed
> graph (by eliminating the reverse matches from your set), then tracing down
> the tree.  If point 2 links to point 7,then 7 also links to 2 (both would be
> found by your nearness test. Eliminate any matches where the first point
> appears after the second point from your list.


Thanks much for the reply Shawn. Unfortunately this project was put on
the back burner. When I get back to it I'll be sure to try the
suggestions given here and I will report back as to what worked.
Thread
grouping by the difference between values in rowsLarry Martell12 Jan 2014
  • Re: grouping by the difference between values in rowsPeter Brawley12 Jan 2014
    • Re: grouping by the difference between values in rowsLarry Martell13 Jan 2014
      • Re: grouping by the difference between values in rowsPeter Brawley13 Jan 2014
        • Re: grouping by the difference between values in rowsLarry Martell13 Jan 2014
        • Re: grouping by the difference between values in rowsshawn l.green15 Jan 2014
          • Re: grouping by the difference between values in rowsLarry Martell16 Jan 2014
  • Re: grouping by the difference between values in rowshsv21 Jan 2014
    • Re: grouping by the difference between values in rowsTakeshi Hashimoto21 Jan 2014