List:General Discussion« Previous MessageNext Message »
From:Larry Martell Date:September 19 2012 10:21pm
Subject:Re: getting certain rows from a group by
View as plain text  
Well, I'm getting the proper rows from the 4 joins I added, but now
I'm seeing a weird side effect - my counts are 3 times to much.

The query is really huge and nasty now, but I'm going to paste it below.

In the outer most select, Wafers, Rerun, Runs, and Count are 3 times
what they should be. If I take off the joins, I get the correct
counts. I can't see why that's happening.


        SELECT data_target.name as Target,
               q1.ep as EP,
               COUNT(*) as Wafers,
               Lots,
               FORMAT(SUM(numonep)/(COUNT(*)+(SUM(GREATEST(q1.reruns,
0)))), 1) as 'Sites/Wafer',
               DATE_FORMAT(MAX(LastRun), '%m/%d/%y') as "Last Run",
               SUM(GREATEST(q1.reruns, 0)) as Rerun,
               COUNT(*)+SUM(GREATEST(q1.reruns, 0)) as Runs,
               FORMAT(avgbottom, 1) as "Avg Bottom",
               FORMAT(3*stdbottom, 2) as "3 Sig",
               FORMAT(maxbottom, 1) as Max,
               FORMAT(minbottom, 1) as Min,
               SUM(numonep) as Count,
               SUM(numonep) - SUM(numbottoms) as NAs,
               CONCAT(FORMAT(100*SUM(numbottoms)/SUM(numonep), 1), '
%') as "% Success",
               FORMAT(3*stdbottom/avgbottom, 2) as "3Sig/Avg",
               FORMAT(AVG(avgbottom), 1) as 'Wafer Avg',
               FORMAT(AVG(Wafer3Sigma), 2) as 'Wafer 3 Sigma',
               FORMAT(AVG(Ranges), 1) as 'Avg Range',
               FORMAT(3*STD(Ranges), 2) as '3Sig of Ranges',
               FORMAT(MAX(Ranges), 1) as 'Max Range',
               FORMAT(MIN(Ranges), 1) as 'Min Range',
               FORMAT((SUM(numonep) - SUM(numbottoms))/COUNT(*), 1) as
'NAs/Wafer',
               mx.maxImage as maxImage,
               mn.minImage as minImage,
               lr.lrImage as lrlrImage,
               cm.cmImage as cmcmImage
           FROM (SELECT target_name_id,
                        ep,
                        wafer_id,
                        COUNT(bottom) as numbottoms,
                        AVG(bottom) as avgbottom,
                        STD(bottom) as stdbottom,
                        MAX(bottom) as maxbottom,
                        MIN(bottom) as minbottom,
                        MAX(date_time) as "LastRun",
                        COUNT(*) as numonep,
                        COUNT(DISTINCT target_name_id, ep, lot_id,
data_file_id)-1 as reruns,
                        COUNT(DISTINCT(lot_id)) as Lots,
                        3*STD(bottom) as Wafer3Sigma,
                        MAX(bottom) - MIN(bottom) as Ranges
           FROM data_cst
           WHERE TRUE
         AND data_cst.target_name_id IN (775, 776, 777, 778, 779, 780,
45, 44, 116, 117, 118, 119, 120, 121)
         AND data_cst.date_time BETWEEN '2010-03-04 00:00:00' AND
'2010-03-04 23:59:59'
         GROUP BY target_name_id, ep, wafer_id
         HAVING count(*) < 999) q1
               JOIN (SELECT data_cstimage.name as minImage,
                            data_cst.bottom,
                            data_cst.target_name_id,
                            data_cst.date_time,
                            data_cst.lot_id,
                            data_cst.tool_id,
                            data_cst.roiname,
                            data_cst.recipe_id,
                            data_cst.ep
                     FROM data_cstimage, data_cst
                     WHERE data_cstimage.id = data_cst.image_measurer_id) mn
                     ON mn.bottom = q1.minbottom
                AND mn.target_name_id IN (775, 776, 777, 778, 779,
780, 45, 44, 116, 117, 118, 119, 120, 121)
                AND mn.date_time BETWEEN '2010-03-04 00:00:00' AND
'2010-03-04 23:59:59'
               JOIN (SELECT data_cstimage.name as maxImage,
                            data_cst.bottom,
                            data_cst.target_name_id,
                            data_cst.date_time,
                            data_cst.lot_id,
                            data_cst.tool_id,
                            data_cst.roiname,
                            data_cst.recipe_id,
                            data_cst.ep
                     FROM data_cstimage, data_cst
                     WHERE data_cstimage.id = data_cst.image_measurer_id) mx
                     ON mx.bottom = q1.maxbottom
                AND mx.target_name_id IN (775, 776, 777, 778, 779,
780, 45, 44, 116, 117, 118, 119, 120, 121)
                AND mx.date_time BETWEEN '2010-03-04 00:00:00' AND
'2010-03-04 23:59:59'
               JOIN (SELECT data_cstimage.name as lrImage,
                            data_cst.bottom,
                            data_cst.target_name_id,
                            data_cst.date_time,
                            data_cst.lot_id,
                            data_cst.tool_id,
                            data_cst.roiname,
                            data_cst.recipe_id,
                            data_cst.ep
                     FROM data_cstimage, data_cst
                     WHERE data_cstimage.id = data_cst.image_measurer_id) lr
                     ON lr.date_time = q1.LastRun
                AND lr.target_name_id IN (775, 776, 777, 778, 779,
780, 45, 44, 116, 117, 118, 119, 120, 121)
                AND lr.date_time BETWEEN '2010-03-04 00:00:00' AND
'2010-03-04 23:59:59'
               JOIN (SELECT data_cstimage.name as cmImage,
                            data_cst.bottom,
                            data_cst.target_name_id,
                            data_cst.date_time,
                            data_cst.lot_id,
                            data_cst.tool_id,
                            data_cst.roiname,
                            data_cst.recipe_id,
                            data_cst.ep
               FROM data_cstimage, data_cst
               WHERE data_cstimage.id = data_cst.image_measurer_id
                AND data_cst.target_name_id IN (775, 776, 777, 778,
779, 780, 45, 44, 116, 117, 118, 119, 120, 121)
                AND data_cst.date_time BETWEEN '2010-03-04 00:00:00'
AND '2010-03-04 23:59:59'
               GROUP BY target_name_id, ep
               HAVING count(*) < 999
               ORDER BY ABS(bottom - AVG(bottom))) cm
               ON cm.target_name_id = q1.target_name_id
               AND cm.ep = q1.ep
                AND cm.target_name_id IN (775, 776, 777, 778, 779,
780, 45, 44, 116, 117, 118, 119, 120, 121)
                AND cm.date_time BETWEEN '2010-03-04 00:00:00' AND
'2010-03-04 23:59:59' ,
                data_target
                WHERE data_target.id = q1.target_name_id
                GROUP BY q1.target_name_id, q1.ep


On Wed, Sep 19, 2012 at 3:27 PM, Larry Martell <larry.martell@stripped> wrote:
> OK, I think I have this working. The last join was this:
>
> JOIN (select id, target_name_id, ep,date_time from data_cst WHERE
> target_name_id IN (775, 776, 777, 778, 779, 780, 45, 44, 116, 117,
> 118, 119, 120, 121) AND data_cst.date_time BETWEEN '2010-03-04
> 00:00:00' AND '2010-03-04 23:59:59' GROUP BY  target_name_id, ep
> HAVING count(*) < 999 ORDER BY  ABS(data_cst.bottom - AVG(bottom))) cm
> on cm.target_name_id = q1.target_name_id and cm.ep = q1.ep and
> cm.date_time BETWEEN '2010-03-04 00:00:00' AND '2010-03-04 23:59:59'
>
> I have do to more testing to ensure I'm getting the correct rows back.
>
> Thanks for all the help!
>
> On Wed, Sep 19, 2012 at 10:07 AM, Larry Martell <larry.martell@stripped>
> wrote:
>> On Tue, Sep 18, 2012 at 7:41 PM, Rick James <rjames@stripped> wrote:
>>> SELECT  ((the appropriate id))          -- <--
>>>     FROM  data_cst, rollup, data_target
>>>     WHERE  data_target.name = rollup.Target
>>>       AND  data_cst.ep = rollup.EP
>>>       AND  data_cst.date_time BETWEEN '2010-03-04 00:00:00' AND '2010-03-04
> 23:59:59'
>>>       AND  data_target.id = data_cst.target_name_id
>>>     GROUP BY  rollup.Target, rollup.EP
>>>     ORDER BY  ABS(data_cst.bottom - rollup.`Avg Bottom`)
>>>     LIMIT  1               -- <--
>>
>> rollup is the name of the temp table that I was testing with. It
>> contains the output of the original query. I don't have that in the
>> real world. That's my issue - I can't figure out how to join with the
>> result set and apply the order by to the join condition. What would I
>> use in place of rollup to get the 'Avg Bottom'?
>>
>> Here's what my query looks like now:
>>
>> SELECT data_target.name as Target,
>>        q1.ep as EP,
>>        COUNT(*) as Wafers,
>>        Lots,
>>        SUM(numonep)/(COUNT(*)+SUM(CASE WHEN reruns > 0 THEN reruns
>> ELSE 0 END)) as 'Sites/Wafer',
>>        MAX(LastRun) as "Last Run",
>>        SUM(CASE WHEN reruns > 0 THEN reruns ELSE 0 END) as Rerun,
>>        COUNT(*)+SUM(CASE WHEN reruns > 0 THEN reruns ELSE 0 END) as Runs,
>>        avgbottom as "Avg Bottom",
>>        3*stdbottom as "3 Sig",
>>        maxbottom as Max,
>>        minbottom as Min,
>>        SUM(numonep) as Count, SUM(numonep) - SUM(numbottoms) as NAs,
>>        100-((SUM(numonep) - SUM(numbottoms))/SUM(numonep)*100) as "% Success",
>>        3*stdbottom/avgbottom as "3Sig/Avg",
>>        AVG(avgbottom) as 'Wafer Avg',
>>        AVG(Wafer3Sigma) as 'Wafer 3 Sigma',
>>        AVG(Ranges) as 'Avg Range',
>>        3*STD(Ranges) as '3Sig of Ranges',
>>        MAX(Ranges) as 'Max Range',
>>        MIN(Ranges) as 'Min Range', (SUM(numonep) -
>> SUM(numbottoms))/COUNT(*) as 'NAs/Wafer',
>>        mx.id as maxID,
>>        mn.id as minID,
>>        lr.id as lrID
>> FROM (SELECT target_name_id,
>>              ep,
>>              wafer_id,
>>              COUNT(bottom) as numbottoms,
>>              AVG(bottom) as avgbottom,
>>              STD(bottom) as stdbottom,
>>              MAX(bottom) as maxbottom,
>>              MIN(bottom) as minbottom,
>>              MAX(date_time) as "LastRun",
>>              COUNT(*) as numonep,
>>              COUNT(DISTINCT target_name_id, ep, lot_id,
>> data_file_id)-1 as reruns,
>>              COUNT(DISTINCT(lot_id)) as Lots,
>>              3*STD(bottom) as Wafer3Sigma,
>>              MAX(bottom) - MIN(bottom) as Ranges
>>       FROM data_cst
>>       WHERE target_name_id IN (775, 776, 777, 778, 779, 780, 45, 44,
>> 116, 117, 118, 119, 120, 121)
>>       AND data_cst.date_time BETWEEN '2010-03-04 00:00:00' AND
>> '2010-03-04 23:59:59'
>>       GROUP BY target_name_id, ep, wafer_id
>>       HAVING count(*) < 999) q1
>> join data_cst mn ON mn.bottom = q1.minbottom
>> join data_cst mx on mx.bottom = q1.maxbottom
>> join data_cst lr on lr.date_time = q1.LastRun,
>> data_target
>> WHERE data_target.id = q1.target_name_id
>> GROUP BY q1.target_name_id, q1.ep;
>>
>>>
>>>> -----Original Message-----
>>>> From: Larry Martell [mailto:larry.martell@stripped]
>>>> Sent: Tuesday, September 18, 2012 2:57 PM
>>>> To: Rick James
>>>> Cc: peter.brawley@stripped; mysql@stripped
>>>> Subject: Re: getting certain rows from a group by
>>>>
>>>> On Tue, Sep 18, 2012 at 4:01 PM, Rick James <rjames@stripped>
>>>> wrote:
>>>> > SELECT mn.*
>>>> >        mx.*
>>>> >> >    FROM ( SELECT @min = MIN(bottom),
>>>> >> >                  @max = MAX(bottom), ... )
>>>> >> >    JOIN data_cst mn  ON bottom = @min
>>>> >> >    JOIN data_cst mx  ON bottom = @max
>>>> >
>>>> > That is, the FROM finds the value for the desired row.
>>>> > The JOIN then gets to the rest of the fields.
>>>> > Caveat:  If two rows have bottom = @min, it returns both.
>>>>
>>>>
>>>> Thanks a lot Rick! This was super helpful. I've got it working for 3 of
>>>> the 4 cases - min(bottom), max(bottom), and max(date_time). But I can't
>>>> figure out how to work in the last case - where bottom is closest to
>>>> avg(bottom). In an individual query I can get it with an order by, like
>>>> this:
>>>>
>>>> SELECT rollup.Target, rollup.EP, rollup.`Avg Bottom`, data_cst.id,
>>>> data_cst.bottom FROM data_cst, rollup, data_target WHERE
>>>> data_target.name = rollup.Target AND data_cst.ep = rollup.EP AND
>>>> data_cst.date_time BETWEEN '2010-03-04 00:00:00' AND '2010-03-04
>>>> 23:59:59'
>>>> AND data_target.id = data_cst.target_name_id GROUP BY rollup.Target,
>>>> rollup.EP ORDER BY ABS(data_cst.bottom - rollup.`Avg Bottom`);
>>>>
>>>> Any way to work that into another join?
>>>>
>>>> Thanks!
>>>>
>>>> >
>>>> >> -----Original Message-----
>>>> >> From: Larry Martell [mailto:larry.martell@stripped]
>>>> >> Sent: Tuesday, September 18, 2012 12:54 PM
>>>> >> To: Rick James
>>>> >> Cc: peter.brawley@stripped; mysql@stripped
>>>> >> Subject: Re: getting certain rows from a group by
>>>> >>
>>>> >> On Tue, Sep 18, 2012 at 2:05 PM, Rick James
> <rjames@stripped>
>>>> >> wrote:
>>>> >> > For single query...
>>>> >> >
>>>> >> > Plan A:
>>>> >> > Repeat the SELECT as once for each (MIN, AVG, etc) as a
> subquery
>>>> in
>>>> >> >    WHERE bottom = ( SELECT MIN(bottom) ... )
>>>> >> >
>>>> >> > Plan B:
>>>> >> >    FROM ( SELECT @min = MIN(bottom),
>>>> >> >                  @max = MAX(bottom), ... )
>>>> >> >    JOIN data_cst mn  ON bottom = @min
>>>> >> >    JOIN data_cst mx  ON bottom = @max
>>>> >>
>>>> >>
>>>> >> Can you elaborate on this? I don't see how this will give me
> what I
>>>> >> need. I'm not looking for the min or max bottom - I already
> have
>>>> that
>>>> >> - I'm looking for the row from each group that has the min and
> max
>>>> >> bottom.
>>>> >>
>>>> >> > Plan C:
>>>> >> > Get rid of 3rd party packages that eventually get in your
> way
>>>> >> > instead
>>>> >> of 'helping'.
>>>> >>
>>>> >> It's not django that is getting in the way. django is a
> fantastic
>>>> >> framework for web development. It's just changing requirements.
> The
>>>> >> original developer (who no longer works here) created a few
>>>> different
>>>> >> classes, and then based on the original requirements, chose to
> use
>>>> >> the one that supports a single query. That worked then, but may
> not
>>>> >> work now. There are other classes that allow multiple queries,
> but
>>>> >> then you have to parse the data and stuff it into the context
> that
>>>> >> will be picked up by the browser yourself. The single query
> class
>>>> >> does that for you. I may yet go that way, but I'm trying to
> avoid a
>>>> >> big rewrite if I can.
>>>> >>
>>>> >> >> -----Original Message-----
>>>> >> >> From: Larry Martell [mailto:larry.martell@stripped]
>>>> >> >> Sent: Tuesday, September 18, 2012 4:57 AM
>>>> >> >> To: peter.brawley@stripped
>>>> >> >> Cc: mysql@stripped
>>>> >> >> Subject: Re: getting certain rows from a group by
>>>> >> >>
>>>> >> >> On Tue, Sep 18, 2012 at 7:52 AM, Peter Brawley
>>>> >> >> <peter.brawley@stripped> wrote:
>>>> >> >> > On 2012-09-18 5:53 AM, Larry Martell wrote:
>>>> >> >> >>
>>>> >> >> >> I have this query:
>>>> >> >> >>
>>>> >> >> >> SELECT data_target.name as Target,
>>>> >> >> >>         q1.ep as EP,
>>>> >> >> >>         COUNT(*) as Wafers,
>>>> >> >> >>         Lots,
>>>> >> >> >>         SUM(numonep)/(COUNT(*)+SUM(CASE WHEN
> reruns > 0 THEN
>>>> >> >> >> reruns ELSE 0 END)) as 'Sites/Wafer',
>>>> >> >> >>         MAX(LastRun) as "Last Run",
>>>> >> >> >>         SUM(CASE WHEN reruns > 0 THEN
> reruns ELSE 0 END) as
>>>> >> Rerun,
>>>> >> >> >>         COUNT(*)+SUM(CASE WHEN reruns > 0
> THEN reruns ELSE 0
>>>> >> >> >> END) as
>>>> >> >> Runs,
>>>> >> >> >>         avgbottom as "Avg Bottom",
>>>> >> >> >>         3*stdbottom as "3 Sig",
>>>> >> >> >>         maxbottom as Max,
>>>> >> >> >>         minbottom as Min,
>>>> >> >> >>         SUM(numonep) as Count,
>>>> >> >> >>         SUM(numonep) - SUM(numbottoms) as
> NAs,
>>>> >> >> >>         100-((SUM(numonep) -
>>>> SUM(numbottoms))/SUM(numonep)*100)
>>>> >> as
>>>> >> >> "%
>>>> >> >> >> Success",
>>>> >> >> >>         3*stdbottom/avgbottom as "3Sig/Avg",
>>>> >> >> >>         AVG(avgbottom) as 'Wafer Avg',
>>>> >> >> >>         AVG(Wafer3Sigma) as 'Wafer 3 Sigma',
>>>> >> >> >>         AVG(Ranges) as 'Avg Range',
>>>> >> >> >>         3*STD(Ranges) as '3Sig of Ranges',
>>>> >> >> >>         MAX(Ranges) as 'Max Range',
>>>> >> >> >>         MIN(Ranges) as 'Min Range',
>>>> >> >> >>         (SUM(numonep) -
> SUM(numbottoms))/COUNT(*) as
>>>> 'NAs/Wafer'
>>>> >> >> >> FROM (SELECT target_name_id,
>>>> >> >> >>               ep,
>>>> >> >> >>               wafer_id,
>>>> >> >> >>               COUNT(bottom) as numbottoms,
>>>> >> >> >>               AVG(bottom) as avgbottom,
>>>> >> >> >>               STD(bottom) as stdbottom,
>>>> >> >> >>               MAX(bottom) as maxbottom,
>>>> >> >> >>               MIN(bottom) as minbottom,
>>>> >> >> >>               MAX(date_time) as "LastRun",
>>>> >> >> >>               COUNT(*) as numonep,
>>>> >> >> >>               COUNT(DISTINCT target_name_id,
> ep, lot_id,
>>>> >> >> >> data_file_id)-1 as reruns,
>>>> >> >> >>               COUNT(DISTINCT(lot_id)) as
> Lots,
>>>> >> >> >>               3*STD(bottom) as Wafer3Sigma,
>>>> >> >> >>               MAX(bottom) - MIN(bottom) as
> Ranges
>>>> >> >> >>        FROM data_cst
>>>> >> >> >>        WHERE target_name_id IN (775, 776, 777,
> 778, 779, 780,
>>>> >> >> >> 45,
>>>> >> >> 44,
>>>> >> >> >> 116, 117, 118, 119, 120, 121)
>>>> >> >> >>        AND data_cst.date_time BETWEEN
> '2010-03-04 00:00:00'
>>>> AND
>>>> >> >> >> '2010-03-04 23:59:59'
>>>> >> >> >>        GROUP BY target_name_id, ep, wafer_id
>>>> >> >> >>        HAVING count(*) < 999) q1,
>>>> >> >> >>       data_target
>>>> >> >> >> WHERE data_target.id = target_name_id GROUP
> BY
>>>> >> >> >> q1.target_name_id, q1.ep;
>>>> >> >> >>
>>>> >> >> >>
>>>> >> >> >> This works fine. But now I need to get a
> certain column
>>>> >> >> >> (image_measurer_id) with each row returned
> that corresponds to
>>>> >> the
>>>> >> >> >> row from the group that has bottom =
> Min(bottom), bottom =
>>>> >> >> >> Max(bottom), bottom closest to Avg(bottom),
> and bottom from
>>>> the
>>>> >> >> >> row where date_time = Max(date_time).
>>>> >> >> >>
>>>> >> >> >> Is this even possible from one query?
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > Might be, but what's the importance of doing it as
> one query?
>>>> >> >>
>>>> >> >> Because it's part of a django based web app, and the
> class that
>>>> >> >> this is part of only supports having a single query. To
> use
>>>> >> >> multiple queries will require a fairly major rewrite of
> the
>>>> server
>>>> >> >> side of
>>>> >> that app.
>>>> >> >>
>>>> >> >> > I'd start by
>>>> >> >> > saving this result to a temp table and developing
> the new
>>>> query.
>>>> >> >> > When that's running, see if you can to optimise a
> query built
>>>> by
>>>> >> >> > replacing the reference to the temp table with the
> original
>>>> query.
>>>> >> >>
>>>> >> >> Thanks, I'll look into this.
Thread
getting certain rows from a group byLarry Martell18 Sep
  • Re: getting certain rows from a group byPeter Brawley18 Sep
    • Re: getting certain rows from a group byLarry Martell18 Sep
      • RE: getting certain rows from a group byRick James18 Sep
        • Re: getting certain rows from a group byLarry Martell18 Sep
          • RE: getting certain rows from a group byRick James18 Sep
            • Re: getting certain rows from a group byLarry Martell18 Sep
              • RE: getting certain rows from a group byRick James18 Sep
                • Re: getting certain rows from a group byLarry Martell19 Sep
                  • Re: getting certain rows from a group byLarry Martell19 Sep
                    • Re: getting certain rows from a group byLarry Martell19 Sep
                      • RE: getting certain rows from a group byRick James19 Sep
                        • Re: getting certain rows from a group byLarry Martell20 Sep
      • Re: getting certain rows from a group byLarry Martell18 Sep
  • Re: getting certain rows from a group byhsv19 Sep
    • Re: getting certain rows from a group byLarry Martell19 Sep
      • Re: getting certain rows from a group byLarry Martell19 Sep
    • Re: getting certain rows from a group byLarry Martell19 Sep
      • Re: getting certain rows from a group byhsv19 Sep