On Wed, Sep 19, 2012 at 6:56 PM, Rick James <rjames@stripped> wrote:
>> my counts are 3 times too much.
> Without studying the code, I would guess that there is a JOIN between he data that
> needs COUNTing and the GROUP BY for the COUNT. That is, it collects more 'joined' rows
> before counting. Fixing it will probably make the query even messier.
The issue was with the last run join. I had to change the where clause
in the join from:
lr.target_name_id IN (775, 776, 777, 778, 779, 780, 45, 44, 116, 117,
118, 119, 120, 121)
to:
lr.target_name_id = q1.target_name_id AND lr.ep = q1.ep
In fact I had to do this for all the joins.
Thanks again for all your help!
>
>> -----Original Message-----
>> From: Larry Martell [mailto:larry.martell@stripped]
>> Sent: Wednesday, September 19, 2012 3:22 PM
>> To: Rick James
>> Cc: peter.brawley@stripped; mysql@stripped
>> Subject: Re: getting certain rows from a group by
>>
>> Well, I'm getting the proper rows from the 4 joins I added, but now I'm
>> seeing a weird side effect - my counts are 3 times to much.
>>
>> The query is really huge and nasty now, but I'm going to paste it
>> below.
>>
>> In the outer most select, Wafers, Rerun, Runs, and Count are 3 times
>> what they should be. If I take off the joins, I get the correct counts.
>> I can't see why that's happening.
>>
>>
>> SELECT data_target.name as Target,
>> q1.ep as EP,
>> COUNT(*) as Wafers,
>> Lots,
>> FORMAT(SUM(numonep)/(COUNT(*)+(SUM(GREATEST(q1.reruns,
>> 0)))), 1) as 'Sites/Wafer',
>> DATE_FORMAT(MAX(LastRun), '%m/%d/%y') as "Last Run",
>> SUM(GREATEST(q1.reruns, 0)) as Rerun,
>> COUNT(*)+SUM(GREATEST(q1.reruns, 0)) as Runs,
>> FORMAT(avgbottom, 1) as "Avg Bottom",
>> FORMAT(3*stdbottom, 2) as "3 Sig",
>> FORMAT(maxbottom, 1) as Max,
>> FORMAT(minbottom, 1) as Min,
>> SUM(numonep) as Count,
>> SUM(numonep) - SUM(numbottoms) as NAs,
>> CONCAT(FORMAT(100*SUM(numbottoms)/SUM(numonep), 1), '
>> %') as "% Success",
>> FORMAT(3*stdbottom/avgbottom, 2) as "3Sig/Avg",
>> FORMAT(AVG(avgbottom), 1) as 'Wafer Avg',
>> FORMAT(AVG(Wafer3Sigma), 2) as 'Wafer 3 Sigma',
>> FORMAT(AVG(Ranges), 1) as 'Avg Range',
>> FORMAT(3*STD(Ranges), 2) as '3Sig of Ranges',
>> FORMAT(MAX(Ranges), 1) as 'Max Range',
>> FORMAT(MIN(Ranges), 1) as 'Min Range',
>> FORMAT((SUM(numonep) - SUM(numbottoms))/COUNT(*), 1) as
>> 'NAs/Wafer',
>> mx.maxImage as maxImage,
>> mn.minImage as minImage,
>> lr.lrImage as lrlrImage,
>> cm.cmImage as cmcmImage
>> FROM (SELECT target_name_id,
>> ep,
>> wafer_id,
>> COUNT(bottom) as numbottoms,
>> AVG(bottom) as avgbottom,
>> STD(bottom) as stdbottom,
>> MAX(bottom) as maxbottom,
>> MIN(bottom) as minbottom,
>> MAX(date_time) as "LastRun",
>> COUNT(*) as numonep,
>> COUNT(DISTINCT target_name_id, ep, lot_id,
>> data_file_id)-1 as reruns,
>> COUNT(DISTINCT(lot_id)) as Lots,
>> 3*STD(bottom) as Wafer3Sigma,
>> MAX(bottom) - MIN(bottom) as Ranges
>> FROM data_cst
>> WHERE TRUE
>> AND data_cst.target_name_id IN (775, 776, 777, 778, 779, 780,
>> 45, 44, 116, 117, 118, 119, 120, 121)
>> AND data_cst.date_time BETWEEN '2010-03-04 00:00:00' AND
>> '2010-03-04 23:59:59'
>> GROUP BY target_name_id, ep, wafer_id
>> HAVING count(*) < 999) q1
>> JOIN (SELECT data_cstimage.name as minImage,
>> data_cst.bottom,
>> data_cst.target_name_id,
>> data_cst.date_time,
>> data_cst.lot_id,
>> data_cst.tool_id,
>> data_cst.roiname,
>> data_cst.recipe_id,
>> data_cst.ep
>> FROM data_cstimage, data_cst
>> WHERE data_cstimage.id =
>> data_cst.image_measurer_id) mn
>> ON mn.bottom = q1.minbottom
>> AND mn.target_name_id IN (775, 776, 777, 778, 779, 780,
>> 45, 44, 116, 117, 118, 119, 120, 121)
>> AND mn.date_time BETWEEN '2010-03-04 00:00:00' AND
>> '2010-03-04 23:59:59'
>> JOIN (SELECT data_cstimage.name as maxImage,
>> data_cst.bottom,
>> data_cst.target_name_id,
>> data_cst.date_time,
>> data_cst.lot_id,
>> data_cst.tool_id,
>> data_cst.roiname,
>> data_cst.recipe_id,
>> data_cst.ep
>> FROM data_cstimage, data_cst
>> WHERE data_cstimage.id =
>> data_cst.image_measurer_id) mx
>> ON mx.bottom = q1.maxbottom
>> AND mx.target_name_id IN (775, 776, 777, 778, 779, 780,
>> 45, 44, 116, 117, 118, 119, 120, 121)
>> AND mx.date_time BETWEEN '2010-03-04 00:00:00' AND
>> '2010-03-04 23:59:59'
>> JOIN (SELECT data_cstimage.name as lrImage,
>> data_cst.bottom,
>> data_cst.target_name_id,
>> data_cst.date_time,
>> data_cst.lot_id,
>> data_cst.tool_id,
>> data_cst.roiname,
>> data_cst.recipe_id,
>> data_cst.ep
>> FROM data_cstimage, data_cst
>> WHERE data_cstimage.id =
>> data_cst.image_measurer_id) lr
>> ON lr.date_time = q1.LastRun
>> AND lr.target_name_id IN (775, 776, 777, 778, 779, 780,
>> 45, 44, 116, 117, 118, 119, 120, 121)
>> AND lr.date_time BETWEEN '2010-03-04 00:00:00' AND
>> '2010-03-04 23:59:59'
>> JOIN (SELECT data_cstimage.name as cmImage,
>> data_cst.bottom,
>> data_cst.target_name_id,
>> data_cst.date_time,
>> data_cst.lot_id,
>> data_cst.tool_id,
>> data_cst.roiname,
>> data_cst.recipe_id,
>> data_cst.ep
>> FROM data_cstimage, data_cst
>> WHERE data_cstimage.id = data_cst.image_measurer_id
>> AND data_cst.target_name_id IN (775, 776, 777, 778,
>> 779, 780, 45, 44, 116, 117, 118, 119, 120, 121)
>> AND data_cst.date_time BETWEEN '2010-03-04 00:00:00'
>> AND '2010-03-04 23:59:59'
>> GROUP BY target_name_id, ep
>> HAVING count(*) < 999
>> ORDER BY ABS(bottom - AVG(bottom))) cm
>> ON cm.target_name_id = q1.target_name_id
>> AND cm.ep = q1.ep
>> AND cm.target_name_id IN (775, 776, 777, 778, 779, 780,
>> 45, 44, 116, 117, 118, 119, 120, 121)
>> AND cm.date_time BETWEEN '2010-03-04 00:00:00' AND
>> '2010-03-04 23:59:59' ,
>> data_target
>> WHERE data_target.id = q1.target_name_id
>> GROUP BY q1.target_name_id, q1.ep
>>
>>
>> On Wed, Sep 19, 2012 at 3:27 PM, Larry Martell
>> <larry.martell@stripped> wrote:
>> > OK, I think I have this working. The last join was this:
>> >
>> > JOIN (select id, target_name_id, ep,date_time from data_cst WHERE
>> > target_name_id IN (775, 776, 777, 778, 779, 780, 45, 44, 116, 117,
>> > 118, 119, 120, 121) AND data_cst.date_time BETWEEN '2010-03-04
>> > 00:00:00' AND '2010-03-04 23:59:59' GROUP BY target_name_id, ep
>> > HAVING count(*) < 999 ORDER BY ABS(data_cst.bottom - AVG(bottom)))
>> cm
>> > on cm.target_name_id = q1.target_name_id and cm.ep = q1.ep and
>> > cm.date_time BETWEEN '2010-03-04 00:00:00' AND '2010-03-04 23:59:59'
>> >
>> > I have do to more testing to ensure I'm getting the correct rows
>> back.
>> >
>> > Thanks for all the help!
>> >
>> > On Wed, Sep 19, 2012 at 10:07 AM, Larry Martell
>> <larry.martell@stripped> wrote:
>> >> On Tue, Sep 18, 2012 at 7:41 PM, Rick James
> <rjames@stripped>
>> wrote:
>> >>> SELECT ((the appropriate id)) -- <--
>> >>> FROM data_cst, rollup, data_target
>> >>> WHERE data_target.name = rollup.Target
>> >>> AND data_cst.ep = rollup.EP
>> >>> AND data_cst.date_time BETWEEN '2010-03-04 00:00:00' AND
>> '2010-03-04 23:59:59'
>> >>> AND data_target.id = data_cst.target_name_id
>> >>> GROUP BY rollup.Target, rollup.EP
>> >>> ORDER BY ABS(data_cst.bottom - rollup.`Avg Bottom`)
>> >>> LIMIT 1 -- <--
>> >>
>> >> rollup is the name of the temp table that I was testing with. It
>> >> contains the output of the original query. I don't have that in the
>> >> real world. That's my issue - I can't figure out how to join with
>> the
>> >> result set and apply the order by to the join condition. What would
>> I
>> >> use in place of rollup to get the 'Avg Bottom'?
>> >>
>> >> Here's what my query looks like now:
>> >>
>> >> SELECT data_target.name as Target,
>> >> q1.ep as EP,
>> >> COUNT(*) as Wafers,
>> >> Lots,
>> >> SUM(numonep)/(COUNT(*)+SUM(CASE WHEN reruns > 0 THEN reruns
>> >> ELSE 0 END)) as 'Sites/Wafer',
>> >> MAX(LastRun) as "Last Run",
>> >> SUM(CASE WHEN reruns > 0 THEN reruns ELSE 0 END) as Rerun,
>> >> COUNT(*)+SUM(CASE WHEN reruns > 0 THEN reruns ELSE 0 END) as
>> Runs,
>> >> avgbottom as "Avg Bottom",
>> >> 3*stdbottom as "3 Sig",
>> >> maxbottom as Max,
>> >> minbottom as Min,
>> >> SUM(numonep) as Count, SUM(numonep) - SUM(numbottoms) as NAs,
>> >> 100-((SUM(numonep) - SUM(numbottoms))/SUM(numonep)*100) as "%
>> Success",
>> >> 3*stdbottom/avgbottom as "3Sig/Avg",
>> >> AVG(avgbottom) as 'Wafer Avg',
>> >> AVG(Wafer3Sigma) as 'Wafer 3 Sigma',
>> >> AVG(Ranges) as 'Avg Range',
>> >> 3*STD(Ranges) as '3Sig of Ranges',
>> >> MAX(Ranges) as 'Max Range',
>> >> MIN(Ranges) as 'Min Range', (SUM(numonep) -
>> >> SUM(numbottoms))/COUNT(*) as 'NAs/Wafer',
>> >> mx.id as maxID,
>> >> mn.id as minID,
>> >> lr.id as lrID
>> >> FROM (SELECT target_name_id,
>> >> ep,
>> >> wafer_id,
>> >> COUNT(bottom) as numbottoms,
>> >> AVG(bottom) as avgbottom,
>> >> STD(bottom) as stdbottom,
>> >> MAX(bottom) as maxbottom,
>> >> MIN(bottom) as minbottom,
>> >> MAX(date_time) as "LastRun",
>> >> COUNT(*) as numonep,
>> >> COUNT(DISTINCT target_name_id, ep, lot_id,
>> >> data_file_id)-1 as reruns,
>> >> COUNT(DISTINCT(lot_id)) as Lots,
>> >> 3*STD(bottom) as Wafer3Sigma,
>> >> MAX(bottom) - MIN(bottom) as Ranges
>> >> FROM data_cst
>> >> WHERE target_name_id IN (775, 776, 777, 778, 779, 780, 45, 44,
>> >> 116, 117, 118, 119, 120, 121)
>> >> AND data_cst.date_time BETWEEN '2010-03-04 00:00:00' AND
>> >> '2010-03-04 23:59:59'
>> >> GROUP BY target_name_id, ep, wafer_id
>> >> HAVING count(*) < 999) q1
>> >> join data_cst mn ON mn.bottom = q1.minbottom join data_cst mx on
>> >> mx.bottom = q1.maxbottom join data_cst lr on lr.date_time =
>> >> q1.LastRun, data_target WHERE data_target.id = q1.target_name_id
>> >> GROUP BY q1.target_name_id, q1.ep;
>> >>
>> >>>
>> >>>> -----Original Message-----
>> >>>> From: Larry Martell [mailto:larry.martell@stripped]
>> >>>> Sent: Tuesday, September 18, 2012 2:57 PM
>> >>>> To: Rick James
>> >>>> Cc: peter.brawley@stripped; mysql@stripped
>> >>>> Subject: Re: getting certain rows from a group by
>> >>>>
>> >>>> On Tue, Sep 18, 2012 at 4:01 PM, Rick James
> <rjames@stripped>
>> >>>> wrote:
>> >>>> > SELECT mn.*
>> >>>> > mx.*
>> >>>> >> > FROM ( SELECT @min = MIN(bottom),
>> >>>> >> > @max = MAX(bottom), ... )
>> >>>> >> > JOIN data_cst mn ON bottom = @min
>> >>>> >> > JOIN data_cst mx ON bottom = @max
>> >>>> >
>> >>>> > That is, the FROM finds the value for the desired row.
>> >>>> > The JOIN then gets to the rest of the fields.
>> >>>> > Caveat: If two rows have bottom = @min, it returns both.
>> >>>>
>> >>>>
>> >>>> Thanks a lot Rick! This was super helpful. I've got it working
> for
>> >>>> 3 of the 4 cases - min(bottom), max(bottom), and
> max(date_time).
>> >>>> But I can't figure out how to work in the last case - where
> bottom
>> >>>> is closest to avg(bottom). In an individual query I can get it
>> with
>> >>>> an order by, like
>> >>>> this:
>> >>>>
>> >>>> SELECT rollup.Target, rollup.EP, rollup.`Avg Bottom`,
> data_cst.id,
>> >>>> data_cst.bottom FROM data_cst, rollup, data_target WHERE
>> >>>> data_target.name = rollup.Target AND data_cst.ep = rollup.EP
> AND
>> >>>> data_cst.date_time BETWEEN '2010-03-04 00:00:00' AND
> '2010-03-04
>> >>>> 23:59:59'
>> >>>> AND data_target.id = data_cst.target_name_id GROUP BY
>> >>>> rollup.Target, rollup.EP ORDER BY ABS(data_cst.bottom -
>> rollup.`Avg
>> >>>> Bottom`);
>> >>>>
>> >>>> Any way to work that into another join?
>> >>>>
>> >>>> Thanks!
>> >>>>
>> >>>> >
>> >>>> >> -----Original Message-----
>> >>>> >> From: Larry Martell [mailto:larry.martell@stripped]
>> >>>> >> Sent: Tuesday, September 18, 2012 12:54 PM
>> >>>> >> To: Rick James
>> >>>> >> Cc: peter.brawley@stripped; mysql@stripped
>> >>>> >> Subject: Re: getting certain rows from a group by
>> >>>> >>
>> >>>> >> On Tue, Sep 18, 2012 at 2:05 PM, Rick James
>> >>>> >> <rjames@stripped>
>> >>>> >> wrote:
>> >>>> >> > For single query...
>> >>>> >> >
>> >>>> >> > Plan A:
>> >>>> >> > Repeat the SELECT as once for each (MIN, AVG, etc)
> as a
>> >>>> >> > subquery
>> >>>> in
>> >>>> >> > WHERE bottom = ( SELECT MIN(bottom) ... )
>> >>>> >> >
>> >>>> >> > Plan B:
>> >>>> >> > FROM ( SELECT @min = MIN(bottom),
>> >>>> >> > @max = MAX(bottom), ... )
>> >>>> >> > JOIN data_cst mn ON bottom = @min
>> >>>> >> > JOIN data_cst mx ON bottom = @max
>> >>>> >>
>> >>>> >>
>> >>>> >> Can you elaborate on this? I don't see how this will
> give me
>> >>>> >> what I need. I'm not looking for the min or max bottom
> - I
>> >>>> >> already have
>> >>>> that
>> >>>> >> - I'm looking for the row from each group that has the
> min and
>> >>>> >> max bottom.
>> >>>> >>
>> >>>> >> > Plan C:
>> >>>> >> > Get rid of 3rd party packages that eventually get
> in your way
>> >>>> >> > instead
>> >>>> >> of 'helping'.
>> >>>> >>
>> >>>> >> It's not django that is getting in the way. django is
> a
>> >>>> >> fantastic framework for web development. It's just
> changing
>> >>>> >> requirements. The original developer (who no longer
> works here)
>> >>>> >> created a few
>> >>>> different
>> >>>> >> classes, and then based on the original requirements,
> chose to
>> >>>> >> use the one that supports a single query. That worked
> then, but
>> >>>> >> may not work now. There are other classes that allow
> multiple
>> >>>> >> queries, but then you have to parse the data and stuff
> it into
>> >>>> >> the context that will be picked up by the browser
> yourself. The
>> >>>> >> single query class does that for you. I may yet go that
> way,
>> but
>> >>>> >> I'm trying to avoid a big rewrite if I can.
>> >>>> >>
>> >>>> >> >> -----Original Message-----
>> >>>> >> >> From: Larry Martell
> [mailto:larry.martell@stripped]
>> >>>> >> >> Sent: Tuesday, September 18, 2012 4:57 AM
>> >>>> >> >> To: peter.brawley@stripped
>> >>>> >> >> Cc: mysql@stripped
>> >>>> >> >> Subject: Re: getting certain rows from a group
> by
>> >>>> >> >>
>> >>>> >> >> On Tue, Sep 18, 2012 at 7:52 AM, Peter
> Brawley
>> >>>> >> >> <peter.brawley@stripped> wrote:
>> >>>> >> >> > On 2012-09-18 5:53 AM, Larry Martell
> wrote:
>> >>>> >> >> >>
>> >>>> >> >> >> I have this query:
>> >>>> >> >> >>
>> >>>> >> >> >> SELECT data_target.name as Target,
>> >>>> >> >> >> q1.ep as EP,
>> >>>> >> >> >> COUNT(*) as Wafers,
>> >>>> >> >> >> Lots,
>> >>>> >> >> >>
> SUM(numonep)/(COUNT(*)+SUM(CASE WHEN reruns > 0
>> >>>> >> >> >> THEN reruns ELSE 0 END)) as
> 'Sites/Wafer',
>> >>>> >> >> >> MAX(LastRun) as "Last Run",
>> >>>> >> >> >> SUM(CASE WHEN reruns > 0
> THEN reruns ELSE 0 END)
>> >>>> >> >> >> as
>> >>>> >> Rerun,
>> >>>> >> >> >> COUNT(*)+SUM(CASE WHEN reruns
> > 0 THEN reruns
>> ELSE
>> >>>> >> >> >> 0
>> >>>> >> >> >> END) as
>> >>>> >> >> Runs,
>> >>>> >> >> >> avgbottom as "Avg Bottom",
>> >>>> >> >> >> 3*stdbottom as "3 Sig",
>> >>>> >> >> >> maxbottom as Max,
>> >>>> >> >> >> minbottom as Min,
>> >>>> >> >> >> SUM(numonep) as Count,
>> >>>> >> >> >> SUM(numonep) -
> SUM(numbottoms) as NAs,
>> >>>> >> >> >> 100-((SUM(numonep) -
>> >>>> SUM(numbottoms))/SUM(numonep)*100)
>> >>>> >> as
>> >>>> >> >> "%
>> >>>> >> >> >> Success",
>> >>>> >> >> >> 3*stdbottom/avgbottom as
> "3Sig/Avg",
>> >>>> >> >> >> AVG(avgbottom) as 'Wafer
> Avg',
>> >>>> >> >> >> AVG(Wafer3Sigma) as 'Wafer 3
> Sigma',
>> >>>> >> >> >> AVG(Ranges) as 'Avg Range',
>> >>>> >> >> >> 3*STD(Ranges) as '3Sig of
> Ranges',
>> >>>> >> >> >> MAX(Ranges) as 'Max Range',
>> >>>> >> >> >> MIN(Ranges) as 'Min Range',
>> >>>> >> >> >> (SUM(numonep) -
> SUM(numbottoms))/COUNT(*) as
>> >>>> 'NAs/Wafer'
>> >>>> >> >> >> FROM (SELECT target_name_id,
>> >>>> >> >> >> ep,
>> >>>> >> >> >> wafer_id,
>> >>>> >> >> >> COUNT(bottom) as
> numbottoms,
>> >>>> >> >> >> AVG(bottom) as
> avgbottom,
>> >>>> >> >> >> STD(bottom) as
> stdbottom,
>> >>>> >> >> >> MAX(bottom) as
> maxbottom,
>> >>>> >> >> >> MIN(bottom) as
> minbottom,
>> >>>> >> >> >> MAX(date_time) as
> "LastRun",
>> >>>> >> >> >> COUNT(*) as numonep,
>> >>>> >> >> >> COUNT(DISTINCT
> target_name_id, ep, lot_id,
>> >>>> >> >> >> data_file_id)-1 as reruns,
>> >>>> >> >> >> COUNT(DISTINCT(lot_id))
> as Lots,
>> >>>> >> >> >> 3*STD(bottom) as
> Wafer3Sigma,
>> >>>> >> >> >> MAX(bottom) -
> MIN(bottom) as Ranges
>> >>>> >> >> >> FROM data_cst
>> >>>> >> >> >> WHERE target_name_id IN (775,
> 776, 777, 778, 779,
>> >>>> >> >> >> 780, 45,
>> >>>> >> >> 44,
>> >>>> >> >> >> 116, 117, 118, 119, 120, 121)
>> >>>> >> >> >> AND data_cst.date_time BETWEEN
> '2010-03-04
>> 00:00:00'
>> >>>> AND
>> >>>> >> >> >> '2010-03-04 23:59:59'
>> >>>> >> >> >> GROUP BY target_name_id, ep,
> wafer_id
>> >>>> >> >> >> HAVING count(*) < 999) q1,
>> >>>> >> >> >> data_target
>> >>>> >> >> >> WHERE data_target.id = target_name_id
> GROUP BY
>> >>>> >> >> >> q1.target_name_id, q1.ep;
>> >>>> >> >> >>
>> >>>> >> >> >>
>> >>>> >> >> >> This works fine. But now I need to
> get a certain column
>> >>>> >> >> >> (image_measurer_id) with each row
> returned that
>> >>>> >> >> >> corresponds to
>> >>>> >> the
>> >>>> >> >> >> row from the group that has bottom =
> Min(bottom), bottom
>> =
>> >>>> >> >> >> Max(bottom), bottom closest to
> Avg(bottom), and bottom
>> >>>> >> >> >> from
>> >>>> the
>> >>>> >> >> >> row where date_time =
> Max(date_time).
>> >>>> >> >> >>
>> >>>> >> >> >> Is this even possible from one
> query?
>> >>>> >> >> >
>> >>>> >> >> >
>> >>>> >> >> > Might be, but what's the importance of
> doing it as one
>> query?
>> >>>> >> >>
>> >>>> >> >> Because it's part of a django based web app,
> and the class
>> >>>> >> >> that this is part of only supports having a
> single query. To
>> >>>> >> >> use multiple queries will require a fairly
> major rewrite of
>> >>>> >> >> the
>> >>>> server
>> >>>> >> >> side of
>> >>>> >> that app.
>> >>>> >> >>
>> >>>> >> >> > I'd start by
>> >>>> >> >> > saving this result to a temp table and
> developing the new
>> >>>> query.
>> >>>> >> >> > When that's running, see if you can to
> optimise a query
>> >>>> >> >> > built
>> >>>> by
>> >>>> >> >> > replacing the reference to the temp table
> with the
>> original
>> >>>> query.
>> >>>> >> >>
>> >>>> >> >> Thanks, I'll look into this.