MySQL Lists are EOL. Please join:

List:General Discussion« Previous MessageNext Message »
From:Michael Stassen Date:February 21 2005 7:30pm
Subject:Re: select where multiple joined records match
View as plain text  
AM Thomas wrote:
> Hi there,
> 
> I'll answer your questions below, but I should tell you that it looks 
> like  even three or four "joins" seem to make this too slow.  
> Theoretically, the  user could select up to 15 or so GoalNumber values, 
> 12 for Grade, and  about 20 possibilities for Subject - clearly it 
> wouldn't be useful to  specify that many items in a search, but that's 
> the theoretical maximum; 4  or 5 values isn't unreasonable, though.
> 
> Four "AND"ed goal numbers plus a subject and a grade slowed the search 
> (on  the shared commercial web host I'm using) into the 3 minute range, 
> and  that's with a regular join, not a left join.  This is the SELECT 
> that took  about 3 minutes (3 trials, simplifying slightly each time, 
> simplest given  here):

You've removed necessary conditions on the JOINs, so you are getting lots of 
extra rows.  Furthermore, the logic isn't right, so I think this query will, 
in all likelihood, retrieve incorrect rows.  I'll explain.  For a given 
resource id,

> select r.id from resources as r
>   join resource_goals as g0 on (r.id=g0.ResourceID)

(adding the WHERE clause below), this retrieves every row in copy 0 with the 
right subject and grade, *regardless of goal*

>   join resource_goals as g1 on (r.id=g1.ResourceID and g1.GoalNumber=1)

and pairs it with every row in copy 1 with GoalNumber=1, *regardless of 
Subject or Grade*

>   join resource_goals as g2 on (r.id=g2.ResourceID and g2.GoalNumber=2)

and pairs it with every row in copy 2 with GoalNumber=2, *regardless of 
Subject or Grade*

>   join resource_goals as g3 on (r.id=g3.ResourceID and g3.GoalNumber=3)

and pairs it with every row in copy 3 with GoalNumber=3, *regardless of 
Subject or Grade*

>   join resource_goals as g4 on (r.id=g4.ResourceID and g4.GoalNumber=4)

and pairs it with every row in copy 4 with GoalNumber=4, *regardless of 
Subject or Grade*

> where ((g0.Subject='Social_Studies')
>   and (g0.Grade='4th'))
> group by r.id;

and finally, we pick one of those many rows to display (the effect of the 
GROUP BY r.id).

Do you see why that is both more than and different from what you want? 
Consider a resource with the following rows in resource_goals:

+------------+-------+----------------+------------+--------------+
| ResourceID | Grade | Subject        | GoalNumber | NumericGrade |
+------------+-------+----------------+------------+--------------+
|         14 | 4th   | Social_Studies |          7 |            4 |
|         14 | 1st   | English        |          1 |            1 |
|         14 | 2nd   | English        |          2 |            2 |
|         14 | 3rd   | English        |          3 |            3 |
|         14 | 5th   | History        |          4 |            5 |
+------------+-------+----------------+------------+--------------+

ResourceID 14 would be returned by your query, but isn't what you want.  I'd 
guess you haven't come across a case like this because you have few, if any, 
cross-subject resources, but I assume they are a possibility, since you have 
Subject part of the resource_goals table, rather than part of the resources 
table.

> The fastest time was 2 min 48 sec.
> Last time (simplest query) was 3 min 2 sec.

If we join each row in resources to a single row in each copy of 
resource_goals using an index, this should be reasonably fast, but I expect 
you are getting multiple matching rows in each copy, as it stands now.  The 
total resulting rows per id is the product of the matches in each copy.  If 
just 3 rows match your current conditions per copy, that would be 3^5 = 243 
rows per resource id, where we expect only 1!  In other words, I expect some 
of the slow down is due to the overhead of retrieving many times the number 
of desired rows.  The rest is probably lack of a suitable index.

As I understand it, you are looking for a resource for 4th grade Social 
Studies which meets goals 1 through 4.  In terms of your tables, that 
corresponds to having 4 rows in resources_goals, *all* of which have 
grade=4th and Subject='Social_Studies'.  That is, we need to look in *4* 
copies of resources_goals (not 5).  For a given resource id, we want exactly 
one row from each copy, namely, the row with the correct resource id, 
correct subject, correct grade, and desired goal number.  I think this 
should do:

   SELECT r.id
   FROM resources as r
   JOIN resource_goals as g1
     ON  r.id = g1.ResourceID
     AND g1.Subject = 'Social_Studies'
     AND g1.Grade = '4th'
     AND g1.GoalNumber = 1
   JOIN resource_goals as g2
     ON  r.id = g2.ResourceID
     AND g2.Subject = 'Social_Studies'
     AND g2.Grade = '4th'
     AND g2.GoalNumber = 2
   JOIN resource_goals as g3
     ON  r.id = g3.ResourceID
     AND g3.Subject = 'Social_Studies'
     AND g3.Grade = '4th'
     AND g3.GoalNumber = 3
   JOIN resource_goals as g4
     ON  r.id = g4.ResourceID
     AND g4.Subject = 'Social_Studies'
     AND g4.Grade = '4th'
     AND g4.GoalNumber = 4;

That should return 1 row per resource, so long as there are no duplicates 
rows in resource_goals.

This should do the same thing:

   SELECT ResourceID
   FROM resource_goals
   WHERE Subject = 'Social_Studies'
     AND Grade = '4th'
     AND GoalNumber IN (1, 2, 3, 4)
   GROUP BY ResourceID
   HAVING COUNT(*) = 4;

You'll have to try both to see which is faster.

> I'm really running out of time on this project, so I just went ahead 
> and  made the user interface such that users can only select one 
> subject,  grade, and/or goal number at a time.  It's probably a sound 
> decision from  a usability perspective, so I'm not too sad.  If I decide 
> to make this  work in the future, I'd probably have to just do a SELECT 
> for each "AND"ed  field, get the list of resource id's for each SELECT, 
> then find the  intersection of the lists in Perl.
> 
> If I could speed this up with some kind of indexing, I'd love to know  
> about it.

You mentioned before that the combination of ResourceID, Subject, Grade, and
GoalNumber will be unique.  If you have not already done so, you should add 
a UNIQUE INDEX on that combination of columns.

   ALTER TABLE resource_goals
   ADD UNIQUE INDEX (ResourceID, Subject, Grade, GoalNumber);

This will enforce that constraint, and either query I gave above could use 
it to speed things up.  If you wanr further comments on indexes, include the 
output from

   SHOW INDEXES FROM resources;
   SHOW INDEXES FROM resource_goals;

in your next message.

> The GROUP BY phrase is because I wanted just one row per resource.  It  
> seemed like I'd get a row for each condition/resource (didn't test it 
> with  the final ).  I'm actually doing SELECT * FROM... in my code, and 
> not  using the resource_goals information in my output (that's a 
> separate view  at present, generated by different Perl code).

When you get more rows than you expect, you should resist the urge to "fix" 
your query by adding DISTINCT or GROUP BY.  Extra rows usually mean you've 
missed a condition.  At best, such a query is inefficient, at worst, it 
gives wrong results.  Either way, DISTINCT and GROUP BY are usually 
cover-ups, not solutions.

I'd also suggest not selecting columns you don't need.  That adds overhead. 
  Just select r.id or r.*, if that's all you need.

> It seems to work fine without the NOT NULL parts, you're right.  I was  
> wondering about that, but was sleepy enough at the time that I didn't  
> trust my thinking.
> 
> Sorry about not including my table defs :-(  .  I guess I was just 
> hoping  for a general approach, and didn't realize that anyone would be 
> interested  enough to read all that detail and provide an exact 
> solution for me.  Of  course, now I realize that it would have 
> simplified our discussion.   Anyway, late but not never, and for help to 
> whoever finds this in the list  archives someday, here are my table defs 
> (you're right - the goal number  is a TINYINT):
> 
> (Below is an abridged version of the resources table ; it also contains  
> about 60 more TINYINT fields which are essentially used as booleans, 
> some  of which I hope to eliminate.  Yes, I could have used SET or 
> something,  but I didn't for various reasons.)
> 
> CREATE TABLE resources (
>   id INT UNSIGNED PRIMARY KEY,
>   Title TEXT,
>   ResourceType_THJHArticle TINYINT,
>   ResourceType_NIEArticle TINYINT,
>   DataEntryName TINYTEXT,
>   Date DATETIME,
>   Notes TEXT,
>   Made_Keywords TEXT);
> 
> CREATE TABLE  resource_goals (
>   goal_id INT UNSIGNED PRIMARY KEY,
>   ResourceID INT,
>   Grade TINYTEXT,
>   Subject TINYTEXT,
>   GoalNumber TINYINT,
>   NumericGrade TINYINT);

Some comments:

In general, tables with fixed-length rows are faster than tables with 
variable-length rows (with some wasted space as the tradeoff).  To make a 
table have fixed-length rows requires every column to be fixed length.  I 
doubt that's reasonable for table resources, but I'm guessing it's doable 
for table resource_goals.  Even if not, it's still a good idea not to make 
columns any bigger than they need to be.  From your example, I would expect 
that column Grade contains values such as '1st', '3rd', and '11th', so I 
would think a CHAR(4) would do.  TINYTEXT is 255 characters wide. 
Similarly, I doubt you have any Subjects which approach 255 characters in 
length.

I see another column named NumericGrade.  Is that simply the same info as is 
in column Grade, expressed as a number ('1st' = 1, '2nd' = 2, and so on)? 
If so, that's good news, for 2 reasons.  First, numeric comparisons are 
faster than string comparisons, so we would make a couple of changes to my 
advice above.  In that case, we'd use NumericGrade instead of Grade in the 
multi-column unique index, and we'd substitute NumericGrade = 4 for 
Grade='4th' in the queries.

Second, that would allow us to improve things by further normalizing your 
data.  Having two columns with the same information is considered a bad 
idea, as it is wasteful of space and can impact speed.  I assume the purpose 
of column Grade is to express the grade in a nicer form.  Instead, we make a 
new table

   CREATE TABLE grade (id TINYINT PRIMARY KEY, name CHAR(4), INDEX (name));

   INSERT INTO grade
    SELECT DISTINCT NumericGrade, Grade FROM resource_goals
    ORDER BY NumericGrade;

which holds the mapping from number to text in just a few rows.  Then we no 
longer need the Grade column in table resource_goals.

Similarly, we could normalize the Subjects:

   CREATE TABLE subject (id INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
                         name CHAR(25), INDEX (name));

   INSERT INTO subject (name)
     SELECT Subject FROM resource_goals GROUP BY subject;

   ALTER TABLE resource_goals ADD subject_id INT UNSIGNED NOT NULL;

   UPDATE resource_goals JOIN subject
       ON resource_goals.Subject = subject.name
   SET resource_goals.subject_id = subject.id;

Now we no longer need the Subject column in resource_goals.

   ALTER TABLE resource_goals
   DROP Grade,
   DROP Subject;

We may then wish to rename column NumericGrade to just grade:

   ALTER TABLE resource_goals
   CHANGE NumericGrade grade TINYINT;

At this point, the multi-column index I mentioned above would be defined 
this way:

   ALTER TABLE resource_goals
   ADD UNIQUE INDEX (ResourceID, subject_id, grade, GoalNumber);

In general, we now get the strings for grade and subject by joining to the 
new tables on their respective IDs.  In this particualr case, with about 12 
grades and 20 subjects, it should be simple to select the grade and subject 
IDs and names from their respective tables, then use that to build your form 
with drop-down menus which show the names, but use the corresponding IDs in 
any query.  Then, the queries above become (assuming 'Social_Studies' has ID 
= 14):

   SELECT r.id
   FROM resources as r
   JOIN resource_goals as g1
     ON  r.id = g1.ResourceID
     AND g1.subject_id = 14
     AND g1.grade = 4
     AND g1.GoalNumber = 1
   JOIN resource_goals as g2
     ON  r.id = g2.ResourceID
     AND g2.subject_id = 14
     AND g2.grade = 4
     AND g2.GoalNumber = 2
   JOIN resource_goals as g3
     ON  r.id = g3.ResourceID
     AND g3.subject_id = 14
     AND g3.grade = 4
     AND g3.GoalNumber = 3
   JOIN resource_goals as g4
     ON  r.id = g4.ResourceID
     AND g4.subject_id = 14
     AND g4.grade = 4
     AND g4.GoalNumber = 4;

or

   SELECT ResourceID
   FROM resource_goals
   WHERE subject_id = 14
     AND grade = 4
     AND GoalNumber IN (1, 2, 3, 4)
   GROUP BY ResourceID
   HAVING COUNT(*) = 4;

The latter method, by the way, allows some flexibility.  For example, you 
could change the HAVING clause to "HAVING COUNT(*) >= 3" to find resources 
which meet at least 3 of the specified goals.

> Thanks a bunch for your help; I'm finding this more interesting than I  
> thought I would.

You're welcome.

Michael
Thread
select where multiple joined records matchAM Thomas13 Feb
  • Re: select where multiple joined records matchPeter Brawley13 Feb
    • Re: select where multiple joined records matchMichael Stassen13 Feb
  • Re: select where multiple joined records matchMichael Stassen13 Feb
    • more complexity (was: select where multiple joined records match)AM Thomas14 Feb
  • RE: select where multiple joined records matchGordon14 Feb
    • Re: select where multiple joined records matchAM Thomas14 Feb
      • Re: select where multiple joined records matchJeremy Cole14 Feb
        • Re: select where multiple joined records matchAM Thomas15 Feb
          • Re: select where multiple joined records matchMichael Stassen16 Feb
            • Re: select where multiple joined records matchAM Thomas16 Feb
              • Re: select where multiple joined records matchMathew Ray17 Feb
              • Re: select where multiple joined records matchMichael Stassen21 Feb
                • Re: select where multiple joined records matchAM Thomas21 Feb
        • Re: select where multiple joined records matchMichael Stassen16 Feb
Re: more complexity (was: select where multiple joined records match)Harald Fuchs14 Feb