List:General Discussion« Previous MessageNext Message »
From:Michael Stassen Date:June 1 2006 4:51pm
Subject:Re: Need help querying a database of polynomials
View as plain text  
Lew E. Lefton wrote:
 > Hi,
 >
 > I hope this is an appropriate place to ask this question, if you think
 > it is better suited for another list/forum, please let me know.
 >
 > I have a table that looks like this:
 >
 > mysql>  select polynomial_id, term_id from polynomial;
 > +---------------+---------+
 > | polynomial_id | term_id |
 > +---------------+---------+
 > |             1 |       1 |
 > |             1 |       2 |
 > |             1 |       3 |
 > |             1 |       4 |
 > |             2 |       1 |
 > |             2 |       2 |
 > |             2 |       3 |
 > |             2 |       4 |
 > |             2 |       5 |
 > |             3 |       1 |
 > |             3 |       2 |
 > |             3 |       3 |
 > |             3 |       5 |
 > +---------------+---------+
 >
 > which represents, say, three polynomials,
 >
 >   the first is a sum of 4 terms (term1 + term2 + term3 + term4),
 >   the second is a sum of 5 terms (term1 + term2 + term3 + term4 + term5),
 >   the third is a sum of 4 terms (term1 + term2 + term3 + term5),
 >   etc.
 >
 > I am storing the polynomials in this way because I may need to store
 > very large polynomials.  The table may grow to millions of rows before
 > I'm done, with potentially many of the same terms appearing in many
 > different polynomials.  Thus I have the terms stored in a separate
 > table. Also, this method of storage makes the table easily searched
 > (e.g. "find all polynomials which have term 2").
 >
 > If I have another polynomial, say the sum of terms 1,3,4, and 5, how can
 > I quickly search this database to see if it's already been stored?
 > Actually, I would eventually like to have a function (using appropriate
 > API) which, when given a list of terms, returns the polynomial_id
 > regardless of whether it is the result of a new insert or a successful
 > lookup.
 >
 > I tried variations of this
 >    SELECT DISTINCT polynomial_id FROM polynomial
 >     WHERE term_id in ('1','2','3','4')
 > but I get
 >
 > +---------------+
 > | polynomial_id |
 > +---------------+
 > |             1 |
 > |             2 |
 > |             3 |
 > +---------------+
 >
 > when I really just wanted '1'. I suspect some subquery magic (e.g.
 > restricting to only those polynomials with exactly a count of 4 terms)
 > would give me a nice efficient solution, but I am not an SQL expert and
 > I have not been succesful in crafting the appropriate query.
 >
 > Thanks in advance for your help.  I am happy to provide more details if
 > necessary, but I have tried to distill the essence of the problem by this
 > simple example.
 >
 > Cheers,
 > Lew Lefton


Peter Brawley wrote:
> SELECT DISTINCT polynomial_id
> FROM polynomial p1
> INNER JOIN polynomial p2 ON p1.term_id=1 AND p2.term_id=3
> INNER JOIN polynomial p3 ON p2.term_id=3 AND p3.term_id=4
> INNER JOIN polynomial p4 ON p3.term_id=4 AND p4.term_id=5

That won't work, as each join is missing a crucial condition.  Also, as each 
joined table has a polynomial_id, you need to specify which one you want. 
Finally, if we get the joins right, "DISTINCT" shouldn't be needed.  I think you 
meant to say

#1
   SELECT p1.polynomial_id
   FROM polynomial p1
   JOIN polynomial p2 ON p1.polynomial_id = p2.polynomial_id AND p2.term_id=3
   JOIN polynomial p3 ON p2.polynomial_id = p3.polynomial_id AND p3.term_id=4
   JOIN polynomial p4 ON p3.polynomial_id = p4.polynomial_id AND p4.term_id=5
   WHERE p1.term_id=1;

I expect that this query, and most of the ones given below, would definitely 
benefit from a multi-column index on (polynomial_id, term_id).  Even then, 
however, I don't think this query will scale very well.  Adding a join per term 
rapidly gets expensive.  Here's an equivalent query without joins:

#2
   SELECT polynomial_id
   FROM polynomial
   WHERE term_id IN (1,3,4,5)
   GROUP BY polynomial_id
   HAVING COUNT(*) = 4;

It should scale much better (and it's easier to write).

Unfortunately, neither query is quite right.  Both will return all polynomials 
which *contain* terms 1, 3, 4, and 5, not the one polynomial (if it exists) 
consisting of precisely those terms and no others.  You would have to do a 
subsequent check of the returned polynomials to see if they had any more terms.

In other words, there are actually two conditions: The polynomial must have all 
the required terms, and it must have the right number of terms.  We need to 
modify the query to take both conditions into account.  This means we cannot put 
the matching terms condition into the WHERE clause, as I did above, because then 
we won't see any other terms.  Instead, we must count total terms and matching 
terms, like this:

#3
   SELECT polynomial_id,
          COUNT(*) AS terms,
          SUM(term_id IN (1,3,4,5)) as matched_terms
   FROM polynomial
   GROUP BY polynomial_id
   HAVING terms = 4 AND matched_terms = 4;

I wrote it that way to make it clear what I'm doing.  If you prefer not to see 
the counts, you can do it this way:

#4
   SELECT polynomial_id
   FROM polynomial
   GROUP BY polynomial_id
   HAVING COUNT(*) = 4 AND SUM(term_id IN (1,3,4,5)) = 4;

Now, that's almost certainly a full table scan (or full index scan).  Depending 
on your data and indices, it may be faster to store the results of my first 
query (#2 above) into a temporary table of candidates, then join that to the 
polynomial table and look for polynomial_ids with 4 rows.

   CREATE TEMPORARY TABLE candidates
   SELECT polynomial_id ...          <= query #2 from above

   SELECT c.polynomial_id
   FROM candidates c
   JOIN polynomial p ON c.polynomial_id = p.polynomial_id
   GROUP BY c.polynomial_id
   HAVING COUNT(*) = 4;

   DROP TABLE candidates;    #clean up

Michael




Thread
name 'Szczech' returns more rows then 'Szczec%'Lukasz Budnik31 May
  • Need help querying a database of polynomialsLew E. Lefton31 May
    • Re: Need help querying a database of polynomialsPeter Brawley31 May
      • Re: Need help querying a database of polynomialsPooly31 May
      • Re: Need help querying a database of polynomialsMichael Stassen1 Jun
  • Re: name 'Szczech' returns more rows then 'Szczec%'sheeri kritzer31 May
    • Re: name 'Szczech' returns more rows then 'Szczec%'Lukasz Budnik31 May
      • Re: name 'Szczech' returns more rows then 'Szczec%'Jake Peavy1 Jun
        • Re: name 'Szczech' returns more rows then 'Szczec%'Brendan Bouffler1 Jun
          • Re: name 'Szczech' returns more rows then 'Szczec%'Lukasz Budnik1 Jun
            • Re: name 'Szczech' returns more rows then 'Szczec%'Jake Peavy1 Jun
  • Re: name 'Szczech' returns more rows then 'Szczec%'Remo Tex1 Jun
    • Re: name 'Szczech' returns more rows then 'Szczec%'Lukasz Budnik1 Jun
      • Re: name 'Szczech' returns more rows then 'Szczec%'Remo Tex1 Jun
        • Re: name 'Szczech' returns more rows then 'Szczec%'Lukasz Budnik1 Jun
        • myisamchk (was: name 'Szczech' returns more rows then 'Szczec%')Chris Sansom4 Jun
          • Re: myisamchk (was: name 'Szczech' returns more rows then'Szczec%')Chris Sansom5 Jun
          • Re: myisamchk (was: name 'Szczech' returns more rows then 'Szczec%')gclark5 Jun
            • Re: myisamchk (was: name 'Szczech' returns more rows then 'Szczec%')Chris Sansom8 Jun