List:General Discussion« Previous MessageNext Message »
From:chas Date:April 22 1999 7:44am
Subject:intersection in SQL with only one table ?
View as plain text  
I'm creating a search engine and have the indexing
done as follows :


Table : "Documents"

docid  |  URL                           |  title  
----------------------------------------------------
45     | www.mydomain.com/dir/page.html |  Some page
46     | www.otherplace.com/index.html  |  Elsewhere
47     | www.otherplace.com/services    |  His services



Table : "Words"

wordid  |  word
--------------------
367     | superb
368     | world
369     | happy



And then the occurences of each word in each document (and the
number of occurences, used for accuracy purposes) are stored in
a third 'search' table :

Table : "search"

wordid  |  docid  |   numtimes
------------------------------
367     |  47     |   8
368     |  47     |   2
367     |  45     |   4



So, we can see that the word 'superb' (which has wordid '367') 
occurs 8 times in the page with URL www.otherplace.com/services
and 4 times in  www.mydomain.com/dir/page.html 

Does this seem like an intelligent enough way to index it ?
(Even though I'm taking care of culling stopwords, I realise 
that the 'search' table will soon become very large)

Now for my problem :
--------------------
Now, if the user searches for "Happy world", I can find all the 
document IDs where the wordid is 369 (ie. "happy") and then 
all the document IDs where the wordid is 368  (ie. "world")
but is it possible to use SQL to return the intersection of the
two ? (At the moment, I'm using Python but this doesn't seem
scalable)

For example, this is taken from my current database with
some common words whose word ID's are 49 and 45 :

mysql> select search.docid from search where wordid = 49
+-------+
| docid |
+-------+
|     1 |
|     3 |
|     9 |
|    11 |
|    17 |
|    19 |
|    25 |
|    27 |
|    33 |
|    35 |
|    41 |
|    43 |
+-------+
12 rows in set (0.02 sec)

mysql> select search.docid from search where wordid = 45
+-------+
| docid |
+-------+
|     1 |
|     6 |
|     9 |
|    14 |
|    17 |
|    22 |
|    25 |
|    30 |
|    33 |
|    38 |
|    41 |
|    46 |
+-------+
12 rows in set (0.02 sec)

Assuming that documents which have both words are more likely to be correct,
how can I find the intersection ?   I can't use either :
'select search.docid where wordid = 45 or wordid = 49' ... produces a UNION
or 
'select search.docid where wordid = 45 and wordid = 49' ... always EMPTY

A JOIN is the logical choice here but would require 2 different tables.
It seems silly to have 2 identical tables... and not scalable because I'm
going to wish to check for intersections of 3 or 4 words too (in case 
somebody searches on 'very happy world' etc)

Is this beyond SQL ? or is my schema totally wrong in the first place ? 

Thank you very much,

chas

Thread
phpMichael Farr22 Apr
  • Re: phpSasha Pachev22 Apr
  • Re: phpDavor Cengija22 Apr
    • intersection in SQL with only one table ?chas22 Apr
      • RE: intersection in SQL with only one table ?Don Read22 Apr
        • RE: intersection in SQL with only one table ?chas22 Apr
          • RE: intersection in SQL with only one table ?Don Read22 Apr
      • Re: intersection in SQL with only one table ?Robin Bowes22 Apr
        • Re: intersection in SQL with only one table ?Alex Krohn22 Apr
          • Re: intersection in SQL with only one table ?chas23 Apr
  • Re: intersection in SQL with only one table ?Robin Bowes22 Apr
Re: phpPhilip Diller22 Apr