I'm creating a search engine and have the indexing
done as follows :
Table : "Documents"
docid | URL | title
----------------------------------------------------
45 | www.mydomain.com/dir/page.html | Some page
46 | www.otherplace.com/index.html | Elsewhere
47 | www.otherplace.com/services | His services
Table : "Words"
wordid | word
--------------------
367 | superb
368 | world
369 | happy
And then the occurences of each word in each document (and the
number of occurences, used for accuracy purposes) are stored in
a third 'search' table :
Table : "search"
wordid | docid | numtimes
------------------------------
367 | 47 | 8
368 | 47 | 2
367 | 45 | 4
So, we can see that the word 'superb' (which has wordid '367')
occurs 8 times in the page with URL www.otherplace.com/services
and 4 times in www.mydomain.com/dir/page.html
Does this seem like an intelligent enough way to index it ?
(Even though I'm taking care of culling stopwords, I realise
that the 'search' table will soon become very large)
Now for my problem :
--------------------
Now, if the user searches for "Happy world", I can find all the
document IDs where the wordid is 369 (ie. "happy") and then
all the document IDs where the wordid is 368 (ie. "world")
but is it possible to use SQL to return the intersection of the
two ? (At the moment, I'm using Python but this doesn't seem
scalable)
For example, this is taken from my current database with
some common words whose word ID's are 49 and 45 :
mysql> select search.docid from search where wordid = 49
+-------+
| docid |
+-------+
| 1 |
| 3 |
| 9 |
| 11 |
| 17 |
| 19 |
| 25 |
| 27 |
| 33 |
| 35 |
| 41 |
| 43 |
+-------+
12 rows in set (0.02 sec)
mysql> select search.docid from search where wordid = 45
+-------+
| docid |
+-------+
| 1 |
| 6 |
| 9 |
| 14 |
| 17 |
| 22 |
| 25 |
| 30 |
| 33 |
| 38 |
| 41 |
| 46 |
+-------+
12 rows in set (0.02 sec)
Assuming that documents which have both words are more likely to be correct,
how can I find the intersection ? I can't use either :
'select search.docid where wordid = 45 or wordid = 49' ... produces a UNION
or
'select search.docid where wordid = 45 and wordid = 49' ... always EMPTY
A JOIN is the logical choice here but would require 2 different tables.
It seems silly to have 2 identical tables... and not scalable because I'm
going to wish to check for intersections of 3 or 4 words too (in case
somebody searches on 'very happy world' etc)
Is this beyond SQL ? or is my schema totally wrong in the first place ?
Thank you very much,
chas