In the last episode (Oct 19), Barbara Ferrell said:
> We have 3.3 million text documents ( U.S. Patents) that we can put
> into any format needed to import into any database needed. Would
> Mysql be able to handle fast FULLTEXT searching? Each document has
> several fields that would need to be searched, for example: we would
> want to search all 3.3 million documents for the "database" and
> "relation", and sometimes limit the search to specific fields per
> document. .....the database is about 200 gigs.
Mysql 4.0's Fulltext support is not efficient enough for tables with
this many records. A better solution for now is to do your own
fulltext processing outside of mysql.
My solution is to have a mediumint auto_increment value on my main
table, then create a words table with this layout:
CREATE TABLE words (
id mediumint(9) NOT NULL auto_increment,
word varchar(25) NOT NULL,
PRIMARY KEY (id),
UNIQUE KEY (word) )
doclist is an array of 4-byte integers representing the docid of each
document in the main table containing a particular word.
To add a document to the index, you split it into words, pull the
doclist for each word, add the document's ID to the doclist, and update
MySQL. For more efficient updates, you can process multiple docuemnts
at a time, cache the updated 'doclist' arrays between runs, and only
update the 'words' table at the end of the run. My method can process
about 100 documents/second with flushes every 2000 documents on a
p6/200. It would probably go 500 docs/sec if I only flush at the very
end of a run, since the majority of the time is spend pulling doclists
To process queries, you simply pull the doclists of each word in your
query and do set math on the arrays. Extremely fast.
My test table of 3.5 million 60-byte entries (250 MB table) creates a
50 MB words table with 213000 words. You can cut the size down by 50%
by simply compressing the doclist before storing it in mysql. By
contrast, creating a Mysql fulltext index on the same data creates a
500MB index, and is 100 times slower to search.