>>>>> "Andrew" == Andrew Dunstan <andrewmd@stripped> writes:
Andrew> I have a table with about 700,000 rows as follows:
Andrew> k char(50) not null, v char(50) not null, primary key (k)
Andrew> I have a program that gets fed external data with about
Andrew> 2,000,000 k type values which must be filtered against this
Andrew> table (i.e. we must knock out from the 2m values those that
Andrew> appear in the table.
Andrew> I've been trying to benchmark this, and have the following data points:
Andrew> (this is all done in Perl/DBI using prepared statements for MySQL)
Andrew> Baseline: reading in external values and doing nothing: 654s
Andrew> Filtering against a BerkelyDb file (identical data): 1094s
Andrew> MySQL Filtering
Andrew> (select * from t where k = ?) 3497s
Andrew> (select * from t where k in (?,? ...[25 entries]) 1905s
Andrew> (select * from t where k = ? or k = ? or [25 entries] 1865s
Here's a followup to my hashing suggestion.
I built a file of 700,000 unique words and another of 2,000,000 words.
Half of the 2 million were in the 700,000, half not. With a hash
table of size 2,000,000 i can read in and hash all the 700,000 and
then check the 2,000,000 for hits in the hash table in 12
seconds. this is on my laptop (300MHz pentium + 64MB RAM + Linux). if
your times above really are seconds, you might want to try hashing ;-)
but, it also depends on how often your target 700K words are changing,
and for how long you would need to & could afford to keep the hash
table in memory, etc.
i also did it with a bitmap for fun. i built a bitmap of length
4,000,000 (creating a file of size 0.5 MB) from the 700,000 words in
1.8 seconds. then to mmap the bitmap and check all 2,000,000 words
took 3.5 seconds.
obviously the hashing solution is better, since it gives you a
definitive answer for all queries. let me know if you want the code.