"Tommy Svensson \(InfoGrafix\)" <tommy@stripped> wrote on 06/07/2005
04:49:09 PM:
> Hi all you mysql gurus,
> I have 400 000 unique strings where each and every one of these strings
are
> associated with 1 - 50 (appr.) integer values.
> Now, pretty simple for you guys I guess, but how will I design my
> database to make a search interface against this data as rapid as
possible?
> My first guess for a table whould be
> TABLE list
> string int
> ======= =====
> string1 234
> string1 6323
> string1 343
> string2 313
> string2 9055
> ...
> string434 5445
> string434 12
> ...
>
> But I come to a grinding halt when I realize this table will be 400 000
rows
> big, times the sum over the number of associated integers for each
unique
> string... let's say each unique string always have 10 associated
integers.
> Then we have a table of 400 000 x 10 = 4 000 000 rows. Is this really
the
> best approach?
> A search would appropriately look like this
> Search: string434
> and produce the following output
> Result: 5445 12.
> I guess the sql could look like this:
> SELECT int FROM list WHERE string = 'string434'
> What would be the optimal select query if I would like to return ints
for
> strings that begin with say 'str'?
> And what if I wanted to search for strings where I only know the mid
part or
> the end of the string? How then would a performance stable sql command
look?
> If the above design suggestion is the optimal one, then I guess the
internal
> database workings do not contain duplicates of the string values as they
do
> in the table...? I'm kinda hoping a binary tree is built or something
> similar. Though I read somewhere that in order for the db to optimize
and
> create btrees and the like, some data must be indexed... but how? I
can't
> very well make the string column 'primary' since it isn't...?
> Very, very, very interested in hearing what you pros have to say! Thx a
lot
> for this forum!
> /Tommy
You are correct in your analysis that you would have four million strings
in your table if you only created one table. The solution to your problem
is the design technique called "normalization" which in its simplest
explanation is the process of preventing data duplication.
Create a table for your strings. Since they are all unique, this table
will top out at 400000 rows.
CREATE TABLE StringMaster (
id int unsigned auto_increment primary key
, stringvalue varchar(255)
, UNIQUE (stringvalue)
);
And a second table to associate the string values with their 1-50 integer
values
CREATE TABLE StringInteger (
StringMaster_ID int unsigned not null
, IntegerValue int not null
, PRIMARY KEY(StringMaster_ID, IntegerValue)
, KEY (IntegerValue, StringMaster_ID)
)
The primary key prevents the same integer value from being assigned to the
same string more than once (in case you don't want that restriction just
delete the word PRIMARY from the declaration). The second key is a
"covering key" or "covering index" to do reverse lookups (tell me all of
the strings associated with the integer 36). The other index will nearly
double the storage requirements for this table. The payoff is that it will
make your lookups VERY fast and it really won't take up that much room as
integer values only use about 4 bytes per value.
So let's estimate disk space:
StringMaster:
(255 chars + 4 bytes) * 400000 rows = (appx) 103600000 bytes (around 99
MB) max size
StringIntegers:
8 bytes * 400000 string values * 10 integers per string = 32000000 bytes
(about 30 MB) data
plus an additional 30 MB or so for the extra index
Even with another index on StringMaster you are looking at less that 250MB
total. This is trivial for MySQL to handle (even on a P-II with almost no
memory) and you should get great results. There are working MySQL
databases with BILLIONS of entries containing hundreds of gigabytes of
data and with the right equipment and design, they are performing very
nicely.
You asked about how to write a query that searches for only part of a
string. For that I kindly refer you to the excellent manual. It can tell
you so much more about the LIKE function and the RLIKE function and all of
the other things MySQL can do with strings than I have time to go over.
(It's pretty easy once you actually try it out and get a feel for it)
http://dev.mysql.com/doc/mysql/en/string-comparison-functions.html
http://dev.mysql.com/doc/mysql/en/regexp.html
Shawn Green
Database Administrator
Unimin Corporation - Spruce Pine