Hey all,
I've been experimenting with the best way to store a large (~100GB) of data for retrieval.
Essentially, I'm storing 9 variables for approximately 1,000,000 locations a day for the
last ten years. This can work out at around 4MB a variable a day - but not all variables
are always present (could be null).
(If you don't care about the details, I'll summarize at the end of this email).
Inserting and retrieving from a MyISAM table seemed to be approaching impossible. (Not
totally surprising.) I originally had a table along the lines of:
create table basic_daily_report(
date DATE NOT NULL,
location_id MEDIUMINT UNSIGNED NOT NULL,
variable1 float,
variable2 float....
variable9 float,
primary key (date, location_id)
);
(Just a summary of the actual table)
With this I had a maxiumum table size of around 100GB - just barely enough to do it. I
expected I would end up segmenting by year, or something similar, as ugly as that is. I
tested InnoDB as an alternative to this, but we'll get to that in a second. Basically,
inserting a day's worth of data would take ages, and pretty much require an analyze table
for a couple hours every morning. Selecting was getting to be pretty slow, as well.
Eventually, I hit on the idea of including one row per day:
create table basic_daily_grid(
date DATE NOT NULL PRIMARY KEY,
variable1 MEDIUMBLOB,
variable2 MEDIUMBLOB ...
variable9 MEDIUMBLOB
);
And wrote a UDF such that you pass it the variable and a location, and it'll return the
exact value. This works well because every day has a constant number of locations in a
grid format, so it's simply an array lookup. So,
select grid_point(location_id, variable1) from basic_daily_grid where date=20030101
would return the right value for locationid. It turns out this is almost (95%) as fast as
the first version in selecting, but it has the added bonus of inserts now only take ~5
seconds per day!
Alas, after inserting 260 days (less than a year) I hit the MyISAM table size limit of
4.2GB - because a BLOB is a variable length field.
-----------------------------------------------------------------
I mention all the above in case someone has an alternative solution I'm looking over.
Possible solutions I've found are a) use InnoDB instead, b) work with the source to
create a new field type BLOBARRAY of a constant width instead of dynamic, c) work with
the source to somehow overcome the 4.2GB limit on a dynamic table.
c) Seems unlikely - if the actual developers can't do it, I probably can't
b) Seems possible, I assume no one saw a need for a constant width column of 4MB, so
hopefully its not too difficult
a) Was my first try. Inserting takes about twice as long as myisam... sure, I can deal
with that. Selecting a specific date is in the same ballpark as well, so little problem
there. What I'm having severe performance issues on are querys that group, or do a
count(*). For example:
mysql> explain select date from basic_daily_grid_innodb;
+-------------------------+-------+---------------+---------+---------+------+------+-------------+
| table | type | possible_keys | key | key_len | ref | rows | Extra |
+-------------------------+-------+---------------+---------+---------+------+------+-------------+
| basic_daily_grid_innodb | index | NULL | PRIMARY | 3 | NULL |301 | Using index |
+-------------------------+-------+---------------+---------+---------+------+------+-------------+
mysql> select date from basic_daily_grid_innodb;
...
317 rows in set (0.00 sec)
mysql> explain select date, count(*) from basic_daily_grid_innodb group by date;
+-------------------------+-------+---------------+---------+---------+------+------+-------------+
| table | type | possible_keys | key | key_len | ref | rows | Extra |
+-------------------------+-------+---------------+---------+---------+------+------+-------------+
| basic_daily_grid_innodb | index | NULL | PRIMARY | 3 | NULL |301 | Using index |
+-------------------------+-------+---------------+---------+---------+------+------+-------------+
mysql> explain select date, count(*) from basic_daily_grid_innodb group by date;
...
317 rows in set (2 min 54.95 sec)
I assume this is due to versioning or some other transactional feature. Or, is this a bug,
or am I doing something wrong? I don't quite see why grouping items that are all unique
should be that much slower than not grouping. I need InnoDB for the unlimited table size,
but I don't (Really) need transactions, commit/rollback, or checkpoints.
Any suggestions on solving this last hurdle? Its entirely likely I'll need to group by
year and average the results, or something similar - and at 3 minutes a pop, thats a
little high for my liking. Perhaps I'm expecting too much?
Thanks,
Nick Elliott