List:Database Benchmarks« Previous MessageNext Message »
From:Peter Zaitsev Date:June 28 2005 5:08pm
Subject:Re: Really big tables?
View as plain text  
On Sat, 2005-06-25 at 06:08 -0700, Chris Kantarjiev wrote:
> Not *strictly* a benchmark issue as much as a performance issue - I can
> take this to the general list if that's more appropriate.
> We're running a moderately small machine with a lot of memory: OpenBSD 3.5
> on a 2.8GHz machine, RAID1 disk, 4GB RAM (which isn't being fully used).
> We don't have many simultaneous queries at all - there might be two
> applications going on at once (this is a back end operation).
> But there are lots of rows. 24 million in one of the bigger tables.

> (All these tables are myISAM - innodb was *way* slower.)
> ANALYZE TABLE can take two days in this configuration, and I'm trying
> to figure out how to make it go (much) faster. We're seeing 93 qps,
> which doesn't seem nearly enough. (Well, it's not enough, but
> I don't know if it's unreasonable.) That table has about 15GB of
> data; the indices are about 10GB.

That is very long time to run ANALYZE, unless your table is badly
fragmented.  In this case you need to run OPTIMIZE rather than analyze.

> The basic flow of operations is that we read in data from several
> sources and insert it with one field (link_id) set to NULL. The
> PK is indexed, as is link_id. On a regular basis, we grab bunches
> of rows WHERE LINK_ID IS NULL and chew on them, then UPDATE with
> a new value. This update is pretty disk bound. We have tried
> large batches, medium batches, small batches, with and without
> locking the table first. None was particularly enlightening in
> terms of speed.

Depending on your index value distribution performance may be very
expected.   Index updates may need to update random locations of BTREE 
(Index tree), which does not fit in memory in your case and so these can
be random IO operations.     Hard drive can do some 100-200 IO/sec  so
it is quite normal.  Your goal should be ether avoiding random BTREE
updates or making BTREE to fit in memory. 

For example may be you need to split your data into several tables
(possibly get MERGE table on them) and process them  one by one so your 
indexes always fit in  key_buffer variable. 

> The index on link_id is there because we do a lot of queries based on it -
> but after the initial set of updates, it's (almost) all read-only.
> It's probably reasonable to think of this as a data warehousing application
> rather than an on-line operation. It's never been entirely clear to
> me that an index that is so big is useful, since we spend a lot of time
> moving it in and out of memory.
> The data is partitioned across a lot of tables; I'd love it if we could
> avoid that partitioning, but then I think it will be completely 
> unmanageable. I suppose we could break it down even farther with
> some work.

If you have already split data across many tables why did you chose them
to be so large ?   In case of partitioning it is normally better to keep
tables managable (small enough so they fit in memory well enough for all
operations to be fast)

> I'm looking for two kinds of advice:
> 1) What can I do to make my current config perform better?
> 2) What should I do to build a new config that will really go? We're a
> little strapped for money now, but believe that will get better soon.
> (Yes, I'm thinking about Opteron.)

If you're Disk bound your problem is Disk IO (or memory) not CPU.  
64bit (Opteron) and amount of memory large enough to save you from being
disk bound is good. 

It could be it is possible to optimize your application to be less IO
bound by partitioning, schema changes  or flow changes 

> Or are we just asking mySQL to do things it wasn't meant to do?
> Here is a portion of /etc/my.cnf.
> skip-innodb
> skip-locking
> key_buffer = 512M
> key_buffer_size = 512M

There are same :)

> max_allowed_packet = 1M
> table_cache = 120
> sort_buffer_size = 2M
> read_buffer_size = 2M
> myisam_sort_buffer_size = 256M
> thread_cache =16 
> query_cache_size = 32M
> # Try number of CPU's*2 for thread_concurrency
> thread_concurrency = 2

Peter Zaitsev, Senior Performance Engineer

Really big tables?Chris Kantarjiev25 Jun
  • Re: Really big tables?Peter Zaitsev28 Jun
Re: Really big tables?Chris Kantarjiev28 Jun