On 12/6/2011 01:40, Machiel Richards - Gmail wrote:
> Good day all
> I have someone who has asked me the following, however due to not having
> that many years of experience in these type of volumes, I am posting
> this as I know that someone will probably be able to answer it better
> than me.
> (This will also give me a learning opportunity to see what to do)
> _*Client Question: *_
> Well let me describe the issue.
> 1.I require to load records into a MySQL database table - no problem so
> far ;-)
> 2.The table represents "stock" that will be being searched and
> transacted (i.e. sold, which involves changing flags on the record) by a
> live system.
> 3.The stock table will be big --millions or tens of millions of rows
> 4.Stock is uniquely identified by two fields -- a supplier ID (numeric)
> and a serial number (varchar)
> 5.Transaction volumes may also be very high
> 6.Stock must be available to the system 24/7
> 7.I will need to replenish the stock table from a file, one or more
> times a day -- potentially loading tens or hundreds of thousands of rows
> each time
> 8.The DB will be a master-slave: reporting and recon files off the
> slave, transactions off the master (and presumably replenishment into
> I can go into a lot more detail about the process I am using (using an
> ETL tool called Talend) ... but the essential question is around
> strategies for doing this kind of dynamic loading:
> 1.How to insert data (high volumes) into the live table without locking
> it and affecting transaction performance (Insert low_priority?)
> 2.How to speed up inserts, even when there are two unique key
> constraints. My observation is obvious -- that inserts get slower and
> slower as the table grows (date based partitions of some kind maybe?).
> 3.General principles/ strategies in dealing with situations like this.
> Can someone please assist.
I can't give you precise details but I can point you in the right
directions. Your requirements are well-formed but they tend to
contradict each other. While there are no ways to completely remove the
contradictions, there are ways to minimize their impacts.
#5 High transaction volumes
#6 Available 24x7
#1,#7 Bulk updates of 10000+ records daily
These three are in conflict. Database changes require indexes to be
rebuilt. Index rebuilds can be fast (for small changes or small indexes)
or take a noticeable length of time (larger changes or larger indexes or
both). This means you may need to have two systems you flip-flop into
place to minimize your downtime. The same problem was solved by graphics
card manufacturers by creating multiple frame buffers. You can make your
'unavailability' time as short as possible by updating a passive copy of
the data while it is not being pointed to by your application front-end
then swapping the 'updated' set of data for the 'old' set of data by
either altering the virtual IPs of your sets of instances or by
redirecting which set your applications are pulling data from.
#8 System will be master-slave
My flip-flop idea implies that your system will have two sets of
master-slave(s) one carrying the 'current' data and one used to build
the 'new' set of data (with the imports). This also implies that your
'active' set will need to be replicating to your 'passive' set to keep
it in sync between bulk updates.
#2a Many records need to change in a day
#3 There will be millions of records
#2b Searches need to be fast
These conflict with each other too. The more records you add to a table,
the longer any indexed lookup will take. If you can't use the data in
memory in the index then a trip to the disk will be necessary to
retrieve the columns for your query. Combine this with the number of
queries at any one time and divide that by the maximum number of
random-access reads a physical disk can achieve and you may easily
exceed the capacity of any one disk storage system to supply. This
implies that you need to look at how to divide your storage among
several independent devices at the same time. Options abound: sharding,
partitioning, simple configuration changes (some tablespaces on one
device, some on others). Or, you can look at pricing solid-state disks
for your storage neeeds. Factoring in need #4, this suggests that a
partitioning scheme based on (supplier, serial#) may be a good first
So... After discussing the pain points of each of your requirements I
have the following mental image of a system:
a) two sets of master-slaves. The master of the passive set will be a
'slave' to the master of the active set.
b) data on each set is using InnoDB
c) data partitioned on the stock table based on (serial#, supplier) - I
chose that order because I think it will give a better random spread
among the partition tables and because I think it will be much more
common to ask 'which suppliers have part XXX' than it will be to say
'what are all the parts that supplier YYY has'.
As always, take this advice with a grain of salt and adjust this
possible design based on any other factors you did not include in your
list of requirements. It may even be possible (depending on the size of
your rows and other factors) that MySQL Cluster might be a better fit
for your requirements. I encourage you to engage with Cluster sales or
any reputable consultant to get an evaluation and their recommendation,
too. (disclaimer: I am not a cluster guru). I also encourage you to seek
multiple recommendations. Many different solutions to the same problems
you describe have been created by many different people. What works in
my mind may not work in all situations.
MySQL Principal Technical Support Engineer
Oracle USA, Inc. - Hardware and Software, Engineered to Work Together.
Office: Blountville, TN