List:General Discussion« Previous MessageNext Message »
From:Shawn Green (MySQL) Date:December 7 2011 1:16pm
Subject:Re: Loading of large amounts of data
View as plain text  
Hello Machiel,

On 12/6/2011 01:40, Machiel Richards - Gmail wrote:
> Good day all
> I have someone who has asked me the following, however due to not having
> that many years of experience in these type of volumes, I am posting
> this as I know that someone will probably be able to answer it better
> than me.
> (This will also give me a learning opportunity to see what to do)
> _*Client Question: *_
> Well let me describe the issue.
> 1.I require to load records into a MySQL database table - no problem so
> far ;-)
> 2.The table represents "stock" that will be being searched and
> transacted (i.e. sold, which involves changing flags on the record) by a
> live system.
> 3.The stock table will be big --millions or tens of millions of rows
> 4.Stock is uniquely identified by two fields -- a supplier ID (numeric)
> and a serial number (varchar)
> 5.Transaction volumes may also be very high
> 6.Stock must be available to the system 24/7
> 7.I will need to replenish the stock table from a file, one or more
> times a day -- potentially loading tens or hundreds of thousands of rows
> each time
> 8.The DB will be a master-slave: reporting and recon files off the
> slave, transactions off the master (and presumably replenishment into
> master)
> I can go into a lot more detail about the process I am using (using an
> ETL tool called Talend) ... but the essential question is around
> strategies for doing this kind of dynamic loading:
> 1.How to insert data (high volumes) into the live table without locking
> it and affecting transaction performance (Insert low_priority?)
> 2.How to speed up inserts, even when there are two unique key
> constraints. My observation is obvious -- that inserts get slower and
> slower as the table grows (date based partitions of some kind maybe?).
> 3.General principles/ strategies in dealing with situations like this.
> Can someone please assist.

I can't give you precise details but I can point you in the right 
directions. Your requirements are well-formed but they tend to 
contradict each other. While there are no ways to completely remove the 
contradictions, there are ways to minimize their impacts.

#5 High transaction volumes
#6 Available 24x7
#1,#7 Bulk updates of 10000+ records daily

These three are in conflict. Database changes require indexes to be 
rebuilt. Index rebuilds can be fast (for small changes or small indexes) 
or take a noticeable length of time (larger changes or larger indexes or 
both). This means you may need to have two systems you flip-flop into 
place to minimize your downtime. The same problem was solved by graphics 
card manufacturers by creating multiple frame buffers. You can make your 
'unavailability' time as short as possible by updating a passive copy of 
the data while it is not being pointed to by your application front-end 
then swapping the 'updated' set of data for the 'old' set of data by 
either altering the virtual IPs of your sets of instances or by 
redirecting which set your applications are pulling data from.

#8 System will be master-slave
My flip-flop idea implies that your system will have two sets of 
master-slave(s) one carrying the 'current' data and one used to build 
the 'new' set of data (with the imports). This also implies that your 
'active' set will need to be replicating to your 'passive' set to keep 
it in sync between bulk updates.

#2a Many records need to change in a day
#3 There will be millions of records
#2b Searches need to be fast

These conflict with each other too. The more records you add to a table, 
the longer any indexed lookup will take. If you can't use the data in 
memory in the index then a trip to the disk will be necessary to 
retrieve the columns for your query. Combine this with the number of 
queries at any one time and divide that by the maximum number of 
random-access reads a physical disk can achieve and you may easily 
exceed the capacity of any one disk storage system to supply. This 
implies that you need to look at how to divide your storage among 
several independent devices at the same time. Options abound: sharding, 
partitioning, simple configuration changes (some tablespaces on one 
device, some on others). Or, you can look at pricing solid-state disks 
for your storage neeeds. Factoring in need #4, this suggests that a 
partitioning scheme based on (supplier, serial#) may be a good first 
design choice.

So... After discussing the pain points of each of your requirements I 
have the following mental image of a system:

a) two sets of master-slaves. The master of the passive set will be a 
'slave' to the master of the active set.
b) data on each set is using InnoDB
c) data partitioned on the stock table based on (serial#, supplier) - I 
chose that order because I think it will give a better random spread 
among the partition tables and because I think it will be much more 
common to ask 'which suppliers have part XXX' than it will be to say 
'what are all the parts that supplier YYY has'.

As always, take this advice with a grain of salt and adjust this 
possible design based on any other factors you did not include in your 
list of requirements. It may even be possible (depending on the size of 
your rows and other factors) that MySQL Cluster might be a better fit 
for your requirements.  I encourage you to engage with Cluster sales or 
any reputable consultant to get an evaluation and their recommendation, 
too. (disclaimer: I am not a cluster guru). I also encourage you to seek 
multiple recommendations. Many different solutions to the same problems 
you describe have been created by many different people. What works in 
my mind may not work in all situations.

Shawn Green
MySQL Principal Technical Support Engineer
Oracle USA, Inc. - Hardware and Software, Engineered to Work Together.
Office: Blountville, TN
Loading of large amounts of dataMachiel Richards - Gmail6 Dec
  • Re: Loading of large amounts of dataMySQL)7 Dec
    • Re: Loading of large amounts of dataRafael Valenzuela7 Dec