List:General Discussion« Previous MessageNext Message »
From:Rafael Valenzuela Date:December 7 2011 4:25pm
Subject:Re: Loading of large amounts of data
View as plain text  
Hi ,
I normally use the PDI(pentaho data integration), this software is like 
talent or one batch process in C  .

On 07/12/11 14:16, Shawn Green (MySQL) wrote:
> Hello Machiel,
> On 12/6/2011 01:40, Machiel Richards - Gmail wrote:
>> Good day all
>> I have someone who has asked me the following, however due to not having
>> that many years of experience in these type of volumes, I am posting
>> this as I know that someone will probably be able to answer it better
>> than me.
>> (This will also give me a learning opportunity to see what to do)
>> _*Client Question: *_
>> Well let me describe the issue.
>> 1.I require to load records into a MySQL database table - no problem so
>> far ;-)
>> 2.The table represents "stock" that will be being searched and
>> transacted (i.e. sold, which involves changing flags on the record) by a
>> live system.
>> 3.The stock table will be big --millions or tens of millions of rows
>> 4.Stock is uniquely identified by two fields -- a supplier ID (numeric)
>> and a serial number (varchar)
>> 5.Transaction volumes may also be very high
>> 6.Stock must be available to the system 24/7
>> 7.I will need to replenish the stock table from a file, one or more
>> times a day -- potentially loading tens or hundreds of thousands of rows
>> each time
>> 8.The DB will be a master-slave: reporting and recon files off the
>> slave, transactions off the master (and presumably replenishment into
>> master)
>> I can go into a lot more detail about the process I am using (using an
>> ETL tool called Talend) ... but the essential question is around
>> strategies for doing this kind of dynamic loading:
>> 1.How to insert data (high volumes) into the live table without locking
>> it and affecting transaction performance (Insert low_priority?)
>> 2.How to speed up inserts, even when there are two unique key
>> constraints. My observation is obvious -- that inserts get slower and
>> slower as the table grows (date based partitions of some kind maybe?).
>> 3.General principles/ strategies in dealing with situations like this.
>> Can someone please assist.
> I can't give you precise details but I can point you in the right 
> directions. Your requirements are well-formed but they tend to 
> contradict each other. While there are no ways to completely remove 
> the contradictions, there are ways to minimize their impacts.
> #5 High transaction volumes
> #6 Available 24x7
> #1,#7 Bulk updates of 10000+ records daily
> These three are in conflict. Database changes require indexes to be 
> rebuilt. Index rebuilds can be fast (for small changes or small 
> indexes) or take a noticeable length of time (larger changes or larger 
> indexes or both). This means you may need to have two systems you 
> flip-flop into place to minimize your downtime. The same problem was 
> solved by graphics card manufacturers by creating multiple frame 
> buffers. You can make your 'unavailability' time as short as possible 
> by updating a passive copy of the data while it is not being pointed 
> to by your application front-end then swapping the 'updated' set of 
> data for the 'old' set of data by either altering the virtual IPs of 
> your sets of instances or by redirecting which set your applications 
> are pulling data from.
> #8 System will be master-slave
> My flip-flop idea implies that your system will have two sets of 
> master-slave(s) one carrying the 'current' data and one used to build 
> the 'new' set of data (with the imports). This also implies that your 
> 'active' set will need to be replicating to your 'passive' set to keep 
> it in sync between bulk updates.
> #2a Many records need to change in a day
> #3 There will be millions of records
> #2b Searches need to be fast
> These conflict with each other too. The more records you add to a 
> table, the longer any indexed lookup will take. If you can't use the 
> data in memory in the index then a trip to the disk will be necessary 
> to retrieve the columns for your query. Combine this with the number 
> of queries at any one time and divide that by the maximum number of 
> random-access reads a physical disk can achieve and you may easily 
> exceed the capacity of any one disk storage system to supply. This 
> implies that you need to look at how to divide your storage among 
> several independent devices at the same time. Options abound: 
> sharding, partitioning, simple configuration changes (some tablespaces 
> on one device, some on others). Or, you can look at pricing 
> solid-state disks for your storage neeeds. Factoring in need #4, this 
> suggests that a partitioning scheme based on (supplier, serial#) may 
> be a good first design choice.
> So... After discussing the pain points of each of your requirements I 
> have the following mental image of a system:
> a) two sets of master-slaves. The master of the passive set will be a 
> 'slave' to the master of the active set.
> b) data on each set is using InnoDB
> c) data partitioned on the stock table based on (serial#, supplier) - 
> I chose that order because I think it will give a better random spread 
> among the partition tables and because I think it will be much more 
> common to ask 'which suppliers have part XXX' than it will be to say 
> 'what are all the parts that supplier YYY has'.
> As always, take this advice with a grain of salt and adjust this 
> possible design based on any other factors you did not include in your 
> list of requirements. It may even be possible (depending on the size 
> of your rows and other factors) that MySQL Cluster might be a better 
> fit for your requirements.  I encourage you to engage with Cluster 
> sales or any reputable consultant to get an evaluation and their 
> recommendation, too. (disclaimer: I am not a cluster guru). I also 
> encourage you to seek multiple recommendations. Many different 
> solutions to the same problems you describe have been created by many 
> different people. What works in my mind may not work in all situations.
> Regards,

Loading of large amounts of dataMachiel Richards - Gmail6 Dec
  • Re: Loading of large amounts of dataMySQL)7 Dec
    • Re: Loading of large amounts of dataRafael Valenzuela7 Dec