I normally use the PDI(pentaho data integration), this software is like
talent or one batch process in C .
On 07/12/11 14:16, Shawn Green (MySQL) wrote:
> Hello Machiel,
> On 12/6/2011 01:40, Machiel Richards - Gmail wrote:
>> Good day all
>> I have someone who has asked me the following, however due to not having
>> that many years of experience in these type of volumes, I am posting
>> this as I know that someone will probably be able to answer it better
>> than me.
>> (This will also give me a learning opportunity to see what to do)
>> _*Client Question: *_
>> Well let me describe the issue.
>> 1.I require to load records into a MySQL database table - no problem so
>> far ;-)
>> 2.The table represents "stock" that will be being searched and
>> transacted (i.e. sold, which involves changing flags on the record) by a
>> live system.
>> 3.The stock table will be big --millions or tens of millions of rows
>> 4.Stock is uniquely identified by two fields -- a supplier ID (numeric)
>> and a serial number (varchar)
>> 5.Transaction volumes may also be very high
>> 6.Stock must be available to the system 24/7
>> 7.I will need to replenish the stock table from a file, one or more
>> times a day -- potentially loading tens or hundreds of thousands of rows
>> each time
>> 8.The DB will be a master-slave: reporting and recon files off the
>> slave, transactions off the master (and presumably replenishment into
>> I can go into a lot more detail about the process I am using (using an
>> ETL tool called Talend) ... but the essential question is around
>> strategies for doing this kind of dynamic loading:
>> 1.How to insert data (high volumes) into the live table without locking
>> it and affecting transaction performance (Insert low_priority?)
>> 2.How to speed up inserts, even when there are two unique key
>> constraints. My observation is obvious -- that inserts get slower and
>> slower as the table grows (date based partitions of some kind maybe?).
>> 3.General principles/ strategies in dealing with situations like this.
>> Can someone please assist.
> I can't give you precise details but I can point you in the right
> directions. Your requirements are well-formed but they tend to
> contradict each other. While there are no ways to completely remove
> the contradictions, there are ways to minimize their impacts.
> #5 High transaction volumes
> #6 Available 24x7
> #1,#7 Bulk updates of 10000+ records daily
> These three are in conflict. Database changes require indexes to be
> rebuilt. Index rebuilds can be fast (for small changes or small
> indexes) or take a noticeable length of time (larger changes or larger
> indexes or both). This means you may need to have two systems you
> flip-flop into place to minimize your downtime. The same problem was
> solved by graphics card manufacturers by creating multiple frame
> buffers. You can make your 'unavailability' time as short as possible
> by updating a passive copy of the data while it is not being pointed
> to by your application front-end then swapping the 'updated' set of
> data for the 'old' set of data by either altering the virtual IPs of
> your sets of instances or by redirecting which set your applications
> are pulling data from.
> #8 System will be master-slave
> My flip-flop idea implies that your system will have two sets of
> master-slave(s) one carrying the 'current' data and one used to build
> the 'new' set of data (with the imports). This also implies that your
> 'active' set will need to be replicating to your 'passive' set to keep
> it in sync between bulk updates.
> #2a Many records need to change in a day
> #3 There will be millions of records
> #2b Searches need to be fast
> These conflict with each other too. The more records you add to a
> table, the longer any indexed lookup will take. If you can't use the
> data in memory in the index then a trip to the disk will be necessary
> to retrieve the columns for your query. Combine this with the number
> of queries at any one time and divide that by the maximum number of
> random-access reads a physical disk can achieve and you may easily
> exceed the capacity of any one disk storage system to supply. This
> implies that you need to look at how to divide your storage among
> several independent devices at the same time. Options abound:
> sharding, partitioning, simple configuration changes (some tablespaces
> on one device, some on others). Or, you can look at pricing
> solid-state disks for your storage neeeds. Factoring in need #4, this
> suggests that a partitioning scheme based on (supplier, serial#) may
> be a good first design choice.
> So... After discussing the pain points of each of your requirements I
> have the following mental image of a system:
> a) two sets of master-slaves. The master of the passive set will be a
> 'slave' to the master of the active set.
> b) data on each set is using InnoDB
> c) data partitioned on the stock table based on (serial#, supplier) -
> I chose that order because I think it will give a better random spread
> among the partition tables and because I think it will be much more
> common to ask 'which suppliers have part XXX' than it will be to say
> 'what are all the parts that supplier YYY has'.
> As always, take this advice with a grain of salt and adjust this
> possible design based on any other factors you did not include in your
> list of requirements. It may even be possible (depending on the size
> of your rows and other factors) that MySQL Cluster might be a better
> fit for your requirements. I encourage you to engage with Cluster
> sales or any reputable consultant to get an evaluation and their
> recommendation, too. (disclaimer: I am not a cluster guru). I also
> encourage you to seek multiple recommendations. Many different
> solutions to the same problems you describe have been created by many
> different people. What works in my mind may not work in all situations.