From: Warren Young Date: October 14 2008 11:30am Subject: Re: Using MySQL++ with lots and lots of data List-Archive: http://lists.mysql.com/plusplus/8031 Message-Id: MIME-Version: 1.0 (Apple Message framework v929.2) Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes Content-Transfer-Encoding: 7bit On Oct 13, 2008, at 3:22 PM, Rick Gutleber wrote: > Increasing the packet size limit I don't see how this would help. TCP's window size will be smaller than the MySQL packet size. Once that's filled, you can keep on pushing more data into the stack, and it will queue it for you (or block the sender, same thing), but it won't make the data go out any faster. MySQL has a packet size parameter to handle the case where a *single record* may be too big for a default packet. You increase it only to fix this problem, not to get better performance. Have you eliminated the possibility that disk I/O is the bottleneck? If you have less than a gigabit per second of data continually going between client and server, it could be disk I/O on either end holding you up, not packet overhead. It takes a real server -- as opposed to a desktop PC in a rack case -- to continually fill a gigabit pipe from disk. This is doubly the case on the DB server, which has more than simple disk I/O overhead. The DB server's disk subsystem has to be fast enough to do all the DB stuff on top of sinking a gigabit per second of data. > Is there any kind of performance gain that can be made by issuing > bulk INSERT statements over inserting records individually Yes, up to the point that you can keep the TCP window full, or at least continually "sated". Beyond that, no. > can MySQL++ be used to gain the performance advantages of bulk > INSERT statements without having to do a whole bunch of special > coding? You can use Query::insert(iter, iter) for this. You wrap it in a loop, passing a subranges of the vector on each iteration, covering the whole vector a chunk at a time. > I would love to contribute to creating a patch for such an > improvement if it makes sense. We accept your offer. :) A possible design: Add Query::insert(container, policy), on the pattern of Query::storein(). The policy parameter is a pointer to a policy object. If you pass 0 for the policy, it gives the same effect as Query::insert(con.begin(), con.end()): the whole container is expected to fit into a single MySQL packet. Can't make this a default parameter, else you create a possibility of ambiguity w.r.t. Query::insert(const SSQLS&). You want the ability to separate policy from function because Query can't hard-code details like the server's configured max packet size, the min/max/avg object size, etc. Policy object suggestions: 1. MaxPacketPolicy: create it with a size parameter, and it will pack as many records into the query as possible, up to that packet size limit. The difficulty here is that you probably have to build each values() fragment separately, then concatenate them, because there's no easier way to tell exactly how many bytes a given SSQLS holds in SQL terms. You'd be betting the extra memory overhead pays for the savings on disk and network I/O. 2. SizeThresholdPolicy: like above, but it doesn't check the size of the records before inserting them. It just keeps a running count of query size, and when it goes over, it sends the query. This works well when you give it a size on the order of the TCP window size, which should be much lower than the MySQL packet size. As long as the max record size is smaller than the MySQL max packet less the size threshold, you never overflow. Relative to the max-packet policy, it saves the cost of the extra data copy, but it loses the guaranteed safety of a hard upper packet size limit. 3. RecordCountPolicy: N records per query, every time. This just takes us further down the safety vs. simplicity axis toward simplicity. This design lets you try multiple things, benchmarking each, and deciding on one based on what you see in your particular situation.