List:MySQL++« Previous MessageNext Message »
From:Warren Young Date:March 13 2009 7:36am
Subject:Re: binary bulk-insert
View as plain text  
On Mar 12, 2009, at 11:13 PM, Andrej van der Zee wrote:

>> Does it serialize to strings? Does it bulk insert?

It does both.  It constructs large INSERT queries containing many  
rows, to reduce the per-packet processing overhead.  It also  
simplifies the task of doing the insertion: think of it as the  
opposite of Query::storein(), which you may already be using.

Here's the new material in the svn version of the user manual that  
covers this:
> The two-iterator form of insert() has an associated risk: MySQL has  
> a limit on the size of the SQL query it will process. The default  
> limit is 1 MB. You can raise the limit, but the reason the limit is  
> configurable is not to allow huge numbers of inserts in a single  
> query. They made the limit configurable because a single row might  
> be bigger than 1 MB, so the default would prevent you from inserting  
> anything at all. If you raise the limit simply to be able to insert  
> more rows at once, you’re courting disaster with no compensating  
> benefit: the more data you send at a time, the greater the chance  
> and cost of something going wrong. Worse, this is pure risk, because  
> by the time you hit 1 MB, the per-packet overhead is such a small  
> fraction of the data being transferred that increasing the packet  
> size buys you essentially nothing.
>
> Let’s say you have a vector containing several megabytes of data; it  
> will get even bigger when expressed in SQL form, so there’s no way  
> you can insert it all in a single query without raising the MySQL  
> packet limit. One way to cope would be to write your own naïve loop,  
> inserting just one row at a time. This is slow, because you’re  
> paying the per-query cost for every row in the container. Then you  
> might realize that you could use the two iterator form of insert(),  
> passing iterators expressing sub-ranges of the container instead of  
> trying to insert the whole container in one go. Now you’ve just got  
> to figure out how to calculate those sub-ranges to get efficient  
> operation without exceeding the packet size limit.
>
> MySQL++ already knows how to do that, too, with Query::insertfrom().  
> We gave it a different name instead of adding yet another insert()  
> overload because it doesn’t merely build the INSERT query, which you  
> then execute(). It’s more like storein(), in that it wraps the  
> entire operation up in a single call. This feature is demonstrated  
> in examples/ssqls6.cpp:
>

Here's a link to that example's current code:

http://svn.gna.org/viewcvs/mysqlpp/trunk/examples/ssqls6.cpp?rev=2469&view=markup

Then, continuing:
> Most of the complexity in this example goes to just reading in the  
> data from a file; we have to get our test data from somewhere. There  
> are only two key lines of code: create an insertion policy object,  
> and pass it along with an STL container full of row data to  
> Query::insertfrom().
>
> This policy object is the main thing that differentiates  
> insertfrom() from the two-iterator form of insert(). It controls how  
> insertfrom() builds the query strings, primarily controlling how  
> large each query gets before insertfrom() executes it and starts  
> building a new query. We designed it to use policy objects because  
> there is no single “right” choice for the decisions it makes.
>
> MySQL++ ships with three different insertion policy classes, which  
> should cover most situations.
>
> MaxPacketInsertPolicy, demonstrated in the example above, does  
> things the most obvious way: when you create it, you pass the  
> maximum packet size, which it uses to prevent queries from going  
> over the size limit. It builds up a query string row by row,  
> checking each time through the loop whether adding another insert  
> statement to the query string would make the packet size go over the  
> limit. When that happens, or it gets to the end of the iteration  
> range, it executes the query and starts over if it’s not yet at the  
> end. This is robust, but it has a downside: it has to build each  
> insert query in advance of knowing that it can append it to the  
> larger query. Any time an insert query would push the packet over  
> the limit, it has to throw it away, causing the library to do more  
> work than is strictly necessary.
>
> Imagine you’ve done some benchmarking and have found that the point  
> of diminishing returns is at about 20 KB per query in your  
> environment; beyond that point, the per-query overhead ceases to be  
> an issue. Let’s also say you know for a fact that your largest row  
> will always be less than 1 MB — less 20 KB — when expressed as a SQL  
> insert statement. In that case, you can use the more efficient  
> SizeThresholdInsertPolicy. It differs fromMaxPacketInsertPolicy in  
> that it allows insertfrom() to insert rows blindly into the query  
> string until the built query exceeds the threshold, 20 KB in this  
> example. Then it ships the packet off, and if successful, starts a  
> new query. Thus, each query (except possibly the last) will be at  
> least 20 KB, exceeding that only by as much as one row’s worth of  
> data, minus one byte. This is quite appropriate behavior when your  
> rows are relatively small, as is typical for tables not containing  
> BLOB data. It is more efficient than MaxPacketInsertPolicy because  
> it never has to throw away any SQL fragments.
>
> The simplest policy object type is RowCountInsertPolicy. This lets  
> you simply say how many rows at a time to insert into the database.  
> This works well when you have a good handle on how big each row will  
> be, so you can calculate in advance how many rows you can insert at  
> once without exceeding some given limit. Say you know your rows  
> can’t be any bigger than about 1 KB. If we stick with that 20 KB  
> target, passing RowCountInsertPolicy<>(20) for the policy object  
> would ensure we never exceed the size threshold. Or, say that  
> maximum size value above is still true, but we also know the average  
> row size is only 200 bytes. You could pass  
> RowCountInsertPolicy<>(100) for the policy, knowing that the average  
> packet size will be around 20 KB, and the worst case packet size 100  
> KB, still nowhere near the default 1 MB packet size limit. The code  
> for this policy is very simple, so it makes your program a little  
> smaller than if you used either of the above policies. Obviously  
> it’s a bad choice if you aren’t able to predict the size of your  
> rows accurately.
>
> If one of the provided insert policy classes doesn’t suit your  
> needs, you can easily create a custom one. Just study the  
> implementation inlib/insertpolicy.*.
>

I won't claim this is just as fast as a binary insert.  What I am  
suggesting is that maybe it's fast enough for your application.  It'll  
certainly be a lot faster than doing many one-row INSERTs, and a lot  
easier than iterating over insert(iter, iter) yourself.

If you find that it's not fast enough, and go ahead and write your own  
binary insertion code, please post the results and the test  
conditions.  One of my vague ideas for the far future of MySQL++ is a  
way to use SSQLS for binary data transfers.  That's unlikely to happen  
without someone proving it worthwhile first, though.
Thread
binary bulk-insertAndrej van der Zee13 Mar
  • Re: binary bulk-insertWarren Young13 Mar
Re: binary bulk-insertAndrej van der Zee13 Mar
  • Re: binary bulk-insertWarren Young13 Mar
Re: binary bulk-insertAndrej van der Zee23 Mar
  • Re: binary bulk-insertWarren Young23 Mar
Re: binary bulk-insertAndrej van der Zee23 Mar