From: Warren Young Date: March 13 2009 7:36am Subject: Re: binary bulk-insert List-Archive: http://lists.mysql.com/plusplus/8462 Message-Id: <294F0C8D-5DF3-4117-972B-CAD32F2A1B19@etr-usa.com> MIME-Version: 1.0 (Apple Message framework v930.3) Content-Type: text/plain; charset=WINDOWS-1252; format=flowed; delsp=yes Content-Transfer-Encoding: quoted-printable On Mar 12, 2009, at 11:13 PM, Andrej van der Zee wrote: >> Does it serialize to strings? Does it bulk insert? It does both. It constructs large INSERT queries containing many =20 rows, to reduce the per-packet processing overhead. It also =20 simplifies the task of doing the insertion: think of it as the =20 opposite of Query::storein(), which you may already be using. Here's the new material in the svn version of the user manual that =20 covers this: > The two-iterator form of insert() has an associated risk: MySQL has =20= > a limit on the size of the SQL query it will process. The default =20 > limit is 1 MB. You can raise the limit, but the reason the limit is =20= > configurable is not to allow huge numbers of inserts in a single =20 > query. They made the limit configurable because a single row might =20 > be bigger than 1 MB, so the default would prevent you from inserting =20= > anything at all. If you raise the limit simply to be able to insert =20= > more rows at once, you=92re courting disaster with no compensating =20 > benefit: the more data you send at a time, the greater the chance =20 > and cost of something going wrong. Worse, this is pure risk, because =20= > by the time you hit 1 MB, the per-packet overhead is such a small =20 > fraction of the data being transferred that increasing the packet =20 > size buys you essentially nothing. > > Let=92s say you have a vector containing several megabytes of data; it = =20 > will get even bigger when expressed in SQL form, so there=92s no way =20= > you can insert it all in a single query without raising the MySQL =20 > packet limit. One way to cope would be to write your own na=EFve loop, = =20 > inserting just one row at a time. This is slow, because you=92re =20 > paying the per-query cost for every row in the container. Then you =20 > might realize that you could use the two iterator form of insert(), =20= > passing iterators expressing sub-ranges of the container instead of =20= > trying to insert the whole container in one go. Now you=92ve just got =20= > to figure out how to calculate those sub-ranges to get efficient =20 > operation without exceeding the packet size limit. > > MySQL++ already knows how to do that, too, with Query::insertfrom(). =20= > We gave it a different name instead of adding yet another insert() =20 > overload because it doesn=92t merely build the INSERT query, which you = =20 > then execute(). It=92s more like storein(), in that it wraps the =20 > entire operation up in a single call. This feature is demonstrated =20 > in examples/ssqls6.cpp: > Here's a link to that example's current code: = http://svn.gna.org/viewcvs/mysqlpp/trunk/examples/ssqls6.cpp?rev=3D2469&vi= ew=3Dmarkup Then, continuing: > Most of the complexity in this example goes to just reading in the =20 > data from a file; we have to get our test data from somewhere. There =20= > are only two key lines of code: create an insertion policy object, =20 > and pass it along with an STL container full of row data to =20 > Query::insertfrom(). > > This policy object is the main thing that differentiates =20 > insertfrom() from the two-iterator form of insert(). It controls how =20= > insertfrom() builds the query strings, primarily controlling how =20 > large each query gets before insertfrom() executes it and starts =20 > building a new query. We designed it to use policy objects because =20 > there is no single =93right=94 choice for the decisions it makes. > > MySQL++ ships with three different insertion policy classes, which =20 > should cover most situations. > > MaxPacketInsertPolicy, demonstrated in the example above, does =20 > things the most obvious way: when you create it, you pass the =20 > maximum packet size, which it uses to prevent queries from going =20 > over the size limit. It builds up a query string row by row, =20 > checking each time through the loop whether adding another insert =20 > statement to the query string would make the packet size go over the =20= > limit. When that happens, or it gets to the end of the iteration =20 > range, it executes the query and starts over if it=92s not yet at the =20= > end. This is robust, but it has a downside: it has to build each =20 > insert query in advance of knowing that it can append it to the =20 > larger query. Any time an insert query would push the packet over =20 > the limit, it has to throw it away, causing the library to do more =20 > work than is strictly necessary. > > Imagine you=92ve done some benchmarking and have found that the point =20= > of diminishing returns is at about 20 KB per query in your =20 > environment; beyond that point, the per-query overhead ceases to be =20= > an issue. Let=92s also say you know for a fact that your largest row =20= > will always be less than 1 MB =97 less 20 KB =97 when expressed as a = SQL =20 > insert statement. In that case, you can use the more efficient =20 > SizeThresholdInsertPolicy. It differs fromMaxPacketInsertPolicy in =20 > that it allows insertfrom() to insert rows blindly into the query =20 > string until the built query exceeds the threshold, 20 KB in this =20 > example. Then it ships the packet off, and if successful, starts a =20 > new query. Thus, each query (except possibly the last) will be at =20 > least 20 KB, exceeding that only by as much as one row=92s worth of =20= > data, minus one byte. This is quite appropriate behavior when your =20 > rows are relatively small, as is typical for tables not containing =20 > BLOB data. It is more efficient than MaxPacketInsertPolicy because =20 > it never has to throw away any SQL fragments. > > The simplest policy object type is RowCountInsertPolicy. This lets =20 > you simply say how many rows at a time to insert into the database. =20= > This works well when you have a good handle on how big each row will =20= > be, so you can calculate in advance how many rows you can insert at =20= > once without exceeding some given limit. Say you know your rows =20 > can=92t be any bigger than about 1 KB. If we stick with that 20 KB =20 > target, passing RowCountInsertPolicy<>(20) for the policy object =20 > would ensure we never exceed the size threshold. Or, say that =20 > maximum size value above is still true, but we also know the average =20= > row size is only 200 bytes. You could pass =20 > RowCountInsertPolicy<>(100) for the policy, knowing that the average =20= > packet size will be around 20 KB, and the worst case packet size 100 =20= > KB, still nowhere near the default 1 MB packet size limit. The code =20= > for this policy is very simple, so it makes your program a little =20 > smaller than if you used either of the above policies. Obviously =20 > it=92s a bad choice if you aren=92t able to predict the size of your =20= > rows accurately. > > If one of the provided insert policy classes doesn=92t suit your =20 > needs, you can easily create a custom one. Just study the =20 > implementation inlib/insertpolicy.*. > I won't claim this is just as fast as a binary insert. What I am =20 suggesting is that maybe it's fast enough for your application. It'll =20= certainly be a lot faster than doing many one-row INSERTs, and a lot =20 easier than iterating over insert(iter, iter) yourself. If you find that it's not fast enough, and go ahead and write your own =20= binary insertion code, please post the results and the test =20 conditions. One of my vague ideas for the far future of MySQL++ is a =20= way to use SSQLS for binary data transfers. That's unlikely to happen =20= without someone proving it worthwhile first, though.=