|List:||General Discussion||« Previous MessageNext Message »|
|From:||Zhangzhigang||Date:||May 10 2012 4:59am|
|Subject:||回复： 回复： Why is creating indexes f|
aster after inserting massive data rows?
|View as plain text|
>The “output” from the sortmerge is fed into code that builds the BTree for > the table. This building of the BTree is sequential – fill the first block, > move on to the next block, and never have to go back. James... Thanks for your > answer, so clearly. Firstly: I thought that the "block split" for building of the BTree > has to been done to do random I/O before accepting this answer. Now, i have known that > the mysql do the optimization to keep from "block split" by "sort merge" for building > BTree, so it does not do more "random" I/O. Secondly: It bypass BTree traversals, When > the index are too big to be cached which involves disk hit(s) fro each row > inserted. Thank you very much. Sincerely yours Zhigang > Zhang ________________________________ 发件人： Rick James > <rjames@stripped> 收件人： Zhangzhigang <zzgang_2008@stripped> > 抄送： "mysql@stripped" <mysql@stripped> 发送日期： > 2012年5月9日, 星期三, 下午 11:21 主题: RE: 回复： Why is creating indexes > faster after inserting massive data rows? A BTree that is small enough to be cached in > RAM can be quickly maintained. Even the “block splits” are not too > costly without the I/O. A big file that needs sorting – bigger than can be > cached in RAM – is more efficiently done with a dedicated “sort merge” > program. A “big” INDEX on a table may be big enough to fall into this > category. I/O is the most costly part of any of these operations. My rule of > thumb for MySQL SQL statements is: If everything is cached, the query will run ten > times as fast as it would if things have to be fetched from disk. Sortmerge works > this way: 1. Sort as much of the file as you can in > RAM. Write that sorted piece to disk. 2. Repeat > for the next chunk of the file. Repeat until the input file is broken into sorted > chunks. 3. Now, “merge” those chunks > together – take the first row from each, decide which is the > “smallest”, send it to the output 4. > Repeat until finished with all the pieces. For a really big task, there may have to be > more than on “merge” pass. Note how sort merge reads the input sequentially > once, writes the output sequentially once, and has sequential I/O for each merge > chunk. “Sequential” I/O is faster than “random” I/O – no arm > motion on traditional disks. (SSDs are a different matter; I won’t go into > that.) The “output” from the sortmerge is fed into code that builds the > BTree for the table. This building of the BTree is sequential – fill the first > block, move on to the next block, and never have to go back. BTrees (when built > randomly), if they need to spill to disk, will involve random I/O. (And we are > talking about an INDEX that is so big that it needs to spill to disk.) When a block > “splits”, one full block becomes two half-full blocks. Randomly filling > a BTree leads to, on average, the index being 69% full. This is not a big factor in > the overall issue, but perhaps worth noting. How bad can it get? Here’s > an example. · You have an INDEX on > some random value, such as a GUID or > MD5. · The INDEX will be 5 times as > big as you can fit in RAM. · MySQL > is adding to the BTree one row at a time (the non-sortmerge way) When it is nearly > finished, only 1 of 5 updates to the BTree can be done immediately in RAM; 4 out of 5 > updates to the BTree will have to hit disk. If you are using normal disks, that is > on the order of 125 rows per second that you can insert – Terrible! Sortmerge > is likely to average over 10,000. From:Zhangzhigang > [mailto:zzgang_2008@stripped] Sent: Tuesday, May 08, 2012 9:13 PM To: Rick James Cc: > mysql@stripped Subject: 回复：Why is creating indexes faster after inserting > massive data rows? James... >* By doing all the indexes after building the table > (or at least all the non-UNIQUE indexes), "sort merge" can be used. This technique > had been highly optimized over the past half-century, and is more efficient. I have > a question about "sort merge": Why does it do the all "sort merge"? In my > opinion, it just maintains the B tree and inserts one key into a B tree node which has > fewer sorted keys, so it is good performance. If it only does the "sort merge", the > B tree data structure have to been created separately. it wastes some > performance. Does > it? ________________________________ 发件人：Rick James > <rjames@stripped> 收件人：Johan De Meersman <vegivamp@stripped>; > Zhangzhigang <zzgang_2008@stripped> 抄送："mysql@stripped" > <mysql@stripped> 发送日期：2012年5月8日, 星期二, > 上午12:35 主题:RE: Why is creating indexes faster after inserting massive data > rows? * Batch INSERTs run faster than one-row-at-a-time, but this is unrelated to INDEX > updating speed. * The cache size is quite important to dealing with indexing during > INSERT; see http://mysql.rjweb.org/doc.php/memory * Note that mysqldump sets up for an > efficient creation of indexes after loading the data. This is not practical (or > necessarily efficient) when incremental INSERTing into a table. As for the original > question... * Updating the index(es) for one row often involves random BTree > traversals. When the index(es) are too big to be cached, this can involve disk > hit(s) for each row inserted. * By doing all the indexes after building the table (or at > least all the non-UNIQUE indexes), "sort merge" can be used. This technique had been > highly optimized over the past half-century, and is more efficient. > -----Original Message----- > From: Johan De Meersman [mailto:vegivamp@stripped] > Sent: Monday, May 07, 2012 1:29 AM > To: Zhangzhigang > Cc: mysql@stripped > Subject: Re: Why is creating indexes faster after inserting massive > data rows? > > ----- Original Message ----- > > From: "Zhangzhigang" <zzgang_2008@stripped> > > > > Creating indexes after inserting massive data rows is faster than > > before inserting data rows. > > Please tell me why. > > Plain and simple: the indices get updated after every insert statement, > whereas if you only create the index *after* the inserts, the index > gets created in a single operation, which is a lot more efficient. > > I seem to recall that inside of a transaction (thus, InnoDB or so) the > difference is markedly less; I might be wrong, though. > > > -- > Bier met grenadyn > Is als mosterd by den wyn > Sy die't drinkt, is eene kwezel > Hy die't drinkt, is ras een ezel > > -- > MySQL General Mailing List > For list archives: http://lists.mysql.com/mysql > To unsubscribe: http://lists.mysql.com/mysql