List:General Discussion« Previous MessageNext Message »
From:Claudio Nanni Date:May 9 2012 3:34pm
Subject:Re: 回复: Why is creating indexes faster after i
nserting massive data rows?
View as plain text  
This thread is going on and on and on and on,
does anyone have time to actually measure I/O?
Let's make numbers talk.

Claudio


2012/5/9 Rick James <rjames@stripped>

> A BTree that is small enough to be cached in RAM can be quickly
> maintained.  Even the “block splits” are not too costly without the I/O.
>
> A big file that needs sorting > efficiently done with a dedicated “sort merge” program.  A
> “big” INDEX on a
> table may be big enough to fall into this category.
>
> I/O is the most costly part of any of these operations.  My rule of thumb
> for MySQL SQL statements is:  If everything is cached, the query will run
> ten times as fast as it would if things have to be fetched from disk.
>
> Sortmerge works this way:
>
> 1.       Sort as much of the file as you can in RAM.  Write that sorted
> piece to disk.
>
> 2.       Repeat for the next chunk of the file.  Repeat until the input
> file is broken into sorted chunks.
>
> 3.       Now, “merge” those chunks together >  each, decide which is the “smallest”, send it to the output
>
> 4.       Repeat until finished with all the pieces.
> For a really big task, there may have to be more than on “merge” pass.
> Note how sort merge reads the input sequentially once, writes the output
> sequentially once, and has sequential I/O for each merge chunk.
> “Sequential” I/O is faster than “random” I/O > traditional disks.  (SSDs are a different matter; I won’t go into that.)
>
> The “output” from the sortmerge is fed into code that builds the BTree
> for
> the table.  This building of the BTree is sequential > block, move on to the next block, and never have to go back.
>
> BTrees (when built randomly), if they need to spill to disk, will involve
> random I/O.  (And we are talking about an INDEX that is so big that it
> needs to spill to disk.)
>
> When a block “splits”, one full block becomes two half-full blocks.
>  Randomly filling a BTree leads to, on average, the index being 69% full.
>  This is not a big factor in the overall issue, but perhaps worth noting.
>
> How bad can it get?  Here’s an example.
>
> ・         You have an INDEX on some random value, such as a GUID or MD5.
>
> ・         The INDEX will be 5 times as big as you can fit in RAM.
>
> ・         MySQL is adding to the BTree one row at a time (the
> non-sortmerge way)
> When it is nearly finished, only 1 of 5 updates to the BTree can be done
> immediately in RAM; 4 out of 5 updates to the BTree will have to hit disk.
>  If you are using normal disks, that is on the order of 125 rows per second
> that you can insert >
>
>
> From: Zhangzhigang [mailto:zzgang_2008@stripped]
> Sent: Tuesday, May 08, 2012 9:13 PM
> To: Rick James
> Cc: mysql@stripped
> Subject: 回复: Why is creating indexes faster after inserting massive data
> rows?
>
> James...
> >* By doing all the indexes after building the table (or at least all the
> non-UNIQUE indexes), "sort merge" can be used.  This technique had been
> highly optimized over the past half-century, and is more efficient.
>
> I have a question about "sort merge":
>
> Why does it do the all "sort merge"?
>
> In my opinion, it just maintains the B tree and inserts one key into a B
> tree node which has fewer sorted keys, so it is good performance.
>
> If it only does the "sort merge", the B tree data structure have to been
> created separately. it wastes some performance.
>
> Does it?
>
>
> ________________________________
> 发件人: Rick James
> <rjames@stripped<mailto:rjames@stripped>>
> 收件人: Johan De Meersman
> <vegivamp@stripped<mailto:vegivamp@stripped>>;
> Zhangzhigang <zzgang_2008@stripped<mailto:zzgang_2008@stripped>>
> 抄送: "mysql@stripped<mailto:mysql@stripped>" <
> mysql@stripped<mailto:mysql@stripped>>
> 发送日期: 2012年5月8日, 星期二, 上午 12:35
> 主题: RE: Why is creating indexes faster after inserting massive data rows?
>
> * Batch INSERTs run faster than one-row-at-a-time, but this is unrelated
> to INDEX updating speed.
> * The cache size is quite important to dealing with indexing during
> INSERT; see http://mysql.rjweb.org/doc.php/memory <
> http://mysql.rjweb.org/doc.php/memory%0A>
> * Note that mysqldump sets up for an efficient creation of indexes after
> loading the data.  This is not practical (or necessarily efficient) when
> incremental INSERTing into a table.
>
> As for the original question...
> * Updating the index(es) for one row often involves random BTree
> traversals.  When the index(es) are too big to be cached, this can involve
> disk hit(s) for each row inserted.
> * By doing all the indexes after building the table (or at least all the
> non-UNIQUE indexes), "sort merge" can be used.  This technique had been
> highly optimized over the past half-century, and is more efficient.
>
>
> > -----Original Message-----
> > From: Johan De Meersman [mailto:vegivamp@stripped<mailto:
> vegivamp@stripped>]
> > Sent: Monday, May 07, 2012 1:29 AM
> > To: Zhangzhigang
> > Cc: mysql@stripped<mailto:mysql@stripped>
> > Subject: Re: Why is creating indexes faster after inserting massive
> > data rows?
> >
> > ----- Original Message -----
> > > From: "Zhangzhigang" <zzgang_2008@stripped<mailto:
> zzgang_2008@stripped>>
> > >
> > > Creating indexes after inserting massive data rows is faster than
> > > before inserting data rows.
> > > Please tell me why.
> >
> > Plain and simple: the indices get updated after every insert statement,
> > whereas if you only create the index *after* the inserts, the index
> > gets created in a single operation, which is a lot more efficient.
> >
> > I seem to recall that inside of a transaction (thus, InnoDB or so) the
> > difference is markedly less; I might be wrong, though.
> >
> >
> > --
> > Bier met grenadyn
> > Is als mosterd by den wyn
> > Sy die't drinkt, is eene kwezel
> > Hy die't drinkt, is ras een ezel
> >
> > --
> > MySQL General Mailing List
> > For list archives: http://lists.mysql.com/mysql <
> http://lists.mysql.com/mysql%0A>
> > To unsubscribe:    http://lists.mysql.com/mysql <
> http://lists.mysql.com/mysql%0A>
>
>
>


-- 
Claudio

Thread
Why is creating indexes faster after inserting massive data rows?Zhangzhigang7 May
  • Re: Why is creating indexes faster after inserting massive data rows?Ananda Kumar7 May
    • 回复: Why is creating indexes faster after inserting massive data rows?Zhangzhigang7 May
  • Re: Why is creating indexes faster after inserting massive data rows?Johan De Meersman7 May
    • 回复: Why is creating indexes faster after inserting massive data rows?Zhangzhigang7 May
      • Re: 回复: Why is creating indexes faster after inserting massive data rows?Alex Schaft7 May
        • 回复: 回复: Why is creating indexes faster after inserting massive data rows?Zhangzhigang7 May
          • Re: 回复: 回复: Why is creating indexes faster after inserting massive data rows?Claudio Nanni7 May
            • too nice not to share it!hsv2 Oct
          • Re: 回复: 回复: Why is creating indexes faster after inserting massive data rows?Mihamina Rakotomandimby30 May
      • Re: 回复: Why is creating indexes faster after inserting massive data rows?Claudio Nanni7 May
        • 回复: 回复: Why is creating indexes faster after inserting massive data rows?Zhangzhigang7 May
      • Re: 回复: Why is creating indexes faster after inserting massive data rows?Johan De Meersman7 May
        • RE: 回复: Why is creating indexes faster after inserting massive data rows?Rick James7 May
        • Re: 回复: Why is creating indexes faster after inserting massive data rows?Karen Abgarian7 May
          • 回复: 回复: Why is creating indexes faster after inserting massive data rows?Zhangzhigang8 May
            • Re: 回复: 回复: Why is creating indexes faster after inserting massive data rows?Karen Abgarian8 May
              • 回复: 回复: 回复: Why is creating indexes faster after inserting massive data rows?Zhangzhigang8 May
                • Re: 回复: 回复: 回复: Why is creating indexes faster after inserting massive data rows?Johan De Meersman8 May
                  • 回复: 回复: 回复: 回复: Why is creating indexes faster after inserting massive data rows?Zhangzhigang8 May
                  • Re: 回复: 回复: 回复: Why is creating indexes faster after inserting massive data rows?Karen Abgarian8 May
                    • 回复: 回复: 回复: 回复: Why is creating indexes faster after inserting massive data rows?Zhangzhigang9 May
              • RE: 回复: 回复: Why is creating indexes faster after inserting massive data rows?Rick James10 May
            • Re: 回复: 回复: Why is creating indexes faster after inserting massive data rows?Johan De Meersman8 May
        • Re: 回复: Why is creating indexes faster after inserting massive data rows?Karen Abgarian7 May
    • RE: Why is creating indexes faster after inserting massive datarows?Rick James7 May
      • 回复: Why is creating indexes faster after inserting massive data rows?Zhangzhigang9 May
        • RE: 回复: Why is creating indexes faster after inserting massive data rows?Rick James9 May
          • Re: 回复: Why is creating indexes faster after inserting massive data rows?Claudio Nanni9 May
          • 回复: 回复: Why is creating indexes faster after inserting massive data rows?Zhangzhigang10 May