From: Karen Abgarian <abvk Date: May 7 2012 5:51pm Subject: =?utf-8?Q?Re=3A_=E5=9B=9E=E5=A4=8D=EF=BC=9A_Why_is_creating_inde?= =?utf-8?Q?xes_faster_after_inserting_massive_data_rows=3F?= List-Archive: http://lists.mysql.com/mysql/227346 Message-Id: <4BC02395-EA2C-48ED-8A18-84B64728CE9F@apple.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Good point about key buffer. I was only thinking about the table = updates for MyISAM, not indexes. The being stuck waiting for buffer = flush could also happen. However, for the table blocks this would be = the same issue as with load followed by index rebuild, and for the = indexes, it will have to be compared, performance-wise, with an expense = of sorting an equally sized index. =20 On May 7, 2012, at 10:40 AM, Rick James wrote: > (Correction to Karen's comments) > * MyISAM does all its index operations in the key_buffer, similar to = InnoDB and its buffer_pool. > * Yes, writes are delayed (in both engines), but not forever. If the = table is huge, you will eventually be stuck waiting for blocks to be = flushed from cache. > * If the table is small enough, all the I/O can be delayed, and done = only once. So yes, the in-memory cache may be faster.=20 >=20 > Based on this discussion, you should note that "random" indexes, such = as GUIDs, MD5s, etc, tend to=20 >=20 >=20 >> -----Original Message----- >> From: Karen Abgarian [mailto:abvk@stripped] >> Sent: Monday, May 07, 2012 10:31 AM >> To: mysql@stripped >> Subject: Re: =E5=9B=9E=E5=A4=8D=EF=BC=9A Why is creating indexes = faster after inserting >> massive data rows? >>=20 >> Hi, >>=20 >> A couple cents to this. >>=20 >> There isn't really a million of block writes. The record gets added >> to the block, but that gets modified in OS cache if we assume MyISAM >> tables and in the Innodb buffer if we assume InnoDB tables. In both >> cases, the actual writing does not take place and does not slow down >> the process. What does however happen for each operation, is >> processing the statement, locating the entries to update in the = index, >> index block splits and , for good reason, committing. >>=20 >> When it comes to creating an index, what needs to happen, is to read >> the whole table and to sort all rows by the index key. The latter >> process will be the most determining factor in answering the original >> question, because for the large tables the sort will have to do a lot >> of disk I/O. The point I am trying to make is there will be >> situations when creating indexes and then inserting the rows will be >> faster than creating an index afterwards. If we try to determine = such >> situations, we could notice that the likelihood of the sort going to >> disk increases with the amount of distinct values to be sorted. For >> this reason, my choice would be to create things like primary/unique >> keys beforehand unless I am certain that everything will fit in the >> available memory. >>=20 >> Peace >> Karen >>=20 >>=20 >>=20 >> On May 7, 2012, at 8:05 AM, Johan De Meersman wrote: >>=20 >>> ----- Original Message ----- >>>=20 >>>> From: "Zhangzhigang" >>>=20 >>>> Ok, Creating the index *after* the inserts, the index gets created >> in >>>> a single operation. >>>> But the indexes has to be updating row by row after the data rows >> has >>>> all been inserted. Does it work in this way? >>> No, when you create an index on an existing table (like after a mass >> insert), what happens is that the engine does a single full tablescan >> and builds the index in a single pass, which is a lot more performant >> than updating a single disk block for every record, for the simple >> reason that a single disk block can contain dozens of index entries. >>>=20 >>> Imagine that you insert one million rows, and you have 100 index >> entries in a disk block (random numbers, to make a point. Real = numbers >> will depend on storage, file system, index, et cetera). Obviously >> there's no way to write less than a single block to disk - that's how >> it works. >>>=20 >>> You can update your index for each record in turn. That means you >> will need to do 1 million index - and thus block - writes; plus >> additional reads for those blocks you don't have in memory - that's = the >> index cache. >>>=20 >>> Now, if you create a new index on an existing table, you are first = of >> all bypassing any index read operations - there *is* no index to = read, >> yet. Then the system is going to do a full tablescan - considered = slow, >> but you need all the data, so there's no better way anyway. The index >> will be built - in-memory as much as possible - and the system will >> automatically prefer to write only complete blocks - 10.000 of them. >> That's the exact same number of index blocks, but you only write each >> block once, so that's only 10.000 writes instead of 1.000.000. >>>=20 >>> Now there's a lot more at play, things like B-tree balancing and >> whatnot, but that's the basic picture. >>>=20 >>> -- >>>=20 >>> Bier met grenadyn >>> Is als mosterd by den wyn >>> Sy die't drinkt, is eene kwezel >>> Hy die't drinkt, is ras een ezel >>=20 >>=20 >> -- >> MySQL General Mailing List >> For list archives: http://lists.mysql.com/mysql >> To unsubscribe: http://lists.mysql.com/mysql >=20