List:General Discussion« Previous MessageNext Message »
From:Christian Koetteritzsch Date:June 21 2012 7:04am
Subject:Re: Indexing about 40 Billion Entries
View as plain text  
Thank you a lot. The first indexing process finished after about 13 
hours, so I think the problem is solved now.
I set the myisam_sort_bufffer_size to 10GB.

For the query I will adjust it to your version.

Am 20.06.2012 23:32, schrieb Rick James:
>> SELECT ruid1, ruid2, overlap FROM l4_link WHERE ruid1=xxx OR ruid2=xxx ORDER BY
> overlap DESC
> MySQL does not optimize that kind of OR well.  This will run _much_ faster (with your
> two indexes):
>
> (
>     SELECT ruid1, ruid2, overlap FROM l4_link WHERE ruid1=xxx
>     UNION
>     SELECT ruid1, ruid2, overlap FROM l4_link WHERE ruid2=xxx
> ) ORDER BY overlap DESC;
>
> Make it UNION ALL or UNION DISTINCT depending on whether xxx can be in both fields of
> one row.  UNION DISTINCT makes a pass over the temp table to dedup.
>
> Your version (OR) will do a table scan.
>
> My version will:
> 1. scan ruid1 index (.MYI) (1+ disk hits for "index range scan")
> 2. for each matching ruid1, fetch the data row (in .MYD) (1 hit/row); write to tmp
> table
> 3,4.  ditto for ruid2, appending to same tmp table
> 5. sort tmp table (ORDER BY) (probably 0 disk hits, due to using MEMORY)
>
> Once the indexes are in place...  Depending on the version you are using,
> key_buffer_size is limited to 4G or is not.  For your huge machine, 24G might be wise. 
> The key_buffer will help steps 1,3.  The rest of RAM will be available for the OS to cache
> the data blocks (2,4).
>
> The 16 cores -- A single query (even with the UNION I suggested) will use only one
> core.
>
> How many rows (average, max) do you expect from
>     SELECT ... FROM l4_link WHERE ruid1=xxx
> I ask, because that might run as slow as 100 rows/sec., simply because of the disk
> hits.

I'm expecting  an average of 5000 rows.
>
> PARTITIONing, per se, does not help performance.  There are only a few use cases
> where PARTITION shines.  (I have not seen such [yet] in your application.)
>
>> -----Original Message-----
>> From: Christian Koetteritzsch [mailto:ckoetteritzsch@stripped]
>> Sent: Wednesday, June 20, 2012 1:42 PM
>> To: Rick James
>> Cc: Ananda Kumar; mysql@stripped
>> Subject: Re: Indexing about 40 Billion Entries
>>
>> Thanks for the information. It is no problem if it takes days or weeks,
>> because the server is specially for such tasks that takes time and uses
>> lots of resources.
>>
>> Am 20.06.2012 19:55, schrieb Rick James:
>>> Even if you get past the "REPAIR WITH KEYCACHE", the ALTER will still
>> take days, maybe weeks.
>>> I strongly recommend you do not try to access that many rows
>> directly.  Instead, build summary tables, and access them.  We can
>> discuss further.
>> Did you mean that I make several partitions with for example 1 Billion
>> Entries and than make a union of the results from the partitions?
>>> Unfortunately, you did not even include a PRIMARY KEY when you built
>> the table.  This makes any operations slow.
>> There is no PRIMARY KEY because ruid1 and ruid2 are not unique values.
>>>> I need an index for ruid1 and for ruid2.
>>> What will your SELECTs look like?  You may be better off with
>>> INDEX(ruid1, ruid2, overlap), and INDEX(ruid2, ruid1, overlap)
>>>
>>> Will you be adding more rows to this table?  Or is it now static?
>> No data will be added to this table so it is a static table.  And the
>> index, as i have planed them will be INDEX(ruid1), and INDEX(ruid2).
>>
>> My SELECT looks like this:
>>
>> SELECT ruid1, ruid2, overlap FROM l4_link WHERE ruid1=xxx OR ruid2=xxx
>>
>> and xxx are the same int values
>>
>>> What are the semantics of the fields?
>> The semantic behind these fields are, that ruid1 and ruid2 are ID's for
>> sentences that have common words and overlap is the number of words
>> they have in common.
>>
>> I hope the informations I gave you are helpful. If you have more
>> questions than you can ask them.
>>
>> Kind regards
>>
>> Christian
>>>
>>>> -----Original Message-----
>>>> From: Ananda Kumar [mailto:anandkl@stripped]
>>>> Sent: Wednesday, June 20, 2012 4:37 AM
>>>> To: Christian Koetteritzsch
>>>> Cc:mysql@stripped
>>>> Subject: Re: Indexing about 40 Billion Entries
>>>>
>>>> looks like the value that you give for myisam_max_sort_size is not
>>>> enough for the index creation and hence it doing a "REPAIR WITH
>>>> KEYCACHE"
>>>>
>>>> Use the below query to set the min values required for
>>>> myisam_max_sort_size to avoid "repair with keycache"
>>>>
>>>>
>>>> select
>>>>           a.index_name as index_name,
>>>>           ceil(
>>>>           (select count(*) from `db_name`.`table_name`) *(
>>>>           @@global.myisam_data_pointer_size +
>>>>           sum(ifnull(1 + `a`.`sub_part` *
>>>> (`b`.`character_octet_length`/`b`.`character_maximum_length`),
>>>>                   ifnull(1 + `b`.`character_octet_length`,
>>>>                           case
>>>>                                   when `b`.`data_type` = 'tinyint'
>> then 1
>>>>                                   when `b`.`data_type` = 'smallint'
>>>> then
>>>> 2
>>>>                                   when `b`.`data_type` = 'mediumint'
>>>> then
>>>> 3
>>>>                                   when `b`.`data_type` = 'int' then 4
>>>>                                   when `b`.`data_type` = 'bigint'
>> then 8
>>>>                                   when `b`.`data_type` = 'float' then
>> 4
>>>>                                   when `b`.`data_type` = 'double'
>> then 8
>>>>                                   when `b`.`data_type` = 'real' then
>> 8
>>>>                                   when `b`.`data_type` = 'bit' then 8
>>>>                                   when `b`.`data_type` = 'date' then
>> 3
>>>>                                   when `b`.`data_type` = 'datetime'
>>>> then
>>>> 8
>>>>                                   when `b`.`data_type` = 'timestamp'
>>>> then
>>>> 4
>>>>                                   when `b`.`data_type` = 'time' then
>> 3
>>>>                                   when `b`.`data_type` = 'year' then
>> 1
>>>>                                   when `b`.`data_type` = 'enum' then
>> 2
>>>>                                   when `b`.`data_type` = 'set' then 8
>>>>                                   when `b`.`data_type` = 'decimal'
>>>> then 8
>>>>
>>>>                           end
>>>>                           )
>>>>
>>>>           )
>>>>           + if(`a`.`nullable`='YES',1,0)
>>>>           + if(`b`.`character_octet_length` >=255,2,0)
>>>>           ))/1048576)*1048576  as `index_size` from
>>>> `information_schema`.`statistics` `a`,
>> `information_schema`.`columns`
>>>> `b` where
>>>>           `a`.`table_name`=`b`.`table_name` and
>>>>           `a`.`table_schema`=`b`.`table_schema` and
>>>>           `a`.`column_name`=`b`.`column_name` and
>>>>           `a`.`table_schema`='db_name' and
>>>>           `a`.`table_name`='table_name'
>>>>
>>>> group by `a`.`index_name`
>>>> order by `index_size` desc limit 1;
>>>>
>>>> On Wed, Jun 20, 2012 at 3:15 PM, Christian Koetteritzsch <
>>>> ckoetteritzsch@stripped> wrote:
>>>>
>>>>> Hi guys,
>>>>>
>>>>> As the title says I'm trying to index 40 billion entries with two
>>>>> indexes on a server with 16 cores and 128GB RAM. The table is the
>>>>> one below and it is a myisam table. The *.myd file is about 640GB
>>>>>
>>>>> DROP TABLE IF EXISTS `l4_link`;
>>>>> CREATE TABLE  `l4_link` (
>>>>>    `ruid1` int NOT NULL,
>>>>>    `ruid2` int NOT NULL,
>>>>>    `overlap` int NOT NULL
>>>>> );
>>>>>
>>>>> I need an index for ruid1 and for ruid2.
>>>>>
>>>>> The status for this table is the following:
>>>>>
>>>>>             Name: l4_link
>>>>>           Engine: MyISAM
>>>>>          Version: 10
>>>>>       Row_format: Fixed
>>>>>             Rows: 39806500262
>>>>>    Avg_row_length: 17
>>>>>      Data_length: 676710504454
>>>>> Max_data_length: 4785074604081151
>>>>>     Index_length: 1024
>>>>>        Data_free: 0
>>>>>    Auto_increment: NULL
>>>>>      Create_time: 2012-06-19 14:51:29
>>>>>      Update_time: 2012-06-19 16:26:35
>>>>>       Check_time: NULL
>>>>>        Collation: utf8_general_ci
>>>>>         Checksum: NULL
>>>>>    Create_options:
>>>>>          Comment:
>>>>>
>>>>> The variables for myisam are the following:
>>>>> mysql> show global variables like '%myisam%';
>>>>> +-----------------------------**-----------+------------------**
>>>>> ---------------------+
>>>>> | Variable_name                       | Value
>>>>>            |
>>>>> +-----------------------------**-----------+------------------**
>>>>> ---------------------+
>>>>> | myisam_data_pointer_size  | 6         |
>>>>> | myisam_max_sort_file_size | 9223372036853727232      |
>>>>> | myisam_mmap_size             | 18446744073709551615    |
>>>>> | myisam_recover_options     | BACKUP       |
>>>>> | myisam_repair_threads       | 1             |
>>>>> | myisam_sort_buffer_size    | 8388608
>>>> |
>>>>> | myisam_stats_method        | nulls_unequal |
>>>>> | myisam_use_mmap             | OFF
>>>>>      |
>>>>> +-----------------------------**----------+-------------------**
>>>>> ---------------------+
>>>>> 8 rows in set (0.00 sec)
>>>>>
>>>>> The temp folder has about 16tb free space.
>>>>>
>>>>> When I start the indexing process, it copies the 640Gb into a temp
>>>>> file and then starts with "repair with keycache".
>>>>> On the internet I found that if it says "repair with keycache" you
>>>>> shold increase the "myisam_max_sort_file_size", but this didn't
>> work.
>>>>> It still says "repair with keycache" after it copied the data to
>> the
>>>>> temp dir.
>>>>>
>>>>> I hope you have any idea how to fix this.
>>>>>
>>>>> Thanks in advance.
>>>>>
>>>>> Christian
>>>>>
>>>>>
>>>>> --
>>>>> MySQL General Mailing List
>>>>> For list archives:http://lists.mysql.com/mysql
>>>>> To unsubscribe:http://lists.mysql.com/mysql
>>>>>
>>>>>
>


Thread
Indexing about 40 Billion EntriesChristian Koetteritzsch20 Jun
  • Re: Indexing about 40 Billion EntriesAnanda Kumar20 Jun
    • RE: Indexing about 40 Billion EntriesRick James20 Jun
      • Re: Indexing about 40 Billion EntriesChristian Koetteritzsch20 Jun
        • Re: Indexing about 40 Billion EntriesChristian Koetteritzsch20 Jun
        • RE: Indexing about 40 Billion EntriesRick James20 Jun
          • Re: Indexing about 40 Billion EntriesChristian Koetteritzsch21 Jun
            • Re: Indexing about 40 Billion Entriesmos21 Jun
              • Re: Indexing about 40 Billion EntriesChristian Koetteritzsch21 Jun
              • RE: Indexing about 40 Billion EntriesRick James21 Jun
          • UNION and ORDER BYhsv22 Jun
            • RE: UNION and ORDER BYRick James22 Jun
  • Re: Indexing about 40 Billion EntriesShawn Green20 Jun
    • RE: Indexing about 40 Billion EntriesRick James20 Jun
  • Re: Indexing about 40 Billion EntriesBrent Clark21 Jun