List:General Discussion« Previous MessageNext Message »
From:Christian Koetteritzsch Date:June 20 2012 8:54pm
Subject:Re: Indexing about 40 Billion Entries
View as plain text  
sorry I forget the ORDER BY in the SELECT statement so the correct 
SELECT statement is:

SELECT ruid1, ruid2, overlap FROM l4_link WHERE ruid1=xxx OR ruid2=xxx 
ORDER BY overlap DESC

Am 20.06.2012 22:42, schrieb Christian Koetteritzsch:
> Thanks for the information. It is no problem if it takes days or 
> weeks, because the server is specially for such tasks that takes time 
> and uses lots of resources.
>
> Am 20.06.2012 19:55, schrieb Rick James:
>> Even if you get past the "REPAIR WITH KEYCACHE", the ALTER will still 
>> take days, maybe weeks.
>>
>> I strongly recommend you do not try to access that many rows 
>> directly.  Instead, build summary tables, and access them.  We can 
>> discuss further.
> Did you mean that I make several partitions with for example 1 Billion 
> Entries and than make a union of the results from the partitions?
>>
>> Unfortunately, you did not even include a PRIMARY KEY when you built 
>> the table.  This makes any operations slow.
> There is no PRIMARY KEY because ruid1 and ruid2 are not unique values.
>>
>>> I need an index for ruid1 and for ruid2.
>> What will your SELECTs look like?  You may be better off with
>> INDEX(ruid1, ruid2, overlap), and
>> INDEX(ruid2, ruid1, overlap)
>>
>> Will you be adding more rows to this table?  Or is it now static?
> No data will be added to this table so it is a static table.  And the 
> index, as i have planed them will be
> INDEX(ruid1), and INDEX(ruid2).
>
> My SELECT looks like this:
>
> SELECT ruid1, ruid2, overlap FROM l4_link WHERE ruid1=xxx OR ruid2=xxx
>
> and xxx are the same int values
>
>> What are the semantics of the fields?
> The semantic behind these fields are, that ruid1 and ruid2 are ID's 
> for sentences that have common words and overlap is the number of 
> words they have in common.
>
> I hope the informations I gave you are helpful. If you have more 
> questions than you can ask them.
>
> Kind regards
>
> Christian
>>
>>
>>> -----Original Message-----
>>> From: Ananda Kumar [mailto:anandkl@stripped]
>>> Sent: Wednesday, June 20, 2012 4:37 AM
>>> To: Christian Koetteritzsch
>>> Cc:mysql@stripped
>>> Subject: Re: Indexing about 40 Billion Entries
>>>
>>> looks like the value that you give for myisam_max_sort_size is not
>>> enough for the index creation and hence it doing a "REPAIR WITH
>>> KEYCACHE"
>>>
>>> Use the below query to set the min values required for
>>> myisam_max_sort_size to avoid "repair with keycache"
>>>
>>>
>>> select
>>>          a.index_name as index_name,
>>>          ceil(
>>>          (select count(*) from `db_name`.`table_name`) *(
>>>          @@global.myisam_data_pointer_size +
>>>          sum(ifnull(1 + `a`.`sub_part` *
>>> (`b`.`character_octet_length`/`b`.`character_maximum_length`),
>>>                  ifnull(1 + `b`.`character_octet_length`,
>>>                          case
>>>                                  when `b`.`data_type` = 'tinyint' 
>>> then 1
>>>                                  when `b`.`data_type` = 'smallint' then
>>> 2
>>>                                  when `b`.`data_type` = 'mediumint' 
>>> then
>>> 3
>>>                                  when `b`.`data_type` = 'int' then 4
>>>                                  when `b`.`data_type` = 'bigint' then 8
>>>                                  when `b`.`data_type` = 'float' then 4
>>>                                  when `b`.`data_type` = 'double' then 8
>>>                                  when `b`.`data_type` = 'real' then 8
>>>                                  when `b`.`data_type` = 'bit' then 8
>>>                                  when `b`.`data_type` = 'date' then 3
>>>                                  when `b`.`data_type` = 'datetime' then
>>> 8
>>>                                  when `b`.`data_type` = 'timestamp' 
>>> then
>>> 4
>>>                                  when `b`.`data_type` = 'time' then 3
>>>                                  when `b`.`data_type` = 'year' then 1
>>>                                  when `b`.`data_type` = 'enum' then 2
>>>                                  when `b`.`data_type` = 'set' then 8
>>>                                  when `b`.`data_type` = 'decimal' 
>>> then 8
>>>
>>>                          end
>>>                          )
>>>
>>>          )
>>>          + if(`a`.`nullable`='YES',1,0)
>>>          + if(`b`.`character_octet_length` >=255,2,0)
>>>          ))/1048576)*1048576  as `index_size` from
>>> `information_schema`.`statistics` `a`, `information_schema`.`columns`
>>> `b` where
>>>          `a`.`table_name`=`b`.`table_name` and
>>>          `a`.`table_schema`=`b`.`table_schema` and
>>>          `a`.`column_name`=`b`.`column_name` and
>>>          `a`.`table_schema`='db_name' and
>>>          `a`.`table_name`='table_name'
>>>
>>> group by `a`.`index_name`
>>> order by `index_size` desc limit 1;
>>>
>>> On Wed, Jun 20, 2012 at 3:15 PM, Christian Koetteritzsch <
>>> ckoetteritzsch@stripped> wrote:
>>>
>>>> Hi guys,
>>>>
>>>> As the title says I'm trying to index 40 billion entries with two
>>>> indexes on a server with 16 cores and 128GB RAM. The table is the one
>>>> below and it is a myisam table. The *.myd file is about 640GB
>>>>
>>>> DROP TABLE IF EXISTS `l4_link`;
>>>> CREATE TABLE  `l4_link` (
>>>>   `ruid1` int NOT NULL,
>>>>   `ruid2` int NOT NULL,
>>>>   `overlap` int NOT NULL
>>>> );
>>>>
>>>> I need an index for ruid1 and for ruid2.
>>>>
>>>> The status for this table is the following:
>>>>
>>>>            Name: l4_link
>>>>          Engine: MyISAM
>>>>         Version: 10
>>>>      Row_format: Fixed
>>>>            Rows: 39806500262
>>>>   Avg_row_length: 17
>>>>     Data_length: 676710504454
>>>> Max_data_length: 4785074604081151
>>>>    Index_length: 1024
>>>>       Data_free: 0
>>>>   Auto_increment: NULL
>>>>     Create_time: 2012-06-19 14:51:29
>>>>     Update_time: 2012-06-19 16:26:35
>>>>      Check_time: NULL
>>>>       Collation: utf8_general_ci
>>>>        Checksum: NULL
>>>>   Create_options:
>>>>         Comment:
>>>>
>>>> The variables for myisam are the following:
>>>> mysql> show global variables like '%myisam%';
>>>> +-----------------------------**-----------+------------------**
>>>> ---------------------+
>>>> | Variable_name                       | Value
>>>>           |
>>>> +-----------------------------**-----------+------------------**
>>>> ---------------------+
>>>> | myisam_data_pointer_size  | 6         |
>>>> | myisam_max_sort_file_size | 9223372036853727232      |
>>>> | myisam_mmap_size             | 18446744073709551615    |
>>>> | myisam_recover_options     | BACKUP       |
>>>> | myisam_repair_threads       | 1             |
>>>> | myisam_sort_buffer_size    | 8388608
>>> |
>>>> | myisam_stats_method        | nulls_unequal |
>>>> | myisam_use_mmap             | OFF
>>>>     |
>>>> +-----------------------------**----------+-------------------**
>>>> ---------------------+
>>>> 8 rows in set (0.00 sec)
>>>>
>>>> The temp folder has about 16tb free space.
>>>>
>>>> When I start the indexing process, it copies the 640Gb into a temp
>>>> file and then starts with "repair with keycache".
>>>> On the internet I found that if it says "repair with keycache" you
>>>> shold increase the "myisam_max_sort_file_size", but this didn't work.
>>>> It still says "repair with keycache" after it copied the data to the
>>>> temp dir.
>>>>
>>>> I hope you have any idea how to fix this.
>>>>
>>>> Thanks in advance.
>>>>
>>>> Christian
>>>>
>>>>
>>>> -- 
>>>> MySQL General Mailing List
>>>> For list archives:http://lists.mysql.com/mysql
>>>> To unsubscribe:http://lists.mysql.com/mysql
>>>>
>>>>
>
>
>


Thread
Indexing about 40 Billion EntriesChristian Koetteritzsch20 Jun
  • Re: Indexing about 40 Billion EntriesAnanda Kumar20 Jun
    • RE: Indexing about 40 Billion EntriesRick James20 Jun
      • Re: Indexing about 40 Billion EntriesChristian Koetteritzsch20 Jun
        • Re: Indexing about 40 Billion EntriesChristian Koetteritzsch20 Jun
        • RE: Indexing about 40 Billion EntriesRick James20 Jun
          • Re: Indexing about 40 Billion EntriesChristian Koetteritzsch21 Jun
            • Re: Indexing about 40 Billion Entriesmos21 Jun
              • Re: Indexing about 40 Billion EntriesChristian Koetteritzsch21 Jun
              • RE: Indexing about 40 Billion EntriesRick James21 Jun
          • UNION and ORDER BYhsv22 Jun
            • RE: UNION and ORDER BYRick James22 Jun
  • Re: Indexing about 40 Billion EntriesShawn Green20 Jun
    • RE: Indexing about 40 Billion EntriesRick James20 Jun
  • Re: Indexing about 40 Billion EntriesBrent Clark21 Jun