List:General Discussion« Previous MessageNext Message »
From:beacker Date:December 4 2003 11:39pm
Subject:Large data set load and access
View as plain text  
Folks,
     Just completed an interesting task utilizing mySQL 4.0.16.  The
database I'm creating is some summary information from the GenBank
info from the NCBI.  I must say that I am quite impressed by the
performance that I am seeing.  The data set is pretty substantial,
consisting of almost 30M records.  Yet it was loaded from the text
file in less than 6 minutes.  The text file itself is about 1.6GB
in size.

     An aggregation of the data also took just under a minute as
shown in the information I've included with this message.  I was also
impressed by the time to select a particular record without the use
of an index on the column being selected upon.  I'm sure that would
change once I create an index on this column.

     For information, this test was done on a 1300 MHz RH Linux 7.3
system with 896MB of memory, and WDC ATA drives.  Not the top of the
line machine, but pretty respectable.
                   Brad Eacker (beacker@stripped)

Particulars:
           
mysql> create table gb_locus (
    ->         gbl_id          int primary key,
    ->         gbl_fileID      int,
    ->         gbl_locus       varchar(20),
    ->         gbl_size        int,
    ->         gbl_date        date,
    ->         gbl_phylum      char(3),
    ->         gbl_foffset     int
    -> );
Query OK, 0 rows affected (0.00 sec)

mysql> load data infile '/hda3/beacker/gene/genbank/gbl_locus.txt'
    -> into table gb_locus fields terminated by ',';
Query OK, 29830869 rows affected (5 min 44.68 sec)
Records: 29830869  Deleted: 0  Skipped: 0  Warnings: 0


Input file information:
[beacker@phoenix genbank]$ ls -l gbl_locus.txt
-rw-rw-r--    1 beacker  beacker  1583781135 Dec  4 13:41 gbl_locus.txt


Aggregation select:

mysql> select gbl_phylum, count(*) from gb_locus group by gbl_phylum;
+------------+----------+
| gbl_phylum | count(*) |
+------------+----------+
| BCT        |   210778 |
| CON        |    11472 |
| EST        | 18836635 |
| GSS        |  7585521 |
| HTC        |   148411 |
| HTG        |    68390 |
| INV        |   186924 |
| MAM        |    52858 |
| PAT        |  1345394 |
| PHG        |     2396 |
| PLN        |   368927 |
| PRI        |   302997 |
| ROD        |   115600 |
| STS        |   257403 |
| SYN        |    10988 |
| UNA        |     1093 |
| VRL        |   203738 |
| VRT        |   121344 |
+------------+----------+
18 rows in set (59.74 sec)

[root@phoenix gene]# ls -l gb_locus*  
-rw-rw----    1 mysql    mysql        8766 Dec  4 13:57 gb_locus.frm
-rw-rw----    1 mysql    mysql    1075530216 Dec  4 14:04 gb_locus.MYD
-rw-rw----    1 mysql    mysql    244406272 Dec  4 14:04 gb_locus.MYI


Single record selections:

mysql> select * from gb_locus where gbl_id = 14000000;
+----------+------------+-----------+----------+------------+------------+-------------+
| gbl_id   | gbl_fileID | gbl_locus | gbl_size | gbl_date   | gbl_phylum | gbl_foffset |
+----------+------------+-----------+----------+------------+------------+-------------+
| 14000000 |     310212 | AL556818  |     1027 | 2003-05-31 | EST        |    41517291 |
+----------+------------+-----------+----------+------------+------------+-------------+
1 row in set (0.04 sec)

mysql> select * from gb_locus where gbl_locus = 'AL556818';
+----------+------------+-----------+----------+------------+------------+-------------+
| gbl_id   | gbl_fileID | gbl_locus | gbl_size | gbl_date   | gbl_phylum | gbl_foffset |
+----------+------------+-----------+----------+------------+------------+-------------+
| 14000000 |     310212 | AL556818  |     1027 | 2003-05-31 | EST        |    41517291 |
+----------+------------+-----------+----------+------------+------------+-------------+
1 row in set (45.69 sec)

Thread
Large data set load and accessbeacker4 Dec
  • Re: Large data set load and accessChuck Gadd5 Dec
Re: Large data set load and accessbeacker5 Dec
  • Re: Large data set load and accessChuck Gadd5 Dec