List:General Discussion« Previous MessageNext Message »
From:SGreen Date:June 18 2004 2:50pm
Subject:Re: load data into 2 tables and set id
View as plain text  
JS,

I need one more piece of information to help make that query work for you,
I need the structure of the table that you use (or would use) to bulk
import those logs into.

If you are running out of room you may consider further normalizing you
data (which saves space, but creates more data maintenance steps).  I, too,
have had to deal with millions of rows of internet usage logs so I
understand your pain.

You can store your IP addresses in an INT and get them back in dotted
notation with the MySQL functions INET_ATON() and INET_NTOA().  That will
save you an average of 10 bytes PER ROW (it adds up when you are into the
millions of rows).

Create a Proxy_Action table (id tinyint, action varchar(20) ) and populate
it with all of the actions your proxy/firewall can log.  Then replace the
column "action varchar(20)" with "ProxyAction_id tinyint".  (I assume there
are less than 256 "action" messages available from your proxy?) That's 1
byte vs. 8 at the low end of your sample data (counting the null at the end
of the string).  First add the new column to the table, populate it with
the ID values from your new Action table, then drop the old column.

URLs consist (in a basic sense) of the server portion (to the left of the
first  single / ) and the path portion (right of the first / and left of a
? or #) and either a fragment (after the #) or a query string (after the ?)
I would at least split out the server portion into it's own table. For each
page request (assume 1 page and 9 pictures, all from the same server) that
would be 10 rows of log data that all contain the same chunk of similar
information. Reducing that heavily repeated portion of your data to an ID
number will greatly help reduce the size of your database.

About the non-uniqueness of your internet_usage table.  Even if the same
user visits the same URL multiple times (is that what you mean by repeated
records?) the times should all be slightly different.  If they are not
different, it is still possible that the same person requested the same
page twice or more during the same second (the auto-login feature of MSIE
comes to mind as one culprit). OR you could have multiple users all on the
same userID hitting the same page from different machines.... I guess I
would have to see more data to understand your problem better.

Yours,
Shawn Green
Database Administrator
Unimin Corporation - Spruce Pine



                                                                                          
                             
                      "J S"                                                               
                             
                      <vervoom@hotmail.        To:       SGreen@stripped             
                                
                      com>                     cc:       mysql@stripped         
                                
                                               Fax to:                                    
                             
                      06/18/2004 09:40         Subject:  Re: load data into 2 tables and
set id                         
                      AM                                                                  
                             
                                                                                          
                             
                                                                                          
                             




Shawn,

Thanks for helping on this. I really appreciate it.

>No problem!!
>
>Please post the structures of your "big_table" and your "url_table"
>(whatever you called them)  and I will help you to rewrite step 4 to count
>how many times a URL appears in the "big_table".

mysql> desc internet_usage;
+--------+-------------+------+-----+---------+-------+
| Field  | Type        | Null | Key | Default | Extra |
+--------+-------------+------+-----+---------+-------+
| uid    | varchar(10) | YES  | MUL | NULL    |       |
| time   | datetime    | YES  |     | NULL    |       |
| ip     | varchar(20) | YES  |     | NULL    |       |
| action | varchar(20) | YES  |     | NULL    |       |
| urlid  | int(11)     | YES  |     | NULL    |       |
| size   | int(11)     | YES  |     | NULL    |       |
+--------+-------------+------+-----+---------+-------+
6 rows in set (0.03 sec)

mysql> desc url_table;
+----------+--------------+------+-----+---------+----------------+
| Field    | Type         | Null | Key | Default | Extra    |
+----------+--------------+------+-----+---------+----------------+
| urlid    | mediumint(9) |      | PRI | NULL    | auto_increment |
| url      | text         | YES  | MUL | NULL    |                |
| hits     | mediumint(9) | YES  |     | NULL    |                |
| category | varchar(50)  | YES  |     | NULL    |                |
+----------+--------------+------+-----+---------+----------------+
4 rows in set (0.01 sec)

mysql>

>
>Is this bulk import a process you need to do repeatedly? if so we need to
>worry about updating the count column on the next batch import and not
just
>re-creating it from the new data. That will change the query
significantly.

Yes it will be a repeated process. Actually I have a backlog of 6 months
data to load!

Here's an example of what the data looks like:

mysql> select * from internet_usage limit 5;
+---------+---------------------+----------------+--------------+-------+------+

| uid     | time                | ip             | action       | urlid |
size |
+---------+---------------------+----------------+--------------+-------+------+

| n58396  | 2004-06-07 21:12:16 | 21.38.25.204   | TCP_TUNNELED |  5999 |
5297 |
| u344584 | 2004-06-07 21:07:12 | 21.33.136.74   | TCP_HIT      |  4494 |
438 |
| -       | 2004-06-07 21:07:02 | 21.38.92.76    | TCP_NC_MISS  |  2134 |
771 |
| u524797 | 2004-06-07 21:03:27 | 21.32.25.41    | TCP_NC_MISS  |   260 |
582 |
| -       | 2004-06-07 21:09:13 | 21.201.130.240 | TCP_HIT      |  3112 |
542 |
+---------+---------------------+----------------+--------------+-------+------+


mysql> select * from url_table limit 5;
+-------+-----------------------------------------------------------------------------------+------+---------------+

| urlid | url

                | hits | category      |
+-------+-----------------------------------------------------------------------------------+------+---------------+

|     1 |
http://www.bbc.co.uk/horoscopes/chinesehoroscopes/images/hp-snake.gif |
NULL
| Entertainment |
|     2 | http://www.call18866.co.uk/images/logo.jpg

                | NULL | none          |
|     3 | http://m2.doubleclick.net/866421/0409_santoku_250.gif

                | NULL | none          |
|     4 | http://lysto1-dc02.ww.ad.ba.com/

                | NULL | Travel        |
|     5 | http://www.aboutscotland.com/edin/newstreetph/sitview.jpg

                | NULL | Travel        |
+-------+-----------------------------------------------------------------------------------+------+---------------+

5 rows in set (0.00 sec)


One other problem I'm having here is making the rows in internet_usage
unique. At the moment I have lots of duplicates, and I was trying to create

a temporary table but unfortunately got an error 27 (I think this refers to

a 2GB limit).

mysql> CREATE TEMPORARY TABLE bad_temp AS SELECT DISTINCT * FROM
internet_usage;
ERROR 1030: Got error 27 from table handler

Is there another way of doing this?

>
>Yours,
>Shawn Green
>Database Administrator
>Unimin Corporation - Spruce Pine
>
> >
> >Mos forgot to populate the url_id column in your user table. I would use
> >his same process but re-arrange it like this:
> >
> >1) create table BIG_TABLE....
> >2) load data infile....
> >3) create table URL_TABLE (
> >       url_id bigint not null auto_increment,
> >       url varchar(25) not null primary key,
> >       category ....
> >       )
> >4) INSERT IGNORE URL_TABLE (url, category)
> >       SELECT url,category
> >       FROM BIG_TABLE
> >4) create table USER_TABLE (
> >       user_id varchar?(...) primary key,
> >       date_time datetime,
> >       url_id bigint,
> >       size int
> >       )
> >
> >5) insert USER_TABLE
> >       SELECT bt.userID, bt.datetime, u.url_id, bt.size
> >       FROM BIG_TABLE bt
> >       INNER JOIN URL_TABLE u
> >       ON u.url = bt.url
> >
> >doing it this way lets the SQL engine handle the conversion of URLs to
> >their IDs in the USER_TABLE...see? No scripting required at all!
> >
> >Yours,
> >Shawn Green
> >Database Administrator
> >Unimin Corporation - Spruce Pine
> >
> >
>
>Shawn,
>
>Many thanks for your reply. It sounds like this might be quicker than my
>perl script which parses a 1GB log file and fills the database in 40 mins
>at
>the mo. (The actual parsing only takes 1.5 mins).
>There's one snag which I forgot about and that's in the url_table I have
>another column called hits which is the number of hits for each url.  I'm
>terrible at SQL and wondered if you might be able to suggest a way of
doing
>
>this with the method above?
>
>Thanks,
>
>JS.
>
>_________________________________________________________________
>It's fast, it's easy and it's free. Get MSN Messenger today!
>http://www.msn.co.uk/messenger
>
>
>
>
>
>
>
>--
>MySQL General Mailing List
>For list archives: http://lists.mysql.com/mysql
>To unsubscribe:    http://lists.mysql.com/mysql?unsub=1
>

_________________________________________________________________
Want to block unwanted pop-ups? Download the free MSN Toolbar now!
http://toolbar.msn.co.uk/






Thread
load data into 2 tables and set idJ S9 Jun
  • Re: load data into 2 tables and set idWilliam R. Mussatto9 Jun
  • Re: load data into 2 tables and set idmos9 Jun
Re: load data into 2 tables and set idJ S9 Jun
  • Re: load data into 2 tables and set idWilliam R. Mussatto9 Jun
Re: load data into 2 tables and set idJ S10 Jun
Re: load data into 2 tables and set idSGreen10 Jun
Re: load data into 2 tables and set idJ S18 Jun
Re: load data into 2 tables and set idSGreen18 Jun
Re: load data into 2 tables and set idJ S18 Jun
Re: load data into 2 tables and set idSGreen18 Jun
Re: load data into 2 tables and set idJ S18 Jun
Re: load data into 2 tables and set idSGreen18 Jun
Re: load data into 2 tables and set idJ S22 Jun
Re: load data into 2 tables and set idJ S22 Jun
Re: load data into 2 tables and set idJ S22 Jun
Re: load data into 2 tables and set idSGreen22 Jun
Re: load data into 2 tables and set idJ S22 Jun
Re: load data into 2 tables and set idJ S23 Jun
Re: load data into 2 tables and set idSGreen23 Jun
Re: load data into 2 tables and set idJ S23 Jun
  • Re: load data into 2 tables and set idMichael Stassen23 Jun
Re: load data into 2 tables and set idSGreen23 Jun
Re: load data into 2 tables and set idJ S23 Jun
Re: load data into 2 tables and set idSGreen23 Jun
Re: load data into 2 tables and set idJ S26 Jun
Re: load data into 2 tables and set idJ S26 Jun