Shawn,
Thanks for helping on this. I really appreciate it.
>No problem!!
>
>Please post the structures of your "big_table" and your "url_table"
>(whatever you called them) and I will help you to rewrite step 4 to count
>how many times a URL appears in the "big_table".
mysql> desc internet_usage;
+--------+-------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+--------+-------------+------+-----+---------+-------+
| uid | varchar(10) | YES | MUL | NULL | |
| time | datetime | YES | | NULL | |
| ip | varchar(20) | YES | | NULL | |
| action | varchar(20) | YES | | NULL | |
| urlid | int(11) | YES | | NULL | |
| size | int(11) | YES | | NULL | |
+--------+-------------+------+-----+---------+-------+
6 rows in set (0.03 sec)
mysql> desc url_table;
+----------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+----------+--------------+------+-----+---------+----------------+
| urlid | mediumint(9) | | PRI | NULL | auto_increment |
| url | text | YES | MUL | NULL | |
| hits | mediumint(9) | YES | | NULL | |
| category | varchar(50) | YES | | NULL | |
+----------+--------------+------+-----+---------+----------------+
4 rows in set (0.01 sec)
mysql>
>
>Is this bulk import a process you need to do repeatedly? if so we need to
>worry about updating the count column on the next batch import and not just
>re-creating it from the new data. That will change the query significantly.
Yes it will be a repeated process. Actually I have a backlog of 6 months
data to load!
Here's an example of what the data looks like:
mysql> select * from internet_usage limit 5;
+---------+---------------------+----------------+--------------+-------+------+
| uid | time | ip | action | urlid |
size |
+---------+---------------------+----------------+--------------+-------+------+
| n58396 | 2004-06-07 21:12:16 | 21.38.25.204 | TCP_TUNNELED | 5999 |
5297 |
| u344584 | 2004-06-07 21:07:12 | 21.33.136.74 | TCP_HIT | 4494 |
438 |
| - | 2004-06-07 21:07:02 | 21.38.92.76 | TCP_NC_MISS | 2134 |
771 |
| u524797 | 2004-06-07 21:03:27 | 21.32.25.41 | TCP_NC_MISS | 260 |
582 |
| - | 2004-06-07 21:09:13 | 21.201.130.240 | TCP_HIT | 3112 |
542 |
+---------+---------------------+----------------+--------------+-------+------+
mysql> select * from url_table limit 5;
+-------+-----------------------------------------------------------------------------------+------+---------------+
| urlid | url
| hits | category |
+-------+-----------------------------------------------------------------------------------+------+---------------+
| 1 |
http://www.bbc.co.uk/horoscopes/chinesehoroscopes/images/hp-snake.gif | NULL
| Entertainment |
| 2 | http://www.call18866.co.uk/images/logo.jpg
| NULL | none |
| 3 | http://m2.doubleclick.net/866421/0409_santoku_250.gif
| NULL | none |
| 4 | http://lysto1-dc02.ww.ad.ba.com/
| NULL | Travel |
| 5 | http://www.aboutscotland.com/edin/newstreetph/sitview.jpg
| NULL | Travel |
+-------+-----------------------------------------------------------------------------------+------+---------------+
5 rows in set (0.00 sec)
One other problem I'm having here is making the rows in internet_usage
unique. At the moment I have lots of duplicates, and I was trying to create
a temporary table but unfortunately got an error 27 (I think this refers to
a 2GB limit).
mysql> CREATE TEMPORARY TABLE bad_temp AS SELECT DISTINCT * FROM
internet_usage;
ERROR 1030: Got error 27 from table handler
Is there another way of doing this?
>
>Yours,
>Shawn Green
>Database Administrator
>Unimin Corporation - Spruce Pine
>
> >
> >Mos forgot to populate the url_id column in your user table. I would use
> >his same process but re-arrange it like this:
> >
> >1) create table BIG_TABLE....
> >2) load data infile....
> >3) create table URL_TABLE (
> > url_id bigint not null auto_increment,
> > url varchar(25) not null primary key,
> > category ....
> > )
> >4) INSERT IGNORE URL_TABLE (url, category)
> > SELECT url,category
> > FROM BIG_TABLE
> >4) create table USER_TABLE (
> > user_id varchar?(...) primary key,
> > date_time datetime,
> > url_id bigint,
> > size int
> > )
> >
> >5) insert USER_TABLE
> > SELECT bt.userID, bt.datetime, u.url_id, bt.size
> > FROM BIG_TABLE bt
> > INNER JOIN URL_TABLE u
> > ON u.url = bt.url
> >
> >doing it this way lets the SQL engine handle the conversion of URLs to
> >their IDs in the USER_TABLE...see? No scripting required at all!
> >
> >Yours,
> >Shawn Green
> >Database Administrator
> >Unimin Corporation - Spruce Pine
> >
> >
>
>Shawn,
>
>Many thanks for your reply. It sounds like this might be quicker than my
>perl script which parses a 1GB log file and fills the database in 40 mins
>at
>the mo. (The actual parsing only takes 1.5 mins).
>There's one snag which I forgot about and that's in the url_table I have
>another column called hits which is the number of hits for each url. I'm
>terrible at SQL and wondered if you might be able to suggest a way of doing
>
>this with the method above?
>
>Thanks,
>
>JS.
>
>_________________________________________________________________
>It's fast, it's easy and it's free. Get MSN Messenger today!
>http://www.msn.co.uk/messenger
>
>
>
>
>
>
>
>--
>MySQL General Mailing List
>For list archives: http://lists.mysql.com/mysql
>To unsubscribe: http://lists.mysql.com/mysql?unsub=1
>
_________________________________________________________________
Want to block unwanted pop-ups? Download the free MSN Toolbar now!
http://toolbar.msn.co.uk/