Shawn,
Thanks for your reply below. I found it extremely useful. I have followed
your instructions and got good results up to the url_visits table.
I have a perl script to parse the values out of the log. The log has
3,770,246 lines and is gzipped. I then applied your SQL statements with the
following times:
1. Parse script: 10m31
2. Load infile: 2m39
3. Insert url_servers and url_paths: 5m38
4. Insert url_visits 9m50
However when I ran the sql to insert the internet usage table, it just ran
until all the log space was filled up:
-rw-rw---- 1 mysql mysql 379694680 Jun 22 10:05 bulk_table.MYD
<====
-rw-rw---- 1 mysql mysql 1024 Jun 22 10:05 bulk_table.MYI
-rw-rw---- 1 mysql mysql 8760 Jun 22 09:59 bulk_table.frm
-rw-rw---- 1 mysql mysql 2114977792 Jun 22 11:11 internet_usage.MYD
<====
-rw-rw---- 1 mysql mysql 1024 Jun 22 10:34 internet_usage.MYI
-rw-rw---- 1 mysql mysql 8646 Jun 22 09:59 internet_usage.frm
-rw-rw---- 1 mysql mysql 33437600 Jun 22 10:23 url_paths.MYD
-rw-rw---- 1 mysql mysql 27696128 Jun 22 10:23 url_paths.MYI
-rw-rw---- 1 mysql mysql 8574 Jun 22 09:59 url_paths.frm
-rw-rw---- 1 mysql mysql 646160 Jun 22 10:19 url_servers.MYD
-rw-rw---- 1 mysql mysql 796672 Jun 22 10:19 url_servers.MYI
-rw-rw---- 1 mysql mysql 8578 Jun 22 09:59 url_servers.frm
-rw-rw---- 1 mysql mysql 119076844 Jun 22 10:32 url_visit.MYD
-rw-rw---- 1 mysql mysql 73026560 Jun 22 10:33 url_visit.MYI
-rw-rw---- 1 mysql mysql 8694 Jun 22 09:59 url_visit.frm
I'm sure something's not right because the internet Usage table is bigger
than the bulk table. I changed your sql a bit (on the last line with the
query). Could this be the problem?
INSERT internet_usage (uid,time,ip,urlid,size)
SELECT bt.user, bt.time, bt.ip, uv.urlid, bt.size
FROM bulk_table bt
INNER JOIN url_servers us
ON us.server=bt.server
INNER JOIN url_paths up
ON up.path=bt.path
INNER JOIN url_visit uv
ON uv.url_server_ID=us.ID
AND uv.url_path_ID=up.ID
AND uv.query=bt.query;
Thanks again,
js.
>
>-------- welcome to a basic overview of bulk importing and normalizing as
>you go ------------
>
>[ author's note: if you are seeing this thread for the first time and
>certain items seem to be
>introduced out of context, please review all previous posts in this thread.
>There has been
>a lot of information already exchanged on this topic that I clipped
>out of this response. Thanks! -- SG]
>
>In an earlier post you said
> >The data values I have for each record are:
> >
> >user_id date_time size url category
>
> >for example:
> >
> >u752359 2004-04-02 12:33:04 3403 http://www.mysql.com/index.html
>business
> >
>
>To me that implies that you have a standard text log where each field is
>separated by a space and rows are delimited by a CRLF pair. To get that log
>into MySQL you will need to do something _like_ (I say "like" as you will
>most likely need to tweak it to accommodate your actual data)
>
>CREATE TABLE bulk_table (
> user varchar(10)
> , eventdate date
> , eventtime time
> , size int
> , url text
> , category varchar(50)
>);
>
>LOAD DATA INFILE <a proxy log file> INTO bulk_table FIELDS TERMINATED BY '
>' LINES TERMINATED BY '\r\n';
>
>Now we can start merging the logs into your data tables.
>
>To quote RFC 2396:
> The URI syntax does not require that the scheme-specific-part have
> any general structure or set of semantics which is common among all
> URI. However, a subset of URI do share a common syntax for
> representing hierarchical relationships within the namespace. This
> "generic URI" syntax consists of a sequence of four main components:
>
> <scheme>://<authority><path>?<query>
>
> each of which, except <scheme>, may be absent from a particular URI.
>
>That translates into 4 logical pieces we can split a URL into:
>1) the scheme -- HTTP, FTP, GOPHER, etc...
>2) the authority -- that is the server (www.yahoo.com) and any port numbers
>or login information
>3) the path -- /somefolder/somefile.whatever
>4) the query -- everything after the ?
>
>Not part of the generic URI is that bit known as a fragment (as identified
>by the #) it is ALWAYS at the end of the _entire URL_ (including the query)
>when it's used. I have always lumped those and queries into the same field.
>
>So you could create tables for each of those parts and get VERY normalized
>or you can partly normalize like this:
>
>ALTER TABLE bulk_table add server_split int not null default 0
> , add path_split int not null default 0
> , add server varchar(255)
> , add path varchar(255)
>
>UPDATE bulk_table
>set server_split = LOCATE('/', URL , 8)-1;
>
>UPDATE bulk_table
>SET path_split = if(LOCATE('?', URL, server_split)> 0, LOCATE('?', URL,
>server_split), LOCATE('#', URL, server_split)-1);
>
>UPDATE bulk_table
>set server=LEFT(URL, server_split )
>
>#those 4 new columns helped us to parse out the 3 major parts of the url
>#I added them to the table so that we would not have to keep recalculating
>those values later on
># if it turns out that adding the columns takes a LONG time, we can create
>this table with those columns
># already created and just not import to them (change the LOAD DATA INFILE
>statement slightly)
>
>CREATE TABLE IF NOT EXISTS url_servers (
>ID int not null auto_increment,
>server varchar(255) primary key,
>Key (ID)
>)
>
>CREATE TABLE IF NOT EXISTS url_paths (
>ID int not null auto_increment,
>path varchar(255) primary key,
>Key (ID)
>)
>
>INSERT IGNORE INTO url_servers(server)
>SELECT DISTINCT server
>FROM bulk_table
>
>INSERT IGNORE INTO url_paths (path)
>SELECT DISTINCT path
>FROM bulk_table
>
># at this point we have all of our new Servers and our Paths uniquely
>numbered
># but we are going to need a slightly different URL table to track visits.
>
>CREATE TABLE url_visit (
>urlid mediumint not null auto_increment primary key,
>url_server_ID int not null default 0,
>url_path_ID int not null default 0,
>querystring text default null,
>category varchar(50)default null,
>KEY(url_server_ID, url_path_ID)
>)
>
>## that last key is to speed up our joins to our _servers and _paths
>tables...
>
># we finally have enough information to insert to the visit table
>
>INSERT url_visit (url_server_ID, url_path_ID, querystring, category)
>SELECT us.ID, up.ID, if(bt.path_split >0, SUBSTRING(bt.url,path), NULL),
>bt.category
>FROM bulk_table bt
>INNER JOIN url_servers us
> ON us.server = bt.server
>INNER JOIN url_paths up
> on up.path=bt.path
>
>## (see where the new pre-computed columns come in handy?) :-D
>
># and now we have enough information to load the internet_usage table.
>Though there isn't enough data in your sample
># source data to fill in all of the columns
>
>INSERT internet_usage ( uid,`time`,urlid, size)
>SELECT bt.user, ADDTIME(bt.date, bt.time), uv.urlid, bt.size
>FROM bulk_table bt
>INNER JOIN url_servers us
> ON us.server = bt.server
>INNER JOIN url_paths up
> ON up.path=bt.path
>INNER JOIN url_visit uv
> ON uv.url_server_ID = us.ID
> AND uv.url_path_ID = up.id
> AND uv.querystring = if(bt.path_split >0, SUBSTRING(bt.url,path),
>NULL)
>
>
>It may not be perfect but it's how I would do it. I am sure that others out
>there would have a twist or two to put on the process. When it's all over
>and done, clear out the bulk_table and do it all over again.
>
>I would not mix a statistic, like hit count, into the url_visit table. This
>next query should be blazingly fast as both columns are part of a compound
>index.
>
>SELECT url_server_ID, url_path_ID, count(1)
>from url_visit
>group by url_server_ID,url_path_Id;
>
>-OR- if you would rather see full names:
>
>create temporary table tmpStats
>SELECT url_server_ID, url_path_ID, count(1) as hits
>from url_visit
>group by url_server_ID,url_path_Id;
>
>select concat(us.server,up.path), ts.hits
>from tmpStats
>inner join url_servers us
>on us.id = ts.url_server_ID
>inner join url_paths up
>on up.id = ts.url_path_ID
>
>drop table tmpStats;
>
>(NOTE: I used a temp table to aggregate on the ID values before resolving
>to the names because resolving those BEFORE running the GROUP BY would have
>eliminated the use of an existing index.)
>
>Well, I have to get back to work. Let me know how this works for you, OK?
>
>Cheers,
>Shawn Green
>Database Administrator
>Unimin Corporation - Spruce Pine
>
>
>------------all previous responses clipped for space ---------------------
>
>
>
>
>--
>MySQL General Mailing List
>For list archives: http://lists.mysql.com/mysql
>To unsubscribe: http://lists.mysql.com/mysql?unsub=1
>
_________________________________________________________________
It's fast, it's easy and it's free. Get MSN Messenger today!
http://www.msn.co.uk/messenger