List:General Discussion« Previous MessageNext Message »
From:J S Date:June 22 2004 11:18am
Subject:Re: load data into 2 tables and set id
View as plain text  
Did you mean there to be duplicates in the url_visits? Do I need to use 
IGNORE in the following SQL?

INSERT url_visit (url_server_ID, url_path_ID, querystring,
category)
SELECT us.ID, up.ID, if(bt.path_split >0, SUBSTRING(bt.url,path),
NULL),
bt.category
FROM bulk_table bt
INNER JOIN url_servers us
       ON us.server = bt.server
INNER JOIN url_paths up
       on up.path=bt.path

js.

mysql> select * from url_visit where urlid=1631;
+-------+---------------+-------------+--------------------------------------------------------------------------------------------------------------------+--------------+
| urlid | url_server_ID | url_path_ID | query                                
                                                                             
   | category     |
+-------+---------------+-------------+--------------------------------------------------------------------------------------------------------------------+--------------+
|  1631 |         21720 |      630695 | 
cid=mrkbaki&src=catcmrkdswi|p1|rshop/catcmrkseai|p1|rshop/catcmrkhori|p1|rshop/catcmrki|p1|rshop/catcripa|p1|rgift

| Online Sales |
+-------+---------------+-------------+--------------------------------------------------------------------------------------------------------------------+--------------+
1 row in set (0.01 sec)

mysql> select * from url_visit where urlid=1753;
+-------+---------------+-------------+--------------------------------------------------------------------------------------------------------------------+--------------+
| urlid | url_server_ID | url_path_ID | query                                
                                                                             
   | category     |
+-------+---------------+-------------+--------------------------------------------------------------------------------------------------------------------+--------------+
|  1753 |         21720 |      630695 | 
cid=mrkbaki&src=catcmrkdswi|p1|rshop/catcmrkseai|p1|rshop/catcmrkhori|p1|rshop/catcmrki|p1|rshop/catcripa|p1|rgift

| Online Sales |
+-------+---------------+-------------+--------------------------------------------------------------------------------------------------------------------+--------------+
1 row in set (0.00 sec)

>
>Shawn,
>
>Thanks for your reply below. I found it extremely useful. I have followed 
>your instructions and got good results up to the url_visits table.
>
>I have a perl script to parse the values out of the log. The log has 
>3,770,246 lines and is gzipped. I then applied your SQL statements with the 
>following times:
>
>1. Parse script: 10m31
>2. Load infile: 2m39
>3. Insert url_servers and url_paths: 5m38
>4. Insert url_visits 9m50
>
>However when I ran the sql to insert the internet usage table, it just ran 
>until all the log space was filled up:
>
>-rw-rw----   1 mysql    mysql    379694680 Jun 22 10:05 bulk_table.MYD      
>       <====
>-rw-rw----   1 mysql    mysql       1024 Jun 22 10:05 bulk_table.MYI
>-rw-rw----   1 mysql    mysql       8760 Jun 22 09:59 bulk_table.frm
>-rw-rw----   1 mysql    mysql    2114977792 Jun 22 11:11 internet_usage.MYD 
>     <====
>-rw-rw----   1 mysql    mysql       1024 Jun 22 10:34 internet_usage.MYI
>-rw-rw----   1 mysql    mysql       8646 Jun 22 09:59 internet_usage.frm
>-rw-rw----   1 mysql    mysql    33437600 Jun 22 10:23 url_paths.MYD
>-rw-rw----   1 mysql    mysql    27696128 Jun 22 10:23 url_paths.MYI
>-rw-rw----   1 mysql    mysql       8574 Jun 22 09:59 url_paths.frm
>-rw-rw----   1 mysql    mysql     646160 Jun 22 10:19 url_servers.MYD
>-rw-rw----   1 mysql    mysql     796672 Jun 22 10:19 url_servers.MYI
>-rw-rw----   1 mysql    mysql       8578 Jun 22 09:59 url_servers.frm
>-rw-rw----   1 mysql    mysql    119076844 Jun 22 10:32 url_visit.MYD
>-rw-rw----   1 mysql    mysql    73026560 Jun 22 10:33 url_visit.MYI
>-rw-rw----   1 mysql    mysql       8694 Jun 22 09:59 url_visit.frm
>
>I'm sure something's not right because the internet Usage table is bigger 
>than the bulk table. I changed your sql a bit (on the last line with the 
>query). Could this be the problem?
>
>INSERT internet_usage (uid,time,ip,urlid,size)
>SELECT bt.user, bt.time, bt.ip, uv.urlid, bt.size
>FROM bulk_table bt
>INNER JOIN url_servers us
>        ON us.server=bt.server
>INNER JOIN url_paths up
>        ON up.path=bt.path
>INNER JOIN url_visit uv
>        ON uv.url_server_ID=us.ID
>        AND uv.url_path_ID=up.ID
>        AND uv.query=bt.query;
>
>Thanks again,
>
>js.
>
>>
>>-------- welcome to a basic overview of bulk importing and normalizing as
>>you go ------------
>>
>>[ author's note: if you are seeing this thread for the first time and
>>certain items seem to be
>>introduced out of context, please review all previous posts in this 
>>thread.
>>There has been
>>a lot of information already exchanged on this topic that I clipped
>>out of this response. Thanks! -- SG]
>>
>>In an earlier post you said
>> >The data values I have for each record are:
>> >
>> >user_id   date_time   size   url   category
>>
>> >for example:
>> >
>> >u752359   2004-04-02 12:33:04   3403   http://www.mysql.com/index.html
>>business
>> >
>>
>>To me that implies that you have a standard text log where each field is
>>separated by a space and rows are delimited by a CRLF pair. To get that 
>>log
>>into MySQL you will need to do something _like_ (I say "like" as you will
>>most likely need to tweak it to accommodate your actual data)
>>
>>CREATE TABLE bulk_table (
>>       user varchar(10)
>>       , eventdate date
>>       , eventtime time
>>       , size int
>>       , url text
>>       , category varchar(50)
>>);
>>
>>LOAD DATA INFILE <a proxy log file> INTO bulk_table FIELDS TERMINATED BY '
>>' LINES TERMINATED BY '\r\n';
>>
>>Now we can start merging the logs into your data tables.
>>
>>To quote RFC 2396:
>>    The URI syntax does not require that the scheme-specific-part have
>>    any general structure or set of semantics which is common among all
>>    URI.  However, a subset of URI do share a common syntax for
>>    representing hierarchical relationships within the namespace.  This
>>    "generic URI" syntax consists of a sequence of four main components:
>>
>>       <scheme>://<authority><path>?<query>
>>
>>    each of which, except <scheme>, may be absent from a particular URI.
>>
>>That translates into 4 logical pieces we can split a URL into:
>>1) the scheme -- HTTP, FTP, GOPHER, etc...
>>2) the authority -- that is the server (www.yahoo.com) and any port 
>>numbers
>>or login information
>>3) the path -- /somefolder/somefile.whatever
>>4) the query -- everything after the ?
>>
>>Not part of the generic URI is that bit known as a fragment (as identified
>>by the #) it is ALWAYS at the end of the _entire URL_ (including the 
>>query)
>>when it's used. I have always lumped those and queries into the same 
>>field.
>>
>>So you could create tables for each of those parts and get VERY normalized
>>or you can partly normalize like this:
>>
>>ALTER TABLE bulk_table add server_split int not null default 0
>>       , add path_split int not null default 0
>>       , add server varchar(255)
>>       , add path varchar(255)
>>
>>UPDATE bulk_table
>>set server_split = LOCATE('/', URL , 8)-1;
>>
>>UPDATE bulk_table
>>SET path_split = if(LOCATE('?', URL, server_split)> 0, LOCATE('?', URL,
>>server_split),  LOCATE('#', URL, server_split)-1);
>>
>>UPDATE bulk_table
>>set server=LEFT(URL, server_split )
>>
>>#those 4 new columns helped us to parse out the 3 major parts of the url
>>#I added them to the table so that we would not have to keep recalculating
>>those values later on
>># if it turns out that adding the columns takes a LONG time, we can create
>>this table with those columns
>># already created and just not import to them (change the LOAD DATA INFILE
>>statement slightly)
>>
>>CREATE TABLE IF NOT EXISTS url_servers (
>>ID int not null auto_increment,
>>server varchar(255) primary key,
>>Key (ID)
>>)
>>
>>CREATE TABLE IF NOT EXISTS url_paths (
>>ID int not null auto_increment,
>>path varchar(255) primary key,
>>Key (ID)
>>)
>>
>>INSERT IGNORE INTO url_servers(server)
>>SELECT DISTINCT server
>>FROM bulk_table
>>
>>INSERT IGNORE INTO url_paths (path)
>>SELECT DISTINCT path
>>FROM bulk_table
>>
>># at this point we have all of our new Servers and our Paths uniquely
>>numbered
>># but we are going to need a slightly different URL table to track visits.
>>
>>CREATE TABLE url_visit (
>>urlid mediumint not null auto_increment primary key,
>>url_server_ID int not null default 0,
>>url_path_ID int not null default 0,
>>querystring text default null,
>>category varchar(50)default null,
>>KEY(url_server_ID, url_path_ID)
>>)
>>
>>## that last key is to speed up our joins to our _servers and _paths
>>tables...
>>
>>#  we finally have enough information to insert to the visit table
>>
>>INSERT url_visit (url_server_ID, url_path_ID, querystring, category)
>>SELECT us.ID, up.ID, if(bt.path_split >0, SUBSTRING(bt.url,path), NULL),
>>bt.category
>>FROM bulk_table bt
>>INNER JOIN url_servers us
>>       ON us.server = bt.server
>>INNER JOIN url_paths up
>>       on up.path=bt.path
>>
>>## (see where the new pre-computed columns come in handy?) :-D
>>
>># and now we have enough information to load the internet_usage table.
>>Though there isn't enough data in your sample
>># source data to fill in all of the columns
>>
>>INSERT internet_usage ( uid,`time`,urlid, size)
>>SELECT bt.user, ADDTIME(bt.date, bt.time), uv.urlid, bt.size
>>FROM bulk_table bt
>>INNER JOIN url_servers us
>>       ON us.server = bt.server
>>INNER JOIN url_paths up
>>       ON up.path=bt.path
>>INNER JOIN url_visit uv
>>       ON uv.url_server_ID = us.ID
>>       AND uv.url_path_ID = up.id
>>       AND uv.querystring = if(bt.path_split >0, SUBSTRING(bt.url,path),
>>NULL)
>>
>>
>>It may not be perfect but it's how I would do it. I am sure that others 
>>out
>>there would have a twist or two to put on the process.  When it's all over
>>and done, clear out the bulk_table and do it all over again.
>>
>>I would not mix a statistic, like hit count, into the url_visit table. 
>>This
>>next query should be blazingly fast as both columns are part of a compound
>>index.
>>
>>SELECT url_server_ID, url_path_ID, count(1)
>>from url_visit
>>group by url_server_ID,url_path_Id;
>>
>>-OR- if you would rather see full names:
>>
>>create temporary table tmpStats
>>SELECT url_server_ID, url_path_ID, count(1) as hits
>>from url_visit
>>group by url_server_ID,url_path_Id;
>>
>>select concat(us.server,up.path), ts.hits
>>from tmpStats
>>inner join url_servers us
>>on us.id = ts.url_server_ID
>>inner join url_paths up
>>on up.id = ts.url_path_ID
>>
>>drop table tmpStats;
>>
>>(NOTE: I used a temp table to aggregate on the ID values before resolving
>>to the names because resolving those BEFORE running the GROUP BY would 
>>have
>>eliminated the use of an existing index.)
>>
>>Well, I have to get back to work. Let me know how this works for you, OK?
>>
>>Cheers,
>>Shawn Green
>>Database Administrator
>>Unimin Corporation - Spruce Pine
>>
>>
>>------------all previous responses clipped for space ---------------------
>>
>>
>>
>>
>>--
>>MySQL General Mailing List
>>For list archives: http://lists.mysql.com/mysql
>>To unsubscribe:    http://lists.mysql.com/mysql?unsub=1
>>
>
>_________________________________________________________________
>It's fast, it's easy and it's free. Get MSN Messenger today! 
>http://www.msn.co.uk/messenger
>
>
>--
>MySQL General Mailing List
>For list archives: http://lists.mysql.com/mysql
>To unsubscribe:    http://lists.mysql.com/mysql?unsub=1
>

_________________________________________________________________
It's fast, it's easy and it's free. Get MSN Messenger today! 
http://www.msn.co.uk/messenger

Thread
load data into 2 tables and set idJ S9 Jun
  • Re: load data into 2 tables and set idWilliam R. Mussatto9 Jun
  • Re: load data into 2 tables and set idmos9 Jun
Re: load data into 2 tables and set idJ S9 Jun
  • Re: load data into 2 tables and set idWilliam R. Mussatto9 Jun
Re: load data into 2 tables and set idJ S10 Jun
Re: load data into 2 tables and set idSGreen10 Jun
Re: load data into 2 tables and set idJ S18 Jun
Re: load data into 2 tables and set idSGreen18 Jun
Re: load data into 2 tables and set idJ S18 Jun
Re: load data into 2 tables and set idSGreen18 Jun
Re: load data into 2 tables and set idJ S18 Jun
Re: load data into 2 tables and set idSGreen18 Jun
Re: load data into 2 tables and set idJ S22 Jun
Re: load data into 2 tables and set idJ S22 Jun
Re: load data into 2 tables and set idJ S22 Jun
Re: load data into 2 tables and set idSGreen22 Jun
Re: load data into 2 tables and set idJ S22 Jun
Re: load data into 2 tables and set idJ S23 Jun
Re: load data into 2 tables and set idSGreen23 Jun
Re: load data into 2 tables and set idJ S23 Jun
  • Re: load data into 2 tables and set idMichael Stassen23 Jun
Re: load data into 2 tables and set idSGreen23 Jun
Re: load data into 2 tables and set idJ S23 Jun
Re: load data into 2 tables and set idSGreen23 Jun
Re: load data into 2 tables and set idJ S26 Jun
Re: load data into 2 tables and set idJ S26 Jun