From: J S Date: June 22 2004 11:18am Subject: Re: load data into 2 tables and set id List-Archive: http://lists.mysql.com/mysql/167682 Message-Id: MIME-Version: 1.0 Content-Type: text/plain; format=flowed Did you mean there to be duplicates in the url_visits? Do I need to use IGNORE in the following SQL? INSERT url_visit (url_server_ID, url_path_ID, querystring, category) SELECT us.ID, up.ID, if(bt.path_split >0, SUBSTRING(bt.url,path), NULL), bt.category FROM bulk_table bt INNER JOIN url_servers us ON us.server = bt.server INNER JOIN url_paths up on up.path=bt.path js. mysql> select * from url_visit where urlid=1631; +-------+---------------+-------------+--------------------------------------------------------------------------------------------------------------------+--------------+ | urlid | url_server_ID | url_path_ID | query | category | +-------+---------------+-------------+--------------------------------------------------------------------------------------------------------------------+--------------+ | 1631 | 21720 | 630695 | cid=mrkbaki&src=catcmrkdswi|p1|rshop/catcmrkseai|p1|rshop/catcmrkhori|p1|rshop/catcmrki|p1|rshop/catcripa|p1|rgift | Online Sales | +-------+---------------+-------------+--------------------------------------------------------------------------------------------------------------------+--------------+ 1 row in set (0.01 sec) mysql> select * from url_visit where urlid=1753; +-------+---------------+-------------+--------------------------------------------------------------------------------------------------------------------+--------------+ | urlid | url_server_ID | url_path_ID | query | category | +-------+---------------+-------------+--------------------------------------------------------------------------------------------------------------------+--------------+ | 1753 | 21720 | 630695 | cid=mrkbaki&src=catcmrkdswi|p1|rshop/catcmrkseai|p1|rshop/catcmrkhori|p1|rshop/catcmrki|p1|rshop/catcripa|p1|rgift | Online Sales | +-------+---------------+-------------+--------------------------------------------------------------------------------------------------------------------+--------------+ 1 row in set (0.00 sec) > >Shawn, > >Thanks for your reply below. I found it extremely useful. I have followed >your instructions and got good results up to the url_visits table. > >I have a perl script to parse the values out of the log. The log has >3,770,246 lines and is gzipped. I then applied your SQL statements with the >following times: > >1. Parse script: 10m31 >2. Load infile: 2m39 >3. Insert url_servers and url_paths: 5m38 >4. Insert url_visits 9m50 > >However when I ran the sql to insert the internet usage table, it just ran >until all the log space was filled up: > >-rw-rw---- 1 mysql mysql 379694680 Jun 22 10:05 bulk_table.MYD > <==== >-rw-rw---- 1 mysql mysql 1024 Jun 22 10:05 bulk_table.MYI >-rw-rw---- 1 mysql mysql 8760 Jun 22 09:59 bulk_table.frm >-rw-rw---- 1 mysql mysql 2114977792 Jun 22 11:11 internet_usage.MYD > <==== >-rw-rw---- 1 mysql mysql 1024 Jun 22 10:34 internet_usage.MYI >-rw-rw---- 1 mysql mysql 8646 Jun 22 09:59 internet_usage.frm >-rw-rw---- 1 mysql mysql 33437600 Jun 22 10:23 url_paths.MYD >-rw-rw---- 1 mysql mysql 27696128 Jun 22 10:23 url_paths.MYI >-rw-rw---- 1 mysql mysql 8574 Jun 22 09:59 url_paths.frm >-rw-rw---- 1 mysql mysql 646160 Jun 22 10:19 url_servers.MYD >-rw-rw---- 1 mysql mysql 796672 Jun 22 10:19 url_servers.MYI >-rw-rw---- 1 mysql mysql 8578 Jun 22 09:59 url_servers.frm >-rw-rw---- 1 mysql mysql 119076844 Jun 22 10:32 url_visit.MYD >-rw-rw---- 1 mysql mysql 73026560 Jun 22 10:33 url_visit.MYI >-rw-rw---- 1 mysql mysql 8694 Jun 22 09:59 url_visit.frm > >I'm sure something's not right because the internet Usage table is bigger >than the bulk table. I changed your sql a bit (on the last line with the >query). Could this be the problem? > >INSERT internet_usage (uid,time,ip,urlid,size) >SELECT bt.user, bt.time, bt.ip, uv.urlid, bt.size >FROM bulk_table bt >INNER JOIN url_servers us > ON us.server=bt.server >INNER JOIN url_paths up > ON up.path=bt.path >INNER JOIN url_visit uv > ON uv.url_server_ID=us.ID > AND uv.url_path_ID=up.ID > AND uv.query=bt.query; > >Thanks again, > >js. > >> >>-------- welcome to a basic overview of bulk importing and normalizing as >>you go ------------ >> >>[ author's note: if you are seeing this thread for the first time and >>certain items seem to be >>introduced out of context, please review all previous posts in this >>thread. >>There has been >>a lot of information already exchanged on this topic that I clipped >>out of this response. Thanks! -- SG] >> >>In an earlier post you said >> >The data values I have for each record are: >> > >> >user_id date_time size url category >> >> >for example: >> > >> >u752359 2004-04-02 12:33:04 3403 http://www.mysql.com/index.html >>business >> > >> >>To me that implies that you have a standard text log where each field is >>separated by a space and rows are delimited by a CRLF pair. To get that >>log >>into MySQL you will need to do something _like_ (I say "like" as you will >>most likely need to tweak it to accommodate your actual data) >> >>CREATE TABLE bulk_table ( >> user varchar(10) >> , eventdate date >> , eventtime time >> , size int >> , url text >> , category varchar(50) >>); >> >>LOAD DATA INFILE INTO bulk_table FIELDS TERMINATED BY ' >>' LINES TERMINATED BY '\r\n'; >> >>Now we can start merging the logs into your data tables. >> >>To quote RFC 2396: >> The URI syntax does not require that the scheme-specific-part have >> any general structure or set of semantics which is common among all >> URI. However, a subset of URI do share a common syntax for >> representing hierarchical relationships within the namespace. This >> "generic URI" syntax consists of a sequence of four main components: >> >> ://? >> >> each of which, except , may be absent from a particular URI. >> >>That translates into 4 logical pieces we can split a URL into: >>1) the scheme -- HTTP, FTP, GOPHER, etc... >>2) the authority -- that is the server (www.yahoo.com) and any port >>numbers >>or login information >>3) the path -- /somefolder/somefile.whatever >>4) the query -- everything after the ? >> >>Not part of the generic URI is that bit known as a fragment (as identified >>by the #) it is ALWAYS at the end of the _entire URL_ (including the >>query) >>when it's used. I have always lumped those and queries into the same >>field. >> >>So you could create tables for each of those parts and get VERY normalized >>or you can partly normalize like this: >> >>ALTER TABLE bulk_table add server_split int not null default 0 >> , add path_split int not null default 0 >> , add server varchar(255) >> , add path varchar(255) >> >>UPDATE bulk_table >>set server_split = LOCATE('/', URL , 8)-1; >> >>UPDATE bulk_table >>SET path_split = if(LOCATE('?', URL, server_split)> 0, LOCATE('?', URL, >>server_split), LOCATE('#', URL, server_split)-1); >> >>UPDATE bulk_table >>set server=LEFT(URL, server_split ) >> >>#those 4 new columns helped us to parse out the 3 major parts of the url >>#I added them to the table so that we would not have to keep recalculating >>those values later on >># if it turns out that adding the columns takes a LONG time, we can create >>this table with those columns >># already created and just not import to them (change the LOAD DATA INFILE >>statement slightly) >> >>CREATE TABLE IF NOT EXISTS url_servers ( >>ID int not null auto_increment, >>server varchar(255) primary key, >>Key (ID) >>) >> >>CREATE TABLE IF NOT EXISTS url_paths ( >>ID int not null auto_increment, >>path varchar(255) primary key, >>Key (ID) >>) >> >>INSERT IGNORE INTO url_servers(server) >>SELECT DISTINCT server >>FROM bulk_table >> >>INSERT IGNORE INTO url_paths (path) >>SELECT DISTINCT path >>FROM bulk_table >> >># at this point we have all of our new Servers and our Paths uniquely >>numbered >># but we are going to need a slightly different URL table to track visits. >> >>CREATE TABLE url_visit ( >>urlid mediumint not null auto_increment primary key, >>url_server_ID int not null default 0, >>url_path_ID int not null default 0, >>querystring text default null, >>category varchar(50)default null, >>KEY(url_server_ID, url_path_ID) >>) >> >>## that last key is to speed up our joins to our _servers and _paths >>tables... >> >># we finally have enough information to insert to the visit table >> >>INSERT url_visit (url_server_ID, url_path_ID, querystring, category) >>SELECT us.ID, up.ID, if(bt.path_split >0, SUBSTRING(bt.url,path), NULL), >>bt.category >>FROM bulk_table bt >>INNER JOIN url_servers us >> ON us.server = bt.server >>INNER JOIN url_paths up >> on up.path=bt.path >> >>## (see where the new pre-computed columns come in handy?) :-D >> >># and now we have enough information to load the internet_usage table. >>Though there isn't enough data in your sample >># source data to fill in all of the columns >> >>INSERT internet_usage ( uid,`time`,urlid, size) >>SELECT bt.user, ADDTIME(bt.date, bt.time), uv.urlid, bt.size >>FROM bulk_table bt >>INNER JOIN url_servers us >> ON us.server = bt.server >>INNER JOIN url_paths up >> ON up.path=bt.path >>INNER JOIN url_visit uv >> ON uv.url_server_ID = us.ID >> AND uv.url_path_ID = up.id >> AND uv.querystring = if(bt.path_split >0, SUBSTRING(bt.url,path), >>NULL) >> >> >>It may not be perfect but it's how I would do it. I am sure that others >>out >>there would have a twist or two to put on the process. When it's all over >>and done, clear out the bulk_table and do it all over again. >> >>I would not mix a statistic, like hit count, into the url_visit table. >>This >>next query should be blazingly fast as both columns are part of a compound >>index. >> >>SELECT url_server_ID, url_path_ID, count(1) >>from url_visit >>group by url_server_ID,url_path_Id; >> >>-OR- if you would rather see full names: >> >>create temporary table tmpStats >>SELECT url_server_ID, url_path_ID, count(1) as hits >>from url_visit >>group by url_server_ID,url_path_Id; >> >>select concat(us.server,up.path), ts.hits >>from tmpStats >>inner join url_servers us >>on us.id = ts.url_server_ID >>inner join url_paths up >>on up.id = ts.url_path_ID >> >>drop table tmpStats; >> >>(NOTE: I used a temp table to aggregate on the ID values before resolving >>to the names because resolving those BEFORE running the GROUP BY would >>have >>eliminated the use of an existing index.) >> >>Well, I have to get back to work. Let me know how this works for you, OK? >> >>Cheers, >>Shawn Green >>Database Administrator >>Unimin Corporation - Spruce Pine >> >> >>------------all previous responses clipped for space --------------------- >> >> >> >> >>-- >>MySQL General Mailing List >>For list archives: http://lists.mysql.com/mysql >>To unsubscribe: http://lists.mysql.com/mysql?unsub=vervoom@stripped >> > >_________________________________________________________________ >It's fast, it's easy and it's free. Get MSN Messenger today! >http://www.msn.co.uk/messenger > > >-- >MySQL General Mailing List >For list archives: http://lists.mysql.com/mysql >To unsubscribe: http://lists.mysql.com/mysql?unsub=vervoom@stripped > _________________________________________________________________ It's fast, it's easy and it's free. Get MSN Messenger today! http://www.msn.co.uk/messenger