Hi Shawn,
Here's the url_Schemes table (it's the same as the url_paths and
url_servers). This means url_scheme_ID is part of a unique constraint/key ?
mysql> desc url_schemes;
+--------+-------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+--------+-------------+------+-----+---------+----------------+
| ID | int(11) | | MUL | NULL | auto_increment |
| scheme | varchar(20) | | PRI | | |
+--------+-------------+------+-----+---------+----------------+
2 rows in set (0.00 sec)
mysql> desc url_visit;
+---------------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+---------------+--------------+------+-----+---------+----------------+
| urlid | mediumint(9) | | PRI | NULL | auto_increment |
| url_scheme_ID | int(11) | | MUL | 0 | |
| url_server_ID | int(11) | | | 0 | |
| url_path_ID | int(11) | | | 0 | |
| query | text | YES | | NULL | |
| category | varchar(50) | YES | | NULL | |
+---------------+--------------+------+-----+---------+----------------+
6 rows in set (0.00 sec)
mysql> select * from url_schemes;
+----+--------+
| ID | scheme |
+----+--------+
| 1 | http |
| 2 | tcp |
| 3 | - |
| 4 | ftp |
| 5 | https |
+----+--------+
5 rows in set (0.00 sec)
>
>
>J S,
>
>Check to see if the url_scheme_ID part of a unique constraint/key or the
>Primary Key? If it isn't, you will get dupes even if you use INSERT IGNORE
>unless you pre-screen your INSERTs some other way to avoid duplication.
>
>Yours,
>Shawn Green
>Database Administrator
>Unimin Corporation - Spruce Pine
>
>
>
>
>
>
> "J S"
> <vervoom@hotmail. To: SGreen@stripped
> com> cc:
>mysql@stripped
> Fax to:
> 06/23/2004 04:13 Subject: Re: load data
>into 2 tables and set id
> AM
>
>
>
>
>
>
>Shawn,
>
>I uncovered a problem this morning. I wonder if you (or anyone else) can
>help me out again?
>
>mysql> select * from url_visit where url_scheme_ID=3 limit 10;
>+---------+---------------+---------------+-------------+-------+----------+
>
>| urlid | url_scheme_ID | url_server_ID | url_path_ID | query | category
>|
>+---------+---------------+---------------+-------------+-------+----------+
>
>| 23392 | 3 | 1070 | 22221 | NULL | none
>|
>| 1346269 | 3 | 1070 | 22221 | NULL | none
>|
>+---------+---------------+---------------+-------------+-------+----------+
>
>2 rows in set (0.00 sec)
>
>This is the insert statement I'm using:
>
>INSERT IGNORE url_visit (url_scheme_ID, url_server_ID, url_path_ID, query,
>category)
> SELECT DISTINCT uc.ID, us.ID,
>up.ID,bt.query,bt.category
> FROM bulk_table bt
> INNER JOIN url_servers us ON us.server=bt.server
> INNER JOIN url_paths up ON up.path=bt.path
> INNER JOIN url_schemes uc ON uc.scheme=bt.scheme
>
>Do I need brackets after the distinct? e.g
>SELECT DISTINCT (uc.ID, us.ID, up.ID,bt.query,bt.category)
>
>Thanks,
>
>js.
>
> >
> >
> >Great catch! I believe you found the problem. Sorry about that! ;-D
> >
> >So... just curious... from the original 60GB of text data, how much space
> >did the final data require (including indices)?
> >
> >Yours,
> >Shawn Green
> >Database Administrator
> >Unimin Corporation - Spruce Pine
> >
> >
> >
> >
> > "J S"
> > <vervoom@hotmail. To:
>SGreen@stripped
> > com> cc:
> >mysql@stripped
> > Fax to:
> > 06/22/2004 07:55 Subject: Re: load data
> >into 2 tables and set id
> > AM
> >
> >
> >
> >
> >
> >
> >I think I fixed it!
> >
> >INSERT IGNORE url_visit (url_server_ID, url_path_ID, querystring,
> >category)
> >SELECT DISTINCT us.ID, up.ID, if(bt.path_split >0,
>SUBSTRING(bt.url,path),
> >NULL),
> >bt.category
> >FROM bulk_table bt
> >INNER JOIN url_servers us
> > ON us.server = bt.server
> >INNER JOIN url_paths up
> > on up.path=bt.path
> >
> >After this I ran the sql to insert into internet_usage and this time it
> >finished after 10m32.
> >
> >I'll do some more testing and let you know how it goes.
> >
> >Thanks,
> >
> >js.
> > >
> > >Did you mean there to be duplicates in the url_visits? Do I need to use
> > >IGNORE in the following SQL?
> > >
> > >INSERT url_visit (url_server_ID, url_path_ID, querystring,
> > >category)
> > >SELECT us.ID, up.ID, if(bt.path_split >0, SUBSTRING(bt.url,path),
> > >NULL),
> > >bt.category
> > >FROM bulk_table bt
> > >INNER JOIN url_servers us
> > > ON us.server = bt.server
> > >INNER JOIN url_paths up
> > > on up.path=bt.path
> > >
> > >js.
> > >
> > >mysql> select * from url_visit where urlid=1631;
> >
>
> >+-------+---------------+-------------+--------------------------------------------------------------------------------------------------------------------+--------------+
>
> >
> > >| urlid | url_server_ID | url_path_ID | query
> >
> > >
> >
> > > | category |
> >
>
> >+-------+---------------+-------------+--------------------------------------------------------------------------------------------------------------------+--------------+
>
> >
> > >| 1631 | 21720 | 630695 |
> >
>
> >cid=mrkbaki&src=catcmrkdswi|p1|rshop/catcmrkseai|p1|rshop/catcmrkhori|p1|rshop/catcmrki|p1|rshop/catcripa|p1|rgift
>
> >
> > >| Online Sales |
> >
>
> >+-------+---------------+-------------+--------------------------------------------------------------------------------------------------------------------+--------------+
>
> >
> > >1 row in set (0.01 sec)
> > >
> > >mysql> select * from url_visit where urlid=1753;
> >
>
> >+-------+---------------+-------------+--------------------------------------------------------------------------------------------------------------------+--------------+
>
> >
> > >| urlid | url_server_ID | url_path_ID | query
> >
> > >
> >
> > > | category |
> >
>
> >+-------+---------------+-------------+--------------------------------------------------------------------------------------------------------------------+--------------+
>
> >
> > >| 1753 | 21720 | 630695 |
> >
>
> >cid=mrkbaki&src=catcmrkdswi|p1|rshop/catcmrkseai|p1|rshop/catcmrkhori|p1|rshop/catcmrki|p1|rshop/catcripa|p1|rgift
>
> >
> > >| Online Sales |
> >
>
> >+-------+---------------+-------------+--------------------------------------------------------------------------------------------------------------------+--------------+
>
> >
> > >1 row in set (0.00 sec)
> > >
> > >>
> > >>Shawn,
> > >>
> > >>Thanks for your reply below. I found it extremely useful. I have
> >followed
> >
> > >>your instructions and got good results up to the url_visits table.
> > >>
> > >>I have a perl script to parse the values out of the log. The log has
> > >>3,770,246 lines and is gzipped. I then applied your SQL statements
>with
> > >>the following times:
> > >>
> > >>1. Parse script: 10m31
> > >>2. Load infile: 2m39
> > >>3. Insert url_servers and url_paths: 5m38
> > >>4. Insert url_visits 9m50
> > >>
> > >>However when I ran the sql to insert the internet usage table, it just
> >ran
> > >>until all the log space was filled up:
> > >>
> > >>-rw-rw---- 1 mysql mysql 379694680 Jun 22 10:05 bulk_table.MYD
> >
> > >> <====
> > >>-rw-rw---- 1 mysql mysql 1024 Jun 22 10:05 bulk_table.MYI
> > >>-rw-rw---- 1 mysql mysql 8760 Jun 22 09:59 bulk_table.frm
> > >>-rw-rw---- 1 mysql mysql 2114977792 Jun 22 11:11
> > >>internet_usage.MYD <====
> > >>-rw-rw---- 1 mysql mysql 1024 Jun 22 10:34
>internet_usage.MYI
> > >>-rw-rw---- 1 mysql mysql 8646 Jun 22 09:59
>internet_usage.frm
> > >>-rw-rw---- 1 mysql mysql 33437600 Jun 22 10:23 url_paths.MYD
> > >>-rw-rw---- 1 mysql mysql 27696128 Jun 22 10:23 url_paths.MYI
> > >>-rw-rw---- 1 mysql mysql 8574 Jun 22 09:59 url_paths.frm
> > >>-rw-rw---- 1 mysql mysql 646160 Jun 22 10:19 url_servers.MYD
> > >>-rw-rw---- 1 mysql mysql 796672 Jun 22 10:19 url_servers.MYI
> > >>-rw-rw---- 1 mysql mysql 8578 Jun 22 09:59 url_servers.frm
> > >>-rw-rw---- 1 mysql mysql 119076844 Jun 22 10:32 url_visit.MYD
> > >>-rw-rw---- 1 mysql mysql 73026560 Jun 22 10:33 url_visit.MYI
> > >>-rw-rw---- 1 mysql mysql 8694 Jun 22 09:59 url_visit.frm
> > >>
> > >>I'm sure something's not right because the internet Usage table is
> >bigger
> >
> > >>than the bulk table. I changed your sql a bit (on the last line with
>the
> > >>query). Could this be the problem?
> > >>
> > >>INSERT internet_usage (uid,time,ip,urlid,size)
> > >>SELECT bt.user, bt.time, bt.ip, uv.urlid, bt.size
> > >>FROM bulk_table bt
> > >>INNER JOIN url_servers us
> > >> ON us.server=bt.server
> > >>INNER JOIN url_paths up
> > >> ON up.path=bt.path
> > >>INNER JOIN url_visit uv
> > >> ON uv.url_server_ID=us.ID
> > >> AND uv.url_path_ID=up.ID
> > >> AND uv.query=bt.query;
> > >>
> > >>Thanks again,
> > >>
> > >>js.
> > >>
> > >>>
> > >>>-------- welcome to a basic overview of bulk importing and
>normalizing
> >as
> > >>>you go ------------
> > >>>
> > >>>[ author's note: if you are seeing this thread for the first time
> and
> > >>>certain items seem to be
> > >>>introduced out of context, please review all previous posts in this
> > >>>thread.
> > >>>There has been
> > >>>a lot of information already exchanged on this topic that I clipped
> > >>>out of this response. Thanks! -- SG]
> > >>>
> > >>>In an earlier post you said
> > >>> >The data values I have for each record are:
> > >>> >
> > >>> >user_id date_time size url category
> > >>>
> > >>> >for example:
> > >>> >
> > >>> >u752359 2004-04-02 12:33:04 3403
> >http://www.mysql.com/index.html
> > >>>business
> > >>> >
> > >>>
> > >>>To me that implies that you have a standard text log where each
> field
> >is
> > >>>separated by a space and rows are delimited by a CRLF pair. To get
>that
> > >>>log
> > >>>into MySQL you will need to do something _like_ (I say "like" as
> you
> >will
> > >>>most likely need to tweak it to accommodate your actual data)
> > >>>
> > >>>CREATE TABLE bulk_table (
> > >>> user varchar(10)
> > >>> , eventdate date
> > >>> , eventtime time
> > >>> , size int
> > >>> , url text
> > >>> , category varchar(50)
> > >>>);
> > >>>
> > >>>LOAD DATA INFILE <a proxy log file> INTO bulk_table FIELDS
> TERMINATED
> >BY
> >
> > >>>'
> > >>>' LINES TERMINATED BY '\r\n';
> > >>>
> > >>>Now we can start merging the logs into your data tables.
> > >>>
> > >>>To quote RFC 2396:
> > >>> The URI syntax does not require that the scheme-specific-part
>have
> > >>> any general structure or set of semantics which is common among
>all
> > >>> URI. However, a subset of URI do share a common syntax for
> > >>> representing hierarchical relationships within the namespace.
>This
> > >>> "generic URI" syntax consists of a sequence of four main
> >components:
> > >>>
> > >>> <scheme>://<authority><path>?<query>
> > >>>
> > >>> each of which, except <scheme>, may be absent from a
> particular
> >URI.
> > >>>
> > >>>That translates into 4 logical pieces we can split a URL into:
> > >>>1) the scheme -- HTTP, FTP, GOPHER, etc...
> > >>>2) the authority -- that is the server (www.yahoo.com) and any port
> > >>>numbers
> > >>>or login information
> > >>>3) the path -- /somefolder/somefile.whatever
> > >>>4) the query -- everything after the ?
> > >>>
> > >>>Not part of the generic URI is that bit known as a fragment (as
> > >>>identified
> > >>>by the #) it is ALWAYS at the end of the _entire URL_ (including
> the
> > >>>query)
> > >>>when it's used. I have always lumped those and queries into the
> same
> > >>>field.
> > >>>
> > >>>So you could create tables for each of those parts and get VERY
> > >>>normalized
> > >>>or you can partly normalize like this:
> > >>>
> > >>>ALTER TABLE bulk_table add server_split int not null default 0
> > >>> , add path_split int not null default 0
> > >>> , add server varchar(255)
> > >>> , add path varchar(255)
> > >>>
> > >>>UPDATE bulk_table
> > >>>set server_split = LOCATE('/', URL , 8)-1;
> > >>>
> > >>>UPDATE bulk_table
> > >>>SET path_split = if(LOCATE('?', URL, server_split)> 0,
> LOCATE('?',
>URL,
> > >>>server_split), LOCATE('#', URL, server_split)-1);
> > >>>
> > >>>UPDATE bulk_table
> > >>>set server=LEFT(URL, server_split )
> > >>>
> > >>>#those 4 new columns helped us to parse out the 3 major parts of
> the
> >url
> > >>>#I added them to the table so that we would not have to keep
> > >>>recalculating
> > >>>those values later on
> > >>># if it turns out that adding the columns takes a LONG time, we can
> > >>>create
> > >>>this table with those columns
> > >>># already created and just not import to them (change the LOAD DATA
> > >>>INFILE
> > >>>statement slightly)
> > >>>
> > >>>CREATE TABLE IF NOT EXISTS url_servers (
> > >>>ID int not null auto_increment,
> > >>>server varchar(255) primary key,
> > >>>Key (ID)
> > >>>)
> > >>>
> > >>>CREATE TABLE IF NOT EXISTS url_paths (
> > >>>ID int not null auto_increment,
> > >>>path varchar(255) primary key,
> > >>>Key (ID)
> > >>>)
> > >>>
> > >>>INSERT IGNORE INTO url_servers(server)
> > >>>SELECT DISTINCT server
> > >>>FROM bulk_table
> > >>>
> > >>>INSERT IGNORE INTO url_paths (path)
> > >>>SELECT DISTINCT path
> > >>>FROM bulk_table
> > >>>
> > >>># at this point we have all of our new Servers and our Paths
> uniquely
> > >>>numbered
> > >>># but we are going to need a slightly different URL table to track
> > >>>visits.
> > >>>
> > >>>CREATE TABLE url_visit (
> > >>>urlid mediumint not null auto_increment primary key,
> > >>>url_server_ID int not null default 0,
> > >>>url_path_ID int not null default 0,
> > >>>querystring text default null,
> > >>>category varchar(50)default null,
> > >>>KEY(url_server_ID, url_path_ID)
> > >>>)
> > >>>
> > >>>## that last key is to speed up our joins to our _servers and
> _paths
> > >>>tables...
> > >>>
> > >>># we finally have enough information to insert to the visit table
> > >>>
> > >>>INSERT url_visit (url_server_ID, url_path_ID, querystring,
> category)
> > >>>SELECT us.ID, up.ID, if(bt.path_split >0,
> SUBSTRING(bt.url,path),
> >NULL),
> > >>>bt.category
> > >>>FROM bulk_table bt
> > >>>INNER JOIN url_servers us
> > >>> ON us.server = bt.server
> > >>>INNER JOIN url_paths up
> > >>> on up.path=bt.path
> > >>>
> > >>>## (see where the new pre-computed columns come in handy?) :-D
> > >>>
> > >>># and now we have enough information to load the internet_usage
>table.
> > >>>Though there isn't enough data in your sample
> > >>># source data to fill in all of the columns
> > >>>
> > >>>INSERT internet_usage ( uid,`time`,urlid, size)
> > >>>SELECT bt.user, ADDTIME(bt.date, bt.time), uv.urlid, bt.size
> > >>>FROM bulk_table bt
> > >>>INNER JOIN url_servers us
> > >>> ON us.server = bt.server
> > >>>INNER JOIN url_paths up
> > >>> ON up.path=bt.path
> > >>>INNER JOIN url_visit uv
> > >>> ON uv.url_server_ID = us.ID
> > >>> AND uv.url_path_ID = up.id
> > >>> AND uv.querystring = if(bt.path_split >0,
> >SUBSTRING(bt.url,path),
> > >>>NULL)
> > >>>
> > >>>
> > >>>It may not be perfect but it's how I would do it. I am sure that
>others
> > >>>out
> > >>>there would have a twist or two to put on the process. When it's
> all
> > >>>over
> > >>>and done, clear out the bulk_table and do it all over again.
> > >>>
> > >>>I would not mix a statistic, like hit count, into the url_visit
>table.
> > >>>This
> > >>>next query should be blazingly fast as both columns are part of a
> > >>>compound
> > >>>index.
> > >>>
> > >>>SELECT url_server_ID, url_path_ID, count(1)
> > >>>from url_visit
> > >>>group by url_server_ID,url_path_Id;
> > >>>
> > >>>-OR- if you would rather see full names:
> > >>>
> > >>>create temporary table tmpStats
> > >>>SELECT url_server_ID, url_path_ID, count(1) as hits
> > >>>from url_visit
> > >>>group by url_server_ID,url_path_Id;
> > >>>
> > >>>select concat(us.server,up.path), ts.hits
> > >>>from tmpStats
> > >>>inner join url_servers us
> > >>>on us.id = ts.url_server_ID
> > >>>inner join url_paths up
> > >>>on up.id = ts.url_path_ID
> > >>>
> > >>>drop table tmpStats;
> > >>>
> > >>>(NOTE: I used a temp table to aggregate on the ID values before
> >resolving
> > >>>to the names because resolving those BEFORE running the GROUP BY
>would
> > >>>have
> > >>>eliminated the use of an existing index.)
> > >>>
> > >>>Well, I have to get back to work. Let me know how this works for
> you,
> >OK?
> > >>>
> > >>>Cheers,
> > >>>Shawn Green
> > >>>Database Administrator
> > >>>Unimin Corporation - Spruce Pine
> > >>>
> > >>>
> > >>>------------all previous responses clipped for space
> > >>>---------------------
> > >>>
> > >>>
> >
> >
> >
> >
> >
> >--
> >MySQL General Mailing List
> >For list archives: http://lists.mysql.com/mysql
> >To unsubscribe: http://lists.mysql.com/mysql?unsub=1
> >
>
>_________________________________________________________________
>Want to block unwanted pop-ups? Download the free MSN Toolbar now!
>http://toolbar.msn.co.uk/
>
>
>
>
>
>
>
>--
>MySQL General Mailing List
>For list archives: http://lists.mysql.com/mysql
>To unsubscribe: http://lists.mysql.com/mysql?unsub=1
>
_________________________________________________________________
It's fast, it's easy and it's free. Get MSN Messenger today!
http://www.msn.co.uk/messenger