No, url_scheme_ID has key type MUL, which means that that multiple
occurences of a given value are allowed within the field.
To prevent duplicate entries in url_visit, decide which combination of
columns should have no duplicates, then add a unique index on that combination.
Michael
J S wrote:
> Hi Shawn,
>
> Here's the url_Schemes table (it's the same as the url_paths and
> url_servers). This means url_scheme_ID is part of a unique constraint/key ?
>
> mysql> desc url_schemes;
> +--------+-------------+------+-----+---------+----------------+
> | Field | Type | Null | Key | Default | Extra |
> +--------+-------------+------+-----+---------+----------------+
> | ID | int(11) | | MUL | NULL | auto_increment |
> | scheme | varchar(20) | | PRI | | |
> +--------+-------------+------+-----+---------+----------------+
> 2 rows in set (0.00 sec)
>
> mysql> desc url_visit;
> +---------------+--------------+------+-----+---------+----------------+
> | Field | Type | Null | Key | Default | Extra |
> +---------------+--------------+------+-----+---------+----------------+
> | urlid | mediumint(9) | | PRI | NULL | auto_increment |
> | url_scheme_ID | int(11) | | MUL | 0 | |
> | url_server_ID | int(11) | | | 0 | |
> | url_path_ID | int(11) | | | 0 | |
> | query | text | YES | | NULL | |
> | category | varchar(50) | YES | | NULL | |
> +---------------+--------------+------+-----+---------+----------------+
> 6 rows in set (0.00 sec)
>
> mysql> select * from url_schemes;
> +----+--------+
> | ID | scheme |
> +----+--------+
> | 1 | http |
> | 2 | tcp |
> | 3 | - |
> | 4 | ftp |
> | 5 | https |
> +----+--------+
> 5 rows in set (0.00 sec)
>
>>
>>
>> J S,
>>
>> Check to see if the url_scheme_ID part of a unique constraint/key or the
>> Primary Key? If it isn't, you will get dupes even if you use INSERT
>> IGNORE
>> unless you pre-screen your INSERTs some other way to avoid duplication.
>>
>> Yours,
>> Shawn Green
>> Database Administrator
>> Unimin Corporation - Spruce Pine
>>
>>
>>
>>
>>
>>
>> "J S"
>> <vervoom@hotmail. To:
>> SGreen@stripped
>> com> cc:
>> mysql@stripped
>> Fax to:
>> 06/23/2004 04:13 Subject: Re: load data
>> into 2 tables and set id
>> AM
>>
>>
>>
>>
>>
>>
>> Shawn,
>>
>> I uncovered a problem this morning. I wonder if you (or anyone else) can
>> help me out again?
>>
>> mysql> select * from url_visit where url_scheme_ID=3 limit 10;
>> +---------+---------------+---------------+-------------+-------+----------+
>>
>>
>> | urlid | url_scheme_ID | url_server_ID | url_path_ID | query |
>> category
>> |
>> +---------+---------------+---------------+-------------+-------+----------+
>>
>>
>> | 23392 | 3 | 1070 | 22221 | NULL | none
>> |
>> | 1346269 | 3 | 1070 | 22221 | NULL | none
>> |
>> +---------+---------------+---------------+-------------+-------+----------+
>>
>>
>> 2 rows in set (0.00 sec)
>>
>> This is the insert statement I'm using:
>>
>> INSERT IGNORE url_visit (url_scheme_ID, url_server_ID, url_path_ID,
>> query,
>> category)
>> SELECT DISTINCT uc.ID, us.ID,
>> up.ID,bt.query,bt.category
>> FROM bulk_table bt
>> INNER JOIN url_servers us ON us.server=bt.server
>> INNER JOIN url_paths up ON up.path=bt.path
>> INNER JOIN url_schemes uc ON uc.scheme=bt.scheme
>>
>> Do I need brackets after the distinct? e.g
>> SELECT DISTINCT (uc.ID, us.ID, up.ID,bt.query,bt.category)
>>
>> Thanks,
>>
>> js.
>>
>> >
>> >
>> >Great catch! I believe you found the problem. Sorry about that! ;-D
>> >
>> >So... just curious... from the original 60GB of text data, how much
>> space
>> >did the final data require (including indices)?
>> >
>> >Yours,
>> >Shawn Green
>> >Database Administrator
>> >Unimin Corporation - Spruce Pine
>> >
>> >
>> >
>> >
>> > "J S"
>> > <vervoom@hotmail. To:
>> SGreen@stripped
>> > com> cc:
>> >mysql@stripped
>> > Fax to:
>> > 06/22/2004 07:55 Subject: Re: load data
>> >into 2 tables and set id
>> > AM
>> >
>> >
>> >
>> >
>> >
>> >
>> >I think I fixed it!
>> >
>> >INSERT IGNORE url_visit (url_server_ID, url_path_ID, querystring,
>> >category)
>> >SELECT DISTINCT us.ID, up.ID, if(bt.path_split >0,
>> SUBSTRING(bt.url,path),
>> >NULL),
>> >bt.category
>> >FROM bulk_table bt
>> >INNER JOIN url_servers us
>> > ON us.server = bt.server
>> >INNER JOIN url_paths up
>> > on up.path=bt.path
>> >
>> >After this I ran the sql to insert into internet_usage and this time it
>> >finished after 10m32.
>> >
>> >I'll do some more testing and let you know how it goes.
>> >
>> >Thanks,
>> >
>> >js.
>> > >
>> > >Did you mean there to be duplicates in the url_visits? Do I need to
>> use
>> > >IGNORE in the following SQL?
>> > >
>> > >INSERT url_visit (url_server_ID, url_path_ID, querystring,
>> > >category)
>> > >SELECT us.ID, up.ID, if(bt.path_split >0, SUBSTRING(bt.url,path),
>> > >NULL),
>> > >bt.category
>> > >FROM bulk_table bt
>> > >INNER JOIN url_servers us
>> > > ON us.server = bt.server
>> > >INNER JOIN url_paths up
>> > > on up.path=bt.path
>> > >
>> > >js.
>> > >
>> > >mysql> select * from url_visit where urlid=1631;
>> >
>>
> >+-------+---------------+-------------+--------------------------------------------------------------------------------------------------------------------+--------------+
>
>>
>>
>> >
>> > >| urlid | url_server_ID | url_path_ID | query
>> >
>> > >
>> >
>> > > | category |
>> >
>>
> >+-------+---------------+-------------+--------------------------------------------------------------------------------------------------------------------+--------------+
>
>>
>>
>> >
>> > >| 1631 | 21720 | 630695 |
>> >
>>
> >cid=mrkbaki&src=catcmrkdswi|p1|rshop/catcmrkseai|p1|rshop/catcmrkhori|p1|rshop/catcmrki|p1|rshop/catcripa|p1|rgift
>
>>
>>
>> >
>> > >| Online Sales |
>> >
>>
> >+-------+---------------+-------------+--------------------------------------------------------------------------------------------------------------------+--------------+
>
>>
>>
>> >
>> > >1 row in set (0.01 sec)
>> > >
>> > >mysql> select * from url_visit where urlid=1753;
>> >
>>
> >+-------+---------------+-------------+--------------------------------------------------------------------------------------------------------------------+--------------+
>
>>
>>
>> >
>> > >| urlid | url_server_ID | url_path_ID | query
>> >
>> > >
>> >
>> > > | category |
>> >
>>
> >+-------+---------------+-------------+--------------------------------------------------------------------------------------------------------------------+--------------+
>
>>
>>
>> >
>> > >| 1753 | 21720 | 630695 |
>> >
>>
> >cid=mrkbaki&src=catcmrkdswi|p1|rshop/catcmrkseai|p1|rshop/catcmrkhori|p1|rshop/catcmrki|p1|rshop/catcripa|p1|rgift
>
>>
>>
>> >
>> > >| Online Sales |
>> >
>>
> >+-------+---------------+-------------+--------------------------------------------------------------------------------------------------------------------+--------------+
>
>>
>>
>> >
>> > >1 row in set (0.00 sec)
>> > >
>> > >>
>> > >>Shawn,
>> > >>
>> > >>Thanks for your reply below. I found it extremely useful. I have
>> >followed
>> >
>> > >>your instructions and got good results up to the url_visits table.
>> > >>
>> > >>I have a perl script to parse the values out of the log. The log
> has
>> > >>3,770,246 lines and is gzipped. I then applied your SQL statements
>> with
>> > >>the following times:
>> > >>
>> > >>1. Parse script: 10m31
>> > >>2. Load infile: 2m39
>> > >>3. Insert url_servers and url_paths: 5m38
>> > >>4. Insert url_visits 9m50
>> > >>
>> > >>However when I ran the sql to insert the internet usage table, it
>> just
>> >ran
>> > >>until all the log space was filled up:
>> > >>
>> > >>-rw-rw---- 1 mysql mysql 379694680 Jun 22 10:05
>> bulk_table.MYD
>> >
>> > >> <====
>> > >>-rw-rw---- 1 mysql mysql 1024 Jun 22 10:05
> bulk_table.MYI
>> > >>-rw-rw---- 1 mysql mysql 8760 Jun 22 09:59
> bulk_table.frm
>> > >>-rw-rw---- 1 mysql mysql 2114977792 Jun 22 11:11
>> > >>internet_usage.MYD <====
>> > >>-rw-rw---- 1 mysql mysql 1024 Jun 22 10:34
>> internet_usage.MYI
>> > >>-rw-rw---- 1 mysql mysql 8646 Jun 22 09:59
>> internet_usage.frm
>> > >>-rw-rw---- 1 mysql mysql 33437600 Jun 22 10:23
> url_paths.MYD
>> > >>-rw-rw---- 1 mysql mysql 27696128 Jun 22 10:23
> url_paths.MYI
>> > >>-rw-rw---- 1 mysql mysql 8574 Jun 22 09:59 url_paths.frm
>> > >>-rw-rw---- 1 mysql mysql 646160 Jun 22 10:19
> url_servers.MYD
>> > >>-rw-rw---- 1 mysql mysql 796672 Jun 22 10:19
> url_servers.MYI
>> > >>-rw-rw---- 1 mysql mysql 8578 Jun 22 09:59
> url_servers.frm
>> > >>-rw-rw---- 1 mysql mysql 119076844 Jun 22 10:32
> url_visit.MYD
>> > >>-rw-rw---- 1 mysql mysql 73026560 Jun 22 10:33
> url_visit.MYI
>> > >>-rw-rw---- 1 mysql mysql 8694 Jun 22 09:59 url_visit.frm
>> > >>
>> > >>I'm sure something's not right because the internet Usage table is
>> >bigger
>> >
>> > >>than the bulk table. I changed your sql a bit (on the last line
> with
>> the
>> > >>query). Could this be the problem?
>> > >>
>> > >>INSERT internet_usage (uid,time,ip,urlid,size)
>> > >>SELECT bt.user, bt.time, bt.ip, uv.urlid, bt.size
>> > >>FROM bulk_table bt
>> > >>INNER JOIN url_servers us
>> > >> ON us.server=bt.server
>> > >>INNER JOIN url_paths up
>> > >> ON up.path=bt.path
>> > >>INNER JOIN url_visit uv
>> > >> ON uv.url_server_ID=us.ID
>> > >> AND uv.url_path_ID=up.ID
>> > >> AND uv.query=bt.query;
>> > >>
>> > >>Thanks again,
>> > >>
>> > >>js.
>> > >>
>> > >>>
>> > >>>-------- welcome to a basic overview of bulk importing and
>> normalizing
>> >as
>> > >>>you go ------------
>> > >>>
>> > >>>[ author's note: if you are seeing this thread for the first
> time
>> and
>> > >>>certain items seem to be
>> > >>>introduced out of context, please review all previous posts in
> this
>> > >>>thread.
>> > >>>There has been
>> > >>>a lot of information already exchanged on this topic that I
> clipped
>> > >>>out of this response. Thanks! -- SG]
>> > >>>
>> > >>>In an earlier post you said
>> > >>> >The data values I have for each record are:
>> > >>> >
>> > >>> >user_id date_time size url category
>> > >>>
>> > >>> >for example:
>> > >>> >
>> > >>> >u752359 2004-04-02 12:33:04 3403
>> >http://www.mysql.com/index.html
>> > >>>business
>> > >>> >
>> > >>>
>> > >>>To me that implies that you have a standard text log where each
>
>> field
>> >is
>> > >>>separated by a space and rows are delimited by a CRLF pair. To
> get
>> that
>> > >>>log
>> > >>>into MySQL you will need to do something _like_ (I say "like" as
> you
>> >will
>> > >>>most likely need to tweak it to accommodate your actual data)
>> > >>>
>> > >>>CREATE TABLE bulk_table (
>> > >>> user varchar(10)
>> > >>> , eventdate date
>> > >>> , eventtime time
>> > >>> , size int
>> > >>> , url text
>> > >>> , category varchar(50)
>> > >>>);
>> > >>>
>> > >>>LOAD DATA INFILE <a proxy log file> INTO bulk_table FIELDS
>
>> TERMINATED
>> >BY
>> >
>> > >>>'
>> > >>>' LINES TERMINATED BY '\r\n';
>> > >>>
>> > >>>Now we can start merging the logs into your data tables.
>> > >>>
>> > >>>To quote RFC 2396:
>> > >>> The URI syntax does not require that the
> scheme-specific-part
>> have
>> > >>> any general structure or set of semantics which is common
> among
>> all
>> > >>> URI. However, a subset of URI do share a common syntax for
>> > >>> representing hierarchical relationships within the
> namespace.
>> This
>> > >>> "generic URI" syntax consists of a sequence of four main
>> >components:
>> > >>>
>> > >>>
> <scheme>://<authority><path>?<query>
>> > >>>
>> > >>> each of which, except <scheme>, may be absent from a
> particular
>> >URI.
>> > >>>
>> > >>>That translates into 4 logical pieces we can split a URL into:
>> > >>>1) the scheme -- HTTP, FTP, GOPHER, etc...
>> > >>>2) the authority -- that is the server (www.yahoo.com) and any
> port
>> > >>>numbers
>> > >>>or login information
>> > >>>3) the path -- /somefolder/somefile.whatever
>> > >>>4) the query -- everything after the ?
>> > >>>
>> > >>>Not part of the generic URI is that bit known as a fragment (as
>> > >>>identified
>> > >>>by the #) it is ALWAYS at the end of the _entire URL_ (including
> the
>> > >>>query)
>> > >>>when it's used. I have always lumped those and queries into the
> same
>> > >>>field.
>> > >>>
>> > >>>So you could create tables for each of those parts and get VERY
>> > >>>normalized
>> > >>>or you can partly normalize like this:
>> > >>>
>> > >>>ALTER TABLE bulk_table add server_split int not null default 0
>> > >>> , add path_split int not null default 0
>> > >>> , add server varchar(255)
>> > >>> , add path varchar(255)
>> > >>>
>> > >>>UPDATE bulk_table
>> > >>>set server_split = LOCATE('/', URL , 8)-1;
>> > >>>
>> > >>>UPDATE bulk_table
>> > >>>SET path_split = if(LOCATE('?', URL, server_split)> 0,
> LOCATE('?',
>> URL,
>> > >>>server_split), LOCATE('#', URL, server_split)-1);
>> > >>>
>> > >>>UPDATE bulk_table
>> > >>>set server=LEFT(URL, server_split )
>> > >>>
>> > >>>#those 4 new columns helped us to parse out the 3 major parts of
> the
>> >url
>> > >>>#I added them to the table so that we would not have to keep
>> > >>>recalculating
>> > >>>those values later on
>> > >>># if it turns out that adding the columns takes a LONG time, we
> can
>> > >>>create
>> > >>>this table with those columns
>> > >>># already created and just not import to them (change the LOAD
> DATA
>> > >>>INFILE
>> > >>>statement slightly)
>> > >>>
>> > >>>CREATE TABLE IF NOT EXISTS url_servers (
>> > >>>ID int not null auto_increment,
>> > >>>server varchar(255) primary key,
>> > >>>Key (ID)
>> > >>>)
>> > >>>
>> > >>>CREATE TABLE IF NOT EXISTS url_paths (
>> > >>>ID int not null auto_increment,
>> > >>>path varchar(255) primary key,
>> > >>>Key (ID)
>> > >>>)
>> > >>>
>> > >>>INSERT IGNORE INTO url_servers(server)
>> > >>>SELECT DISTINCT server
>> > >>>FROM bulk_table
>> > >>>
>> > >>>INSERT IGNORE INTO url_paths (path)
>> > >>>SELECT DISTINCT path
>> > >>>FROM bulk_table
>> > >>>
>> > >>># at this point we have all of our new Servers and our Paths
>> uniquely
>> > >>>numbered
>> > >>># but we are going to need a slightly different URL table to
> track
>> > >>>visits.
>> > >>>
>> > >>>CREATE TABLE url_visit (
>> > >>>urlid mediumint not null auto_increment primary key,
>> > >>>url_server_ID int not null default 0,
>> > >>>url_path_ID int not null default 0,
>> > >>>querystring text default null,
>> > >>>category varchar(50)default null,
>> > >>>KEY(url_server_ID, url_path_ID)
>> > >>>)
>> > >>>
>> > >>>## that last key is to speed up our joins to our _servers and
> _paths
>> > >>>tables...
>> > >>>
>> > >>># we finally have enough information to insert to the visit
> table
>> > >>>
>> > >>>INSERT url_visit (url_server_ID, url_path_ID, querystring,
> category)
>> > >>>SELECT us.ID, up.ID, if(bt.path_split >0,
> SUBSTRING(bt.url,path),
>> >NULL),
>> > >>>bt.category
>> > >>>FROM bulk_table bt
>> > >>>INNER JOIN url_servers us
>> > >>> ON us.server = bt.server
>> > >>>INNER JOIN url_paths up
>> > >>> on up.path=bt.path
>> > >>>
>> > >>>## (see where the new pre-computed columns come in handy?) :-D
>> > >>>
>> > >>># and now we have enough information to load the internet_usage
>
>> table.
>> > >>>Though there isn't enough data in your sample
>> > >>># source data to fill in all of the columns
>> > >>>
>> > >>>INSERT internet_usage ( uid,`time`,urlid, size)
>> > >>>SELECT bt.user, ADDTIME(bt.date, bt.time), uv.urlid, bt.size
>> > >>>FROM bulk_table bt
>> > >>>INNER JOIN url_servers us
>> > >>> ON us.server = bt.server
>> > >>>INNER JOIN url_paths up
>> > >>> ON up.path=bt.path
>> > >>>INNER JOIN url_visit uv
>> > >>> ON uv.url_server_ID = us.ID
>> > >>> AND uv.url_path_ID = up.id
>> > >>> AND uv.querystring = if(bt.path_split >0,
>> >SUBSTRING(bt.url,path),
>> > >>>NULL)
>> > >>>
>> > >>>
>> > >>>It may not be perfect but it's how I would do it. I am sure
> that
>> others
>> > >>>out
>> > >>>there would have a twist or two to put on the process. When
> it's
>> all
>> > >>>over
>> > >>>and done, clear out the bulk_table and do it all over again.
>> > >>>
>> > >>>I would not mix a statistic, like hit count, into the url_visit
>
>> table.
>> > >>>This
>> > >>>next query should be blazingly fast as both columns are part of
> a
>> > >>>compound
>> > >>>index.
>> > >>>
>> > >>>SELECT url_server_ID, url_path_ID, count(1)
>> > >>>from url_visit
>> > >>>group by url_server_ID,url_path_Id;
>> > >>>
>> > >>>-OR- if you would rather see full names:
>> > >>>
>> > >>>create temporary table tmpStats
>> > >>>SELECT url_server_ID, url_path_ID, count(1) as hits
>> > >>>from url_visit
>> > >>>group by url_server_ID,url_path_Id;
>> > >>>
>> > >>>select concat(us.server,up.path), ts.hits
>> > >>>from tmpStats
>> > >>>inner join url_servers us
>> > >>>on us.id = ts.url_server_ID
>> > >>>inner join url_paths up
>> > >>>on up.id = ts.url_path_ID
>> > >>>
>> > >>>drop table tmpStats;
>> > >>>
>> > >>>(NOTE: I used a temp table to aggregate on the ID values before
>> >resolving
>> > >>>to the names because resolving those BEFORE running the GROUP BY
>
>> would
>> > >>>have
>> > >>>eliminated the use of an existing index.)
>> > >>>
>> > >>>Well, I have to get back to work. Let me know how this works for
>
>> you,
>> >OK?
>> > >>>
>> > >>>Cheers,
>> > >>>Shawn Green
>> > >>>Database Administrator
>> > >>>Unimin Corporation - Spruce Pine
>> > >>>
>> > >>>
>> > >>>------------all previous responses clipped for space
>> > >>>---------------------
>> > >>>
>> > >>>
>> >
>> >
>> >
>> >
>> >
>> >--
>> >MySQL General Mailing List
>> >For list archives: http://lists.mysql.com/mysql
>> >To unsubscribe:
>> http://lists.mysql.com/mysql?unsub=1
>> >
>>
>> _________________________________________________________________
>> Want to block unwanted pop-ups? Download the free MSN Toolbar now!
>> http://toolbar.msn.co.uk/
>>
>>
>>
>>
>>
>>
>>
>> --
>> MySQL General Mailing List
>> For list archives: http://lists.mysql.com/mysql
>> To unsubscribe: http://lists.mysql.com/mysql?unsub=1
>>
>
> _________________________________________________________________
> It's fast, it's easy and it's free. Get MSN Messenger today!
> http://www.msn.co.uk/messenger
>
>