List:General Discussion« Previous MessageNext Message »
From:Michael Stassen Date:June 23 2004 2:27pm
Subject:Re: load data into 2 tables and set id
View as plain text  
No, url_scheme_ID has key type MUL, which means that that multiple 
occurences of a given value are allowed within the field.

To prevent duplicate entries in url_visit, decide which combination of 
columns should have no duplicates, then add a unique index on that combination.

Michael

J S wrote:

> Hi Shawn,
> 
> Here's the url_Schemes table (it's the same as the url_paths and 
> url_servers). This means url_scheme_ID is part of a unique constraint/key ?
> 
> mysql> desc url_schemes;
> +--------+-------------+------+-----+---------+----------------+
> | Field  | Type        | Null | Key | Default | Extra          |
> +--------+-------------+------+-----+---------+----------------+
> | ID     | int(11)     |      | MUL | NULL    | auto_increment |
> | scheme | varchar(20) |      | PRI |         |                |
> +--------+-------------+------+-----+---------+----------------+
> 2 rows in set (0.00 sec)
> 
> mysql> desc url_visit;
> +---------------+--------------+------+-----+---------+----------------+
> | Field         | Type         | Null | Key | Default | Extra          |
> +---------------+--------------+------+-----+---------+----------------+
> | urlid         | mediumint(9) |      | PRI | NULL    | auto_increment |
> | url_scheme_ID | int(11)      |      | MUL | 0       |                |
> | url_server_ID | int(11)      |      |     | 0       |                |
> | url_path_ID   | int(11)      |      |     | 0       |                |
> | query         | text         | YES  |     | NULL    |                |
> | category      | varchar(50)  | YES  |     | NULL    |                |
> +---------------+--------------+------+-----+---------+----------------+
> 6 rows in set (0.00 sec)
> 
> mysql> select * from url_schemes;
> +----+--------+
> | ID | scheme |
> +----+--------+
> |  1 | http   |
> |  2 | tcp    |
> |  3 | -      |
> |  4 | ftp    |
> |  5 | https  |
> +----+--------+
> 5 rows in set (0.00 sec)
> 
>>
>>
>> J S,
>>
>> Check to see if the url_scheme_ID part of a unique constraint/key or the
>> Primary Key? If it isn't, you will get dupes even if you use INSERT 
>> IGNORE
>> unless you pre-screen your INSERTs some other way to avoid duplication.
>>
>> Yours,
>> Shawn Green
>> Database Administrator
>> Unimin Corporation - Spruce Pine
>>
>>
>>
>>
>>
>>
>>                       "J S"
>>                       <vervoom@hotmail.        To:       
>> SGreen@stripped
>>                       com>                     cc:       
>> mysql@stripped
>>                                                Fax to:
>>                       06/23/2004 04:13         Subject:  Re: load data 
>> into 2 tables and set id
>>                       AM
>>
>>
>>
>>
>>
>>
>> Shawn,
>>
>> I uncovered a problem this morning. I wonder if you (or anyone else) can
>> help me out again?
>>
>> mysql> select * from url_visit where url_scheme_ID=3 limit 10;
>> +---------+---------------+---------------+-------------+-------+----------+ 
>>
>>
>> | urlid   | url_scheme_ID | url_server_ID | url_path_ID | query | 
>> category
>> |
>> +---------+---------------+---------------+-------------+-------+----------+ 
>>
>>
>> |   23392 |             3 |          1070 |       22221 | NULL  | none
>> |
>> | 1346269 |             3 |          1070 |       22221 | NULL  | none
>> |
>> +---------+---------------+---------------+-------------+-------+----------+ 
>>
>>
>> 2 rows in set (0.00 sec)
>>
>> This is the insert statement I'm using:
>>
>> INSERT IGNORE url_visit (url_scheme_ID, url_server_ID, url_path_ID, 
>> query,
>> category)
>>                          SELECT DISTINCT uc.ID, us.ID,
>> up.ID,bt.query,bt.category
>>                          FROM bulk_table bt
>>                          INNER JOIN url_servers us ON us.server=bt.server
>>                          INNER JOIN url_paths up ON up.path=bt.path
>>                          INNER JOIN url_schemes uc ON uc.scheme=bt.scheme
>>
>> Do I  need brackets after the distinct? e.g
>> SELECT DISTINCT (uc.ID, us.ID, up.ID,bt.query,bt.category)
>>
>> Thanks,
>>
>> js.
>>
>> >
>> >
>> >Great catch! I believe you found the problem. Sorry about that!   ;-D
>> >
>> >So... just curious... from the original 60GB of text data, how much 
>> space
>> >did the final data require (including indices)?
>> >
>> >Yours,
>> >Shawn Green
>> >Database Administrator
>> >Unimin Corporation - Spruce Pine
>> >
>> >
>> >
>> >
>> >                       "J S"
>> >                       <vervoom@hotmail.        To:
>> SGreen@stripped
>> >                       com>                     cc:
>> >mysql@stripped
>> >                                                Fax to:
>> >                       06/22/2004 07:55         Subject:  Re: load data
>> >into 2 tables and set id
>> >                       AM
>> >
>> >
>> >
>> >
>> >
>> >
>> >I think I fixed it!
>> >
>> >INSERT IGNORE url_visit (url_server_ID, url_path_ID, querystring,
>> >category)
>> >SELECT DISTINCT us.ID, up.ID, if(bt.path_split >0, 
>> SUBSTRING(bt.url,path),
>> >NULL),
>> >bt.category
>> >FROM bulk_table bt
>> >INNER JOIN url_servers us
>> >        ON us.server = bt.server
>> >INNER JOIN url_paths up
>> >        on up.path=bt.path
>> >
>> >After this I ran the sql to insert into internet_usage and this time it
>> >finished after 10m32.
>> >
>> >I'll do some more testing and let you know how it goes.
>> >
>> >Thanks,
>> >
>> >js.
>> > >
>> > >Did you mean there to be duplicates in the url_visits? Do I need to 
>> use
>> > >IGNORE in the following SQL?
>> > >
>> > >INSERT url_visit (url_server_ID, url_path_ID, querystring,
>> > >category)
>> > >SELECT us.ID, up.ID, if(bt.path_split >0, SUBSTRING(bt.url,path),
>> > >NULL),
>> > >bt.category
>> > >FROM bulk_table bt
>> > >INNER JOIN url_servers us
>> > >       ON us.server = bt.server
>> > >INNER JOIN url_paths up
>> > >       on up.path=bt.path
>> > >
>> > >js.
>> > >
>> > >mysql> select * from url_visit where urlid=1631;
>> >
>>
> >+-------+---------------+-------------+--------------------------------------------------------------------------------------------------------------------+--------------+
> 
>>
>>
>> >
>> > >| urlid | url_server_ID | url_path_ID | query
>> >
>> > >
>> >
>> > >     | category     |
>> >
>>
> >+-------+---------------+-------------+--------------------------------------------------------------------------------------------------------------------+--------------+
> 
>>
>>
>> >
>> > >|  1631 |         21720 |      630695 |
>> >
>>
> >cid=mrkbaki&src=catcmrkdswi|p1|rshop/catcmrkseai|p1|rshop/catcmrkhori|p1|rshop/catcmrki|p1|rshop/catcripa|p1|rgift
> 
>>
>>
>> >
>> > >| Online Sales |
>> >
>>
> >+-------+---------------+-------------+--------------------------------------------------------------------------------------------------------------------+--------------+
> 
>>
>>
>> >
>> > >1 row in set (0.01 sec)
>> > >
>> > >mysql> select * from url_visit where urlid=1753;
>> >
>>
> >+-------+---------------+-------------+--------------------------------------------------------------------------------------------------------------------+--------------+
> 
>>
>>
>> >
>> > >| urlid | url_server_ID | url_path_ID | query
>> >
>> > >
>> >
>> > >     | category     |
>> >
>>
> >+-------+---------------+-------------+--------------------------------------------------------------------------------------------------------------------+--------------+
> 
>>
>>
>> >
>> > >|  1753 |         21720 |      630695 |
>> >
>>
> >cid=mrkbaki&src=catcmrkdswi|p1|rshop/catcmrkseai|p1|rshop/catcmrkhori|p1|rshop/catcmrki|p1|rshop/catcripa|p1|rgift
> 
>>
>>
>> >
>> > >| Online Sales |
>> >
>>
> >+-------+---------------+-------------+--------------------------------------------------------------------------------------------------------------------+--------------+
> 
>>
>>
>> >
>> > >1 row in set (0.00 sec)
>> > >
>> > >>
>> > >>Shawn,
>> > >>
>> > >>Thanks for your reply below. I found it extremely useful. I have
>> >followed
>> >
>> > >>your instructions and got good results up to the url_visits table.
>> > >>
>> > >>I have a perl script to parse the values out of the log. The log
> has
>> > >>3,770,246 lines and is gzipped. I then applied your SQL statements 
>> with
>> > >>the following times:
>> > >>
>> > >>1. Parse script: 10m31
>> > >>2. Load infile: 2m39
>> > >>3. Insert url_servers and url_paths: 5m38
>> > >>4. Insert url_visits 9m50
>> > >>
>> > >>However when I ran the sql to insert the internet usage table, it 
>> just
>> >ran
>> > >>until all the log space was filled up:
>> > >>
>> > >>-rw-rw----   1 mysql    mysql    379694680 Jun 22 10:05 
>> bulk_table.MYD
>> >
>> > >>        <====
>> > >>-rw-rw----   1 mysql    mysql       1024 Jun 22 10:05
> bulk_table.MYI
>> > >>-rw-rw----   1 mysql    mysql       8760 Jun 22 09:59
> bulk_table.frm
>> > >>-rw-rw----   1 mysql    mysql    2114977792 Jun 22 11:11
>> > >>internet_usage.MYD     <====
>> > >>-rw-rw----   1 mysql    mysql       1024 Jun 22 10:34
>> internet_usage.MYI
>> > >>-rw-rw----   1 mysql    mysql       8646 Jun 22 09:59
>> internet_usage.frm
>> > >>-rw-rw----   1 mysql    mysql    33437600 Jun 22 10:23
> url_paths.MYD
>> > >>-rw-rw----   1 mysql    mysql    27696128 Jun 22 10:23
> url_paths.MYI
>> > >>-rw-rw----   1 mysql    mysql       8574 Jun 22 09:59 url_paths.frm
>> > >>-rw-rw----   1 mysql    mysql     646160 Jun 22 10:19
> url_servers.MYD
>> > >>-rw-rw----   1 mysql    mysql     796672 Jun 22 10:19
> url_servers.MYI
>> > >>-rw-rw----   1 mysql    mysql       8578 Jun 22 09:59
> url_servers.frm
>> > >>-rw-rw----   1 mysql    mysql    119076844 Jun 22 10:32
> url_visit.MYD
>> > >>-rw-rw----   1 mysql    mysql    73026560 Jun 22 10:33
> url_visit.MYI
>> > >>-rw-rw----   1 mysql    mysql       8694 Jun 22 09:59 url_visit.frm
>> > >>
>> > >>I'm sure something's not right because the internet Usage table is
>> >bigger
>> >
>> > >>than the bulk table. I changed your sql a bit (on the last line
> with
>> the
>> > >>query). Could this be the problem?
>> > >>
>> > >>INSERT internet_usage (uid,time,ip,urlid,size)
>> > >>SELECT bt.user, bt.time, bt.ip, uv.urlid, bt.size
>> > >>FROM bulk_table bt
>> > >>INNER JOIN url_servers us
>> > >>        ON us.server=bt.server
>> > >>INNER JOIN url_paths up
>> > >>        ON up.path=bt.path
>> > >>INNER JOIN url_visit uv
>> > >>        ON uv.url_server_ID=us.ID
>> > >>        AND uv.url_path_ID=up.ID
>> > >>        AND uv.query=bt.query;
>> > >>
>> > >>Thanks again,
>> > >>
>> > >>js.
>> > >>
>> > >>>
>> > >>>-------- welcome to a basic overview of bulk importing and 
>> normalizing
>> >as
>> > >>>you go ------------
>> > >>>
>> > >>>[ author's note: if you are seeing this thread for the first
> time 
>> and
>> > >>>certain items seem to be
>> > >>>introduced out of context, please review all previous posts in
> this
>> > >>>thread.
>> > >>>There has been
>> > >>>a lot of information already exchanged on this topic that I
> clipped
>> > >>>out of this response. Thanks! -- SG]
>> > >>>
>> > >>>In an earlier post you said
>> > >>> >The data values I have for each record are:
>> > >>> >
>> > >>> >user_id   date_time   size   url   category
>> > >>>
>> > >>> >for example:
>> > >>> >
>> > >>> >u752359   2004-04-02 12:33:04   3403
>> >http://www.mysql.com/index.html
>> > >>>business
>> > >>> >
>> > >>>
>> > >>>To me that implies that you have a standard text log where each
> 
>> field
>> >is
>> > >>>separated by a space and rows are delimited by a CRLF pair. To
> get
>> that
>> > >>>log
>> > >>>into MySQL you will need to do something _like_ (I say "like" as
> you
>> >will
>> > >>>most likely need to tweak it to accommodate your actual data)
>> > >>>
>> > >>>CREATE TABLE bulk_table (
>> > >>>       user varchar(10)
>> > >>>       , eventdate date
>> > >>>       , eventtime time
>> > >>>       , size int
>> > >>>       , url text
>> > >>>       , category varchar(50)
>> > >>>);
>> > >>>
>> > >>>LOAD DATA INFILE <a proxy log file> INTO bulk_table FIELDS
> 
>> TERMINATED
>> >BY
>> >
>> > >>>'
>> > >>>' LINES TERMINATED BY '\r\n';
>> > >>>
>> > >>>Now we can start merging the logs into your data tables.
>> > >>>
>> > >>>To quote RFC 2396:
>> > >>>    The URI syntax does not require that the
> scheme-specific-part 
>> have
>> > >>>    any general structure or set of semantics which is common
> among
>> all
>> > >>>    URI.  However, a subset of URI do share a common syntax for
>> > >>>    representing hierarchical relationships within the
> namespace.
>> This
>> > >>>    "generic URI" syntax consists of a sequence of four main
>> >components:
>> > >>>
>> > >>>      
> <scheme>://<authority><path>?<query>
>> > >>>
>> > >>>    each of which, except <scheme>, may be absent from a
> particular
>> >URI.
>> > >>>
>> > >>>That translates into 4 logical pieces we can split a URL into:
>> > >>>1) the scheme -- HTTP, FTP, GOPHER, etc...
>> > >>>2) the authority -- that is the server (www.yahoo.com) and any
> port
>> > >>>numbers
>> > >>>or login information
>> > >>>3) the path -- /somefolder/somefile.whatever
>> > >>>4) the query -- everything after the ?
>> > >>>
>> > >>>Not part of the generic URI is that bit known as a fragment (as
>> > >>>identified
>> > >>>by the #) it is ALWAYS at the end of the _entire URL_ (including
> the
>> > >>>query)
>> > >>>when it's used. I have always lumped those and queries into the
> same
>> > >>>field.
>> > >>>
>> > >>>So you could create tables for each of those parts and get VERY
>> > >>>normalized
>> > >>>or you can partly normalize like this:
>> > >>>
>> > >>>ALTER TABLE bulk_table add server_split int not null default 0
>> > >>>       , add path_split int not null default 0
>> > >>>       , add server varchar(255)
>> > >>>       , add path varchar(255)
>> > >>>
>> > >>>UPDATE bulk_table
>> > >>>set server_split = LOCATE('/', URL , 8)-1;
>> > >>>
>> > >>>UPDATE bulk_table
>> > >>>SET path_split = if(LOCATE('?', URL, server_split)> 0,
> LOCATE('?',
>> URL,
>> > >>>server_split),  LOCATE('#', URL, server_split)-1);
>> > >>>
>> > >>>UPDATE bulk_table
>> > >>>set server=LEFT(URL, server_split )
>> > >>>
>> > >>>#those 4 new columns helped us to parse out the 3 major parts of
> the
>> >url
>> > >>>#I added them to the table so that we would not have to keep
>> > >>>recalculating
>> > >>>those values later on
>> > >>># if it turns out that adding the columns takes a LONG time, we
> can
>> > >>>create
>> > >>>this table with those columns
>> > >>># already created and just not import to them (change the LOAD
> DATA
>> > >>>INFILE
>> > >>>statement slightly)
>> > >>>
>> > >>>CREATE TABLE IF NOT EXISTS url_servers (
>> > >>>ID int not null auto_increment,
>> > >>>server varchar(255) primary key,
>> > >>>Key (ID)
>> > >>>)
>> > >>>
>> > >>>CREATE TABLE IF NOT EXISTS url_paths (
>> > >>>ID int not null auto_increment,
>> > >>>path varchar(255) primary key,
>> > >>>Key (ID)
>> > >>>)
>> > >>>
>> > >>>INSERT IGNORE INTO url_servers(server)
>> > >>>SELECT DISTINCT server
>> > >>>FROM bulk_table
>> > >>>
>> > >>>INSERT IGNORE INTO url_paths (path)
>> > >>>SELECT DISTINCT path
>> > >>>FROM bulk_table
>> > >>>
>> > >>># at this point we have all of our new Servers and our Paths 
>> uniquely
>> > >>>numbered
>> > >>># but we are going to need a slightly different URL table to
> track
>> > >>>visits.
>> > >>>
>> > >>>CREATE TABLE url_visit (
>> > >>>urlid mediumint not null auto_increment primary key,
>> > >>>url_server_ID int not null default 0,
>> > >>>url_path_ID int not null default 0,
>> > >>>querystring text default null,
>> > >>>category varchar(50)default null,
>> > >>>KEY(url_server_ID, url_path_ID)
>> > >>>)
>> > >>>
>> > >>>## that last key is to speed up our joins to our _servers and
> _paths
>> > >>>tables...
>> > >>>
>> > >>>#  we finally have enough information to insert to the visit
> table
>> > >>>
>> > >>>INSERT url_visit (url_server_ID, url_path_ID, querystring,
> category)
>> > >>>SELECT us.ID, up.ID, if(bt.path_split >0,
> SUBSTRING(bt.url,path),
>> >NULL),
>> > >>>bt.category
>> > >>>FROM bulk_table bt
>> > >>>INNER JOIN url_servers us
>> > >>>       ON us.server = bt.server
>> > >>>INNER JOIN url_paths up
>> > >>>       on up.path=bt.path
>> > >>>
>> > >>>## (see where the new pre-computed columns come in handy?) :-D
>> > >>>
>> > >>># and now we have enough information to load the internet_usage
> 
>> table.
>> > >>>Though there isn't enough data in your sample
>> > >>># source data to fill in all of the columns
>> > >>>
>> > >>>INSERT internet_usage ( uid,`time`,urlid, size)
>> > >>>SELECT bt.user, ADDTIME(bt.date, bt.time), uv.urlid, bt.size
>> > >>>FROM bulk_table bt
>> > >>>INNER JOIN url_servers us
>> > >>>       ON us.server = bt.server
>> > >>>INNER JOIN url_paths up
>> > >>>       ON up.path=bt.path
>> > >>>INNER JOIN url_visit uv
>> > >>>       ON uv.url_server_ID = us.ID
>> > >>>       AND uv.url_path_ID = up.id
>> > >>>       AND uv.querystring = if(bt.path_split >0,
>> >SUBSTRING(bt.url,path),
>> > >>>NULL)
>> > >>>
>> > >>>
>> > >>>It may not be perfect but it's how I would do it. I am sure
> that
>> others
>> > >>>out
>> > >>>there would have a twist or two to put on the process.  When
> it's 
>> all
>> > >>>over
>> > >>>and done, clear out the bulk_table and do it all over again.
>> > >>>
>> > >>>I would not mix a statistic, like hit count, into the url_visit
> 
>> table.
>> > >>>This
>> > >>>next query should be blazingly fast as both columns are part of
> a
>> > >>>compound
>> > >>>index.
>> > >>>
>> > >>>SELECT url_server_ID, url_path_ID, count(1)
>> > >>>from url_visit
>> > >>>group by url_server_ID,url_path_Id;
>> > >>>
>> > >>>-OR- if you would rather see full names:
>> > >>>
>> > >>>create temporary table tmpStats
>> > >>>SELECT url_server_ID, url_path_ID, count(1) as hits
>> > >>>from url_visit
>> > >>>group by url_server_ID,url_path_Id;
>> > >>>
>> > >>>select concat(us.server,up.path), ts.hits
>> > >>>from tmpStats
>> > >>>inner join url_servers us
>> > >>>on us.id = ts.url_server_ID
>> > >>>inner join url_paths up
>> > >>>on up.id = ts.url_path_ID
>> > >>>
>> > >>>drop table tmpStats;
>> > >>>
>> > >>>(NOTE: I used a temp table to aggregate on the ID values before
>> >resolving
>> > >>>to the names because resolving those BEFORE running the GROUP BY
> 
>> would
>> > >>>have
>> > >>>eliminated the use of an existing index.)
>> > >>>
>> > >>>Well, I have to get back to work. Let me know how this works for
> 
>> you,
>> >OK?
>> > >>>
>> > >>>Cheers,
>> > >>>Shawn Green
>> > >>>Database Administrator
>> > >>>Unimin Corporation - Spruce Pine
>> > >>>
>> > >>>
>> > >>>------------all previous responses clipped for space
>> > >>>---------------------
>> > >>>
>> > >>>
>> >
>> >
>> >
>> >
>> >
>> >--
>> >MySQL General Mailing List
>> >For list archives: http://lists.mysql.com/mysql
>> >To unsubscribe:    
>> http://lists.mysql.com/mysql?unsub=1
>> >
>>
>> _________________________________________________________________
>> Want to block unwanted pop-ups? Download the free MSN Toolbar now!
>> http://toolbar.msn.co.uk/
>>
>>
>>
>>
>>
>>
>>
>> -- 
>> MySQL General Mailing List
>> For list archives: http://lists.mysql.com/mysql
>> To unsubscribe:    http://lists.mysql.com/mysql?unsub=1
>>
> 
> _________________________________________________________________
> It's fast, it's easy and it's free. Get MSN Messenger today! 
> http://www.msn.co.uk/messenger
> 
> 

Thread
load data into 2 tables and set idJ S9 Jun
  • Re: load data into 2 tables and set idWilliam R. Mussatto9 Jun
  • Re: load data into 2 tables and set idmos9 Jun
Re: load data into 2 tables and set idJ S9 Jun
  • Re: load data into 2 tables and set idWilliam R. Mussatto9 Jun
Re: load data into 2 tables and set idJ S10 Jun
Re: load data into 2 tables and set idSGreen10 Jun
Re: load data into 2 tables and set idJ S18 Jun
Re: load data into 2 tables and set idSGreen18 Jun
Re: load data into 2 tables and set idJ S18 Jun
Re: load data into 2 tables and set idSGreen18 Jun
Re: load data into 2 tables and set idJ S18 Jun
Re: load data into 2 tables and set idSGreen18 Jun
Re: load data into 2 tables and set idJ S22 Jun
Re: load data into 2 tables and set idJ S22 Jun
Re: load data into 2 tables and set idJ S22 Jun
Re: load data into 2 tables and set idSGreen22 Jun
Re: load data into 2 tables and set idJ S22 Jun
Re: load data into 2 tables and set idJ S23 Jun
Re: load data into 2 tables and set idSGreen23 Jun
Re: load data into 2 tables and set idJ S23 Jun
  • Re: load data into 2 tables and set idMichael Stassen23 Jun
Re: load data into 2 tables and set idSGreen23 Jun
Re: load data into 2 tables and set idJ S23 Jun
Re: load data into 2 tables and set idSGreen23 Jun
Re: load data into 2 tables and set idJ S26 Jun
Re: load data into 2 tables and set idJ S26 Jun