List:General Discussion« Previous MessageNext Message »
From:Shawn Green Date:June 15 2012 12:48pm
Subject:Re: console input
View as plain text  
On 6/14/2012 5:57 PM, Gary Aitken wrote:
> Hi all,
>
> I've looked high and low for what I hope is a trivial answer.
>
> I was trying to load a table using LOAD DATA INFILE.  Unfortunately, it craps out
> because there are some duplicate primary keys.  Not surprising as the source did not
> enforce uniqueness.  My problem is the load data simply dies without indicating which line
> of the input file was in error; the error message refers to line 3, which is not even the
> SQL statement for the LOAD DATA INTO statement:
>
> I can get the table loaded by specifying REPLACE INTO TABLE, but that still leaves me
> with not knowing where the duplicate records are.
>
> So...  I wanted to read the data line at a time and use a plain INSERT statement. 
> That way I could check for duplicate keys and discover where the duplicate records are. 
> However, I can't find a way to read input from the console or a file.  What am I missing? 
> I know I could write a java or C++ program to do this, but it seems like overkill for what
> should be a trivial task.
>
> Thanks for any pointers,
>
> Gary
>

The trivial thing I do to solve this problem is to create a copy of the 
destination table without any PRIMARY KEY or UNIQUE constraints on it. 
This gives you an empty space to which you can bulk import your raw 
data. I am not sure if there is an official term for this but I call it 
a 'landing table'. This is the first step of the import process.

Once you can get your data off of disk and onto the landing table (it's 
where the raw import lands inside the database) you can check it for 
duplicates very easily.

1) create a normal index for the PRIMARY KEY column
2) create another table that has a list of duplicateslike this

CREATE TABLE dup_list ENGINE=MYISAM SELECT pkcol, count(*) freq FROM 
landing GROUP BY pkcol HAVING freq >1;

notes:
* use a MyISAM table for this preparation work even if the destination 
table is using the InnoDB storage engine, you really do not need to 
protect this data with a transaction yet.
* MyISAM indexes are also very fast for count(*) queries.

You have clearly identified all duplicate rows in the incoming data. You 
can also compare those rows with your existing rows to see if any of 
them duplicate each other (hint: INNER JOIN is your friend) or if any 
exist in one table but not the other (hint: LEFT JOIN).

 From here you should be able to cleanse the incoming data (remove 
duplicates, adjust any weird fields) and merge it with your existing 
data to maintain the relational and logical integrity of your tables.

Regards,
-- 
Shawn Green
MySQL Principal Technical Support Engineer
Oracle USA, Inc. - Hardware and Software, Engineered to Work Together.
Office: Blountville, TN


Thread
console inputGary Aitken15 Jun
  • Re: console inputKeith Keller15 Jun
    • Re: console inputAnanda Kumar15 Jun
  • Re: console inputJohan De Meersman15 Jun
    • Re: console inputGary Aitken15 Jun
  • Re: console inputShawn Green15 Jun
  • Re: console inputKeith Keller15 Jun