On 6/14/2012 5:57 PM, Gary Aitken wrote:
> Hi all,
> I've looked high and low for what I hope is a trivial answer.
> I was trying to load a table using LOAD DATA INFILE. Unfortunately, it craps out
> because there are some duplicate primary keys. Not surprising as the source did not
> enforce uniqueness. My problem is the load data simply dies without indicating which line
> of the input file was in error; the error message refers to line 3, which is not even the
> SQL statement for the LOAD DATA INTO statement:
> I can get the table loaded by specifying REPLACE INTO TABLE, but that still leaves me
> with not knowing where the duplicate records are.
> So... I wanted to read the data line at a time and use a plain INSERT statement.
> That way I could check for duplicate keys and discover where the duplicate records are.
> However, I can't find a way to read input from the console or a file. What am I missing?
> I know I could write a java or C++ program to do this, but it seems like overkill for what
> should be a trivial task.
> Thanks for any pointers,
The trivial thing I do to solve this problem is to create a copy of the
destination table without any PRIMARY KEY or UNIQUE constraints on it.
This gives you an empty space to which you can bulk import your raw
data. I am not sure if there is an official term for this but I call it
a 'landing table'. This is the first step of the import process.
Once you can get your data off of disk and onto the landing table (it's
where the raw import lands inside the database) you can check it for
duplicates very easily.
1) create a normal index for the PRIMARY KEY column
2) create another table that has a list of duplicateslike this
CREATE TABLE dup_list ENGINE=MYISAM SELECT pkcol, count(*) freq FROM
landing GROUP BY pkcol HAVING freq >1;
* use a MyISAM table for this preparation work even if the destination
table is using the InnoDB storage engine, you really do not need to
protect this data with a transaction yet.
* MyISAM indexes are also very fast for count(*) queries.
You have clearly identified all duplicate rows in the incoming data. You
can also compare those rows with your existing rows to see if any of
them duplicate each other (hint: INNER JOIN is your friend) or if any
exist in one table but not the other (hint: LEFT JOIN).
From here you should be able to cleanse the incoming data (remove
duplicates, adjust any weird fields) and merge it with your existing
data to maintain the relational and logical integrity of your tables.
MySQL Principal Technical Support Engineer
Oracle USA, Inc. - Hardware and Software, Engineered to Work Together.
Office: Blountville, TN