List:General Discussion« Previous MessageNext Message »
From:Jan Steinman Date:December 15 2009 2:50am
Subject:CSV pain and suffering with punctuation and non-ASCII characters
View as plain text  
I am trying to import a database of notable quotations. It is really  
quite simple, but I'm finding importing via phpMyAdmin is (IMHO)  
erroneous.

The database came from FileMaker Pro 7, and there are no options for  
changing the output. There are two choices:
1) tabs between fields (with no opportunity to include the tab  
character in a field), and
2) CSV, which double-quotes fields and has commas in-between, which  
apparently doubles embedded quotes.

I am using UTF-8 bin as the character set, and as the encoding for all  
fields.

The problem is that numerous punctuation characters appear to  
terminate the reading of a field, whether imported as a TAB file or as  
a CSV file.

For example, importing a quote with a "Context" field of:
	The Hitchhiker's Guide to the Galaxy

results in a field containing:
	The Hitchhiker

whether I use TAB or CSV. Note that this in an ASCII single quote  
character, not something exotic. I have also noted this on fields that  
contain "curly quotes."

I am using phpMyAdmin's "CSV using LOAD DATA" option, checking  
"Replace table data with file", terminating fields with a comma,  
enclosing fields with double quotes, and blanking (or leaving the '\',  
it doesn't matter) the "Fields escaped by" field.

It seems to me that using CSV, with double-quoted, comma separated  
fields should tell the import process, "Hands off until you see  
another double quote!" And that using TAB format, nothing between TAB  
characters should be interpreted. Why are characters inside the double  
quotes or tabs being interpreted?

I have not tried mysqlimport from the command line, assuming (perhaps  
incorrectly) that phpMyAdmin was simply passing things along, and not  
interpreting them.

Ah, one light in the tunnel: my "Quote" field was TEXT. Changing it to  
BLOB preserves non-ASCII characters, but I still see the strange  
behaviour noted above with a single quote, which was in a VARCHAR(256)  
UTF8 field.

Thanks for whatever help you can offer!

:::: My pants just went on a wild rampage through a Long Island  
Bowling Alley! --  Zippy the Pinhead
:::: Jan Steinman ::::

Thread
CSV pain and suffering with punctuation and non-ASCII charactersJan Steinman15 Dec
  • RE: CSV pain and suffering with punctuation and non-ASCII charactersNeil Aggarwal15 Dec