>I would suggest that the manner appropriate to most any environment is to
>just use plain ascii for your filenames :-) The "swung dash" you refer to
>is called a tilde, btw, and is mostly used in spanish.
[JS] ... and mathematical notation.
I certainly agree with your suggestion about file names, I don't even like to
see spaces (although that's common on Windows platforms).
However, I thought the original question had to do with the CONTENTS, not the
file names. Getting back to what I think was the original question, I work in
a multi-lingual (mostly Western Europe and Asia). As long as I stick to UTF-8
and web browsers, I'm okay. Unfortunately I have to import data from MS Excel
worksheets a lot, and then all bets are off. When you get a worksheet that was
populated by copy/paste from MS Word, you'll wind up with all kinds of
characters that "look right" but are actually not what you think they are.
(There are some Cyrillic punctuation marks that look like Latin-1 punctuation
marks, but are not the same.)
I spent a very long time experimenting with this, and summarized my
conclusions in http://lists.mysql.com/mysql/212392.
Here's a little bit of code that I use when cleaning up blocks of text. It
might not do what you want! It works on a "transliteration" scheme that maps
"funky" characters into rough equivalents. We use this throughout our system,
so the contents of the database are consistent in certain areas.
There are other places where we use UTF-8, because the source is UTF-8. It
would have been better to use UTF-8 throughout, rather than this
transliteration scheme, but not only did I inherit a lot of existing data but
my colleagues in Japan use the "MS Mincho" font, which can't handle these
characters. (If they used "Arial Unicode MS" it would solve a lot of problems,
but I don't run the zoo.)
It's written in Visual Basic, and implemented as an Excel function, but is
easily re-used. (I have implemented the same algorithm in PHP, for our web
Const VERSION As String = "2009-12-18 - 11:51"
Public Function FixCP1252(CellToScan As String)
' This function will transliterate the common high-ANSI (CP1252)
' characters that come from pasting MS Office text into an Excel
Dim Temp As String
Dim I As Integer
Dim CharsToReplace(7, 1) As String
CharsToReplace(0, 0) = Chr(&H96)
CharsToReplace(0, 1) = "-"
CharsToReplace(1, 0) = Chr(&H97)
CharsToReplace(1, 1) = "--"
CharsToReplace(2, 0) = Chr(&H91)
CharsToReplace(2, 1) = "'"
CharsToReplace(3, 0) = Chr(&H92)
CharsToReplace(3, 1) = "'"
CharsToReplace(4, 0) = Chr(&H85)
CharsToReplace(4, 1) = "..."
CharsToReplace(5, 0) = Chr(&H93)
CharsToReplace(5, 1) = """"
CharsToReplace(6, 0) = Chr(&H94)
CharsToReplace(6, 1) = """"
CharsToReplace(7, 0) = Chr(&H95)
CharsToReplace(7, 1) = "*"
Temp = CellToScan
For I = 0 To UBound(CharsToReplace, 1) - 1
Temp = Replace(Temp, CharsToReplace(I, 0), CharsToReplace(I, 1))
FixCP1252 = Temp
I also have to take text with all of this weirdness and make web pages out of
it. Just in case it comes in handy, here's the code I use for that:
Private Function FixTroubleChars(ByVal LineIn As String) As String
' This little function cleans out any really troublesome characters, making
' as appropriate. We'll have to extend the coding as necessary.
Dim Temp As String
' Fix some Unicode characters that are too weird to handle normally.
According to the
' Unicode maps, some are for "private" use (meaning that they have no
standard glyph assignment).
' Microsoft's "Arial Unicode MS" font can usually give you a suggestion,
since the data
' probably came from a Windows source.
Temp = Replace(LineIn, ChrW(&HDBC0), "•")
Temp = Replace(Temp, ChrW(&HDC83), "")
Temp = Replace(Temp, ChrW(&HF0A7), "•")
FixTroubleChars = Temp
Function MyHTMLEncode(ByVal InString As String)
Dim OutString As String, CleanString As String
Dim ThisChar As String * 1
Dim I As Integer
Dim CodePoint As Long
' First, take care of anything that is truly horrible and cannot be
CleanString = FixTroubleChars(InString)
' Encode all "special" characters for use in a web page.
OutString = ""
For I = 1 To Len(CleanString)
ThisChar = Mid(CleanString, I, 1)
If ThisChar Like "[- a-zA-Z0-9!""#$%'&()*+,./:;=?@]" Then
OutString = OutString & ThisChar
CodePoint = AscW(ThisChar)
If CodePoint < 0 Then
MsgBox "Untranslatable character in " & vbCrLf & """" &
InString & """" & vbCrLf & vbCrLf _
& "Codepoint = " & CodePoint & vbCrLf & vbCrLf _
& "Inspect the character and modify function
FixTroubleChars() accordingly", _
vbCritical + vbOKOnly, "Bad character"
Application.Cursor = xlDefault
OutString = OutString & "&#" & CodePoint & ";"
MyHTMLEncode = OutString
I hope this helps somebody.
The Infoshop by Global Information Incorporated
195 Farmington Ave.
Farmington, CT 06032
860.674.8796 / FAX: 860.674.8341