List:General Discussion« Previous MessageNext Message »
From:Jerry Schwartz Date:March 17 2010 3:41pm
Subject:RE: dash was converted o a wierd character
View as plain text  
>I would suggest that the manner appropriate to most any environment is to
>just use plain ascii for your filenames :-)  The "swung dash" you refer to
>is called a tilde, btw, and is mostly used in spanish.
>
[JS] ... and mathematical notation.

I certainly agree with your suggestion about file names, I don't even like to 
see spaces (although that's common on Windows platforms).

However, I thought the original question had to do with the CONTENTS, not the 
file names. Getting back to what I think was the original question, I work in 
a multi-lingual (mostly Western Europe and Asia). As long as I stick to UTF-8 
and web browsers, I'm okay. Unfortunately I have to import data from MS Excel 
worksheets a lot, and then all bets are off. When you get a worksheet that was 
populated by copy/paste from MS Word, you'll wind up with all kinds of 
characters that "look right" but are actually not what you think they are. 
(There are some Cyrillic punctuation marks that look like Latin-1 punctuation 
marks, but are not the same.)

I spent a very long time experimenting with this, and summarized my 
conclusions in http://lists.mysql.com/mysql/212392.

Here's a little bit of code that I use when cleaning up blocks of text. It 
might not do what you want! It works on a "transliteration" scheme that maps 
"funky" characters into rough equivalents. We use this throughout our system, 
so the contents of the database are consistent in certain areas.

There are other places where we use UTF-8, because the source is UTF-8. It 
would have been better to use UTF-8 throughout, rather than this 
transliteration scheme, but not only did I inherit a lot of existing data but 
my colleagues in Japan use the "MS Mincho" font, which can't handle these 
characters. (If they used "Arial Unicode MS" it would solve a lot of problems, 
but I don't run the zoo.)

It's written in Visual Basic, and implemented as an Excel function, but is 
easily re-used. (I have implemented the same algorithm in PHP, for our web 
site.)

==========
Option Explicit
Const VERSION As String = "2009-12-18 - 11:51"
Public Function FixCP1252(CellToScan As String)

'   This function will transliterate the common high-ANSI (CP1252)
'   characters that come from pasting MS Office text into an Excel
'   spreadsheet.

    Dim Temp As String
    Dim I As Integer

    Dim CharsToReplace(7, 1) As String

    CharsToReplace(0, 0) = Chr(&H96)
        CharsToReplace(0, 1) = "-"
    CharsToReplace(1, 0) = Chr(&H97)
        CharsToReplace(1, 1) = "--"
    CharsToReplace(2, 0) = Chr(&H91)
        CharsToReplace(2, 1) = "'"
    CharsToReplace(3, 0) = Chr(&H92)
        CharsToReplace(3, 1) = "'"
    CharsToReplace(4, 0) = Chr(&H85)
        CharsToReplace(4, 1) = "..."
    CharsToReplace(5, 0) = Chr(&H93)
        CharsToReplace(5, 1) = """"
    CharsToReplace(6, 0) = Chr(&H94)
        CharsToReplace(6, 1) = """"
    CharsToReplace(7, 0) = Chr(&H95)
        CharsToReplace(7, 1) = "*"

    Temp = CellToScan
    For I = 0 To UBound(CharsToReplace, 1) - 1
        Temp = Replace(Temp, CharsToReplace(I, 0), CharsToReplace(I, 1))
    Next I

    FixCP1252 = Temp

End Function
==========

I also have to take text with all of this weirdness and make web pages out of 
it. Just in case it comes in handy, here's the code I use for that:

==========
Private Function FixTroubleChars(ByVal LineIn As String) As String

'   This little function cleans out any really troublesome characters, making 
substitutions
'   as appropriate. We'll have to extend the coding as necessary.

    Dim Temp As String

    '   Fix some Unicode characters that are too weird to handle normally. 
According to the
    '   Unicode maps, some are for "private" use (meaning that they have no 
standard glyph assignment).
    '   Microsoft's "Arial Unicode MS" font can usually give you a suggestion, 
since the data
    '   probably came from a Windows source.

    Temp = Replace(LineIn, ChrW(&HDBC0), "•")
    Temp = Replace(Temp, ChrW(&HDC83), "")
    Temp = Replace(Temp, ChrW(&HF0A7), "•")

    FixTroubleChars = Temp

End Function

Function MyHTMLEncode(ByVal InString As String)

    Dim OutString As String, CleanString As String
    Dim ThisChar As String * 1
    Dim I As Integer
    Dim CodePoint As Long

    '   First, take care of anything that is truly horrible and cannot be 
used.

    CleanString = FixTroubleChars(InString)

    '   Encode all "special" characters for use in a web page.

    OutString = ""

    For I = 1 To Len(CleanString)

        ThisChar = Mid(CleanString, I, 1)
        If ThisChar Like "[- a-zA-Z0-9!""#$%'&()*+,./:;=?@]" Then
            OutString = OutString & ThisChar
        Else
            CodePoint = AscW(ThisChar)
            If CodePoint < 0 Then
                MsgBox "Untranslatable character in " & vbCrLf & """" & 
InString & """" & vbCrLf & vbCrLf _
                        & "Codepoint = " & CodePoint & vbCrLf & vbCrLf _
                        & "Inspect the character and modify function 
FixTroubleChars() accordingly", _
                        vbCritical + vbOKOnly, "Bad character"
                Application.Cursor = xlDefault
                End
            End If
            OutString = OutString & "&#" & CodePoint & ";"
        End If

    Next I

    MyHTMLEncode = OutString

End Function
==========

I hope this helps somebody.

Regards,

Jerry Schwartz
The Infoshop by Global Information Incorporated
195 Farmington Ave.
Farmington, CT 06032

860.674.8796 / FAX: 860.674.8341

www.the-infoshop.com







Thread
dash was converted o a wierd characterNĂ©stor15 Mar
  • Re: dash was converted o a wierd characterJohan De Meersman16 Mar
    • Re: dash was converted o a wierd characterMichael Dykman16 Mar
      • Re: dash was converted o a wierd characterJohan De Meersman17 Mar
        • Identifiers (was: Re: dash was converted o a wierd character)Joerg Bruehe17 Mar
        • RE: dash was converted o a wierd characterJerry Schwartz17 Mar