List:MySQL and Perl« Previous MessageNext Message »
From:Steve Hay Date:June 11 2004 12:03pm
Subject:Re: [PATCH] Re: blessing db data as utf8
View as plain text  
Gaal Yahas wrote:

>On Fri, Jun 11, 2004 at 09:14:21AM +0100, Steve Hay wrote:
>  
>
>>I was concerned that searching for some sequence of bytes that make up a 
>>UTF-8 character might accidentally match in the wrong place, like the 
>>last byte of one character and the first byte of another.  Are you 
>>implying that this can't ever happen because of how UTF-8 works?
>>    
>>
>
>Yes, utf8 is self-synchronizing. If both the needle and the haystack are
>utf8, you don't get false positives even with non-utf8-aware strcmp-like
>code.
>
Cool.  That's a very useful thing to know.

>>>of course, collating and therefore ORDER BY won't
>>>work correctly either, and the sizes the database knows about will all
>>>be in bytes instead of characters. Bothersome but not insurmountable. :)
>>>
>>>      
>>>
>>I assume you mean pull the data into Perl, have the data correctly 
>>flagged as UTF-8 there, and doing things like sorting in the Perl code?
>>
>>I could live with that, but the UPPER()/LOWER() issue is more of a 
>>problem.  I make a lot of use of them and it's not so easy to workaround 
>>in the Perl.
>>    
>>
>
>You're right, but there's not much we can do about it until the database
>supports utf8 natively.
>
>
>Out of curiosity, where do you make use of this? Case-insensetive lookups
>that preserve the original case?
>
Yes, exactly that.  I'm dealing with software that indexes various 
things read out of XML files into the database.  Users can search by 
either the data that was extracted or by the filenames.  Either way, 
they want to do case-insensitive searches, but see the original case in 
the results.

This is particularly relevant to the filenames themselves because this 
is all on Windows which has a case-insensitive but case-preserving 
filesystem.

- Steve



------------------------------------------------
Radan Computational Ltd.

The information contained in this message and any files transmitted with it are
confidential and intended for the addressee(s) only.  If you have received this message
in error or there are any problems, please notify the sender immediately.  The
unauthorized use, disclosure, copying or alteration of this message is strictly
forbidden.  Note that any views or opinions presented in this email are solely those of
the author and do not necessarily represent those of Radan Computational Ltd.  The
recipient(s) of this message should check it and any attached files for viruses: Radan
Computational will accept no liability for any damage caused by any virus transmitted by
this email.

Thread
blessing db data as utf8Gaal Yahas9 Jun
  • Re: blessing db data as utf8Jochen Wiedmann9 Jun
    • Re: blessing db data as utf8Gaal Yahas9 Jun
      • Re: blessing db data as utf8Jochen Wiedmann9 Jun
        • Re: blessing db data as utf8Gaal Yahas9 Jun
  • [PATCH] Re: blessing db data as utf8Gaal Yahas9 Jun
    • Re: [PATCH] Re: blessing db data as utf8Jochen Wiedmann10 Jun
      • Re: [PATCH] Re: blessing db data as utf8Gaal Yahas10 Jun
      • Re: [PATCH] Re: blessing db data as utf8Steve Hay10 Jun
        • Re: [PATCH] Re: blessing db data as utf8Gaal Yahas10 Jun
          • Re: [PATCH] Re: blessing db data as utf8Steve Hay11 Jun
            • Re: [PATCH] Re: blessing db data as utf8Gaal Yahas11 Jun
              • Re: [PATCH] Re: blessing db data as utf8Steve Hay11 Jun
Re: blessing db data as utf8Gaal Yahas10 Jun