Gaal Yahas wrote:
>On Fri, Jun 11, 2004 at 09:14:21AM +0100, Steve Hay wrote:
>
>
>>I was concerned that searching for some sequence of bytes that make up a
>>UTF-8 character might accidentally match in the wrong place, like the
>>last byte of one character and the first byte of another. Are you
>>implying that this can't ever happen because of how UTF-8 works?
>>
>>
>
>Yes, utf8 is self-synchronizing. If both the needle and the haystack are
>utf8, you don't get false positives even with non-utf8-aware strcmp-like
>code.
>
Cool. That's a very useful thing to know.
>>>of course, collating and therefore ORDER BY won't
>>>work correctly either, and the sizes the database knows about will all
>>>be in bytes instead of characters. Bothersome but not insurmountable. :)
>>>
>>>
>>>
>>I assume you mean pull the data into Perl, have the data correctly
>>flagged as UTF-8 there, and doing things like sorting in the Perl code?
>>
>>I could live with that, but the UPPER()/LOWER() issue is more of a
>>problem. I make a lot of use of them and it's not so easy to workaround
>>in the Perl.
>>
>>
>
>You're right, but there's not much we can do about it until the database
>supports utf8 natively.
>
>
>Out of curiosity, where do you make use of this? Case-insensetive lookups
>that preserve the original case?
>
Yes, exactly that. I'm dealing with software that indexes various
things read out of XML files into the database. Users can search by
either the data that was extracted or by the filenames. Either way,
they want to do case-insensitive searches, but see the original case in
the results.
This is particularly relevant to the filenames themselves because this
is all on Windows which has a case-insensitive but case-preserving
filesystem.
- Steve
------------------------------------------------
Radan Computational Ltd.
The information contained in this message and any files transmitted with it are
confidential and intended for the addressee(s) only. If you have received this message
in error or there are any problems, please notify the sender immediately. The
unauthorized use, disclosure, copying or alteration of this message is strictly
forbidden. Note that any views or opinions presented in this email are solely those of
the author and do not necessarily represent those of Radan Computational Ltd. The
recipient(s) of this message should check it and any attached files for viruses: Radan
Computational will accept no liability for any damage caused by any virus transmitted by
this email.