On Fri, Jun 11, 2004 at 09:14:21AM +0100, Steve Hay wrote:
> I was concerned that searching for some sequence of bytes that make up a
> UTF-8 character might accidentally match in the wrong place, like the
> last byte of one character and the first byte of another. Are you
> implying that this can't ever happen because of how UTF-8 works? I've
> never really looked into the detail of the UTF-8 coding; I've just used
> interfaces that manipulate it and took the view that the internals don't
> really interest me (and shouldn't do, if I'm doing things properly).
>
> If it's true, then it certainly does alleviate some of the pain.
Yes, utf8 is self-synchronizing. If both the needle and the haystack are
utf8, you don't get false positives even with non-utf8-aware strcmp-like
code.
> What about things like UPPER() and LOWER(), though? Presumably they're
> not going to work because they'll operate on bytes and completely screw
> everything up?
True, that will not work.
> >of course, collating and therefore ORDER BY won't
> >work correctly either, and the sizes the database knows about will all
> >be in bytes instead of characters. Bothersome but not insurmountable. :)
> >
> I assume you mean pull the data into Perl, have the data correctly
> flagged as UTF-8 there, and doing things like sorting in the Perl code?
>
> I could live with that, but the UPPER()/LOWER() issue is more of a
> problem. I make a lot of use of them and it's not so easy to workaround
> in the Perl.
You're right, but there's not much we can do about it until the database
supports utf8 natively.
Out of curiosity, where do you make use of this? Case-insensetive lookups
that preserve the original case?
Note that ORDER BY and UPPER()/LOWER() will continue to work on the subset
of your strings that happen to be ASCII, even if some of your data is not.
--
Gaal Yahas <gaal@stripped>
http://gaal.livejournal.com/