From: Date: June 11 2004 11:41am Subject: Re: [PATCH] Re: blessing db data as utf8 List-Archive: http://lists.mysql.com/perl/3020 Message-Id: <20040611094130.GP17923@sike.forum2.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii On Fri, Jun 11, 2004 at 09:14:21AM +0100, Steve Hay wrote: > I was concerned that searching for some sequence of bytes that make up a > UTF-8 character might accidentally match in the wrong place, like the > last byte of one character and the first byte of another. Are you > implying that this can't ever happen because of how UTF-8 works? I've > never really looked into the detail of the UTF-8 coding; I've just used > interfaces that manipulate it and took the view that the internals don't > really interest me (and shouldn't do, if I'm doing things properly). > > If it's true, then it certainly does alleviate some of the pain. Yes, utf8 is self-synchronizing. If both the needle and the haystack are utf8, you don't get false positives even with non-utf8-aware strcmp-like code. > What about things like UPPER() and LOWER(), though? Presumably they're > not going to work because they'll operate on bytes and completely screw > everything up? True, that will not work. > >of course, collating and therefore ORDER BY won't > >work correctly either, and the sizes the database knows about will all > >be in bytes instead of characters. Bothersome but not insurmountable. :) > > > I assume you mean pull the data into Perl, have the data correctly > flagged as UTF-8 there, and doing things like sorting in the Perl code? > > I could live with that, but the UPPER()/LOWER() issue is more of a > problem. I make a lot of use of them and it's not so easy to workaround > in the Perl. You're right, but there's not much we can do about it until the database supports utf8 natively. Out of curiosity, where do you make use of this? Case-insensetive lookups that preserve the original case? Note that ORDER BY and UPPER()/LOWER() will continue to work on the subset of your strings that happen to be ASCII, even if some of your data is not. -- Gaal Yahas http://gaal.livejournal.com/