From: Date: June 10 2004 8:49pm Subject: Re: blessing db data as utf8 List-Archive: http://lists.mysql.com/perl/3013 Message-Id: <20040610184957.GO17923@sike.forum2.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii [I hope nobody minds that I'm moving this thread to the DBD::mysql list, because it seems like the best place for it. Please drop cdbi-talk from replies.] On Thu, Jun 10, 2004 at 07:01:30PM +0100, Tim Bunce wrote: > On Thu, Jun 10, 2004 at 12:18:42PM +0300, Gaal Yahas wrote: > > On Thu, Jun 10, 2004 at 09:51:06AM +0100, Tim Bunce wrote: > > > This isn't a good way to check for utf8: > > > > > > +int is_high_bit_set(char *val) { > > > + while (*val++) > > > + if (*val & 0x80) return 1; > > > + return 0; > > > +} > > > > > > because it make it hard for any latin-1 data to coexist. > > > The perl guts probably has a function to check for well-formed utf8 > > > and that should be used instead. > > > > This function is only used as an optimization. The actual decision is here: > > > > + if (imp_dbh->enable_utf8 && > > + is_high_bit_set(col) && is_utf8_string(col, len)) > > + SvUTF8_on(sv); > > Ah, okay. > > > That said, bad things are going to happen sooner of later if a table has > > both latin-1 and utf8 data. > > I'm thinking more about different fields having either latin-1 or utf8 data. > > > But now that I think of it, I'm not sure the call to is_high_bit_set is > > a good idea there, since SvUTF8_on() on a pure (7 bit) ASCII string > > shouldn't do any harm > > It does add overhead (and is actually harmful on 5.6.x where many > utf8 bugs lurk) so the check is worthwhile. > > > and may even be more correct if the string is later concatenated > > with utf8 data. > > No, perl will do-the-right-thing. So all in all it sounds like this patch is simple, but correct? Steve Hay mentioned another similar patch had been written but didn't reach CPAN; I'd like to encourage the maintainers to put either version :-) > > I'm not sure what the cleanest way would be to go about this in the > > long run (whose responsibility it is to say what is and what isn't > > utf8) but the patch addresses an immediate need for people with > > utf8-only data. Maybe this problem would go away in mysql 4.1; I'd > > prefer not to wait. > > Something along these lines is needed. But it does require careful thought. Perhaps the application, or Class::DBI::mysql (which already has some provisions for similar things) should be responsible for keeping track of what fields are which charset, with no policy (except a default one) being enforced on the DBD level. In this scheme the current approach becomes part of the default handling, so it still makes sense to put it in now. -- Gaal Yahas http://gaal.livejournal.com/