List:General Discussion« Previous MessageNext Message »
From:Paul DuBois Date:January 21 2010 8:52pm
Subject:Re: REGEXP and unicode weirdness
View as plain text  
On Jan 21, 2010, at 9:27 AM, John Campbell wrote:

> I want to find rows that contain a word that matches a term, accent
> insensitive:  I am using utf8-general collation everywhere.
> 
> attempt 1:
> SELECT * FROM t WHERE txt LIKE '%que%'
> Matches que qué, but also matches 'queue'
> 
> attempt 1.5:
> SELECT * FROM t WHERE txt LIKE '% que %' OR LIKE 'que %' OR LIKE '% que';
> Almost, but misses "que!"  or 'que...'
> 
> attempt2:
> SELECT * FROM t WHERE txt REGEXP '[[:<:]]que[[:>:]]'
> Matches que, not queue, but doesn't match qué.
> 
> attempt3
> SELECT * FROM t WHERE txt REGEXP
> '[[:<:]]q[uùúûüũūŭůűųǔǖǘǚǜ][eèéêëēĕėęě][[:>:]]'
> Matches que, queue, qué.  (I have no idea why this matches queue, but
> the Regex behavior is bizarre with unicode.)
> 
> Does anyone know why the final regex acts weird?  It there a good solution?


http://dev.mysql.com/doc/refman/5.1/en/regexp.html:

Warning
The REGEXP and RLIKE operators work in byte-wise fashion, so they are not multi-byte safe
and may produce unexpected results with multi-byte character sets. In addition, these
operators compare characters by their byte values and accented characters may not compare
as equal even if a given collation treats them as equal.

-- 
Paul DuBois
Sun Microsystems / MySQL Documentation Team
Madison, Wisconsin, USA
www.mysql.com

Thread
REGEXP and unicode weirdnessJohn Campbell21 Jan
  • Re: REGEXP and unicode weirdnessPaul DuBois21 Jan
  • Re: REGEXP and unicode weirdnessfsb21 Jan