List:General Discussion« Previous MessageNext Message »
From:hsv Date:June 4 2013 1:30pm
Subject:Re: string-likeness
View as plain text  
>>>> 2013/06/03 18:38 +0200, Hartmut Holzgraefe >>>>
equality checks have a linear cost of O(min(len1,len2)) and can make
use of indexes, too, while Levenshtein cost is is almost quadratic
O(len1*len2) and can't make any good use of indexes ... even using
a C UDF would help only so far with this kind of complexity. It will
increase performance by a constant factor, but given long enough
input strings the len1*len2 factor will still account for the majority
of the run time increase over simple equality comparions
My set isn't that big (not the hundreds of thousands to which many on this list refer),
only big enough to be a pain, and here the constant, between implementing in interpreted
SQL with no array, only temporary table, and compiled C, with real array, probably
matters--except that my C-implementation won't happen.

there are a few possible points of optimization though, first of all
you can cut off equal start and end sequences (linear complexity for
that part instead of quadratic). You can also add a few more tricks
if you are only interested in matches below a certain distance threshold:

* if string lengths differ by more than the threshold value you can
  rule out this pair of strings as being "similar" right away

* while iterating over the distance array keep track of the min.
  distance value of the current row ... if at the end of a row
  is larger than the threshold distance you can terminate right away
Didn't think of these ... will have to find a threshold

* only calculate operation cost, not operation type

* do not maintain a full len1*len2 array, having only the previous
  and current row in two one dimensional arrays is sufficient
  (this esp. helps in C implementation as the functions working set
  is more likely to fit into CPU caches) 
I already do this, because MySQL has no arrays, and I use a small temporary table instead
of one linear array.

string-likenesshsv3 Jun
  • Re: string-likenessJohan De Meersman3 Jun
  • Re: string-likenessHartmut Holzgraefe3 Jun
    • Re: string-likenesshsv4 Jun
  • RE: string-likenessRick James3 Jun
    • Re: string-likenessMichael Dykman3 Jun
    • RE: string-likenesshsv6 Jun