List:General Discussion« Previous MessageNext Message »
From:hsv Date:June 4 2013 1:30pm
Subject:Re: string-likeness
View as plain text  
>>>> 2013/06/03 18:38 +0200, Hartmut Holzgraefe >>>>
equality checks have a linear cost of O(min(len1,len2)) and can make
use of indexes, too, while Levenshtein cost is is almost quadratic
O(len1*len2) and can't make any good use of indexes ... even using
a C UDF would help only so far with this kind of complexity. It will
increase performance by a constant factor, but given long enough
input strings the len1*len2 factor will still account for the majority
of the run time increase over simple equality comparions
<<<<<<<<
My set isn't that big (not the hundreds of thousands to which many on this list refer),
only big enough to be a pain, and here the constant, between implementing in interpreted
SQL with no array, only temporary table, and compiled C, with real array, probably
matters--except that my C-implementation won't happen.

>>>>>>>>
there are a few possible points of optimization though, first of all
you can cut off equal start and end sequences (linear complexity for
that part instead of quadratic). You can also add a few more tricks
if you are only interested in matches below a certain distance threshold:

* if string lengths differ by more than the threshold value you can
  rule out this pair of strings as being "similar" right away

* while iterating over the distance array keep track of the min.
  distance value of the current row ... if at the end of a row
  is larger than the threshold distance you can terminate right away
<<<<<<<<
Didn't think of these ... will have to find a threshold

>>>>>>>>
* only calculate operation cost, not operation type

* do not maintain a full len1*len2 array, having only the previous
  and current row in two one dimensional arrays is sufficient
  (this esp. helps in C implementation as the functions working set
  is more likely to fit into CPU caches) 
<<<<<<<<
I already do this, because MySQL has no arrays, and I use a small temporary table instead
of one linear array.

Thread
string-likenesshsv3 Jun
  • Re: string-likenessJohan De Meersman3 Jun
  • Re: string-likenessHartmut Holzgraefe3 Jun
    • Re: string-likenesshsv4 Jun
  • RE: string-likenessRick James3 Jun
    • Re: string-likenessMichael Dykman3 Jun
    • RE: string-likenesshsv6 Jun