>>>> 2013/06/03 21:43 +0000, Rick James >>>>
Soundex is the 'right' approach, but it needs improvement. So, find an improvement, then
do something like this...
Hashing involves somekind normalizing, and in my case I see no means to it; otherwise I
would not have considered something so costly. On the other hand, maybe ....
I am comparing lists of place-names, and I want to match, say, any of "Mount Saint
Francis" or "MT ST FRANCIS" or "MOUNT ST FRANCIS" or "MT SAINT FRANCIS"--but it is not all
standard abbreviations. Sometimes there is "Galvestn" or "Galvston" or "Galvstn" for
"Galveston", and it is not always vowel-letter deletion, either: "Ft Benj Harrison", "FT
BENJAMIN HARRISON", "Ft Benj Harsn"; "CLVR MIL ACAD", "Culver Milt Acad".
Anyhow, I gave up on a perfect solution, and instead added to each name the name padded
with '%'s. On joining the longer name is used, but instead of the shorter the padded is
used after "LIKE", if "LOCATE" also fails to match, and overall the Levenstein edit
distance is used only for a check, with short-circuit "AND" and "OR" supposed (and the
timing is such that I believe it is):
ON (LOCATE(Bookk.Burgh, PO.Burgh) > 0 OR LOCATE(PO.Burgh, Bookk.Burgh) > 0 OR
CHAR_LENGTH(Bookk.Burgh) > CHAR_LENGTH(PO.Burgh) AND Bookk.Burgh LIKE PO.pBurgh OR
CHAR_LENGTH(Bookk.Burgh) < CHAR_LENGTH(PO.Burgh) AND PO.Burgh LIKE Bookk.pBurgh)
AND mismatch(Bookk.Burgh, PO.Burgh, 1, 2, 1) < 8 IS NOT FALSE
It does not match "MOUNT ST FRANCIS" and "MT SAINT FRANCIS".
At least for LOCATE and LIKE there are linear-time algorithms.
All along I assumed that in the end some of the mismatching will be handled by hand. It is
not that big a list, but doing all by hand is far too much.