List:General Discussion« Previous MessageNext Message »
From:Philip Mather Date:February 3 2007 8:21pm
Subject:Re: Fulltext relevance and weighting....
View as plain text  
    I doubt I'd qualify as an expert but here's my two pence worth ( ;^) 
)I wrote a search engine a while back that relies heavily on full-text 
searching and the three things I found that improved results were...

1) Precisely what Dan explains, doing extra biasing per field in the SQL 
with whatever extra info you've got is best. I had fields like 
"seriesFK" that were null if it was a one off TV episode so I used the 
result of (isnull(seriesFK) + 1) * {some multiplier}) and similar to 
shove a whole TV Series above it's episodes. Perhaps you'd want to bias 
the small_desc field more than the large_desc etc...

2) Adding/removing stop words to the |ft_stopword_file| and changing the 
word length as you've already done.

3) Fiddling with the search requests before they get put into the SQL, I 
was dealing with various people's names as well so and I knew they were 
so I've got a bit of PHP that doubles two word queries up but with one 
swapped around, so if the user enters "George Harrison" they'd actually 
get a search for "George Harrison" and "Harrison, George".

Some of those you've done or are domain specific (i.e. the names), 
sorry. ||Some other ideas I didn't get to try out were...

1) You can use the myisam_ftdump utility to dump the actual weightings 
MySQL generates, load those stats back in and use it to generate a new 
"meta" table then use that in combination with feedback about requested 
search terms and followed links to make an engine that could to some 
extent "learn".

2) the soundex() function would allow you to handle spelling mistakes 
and might be of some use in your problem to bias those results that 
sound most like the term? See and search 
for sondex.

The problem I found is that the clients (the people who are paying for 
the search engine) knew, as expert in their subject what exact order 
they'd like the results to come back in but you hit the problem that you 
can't readily program that knowledge into a computer no matter how hard 
you try or how fancy your algorithms get.  I can't tell you the pain 
"Charlie Chaplin" caused me, his real name is actually Charles but 
without lots of extra contextual information to hand you can't program 
that sort of knowledge and exception into a computer without spending 
lots of time or money.

I wrote the search engine bit inside and they were happy with it once 
we'd finished but I think this sort of area is somewhere you could spend 
a lot of effort making little difference on small or medium sized 
projects, there is a certain amount of "tweaking" that can be done but 
eventually you'd need to move to a solution that is significantly more 
technologically complex i.e. expensive & time consuming before achieve 
noticeably better results.

If you want I'd be happy to outline and expand on some ideas.


>> It is more of an issue to prioritizing fields for relevance, and whether it
>> is possible to do this within a fulltext query, or whether it needs to be
>> done through multiple queries, and then "outside" php processing of those
>> query results....
> You should be able to do what you need by making your 'score'
> expression something like this:
> select *, 
>  match(code) against ('ham*' in boolean mode) * 8 +
>  match(name) against ('ham*' in boolean mode) * 4 +
>  match(small_desc) against ('ham*' in boolean mode) * 2 +
>  match(large_desc) against ('ham*' in boolean mode)
>  as score from products where active='y' and site like '%,1,%' and 
>  match(code,name,small_desc,large_desc) against ('ham*' IN BOOLEAN MODE)
>  order by score desc
> This takes advantage of the fact that boolean mode matches always
> return 1 or 0, so a record matching in the "code" field will sort
> higher than a record with "ham" in all 3 of the others but not in
> "code".
>>>> Does anyone have any suggestions on how to solve the result
>>>> weighting problem? I have a client whose search results are
>>>> becoming more and more important, and the relevance demands on the
>>>> results are not entirely satisfactory...
>>>> The fields that are searched are code, name, small description and
>>>> large description, and are ranked in relevance in that order.
>>>> For example, a product with the name: "Bone-In Serrano Ham" should
>>>> ALWAYS outweigh the product with the name of "Boneless Jamon
>>>> Iberico", even if the Jamon Iberico has the word "ham" in the
>>>> description 20 times more than the Serrano product...
>>>> The query that is being run is: select
>>>> *,match(code,name,small_desc,large_desc) against ('ham*') as score
>>>> from products where active='y' and site like '%,1,%' and
>>>> match(code,name,small_desc,large_desc) against ('ham*' IN BOOLEAN
>>>> MODE) order by score desc
>>>> It returns some good relevant matches, but then in the middle of
>>>> products names with "ham" in them, it returns one without....
>>>> Does this require a complete logic switch, or is there a way to
>>>> build a query to do this?
>>>> Obviously the actual build of the query is more complex, and there
>>>> are other rules that need to be applied to the user submitted
>>>> query, but this is the basics...
>>>> If there is a Fulltext search relevance expert out there in list
>>>> land, I am at my wits end trying to make the results the most
>>>> relevant that they can be - I am willing to work closely with
>>>> (pay) someone with the knowledge and expertise to assist in this. 
>>>> (using PHP
Fulltext relevance and weighting....Mike Morton31 Jan
  • Re: Fulltext relevance and weighting....mos31 Jan
    • Re: Fulltext relevance and weighting....Mike Morton31 Jan
      • Re: Fulltext relevance and weighting....Dan Nelson31 Jan
        • Re: Fulltext relevance and weighting....Philip Mather3 Feb