List:General Discussion« Previous MessageNext Message »
From:Steve Edberg Date:April 12 2006 8:35am
Subject:RE: counting keywords
View as plain text  
At 1:55 PM +1000 4/12/06, Taco Fleur wrote:
>Hello Steve,
>
>Your suggestion works like a charm, I am now trying to get my head around
>your following statement, I can't seem to get a grip on what you mean. Is
>there anyway you could elaborate a little on the following? It would be
>really nice if I could get the solution to work with full text and be able
>to use stemming etc.
>
>
>>  where document_id is a foreign key pointing at the table containing your
>fulltext. This would be easier to extend to handle synonym handling too, and
>you could do all the suffix handling/stemming you need (eg; to take care of
>plurals). I've done something like that as well, and included an extra field
>for the metaphone version of the word, to match approximate spellings.


Assume table structures something like this:

    WordTable:
    word        char(32) not null         # or whatever your max word 
length is likely to be
    word_count  integer unsigned not null
    document_id integer unsigned not null

    DocTable:
    document_id integer unsigned not null auto_increment primary key
    doc_author
    doc_date
    ...
    doc_body    text

In your original post, you needed to get exact counts of words 
appearing in the document body; MySQL's full text search can't do 
that. What you could do in this case is some preprocessing on the 
doc_body when you insert it. You would scan the text, ignoring the 
insignificant words (eg; and, the, a, is, and so on), and then 
transform the remaining words into a canonical form (eg; glasses, 
glassy, glass all become glass), and then insert into the WordTable. 
Then, when a search is performed, you translate the search terms 
using the same algorithm and search the WordTable. You might also 
have a SynonymTable that you could use to translate all synonyms to a 
standard term before insertion into the WordTable and before 
searching.

There are various stemming algorithms around; the Porter Algorithm 
was one of the earlier ones, and the one I have worked with some:

    http://www.tartarus.org/martin/PorterStemmer/

It worked pretty well, but I needed to maintain an exception list for 
some words it incorrectly translated. Also, you have to decide 
whether similar terms like 'anthropologist' and 'anthropology' are 
identical as far as your search is concerned.

The Snowball or Porter2 algorithm is apparently an improvement on the original:

    http://snowball.tartarus.org/algorithms/english/stemmer.html

A google for 'stemming algorithms' also turned up this:

    http://www.comp.lancs.ac.uk/computing/research/stemming/

And if you're dealing with non-English words, then you'll have to 
look for native language stemmers or modify the rules in the above 
algorithms.

If you had a table set up as above, you could do either searches 
based on the word table (where you could return documents sorted by 
the number of times the search words appeared), or using a fulltext 
search on the doc_body. I would imagine that for many documents the 
order of results would be similar. The fulltext algorithm also 
weights words more heavily based on their uniqueness.

If you want to do searching based on approximate spellings, you could 
add an additional column to WordTable:

    word_approx    char(32) not null

and store either the soundex version of the word -

    http://en.wikipedia.org/wiki/Soundex

- or the more accurate (in my experience) metaphone algorithm -

    http://en.wikipedia.org/wiki/Metaphone

Then the search terms would be run through two transforms: a stemmer, 
and a 'metaphoner'.

Lastly, it appears that MySQL 5.1 has a new plug-in API -

    http://dev.mysql.com/doc/refman/5.1/en/plugin-api.html

- which includes the ability to modify/replace fulltext parser 
behavior. It looks like you might be able to create custom functions 
to do most or all of the above using user-defined functions, 
presumably with relatively high efficiency.

	steve

-- 
+--------------- my people are the people of the dessert, ---------------+
| Steve Edberg                                http://pgfsun.ucdavis.edu/ |
| UC Davis Genome Center                            sbedberg@stripped |
| Bioinformatics programming/database/sysadmin             (530)754-9127 |
+---------------- said t e lawrence, picking up his fork ----------------+
Thread
counting keywordsTaco Fleur10 Apr
  • Re: counting keywordsSteve Edberg11 Apr
RE: counting keywordsTaco Fleur11 Apr
  • RE: counting keywordsSteve Edberg11 Apr
RE: counting keywordsTaco Fleur12 Apr
  • RE: counting keywordsSteve Edberg12 Apr