List:General Discussion« Previous MessageNext Message »
From:Steve Edberg Date:April 10 2006 11:49pm
Subject:Re: counting keywords
View as plain text  
At 7:37 AM +1000 4/11/06, Taco Fleur wrote:
>Hi all,
>
>I am trying to find out whether it is possible to return the count of
>keywords found in a text field, does anyone know?
>For example; "ColdFusion or Java" is entered in the search string and 20
>records are found that match, I need to count how many times "ColdFusion"
>and "Java" appears in each match, add those two and than sort descending on
>that total.
>
>I was actually using verity for the search, but the client insists he sees
>the number of keywords found, which Verity does not do.
>
>The text searched are Résumé's, I initially thought that MySQL could
> search
>the résumé's when stored as binary data, but I was wrong, so I now have
> the
>CV's converted to HTML and then store them in the DB as VARCHAR
>
>Any help would be much appreciated, I am having a hard time coming from a MS
>SQL background ;-)


There's no function that I know of to do that directly; however, you 
could do something like this:

    select
       (length(your_text_field)-length(replace(your_text_field, 
'coldfusion','')))/length('coldfusion')
       as wordcount
    from
       your_table

That is, it removes all instances of 'coldfusion' from  your string, 
gets the difference in length from the unaltered string, and divides 
that by the number of characters in your search string. I've used 
this method several times.

Of course, if you want to avoid matching against terms like 
'javalike' or 'coldfusionista' then you've gotta do some additional 
checking, for example:

    select
       (length(your_text_field)-length(replace(concat(' 
',your_text_field,' '), ' coldfusion ','')))/length(' coldfusion ')
       as wordcount
    from
       your_table

See

    http://dev.mysql.com/doc/refman/4.1/en/string-functions.html

for more info.

If you're doing this alot, it might be more efficient to build a word 
index table like:

    word        char(32) not null         # or whatever your max word 
length is likely to be
    word_count  integer unsigned not null
    document_id integer unsigned not null

where document_id is a foreign key pointing at the table containing 
your fulltext. This would be easier to extend to handle synonym 
handling too, and you could do all the suffix handling/stemming you 
need (eg; to take care of plurals). I've done something like that as 
well, and included an extra field for the metaphone version of the 
word, to match approximate spellings. If the text fields were all in 
plain text, you could even include character positions like

    word          char(32) not null
    word_position integer unsigned not null
    document_id   integer unsigned not null

then you could get word counts by doing a select count(word).


	steve


>  Kind regards,
>
>
>Taco Fleur
>
>Free Call 1800 032 982 or Mobile 0421 851 786
>Pacific Fox  <http://www.pacificfox.com.au/> http://www.pacificfox.com.au an
>industry leader with commercial IT experience since 1994 …
>
>*
>
>	Web Design and Development
>*
>
>	SMS Solutions, including developer API
>*
>
>	Domain Registration, .COM for as low as fifteen dollars a year,
>.COM.AU for fifty dollars two years!
>
>


-- 
+--------------- my people are the people of the dessert, ---------------+
| Steve Edberg                                http://pgfsun.ucdavis.edu/ |
| UC Davis Genome Center                            sbedberg@stripped |
| Bioinformatics programming/database/sysadmin             (530)754-9127 |
+---------------- said t e lawrence, picking up his fork ----------------+
Thread
counting keywordsTaco Fleur10 Apr
  • Re: counting keywordsSteve Edberg11 Apr
RE: counting keywordsTaco Fleur11 Apr
  • RE: counting keywordsSteve Edberg11 Apr
RE: counting keywordsTaco Fleur12 Apr
  • RE: counting keywordsSteve Edberg12 Apr