List:General Discussion« Previous MessageNext Message »
From:Steve Edberg Date:April 11 2006 7:23am
Subject:RE: counting keywords
View as plain text  
At 1:35 PM +1000 4/11/06, Taco Fleur wrote:
>Thanks Steve,
>
>Much appreciated, I was hoping there was something a little simpler, but I
>will have a go at it.
>Anyway of doing this with RegEx, would that simplify things?


Hi, and you're welcome -

Unfortunately, I don't think this can be done with regex/rlike; those 
only give a boolean result (pattern matched/not matched), but can't 
as far as I know be used for counting/replacing strings. This doesn't 
appear to have changed even in MySQL 5.1.

	steve

>
>-----Original Message-----
>From: Steve Edberg [mailto:sbedberg@stripped]
>Sent: Tuesday, 11 April 2006 9:50 AM
>To: taco.fleur@stripped; mysql@stripped
>Subject: Re: counting keywords
>
>At 7:37 AM +1000 4/11/06, Taco Fleur wrote:
>>Hi all,
>>
>>I am trying to find out whether it is possible to return the count of
>>keywords found in a text field, does anyone know?
>>For example; "ColdFusion or Java" is entered in the search string and
>>20 records are found that match, I need to count how many times
>"ColdFusion"
>>and "Java" appears in each match, add those two and than sort
>>descending on that total.
>>
>>I was actually using verity for the search, but the client insists he
>>sees the number of keywords found, which Verity does not do.
>>
>>The text searched are Résumé's, I initially thought that MySQL
> could
>>search the résumé's when stored as binary data, but I was wrong, so
> I
>>now have the CV's converted to HTML and then store them in the DB as
>>VARCHAR
>>
>>Any help would be much appreciated, I am having a hard time coming from
>>a MS SQL background ;-)
>
>
>There's no function that I know of to do that directly; however, you could
>do something like this:
>
>     select
>        (length(your_text_field)-length(replace(your_text_field,
>'coldfusion','')))/length('coldfusion')
>        as wordcount
>     from
>        your_table
>
>That is, it removes all instances of 'coldfusion' from  your string, gets
>the difference in length from the unaltered string, and divides that by the
>number of characters in your search string. I've used this method several
>times.
>
>Of course, if you want to avoid matching against terms like 'javalike' or
>'coldfusionista' then you've gotta do some additional checking, for example:
>
>     select
>        (length(your_text_field)-length(replace(concat('
>',your_text_field,' '), ' coldfusion ','')))/length(' coldfusion ')
>        as wordcount
>     from
>        your_table
>
>See
>
>     http://dev.mysql.com/doc/refman/4.1/en/string-functions.html
>
>for more info.
>
>If you're doing this alot, it might be more efficient to build a word index
>table like:
>
>     word        char(32) not null         # or whatever your max word
>length is likely to be
>     word_count  integer unsigned not null
>     document_id integer unsigned not null
>
>where document_id is a foreign key pointing at the table containing your
>fulltext. This would be easier to extend to handle synonym handling too, and
>you could do all the suffix handling/stemming you need (eg; to take care of
>plurals). I've done something like that as well, and included an extra field
>for the metaphone version of the word, to match approximate spellings. If
>the text fields were all in plain text, you could even include character
>positions like
>
>     word          char(32) not null
>     word_position integer unsigned not null
>     document_id   integer unsigned not null
>
>then you could get word counts by doing a select count(word).
>
>
>	steve
>
>
>>   Kind regards,
>>
>>
>>Taco Fleur
>>
>>Free Call 1800 032 982 or Mobile 0421 851 786 Pacific Fox 
>><http://www.pacificfox.com.au/> http://www.pacificfox.com.au an
>>industry leader with commercial IT experience since 1994 …
>>
>>*
>>
>>	Web Design and Development
>>*
>>
>>	SMS Solutions, including developer API
>>*
>>
>>	Domain Registration, .COM for as low as fifteen dollars a year,
>>.COM.AU for fifty dollars two years!
>  >
>>


-- 
+--------------- my people are the people of the dessert, ---------------+
| Steve Edberg                                http://pgfsun.ucdavis.edu/ |
| UC Davis Genome Center                            sbedberg@stripped |
| Bioinformatics programming/database/sysadmin             (530)754-9127 |
+---------------- said t e lawrence, picking up his fork ----------------+
Thread
counting keywordsTaco Fleur10 Apr
  • Re: counting keywordsSteve Edberg11 Apr
RE: counting keywordsTaco Fleur11 Apr
  • RE: counting keywordsSteve Edberg11 Apr
RE: counting keywordsTaco Fleur12 Apr
  • RE: counting keywordsSteve Edberg12 Apr