Below is the list of changes that have just been committed into a local
5.0-fulltext repository of kostja. When kostja does a push these changes will
be propagated to the main repository and, within 24 hours after the
push, to the public repository.
For information on how to access the public repository
see http://dev.mysql.com/doc/mysql/en/installing-source-tree.html
ChangeSet
1.2017 05/10/25 22:56:31 konstantin@stripped +1 -0
- describe the relevancy evaluation formula of CNET_WEIGHT(a,b)
plugin/fulltext/cnet_weight.c
1.8 05/10/25 22:56:23 konstantin@stripped +60 -21
- describe the relevancy evaluation formula
# This is a BitKeeper patch. What follows are the unified diffs for the
# set of deltas contained in the patch. The rest of the patch, the part
# that BitKeeper cares about, is below these diffs.
# User: konstantin
# Host: dragonfly.local
# Root: /opt/local/work/mysql-5.0-cnet
--- 1.7/plugin/fulltext/cnet_weight.c 2005-10-25 22:05:27 +04:00
+++ 1.8/plugin/fulltext/cnet_weight.c 2005-10-25 22:56:23 +04:00
@@ -26,18 +26,18 @@
An introduction to UDFs in MySQL.
---------------------------------
- By means of a non-aggregate UDF a user can extend MySQL Server with an SQL
- level function. Such function can be used for query processing just like
- a built-in function, i.e. SUBSTRING, CONCAT, UPPER and so on.
-
- A non-aggregate UDF is defined by 3 callbacks which the server locates in
- a dynamic library when a UDF is installed. Each callback is invoked at
- a particular juncture of query processing:
+ MySQL Server can be extended with an SQL function with help of
+ a non-aggregate UDF. Such function can be used for query processing just
+ like a built-in one, i.e. SUBSTRING, CONCAT, UPPER and so on.
+
+ A non-aggregate UDF is defined by 3 callbacks, which are located by the
+ server in the dynamic library when the UDF is installed. Each callback
+ is invoked at a particular moment of query processing:
* the init function is called after the query has been parsed, but before
execution. This function is called once per query.
* the UDF body is invoked during query execution, possibly many times
- in case arguments of the UDF refer to non-constant expressions such
+ in case the arguments of the UDF refer to non-constant expressions such
as table columns
* the deinit function is called in the end of the query
@@ -47,8 +47,8 @@
CNET_WEIGHT(DOCUMENT, QUERY) -- a weighting UDF
-----------------------------------------------
- A weighting UDF `CNET_WEIGHT' is a function that demonstrates how
- relevance evaluation of fulltext can be implemented.
+ Weighting UDF `CNET_WEIGHT' is a function that demonstrates how
+ relevance evaluation for fulltext search can be implemented.
The function fulfills the following criteria:
* proximity between the matched words of a document is taken into
account: if the matched words stand closer to each other, the
@@ -63,9 +63,9 @@
Example:
CNET_WEIGHT("MySQL", "MySQL") > CNET_WEIGHT("MySQL", "mysql")
- The function accepts two arguments, a document and a search query
- respectively. Whereas a document can refer to a non-constant expression,
- such as a table column, a search query must be a constant.
+ The function accepts two arguments, a document and a search query.
+ Whereas the document can refer to a non-constant expression, such as
+ a table column, the search query must be a constant.
Example:
@@ -100,23 +100,62 @@
The architecture of CNET_WEIGHT.
--------------------------------
- In order to evaluate a relevance, the UDF parses the query and the
- document, and tries to find every word of the query in the document.
+ In order to evaluate the relevance, the UDF parses the query and the
+ document and then tries to find every word of the query in the document.
Parsing of the query is done once per query, in cnet_weight_init,
whereas the document (or if it corresponds to a table column, the record)
is parsed on every invocation of cnet_weight().
For parsing purposes an external function is used; at the moment it
- refers to the parsing function from CNET parsing plugin, which allows
- to achieve best correlation between relevance values and table contents:
+ is the parsing function from CNET parsing plugin, which allows
+ to achieve the best correlation between relevance values and table
+ contents:
ALTER TABLE t1 ADD FULLTEXT KEY(a) WITH PARSER cnet_parser;
The signature of the plugin parser and its input buffers make it
- available for reuse in the UDF.
+ available to reuse this function in the UDF implementation.
- The relevance calculation formula of CNET_WEIGHT
- --------------------------------------------------
+ The relevance calculation formula of CNET_WEIGHT.
+ -------------------------------------------------
+ For conveninece, we'll refer to the search query and the document as to
+ a list of words q_1 ... q_l and d_1 ... d_m respectively.
+ We need to introduce a few simple auxiliary functions to define
+ the final formula:
+
+ A start weight for every word in the search query
+ STATR_WEIGHT(q_i) shall be defined as follows:
+
+ -0.2 if a word matches a synonym from the synonyms list in
+ case-sensitive fashion
+ -0.1 if a word matches a synonym from the synonyms list in
+ case-insensitive fashion
+ 0.1 for any other word
+
+ Let us define function MATCH(q_i, d_j) to 1 if there is a match (case
+ sensitive or insensitive), or to 0 otherwise.
+
+ Let us define function MATCH_WEIGHT(q_i, d_j) to:
+ 2 if the query word q_i matches the document word d_j in
+ case-sensitive fashion
+ 1 if the query word q_i matches the document word d_j in
+ case-insensitive fashion
+ 0 if there is no match
+
+ Let us define INDEX(d_j) to j
+
+ Let us define function PROXIMITY(d_j) as INDEX(d_j) - INDEX(d_k)
+ where k is the index of the last word in the document where there was
+ a MATCH(q_i, d_j) for some q_i.
+
+ Using the auxiliary functions above, the relevance of the query to
+ a document will be calculated as follows:
+
+ RELEVANCE(q, d) = SUM [over every word in the document d_j and every
+ word in the query q_i]
+
+ MATCH_WEIGHT(q_i, d_j) / PROXIMITY(d_j) +
+ MATCH(q_i, d_j) * START_WEIGHT(q_i)
*/
/* This function will be used to parse the query and the document */
@@ -306,7 +345,7 @@
/*
Word from the query matched word from the document
in case sensitive fashion. Increase document weight.
- Lower distance between words gives higher weight
+ Shorter distance between words gives higher weight
(2.0 / weight_param->proximity). Also qwrd->weight,
which was calculated in cnet_weight_init function
affects weight. So synonyms have lower weight than
| Thread |
|---|
| • bk commit into 5.0-fulltext tree (konstantin:1.2017) | konstantin | 25 Oct |