List:Internals« Previous MessageNext Message »
From:Peter Gulutzan Date:March 8 2005 3:48am
Subject:bk commit - mysqldoc tree (peterg:1.2664)
View as plain text  
Below is the list of changes that have just been committed into a local
mysqldoc repository of pgulutzan. When pgulutzan does a push these
changes will
be propagated to the main repository and, within 24 hours after the
push, to the public repository.
For information on how to access the public repository
see http://www.mysql.com/doc/I/n/Installing_source_tree.html

ChangeSet
  1.2664 05/03/07 16:47:30 peterg@stripped +1 -0
  Added example in full-text section.

  Docs/internals.texi
    1.68 05/03/07 16:47:30 peterg@stripped +101 -29
    Added example in full-text section.

# This is a BitKeeper patch.  What follows are the unified diffs for the
# set of deltas contained in the patch.  The rest of the patch, the part
# that BitKeeper cares about, is below these diffs.
# User:	peterg
# Host:	d-142-59-78-116.abhsia.telus.net
# Root:	/home/pgulutzan/mysqldoc

--- 1.67/Docs/internals.texi	Sun Mar  6 18:49:54 2005
+++ 1.68/Docs/internals.texi	Mon Mar  7 16:47:30 2005
@@ -3697,9 +3697,9 @@
 can apply basic trigonometry to calculate "distances", and
 those distances are equatable with similarity measurements.
 A comprehensible discussion of vector space technology is here:
-@uref{http://www.miislita.com/term-vector/term-vector-1.html}.
+http://www.miislita.com/term-vector/term-vector-1.html.
 And a text which partly inspired our original developer is here:
-@uref{ftp://ftp.cs.cornell.edu/pub/smart/smart.11.0.tar.Z} ("SMART").
+ftp://ftp.cs.cornell.edu/pub/smart/smart.11.0.tar.Z ("SMART").
 
 But let's try to describe the classic formula:
 @example
@@ -3721,18 +3721,21 @@
 some calculations for "the normalization factor".
 In the end, MySQL's formula looks something like:
 @example
-w1 = log(dtf)+1);
-w = w1/sumdtf * U/(1+0.0115*U) * log((N-nf)/nf);
+w = (log(dtf)+1)/sumdtf * U/(1+0.0115*U) * log((N-nf)/nf)
 @end example
 Where:
 @example
 dtf     is the number of times the term appears in the document
-sumdtf  is the sum of dtf's for all words in the same document
+sumdtf  is the sum of (log(dtf)+1)'s for all terms in the same document
 U       is the number of Unique terms in the document
 N       is the total number of documents
 nf      is the number of documents that contain the term
 @end example
 
+The formula has three parts: base part, normalization factor, global
multiplier.
+
+The base part is the left of the formula, "(log(dtf)+1)/sumdtf".
+
 The normalization factor is the middle part of the formula.
 The idea of normalization is: if a document is shorter than average
 length then weight goes up, if it's average length then
@@ -3741,12 +3744,16 @@
 normalization factor. For the theory and justification, see
 the paper "Pivoted Document Length Normalization" by Amit
 Singhal and Chris Buckley and Mandar Mitra ACM SIGIR'96, 21-29, 1996:
-@uref{http://ir.iit.edu/~dagr/cs529/files/handouts/singhal96pivoted.pdf}.
-The word "unique" here means, in effect, that our measure of
-document length is the number of unique terms in the document.
-We chose 0.115 as the pivot value, it's @code{PIVOT_VAL} in the MySQL
-source code header file @file{myisam/ftdefs.h}.
+http://ir.iit.edu/~dagr/cs529/files/handouts/singhal96pivoted.pdf.
+The word "unique" here means that our measure of
+document length is based on the unique terms in the document.
+We chose 0.0115 as the pivot value, it's PIVOT_VAL in the MySQL
+source code header file myisam/ftdefs.h.
 
+If we multiply the base part times the normalization factor, we
+have the term weight. The term weight is what MySQL stores in the
index.
+
+The global multiplier is the final part of the formula.
 In the classic Vector Space formula, the final part would be
 the inverse document frequency, or simply
 @example
@@ -3756,15 +3763,13 @@
 @example
 log((N-nf)/nf)
 @end example
-This variant is more often used in "probabilistic" formulas.  Such
formulas
-try to make a better guess of the probability that a term will be
relevant.
-To go back to the old system, look in @file{myisam/ftdefs.h} for
-"@code{#define GWS_IN_USE GWS_PROB}" (i.e. global weights by
probability)
-and change it to "@code{#define GWS_IN_USE GWS_IDF}" (i.e. global
weights by
+This variant is more often used in "probabilistic" formulas.
+Such formulas try to make a better guess of the probability that a term
+will be relevant. To go back to the old system, look in myisam/ftdefs.h
for
+"#define GWS_IN_USE GWS_PROB" (i.e. global weights by probability)
+and change it to "#define GWS_IN_USE GWS_IDF" (i.e. global weights by
 inverse document frequency).
 
-The term weight is what MySQL stores in the index.
-
 Then, when retrieving, the rank is the product of the weight
 and the frequency of the word in the query:
 
@@ -3780,17 +3785,17 @@
 
 In vector-space speak, the similarity is the product of the vectors.
 
-And R is the double-precision number that you see if you say:
-@code{SELECT MATCH(...) AGAINST (...) FROM t}.
+And R is the floating-point number that you see if you say:
+SELECT MATCH(...) AGAINST (...) FROM t.
 
-To sum it up, w --- which stands for weight ---
+To sum it up, w -- which stands for weight --
 goes up if the term occurs more often in a row, goes down if
 the term occurs in many rows, goes up / down depending whether
 the number of unique words in a row is fewer / more than average.
-Then R --- which stands for either Rank or Relevance --- is
-w times the frequence of the term in the @code{AGAINST} expression.
+Then R -- which stands for either Rank or Relevance -- is
+w times the frequency of the term in the AGAINST expression.
 
-@strong{Looking at the text weights}
+@strong{The Simplest Possible Example}
 
 First, make a fulltext index. Follow the instructions in the
 "MySQL Full-Text Functions" section of the MySQL Reference
@@ -3815,16 +3820,16 @@
 Now, let's look at the index.
 
 There's a utility for looking at the fulltext index keys and
-their weights. The source code is @file{myisam/myisam_ftdump.c}, and
+their weights. The source code is myisam/myisam_ftdump.c, and
 the executable comes with the binary distribution. So,
-if @code{exedir} is where the executable is, and @code{datadir} is the
+if exedir is where the executable is, and datadir is the
 directory name that you get with
-"@code{SHOW VARIABLES LIKE 'datadir%'}", and @code{dbname} is the
+"SHOW VARIABLES LIKE 'datadir%'", and dbname is the
 name of the database that contains the articles table,
 then this works:
 
 @example
->/exedir/myisam_ftdump /datadir/dbname/articles 1 --d
+>/exedir/myisam_ftdump /datadir/dbname/articles 1 -d
        b8            0.9456265 1001
        f8            0.9560229 comparison
       140            0.8148246 configured
@@ -3850,6 +3855,74 @@
        f8            0.9560229 yoursql
 @end example
 
+Let's see how one of these numbers relates to the formula.
+
+The term 'tutorial' appears in document 0. The full document is
+"MySQL Tutorial / DBMS stands for DataBase ...".
+The word "tutorial" appears once in the document, so dtf = 1.
+The word "for" is a stopword, so there are only 5 unique
+terms in the document ("mysql", "tutorial", "dbms", "stands",
"database"),
+so U = 5. Each of these terms appears once in the document,
+so sumdtf is the sum of log(1)+1, five times.
+So, taking the first two parts of the formula (the term weight), we
have:
+@example
+(log(dtf)+1)/sumdtf * U/(1+0.0115*U)
+@end example
+which is
+@example
+(log(1)+1)/( (log(1)+1)*5) * 5/(1+0.0115*5)
+@end example
+which is
+@example
+0.9456265
+@end example
+which is what myisam_ftdump says. So the term weight looks good.
+
+Now, what about the global multiplier? Well, you don't see it
+with myisam_ftdump, but you'll see it with the mysql client.
+The total number of rows in the articles table is 6, so N = 6.
+And "tutorial" occurs in two rows, in row 0 and in row 78, so nf = 2.
+So, taking the final (global multiplier) part of the formula, we have:
+@example
+log((N-nf)/nf)
+@end example
+which is
+@example
+log((6-2)/2)
+@end example
+which is
+@example
+0.6931472
+@end example
+
+So what would we get for row 0 with a search for 'tutorial'?
+Well, first we want w, so: Multiply the term weight of
+tutorial (which is 0.9456265) times the global multiplier
+(which is 0.6931472). Then we want R, so:
+Multiply w times the number of times
+that the word 'tutorial' appears in the search (which is 1).
+In other words, R = 0.9456265 * 0.6931472 * 1.
+Here's the proof:
+
+@example
+mysql> select round(0.9456265 * 0.6931472 * 1, 7) as R;
++-----------+
+| R         |
++-----------+
+| 0.6554583 |
++-----------+
+1 row in set (0.00 sec)
+
+mysql> select round(match(title,body) against ('tutorial'), 7) as R
+    -> from articles limit 1;
++-----------+
+| R         |
++-----------+
+| 0.6554583 |
++-----------+
+1 row in set (0.00 sec)
+@example
+
 @strong{You'll need memory}
 
 The MySQL experience is that many users appreciate 
@@ -3861,7 +3934,7 @@
 
 On the other hand, there are occasional complaints about speed.
 Here, the tricky part is that the formula depends on global
-factors --- specifically N (the number of documents) and nf
+factors -- specifically N (the number of documents) and nf
 (the number of documents that contain the term). Every time
 that insert/update/delete occurs for any row in the table,
 these global weight factors change for all rows in the table.
@@ -3881,7 +3954,6 @@
 of the keys. Using this tree, MySQL can calculate the count
 of matching rows with reasonable speed. But speed declines
 logarithmically as the number of terms increases.
-
 
 @strong{Weighting in boolean mode}


Thread
bk commit - mysqldoc tree (peterg:1.2664)Peter Gulutzan8 Mar