MySQL Lists are EOL. Please join:

List:General Discussion« Previous MessageNext Message »
From:Troy Grady Date:July 22 1999 1:50am
Subject:Re: indexes on TEXT columns, keyword searching through text
View as plain text  
Awesome!  All kinds of useful stuff over at devshed.com.  Attached to the
articles are user annotations, among which I found these two very nice
solutions (below) to the text indexing issue.  They use MySQL exclusively,
i.e. without an external indexing engine like htdig.

Thanks Philip!

Troy

---

(1)

Name: ErikLtz
Subject: Performance problem - and solution
Date: 05-10-1999 02:19AM

When using the described approach with plenty of data, the number of records
in the table will quickly become huge, one record for every (non-noise) word
in every article...

To index 28000 articles with approximately 64 relevant words in each the
word table would contain 1.8 million records and even mySQL would take some
time to search this table (despite indexing).

Many records contain the same word and to get rid of these doublettes,
another table is added. In short my solutions contain:

- one word list table with every unique word and a unique id for each word

- one word-id-to-article-ids table where the article id's consist of 32-bit
article id's stored as an array of binary values in a blob (not suitable for
PHP but works very well with a small search application written in C and run
from PHP3 using system)

The result is a very quick search function and the expected time complexity
when the number of articles grow (ordo) is O(log n) instead of O(n)
(linear).

On a measly 140 MHz Ultra Sparc the search completes in 0.05 seconds with a
word list containing 66551 unique words from 29648 telegrams. I discard too
common words but even with this there would have been 1 898 959 records with
the solution described in the article.

My search program also perform wildcard searches and multiple search words
and exclusions. With a separate word list and article id table, article
titles can be searched.

Guess systems like AltaVista use even more sophisticated algorithms, but the
critical point is to switch from O(n) to O(log n) and I've succeeded in
doing that with this still rather simple search function.

Regards,
Erik L.

---

(2)


Name: Porter
Subject: Re: Performance problem - and solution
Date: 05-14-1999 09:12PM

Hi,

About a year ago, I ran across a similar problem in terms of indexing a
keyword search across about 2GB of text data... So, I determined that it
would be a good idea to build an index of the unique, important words in the
body of text. I ended up with a very similar solution - at the time I was
using mSQL (miniSQL) with it's built in server-side scripting engine, Lite.

The table that I ended up with for indexing looked more like this, with 2
fields in the table.

word - varchar(50)
memo

The word was essentially the index (I didn't see as assigning a UID to each
unique, important word would make a difference), and the memo field
contained a comma delimited list of the linked records in which that word
appeared. When I did a query, I supported an ANY or ALL radio button on the
HTML form. This caused my script on the server to handle the search results
in two different ways:

1> ANY - just query for the memo fields attached to that word, and combine
the lists - stripping out duplicate entries (though I suppose, I could have
counted duplicates, and used it as a weighting for relevance).

2> ALL - looped thru the lists, and returned only links to records in which
the ID appeared in all the memo fields.

It seemed to work pretty good, was fairly fast, and it wasn't difficult to
add something new to the index:

Parse the article word by word, make a query against the index table, to see
if the current word is in it, if it is, append the ID of the article (or
record) to the memo field - else, create a new record and start the memo
field with one value.

- PW





__________________________
Troy Grady, Managing Director
Grady Levkov & Company, Inc.


-----Original Message-----
From: Philip Hallstrom <philip@stripped>
To: mysql@stripped <mysql@stripped>
Date: Tuesday, June 22, 1999 7:11 PM
Subject: Re: indexes on TEXT columns, keyword searching through text


>In article <376E5570.A935F50D@stripped> you write:
>>Troy Grady wrote:
>>>
>>> Folks-
>>>
>>> We store a lot of text (multi-page resumes and notes) that is too big to
fit
>>> in VARCHAR columns and we do a lot of keyword searches through these
fields
>>> like, for example:  "find me every TEXT field that contains the word
[...]
>>> and the word [...]"  With several thousand such fields -- and growing --
>>> clearly we need some kind of indexing going on.
>>>
>>> The MySQL manual indicates that the new default table type (MYISAM) in
the
>>> upcoming release (3.23.0) will support indexes on TEXT and BLOB columns.
>>> Does anyone know if this means an index on the first N characters of a
TEXT
>>> column or an index on every word of a TEXT column?
>>>
>>> Regards,
>>>
>>> Troy
>>>
>>
>>Most of the time, an index on a TEXT field will do you no good for what
>>you are doing. The queries you run search of a keyword within the text.
>>An index can be used only if your query specifies what the field starts
>>with, not just what it contains.
>>
>>Maybe a better solution for your setup would be to use a combination of
>>htdig and MySQL. You can store resumes on the filesystem, so that htdig
>>could index them, and all the meta information in MySQL along with a
>>reference to where the text is stored.
>
>You could also write a script that exported the TEXT fields to the
>filesystem, then use htdig on those, then modify htdig so that when you
>do your search, you parse the output and extract the id associated with
>the TEXT field.  A much better explanation of that is at:
>http://www.devshed.com/Server_Side/PHP/ (Search This).  There's also an
>article about doing exactly what you want using Mysql... it's not the
>best system, but it would work.
>
>-philip
>

Thread
indexes on TEXT columns, keyword searching through textTroy Grady22 Jun
  • Re: indexes on TEXT columns, keyword searching through textPaul DuBois22 Jun
  • Re: indexes on TEXT columns, keyword searching through textSasha Pachev22 Jun
    • RE: indexes on TEXT columns, keyword searching through textSteven Roussey22 Jun
      • C API webpagessimon22 Jun
        • RE: C API webpagesDouglas B. Jones22 Jun
        • Re: C API webpagesSasha Pachev22 Jun
      • Re: indexes on TEXT columns, keyword searching through textjonathan michaels23 Jun
Re: indexes on TEXT columns, keyword searching through textPhilip Hallstrom23 Jun
Re: indexes on TEXT columns, keyword searching through textTroy Grady24 Jun