jonathan michaels wrote:
> > I recently uploaded DBIx::TextIndex to CPAN, which perl users may find
> > useful for full-text indexing. It does multi-column inverted indexes, with
> > boolean searching and relevancy scoring.
> a question if i might.
> how does this (and the like) compare to a real text search
> engine, wais ?
> i've been trying to work out how to implement a text file
> repositry, er text database and web searching facility on a few
> documents that total some 4 gb of sundry, various text and a
> few graphics files and some the texts have been converted to
> sound format files so that they can be played by sight impaird
> users in lieu of 'reading' the text file as displayed on the
> as wais server (developemnt and repairing) are going out of
> favour, in teh face of this entrancement that most people have
> with (most in apropriate tools like) sql databse technology.
> this is why i am looking at trying to use mysql for a task that
> no rdbms is built to handle well, if at all.
It uses traditional information retrieval techniques very similar to wais. It
builds an inverted index, and uses a vector space model to score document
relevancy by matching a query vector as closely as possible to a document vector.
Boolean operators are allowed, and I've started adding rudimentary
It also has the advantage of fielded searching, since it allows multiple inverted
indexes on different columns in the database.
It allows the articles to remain in the database for indexing, instead of spitting
them out to external files and indexing them with a traditional text retrieval
solution. It integrates nicely with mod_perl, and allows a lot of flexibility in
the front-end query interface.
I use it on a 750 MB news article collection, and it returns most queries in less
than a second.