List:General Discussion« Previous MessageNext Message »
From:Christian Jaeger Date:September 13 2001 2:42am
Subject:Fulltext indexing libraries (perl/C/C++)
View as plain text  
Hello

[ I'm crossposting this to dbi-users because it might be of interest 
there too. Maybe better don't reply to both lists, thanks. ]

While programming a journal in perl/axkit I realize that the problems 
of both creating useful indexes for searching content efficiently and 
parse user input and create the right sql queries from it are sooo 
common that there *must* be some good library already. :-) So I 
headed over to CPAN, but didn't really find what I was looking for.

It should create indexes that are efficiently searchable in mysql, 
i.e. only <select ... where .. like "abcd%"> queries, not "%abc%". 
Allow to search for word parts (i.e. find "fulltext" when entering 
"text"). Allow for multiple form fields (i.e. one field for title 
words, one for author names, etc.) at once. Preferably allow for some 
sort of query rules (AND/NOT/OR or something).
Preferably do some relevance sorting. Preferably allow to hook some 
numbers (link or access counts etc) into the relevance sorting.

I think there are 3 tough parts which are needed:
1. creation of sophisticated index structures (inverted indexes)
2. somehow recognize sub-word boundaries to split words on. Maybe use 
some form of thesaurus? Or syllables? (I suspect it should be the 
same rules as for splitting words on line boundaries)
3. user input parser / query creator

Why not:

- use mysql's fulltext indexes? Because I think that currently they 
are too limited (i.e. see user comments about them 
www.mysql.com/doc/) (should be better in mysql-4, I read, but we need 
it in a few weeks already...). And they are also not supported in 
Innodb which we want to use.

- use indexing robots? Because we work with XML documents, and would 
like to both keep the index up to date immediately, as well as split 
the XML contents into several parts (i.e. there's a title, byline, 
etcetc, which should be searchable or weigted differently). We want a 
*library*, not a finished product.

There's Lucene (www.lucene.com) in Java that I think does exactly 
what I want. Anyone who helps me port that to perl or 
C(++)/perl-bindings (-; ? (It should be ready in a few weeks, and 
it's about 500k source code :-().

(Something in C/C++ that would be loaded as UDF or so would be nice 
too, but as I understand (from recent discussion about embedded 
procedural languages) it's not possible since these UDF's would have 
to start other queries (i.e. to insert each word fragment into an 
index table).)

What are my current options? What do you use?
More info about mysql-4?

Thx
Christian.
Thread
Fulltext indexing libraries (perl/C/C++)Christian Jaeger13 Sep
  • Re: Fulltext indexing libraries (perl/C/C++)ryc13 Sep