Hello
[ I'm crossposting this to dbi-users because it might be of interest
there too. Maybe better don't reply to both lists, thanks. ]
While programming a journal in perl/axkit I realize that the problems
of both creating useful indexes for searching content efficiently and
parse user input and create the right sql queries from it are sooo
common that there *must* be some good library already. :-) So I
headed over to CPAN, but didn't really find what I was looking for.
It should create indexes that are efficiently searchable in mysql,
i.e. only <select ... where .. like "abcd%"> queries, not "%abc%".
Allow to search for word parts (i.e. find "fulltext" when entering
"text"). Allow for multiple form fields (i.e. one field for title
words, one for author names, etc.) at once. Preferably allow for some
sort of query rules (AND/NOT/OR or something).
Preferably do some relevance sorting. Preferably allow to hook some
numbers (link or access counts etc) into the relevance sorting.
I think there are 3 tough parts which are needed:
1. creation of sophisticated index structures (inverted indexes)
2. somehow recognize sub-word boundaries to split words on. Maybe use
some form of thesaurus? Or syllables? (I suspect it should be the
same rules as for splitting words on line boundaries)
3. user input parser / query creator
Why not:
- use mysql's fulltext indexes? Because I think that currently they
are too limited (i.e. see user comments about them
www.mysql.com/doc/) (should be better in mysql-4, I read, but we need
it in a few weeks already...). And they are also not supported in
Innodb which we want to use.
- use indexing robots? Because we work with XML documents, and would
like to both keep the index up to date immediately, as well as split
the XML contents into several parts (i.e. there's a title, byline,
etcetc, which should be searchable or weigted differently). We want a
*library*, not a finished product.
There's Lucene (www.lucene.com) in Java that I think does exactly
what I want. Anyone who helps me port that to perl or
C(++)/perl-bindings (-; ? (It should be ready in a few weeks, and
it's about 500k source code :-().
(Something in C/C++ that would be loaded as UDF or so would be nice
too, but as I understand (from recent discussion about embedded
procedural languages) it's not possible since these UDF's would have
to start other queries (i.e. to insert each word fragment into an
index table).)
What are my current options? What do you use?
More info about mysql-4?
Thx
Christian.