List:General Discussion« Previous MessageNext Message »
From:Thimble Smith Date:July 26 1999 8:27pm
Subject:Re: Reverse index + search engine question
View as plain text  
At 20:38, 19990726, Martin Smallridge wrote:
>Quick intro, "I'm a newbie" (no surprise here but let me finish).... to
>Linux, mySQL, Perl, etc... so if I say anything dumb you know why.

Being a newbie doesn't buy you any grace.  But the fact is, we all
say dumb things once in a while.  Best you can do is say, "I'm sorry"
and try to learn from it.

>Ok on to the questions...
>
>First off is there anyone out there who has tackled a search engine spider
>script with mySQL as the storage medium for the reverse index? If so, how
>did you go about structuring the index?

Look for UDMSearch at http://www.mysql.com/Contrib/.  Also, on the Links
page of the MySQL site, there's a link to
    http://home.wxs.nl/cgi-bin/planeteers/pgidszoek.cgi

I don't know anything about it, but you could try contacting the webmaster
there to find out what they did.

>The reason I ask is because I'm looking at doing just that in a small but
>specialised topic area and want to create a fast and efficient store.
>
>At the moment I'm thinking of a dual database approach:
>DB1 stores the URL and a unique identifier
>DB2 stores the keywords that refer to the unique identifiers (as
>appropriate)

I think you mean a dual table approach, not dual DB.  This is a common
way to confuse words when you're first working with relational dbs,
especially if you've done work with DBM-style databases.  But it's best
to get used to the right terms, so that people will follow what you're
saying.

>The spider will basically store the reverse indexed URLs in DB1 and list the
>identifier against the relevant words in DB2. I also figure going for a high
>base number (eg Hex) will keep the database file sizes down too [but I could
>be missing something here]

You shouldn't worry about "high base number".  You're not going to store
the ID's as a string representation of a number, you're going to store
them in some native format (e.g., twos compliment).  But you don't worry
about that.  You tell the DB "MEDIUMINT" (or INT or LONGINT, depending on
how big the number will get) and it uses the best representation for your
system.  Don't try to outsmart the DB - work with the DB, and everything
will go better for you.

>Ok, that's about it.. any comments, pointers, advice, etc.. will be much
>appreciated so thanks in advance. Oh and please can you respond direct as
>I'm finding it hard to keep up with all the digest at the moment and would
>hate to miss anything.

Sounds like your plan is okay.  Do this: get that simple search that
you have planned now working.  Get the data into the tables and get
used to working with it.  You can't go wrong with that.  Then if you
need something else, you can massage the data into a new format at any
time.  Once you have the data in tables, you can do anything with it.

Good luck,

Tim
Thread
Reverse index + search engine questionMartin Smallridge26 Jul
  • Re: Reverse index + search engine questionThimble Smith27 Jul