Shawn (and anyone else who will listen):
I'm already running out of RAM (actually, virtual memory/page file space)
just trying to display all the job titles without even joining them with
anything. I have to use a LIKE clause to just get a portion of them.
So, I could:
(1) Have multiple database connections going concurrently where the
first one joins everything except the keywords. As I'm iterating thru the
first result set with the fetch function, I could get a list of keyword
foreign keys for each row with a second database connect and store this in a
second result set. Is this a common approach? Are secondary database
connections cheap?
(2) I could try to store the first join in a hashmap first and then
iterate but I've already demonstrated that the hashmap takes too much
memory.
(3) I could create a new column of type string for each job title. This
would contain a comma separated list of integer foreign keys for the
keywords. This is the non-normalized option and you discouraged this
approach.
(4) I could have a very wide result set. Let assume I have a jobtitle
(joined with a job posting and company) with 26 keywords. That means 26 rows
in the result set are identical except the keyword foreign key (fk) column.
I have to then insert the logic to detect the fact that everything except
the keyword fk column is identical. Are you advocating this approach? It
seems like it requires a lot of computer space and computer time and (worst
of all) my time. I believe this is the classical approach, however.
Which would you choose?
Thanks,
Siegfried
_____
From: SGreen@stripped [mailto:SGreen@stripped]
Sent: Monday, July 25, 2005 9:31 AM
To: Siegfried Heintze
Cc: mysql@stripped
Subject: Re: Alternatives to performing join on normalized joins?
"Siegfried Heintze" <siegfried@stripped> wrote on 07/24/2005 11:35:36 AM:
> I have a large number of job titles (40K). Each job title has multiple
> keywords making a one-to-many parent-child relationship.
>
> If I join job title with company name, address, company url, company city,
> job name, job location, job url (etc...) I have a mighty wide result set
> that will be repeated for each keyword.
>
> What I have done in the past (in a different, much smaller, application)
is
> perform a join of everything except the keyword and store everything in a
> hashmap.
>
> Then I iterate thru each wide row in the hashmap and perform a separate
> SELECT statement foreach row in this hashmap to fetch the multiple
keywords.
>
> Whew! That would be a lot of RAM (and paging) for this application.
>
> Are there any other more efficient approaches?
>
> Thanks,
> Siegfried
>
>
There are two major classes of efficiency when dealing with any RDBMS: time
efficiency (faster results), space efficiency (stored data takes less room
on the disk). Which one are you worried about?
If it were me, I would start with all of the data normalized:
* a Companies table (name, address, url, city, etc)
* a Job Titles table (a list of names)
* a Keywords table (a list of words used to describe Job Titles)
* a JobPosting table ( Relates Companies to Job Titles. Should also
be used to track things like dateposted, dateclosed, salary offered, etc.)
* a Postings_Keywords table (matches a Posting to multiple
Keywords).
I would only denormalize if testing showed a dramatic improvement in
performance by doing so. I would think that the Job Title to Keyword
relationship would be different between Companies. One company posting for a
"Programmer" may want VB while another wants PHP and PERL. By associating
the Keywords with a Posting (and not just the Job Title), you can make that
list Company-specific.
Shawn Green
Database Administrator
Unimin Corporation - Spruce Pine