Thank you, that was a very helpful discussion.
I'm worried a bit about scalability. This is a personal project so I will
be the primary user of the end-result; however, if it is robust enough, I'd
like to open it up to a larger audience. This type of scheme doesn't seem
to lend itself to easily adding new fields or document types "on-the-fly"
from within the application. Am I missing something or is that a fair
assessment? It is not necessarily a current requirement; however, if it may
be added later, I'd like to consider the implications now.
We are also talking about small pieces of individual data, some of which
might be self-authored, meaning that many of the fields that would be
normally present in any work won't be in most of them here, such as "Title",
"Publisher" etc. Would including them in the documents aggregate table be a
"majority rules" type of decision?
Thank you again for your response,
On Wed, Sep 17, 2008 at 12:47 PM, Geert-Jan Brits <gbrits@stripped> wrote:
> Hi Jeffrey,
> David already gave a lot of valid points.
> Table-per-documenttype seems the way to go here.
> As to the 'best' db-scheme for your task given your description you have to
> ask yourself a couple of questions:
> Please note that with a document-type I mean a type like book, html page,
> 1. with variable fields per document I gather you mean that fields can
> differ from document-type to document-type right?
> But at the same time for any given document-type you can infer beforehand
> what fields this document-type should contain right?
> This would be ideal for table-per-documenttype. You can add a book-table
> once you start integrating books, and afterwards create a table for say
> word-documents with different fields altogether.
> 2. do you want your application to be able to search over all documents of
> different document-types at once?
> The answer is probably yes.
> This means you will need to have 1 central table which contains foreign
> to each row in each of the table-per-documenttype tables.
> Call this table 'documents' or something (see below).
> Given that this is a requirement (I'm guessing here), you HAVE to do joins
> anyway when you want to search over ALL documents of all different types.
> 3. given if 2 is true it would be best in my opinion to put shared fields
> the documents-table. Fields like author,title, google-like summary,
> publication_date, rating, filesize, content-type, fetch-date,
> Modification-date etc come to mind.
> 4. normalization vs. Speed.
> Normalization is a good thing... Having said that, when you often only want
> to query 1 document-type at a time, doing joins between 1
> table-per-dcoumenttype table and the documents-table is a bit
> Waste of time. Especially with large number of rows (when I hear talk about
> html-pages I think alot of rows ;-) . For this you could consider keeping
> the shared fields (author, publication-date, etc)
> Redundant in each of the table-per-documenttype tables.
> Of course this means that you have to keep these redundant pieces of
> information in a consistent state, otherwise madness lies ahead.
> For the most part your import-application should take care of this,
> perhaps other people know how to do this (partially) with MySQL.
> This requires some work / testing, but it would surely be worth it.
> What you're left with is this:
> | documents
> |id k
> |documentid fk
> |documenttypeid fk
> Field ID is globally unique over all documents of all different types
> Field Documentid is locally unique to all documents of type documenttypeid
> Field documenttypeid points to a table documenttypes where all different
> documenttypes are described (i.e: 1:Book, 2:HTML, 3:WORD-DOC,etc)
> |documenttypeid k
> And N different table-per-documenttype tables
> For instance
> | Books
> | documentid k
> | (redundant shared fields)
> | book specific fields
> | HTML
> | documentid k
> | (redundant shared fields)
> | html specific fields (rawhtml , cleanedhtml, stripped_html,etc)
> THis enables you to:
> - use the documenttypeid to filter based on doucmenttypes (only books and
> pdf's no problem)
> - use the documenttypeid to determine with which table-per-documenttype
> table to join.
> (define a map-structure in your application which holds the
> relation between documenttypeid and the table with which to join or you
> could add the table-name as
> A column to the Documenttypes-table but the lattter option would
> require joining with the documenttypes-table which is otherwise not needed.
> - NOTE: that you only have to join between documents-table and any of the
> table-per-documenttypes tables when you require specific fields from any of
> the table-per-documenttypes tables!
> - use the shared fields in the documents-table to filter/sort over ALL
> - query directly on 1 of the table-per-documenttype tables when you know
> that there's only 1 type to query a-priori and you need the
> document-specific fields.
> (and use the shared fields and specific fields from that table without
> requiring a join)
> All in all this approach works very well if you happen to have a
> master-detail application (which is usually the case): master page shows N
> results which can be clicked through to get to the actual result (which is
> the detail page).
> If you think beforehand what data you want to show in the master-page
> (title, author, summary, publication date?) it may become possible to make
> these fields shared (so they exist in
> The documents-table). This in turn would mean that:
> 1. for any query which produces a master-page you only need to query the
> documents-table. (given that you don't have to filter / sort on any of the
> document-type specific fields)
> 2. clicking through to a detail-page requires only 1 query to the
> document-type table of which the entity clicked on happens to belong
> (remember I entity could only belong to 1 type)
> 3. If your application enables the user to specify to search in a specific
> document-type, that document-type table could be used to produce the
> master-page without requiring a join.
> Master-detail without requiring joins.
> Hope this makes sense,
> -----Oorspronkelijk bericht-----
> Van: Jeffrey Santos [mailto:jmsdblist@stripped]
> Verzonden: woensdag 17 september 2008 17:39
> Aan: David Ashley
> CC: mysql@stripped
> Onderwerp: Re: Appropriate Design
> Hi David,
> My project involves storing information that could have been gathered from
> wildly different sources. You can think of this part of the database as a
> sort of bibliography for the sources of that information. If I gathered
> information from a book, for instance, there are some generally required
> fields such as "Author," "Title," ... etc. A website, on the other hand,
> would also require fields such as "URL" and "Date Visited" and so on and so
> I hope that's more clear!
> ~Jeffrey Santos
> On Wed, Sep 17, 2008 at 9:57 AM, David Ashley <dashleylist@stripped>
> > On Wed, Sep 17, 2008 at 5:46 AM, Jeffrey Santos
> >> I'm not sure if this is the right list to ask such a question; if not,
> >> direction as to the appropriate forum would be much appreciated!
> >> This may be a simple question, but I'm still on the early stages of my
> >> learning about databases. I'm developing an application that will
> >> information from various sources. Since what might be considered
> >> information about those sources will vary (occasionally dramatically),
> >> unsure as to the correct design for the database tables. In other words
> >> I'm
> >> left with (I believe) two choices:
> >> 1. Create a source table that will contain, among other things, every
> >> possible field for a source type and sort out what is considered
> >> information at the application level.
> >> 2. Create a few source tables detailing the required information about
> >> each source type. The only way I can think of doing this is something
> >> along
> >> the lines of:
> >> sourceFields: fieldId, fieldName, ....
> >> sourceRequireds: reqId, typeId, fieldId, ....
> >> sourceInfo: infoId, fieldId, sourceId, ....
> >> and then pull only the appropriate data from the database at query time
> >> using JOINs.
> >> Which of these options (or a third I have no idea about!) would be
> >> appropriate here?
> >> If possible, maybe a general "conventional wisdom" statement would
> >> help my education on these matters!
> >> Thank you,
> >> Jeffrey Santos
> > The second solution (multiple tables) is generally the only acceptable
> > solution.
> > The general rule (for one table or many) is whether the fields in the
> > single table are fixed in quantity or perhaps are repetitive enough to be
> > tedious or repetitive to manipulate.
> > For example, assume you want a database of people in your state and the
> > cars they own. Clearly, since people vary widely in the number of cars
> > own and since each car has similar data, puttling fields like car1vin,
> > car2vin, car3vin in the people table is the wrong solution.
> > Your problem description is rather vague, but the way I'm reading it is
> > that your sources vary widely in the fields that need to be recorded. In
> > that case, you'd generally want a table of sources (each instance of a
> > source), a table of fields (each field that MAY be associated with a
> > source), and a table of sourcefields (the actual data for a given field
> > a given source).
> > But you really need to solidify the description of your problem or have a
> > simplest example so people can really help you. I don't fully understand
> > your description.
> > Dave.