On Wed, Sep 17, 2003 at 09:58:16PM -0400, Bruce Feist wrote:
> Bob Hall wrote:
> >Bruce Feist has initiated a discussion with me off the list,
> Off the list by accident, by the way. I sometimes forget that in this
> list I need to do a REPLY ALL. I generally don't go private unless I
> want to avoid embarassing someone or need to discuss something genuinely
> of no interest to others in the group; in this case, the conversation
> was of general interest and reasonably civil <g>.
> >The conventional definition of NULL, whether or not it is included in
> >the SQL standard, is "Not Known" or "Not Applicable". This is both
> >precise and ambiguous: Ambiguous because it has two possible meanings,
> >and precise because it has only those two meanings.
> I'll agree with that.
> >an unambiguous definition of NULL, found in the SQL standard, is
> >the value that always causes a comparison operator to return FALSE in
> >any known DBMS.
> Interesting approach. I haven't thought it through completely enough to
> decide whether or not I can agree that this is an unambiguous
> definition. For instance, I'm not sure what SQL is supposed to evaluate
> 0/0 as -- is it NULL, or does it invalidate the statement computing it,
> or is it some non-NULL null-like thing? (Rhetorical question; I don't
> necessarily expect an answer, although one would be nice.)
RDBMSs seem to be roughly evenly divided between those that return
NULL and those that return an error. This leads me to believe that
the SQL standard doesn't say what to do when 0/0 is attempted.
> I do *not* think that it's a useful definition, though, because of the
> weakness you point out. It doesn't tell the designer how to use the
> feature. I cannot think of any case where it is useful to store a NULL
> if you don't know what it means other than that it's a value which when
> compared to other things returns UNKNOWN! In many ways, I see this as
> analagous to identifying the units that a numeric field is identified
> in; the field won't tell you whether it's in fortnights or seconds, so
> the documentation must.
Like it or not, it is the definition that the RDBMS uses, so it is the
definition that your database is using, whether you are aware of it or
> >the distinction between unknown and inapplicable
> >can be important, in theory.
> >a bank may keep a
> >record of credit limits for its customers. A database designer may be
> >tempted to set aside a special number, say -1, to place in the credit
> >limit column for customers who have no credit limit.
> >the attribute of having a credit limit is
> >different information from the attribute of the credit limit size. These
> >two types of data should not be in the same column. There should be
> >a boolean column for credit limit existence (HasCreditLimit Boolean)
> >and a currency column for the amount of the credit limit.
> If you don't mind a brief red herring, consider the fact that not all
> RDBMSs support CHECK constraints, and in those which don't this approach
> gives rise to the possibility of inconsistent data: what happens if
> HasCreditLimit is FALSE and CreditLimit is $10000?
If people can enter data that will break your database, they will, so
that's not really a red herring. If you can't use a CHECK constraint,
then you need to run periodic queries that check for data consistency.
It's a red herring only in the sense that we already knew that.
> It's a red herring, of course, because we're letting the real world
> intrude upon our theoretical discussion. In any real DBMS, experienced
> DBDs know that there are compromises which must be made; these don't
> invalidate what we'd *like* to do, but sometimes are things that we must
> do to have a well-functioning, or at least usable, system. Lack of a
> CHECK constraint in the target DBMS doesn't impact what the correct
> logical design is.
It's not really a theoretical discussion. We're discussing how the
SQL standard is implemented in existing DBMSs, and how that affects
> >This mistake
> >is so consistent that I think of it as diagnostic; if someone is
> >trying to create two types of NULLs,
> I wasn't suggesting creating two types of NULLs... I was suggesting that
> if NULLs are permitted for a column, the DBD should indicate which
> meaning of NULL is in use for that column.
Only one meaning is ever permitted; the meaning hardcoded into the
RDBMS. Anything that conforms to that meaning is always permitted. You
have no way of altering that. The ambiguous definition used by designers
is only an English language approximation. It doesn't matter whether the
ambiguous definition is in the SQL standard or not because it is not
implemented in the RDBMS. If your design permits only "Unknown" or
"Inapplicable", and you use NULL to represent this, you are reintroducing
confusion that the SQL standard removed. If you remember that the ambiguous
definition closely approximates the actual definition, then you will avoid
problems. If you try to restrict the definition of NULL to one of the two
conventional meanings, you are trying to force your database to do something
that has no correlation in the way the DBMS actually functions. In the
credit limit example above, you can compensate for the lack of a CHECK
constraint by running queries that test data consistency. If you try to
restrict NULL to only one of its conventional meanings, you will have no
way of enforcing that restriction and no way of testing to ensure that the
restriction is being followed. There are no queries that you can run to
test for NULL consistency. If you can't enforce consistency or test for
inconsistency, then your database will inevitably contain inconsistent data.
> >My disagreement with Bruce Feist is not over whether NULL is ambiguous;
> >the definition that database designers use is explicitly ambiguous. We
> >disagree over how to deal with the ambiguity. My contention is that it
> >is rarely a problem in a well designed database. But I acknowledge that it
> >sometimes may be a problem. Bruce advocates restricting the meaning of
> >NULL and documenting the restriction.
> I'm not sure what you mean by "restricting". If you mean "permitting
> only one of the two possible uses of NULL for a given column", then
> agree... and I think that *you* think that this is a good idea as well,
> given your above comments on permitting both being a violation of
That's not what I said. The use of NULL, as defined in the SQL standard,
doesn't violate 1NF. Putting a NOT NULL constraint on a column and
setting up A/I-mark signifiers doesn't violate 1NF. Trying to restrict
NULL to one of its conventional meanings doesn't in itself violate 1NF,
but in practice it correlates very highly with violating 1NF. There's
no theoretical reason for that; it just works that way in practice.
> To rephrase, I'm confused about your viewpoint. On one hand you're
> saying that if a field can contain both I-mark and A-mark NULLs that's a
> bad design choice; on the other, you're saying that since most current
> RDBMSs don't distinguish betwen I-mark and A-mark NULLs, trying to
> restrict the meaning to just one is a bad design choice. You're clearly
> knowlegable and have given thought to the subject, so I believe that
> there's something about your point of view that I'm just not getting.
> Please explain!
Let's take a step back: I'm not talking about what is theoretically
correct or best. I'm talking about what actually happens to the data
in a database, given the fact that the SQL standard is hardcoded into
the DBMS. You and I agree on our theoretical preferences. My point is
that if you want to use I-mark and A-mark, you have to set the column
NOT NULL, because the SQL standard gives you no way of testing for
NULL consistency or enforcing NULL consistency. Any existing RDBMS will
always permit any NULL as long as the column permits NULL. If you
try to use NULL for A-mark only or I-mark only, the SQL standard
and human fallibility ensure that your database will contain inconsistent
data. The only way you can use NULL safely is with the understanding
that it can mean anything permitted by the SQL standard.
> >My problem with that is that the RDBMS
> >will continue to accept NULLs and process them properly, even when they
> >violate the restriction.
> >When I've needed to specify that data in a text column is unknown, I've
> >used "Unknown" or "Not Known".
> What if "Unknown" and "Not Known" are valid values for the column? I
> think that in most cases it would be best to have a separate boolean
> flag column (or two of them) if NULLs are not an option, especially if
> CHECK constraints are available to ensure consistency.
You may be correct. All NULL means is "this data is not in the database".
If you try to indicate why the data is not in the database, you are
creating a new attribute. In a sense, A/I-marks are not NULLs because
they contain information about other data. In that case, they should have
their own column. Part of the reason NULL is NULL is because it doesn't
tell you anything about the missing data.