Bruce Feist has initiated a discussion with me off the list, and
has forced me to refine the way I express my position somewhat.
The combination of on-list and off-list discussion has been interesting,
but it's getting overwhelming, and reminds me that I have a tendency to
overindulge in controversy. Or, more generally, I never met a keyboard
I didn't like. (Actually, I own one that I hate, but that's another
matter.) This will be my last contribution to this thread. I'll
read the thread, so you can all make points at my expense, but I'll
take a tip from Odysseus and strap myself to the mast.
I'm not an expert on the SQL standard. Specifically, it has occured to me
that I don't actually know whether the SQL standard defines NULL as
"Not Known" or "Not Applicable". I've always been told that it does,
but I can't confirm that.
Here's what I do know. The "Interim Report 75-02-08 to the ANSI X3
(SPARC Study Group 1975)" described fourteen possible kinds of missing
data. Codd divided them into two groups, designated by his I-mark and
A-mark. The ANSI SQL committee lumped them under one heading, and called
that NULL. I think we all agree that we would have preferred Codd's
approach.
The conventional definition of NULL, whether or not it is included in
the SQL standard, is "Not Known" or "Not Applicable". This is both
precise and ambiguous: Ambiguous because it has two possible meanings,
and precise because it has only those two meanings. Of course, it can
be further subdivided into fourteen types of missing data.
NULL has another, operational definition which is both precise and
unambiguous, and I can confirm that this definition is in the SQL
standard. This definition has several parts, which define how NULL
effects various database operations, such as SORT, SELECT, addition,
subtraction, etc. We only need to deal with one part, from which the
other parts can more or less be implied. That part is the requirement
that whenever a comparison operator compares NULL with any other value,
including itself, the operator should return UNKNOWN. Every DBMS I know
of implements UNKNOWN as FALSE, but the standard calls for UNKNOWN.
So an unambiguous definition of NULL, found in the SQL standard, is
the value that always causes a comparison operator to return FALSE in
any known DBMS.
The problem with the unambiguous definition is that it is often the
hardest to use. The unambiguous definition of NULL doesn't make it
obvious why, or when, a database would contain NULL. The "Not Known"
or "Not Applicable" definition, while ambiguous, does make it clear.
So those of us who design databases use the ambiguous definition, and
our computers use the unambiguous definition. This works fine, as long
as we remember that there's no ambiguity in the way that NULL is
implemented in a properly designed RDBMS. A computer doesn't know whether
data is unknown or inapplicable; those concepts don't mean anything to
it. All it cares about is whether it receives a 0 or 1 from a comparison
operation.
To carry the anthropomorphism further, a computer doesn't care, and
an RDBMS doesn't care, but a database lives at a higher level of
abstraction, and it may care. For example, when I bought a radiator
for my car, the auto parts clerk asked me if I had an automatic
transmission. Cars that have automatic transmissions need radiators
with extra tubes for cooling the transmission fluid. Cars that have
manual transmissions don't need the extra tubes, and the radiators
usually aren't interchangeable. On the other hand, I use the same
valve lifters, regardless of the transmission type. So the auto parts
table has to include an attribute for "goes with an automatic
transmission". For a radiator, this would contain TRUE or FALSE.
For a valve lifter, this would be inapplicable. If it were unknown
instead of inapplicable, the data would have to be supplied before the
part could be sold. So the distinction between unknown and inapplicable
can be important, in theory.
In practice, this rarely happens. If necessary data is missing, no
record is created until the missing data is supplied. Parts warehouses
don't want parts in their databases if they don't know which cars can
use them. Usually, when someone wants to make a distinction between
unknown and inapplicable, it's because they're trying to fit two types
of data into a single columns, which violates the rules by which
relational databases are constructed. For example, a bank may keep a
record of credit limits for its customers. A database designer may be
tempted to set aside a special number, say -1, to place in the credit
limit column for customers who have no credit limit. The idea is to
split NULL into its two parts, and use -1 = NULL-not-applicable, and
NULL = NULL-not-known. But the attribute of having a credit limit is
different information from the attribute of the credit limit size. These
two types of data should not be in the same column. There should be
a boolean column for credit limit existence (HasCreditLimit Boolean)
and a currency column for the amount of the credit limit. This mistake
is so consistent that I think of it as diagnostic; if someone is
trying to create two types of NULLs, they're probably violating
1NF. Usually this happens when a database needs to store a new attribute
that the original designer didn't anticipate, and someone gets clever
and tries to create special flags for existing columns instead of
creating new columns for new attributes.
My disagreement with Bruce Feist is not over whether NULL is ambiguous;
the definition that database designers use is explicitly ambiguous. We
disagree over how to deal with the ambiguity. My contention is that it
is rarely a problem in a well designed database. But I acknowledge that it
sometimes may be a problem. Bruce advocates restricting the meaning of NULL
and documenting the restriction. My problem with that is that the RDBMS
will continue to accept NULLs and process them properly, even when they
violate the restriction. The meaning of NULL is well known, and is
specified in the SQL standard in one form or another, so a DBA is fully
justified in assuming that NULL is used according to the standard in
any given database, and has no reason to look for documentation that
tells him otherwise. NULL is any data that should cause any comparison
to return UNKNOWN. This could be either unknown data or inapplicable data.
This is akin to using a char or varchar column for dates. If you put
a date into a DATE column in an invalid format, the column will spit
the date back and complain. If you put an invalid date into a text
column, the column will happily store it for you and all your aggregate
queries will return inaccurate results. Similarly, if you try to use
NULL for data that is applicable but unknown, the column will also
accept NULL in cases where the data is inapplicable, and your assumptions
about the meaning of NULL in that column will be wrong. The RDBMS will
return FALSE on all comparisons with NULL, so there will be nothing to
tell you that the column contains invalid NULLs. My experience is that
if it is possible to put invalid data in a column, someone will do it.
(I've done data cleanup on a subset of the payroll database for a
multinational company. Lots of people had lots of opportunities to
screw up the data, so they did.)
When I've needed to specify that data in a text column is unknown, I've
used "Unknown" or "Not Known". This has two advantages. First, it doesn't
redefine an established term. (NULL is defined as anything that should cause
a comparison to return UNKNOWN, which includes *both* unknown and inapplicable
data. This definition is widely known and accepted, and it is implemented in
the RDBMS. Changing the definition for a specific database or table is
dangerous.) The other advantage is that it is self-documenting. Anyone
seeing "Unknown" or "Inapplicable" will know immediately that one doesn't
mean the other, or both.
Numeric columns are a bigger problem. You can put a number or a NULL into
and INT column, but not "Unknown" or "Inapplicable". In that case, the
best solution is to allow NULLs, but add a boolean column called Unknown,
with the restriction that both columns can't be NULL in the same row.
So if the INT column contains a NULL, you have to specify what the NULL
means in the Unknown column; 1 = unknown, 0 = inapplicable. If the column
only contains non-negative integers, then you can assign -1 and -2 the
meanings Unknown and Inapplicable. But make sure you disallow NULLs, or
NULL will end up being used for both (as it should be if it is allowed).
But this is coming dangerously close to cleverness, which is a bad quality
in a database designer. These problems should occur very rarely. I'll repeat
that if you are trying to split NULL into its two meanings, you are probably
making a design mistake, and should add columns to capture additional data,
rather than shoehorn multiple attributes into one column. If you have
to create multiple NULLs more than twice in your career, reexamine your
designs.
Bob Hall