List:General Discussion« Previous MessageNext Message »
From:Bob Hall Date:September 17 2003 7:57pm
Subject:Re: Does Null == ""?
View as plain text  
Bruce Feist has initiated a discussion with me off the list, and 
has forced me to refine the way I express my position somewhat. 
The combination of on-list and off-list discussion has been interesting, 
but it's getting overwhelming, and reminds me that I have a tendency to 
overindulge in controversy. Or, more generally, I never met a keyboard 
I didn't like. (Actually, I own one that I hate, but that's another 
matter.) This will be my last contribution to this thread. I'll 
read the thread, so you can all make points at my expense, but I'll 
take a tip from Odysseus and strap myself to the mast.

I'm not an expert on the SQL standard. Specifically, it has occured to me 
that I don't actually know whether the SQL standard defines NULL as 
"Not Known" or "Not Applicable". I've always been told that it does, 
but I can't confirm that. 

Here's what I do know. The "Interim Report 75-02-08 to the ANSI X3 
(SPARC Study Group 1975)" described fourteen possible kinds of missing 
data. Codd divided them into two groups, designated by his I-mark and 
A-mark. The ANSI SQL committee lumped them under one heading, and called 
that NULL. I think we all agree that we would have preferred Codd's 
approach. 

The conventional definition of NULL, whether or not it is included in 
the SQL standard, is "Not Known" or "Not Applicable". This is both 
precise and ambiguous: Ambiguous because it has two possible meanings, 
and precise because it has only those two meanings. Of course, it can 
be further subdivided into fourteen types of missing data.

NULL has another, operational definition which is both precise and 
unambiguous, and I can confirm that this definition is in the SQL 
standard. This definition has several parts, which define how NULL 
effects various database operations, such as SORT, SELECT, addition, 
subtraction, etc. We only need to deal with one part, from which the 
other parts can more or less be implied. That part is the requirement 
that whenever a comparison operator compares NULL with any other value, 
including itself, the operator should return UNKNOWN. Every DBMS I know 
of implements UNKNOWN as FALSE, but the standard calls for UNKNOWN. 
So an unambiguous definition of NULL, found in the SQL standard, is 
the value that always causes a comparison operator to return FALSE in 
any known DBMS. 

The problem with the unambiguous definition is that it is often the 
hardest to use. The unambiguous definition of NULL doesn't make it 
obvious why, or when, a database would contain NULL. The "Not Known" 
or "Not Applicable" definition, while ambiguous, does make it clear.
So those of us who design databases use the ambiguous definition, and 
our computers use the unambiguous definition. This works fine, as long
as we remember that there's no ambiguity in the way that NULL is 
implemented in a properly designed RDBMS. A computer doesn't know whether 
data is unknown or inapplicable; those concepts don't mean anything to 
it. All it cares about is whether it receives a 0 or 1 from a comparison 
operation.

To carry the anthropomorphism further, a computer doesn't care, and 
an RDBMS doesn't care, but a database lives at a higher level of 
abstraction, and it may care. For example, when I bought a radiator 
for my car, the auto parts clerk asked me if I had an automatic 
transmission. Cars that have automatic transmissions need radiators 
with extra tubes for cooling the transmission fluid. Cars that have 
manual transmissions don't need the extra tubes, and the radiators 
usually aren't interchangeable. On the other hand, I use the same 
valve lifters, regardless of the transmission type. So the auto parts 
table has to include an attribute for "goes with an automatic 
transmission". For a radiator, this would contain TRUE or FALSE. 
For a valve lifter, this would be inapplicable. If it were unknown 
instead of inapplicable, the data would have to be supplied before the 
part could be sold. So the distinction between unknown and inapplicable 
can be important, in theory. 

In practice, this rarely happens. If necessary data is missing, no 
record is created until the missing data is supplied. Parts warehouses 
don't want parts in their databases if they don't know which cars can 
use them. Usually, when someone wants to make a distinction between 
unknown and inapplicable, it's because they're trying to fit two types 
of data into a single columns, which violates the rules by which 
relational databases are constructed. For example, a bank may keep a 
record of credit limits for its customers. A database designer may be 
tempted to set aside a special number, say -1, to place in the credit 
limit column for customers who have no credit limit. The idea is to 
split NULL into its two parts, and use -1 = NULL-not-applicable, and 
NULL = NULL-not-known. But the attribute of having a credit limit is 
different information from the attribute of the credit limit size. These 
two types of data should not be in the same column. There should be 
a boolean column for credit limit existence (HasCreditLimit Boolean)
and a currency column for the amount of the credit limit. This mistake 
is so consistent that I think of it as diagnostic; if someone is 
trying to create two types of NULLs, they're probably violating 
1NF. Usually this happens when a database needs to store a new attribute 
that the original designer didn't anticipate, and someone gets clever 
and tries to create special flags for existing columns instead of 
creating new columns for new attributes.  

My disagreement with Bruce Feist is not over whether NULL is ambiguous; 
the definition that database designers use is explicitly ambiguous. We 
disagree over how to deal with the ambiguity. My contention is that it 
is rarely a problem in a well designed database. But I acknowledge that it 
sometimes may be a problem. Bruce advocates restricting the meaning of NULL 
and documenting the restriction. My problem with that is that the RDBMS 
will continue to accept NULLs and process them properly, even when they 
violate the restriction. The meaning of NULL is well known, and is 
specified in the SQL standard in one form or another, so a DBA is fully 
justified in assuming that NULL is used according to the standard in 
any given database, and has no reason to look for documentation that 
tells him otherwise. NULL is any data that should cause any comparison 
to return UNKNOWN. This could be either unknown data or inapplicable data. 

This is akin to using a char or varchar column for dates. If you put 
a date into a DATE column in an invalid format, the column will spit 
the date back and complain. If you put an invalid date into a text 
column, the column will happily store it for you and all your aggregate 
queries will return inaccurate results. Similarly, if you try to use 
NULL for data that is applicable but unknown, the column will also 
accept NULL in cases where the data is inapplicable, and your assumptions 
about the meaning of NULL in that column will be wrong. The RDBMS will 
return FALSE on all comparisons with NULL, so there will be nothing to 
tell you that the column contains invalid NULLs. My experience is that 
if it is possible to put invalid data in a column, someone will do it.
(I've done data cleanup on a subset of the payroll database for a 
multinational company. Lots of people had lots of opportunities to 
screw up the data, so they did.)

When I've needed to specify that data in a text column is unknown, I've 
used "Unknown" or "Not Known". This has two advantages. First, it doesn't 
redefine an established term. (NULL is defined as anything that should cause 
a comparison to return UNKNOWN, which includes *both* unknown and inapplicable 
data. This definition is widely known and accepted, and it is implemented in 
the RDBMS. Changing the definition for a specific database or table is 
dangerous.) The other advantage is that it is self-documenting. Anyone 
seeing "Unknown" or "Inapplicable" will know immediately that one doesn't 
mean the other, or both.

Numeric columns are a bigger problem. You can put a number or a NULL into 
and INT column, but not "Unknown" or "Inapplicable". In that case, the 
best solution is to allow NULLs, but add a boolean column called Unknown, 
with the restriction that both columns can't be NULL in the same row. 
So if the INT column contains a NULL, you have to specify what the NULL 
means in the Unknown column; 1 = unknown, 0 = inapplicable. If the column 
only contains non-negative integers, then you can assign -1 and -2 the 
meanings Unknown and Inapplicable. But make sure you disallow NULLs, or 
NULL will end up being used for both (as it should be if it is allowed).

But this is coming dangerously close to cleverness, which is a bad quality 
in a database designer. These problems should occur very rarely. I'll repeat 
that if you are trying to split NULL into its two meanings, you are probably 
making a design mistake, and should add columns to capture additional data, 
rather than shoehorn multiple attributes into one column. If you have 
to create multiple NULLs more than twice in your career, reexamine your 
designs.

Bob Hall
Thread
Does NULL == ""?Randy Chrismon15 Sep
  • Re: Does NULL == ""?Bruce Feist15 Sep
  • RE: Does NULL == ""?Mark Richards15 Sep
  • Re: Does NULL == ""?Brent Baisley15 Sep
    • Re: Does NULL == ""?Keith C. Ivey15 Sep
  • Re: Does NULL == ""?Paul DuBois15 Sep
Re: Does NULL == ""?Alec.Cawley15 Sep
  • RE: Does NULL == ""?Mark Richards15 Sep
Re: Does Null == ""?Randy Chrismon15 Sep
  • RE: Does Null == ""?Mark Richards15 Sep
  • Re: Does Null == ""?Bob Hall16 Sep
    • Re: Does Null == ""?Bruce Feist16 Sep
      • Re: Does Null == ""?Dan Nelson16 Sep
        • Re: Does Null == ""?Bruce Feist16 Sep
          • Re: Does Null == ""?Bob Hall16 Sep
        • Re: Does Null == ""?Bob Hall16 Sep
      • RE: Does Null == ""?Jon Frisby16 Sep
        • Re: Does Null == ""?Bruce Feist16 Sep
          • RE: Does Null == ""?Jon Frisby16 Sep
            • Re: Does Null == ""?Bruce Feist16 Sep
              • RE: Does Null == ""?Jon Frisby16 Sep
              • Re: Does Null == ""?Bob Hall17 Sep
        • Re: Does Null == ""?Bob Hall16 Sep
      • Re: Does Null == ""?Bob Hall16 Sep
    • Re: Does Null == ""?Bob Hall17 Sep
      • Re: Does Null == ""?Bruce Feist18 Sep
        • Re: Does Null == ""?Bob Hall18 Sep
  • Re: Does Null == ""?Haydies18 Sep
    • Re: Does Null == ""?Bob Hall18 Sep