List:General Discussion« Previous MessageNext Message »
From:Michael Stassen Date:December 12 2003 5:57am
Subject:Re: mean/median/mode
View as plain text  
Mike Johnson wrote:

> From: Robert Citek [mailto:rwcitek@stripped]
> 
> 
>>Hello all,
>>
>>How can I calculate the mean/median/mode from a set of data using SQL?
>>
>>Mean seems to exist as the average (avg):
>>   select name, avg(value) from table group by name
>>
>>Is there a way to calculate median and mode with a group by clause?
>>
>>Median: the value at which 50% of the samples are above and 
>>below that value.
>>Mode: the most common value
> 
> 
> 
> For mode, this should work:
> SELECT COUNT(value) AS mode FROM table GROUP BY value ORDER BY mode DESC LIMIT 1;

I expect you meant to include the value itself:

SELECT value, COUNT(value) AS mode FROM table
GROUP BY value ORDER BY mode DESC LIMIT 1;

This will work in many cases, but modes are tricky because:

- There may not be a mode.  For a value to be a mode, it must occur at 
least twice.

- There may be more than one mode.  Any value whose frequency matches 
the max frequency is a mode.  For example, if values 2, 8, and 13 each 
occur 8 times, and all other values occur less frequently, then 2, 8, 
and 13 are modes.

I feel that there must be a better way, but the following works:

   CREATE TEMPORARY TABLE counts
   SELECT val, COUNT(val) AS freq
   FROM data GROUP BY val HAVING COUNT(*)>1 ORDER BY freq DESC;

   SELECT @mode:=MAX(freq) FROM counts;

   SELECT val AS mode, freq FROM counts WHERE freq = @mode;

   DROP TABLE IF EXISTS counts;

As a further wrinkle, considering the error inherent in statistical 
sampling, many statisticians would consider a value whose frequency is 
much greater than most frequencies but not quite as high as the max 
frequency to be a mode, or at least sort of a mode.  For example, if 
values 36 and 48 have counts of 102, value 76 has a count of 98, and 
every other value occurs no more than 25 times, most would call the 
distribution trimodal (3 modes), even though 76 is not strictly a mode.

Put another way, finding the modes is usually about finding the shape of 
the distribution.  An alternative to calculating the modes would be to 
look for them visually by plotting the distribution.  Something like this:

SELECT val, COUNT(val) AS frequency, repeat('.',COUNT(val)) AS histogram
FROM data GROUP BY val ORDER BY val;

+------+-----------+-----------+
| val  | frequency | histogram |
+------+-----------+-----------+
|    1 |         3 | ...       |
|    2 |         8 | ........  |
|    3 |         3 | ...       |
|    4 |         3 | ...       |
|    5 |         2 | ..        |
|    6 |         1 | .         |
|    7 |         3 | ...       |
|    8 |         8 | ........  |
|    9 |         1 | .         |
|   11 |         2 | ..        |
|   12 |         3 | ...       |
|   13 |         8 | ........  |
|   14 |         1 | .         |
|   15 |         1 | .         |
|   16 |         2 | ..        |
+------+-----------+-----------+
15 rows in set (0.00 sec)

You might want to know something about the size of your counts relative 
to your screen width before you try that.  You could adjust accordingly, 
however, with something like

SELECT val, COUNT(val) AS frequency,
        repeat('+',ROUND(COUNT(val)/10)) AS histogram
FROM data GROUP BY val ORDER BY val;

Michael

Thread
mean/median/modeRobert Citek4 Dec
  • Solved - median (was Re: mean/median/mode)Robert Citek5 Dec
    • Re: Solved - median (was Re: mean/median/mode)Michael Stassen11 Dec
RE: mean/median/modeMike Johnson4 Dec
  • Re: mean/median/modeMichael Stassen12 Dec
RE: Solved - median (was Re: mean/median/mode)emierzwa12 Dec