From the
Exploring Data website - http://curriculum.qed.qld.gov.au/kla/eda/
© Education Queensland, 1997
The Mean? Who Needs It
I had a clever idea some time ago. I wondered if I could convince the statistics world to abolish the mean in descriptive statistics as a measure of the center of a dataset. I could become famous as the man who put the mean in its rightful place. So I put my persuasive argument to the EdStat mailing list. Unfortunately not everyone agreed with me.
Rex Boggs wrote:
The NCSS help file says that for a symmetric distribution, the mean, median, mode and truncated mean are identical. It further states that for a non-symmetric distribution, the median is preferable for describing the location (ie centre) of the distribution as it is less effected by extreme values.
Given these two statements, it seems to me that there is absolutely no need for the mean if all that is wanted are descriptive statistics. Hence, I propose that for describing a dataset, the word 'average' should by default refer to the median.
Of course the mean is needed for inference, so it still has a job to do.
Responses
From Donald F. Burrill
Seems to me you aren't interpreting the two statements with an adequate degree of fuzziness. Hardly any real distributions are exactly symmetrical, but some are approximately so; for them, the various averages you cite are not identical, but may not be different enough to amount to a hill of beans. On the other hand, they may be; you can't tell if you don't look, and you can't look if you don't bother computing the several values.
Even for asymmetrical distributions, some are more asymmetric than others, and one presumably would like to have some notion of whether one is just a little unsymmetrical, or wildly unsymmetrical, or ... And for any distribution, only the mean can tell you anything useful about the sum of all values (aka the grand total), and for some variables the sum (and therefore the mean) is of some considerable importance.
"Hence, I propose that for describing a dataset, the word 'average' should by default refer to the median."
I would propose that if you want to refer to the median, you CALL it the median; and similarly for the mean. If you find it useful to use so ill-defined a term as 'average' and want to give it a conventional definition (by which I mean, define it to have a particular meaning by your [possibly idiosyncratic] convention), say so at the beginning of whatever discourse you embark on, and include that convention in a formal Glossary. To refer to 'median' when you say (or write) 'average' (and to do so silently) is to generate confusion among those for whom 'average' had already been more or less routinely and conventionally defined as 'arithmetic mean', and to make useful communication, always difficult at the best of times, nearly impossible.
"Of course the mean is needed for inference, so it still has a job to do."
Several jobs, in fact.
From Dave Krantz
In reply to Rex Boggs' question, "Do we really need the mean in descriptive stats?" I offer the following thoughts on location descriptors.
I'm not very clear on what is meant by "descriptive statistics". To be honest, I don't think there is any such thing, except as a textbook heading to refer to the things that are introduced prior to consideration of sampling distributions. Any description must have a purpose if it is to be useful--it is supposed to convey something real. The line between "mere description" and suggesting some sort of inference is very fuzzy.
To convey an accurate picture of a distribution requires different methods depending on WHY the picture is being conveyed and on the shape of the distribution. For example, for many social-policy purposes, the best descriptor of an income distribution might be the 20th or 25th percentile, which indicates how badly off the poorest families are. To give a fairly complete description one may need many percentiles.
Much of the purpose of description and inference is comparison. To compare LOCATIONS of two distributions that have approximately the same shape, any percentile is as good as any other - if the shapes are the same, the distance between the 5th percentiles will be the same as that between the 50th percentiles, or between the means, for that matter. So the selection of a location descriptor has much more to do with inferential reliability than with the logic of comparison in this case. For most commonly encountered distribution shapes, it would take quite a lot of data to estimate differences in 5th percentiles accurately, much less to estimate differences in 50th percentiles; so even though the location difference is estimated either way, it is estimated more reliably by medians. (Note that even this might not be true for a distribution shape that has a strong primary or secondary mode near the 5th percentile and is very stretched out around the 50th.) Means are favoured because their estimates use all the data very efficiently, but sometimes other estimators are favoured because they are less sensitive to contamination by a few extreme and perhaps erroneous observations.
To compare the LOCATIONS of two distributions that differ greatly in shape is not a well-defined task. It really depends on the purposes of the comparison. As I mentioned, if one is interested in poverty comparisons, looking at the location difference at the low percentiles might make sense; if one is interested in sums (eg., an airline wants to estimate TOTAL weight of passengers' baggage) then the mean is definitely what one needs.
In short, I think most of what is written about location in elementary sources is far too simple to be taken as a standard. Oversimplification may or may not be appropriate for the beginning student; but at some point early in the student's career, she or he should be induced to think seriously about the question of location of a distribution and the different meanings it has for different tasks and different shapes of distributions.
From Bob Frick
I thank Rex Boggs for a stimulating comment on a slow day. My two cents:
Suppose the question is how long does it take to drive from my home to work, when traffic is light. My impression is that, prior to the 1800's, this question would have been thought meaningless -- obviously, it will be a different time every time I make the drive. However, we now have a concept which could loosely be called propensity. (And my impression is that Quetelet gets credit for this.) To measure the "time" it takes me to drive to work, I make this drive several times, take the average, and call that an estimate of the propensity. The deviations of the propensity are then thought of as error, though they would presumably contain very little measurement error.
So, as Dave Krantz noted, the average is not merely descriptive, it is inferential. Of course, it is circular in what it is inferential to -- it is a measure of propensity, which is defined as the average. But propensities, bogus or not, seem to be the building block of much of psychology. IQ and extroversion, for example, are propensities calculated by averaging. Similarly, it would be difficult to understand "males are taller than females" or "Miami is warmer than Boston" without reference to propensities and hence averages.