From the
Exploring Data website - http://curriculum.qed.qld.gov.au/kla/eda/
© Education Queensland, 1997
The Six Characteristics of a Dataset
Once some data have been gathered, the first step in working with the data is to look at it in a variety of ways. These six characteristics of a dataset are a good starting point in analysing a dataset, although a fuller analysis extends to looking for unexpected anomalies and patterns in the data. For example, in the Metric dataset which consists of forty-four estimates of the width of a lecture hall, there are a large number of estimates of 10 m and 15 m. This is almost certainly due to subjects rounding off their estimate and is a feature more of our number system than the size of the hall.
The Old Faithful Dataset
The Old Faithful dataset has some interesting features and hence will be the example used in this article. Old Faithful is a geyser in Yellowstone National Park in Wyoming. The graphical displays below are based on 222 measurements of the duration of the geyser, in minutes.

Shape
The shape of a dataset will be the main factor is determining which set of summary statistics best summarises the dataset, so it should be the first characteristic to be noted. Shape is commonly categorised as symmetric, left-skewed or right-skewed, and as uni-modal, bi-modal or multi-modal.
The shape of the Old Faithful dataset is bi-modal. Note that both the histogram and the dotplot do a good job of showing this while the boxplot doesnt indicate this at all.
Location
Statisticians often use the term 'location' for what Queensland texts often call the measure of central tendency. 'Location' is both simpler and more descriptive than 'measure of central tendency', so it is the term I've adopted for this website.
When initially examining a dataset only an approximate location is needed, often just estimated by eye. After further analysis the choice of measure of location should become clearer. Common measures of location are the mean and median. Less common measures of location are the mode (the most frequent value), the mid-range (the value midway between the minimum and maximum values) and the truncated mean (where a fixed percentage of the largest and smallest scores are deleted from the dataset and the mean of the remaining data is calculated)
For the Old Faithful dataset, I would say none of these are a good measure of location! The mean is 3.6 and the median is 4, and it is fairly obvious that these values tell us very little about the data. A more sophisticated description of location would be to say that the data is bi-modal with one peak about 2 and the other about 4.5. If more accurate values are wanted then the dataset could be broken into two sections and the mean or median of each section calculated independently.
This example illustrates an important point - blindly following a procedure will not always give the best results. Looking at the data and using judgement about how to describe the location of the data are needed.
Spread
This is a measure of the amount of variation in the data. Again, an approximate value is sufficient initially, with the choice of measure of spread being informed by the shape of the data, and its intended use. Common measures of spread are variance, standard deviation and the interquartile range. Less commonly used is the range, as it is not very robust.
For the Old Faithful dataset the standard deviation doesnt give a good picture of the spread of the data, as it usually is used when the data be approximately normally distributed, or at least uni-modal and reasonably symmetric. The interquartile range again is unsatisfactory as it doesnt give a true picture of how the data is distributed. Probably the best description of the spread would be found by dividing this dataset in two sections, and discussing the spread of each section. Either the standard deviation or the interquartile range could be used, depending on which measure of location was chosen.
Outliers
Outliers are data values that lie away from the general cluster of other data values. Each outlier needs to be examined to determine if it represents a possible value from the population being studied, in which case it should be retained, or if it is non-representative (or an error) in which case it can be excluded. It may be that an outlier is the most important feature of a dataset. There is a true story that the ozone hole above the South Pole had been detected by a satellite years before it was detected by ground-based observations, but the values were tossed out by a computer program because they were smaller than thought possible. Read the Ozone and Outliers article at this website to learn more about this fascinating story.
The best choice of display when looking for outliers is the boxplot. A glance at the boxplot of the Old Faithful dataset shows that this dataset contains no outliers. Note that the three displays complement each other in the information they provide about the data. One strong argument for the need to use computers and graphing calculators when studying statistics is the necessity of viewing the data in a variety of ways. Without technology to draw the graphs this would be impossible to do efficiently.
Clustering
Clustering implies that the data tends to bunch up around certain values, eg annual wages for a factory may cluster around $20 000 for unskilled factory workers, $35 000 for tradespersons and
$50 000 for management. Clustering shows up most clearly on a dotplot.
The Old Faithful dataset shows two clusters centred around 2 minutes and 4.5 minutes.
Granularity
Granularity implies that only certain discrete values are allowed, eg a company may only pay salaries in multiples of $1,000. A dotplot shows granularity as stacks of dots separated by gaps. By default, discrete data has some granularity as only certain values are possible. Continuous data can show granularity if the data is rounded.
The Old Faithful dataset shows evidence of granularity. By examining the original data it becomes clear that this is the result of the data being rounded to one decimal place and is not a feature of the data itself.
Other Features
With the availability of computers and low cost statistics software it is possible to calculate summary statistics and generate graphical displays very rapidly. The choice of bin width of a histogram can markedly alter the apparent shape of the data, especially if the data is not uni-modal. As they are so quick to generate, it may be worth our while looking at some alternative histograms to see what they show.
|
![]() |
![]() |
![]() |
The choice of bin width (and hence the number of bins) does change the appearance of the histogram. Which one best gives a true picture of the data is subjective. It is a worthwhile exercise to give students a dataset that is not unimodal and ask them to choose the best histogram and then defend their decision.