From the
Exploring Data website - http://curriculum.qed.qld.gov.au/kla/eda/
© Education Queensland, 1997
Stories To Accompany Datasets
Story: AIDS and HIV Infections
This dataset contains the global estimate of cumulative AIDS and HIV cases worldwide, since 1980. One question that could be asked is if there is any slowing in the rate at which HIV and AIDS cases are occuring. Even after the data is transformed using logarithms there is still a strong upward trend in the residuals. Finding a model that fits the data may require looking more sophisticated functions than the usual exponential and polynomial functions. It would also be worth finding the number of new cases every year and modelling that growth. The relationship between the number of new HIV and AIDS cases might well be worth exploring. A dataset with rich potential.
Variables:
Source: Compiled by Worldwatch Institute from: Global AIDS Policy Coalition, Harvard School of Public Health Boston, MA, discussion with author, 24 January 1997.
Story: Bradmanesque
Much was made of the outstanding batting performance of Steve Waugh for the Australian cricket team over the seasons 1993-1995. Some commentators believe that over this time his performance ranked with the best players of all time and point to his batting average during this period to support this view. This is of course a difficult comparison to make because different players in different eras competed under a great range of conditions. It should be noted that a number of other players during the 1993-1995 seasons also recorded very high batting averages, so it might well be, for whatever reason, batting was relatively easy during this time. How then can a comparison be made between Steve Waughs performance and those of players in other eras?
Download the Word document Bradmanesque for more information and a full analysis.
Variables:
Source: Geiger, Vince, Hillbrook Anglican School, Brisbane, Australia.
Story: Carbon Dioxide Concentration
It is an intriguing exercise to examine the trends in carbon dioxide concentration in the atmosphere over the last two centuries.
Variable Names:
Source: compiled by the WorldWatch Insitute from H. Friedli et al., "Ice Core Record of the 13C/12C ratio of Atmospheric CO2 in the Past Two Centuries," Nature, November 20, 1986; Charles D. Keeling and Timothy Whorf, Scripps Institute of Oceanography, La Jolla, CA, private communications, February 26, 1993 and February 14, 1994 and private communication and printout, February 5, 1996.
Story: Carbon Emissions
This dataset highlights the rise in worldwide carbon emissions from 1950 to 1995. The data appears to exhibit at least two major trends, with carbon emissions rising exponentially until the early 1970s, and then roughly linearly since then. A piece-wise function may be an appropriate choice for modelling the data.
Other interesting trends are the different patterns for the industrialised countries, the Eastern block countries and the 'others' (probably best categorised as the developing countries).
Variable Names:
Source: Marland, Andres and Boden, 1995, Oak Ridge National Laboratory. 1993-95 data from the WorldWatch Institute based on estimates from Marland, Andres and Boden, on OECD and on BP.
Story: Cloud Seeding
Clouds were randomly seeded or not with silver nitrate. Rainfall amounts were recorded from the clouds. The purpose of the experiment was to determine if cloud seeding increases rainfall. The rainfall distributions are more nearly symmetric after a log transformation. The log transformation also makes the variance of the two groups more nearly equal.
After a log transformation, a pooled t-test may be appropriate. Without a transformation it is neither appropriate (failing both the normality and equal variance assumptions) nor significant at .05. Without transforming, a Mann-Whitney U test would be appropriate.
A boxplot or the dotplot of rainfall for the two groups of clouds is helpful.
Reference: Chambers, Cleveland, Kleiner, and Tukey. (1983). Graphical Methods for Data Analysis. Wadsworth International Group, Belmont, CA, 351. Original Source: Simpson, Alsen, and Eden. (1975). A Bayesian analysis of a multiplicative treatment effect in weather modification. Technometrics 17, 161-166.
Variable Names:
Story: Cricket (the Insect)
I wrote an email to the EdStat mailing list:
I am browsing the web, and just came across this 'factoid':
"Factoid: Crickets make their chirping sounds by rapidly sliding one wing over the other. The faster they move their wings, the higher the chirping sound that is produced. Scientists have noticed that crickets move their wings faster in warm temperatures than in cold temperatures. Therefore, by listening to the pitch of the chirp of crickets, it is possible to tell the temperature of the air. The table below gives the recorded pitch (in vibrations per second) of a cricket chirping recorded at 15 different temperatures. [the table was supplied as a gif file]."
1. What is a factoid?
2. Does anyone have access to any real cricket and temperature data? This 'factoid' sounds suspiciously to me like the 'life of light bulbs is normally distributed' story, ie endlessly repeated, but with no basis in reality. The phrase 'Scientists have noticed....' is a dead giveaway, I reckon.
Jerry Thornhill, (jerry_thornhill@sw.cc.va.us), Southwest Virginia Community College wrote:
According to the American Heritage Electronic Dictionary, Version 3.6, a factoid is: Unverified or inaccurate information that is presented in the press as factual, often as part of a publicity effort, and that is then accepted as true because of constant repetition
In this case, factoid (with the above definition) is probably inappropriate. In a 1948 book called The Song of Insects, George W. Pierce, a Havard physics professor, presented real data relating the number of chirps per second for striped ground crickets to the temperature in degrees F. The data is real cricket and temperature data. Apparently the number of chirps represents some kind of average since it is given to the nearest tenth. I have no idea whether the original book is still in print.
Variable Names:
Source: Pierce, George W. 1948. The Song of Insects.
Story: Global Mean Temperature (1)
How much the current rise in global temperatures is due to man and how much is part of normal variation in global temperatures is very much in dispute. This dataset contains the global mean temperature from 1866 to 1996. There are intriguing non-random patterns in the data.
Variable Names:
Source: Compiled by Worldwatch Institute from James Hansen and Reto Ruedy, Goddard Institute for Space Studies, 14 January 1997.
Story: Global Mean Temperature (2)
This interesting dataset contains data on global temperature and atmospheric CO2 contentrations for the last 159 000 years. Due to there being no year 0, the number of years prior to the present was used instead of the variable year. The temperature given is relative to (I think) to present day temperatures.
Variable Names:
Source: Compiled by Worldwatch Institute from J.M. Barnola et al. "Historical CO2 Record from the Vostok Ice Core," in Thomas A. Boden et al., eds., Trends '93: A Compendium of Data on
Global Change (Oak Ridge, TN.: Oak Ridge National Laboratory, 1994); J. Jouzel et al., "Vostok Isotopic Temperature Record," in Thomas A. Boden et al.; Timothy Whorf, Scripps Institution of Oceanography, La Jolla, CA, private communication, February 2, 1995.
Story: Measuring Air Pollution
An oil refinery northeast of San Francisco conducted a series of 31 daily measurements of the carbon monoxide levels arising from one of their stacks between April 16 and May 16, 1993. The measurements were submitted as evidence for establishing a baseline to the Bay Area Air Quality Management District (BAAQMD). BAAQMD personnel had also made 9 independent measurements of the carbon monoxide from this same stack over the period from September 11, 1990 to March 30, 1993.
In this case, the refinery had an incentive to overestimate carbon monoxide emissions. The data show that both the average and median carbon monoxide measurements by the refinery were higher than the measurements by the BAAQMD. A t-test of the significance of this difference is not appropriate, however, as a line plot of the carbon monoxide measurement over time shows a cyclical pattern, violating the assumption of independently distributed observations.
Variable Names:
Story: The Old Faithful Geyser
Old Faithful is a geyser in Yellowstone National Park in Wyoming, USA. As it is a major tourist attraction, being able to predict the timing and length of the next interruption would be useful. The Old Faithful dataset contains data about the date of the observation, the duration of an eruption and the time between eruptions.
The Old Faithful dataset is often used to demonstrate how altering the width of the bins of a histogram can alter how the histogram displays the shape of the distribution of the data. This effect is most pronounced with bi-modal data. Both the duration of eruptions and the time between eruptions of the Old Faithful geyser have bi-modal distributions and hence exhibit this feature.
Variable Names
Story: Reining in the Wild Horses
Management of the growing mustang population on federal lands has been a controversial issue. A suggested method for controlling overpopulation is to sterilize the dominant male in each group. Eagle, Asa, and Garrott et al. (1993) conducted an experiment evaluating the effectiveness of sterilizing the dominant males as a way to reduce foaling (birth) rates for 2 or more years.
The researchers chose two Herd Management Areas (HMAs), Flanigan in northwestern Nevada and Beaty Butte in southeastern Oregon, for this study. In December 1985, they rounded up the horses in bands and counted all individual horses, determined their sex, and estimated their ages by looking at tooth wear. They photographed all horses three years old or older and fitted them with numbered collars to assist in identification throughout the study. They identified the dominant male in each band, vasectomized it, and fitted it with a radio-transmitting collar. Finally, they released the band as a group. Between June 1986 and July 1988 they attempted to locate each sterilized male 3-4 times a year by aerial survey from helicopter. The researchers recorded the number of adults and foals in each group containing a sterilized male (treated groups), and in the groups without a sterilized male (untreated groups).
While the researchers could not record actual birthrates in the bands of horses, the number of foals per 100 adults in each band is a good substitute.
Reference: Eagle, T. C., Asa, C., and Garrott, R. et al. (1993), "Efficacy of Dominant Male Sterilization To Reduce Reproduction in Feral Horses," Wildlife Society Bulletin , 21(2), 116-121.
Variable Names:
Story: The Size of Alligators
Many wildlife populations are monitored by taking aerial photographs. Information about the number of animals and their whereabouts is important to protecting certain species and to ensuring the safety of surrounding human populations.
In addition, it is sometimes possible to monitor certain characteristics of the animals. The length of an alligator can be estimated quite accurately from aerial photographs or from a boat. However, the alligator's weight is much more difficult to determine. In the example below, data on the length (in inches) and weight (in pounds) of alligators captured in central Florida are used to develop a model from which the weight of an alligator can be predicted from its length.
Download the Word document, Curve Fitting to get the full story and analysis.
Variables:
Story: Smoking and Cancer
Government statisticians in England conducted a study of the relationship between smoking and lung cancer. The data concern 25 occupational groups and are condensed from data on thousands of individual men. The explanatory variable is the number of cigarettes smoked per day by men in each occupation relative to the number smoked by all men of the same age. This smoking ratio is 100 if men in an occupation are exactly average in their smoking, it is below 100 if they smoke less than average, and above 100 if they smoke more than average. The response variable is the standardized mortality ratio for deaths from lung cancer. It is also measured relative to the entire population of men of the same ages as those studied, and is greater or less than 100 when there are more or fewer deaths from lung cancer than would be expected based on the experience of all English men.
Variable Names:
Story: Speed of Light
In 1879, A. A. Michelson made 100 determinations of the velocity of light in air using a modification of a method proposed by the French physicist Foucault. These measurements were grouped into five trials of 20 measurements each. The numbers are in km/sec, and have had 299,000 subtracted from them. The currently accepted "true" velocity of light in vacuum is 299,792.5 km/sec.
The data are given here as reported by Stigler. Stigler has applied the corrections used by Michelson and reports that the "true" value appropriate for comparison to these measurements is 734.5. Each trial may be a summary of several experimental observations.
Because the speed of light is a physical constant, we know (to a close approximation) the "true" value that Michelson was trying to measure. It is therefore possible to test the null hypothesis that the true mean = 734.5 for each of the trials or for all 100 determinations taken together.
There is evidence of trouble in the data. Boxplots of the trials side-by-side indicate that not all were equally variable nor even centered on the same value. A one-way ANOVA confirms this. One might consider startup effects and underlying bias in the instrument.
Reference: Stigler, Stephen M. (1977). Do Robust Estimators Work with Real Data? The Annals of Statistics 5:4, 1075.
Variable Names:
Story: World Oil Production
Mathematical models which describe physical phenomena are often very accurate, reflecting the simple underlying formula that links the variables and the ability to measure such variables precisely. We usually aren't so fortunate when we model activities that involve nature and biological processes, while those involving people are the most difficult of all to model accurately.
The data in the table is the world oil production measured in millions of barrels. Your task is to find a function to model this data, discussing limitations of your model, and its usefulness as a predictor of future production.
Download the Word document, Curve Fitting to get the full story and analysis.
Variables:
Story: World Population Figures
The growth in the world's population is roughly exponential, but there are some intriguing patterns to the data brought out by a residual plot. The dataset consists of two pairs of data - one listing the world's population from the year 0 to 1995, and the other giving the population of the world from 1950 to 1995.
Variable Names:
Story: Year 10 Certificates
Education Queensland issues the Year 10 Certificate to all state school students who complete year 10. This dataset contains information on the number of certificates ordered by each school, and the region and district in which the school is located. This dataset has an unusual distribution due to a large number of schools with a small year 10 population, and one large outlier. Students will have to make a decision on the best displays for this dataset. They will also need to make a decision about whether to retain the outlier. The outlier is the School of Distance Education so there is a case to be made for discarding it as it is an atypical school.