One summer afternoon between
the wars in Cambridge, England, a group of dons from the university and
their wives were taking tea. One woman stated that she liked milk
with her tea, but she liked the tea poured in first and the milk poured
in after, and not the other way around. Many of the professors were
in the sciences, and they had a hard time believing the woman; after all,
there should be no chemical difference between milk in tea or tea in milk,
so she probably only thought she tasted a difference; others brought up Newton's
Law of Cooling, which says that a hot liquid poured into a colder liquid
will cool differently than a cold liquid poured into a hotter one.
One of the people present was Ronald
Fisher, later knighted, one of the leading lights in the field of statistics;
his interest was in testing to see if her statement could be verified by
experiment. He documented his method in his 1935 book, The Design
of Experiments, and re-tells this story in the book's second chapter.
The basic idea of the experiment
is this; let us assume that nothing special is happening. We will design
an experiment that will test a claim that something special is happening,
and we will only accept the claim that the special thing exists if the experiment
says that "nothing special happening" is very unlikely. In the case
above, the assumption that nothing special is happening would be "The lady
cannot tell the difference between milk poured into tea and tea poured into
milk." This is called the null hypothesis, and is denoted in
the literature as H0, pronounced "H nought".
Our experiment will be comprised
of a certain number of trials, and in a simple situation such as this, each
trial will have some way of being judged a success or a failure; in a taste
test, success will mean correctly stating whether the drink is milk-in-tea
or tea-in-milk, and failure will be stating incorrectly. Because H0
is that she cannot tell the difference, we can say the probability of being
correct by guessing is 50%; in other experiments, the probability of being
correct may be more or less than 50% at each trial. (For example, if you
have to guess the type of jelly bean you are tasting while blindfolded, and
there are five types of jelly beans randomly distributed, the probability
of success by guessing would be 20%.) In general, the results of an
experiment will be easier to deal with if every trial is independent and
the probability of success in each trial is uniform; these criteria may not
be possible in every experiment, and the methods shown here will not be valid
if those two criteria are not met.
If the lady tasting tea does very
well in identifying the different mixtures, can we assume the null hypothesis
is incorrect? Fisher decided that if the results of an experiment showed
that there was only a 5% probability of someone performing as well as the
test subject just by guessing randomly, then we could say the alternate
hypothesis or HA had a statistically significant
chance of being true, and if the probability of matching a result at random
was only 1%, the phrase highly statistically significant could be
used instead. (These arbitrary mileposts are chosen because they are
close to the points of being 2 standard deviations away from the mean and
3 standard deviations away from the mean respectively, given normal
distribution of data.)
The probability of getting n
or more successes out of t trials by random guessing when the probability
of success is p (and the probability of failure, which is 1-p,
is written as q) is
In the Java applet below, the
user can move the slider bars to choose an experiment consisting of between
10 and 100 trials, and the probability of success can be set at a percentage
between 10% and 90%. The visible part of normal distributions, which
are skewed if the probability of success isn't 50% are color-coded in the
following way.
Grey - more than 50% probability of getting n
successes out of t trials by random guessing
Purple - between 40% and 50% probability of getting
n successes out of t trialsby random guessing Blue - between 30% and 40% probability of getting
n successes out of t trialsby random guessing Cyan - between 20% and 30% probability of getting
n successes out of t trialsby random guessing Green - between 10% and 20% probability of getting
n successes out of t trialsby random guessing Yellow - between 5% and 10% probability of getting
n successes out of t trialsby random guessing Orange - between 1% and 5% probability of getting
n successes out of t trialsby random guessing Red - less than 1% probability of getting n
successes out of t trialsby random guessing
The fractions below the left and
right edges of the bar graph show the number of trials where the "visible"
part of the normal distribution begins; in this applet, the visible bars
are those that have a height that is more than .002 of the tallest bar.
(An historical note: In David Salsburg's The Lady Tasting
Tea: How Statistics Revolutionized Science In The 20th Century,
Hugh Smith, a witness on that summer afternoon in Cambridge, does not recall
the exact number of trials in the experiment, but does say that the lady
in question passed with flying colors, correctly identifying milk-in-tea or
tea-in-milk every time; as for the cause of the taste difference, pouring
hot tea into cold milk makes the milk curdle, but not so pouring cold milk
into hot tea.)