Learning From Data

XVI
Pretace
Many people contributed to the success of this edition. First and

foremost, we thank all the distinguished researchers who made time
to write essays for this edition. We owe a great debt to them for their
time, insight, and contribution to statistics education. Our thanks
also go to David Moore for crafting a fine introduction to the collec
tion. There are others deserving of our thanks as well: Larry Weldon
for his thoughtful comments on the essays and for writing most of
the study questions that appear at the end of each essay; Judith Tanur
for her priceless advice and support; Frederick Mosteller and the
editorial boards of the previous SAGTU editions for providing such
successful models for this edition; and our editor at Duxbury Press,
Carolyn Crockett, for her good humor and unwavering support.
Each essay was reviewed by several students, who provided com
ments and suggestions for revision, and we thank them for their
work: Mathew Bowyer, Julia Busso, Marissa Deffebach, Dawn Eash,
Breanne Henkelman, Katherine Ianiro, Mary Joynt, Brad Kaplan,
Jordan Keene, Amanda King, Michael Kolkowski, Maya Markowitz,
Katie Pesicka, Rebecca Russ, Tierra Stimson.
We also thank the American Statistical Association for its support
of this project, which included providing the funding necessary for
the book to be produced in color. And, our thanks go also to the
Sloan Foundation for providing the original funding for the first edi
tion of this book.
Finally, 1 would like to express my deepest gratitude to the mem
bers of the editorial board. It was a pleasure to work with such a sup
portive and responsive group of individuals. Each member of this
eminent group contributed many, many hours to this project and
their commitment and dedication to this project was extraordinary.
Roxy PeckJor the SAGTU Editorial Board
INTRODUCTION
Learning from Data

DAVID
s.
MOORE
Purdue University
------~ : : ~ - - - - - -
What genes are active in a tissue? Answering this question can un

ravel basic questions in biology, distinguish cancer cells from normal
cells, and distinguish between closely related types of cancer. To learn
the answer, apply the tissue to a "microarray" that contains thousands
of snippets of DNA arranged in a grid on a chip about the size of
your thumb. As DNA in the tissue binds to the snippets in the array,
special recorders pick up spots of light of varying color and intensity
across the grid and store what they see as numbers.
What's hot in popular music this week? SoundScan (part of
Nielsen Media Research) knows. SoundScan collects data electroni
cally from the cash registers in more than 14,000 retail outlets, and
also collects data on download sales from websites. When you buy a
CD, the checkout scanner is probably telling SoundScan what you
bought. SoundScan provides this information to Billboard Maga
zine, MTV; and VH 1, as well as to record companies and artists'
agents.
Should women take hormones such as estrogen after menopause,
when natural production of these hormones ends? In 1992, several
major medical organizations said yeso In particular, women who took
hormones seemed to reduce their risk of a heart attack by 35% to
50%. The risks of taking hormones appeared small compared with
the benefits. But in 2002, the National Institutes of Health declared
these findings wrong. Use ofhormones after menopause immediately
plummeted. Both recommendations were based on extensive studies.
What happened?
-xvu
xviii
Introduction
DNA microarrays, SoundScan, and medical studies aH produce

data (numerical facts), and lots of them. Using data effectively is a
large and growing part of most professions. Reacting ro data is pan
of everyday life. That's why statistics is important:
Statistics is the science oflearning ftom data.
The essays in this book iIlustrate learning from data in many set
tings. To get started, here are sorne comments on how we learn.
WHERE THE DATA COME FROM MATTERS

What's behind the flip-flop in advice offered ro women about hor
mone replacement? The evidence in favor of hormone replacement
carne from a number of observational studies that compared women
who were talcing hormones with others who were not. But women
who choose to take hormones are very different from women who do
not: theyare richer and better educated and see docrors more often.
These women do many things ro maintain their health. It isn't sur
prising that they have fewer heart attacks.
Large and careful observational studies are expensive, but are
much easier to arrange than careful experiments. Experiments dout
let women decide what to do. They assign women ro either hormone
replacement or ro dummy pills that look and taste the same as the
hormone pills. The assignment is done by a coin ross, so that aH
kinds of women are equaHy likely to get either treatment. Part of the
difficulty of a good experiment is persuading women ro agree to ac
cept the result-invisible to them-of the coin toss. By 2002, several
experiments with women of different ages agreed that hormone re
placement does not reduce the risk of heart attacks.
Observational studies just observe; they don't try to change be
havior. We can learn from observational studies how chimpanzees
behave in the wild, or which popular songs sold best last week, or
what the public thinks about the president's performance. But ro
learn whether sorne act causes a change in behavior-for example,
whether taking hormones causes a reduction in me risk of heart
attacks-observational studies are a poor choice. Experiments that
Introduction
XIX
direcdy compare different actions-for example, taking a hormone

pill and taking a dummy pill-are designed to help us learn about
cause and effect.
The most important infOrmation about any statistical study is
how the data were produced.
Observation versus experiment is just one aspect of how data are

produced. SoundScan can't get data from every music-selling source
in the United States. Its data come from a large sample of sellers.
"When downloading music became popular, SoundScan had to add
online sellers in order to keep its sample representative of aH sellers.
As CDs are increasingly soId in untraditional places such as HaH
mark card stores and Starbucks, SoundScan faces more challenges.
The opinion poll that tells us that 55% of the public approves of
the president's performance faces challenges more severe than Sound
Scan's, because there are so many of us whose opinions are pan of
public opinion. Professional polls use random samples. That is, they
let impersonal chance choose the sample of people they talk with.
That's a great idea-it allows rich and poor, black and white, Demo
crat and Republican the same chance ro respondo You might contrast
the professional approach with just gathering a sample by stopping
people at shopping malls. Shopping mall patrons aren't "the public."
They are more likely to be either teens or retired. They are Iess likeIy
ro be reaHy poor. And the poIl-taker may not want ro question the
unshaven hulk in the musde shirt. Random samples avoid the lcinds
of favoritism that a mall survey produces.
But even with a random sample, we must watch the details. Al
most all polls create random samples by dialing telephone numbers
at random. That misses the few people without phones (and often
leaves out Alaska and Hawaii ro hold down cost), but mat's not me
big problem. Answering machines, never at home, don't want ro talk
with someone who might be a telemarketer-the big problem is mat
many of the people in the random sample don't respondo Most poIls
don't announce their rates of nonresponse, because the truth would
be embarrassing. It appears from studies by groups such as the Pew
Research Cenrer that roughly two-thirds of the initiaI sample
typically fails ro respond to a telephone survey. Do you think the
xx
Introduction
Introduction
nonresponders are different from the responders? If so, don't take the
polI results too seriously.
Sorne sample surveys are more trusrworthy. Government surveys,
such as the monthly Current Population Survey (CPS) that produces
the unemployment rate and much other information, have much
higher rates of response. Only about 6% or 7% of the households
chosen at random for the CPS sample don't respondo The Bureau of
Labor Statistics, unlike pollsters, makes its response rates publico
Knowing the details increases our confidence in the findings.
Before you trust the results o/a statistical study, ask about details
o/how the study was conducted.
3~O 1
3000
2500
e
ro
e
1!
2000
::J
D:I
1500
Q)
g
1000 --.
...
~,
Yogi Berras famous saying is a motto for learning from data. A few
carefully chosen graphs are often more instructive than great piles of
numbers. Consider the outcome of the 2000 presidential election in
Florida.
Elections don't come much closer: after much recounting, state
officials declared that George Bush had carried Florida by 537 votes
out of almost 6 million votes casto Floridas vote decided the election
and made George Bush rather than Al Gore president. Lawsuits fol
lowed, and the Supreme Court upheld the resulto Legal and political
issues aside, Figure 1 displays a graph that plots votes for the third
party candidate Pat Buchanan against votes for the Democratic can
didate Al Gore in Floridas 67 counties.
What happened in Palm Beach Counry?The question leaps out from
the graph. In this large and heavily Democratic county, a conservative
third-party candidate did far better relative to the Democratic Party
candidate than in any other county. The points for the other 66 coun
ties show votes for both candidates increasing together in a roughly
straight-line pattern. Both counts go up as county population goes up.
Based on this pattern, we would expect Buchanan to receive around
800 votes in Palm Beach County. He actually received more than
3400 votes. That difference determined the election result in Florida
500 --j
YOU CAN OBSERVE A LOT JUST BYWATCHING
XXI
O~
O
100,000
200,000
300,000
400,000
Votes for Gore
FIGURE 1 Votes for Pat Buchanan versus votes for Al Gore in Florida's 67 counties
and in the nation. All this from a simple graph. Once you have data in
hand, the first rule of data analysis is:
Always plotyour data.

The graph demands an explanation. It turns out that Palm Beach
County used a confusing "butterfly" ballot in which candidate names
on both left and right pages led to a voting column in the center. Ir
would be easy for a voter who didn't look carefully to cast a vote for
Buchanan when a vote for Gore was intended. The graph is convinc
ing evidence that this in fact happened, probably more convincing
than the complaints of voters who (later) were unsure where their
votes ended up.
Plotting the data and thinking about the plots is the start oflearn
ing from data, but only the start. Issues such as the effect of hormone
replacement on women's health are too complicated to be settled by
looking at graphs. The women studied vary in age, race, bad habits
XXII
lnrroaucuon
such as smoking, good habits such as regular exercise, and so on. The
variation among women will overwhelm the effect of taking hor
mones unless we can find a way ro see through the variation.
FIGHTING THE CURSE OF VARIATION

Why don't opinion polIs interview just one person? Because not all
people have the same opinions. Why can't we just compare the long
term health of two women, one with hormone replacement and one
without? Because individuals vary even when the two individuals are
the same in sex, age, race, income, previous health history, and so on.
Accounting for variation, and making sure that variation among in
dividuals doesn't obscure important overalI patterns, is a big reason
why we need the science of statistics, not just a quick look at the data.
Statisticians have two main strategies for overcoming variation.
First, take enough observations so that the effects ofvariation are "aver
aged out." A polI of 50 people, if repeated, may give quite different
proportions who approve the president's performance. A poll of
2500 people will almost always give close ro the same result, because
the variation caused by choosing people at random disappears as we
choose more and more people. It's just like rossing a cain-one, or
10, or even 50 tosses give quite variable percentages of "heads," but
thousands of tosses give close ro half "heads" every time. In fact, an
other big reason ro use random samples is that probability, the same
mathematics that describes cain tosses, describes how random sam
pies behave. That larger samples are less variable isn't just common
sense-it's a mathematical fact. The mathematics of probability
alIows statisticians ro say how large a sample we need ro reduce the
variation in results ro whatever level we want.
The second way ro fight variability is ro measure characteristics that
explain variation among individuals. If we know the age, race, income,
and health history of a woman, we can use this information ro predict
her future health. Ifwe do this for alI the women in a study ofhormone
replacement, it becomes easier to see the effect ofhormone replacement
because we can remove variation explained by the things we measured.
These strategies can reduce variation in outcomes, but they don't
produce certainty. Experimental findings that hormone replacement
Inrroducrion
XXlll
has few benefits and sorne risks trump observational studies that show
benefits, but we can't be absolutely sure that the experimental findings
are right. There remains sorne risk that by bad luck the dummy-pill
group received healthier women than the hormone group. So, statis
tical findings are always uncertain. The laws of probability again come
ro our rescue: we can attach ro our findings a statement of just
how uncertain they are, and we can design our studies ro make the
remaining uncertainty as smalI as we may wish. Opinion polIs, for
example, give not only the percentage of the sample who support the
president but also a "margin of error" that describes the uncertainty in
applying the sample result ro the wider universe of alI adults. Saying
how much variation remains belongs with strategies for reducing vari
ation in the statistician's roolkit for dealing with variation.
Variation is everywhere. Individuals vary. Repeated measurements

on the same individual vary. The science ofstatistics provides tools
ftr dealing with variation.
Applying the laws of probablity, removing variation due to char
acteristics we have measured, and saying how much uncertainty
remains are technical matters-that's why there are statisticians. It's
important ro realize that this technical magic can't remedy bad data
production. The mathematics of probability doesn't apply to a hap
hazard shopping mall sample, and it doesn't accaunt for nonresponse
in a random sample. The observational studies that seemed to show
beneficial effects from hormone replacement did measure and adjust
for many other variables, but still carne up with the wrong answer.
Did these studies fal ro make sorne essential measurement? Were the
women they studied not typical in sorne unknown way? We don't
know. We do know that experiments comparing two groups formed
at random are the gold standard for understanding the effect of an
intervention such as hormone replacement.
STATISTICS: A GUIDE TO THE UNKNOWN

Data enlighten; they shed light in dark places. We are interested in
whether the president has the support of the country and whether un
employment is dropping. Women need ro know whether the benefits
XXIV
Introduction
of hormone replacement outweigh the risks. We may be surprised ro

learn fram SoundScan's cousin, BookScan, that classics such as Jane
Austen's Pride and Prejudice outseH even the best seHers of a few years
ago-Pride and Prejudice sold 110,000 copies in 2002, and that
doesn't include students who had ro buy the book. Biologists were
surprised ro learn that the pattern of data from a microarray can dis
tinguish two types of leukemia that seem identical ro the usual rools
of medicine.
The ideas and methods of statistics guide us in using data ro ex
plore the unknown: how ro produce trustworthy data, how ro look at
data (starting with graphs), and how ro reach sound conclusions that
come with an indication of just how confident we can be. These are
three important aspects of statistical science: data production, data
analysis, and inferring conclusions fram data. We have seen brief ex
amples of each. The essays in this book illustrate in more detail the
reach of these statistical ideas, fram counting tigers ro reducing junk
mail. Every application of statistics has its special features-tigers
and junk mail have little in common-but common ways of think
ing iHuminate aH.
PUBLIC POLICY
ANDSOCIAL
SCIENCE
------~ : : ~ - - - - - -
Statistics in the Courtroom: United States

v. Kristen Gilbert
George Cobb & Stephen Gehlbach
The Anatomy of a Preelection Poll

Edward C. Ratledge
Counting and Apportionment: Foundations

ofAmerica's Democracy
Tommy Wright & George Cobb
Evaluating School Choice Programs

Jennifer Hill
Designing National Health Care Surveys

to Inform Health Policy
Steven B. Cohen & Trena M Ezzati-Rice
-1

Learning From Data

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Learning From Data

Uploaded by

Copyright:

Available Formats

XVI

Many people contributed to the success of this edition. First and

Roxy PeckJor the SAGTU Editorial Board

Learning from Data

What genes are active in a tissue? Answering this question can un

DNA microarrays, SoundScan, and medical studies aH produce

tings. To get started, here are sorne comments on how we learn.

WHERE THE DATA COME FROM MATTERS

direcdy compare different actions-for example, taking a hormone

Observation versus experiment is just one aspect of how data are

YOU CAN OBSERVE A LOT JUST BYWATCHING

Votes for Gore

Always plotyour data.

FIGHTING THE CURSE OF VARIATION

Variation is everywhere. Individuals vary. Repeated measurements

STATISTICS: A GUIDE TO THE UNKNOWN

of hormone replacement outweigh the risks. We may be surprised ro

Statistics in the Courtroom: United States

The Anatomy of a Preelection Poll

Counting and Apportionment: Foundations

Evaluating School Choice Programs

Designing National Health Care Surveys

You might also like