You are on page 1of 2

BELIEVE IT OR DON�T: USING OUTLIER DETECTION TO FIND THE WEIRDEST OF THE WEIRD

Did you hear about the man who stole a GPS, but ended up having to call 911 when he
got lost? How about the murder of nine college students that was recently blamed on
the Yeti (the Russian Bigfoot)? Or the California school kids who made a fifty-foot
long peanut butter and jelly sandwich in less than three minutes?

On any given day, you can search the Internet and find stories like these�true
reports that are stranger than fiction. Many news outlets carry weird news stories,
and a surprising number of these stories seem to come from the state of Florida. In
fact, a recent Google search on �weird news Florida� revealed over 47,000,000 hits.
That�s a lot of strange. And from sewer-surfing alligators to whale-wrangling
nudists, the sunshine state has it all.

Outliers are the Florida of the data analysis world. These strange observations sit
at the extremes, far away from the rest of your data. And like the story about a
twelve-foot python caught wrapping itself around an unsuspecting woman�s toilet,
outliers can leave you slightly disturbed, wondering what might�ve happened had
they not been found. In this chapter, News of the Weird stories will be studied,
and outlier detection will be used to find the weirdest of the weird.
THE WORLD OF THE WEIRD

If you�ve read the last few chapters, you�ve already run across outliers. You know
they�re extreme values that sit far away from the center of a dataset. You�ve seen
how they can ruin the results of a statistical analysis. And you�ve learned there
are techniques for minimizing the impact these weird values have. In this chapter,
you�ll learn some techniques for finding and eliminating extreme values before they
have a chance to influence your analysis. If you�re comfortable with the concept of
outliers and you�ve seen how much damage they can do, then go ahead and skip to the
next section. If you�d like to see one more illustration, then this section is for
you.

It only takes one or two extreme values to shift a sample mean, inflate a standard
deviation, and bias a slope estimate. Consider the two datasets in Figure 7.1.
c7-fig-0001

Figure 7.1 Normal data with two outliers.

These data are simulated, made up, and just about as perfect as data get. Almost.
There are twenty observations in Group 1, and all of them are normally distributed
observations with mean 5.5 and variance 1. There are also twenty observations in
Group 2. Eighteen of them are normally distributed with mean 6 and variance 1. The
remaining two, the extreme values at 1.8 and 2.1, are outliers. Figure 7.2 shows
what these two outliers can do to some basic statistics.
c7-fig-0002

Figure 7.2 The impact of outliers on common statistical techniques.

The sample mean and standard deviation of Group 1 are close to the respective
values of 5.5 and 1.0 that we know to be true. The sample mean and standard
deviation for Group 2 are not close to the true values. Where the sample mean
should be about 6, it�s 5.68. Where the standard deviation should be close to 1,
it�s 1.6. Removing the two obvious outliers dramatically improves the sample
statistics, increasing the sample mean to 6.1 and decreasing the sample standard
deviation to 1.01.

The impact of these outliers on a t-test is even more dramatic. Because the
standard deviation is inflated when the outliers are included, the p-value is a
statistically insignificant value of 0.49, and we�d be forced to accept the null
hypothesis that the means are the same, even though they�re different. When the two
outliers are excluded, however, the p-value drops below the 0.05 significance
level, leading to the correct conclusion that the means of the two groups are
different.

You might also like