You are on page 1of 129

What is statistic

Statistics is a tool for creating an understanding


from a set of numbers.

An Example: Stats Anxiety.

Key Statistical Concepts. . .

Population
a population is the group of all items of interest to a
statistics practitioner.
frequently very large; sometimes innite.
E.g. All 5 million Florida voters
Sample
A sample is a set of data drawn from the population.
Potentially very large, but less than the population.
E.g. a sample of 765 voters exit polled on election day.

Parameter
A descriptive measure of a population.
Statistic
A descriptive measure of a sample.

Descriptive Statistics. . .
. . . are methods of organizing, summarizing, and
presenting data in a convenient and informative
way. These
methods include:
Graphical Techniques , and
Numerical Techniques .
The actual method used depends on what
information we would like to extract.

Inferential Statistics. . .
Descriptive Statistics describe the data set thats
being analyzed, but doesnt allow us to draw any
conclusions or make any interferences about the
data. Hence we need another branch of statistics:
inferential statistics.

Inferential statistics is also a set of methods, but


it is used to draw conclusions or inferences about
characteristics of populations based on data from
a sample.

We use statistics to make inferences about


parameters.
Therefore, we can make an estimate, prediction, or
decision about a population based on sample data.
Thus, we can apply what we know about a sample
to the larger population from which it was drawn!
Rationale:
Large populations make investigating each member
impractical and expensive.
Easier and cheaper to take a sample and make
estimates about the population from the sample.

However:
Such conclusions and estimates are not always going to be
correct. For this reason, we build into the statistical
inference measures of reliability, namely condence level
and signicance level.

Condence and Signicance


Levels. . .
The condence level is the proportion of times that
an estimating procedure will be correct.
E.g. a condence level of 95% means that, estimates
based on this form of statistical inference will be
correct 95% of the time.
When the purpose of the statistical inference is to
draw a conclusion about a population, the
signicance level measures how frequently the
conclusion will be wrong in the long run.
E.g. a 5% signicance level means that, in the long
run, this type of conclusion will be wrong 5% of the
time.

If we use (Greek letter alpha) to represent


signicance, then our condence level is 1 .
This relationship can also be stated as:
Condence Level + Signicance Level = 1
Consider a statement from polling data you may
hear about in the news:
This poll is considered accurate within 3.4
percentagepoints, 19 times out of 20.
In this case, our condence level is 95% (19/20 =
0.95), while our signicance level is 5%.

Random Variables. .
variability is omnipresent in the
business world. To model variability
probabilistically, we need the
concept of a random variable.
A random variable is a numerically
valued variable which takes on
dierent values with given
probabilities.

Examples:
The return on an investment in a one-year
period
The price of an equity
The number of customers entering a store
The sales volume of a store on a
particular day
The turnover rate at your organization
next year

Types of Random
Variables. . .
Discrete Random Variable:
one that takes on a countable
number of possible
values, e.g.,
total of roll of two dice: 2, 3, . . . ,
12
number of desktops sold: 0, 1, . . .
customer count: 0, 1, . . .

Continuous Random Variable:


one that takes on an uncountable number
of possible
values, e.g.,
interest rate: 3.25%, 6.125%, . . .
task completion time: a nonnegative value
price of a stock: a nonnegative value
Basic Concept: Integer or rational numbers
are discrete, while real numbers are
continuous.

Probability Distributions. . .
Random variableshave values that
are determined by chance events.
The future price of a share of stock is
a random variable because its value
is determined by chance factors such
as market conditions, the
accomplishment of revenue targets
by the company, interest rates, and
so on.

Random variables can be eitherdiscreteor


continuous. A random variable is discrete if it
can assume only a nite number of values or if its
values are distinct and separate units.
For example, the number of boxes of cookies
produced during a given shift is a discrete
random variable, because each box is a distinct,
whole unit; a manufacturer would not produce or
measure half a box of cookies.

continuous random variables can assume any


range of values along a continuum. Consider
boxes of cookies again. The weight of a box of
cookies is a continuous random variable because
it can be measured using an innite range of
fractional values.
For example, the weight could assume values
such as 16 ounces, 16.24 ounces, 16.2411
ounces, or any of a range of fractional values.

Consider the experiment of tossing a single die. Dene X as


the number of spot on the up face of the die after a toss.
Then R = (I. 2. 3. 4. 5. 6). Assume the die is loaded so that
the probability that a given face lands up is proportional to
the number of spot showing. The discrete probability
distribution for this random experiment is given by

Population Mean Expected


Value. . .
The population mean is the weighted average of
all of its values. The weights are specied by the
probability mass function. This parameter is also
called the expected value of X and is denoted by
E(X).
The formal denition is similar to computing
sample mean for grouped data:

Example: Expected No. of TVs


Let X be the number of TVs in a household.
Then,
E(X) = 0 0.012 + 1 0.319 + + 5 0.028 = 2.084

Population Variance. . .
The population variance is calculated similarly. It is the
weighted average of the squared deviations from the
mean. Formally

Since (2) is an expected value (of (X ) 2 ), it should be


interpreted as the long-run average of squared deviations
from the mean. Thus, the parameter 2 is a measure of
the extent of variability in successive realizations of X.

1. Terminals on an on-line computer system are


attached to a communication line to the central
computer system. The probability that any terminal is
ready to transmit is 0.95.
Let X = number of terminals polled until the rstready
terminal is located.
2. Toss a coin repeatedly.
Let X = number of tosses to rst head
3. It is known that 20% of products on a production
line are defective. Products are inspected until rst
defective is encountered.
Let X = number of inspections to obtain rst defective

Poisson distribution

The Poisson distribution is a discrete distribution. It is


often used as a model for the number of events (such as
the number of telephone calls at a business, number of
customers in waiting lines, number of defects in a given
surface area, airplane arrivals, or the number of accidents
at an intersection) in a specic time period.

The major dierence between Poisson and Binomial


distributions is that the Poisson does not have a xed
number of trials. Instead, it uses the xed interval of time
or space in which the number of successes is recorded.

is the parameter which indicates the average number of events in the given time interval.

Parameters: The mean is . The variance is


.

Consider a computer system with Poisson


job-arrival stream at an average of 2 per
minute. Determine the probability that in
any one-minute interval there will be
(i) 0 jobs;
(ii) exactly 2 jobs;
(iii) at most 3 arrivals.
(iv) What is the maximum jobs that should
arrive one minute with 90 % certainty?

Hypergeometric Distribution
The probability distribution of a hypergeometric random variable is
called ahypergeometric distribution
The following notation is helpful, when we talk about
hypergeometric distributions and hypergeometric probability.
N: The number of items in thepopulation.
k: The number of items in the population that are classied as
successes.
n: The number of items in thesample.
x: The number of items in the sample that are classied as
successes.
kCx: The number ofcombinationsofkthings, takenxat a time.
h(x;N,n,k):hypergeometric probability- the probability that
ann-trial hypergeometric experiment results
inexactlyxsuccesses, when the population consists
ofNitems,kof which are classied as successes.

Hypergeometric Experiments

Ahypergeometric experimentis astatistical experiment


that has the following properties:
Asampleof sizenis randomly selected
without replacementfrom apopulationofNitems.
In the population,kitems can be classied as successes,
andN - kitems can be classied as failures.
Consider the following statistical experiment. You have an
urn of 10 marbles - 5 red and 5 green. You randomly select
2 marbles without replacement and count the number of
red marbles you have selected. This would be a
hypergeometric experiment.

Hypergeometric Distribution
Ahypergeometric random variableis the number of
successes that result from a hypergeometric experiment.
Theprobability distributionof a hypergeometric random
variable is called a hypergeometric distribution.
Givenx,N,n, andk, we can compute the hypergeometric
probability based on the following formula:
Hypergeometric Formula.Suppose a population
consists ofNitems,kof which are successes. And a
random sample drawn from that population consists
ofnitems,xof which are successes. Then the
hypergeometric probability is:h(x;N,n,k) = [kCx] [N-kCnx] / [NCn]

The hypergeometric distribution has


the following properties:
The mean of the distribution is equal
ton*k/N.
Thevarianceisn*k* (N-k) *
(N-n) / [N2* (N- 1 ) ] .

Example 1
Suppose we randomly select 5 cards without replacement from an
ordinary deck of playing cards. What is the probability of getting
exactly 2 red cards (i.e., hearts or diamonds)?
Solution:This is a hypergeometric experiment in which we know
the following:
N = 52; since there are 52 cards in a deck.
k = 26; since there are 26 red cards in a deck.
n = 5; since we randomly select 5 cards from the deck.
x = 2; since 2 of the cards we select are red.
We plug these values into the hypergeometric formula as follows:
h(x;N,n,k) = [kCx] [N-kCn-x] / [NCn]
h(2;52,5,26) = [26C2] [26C3] / [52C5]
h(2;52,5,26) = [ 325 ] [ 2600 ] / [ 2,598,960 ] = 0.32513
Thus, the probability of randomly selecting 2 red cards is 0.32513.

Multinomial
The Binomial distribution was based on having a
series of events that could take on only two states:
success/failure, sick/well, heads/tails, et cetera.
But what if there are several possible events, like
left/right/center, or Africa/Eurasia/Australia/Americas?
The Multinomial distribution extends the Binomial
distribution for such cases.
The Binomial case could be expressed with one
parameter, p, which indicated success with probability
p and failure with probability 1 p. The Multinomial
case requires k variables, p1, . . . , p k, such that

The binomial distribution allows one to compute the


probability of obtaining a given number of binary outcomes.
For example, it can be used to compute the probability of
getting 6 heads out of 10 coin flips. The flip of a coin is a
binary outcome because it has only two possible outcomes:
heads and tails. The multinomial distribution can be used to
compute the probabilities in situations in which there are
more than two possible outcomes.

For example, suppose that two chess players had played


numerous games and it was determined that the probability
that Player A would win is 0.40, the probability that Player B
would win is 0.35, and the probability that the game would
end in a draw is 0.25. The multinomial distribution can be
used to answer questions such as: "If these two chess
players played 12 games, what is the probability that Player
A would win 7 games, Player B would win 2 games, and the
remaining 3 games would be drawn?" The following formula
gives the probability of obtaining a specic set of outcomes
when there are three possible outcomes for each event:

where
p is the probability,
n is the total number of events
n1is the number of times Outcome 1 occurs,
n2is the number of times Outcome 2 occurs,
n3is the number of times Outcome 3 occurs,
p1is the probability of Outcome 1
p2is the probability of Outcome 2, and
p3is the probability of Outcome 3.

For the chess example,


n = 12 (12 games are played),
n1= 7 (number won by Player A),
n2= 2 (number won by Player B),
n3= 3 (the number drawn),
p1= 0.40 (probability Player A wins)
p2= 0.35(probability Player B wins)
p3= 0.25(probability of a draw)

Continuous Distributions

Normal Distribution

The lognormal distribution


A random variable x is lognormally distributed if
ln(x) is normally distributed
If x is normal, and ln(y) = x (or y = ex), then y is
lognormal
If continuously compounded stock returns are
normal then the stock price is lognormally
distributed

Product of lognormal variables is lognormal


If x1 and x2 are normal, then y1=ex and y2=ex are
lognormal.
The product of y1 and y2: y1 x y2 = ex x ex = ex +x
Since x1+x2 is normal, ex +x is lognormal
1

83

1
2

l
n
x

1
2
2

f(x
)
e

x
2

0
Lognormal Distribution Probability Density Function

A random variable X is said to have the Lognormal


Distribution with parameters and , where > 0 and >
0, if the probability density function of X is:

,
for X >0

f(x)

for X

If

X ~ LN(,),

then

Y= ln (X) ~ N(,)

Lognormal Distribution - Probability


Distribution Function
ln x
F ( x) P( X x) F

where F(z) is the cumulative probability distribution function of


N(0,1)

Lognormal Distribution Example


A theoretical justication based on a certain material
failure mechanism underlies the assumption that ductile
strength X of a material has a lognormal distribution.
If the parameters are =5 and =0.1 ,
Find:
(a)x and x
(b)P(X >120)
(c)P(110 X 130)
(d)The median ductile strength
(e)The expected number having strength at least 120, if
ten dierent samples of an alloy steel of this type were
subjected to a strength test.

Theprobability density functionof a


log-normal distribution is

Negative binomial

Say that we have a sequence of Bernoulli draws. How many


failures will we see before we see n successes? If p percent
of cars are illegally parked, and a meter reader hopes to
write n parking tickets, the Negative binomial tells her the
odds that she will be able to stop with n + x cars.

Gamma distribution
A better name in the statistical context would be
Negative Poisson, because it relates to the
Poisson distribution in the same way the Negative
binomial relates to the Binomial.
If the timing of events follows a Poisson
distribution, meaning that events come by at the
rate of per period, then this distribution tells us
how long we would have to wait until the nth
event occurs
The form of the Gamma distributionis typically
expressed in terms of a shape parameter 1/,
where is the Poisson parameter.

Just as the Gamma distribution is named for the Gamma


function, the Beta distribution is named after the Beta
functionwhose parameters are typically notated as and

Bivariate Distributions. . .
Up to now, we have looked at univariate
distributions, i.e., probability distributions in one
variable.
Bivariate distributions, also called joint
distributions, are probabilities of combinations of
two variables.
For discrete variables X and Y , the joint
probability distribution or joint probability mass
function of X and Y is dened as:
P(x, y) P(X = x and Y = y)
for all pairs of values x and y.

Marginal Probabilities

Covariance and correlation describe how two


variables are related.
Variables are positively related if they move in
the same direction.
Variables are inversely related if they move in
opposite directions.
Both covariance and correlation indicate
whether variables are positively or inversely
related. Correlation also tells you the degree
to which the variables tend to move together.

You are probably already familiar with statements


about covariance and correlation that appear in
the news almost daily.
For example, you might hear that as economic
growth increases, stock market returns tend to
increase as well. These variables are said to be
positively related because they move in the same
direction. You may also hear that as world oil
production increases, gasoline prices fall. These
variables are said to be negatively, or inversely,
related because they move in opposite directions.

To determine the actual relationships of


these variables, you would use the formulas
for covariance and correlation.
Covariance
Covariance indicates how two variables are
related. A positive covariance means the
variables are positively related, while a
negative covariance means the variables are
inversely related. The formula for calculating
covariance of sample data is shown below.

To understand how covariance is used, consider the


table below, which describes the rate of economic
growth (xi) and the rate of return on the S&P 500 (yi).

Using the covariance formula, you can determine


whether economic growth and S&P 500 returns have
a positive or inverse relationship. Before you
compute the covariance, calculate the mean
ofxandy

Correlation
correlation also tells you the degree to which the
variables tend to move together.
covariance measures variables that have
dierent units of measurement. Using covariance,
you could determine whether units were
increasing or decreasing, but it was impossible to
measure the degree to which the variables
moved together because covariance does not use
one standard unit of measurement. To measure
the degree to which variables move together, you
must use correlation.

Correlation standardizes the measure of


interdependence between two variables and,
consequently, tells you how closely the two
variables move. The correlation measurement,
called a correlation coefficient, will always take on
a value between 1 and 1:

If the correlation coefficient is one, the variables have


a perfect positive correlation. This means that if one
variable moves a given amount, the second moves
proportionally in the same direction.
If correlation coefficient is zero, no relationship exists
between the variables. If one variable moves, you can
make no predictions about the movement of the other
variable; they are uncorrelated.
If correlation coefficient is 1, the variables are
perfectly negatively correlated (or inversely correlated)
and move in opposition to each other. If one variable
increases, the other variable decreases proportionally.

To understand how correlation is used, consider


the table below, which describes the rate of
economic growth (xi) and the rate of return on the
S&P 500 (yi).

Using the correlation formula, you can determine


whether economic growth and S&P 500 returns
have a positive or inverse relationship.

you know that the covariance of S&P 500 returns


and economic growth was calculated to be 1.53.
Now you need to determine the standard
deviation of each of the variables. You would
calculate the standard deviation of the S&P 500
returns and the economic growth

Using the information from above, you know that


COV(x,y) = 1.53
sx= 0.90
sy= 2.58
Now you can calculate the correlation coefficient
by substituting the numbers above into the
correlation formula, as shown below.

A correlation coefficient of .66 tells you two


important things:
Because the correlation coefficient is a
positive number, returns on the S&P 500
and economic growth are postively related.
Because .66 is relatively far from indicating
no correlation, the strength of the
correlation between returns on the S&P
500 and economic growth is strong.

You might also like