You are on page 1of 83

Chapter 5

Analyze
Objective
In the measure phase, our focus was primarily on
Y.
In the analyze phase, we shall focus on the Xs.
Objectives are
To identify and establish the root cause(s) of the
problem (i.e. the Xs affecting Y)
[If required ] To develop the function Y=f(X), that is
useful
To identify the important Xs for experimentation
Types of Factors (Xs)
Control Factors
Effect difference among its levels cant be
reduced but the best level can be selected and
kept under control within a desired range of
variation.
Example: (Machining: Speed, Feed rate), (Injection
molding: Injection pressure, Material temperature),
(Heat treatment: Annealing time and temperature),
(Electrical circuit: Resistance of a resistor, Transistor
gain), (Training: Duration, Course content)
Types of Factors (Xs)
Noise Factors
Neither the effect difference among levels can be
reduced nor can the best level be picked up for
implementation
Example: Environmental condition, measurement
method, Degradation time, Supply voltage, Tolerances
on control factors
Block Factors
The best level cant be picked up for implementation but
the effect difference among levels can be reduced
One or more control factors and/or noise factors are
associated with a (naturally occurring) block
Example: Geographical location, Shift of operation,
Spindle number, Supplier
Analysis Strategy
Analyze Block Factors
Analyze Control and
Noise Factors
Analyze Control Factors
Analysis Steps
Define and Verify
Root Causes
Develop Y = f(X)
Identify Control and
Noise Factors for
Experimentation
Develop Detail Process Map
Identify and Summarize
Potential Causes
Scope Project (w.r.t Xs)
Validate Measurement
Select Probable Causes
Develop and Test Hypotheses
Fully
Characterized
Partially
Characterize
d
Notes: Analysis Steps
Develop detail process map
More detail than the SIPOC map
Relevant to Y (Process map for reducing cycle time will
aim at identifying bottlenecks and non value added
activities)
Identify and summarize potential causes
Brainstorm Prepare C & E diagram
Scope project with respect to Xs
Data collection for some Xs may not be possible
Some Xs may be a part of other projects
Notes: Analysis Steps
Validate measurement
Develop operating definition for each X
Perform gauge R & R for each X
Select probable causes
Base lining of each X
Is-Is Not testing
Importance rating
Develop and test hypotheses
Develop hypothesis Collect data Test hypothesis

Analyze Phase Deliverables
Process analysis
Process flow chart
CE diagram
Pareto analysis
Is/Is Not analysis
Gauge R&R for the Xs
Data and analysis results
Statistical tests
Graphs
Correlation/regression analysis
Identified root cause or list of short listed control factors
(and the list of control factors thought important but not
considered, if any)
Expected improvement
All of these are not
mandatory. Others
may be added
Brainstorming
Brainstorming is a part of problem solving,
where a group of people put social
inhibitions and norms aside for generating
as many new ideas as possible (regardless
of their initial worth).
Brainstorming Rules
Non-judgmental: Postpone and withheld your
judgment of ideas
Radical: Encourage wild and exaggerated ideas
Quantity: Quantity counts at this stage, not quality
Building up: Build on the ideas put forward by
others
Equality: Every person and every idea has equal
worth
Brainstorming Steps
Make a specific problem statement
Must not suggest solution
Form team
Facilitator: Conduct the session (introduce, manage
time, enforce rules, make the participants feel
comfortable, encourage participation)
Recorder: Record ideas
Members: Creative, capable of out-of-the-box
thinking, may be recruited just for the brainstorming
session, may be from any department
Brainstorming Steps
Prepare the room and materials
A large room having enough space around a table is
preferable
Round shaped table (having no head of the table) is ideal
Flipcharts/white board for displaying problem statement
and recording ideas. Use lots of coloured pens
Writing pad and pen for each participant for writing his
own idea while other ideas are shouted and recorded
Place an object at the centre of the table. Members can
focus their vision on this object instead of looking into the
face of others
Provision for playing sweet music during the session will
be beneficial

Brainstorming Steps
Conduct the session
Introduce: Welcome. State purpose and rules
Invite ideas: One-at-a-time. Each member gets equal
opportunity by rotation. Free for all is also permissible
Freewheel: Speed up the ideas to avoid criticism and
evaluation. Appreciate crazy ideas. Dont discourage non-
radical ideas. Instead of calling members by name, use
we for enhancing group bonding. Take mid-session break
if the process slows down
Record: Record all ideas. Use different colours
End session: Thank members and ask to leave everything
where they are. At the end of a good session (20-30
minutes) participants will normally be mentally exhausted
and one will have more than fifty ideas
Cause and Effect Diagram
A tool for summarizing all the possible causes of a
problem after identifying the causes through
brainstorming
[or]
A tool for discovering all the possible causes of a
problem
Always the result of a group activity
Also called
Fishbone diagram (Because the diagram looks like the
skeleton of a fish)
Ishikawa diagram (After the name of its inventor Prof.
Kauro Ishikawa)
Skeleton of CE Diagram
Effect
Cause D Cause C
Cause A Cause B
Sub cause
Sub-sub cause
Effect = Problem Sub causes = Causes of main causes
Sub-Sub causes = Causes of sub causes
Main Causes in CE Diagram
Manufacturing
4M (Man, Machine, Material, Method)
Service
4P (Policy, Procedure, People, Plant/Technology/Equipment)
Measurement need not be included as a main cause, since
it has already been taken care of during the Measure
phase. However, if it is desired to use the CE diagram for
purposes such as training then the same may be included
Environment may be a key factor
Categorization other than 4M or 4P may be considered
Affinity Diagram may be constructed with the causes
identified through brainstorming for identifying main causes
Benefits of CE Diagram
A complex situation is depicted in an orderly, easy-to-read
format
Being a group activity, provides a comprehensive summary of
the causes
Helps determine root causes
Increases process knowledge
Identifies areas for collecting data
Can be used for training
Can be used for data collection

At the minimum, constructing a CE diagram should lead to
a better understanding of the process
Types of CE Diagram
Three types
Cause enumeration type
Dispersion analysis type
Process classification type
Actually, the first are similar. They differ only in the
method of their construction
First one is the classical type, where causes are first
identified through brainstorming and then summarized as a
CE diagram
In the second case, we do not begin by enumerating
causes. Instead, we start by identifying the main causes
and then develop each branch by asking repeatedly,
Why does the dispersion (in cause x) occur?
CE Diagram: Dispersion Analysis
Type
Poor Data
Quality
Sampling
Sample collection
Design
implementation
Sampling
design
Requirement
Accuracy
Cost
Unit sampling cost
Sample size
Sample size
Stratification
Why does the dispersion in Because of
Sampling occur? Sampling design, design
implementation and sample collection
Sampling design occur? Requirement variation

CE Diagram: Process Classification
Type
Rice Washing Cooker Steaming
Uncooked
Rice
Water
Uncooked
Rice
Rice Washing
Cooker Steaming
Advantage: Easy to
make and understand
Disadvantage: Causes
reflecting combined
effect of two or more
factors belonging to
different process steps
are difficult to illustrate
Bad CE Diagram
Effect
(1) Very simple looking
(2) Contains very few causes
Effect
1
2
3
4
5
(3) Complicated looking, but full of non-actionable causes
(4) Complicated looking, but contains many causes having no effect
A good CE diagram
is one that is easy
to use, leads to
action and actions
are likely to yield
results
Hypothesis Testing
Purpose
You have some claim (belief) about the
parameter(s) of one or more populations
You want to verify the claim by collecting
appropriate data
Data either support or contradict the claim

Population, Parameter, Claim and
Statistic
Population can be
(assumption)*
Parameters
are
@
Claims can be Statistics for simple
comparison
#
Normal (one) and 20, > 30,
> 2
X
s
Binomial p p .2, p < .1 Observed proportion
defective
Normal (two)
1
,
2
,
1
,
2

1

2,

1
>
2
X
1
and X
2
, s
1
and s
2
* Can be more than one of any form. In non-parametric methods, no
assumption is made about the form of the population distribution
@ Values of some of the parameters may be known
# Statistic = Function of sample values. Test statistics are, of course,
little complicated than the statistics for simple comparison

What is a Hypothesis?

A hypothesis is an
assumption about
the population
parameter.
A parameter is a
characteristic of the
population, like its
mean or variance.
The parameter must
be identified before
analysis.
I assume the mean GPA
of this class is 3.5!

H
0
: The Null Hypothesis

States the Assumption (numerical) to be
tested
e.g. The grade point average of juniors is
at least 3.0 (H
0
: 3.0)
Begin with the assumption that the
null hypothesis is TRUE
(Similar to the notion of innocent until proven guilty)
Refers to the Status Quo
Always contains the = sign
The Null Hypothesis may or may not be rejected

H
1
: The Alternative Hypothesis
The opposite of the null hypothesis
e.g. The grade point average of juniors is less
than 3.0 (H
1
: < 3.0)
Challenges the Status Quo
Never contains the = sign
The Alternative Hypothesis may or may
not be accepted
Is generally the hypothesis that is believed
to be true by the researcher

Hypothesis Testing Process
Population
Assume the
population
mean age is 50.
(Null Hypothesis)
REJECT
The Sample
Mean Is 20
Sample
Null Hypothesis
50? 20 = ~ = X Is
No, not likely!

Reason for Rejecting H
0
Sample Mean = 50
It is unlikely
that we would
get a sample
mean of this
value ...
... if in fact this were
the population mean.
... Therefore, we
reject the null
hypothesis that
= 50.
20
H
0
Sampling Distribution

Level of Significance,
Defines Unlikely Values of Sample Statistic if
Null Hypothesis Is True
Called Rejection Region of Sampling Distribution
Designated o (alpha)
Typical values are 0.01, 0.05, 0.10
Selected by the Researcher at the Start
Provides the Critical Value(s) of the Test


and the Rejection Region
H
0
: > 3
H
1
: < 3
0
0
0
H
0
: s 3
H
1
: > 3
H
0
: = 3
H
1
: = 3
o
o
o/2
Rejection
Regions
Critical
Value(s)

Testing Errors
H
0
: Innocent
Jury Trial Hypothesis Test
Actual Situation Actual Situation
Verdict
Innocent Guilty Decision H
0
True H
0
False
Innocent
Correct Error
Do
Not
Reject
H
0
1 - o
Type II
Error ( | )
Guilty
Error
Correct
Reject
H
0
Type I
Error
(
o
)
Power
(1 - | )
& Have an Inverse
Relationship
|
Reduce probability of one error
and the other one goes up.
o


How to choose

Choice depends on the cost of the error
Choose small when the cost of rejecting H
0

is high
Convicting an innocent person, Heavy investment on
automation, Six sigma project Before and after
comparison
Choose large when you have an interest in
changing the status quo

Computerization in a startup company, Searching for a
special cause, Screening of factors during the ANALYZE
phase



Hypothesis Testing: Steps
1. State H
0
H
0
:

> 3.0
2. State H
1
H
1
: < 3.0

3. Choose o o = .05
4. Choose n n = 100
5. Collect Data 100 students sampled

Test the Assumption that the true mean
grade point average of juniors is at least 3.

Hypothesis Testing: Steps
(continued)
Test the Assumption that grade point
average of juniors is at least 3.
6. Choose Test t Test (or p Value)
7. Compute Test Statistic t
computed
= -2
8. Set up critical value(s) t
critical
= -1.7


9. Make statistical decision Reject H
0
10. Express decision The true mean grade
point is less than 3.0

One Sample t Test
One continuous characteristic, say filled weight
Desired average weight , say 50 grams (known
standard), or at least/at most 50 grams
n random observations from a given population of
filled weights - assumed approximately Normal
Population standard deviation is unknown
Test Objective: Based on the sample values can
we say that the mean of the given population
differ significantly from the desired value(s)?
t Statistic
n
S
X
t

=
: Hypothesized population mean
n : Number of observations
X : Sample mean
s : Sample standard deviation
Follows t distribution with = n-1
is conveniently called degrees of freedom
Degrees of Freedom
= No. of observations No. of restrictions imposed on the observations
while computing the test statistic
While computing t, only one restriction is imposed in the process
of computing s. Since X is estimated from data, the restriction
imposed is (X
i
X) = 0

Example: One Tail t-Test

368 gm.
H
0
: s 368
H
1
: > 368
Government inspector for controlling
weight may have H
0
: 368
Does an average box of cereal
contain more than 368 grams
of cereal? A random sample of
36 boxes showed X = 372.5,
and s = 15. Test at the o = 0.01
level.
o is not given,
Example Solution: One Tail
80 . 1
36
15
368 5 . 372
=

=
n
S
X
t

Observed t = 1.80 does not fall


in the rejection region

Example : Z Test for Proportion

Problem: A marketing company claims
that it receives 4% responses from its
Mailing.
Approach: To test this claim, a random
sample of 500 were surveyed with 25
responses. So observed p = .05 is more
than the claimed p
s
= .04
Solution: Z test (Assume o = .05). Why
test at all since p > p
s
? Both sided H
1
.
Z Test for Proportion: Solution
Critical Values: 1.96
Decision:
Conclusion
We do not have sufficient
evidence to reject the companys
claim of 4% response rate.
Z 0
Reject Reject
.025 .025
= 1.14
Testing Against a Standard
Hypotheses
1. H
0
: =
0
2. H
0
:
0
3. H
0
:
0
H
A
:
0
H
A
: >
0
H
A
: <
0
Test Statistic


Reject H
0
if

2
0
2
2
) 1 (
o
_
s n
=
s = Sample standard deviation
computed from n observations
1 , 1
2 2
1 ,
2 2
1 , 2 / 1
2 2
1 , 2 /
2 2
. 3
. 2
. 1


<
>
< >
n
n
n n or
o
o
o o
_ _
_ _
_ _ _ _
Lower 1-(/2) percentage point of the
chi-square distribution with n-1 d.f.
Upper /2 or percentage point of the
chi-square distribution with n-1 d.f.
Two Samples: Testing Mean
X
1
X
2
X
1
X
1
X
2
X
2
Viscosity of
material 1
Viscosity of
material 2
Viscosity of the two materials
significantly different?
Yes
Not sure
Perhaps yes
Testing Two Population Mean
Two tail t Test
Hypotheses
H
0
:
1
=
2
, H
A
:
1

2

Assumption
The two populations are normal and
1
=
2
=
Data
Sample 1: n
1
random observations from population 1
Sample 2: n
2
random observations from population 2
Basic statistics
X
1
, X
2
, s
1
, s
2
Test statistic



Reject H
0
if







2 1
2 1
0
1 1
n n
s
X X
t
p
+

=
v o , 2 / 0
t t >
2
) 1 ( ) 1 (
2 1
2
2 2
2
1 1
+
+
=
n n
s n s n
s
p
2
2 1
+ = = n n df v
Significance level = .01, .05 or .1
Pooled standard deviation
t Test An Example
Sample no. 1 2 3 4 5 6 7 8 9 10
Oil Quenching 145 150 153 148 141 152 146 154 139 148
Water quenching 155 153 150 158 143 149 161 155 154 146
Hardness after hardening by two quenching methods
Data
159 156 153 150 147 144 141
Oil Quenching
Water Quenching
Dotplot of Oil Quenching, Water Quenching
What can you conclude?
Mean s.d
Oil Q
147.6 4.97
Water Q
152.4 5.46
s
p
= [(9*4.97
2
+ 9*5.46
2)
/(10 + 10 - 2)]
= 5.221
t
0
= (147.6 152.4) / [5.221*(2/10)] = -2.06
t
critical
= t
.025, 18
= 2.101 > mod (t
o
)
H
0
:
oil
=
water
H
A
:
oil

water
H
0
can not be rejected at 95% level of confidence
t Test Example Contd.
If, based on technical consideration, we expected water
quenching to increase hardness then we would have
formed the hypotheses as
H
0
:
water

oil
, H
A
:
water
>
oil
H
0
will be rejected if t > t
critical
= t
.05,18
From t table t
.05,18
= 1.734. hence the null hypotheses will
be rejected at 95% level of confidence
Hence, we shall conclude that water quenching results in
higher mean hardness than oil quenching
But remember, hypotheses are to be formed (incorporating
technical knowledge) before data collection
Paired t Test An Example
A company wanted to introduce one of the two types of
pizza they have developed recently. In order to select the
one likely to be more popular, they conducted the
following experiment. Ten persons were given two pizzas
(one type A and one type B) to eat. Weight of each pizza
used was weighed carefully to be 16 oz. After 15 minutes,
the remainder of the two types of pizza for each person
was weighed. Here is the data.

Subject 1 2 3 4 5 6 7 8 9 10
Pizza A 12.9 5.7 16.0 14.3 2.4 1.6 14.6 10.2 4.3 6.6
Pizza B 16.0 7.5 16.0 14.0 13.2 5.4 15.5 11.3 15.4 10.6
Which type of pizza, if any, people seem to like more?
Pizza Experiment : Solution
Subject 1 2 3 4 5 6 7 8 9 10 Mean s. d
Pizza A 12.9 5.7 16.0 14.3 2.4 1.6 14.6 10.2 4.3 6.6 8.86 5.40
Pizza B 16.0 7.5 16.0 14.0 13.2 3.4 15.5 11.3 15.4 10.6 12.29 4.18
Hypotheses:: H
0
:
A
=
B
, H
1
:
A

B
Pooled s. d = sp = [(5.402 + 4.182)/2] = 4.83
t = (12.29 8.86) / [4.83*(1/5)] = 3.43/2.16 = 1.59 < t
.025, 18
= 2.101
Subject 1 2 3 4 5 6 7 8 9 10 Mean s. d
B - A 3.1 1.8 0.0 -0.3 10.8 1.8 0.9 1.1 11.1 4.0 3.43 4.17
Hypotheses:: H
0
:
B-A
= 0, H
1
:
B-A
0

t = 3.43 / [4.17/10] = 2.60 > t
.025, 9
= 2.262
H
0
is Rejected
There is sufficient evidence to claim that
people prefer Pizza A more than Pizza B
H
0
can not be
rejected
Two Sample t Test
F Test :
Testing Variances of Two Populations
H
0
:
1
2
=
2
2

H
1
:
1
2

2
2
H
0
:
1
2
=
2
2

H
1
:
1
2
>
2
2
H
0
:
1
2
=
2
2

H
1
:
1
2
<
2
2
Hypotheses
2
1
2
2
0
s
s
F =
2
2
2
1
0
s
s
F =
2 1
2
2
2
1
0
, s s
s
s
F > =
Test Statistic
1 , 1 , 2 / 0
2 1

>
n n
F F
o
1 , 1 , 0
2 1

>
n n
F F
o
1 , 1 , 0
1 2

>
n n
F F
o
H
0
Rejection criteria
In the previous pizza example, standard deviation in the two
cases are 5.40 and 4.18. Assuming two sided alternative,
F
0
= 5.40
2
/4.18
2
= 1.67 < F
.025, 9, 9
= 4.03
H
0
not rejected Two variances are not significantly different
Limitations of t
So far we have used t test for comparing
Mean of one group with a standard
Mean of two groups
How do we compare the mean of three or more groups?
Performing t test for all possible combinations of two
means is inadequate for several reasons
Tedious
Inefficient, since each time only a part of the information is used
Risky There are so many comparisons to be made that some may
turn out significant simply by chance
Basic ANOVA Situation
We want to compare means of two or more
groups
Analysis of variance (ANOVA) makes use of an
omnibus F test that tells us if there is any
significant difference anywhere among the groups
If F test says no significant difference then there is
no point in searching further
If F test indicates significance then we may use
other tools to find out where the difference is
An Example ANOVA Situation
Subjects: 25 patients with blisters
Treatments: Treatment A, Treatment B, Placebo
Measurement: # of days until blisters heal
Data [and means]:
A: 5, 6, 6, 7, 7, 8, 9, 10 [7.25]
B: 7, 7, 8, 9, 9, 10, 10, 11 [8.875]
P: 7, 9, 9, 10, 10, 10, 11, 12, 13 [10.11]
Are these differences significant?
Two Sources of Variability
In ANOVA, an estimate of variability between groups is
compared with variability within groups.
Between-group variation is the variation among the means of the
different treatment conditions due to chance (random sampling
error) and treatment effects, if any exist.
Within-group variation is the variation due to chance (random
sampling error) among individuals given the same treatment.
Total Variation in Data
Within Group Variation
Variation due to chance
Between Group Variation
Variation due to chance and
treatment effects, if any
Variability Between Groups
There is a lot of variability from one mean to the next.
Large differences between means probably are not due to
chance.
It is difficult to imagine that all six groups are random
samples taken from the same population.
The null hypothesis is rejected, indicating a treatment
effect in at least one of the groups.
Variability Within Groups
Same amount of variability between group means.
However, there is more variability within each group.
The larger the variability within each group, the less
confident we can be that we are dealing with
samples drawn from different populations.

The F Ratio
The ANOVA F-statistic is a ratio of the Between Group
Variation divided by the Within Group Variation
Variation Group Within
Variation Group Between
F =
A large F is evidence against H
0
, since it indicates that there
is more difference between groups than within groups
Two Sources of Variability
1 > F
Groups Within y Variabilit
Groups Between y Variabilit
F =
Two Sources of Variability
roups y Within G Variabilit
Groups y Between Variabilit
F =
1 = F
Blister Experiment: Minitab ANOVA
Output
1 less than # of
groups
# of data values - # of groups
(equals df for each group
added together)
1 less than # of individuals
(just like other situations)

Analysis of Variance for days
Source DF SS MS F P
Treatment 2 34.74 17.37 6.45 0.006
Error 22 59.26 2.69
Total 24 94.00
Minitab ANOVA Output

Analysis of Variance for days
Source DF SS MS F P
Treatment 2 34.74 17.37 6.45 0.006
Error 22 59.26 2.69
Total 24 94.00
2
) (
i
obs
ij
x x


(x
i
obs

x )
2

(x
ij
obs

x )
2
SS stands for sum of squares
Minitab ANOVA Output
MSG = SSG / DFG
MSE = SSE / DFE

Analysis of Variance for days
Source DF SS MS F P
Treatment 2 34.74 17.37 6.45 0.006
Error 22 59.26 2.69
Total 24 94.00
F = MSG / MSE
P-value
comes from
F(DFG,DFE)
Contingency Table
Group Outcome Total
Defective Non-defective
Supplier 1 13 37 50
Supplier 2 6 144 150
Total 19 181 200
26%
4%
74%
96%
Y = Outcome
Y is discrete
Its a 2 x 2
contingency table
Is there any difference between the two suppliers?
Generally speaking, we want to know if there is any
association between the groups and outcomes
In the above we have two categorical variables each
at two levels. In general we can have p x q x r x table
We shall discuss only p x q
Expected Frequencies
Group Outcome Total
Defective Non-defective
Supplier 1 13 37 50
Supplier 2 6 144 150
Total 19 181 200
5
14
45
136
#
Expected frequency
under H
0
, i.e. no
supplier difference
Under H
0
, Probability of getting a defective = 19 / 200 = 0.095
So, under H
0
, number of defectives expected in the samples are
Supplier 1: 50 x 0.095 = 5 (rounded to nearest integer)
Supplier 2: 150 x 0.095 = 14 (rounded to nearest integer)
Expected frequency of non-defectives can be found out similarly, or
simply as (50 5) = 45 (supplier 1) and (150 14) = 136 (supplier 2)
Computing Expected Frequencies
Group Outcome Total
Defective Non-defective
Supplier 1 13 37 50
Supplier 2 6 144 150
Total 19 181
200
5
14
45
136
Expected frequency of
defectives for supplier 1
= 50 x (19 / 200)
= (50 x 19 ) / 200 = 5
In general, expected frequency of cell (i, j) is given by
..
. .
f
f f
E
j i
ij

=
f
i.
= Total frequency of i
th
row
f
.j
= Total frequency of j
th
column
f
..
= Grand total
For example, E
22
= (150 x 181) / 200 = 136
Contingency Table
2
Test
Group Outcome Total
Defective Non-defective
Supplier 1 13 37 50
Supplier 2 6 144 150
Total 19 181 200
5
14
45
136

= =

=
p
i
q
j
ij
ij ij
E
O E
1 1
2
2
) (
_
H
0
will be rejected if
) 1 )( 1 ( ,
2 2

>
q p o
_ _
E
ij
= Expected frequency
O
ij
= Observed frequency
11 . 21
75 . 135
) 144 75 . 135 (
25 . 14
) 6 25 . 14 (
25 . 45
) 37 25 . 45 (
75 . 4
) 13 75 . 4 (
2 2 2 2
2
=

= _

2
.05,1
= 3.84
Expected frequencies
should not be rounded
For our data
H
0
is rejected. Two suppliers are significantly different
Correlation
In practice we frequently find that a group of two
or more variables move together
For simplicity, let us assume two continuous
variables
The variables may be (Y1, Y2), (X1, X2) or (X, Y)
At this moment we also not concerned whether
the correlation is meaningful!
How do we measure the strength of correlation
between the two variables?
Correlation Coefficient
The most popular measure of association between
two continuous variables is the correlation
coefficient r
It is defined in such a manner so that it varies
Between -1 (perfect negative correlation)
Through 0 (absolutely no correlation)
To +1 (perfect positive correlation)
However we should never jump to compute the
value of r. The first step in studying association
between two variables is to plot the data in the
form of a Scatter Diagram
Scatter Diagram and r
r = +1.0
Scatter Diagram and r
r = -1.0
Scatter Diagram and r
r = +0.7
Scatter Diagram and r
r = +0.3
Scatter Diagram and r
r = -0.3
Scatter Diagram and r
r = 0
Usefulness of Scatter Diagram
Drawing scatter diagram should always be the first step in
correlation analysis for the following reasons
Guard against gross computational error
Detects outliers, if any (may be genuine or data error)
Detects influential points, if any (may be genuine or data error)
Detects groups in data, if any
Detects nonlinearity (r is a measure of linear association only)
Provides a quick approximate solution
Easy to understand
Computing r
yy xx
xy
n
y
i yy
n
x
i xx
i i xy
S S
S
r
y y y S
x x x S
n
y x
xy y y x x S
=

= =

= =
= =




2
2
) (
2 2
) (
2 2
) (
) (
) )( (



=
2 2
) ( ) (
) )( (
y y x x
y y x x
r
i i
i i
Data: (Var 1, Var 2)=(x, y)
Formulae for convenient and
efficient computing
# x y x
2
y
2
xy
1 2 6 4 36 12
2 6 8 36 64 48
3 1 4 1 16 4
4 4 7 16 49 28
5 3 4 9 16 12
6 5 9 25 81 45
7 8 10 64 100 80
Total 29 48 155 362 229
Sxy = 229 (29 * 48) / 7 = 30.14
Sxx = 155 (29 * 29) / 7 = 34.86
Syy = 362 (48 * 48) / 7 = 32.86
r = (30.14/((34.86*32.86)) =0.891
Regression Analysis
As in the correlation analysis we have n observations of the
form (X, Y) or (X
1
, X
2
, X
3
, , X
k
, Y)
Y is called the dependent variable and the Xs independent
variables
Objective is to develop a equation relating Y to the Xs for
predicting Y in the data range
Simple linear regression
Y = + X + , where is the random error component
Multiple linear regression
Y = +
1
X
1
+
2
X
2
+
3
X
3
+ . +
k
X
k
+
All or a few of the Xs may be a function of x
1
Nonlinear regression
Here we shall discuss only simple linear regression
Regression Analysis: An Example
Given one variable
Goal: Predict Y
Example:
Given Years of
Experience
Predict Salary
Questions:
When X=10, what is Y?
When X=25, what is Y?
This is known as
regression
X (years) Y (salary in
Rs. 1,000)
3 30
8 57
9 64
13 72
3 36
6 43
11 59
21 90
1 20
Obtaining the Best Fit Line
Years of Experience
S
a
l
a
r
y
20 15 10 5 0
90
80
70
60
50
40
30
20
Scatterplot of Salary vs Years of Experience
1. Plot the data
Obtaining the Best Fit Line
Linear Regression: Y=3.5*X+23.2
0
20
40
60
80
100
120
0 5 10 15 20 25
Years
S
a
l
a
r
y
2. Then obtain the best fit line
Obtaining the Best Fit Line
by minimizing the deviations
Usually, we minimize the square of the
deviations for estimating and
Hence the name Least Square Method
Regression Formulae
X Y | o + =
x y
x x
y y x x
i
i
i
i i
| o
|
=

2
) (
) )( (
Using our earlier notation we have
xx
xy
S
S
= |
Exercise: Estimate and for the salary data
Goodness of Fit
yy
xy
S
S
TSS
SSR
squares of sum Total
regression to due squares of Sum
R
|
= =
=
2
Note that R
2
is nothing but the square
of the correlation coefficient r
For practical purposes, usefulness of the equation is
judged by the standard error s
s band prediction e Approximat
MSE s error dard S
n SSE MSE error square Mean
RSS TSS SSE squares of sum Error
2
tan
) 2 /(
=
= =
= =
= =

You might also like