You are on page 1of 103

SW388R7

Data Analysis &


Computers II

Strategy for Complete Regression Analysis

Slide 1

Additional issues in regression analysis


Assumption of independence of errors
Influential cases
Multicollinearity
Adjusted R
Strategy for Solving problems
Sample problems
Complete regression analysis

Compu
ters II

Assumption of independence of errors - 1

Slide 2

Multiple regression assumes that the errors are


independent and there is no serial correlation. Errors
are the residuals or differences between the actual
score for a case and the score estimated using the
regression equation. No serial correlation implies
that the size of the residual for one case has no
impact on the size of the residual for the next case.

The Durbin-Watson statistic is used to test for the


presence of serial correlation among the residuals.
The value of the Durbin-Watson statistic ranges from
0 to 4. As a general rule of thumb, the residuals are
not correlated if the Durbin-Watson statistic is
approximately 2, and an acceptable range is 1.50 2.50.

Compu
ters II

Assumption of independence of errors - 2

Slide 3

Serial correlation is more of a concern in analyses


that involve time series.

If it does occur in relationship analyses, its presence


can be usually be understood by changing the
sequence of cases and running the analysis again.

If the problem with serial correlation disappears, it


may be ignored.

If the problem with serial correlation remains, it can


usually be handled by using the difference between
successive data points as the dependent variable.

Compu
ters II

Multicollinearity - 1

Slide 4

Multicollinearity is a problem in regression analysis


that occurs when two independent variables are
highly correlated, e.g. r = 0.90, or higher.

The relationship between the independent variables


and the dependent variables is distorted by the very
strong relationship between the independent
variables, leading to the likelihood that our
interpretation of relationships will be incorrect.

In the worst case, if the variables are perfectly


correlated, the regression cannot be computed.

SPSS guards against the failure to compute a


regression solution by arbitrarily omitting the
collinear variable from the analysis.

Compu
ters II

Multicollinearity - 2

Slide 5

Multicollinearity is detected by examining the


tolerance for each independent variable. Tolerance is
the amount of variability in one independent variable
that is no explained by the other independent
variables.

Tolerance values less than 0.10 indicate collinearity.

If we discover collinearity in the regression output,


we should reject the interpretation of the
relationships as false until the issue is resolved.

Multicollinearity can be resolved by combining the


highly correlated variables through principal
component analysis, or omitting a variable from the
analysis.

Compu
ters II

Adjusted R

Slide 6

The coefficient of determination, R which measures


the strength of a relationship, can usually be
increased simply by adding more variables. In fact, if
the number of variables equals the number of cases,
it is often possible to produce a perfect R of 1.00.

Adjusted R is a measure which attempts to reduce


the inflation in R by taking into account the number
of independent variables and the number of cases.

If there is a large discrepancy between R and


Adjusted R, extraneous variables should be removed
from the analysis and R recomputed.

Compu
ters II

Influential cases

Slide 7

The degree to which outliers affect the regression


solution depends upon where the outlier is located
relative to the other cases in the analysis. Outliers
whose location have a large effect on the regression
solution are called influential cases.

Whether or not a case is influential is measured by


Cooks distance.

Cooks distance is an index measure; it is compared to


a critical value based on the formula:
4 / (n k 1)
where n is the number of cases and k is
the number of independent variables.

If a case has a Cooks distance greater than the


critical value, it should be examined for exclusion.

Compu
ters II

Overall strategy for solving problems

Slide 8
1.

2.

3.

4.

5.

Run a baseline regression using the method for


including variables implied by the problem
statement to find the initial strength of the
relationship, baseline R.
Test for useful transformations to improve normality,
linearity, and homoscedasticity.
Substitute transformed variables and check for
outliers and influential cases.
If R from regression model using transformed
variables and omitting outliers is at least 2% better
than baseline R, select it for interpretation;
otherwise select baseline model.
Validate and interpret the selected regression
model.

Compu
ters II

Problem 1

Slide 9

In the dataset GSS2000.sav, is the following statement true, false, or an


incorrect application of a statistic? Assume that there is no problem with
missing data. Use a level of significance of 0.05 for the regression
analysis. Use a level of significance of 0.01 for evaluating assumptions.
Use 0.0160 as the criteria for identifying influential cases. Validate the
results of your regression analysis by splitting the sample in two, using
788035 as the random number seed.
The variables "age" [age], "sex" [sex], and "respondent's socioeconomic
index" [sei] have a strong relationship to the variable "how many in family
earned money" [earnrs].
Survey respondents who were older had fewer family members earning
money. The variables sex and respondent's socioeconomic index did not
have a relationship to how many in family earned money.
1.
2.
3.
4.

True
True with caution
False
Inappropriate application of a statistic

ters II
Slide
10

Dissecting problem 1 - 1
When we test for influential cases
using Cooks distance, we need to
compute a critical value for
comparison using the formula:
4 / (n k 1)
where n is the number of cases and
k is the number of independent
variables. The correct value
(0.0160) is provided in the problem.

The problem may give us


different levels of
significance for the analysis.
In this problem, we are told
to use 0.05 as alpha for the
regression, but 0.01 for
testing assumptions.

In the dataset GSS2000.sav, is the following statement true, false, or an


incorrect application of a statistic? Assume that there is no problem with
missing data. Use a level of significance of 0.05 for the regression
analysis. Use a level of significance of 0.01 for evaluating assumptions.
Use 0.0160 as the criteria for identifying influential cases. Validate the
results of your regression analysis by splitting the sample in two, using
788035 as the random number seed.
After and
evaluating
assumptions,socioeconomic
outliers, and
The variables "age" [age], "sex" [sex],
"respondent's
influential cases, we will decide whether we
relationship
to the variable "how many in family
should use the model with transformations

The random number


index"
[sei]
a strong
seed
(788035)
forhave
the
earned
split
sample money"
validation [earnrs].
is
provided.

and excluding outliers, or the model with the


original form of the variables and all cases.

ters II
Slide
11

Dissecting problem 1 - 2
In the dataset GSS2000.sav, is the following statement true, false, or an
When a problem
states that there
is
a
variables
first in the
incorrect
application
of
a
statistic?
Assume thatThe
there
is nolisted
problem
with
relationship between some independent
problem statement are the
missing
data.
Use a level
ofwe
significance of 0.05independent
for the regression
variables and
a dependent
variable,
variables (IVs):
analysis.
a level
of significance of 0.01 for evaluating
assumptions.
do standard Use
multiple
regression.
"age" [age], "sex"
[sex], and
Use 0.0160 as the criteria for identifying influential
cases.socioeconomic
Validate the
"respondent's
index"
[sei]. in two, using
results of your regression analysis by splitting the
sample
788035 as the random number seed.
The variables "age" [age], "sex" [sex], and "respondent's socioeconomic
index" [sei] have a strong relationship to the variable "how many in family
earned money" [earnrs].
Survey respondents who were older had fewer
family members earning
The variable that is the
money. The variables sex and respondent's socioeconomic
index
target of the relationship
is did not
have a relationship to how many in family earned
money.
the dependent
variable
1.
2.
3.
4.

(DV): "how many in family


earned money [earnrs].

True
True with caution
False
Inappropriate application of a statistic

ters II
Slide
12

Dissecting problem 1 - 3
In the dataset GSS2000.sav, is the following statement true, false, or an
incorrect application of a statistic?
Assume
that
there
is no
with
In order for
a problem
to be
true, we
will problem
have
missing data. Use a level of significance
for the
regression
to find that thereof
is 0.05
a statistically
significant
relationship
the evaluating
set of IVs and assumptions.
the
analysis. Use a level of significance
ofbetween
0.01 for
and the strength
of the relationship
Use 0.0160 as the criteria forDV,
identifying
influential
cases. stated
Validate the
in the problem must be correct.
results of your regression analysis by splitting the sample in two, using
788035 as the random number seed.
The variables "age" [age], "sex" [sex], and "respondent's socioeconomic
index" [sei] have a strong relationship to the variable "how many in family
earned money" [earnrs].
Survey respondents who were older had fewer family members earning
money. The variables sex and respondent's socioeconomic index did not
have a relationship to how many in family earned money.
1.
2.
3.
4.

True
In addition, the relationship or lack of
True with caution
relationship between the individual IV's
and the DV must be identified correctly,
False
and must be characterized correctly.
Inappropriate application of a statistic

ters II
Slide
13

LEVEL OF MEASUREMENT
In the dataset GSS2000.sav, is the following statement true, false, or an
Multiple regression requires that
incorrect application
a statistic?
the dependent of
variable
be metric Assume that there is no problem with
missing data.
Useindependent
a level of
significance
of 0.05 for the regression
and the
variables
be
metric
or
dichotomous.
analysis. Use a level of significance of 0.01 for evaluating assumptions.
Use 0.0160 as the criteria for identifying influential cases. Validate the
results of your regression analysis by splitting the sample in two, using
788035 as the random number seed.
The variables "age" [age], "sex" [sex], and "respondent's socioeconomic
index" [sei] have a strong relationship to the variable "how many in family
earned money" [earnrs].
"How many in family earned money" [earnrs]

Survey respondents who were


older
had
fewer
family
is an
interval
level
variable,
which members
satisfies the earning
level
of
measurement
requirement.
money. The variables sex and respondent's socioeconomic index did not
have a relationship to how many
family
earned money.
"Age" in
[age]
and "respondent's
socioeconomic
1.
2.
3.
4.

index" [sei] are interval level variables, which


satisfies the level of measurement
requirements for multiple regression analysis.

True
"Sex" [sex] is a dichotomous or dummyTrue with caution
coded nominal variable which may be included
False
in multiple regression analysis.
Inappropriate application of a statistic

ters II
Slide
14

The baseline regression


We begin out analysis by
runring a standard multiple
regression analysis with
earnrs as the dependent
variable and age, sex, and
sei as the independent
variables.

Select Enter as the


Method for including
variables to produce a
standard multiple
regression.
Click on the Statistics
button to select statistics
we will need for the
analysis.

ters II
Slide
15

The baseline regression


Retain the default checkboxes for Estimates and Model
fit to obtain the baseline R, which will be used to
determine whether we should use the model with
transformations and excluding outliers, or the model
with the original form of the variables and all cases.

Mark the Descriptives


checkbox to get the
number of cases
available for the analysis.
Mark the checkbox for the
Durbin-Watson statistic,
which will be used to test
the assumption of
independence of errors.

ters II
Slide
16

Initial sample size


Descriptive Statistics
EARNRS
AGE
SEX
SEI

Mean
1.47
46.62
1.57
48.601

Std. Deviation
1.008
16.642
.496
19.1110

N
254
254
254
254

The initial sample size before excluding


outliers and influential cases is 254. With
3 independent variables, the ratio of
cases to variables is 84.7 to 1, satisfying
both the minimum and preferred sample
size for multiple regression.
If the sample size did not initially satisfy
the minimum requirement, regression
analysis is not appropriate.

ters II
Slide
17

R before transformations or removing outliers


The R of 0.187 is the benchmark
that we will use to evaluate the
utility of transformations and the
elimination of outliers/influential
cases.

Prior to any transformations of variables


to satisfy the assumptions of multiple
regression or removal of outliers, the
proportion of variance in the dependent
variable explained by the independent
variables (R) was 18.7%.
The relationship is statistically
significant, though we would not stop if
it were not significant because the lack
of significance may be a consequence of
violation of assumptions or the inclusion
of outliers and influential cases.

ters II
Slide
18

Assumption of independence of errors:


the Durbin-Watson statistic

The Durbin-Watson statistic is used to test for the presence of


serial correlation among the residuals, i.e., the assumption of
independence of errors, which requires that the residuals or
errors in prediction do not follow a pattern from case to case.
The value of the Durbin-Watson statistic ranges from 0 to 4.
As a general rule of thumb, the residuals are not correlated if
the Durbin-Watson statistic is approximately 2, and an
acceptable range is 1.50 - 2.50.
The Durbin-Watson statistic for this problem is 1.849 which
falls within the acceptable range.
If the Durbin-Watson statistic was not in the acceptable range,
we would add a caution to the findings for a violation of
regression assumptions.

ters II
Slide
19

Normality of dependent variable:


how many in family earned money

After evaluating the dependent variable, we


examine the normality of each metric
variable and linearity of its relationship with
the dependent variable.
To test the normality of number of earners in
family, run the script:
NormalityAssumptionAndTransformations.SB
S

First, move the


independent variable
EARNRS to the list box
of variables to test.

Second, click on the


OK button to
produce the output.

ters II
Slide
20

Normality of dependent variable:


how many in family earned money
Descriptives
HOW MANY IN FAMILY Mean
EARNED MONEY
95% Confidence
Interval for Mean

Lower Bound
Upper Bound

5% Trimmed Mean
Median
Variance
Std. Deviation
Minimum
Maximum
Range
Interquartile Range
Skewness
Kurtosis

The dependent variable "how many in family earned


money" [earnrs] does not satisfy the criteria for a normal
distribution.
The skewness (0.742) fell between -1.0 and +1.0, but the
kurtosis (1.324) fell outside the range from -1.0 to +1.0.

Statistic
1.43
1.31

Std. Error
.061

1.56
1.37
1.00
1.015
1.008
0
5
5
1.00
.742
1.324

.149
.296

ters II
Slide
21

Normality of dependent variable:


how many in family earned money

The logarithmic
transformation
improves the normality
of "how many in family
earned money"
[earnrs]. In evaluating
normality, the skewness
(-0.483) and kurtosis (0.309) were both within
the range of acceptable
values from -1.0 to
+1.0.
The square root transformation also has
values of skewness and kurtosis in the
acceptable range.
However, by our order of preference for
which transformation to use, the logarithm
is preferred to the square root or inverse.

ters II

Transformation for how many in family


earned money

Slide
22

The logarithmic transformation improves the


normality of "how many in family earned money"
[earnrs].

We will substitute the logarithmic transformation of


how many in family earned money as the dependent
variable in the regression analysis.

ters II
Slide
23

Adding a transformed variable

Before testing the assumptions


for the independent variables,
we need to add the
transformation of the dependent
variable to the data set.

Second, mark the


checkbox for the
transformation we
want to add to the
data set, and clear
the other checkboxes.

Third, clear the


checkbox for Delete
transformed variables
from the data. This will
save the transformed
variable.

First, move the variable


that we want to transform
to the list box of variables
to test.

Fourth, click on the


OK button to
produce the output.

ters II
Slide
24

The transformed variable in the data editor


If we scroll to the extreme
right in the data editor, we
see that the transformed
variable has been added to
the data set.

Whenever we add
transformed variables to
the data set, we should be
sure to delete them before
starting another analysis.

ters II
Slide
25

Normality/linearity of independent variable:


age

After evaluating the dependent variable, we


examine the normality of each metric
variable and linearity of its relationship with
the dependent variable.
To test the normality of age, run the script:
NormalityAssumptionAndTransformations.SB
S

First, move the


independent variable
AGE to the list box of
variables to test.

Second, click on the


OK button to
produce the output.

ters II
Slide
26

Normality/linearity of independent variable:


age
Descriptives
AGE OF RESPONDENT Mean
95% Confidence
Interval for Mean

Lower Bound
Upper Bound

5% Trimmed Mean
Median
Variance
Std. Deviation
Minimum
Maximum
Range
Interquartile Range
Skewness
Kurtosis

In evaluating normality, the skewness


(0.595) and kurtosis (-0.351) were
both within the range of acceptable
values from -1.0 to +1.0.

Statistic
45.99
43.98

Std. Error
1.023

48.00
45.31
43.50
282.465
16.807
19
89
70
24.00
.595
-.351

.148
.295

ters II
Slide
27

Normality/linearity of independent variable:


age
First, move the transformed
dependent variable LOGEARN
to the text box for the
dependent variable.

To evaluate the linearity of age and the log


transformation of number of earners in the
family, run the script for the assumption of
linearity:
LinearityAssumptionAndTransformations.SBS

Second, move the


independent variable,
AGE, to the list box for
independent variables.

Third, click on the


OK button to
produce the output.

ters II
Slide
28

Normality/linearity of independent variable:


age
Correlations

Logarithm of EARNRS
[LG10( 1+EARNRS)]
AGE OF RESPONDENT

Logarithm of AGE
[LG10(AGE)]

Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N

Logarithm of
EARNRS
AGE OF
Logarithm of
Square of
Square Root
Inverse of
The
evidence AGE
of linearity in
the
[LG10(
RESPON
AGE
of AGE
AGE
relationship[(AGE)**2]
between the
1+EARNRS)]
DENT
[LG10(AGE)]
[SQRT(AGE)]
[-1/(AGE)]
independent
variable
1
-.493**
-.417**
-.552**"age" [age]
-.457**
-.336**
and
the
dependent
variable
"log
.
.000
.000
.000
.000
.000
transformation of how many in
269
269
269
269
269
269
family earned money" [logearn]
-.493**
1
.979**
.983**
.916**
was the
statistical
significance .995**
of
.000
.
.000
.000
.000
the correlation
coefficient
(r = .000
269
270
270 The probability
270
270
-0.493).
for the270
-.417**
.979**
1 coefficient
.926** was <0.001,
.994**
.978**
correlation
less than. or equal
to the level of
.000
.000
.000
.000
.000
269

270

Square of AGE [(AGE)**2] Pearson Correlation


-.552**
.983**
Sig. (2-tailed)
.000
.000
N
269
270
Square Root of AGE
Pearson
-.457**
The Correlation
independent variable
"age" .995**
[SQRT(AGE)]
Sig.[age]
(2-tailed)
.000for both.000
satisfies the criteria
N the assumption of normality
269 and 270
the assumption
of linearity
with the
Inverse of AGE [-1/(AGE)] Pearson
Correlation
-.336**
.916**
dependent
variable
"log
Sig. (2-tailed)
.000
.000
transformation
of
how
many
in
N
269
270

family earned money" [logearn].

**. Correlation is significant at the 0.01 level (2-tailed).

significance of 0.01. We reject the


270
270
null hypothesis
that
r = 0 and 270
conclude
.926**that there1is a linear .960**
relationship
between
.000
. the variables.
.000
270
.994**
.000
270
.978**
.000
270

270
.960**
.000
270
.832**
.000
270

270
1
.
270
.951**
.000
270

270
.832**
.000
270
.951**
.000
270
1
.
270

ters II
Slide
29

Normality/linearity of independent variable:


respondent's socioeconomic index

To test the normality of respondent's


socioeconomic index, run the script:
NormalityAssumptionAndTransformations.SB
S

First, move the


independent variable
SEI to the list box of
variables to test.

Second, click on the


OK button to
produce the output.

ters II
Slide
30

Normality/linearity of independent variable:


respondent's socioeconomic index
Descriptives
RESPONDENT'S
SOCIOECONOMIC INDEX

Mean
95% Confidence
Interval for Mean
5% Trimmed Mean
Median
Variance
Std. Deviation
Minimum
Maximum
Range
Interquartile Range
Skewness
Kurtosis

Lower Bound
Upper Bound

Statistic
48.710
46.348

Std. Error
1.1994

51.072
47.799
39.600
366.821
19.1526
19.4
97.2
77.8
31.100
.585
-.862

The independent variable "respondent's socioeconomic index" [sei]


satisfies the criteria for the assumption of normality, but does not satisfy
the assumption of linearity with the dependent variable "log
transformation of how many in family earned money" [logearn].
In evaluating normality, the skewness (0.585) and kurtosis (-0.862)
were both within the range of acceptable values from -1.0 to +1.0.

.153
.304

ters II
Slide
31

Normality/linearity of independent variable:


respondent's socioeconomic index
First, move the transformed
dependent variable LOGEARN
to the text box for the
dependent variable.

To evaluate the linearity of the relationship


between respondent's socioeconomic index
and the log transformation of how many in
family earned money, run the script for the
assumption of linearity:
LinearityAssumptionAndTransformations.SBS

Second, move the


independent variable,
SEI, to the list box for
independent variables.

Third, click on the


OK button to
produce the output.

ters II
Slide
32

Normality/linearity of independent variable:


respondent's socioeconomic index
Correlations

Logarithm of EARNRS
[LG10( 1+EARNRS)]
RESPONDENT'S
SOCIOECONOMIC INDEX
Logarithm of SEI
[LG10(SEI)]

Square of SEI [(SEI)**2]

Square Root of SEI


[SQRT(SEI)]
Inverse of SEI [-1/(SEI)]

Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N

Logarithm of
EARNRS
[LG10(
1+EARNRS)]
1
.
269
.055
.385
254
.073
.243

Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N

**. Correlation is significant at the 0.01 level (2-tailed).

254
.036
.563
254
.064
.309
254
.092
.142
254

RESPONDEN
T'S
Logarithm of
Square Root
SOCIOECON
SEI
Square of SEI
of SEI
Inver
OMIC INDEX
[LG10(SEI)] for[(SEI)**2]
[SQRT(SEI)] SEI [-1
The probability
the correlation
.055
.073
.036
coefficient was 0.385, greater than .064
.385
the level of.243
significance .563
of 0.01. We .309
254
254
254
254
cannot reject the null hypothesis
1
.988**
.997**
that
r = 0,.987**
and cannot conclude
.
.000
.000
that
there .000
is a linear relationship
between the
255
255variables. 255
255
.987**
1
.951**
.997**
Since none of. the transformations
.000
.000
.000

to improve linearity were

255
255 that
successful, 255
it is an indication

the problem
may be a weak
.988**
.951**
1
relationship,
rather
than
a
.000
.000
.
curvilinear
relationship
correctable
255
255
255
by
using
a
transformation.
A weak
.997**
.997**
.973**
relationship is not a violation of the
.000
.000
.000
assumption of linearity, and does
255
255
255
not require a caution.
.948**
.986**
.892**
.000
.000
.000
255
255
255

255
.973**
.000
255
1
.
255
.970**
.000
255

ters II
Slide
33

Homoscedasticity of independent variable:


Sex
First, move the transformed
dependent variable LOGEARN
to the text box for the
dependent variable.

To evaluate the homoscedasticity of the


relationship between sex and the log
transformation of how many in family earned
money, run the script for the assumption of
homogeneity of variance:
Second, move the
independent variable,
SEX, to the list box for
independent variables.

HomoscedasticityAssumptionAnd
Transformations.SBS

Third, click on the


OK button to
produce the output.

ters II
Slide
34

Homoscedasticity of independent variable:


Sex

Based on the Levene Test, the


variance in "log transformation
of how many in family earned
money" [logearn] is
homogeneous for the categories
of "sex" [sex].
The probability associated with
the Levene Statistic (0.767) is
greater than the level of
significance, so we fail to reject
the null hypothesis and
conclude that the
homoscedasticity assumption is
satisfied.

ters II
Slide
35

The regression to identify outliers and


influential cases
We use the regression procedure
to identify univariate outliers,
multivariate outliers, and
influential cases.
We start with the same dialog we
used for the baseline analysis and
substitute the transformed
variables which we think will
improve the analysis.

To run the regression again,


select the Regression |
Linear command from the
Analyze menu.

ters II
Slide
36

The regression to identify outliers and


influential cases

Second, we keep the


method of entry to Enter so
that all variables will be
included in the detection of
outliers.
NOTE: we should always use
Enter when testing for
outliers and influential cases
to make sure all variables are
included in the
determination.

Third, we want to save the


calculated values of the outlier
statistics to the data set.
Click on the Save button to
specify what we want to save.

First, we substitute the


logarithmic transformation of
earnrs, logearn, into the list
of independent variables.

ters II
Slide
37

Saving the measures of outliers/influential


cases
First, mark the checkbox for Studentized
residuals in the Residuals panel.
Studentized residuals are z-scores
computed for a case based on the data
for all other cases in the data set.
Fourth, click on
the OK button to
complete the
specifications.

Second, mark the


checkbox for
Mahalanobis in the
Distances panel. This
will compute Mahalanobis
distances for the set of
independent variables.

Third, mark the


checkbox for Cooks in
the Distances panel.
This will compute Cooks
distances to identify
influential cases.

ters II
Slide
38

The variables for identifying


outliers/influential cases
The variable for identifying
univariate outliers for the
dependent variable are in a column
which SPSS has named sre_1.
These are the studentized
residuals for the log transformed
variables.

The variable for identifying


multivariate outliers for the
independent variables are in
a column which SPSS has
named mah_1.

The variable
containing
Cooks distances
for identifying
influential cases
has been
named coo_1
by SPSS.

ters II
Slide
39

Computing the probability for Mahalanobis D


To compute the probability
of D, we will use an SPSS
function in a Compute
command.

First, select the


Compute command
from the Transform
menu.

ters II
Slide
40

Formula for probability for Mahalanobis D


First, in the target variable text box, type the
name "p_mah_1" as an acronym for the probability
of the mah_1, the Mahalanobis D score.

Second, to complete the


specifications for the CDF.CHISQ
function, type the name of the
variable containing the D scores,
mah_1, followed by a comma,
followed by the number of variables
used in the calculations, 3.

Third, click on the OK button


to signal completion of the
computer variable dialog.

Since the CDF function (cumulative


density function) computes the
cumulative probability from the left
end of the distribution up through a
given value, we subtract it from 1 to
obtain the probability in the upper tail
of the distribution.

ters II
Slide
41

Univariate outliers
A score on the dependent variable is
considered unusual if its studentized
residual is bigger than 3.0.

ters II
Slide
42

Multivariate outliers
The combination of scores for the
independent variables is an outlier
if the probability of the Mahalanobis
D distance score is less than or
equal to 0.001.

ters II
Slide
43

Influential cases

In addition, a case may have a


large influence on the regression
analysis, resulting in an analysis
that is less representative of the
population represented by the
sample. The criteria for identifying
influential case is a Cook's distance
score with a value of 0.0160 or
greater.
The criteria for Cooks distance is:
4 / (n k 1) =
4 / (254 3 1) = 0.0160

ters II
Slide
44

Omitting the outliers and influential cases


To omit the outliers and
influential cases from the
analysis, we select in the
cases that are not outliers
and are not influential cases.

First, select the


Select Cases
command from the
Transform menu.

ters II
Slide
45

Specifying the condition to omit outliers

First, mark the If


condition is satisfied
option button to
indicate that we will
enter a specific
condition for
including cases.

Second, click on the


If button to specify
the criteria for inclusion
in the analysis.

ters II
Slide
46

The formula for omitting outliers

To eliminate the outliers and


influential cases, we request the
cases that are not outliers or
influential cases.

After typing in the formula,


click on the Continue button
to close the dialog box,

The formula specifies that we should


include cases if the studentized
residual (regardless of sign) is less
than 3, the probability for
Mahalanobis D is higher than the
level of significance of 0.001, and
the Cooks distance value is less than
the critical value of 0.0160.

ters II
Slide
47

Completing the request for the selection

To complete the
request, we click on
the OK button.

ters II
Slide
48

An omitted outlier and influential case

SPSS identifies the excluded cases by


drawing a slash mark through the case
number. This omitted case has a large
studentized residual, greater than 3.0,
as well as a Cooks distance value that is
greater than the critical value, 0.0160.

ters II
Slide
49

The outliers and influential cases

Case 20000159 is an influential case (Cook's


distance=0.0320) as well as an outlier on the dependent
variable (studentized residual=3.13). Case 20000915 is an
influential case (Cook's distance=0.0239). Case 20001016
is an influential case (Cook's distance=0.0598) as well as
an outlier on the dependent variable (studentized
residual=-3.12). Case 20001761 is an influential case
(Cook's distance=0.0167). Case 20002587 is an influential
case (Cook's distance=0.0264). Case 20002597 is an
influential case (Cook's distance=0.0293). There are 6
cases that have a Cook's distance score that is large
enough to be considered influential cases.

ters II
Slide
50

Running the regression omitting outliers

We run the regression again,


without the outliers which
we selected out with the
Select If command.
Select the Regression |
Linear command from the
Analyze menu.

ters II
Slide
51

Opening the save options dialog


We specify the dependent and
independent variables, continuing
to substitute any transformed
variables required by assumptions.

On our last run, we


instructed SPSS to save
studentized residuals,
Mahalanobis distance, and
Cooks distance. To prevent
these values from being
calculated again, click on the
Save button.

ters II
Slide
52

Clearing the request to save outlier data


First, clear the checkbox
for Studentized residuals.
Second, clear
the checkbox for
Mahalanobis
distance.

Third, clear the


checkbox form
Cooks distance.

Third, click on
the OK button to
complete the
specifications.

ters II
Slide
53

Opening the statistics options dialog

Once we have removed outliers,


we need to check the sample
size requirement for regression.
Since we will need the
descriptive statistics for this,
click on the Statistics button.

ters II
Slide
54

Requesting descriptive statistics

First, mark the


checkbox for
Descriptives.

Second, mark the checkbox for


Collinearity diagnostics to obtain
the tolerance values for each
independent variable in order to
assess multicollinearity.

Third, click
on the
Continue
button to
complete the
specifications.

ters II
Slide
55

Requesting the output

Having specified the


output needed for the
analysis, we click on
the OK button to obtain
the regression output.

ters II
Slide
56

SELECTION OF MODEL FOR INTERPRETATION

Prior to any transformations of variables to satisfy the


assumptions of multiple regression and the removal of outliers
and influential cases, the proportion of variance in the dependent
variable explained by the independent variables (R) was 18.7%.
After substituting transformed variables and removing outliers
and influential cases, the proportion of variance in the dependent
variable explained by the independent variables (R) was 38.4%.

Model Summaryb
Model
1

R
.620a

R Square
.384

Adjusted
R Square
.377

Std. Error of
the Estimate
.1457258

a. Predictors: (Constant), SEI, AGE, SEX


b. Dependent Variable: LOGEARN

Since the regression analysis using transformations and omitting


outliers and influential cases explained at least two percent more
variance than the regression analysis with all cases and no
transformations, the regression analysis with transformed
variables omitting outliers and influential cases was interpreted.

ters II
Slide
57

SAMPLE SIZE

The minimum ratio of valid cases to independent


variables for multiple regression is 5 to 1. After
removing 6 influential cases or outliers, there are
248 valid cases and 3 independent variables.
The ratio of cases to independent variables for this
analysis is 82.67 to 1, which satisfies the
minimum requirement. In addition, the ratio of
82.67 to 1 satisfies the preferred ratio of 15 to 1.

Descriptive Statistics
LOGEARN
AGE
SEX
SEI

Mean
.354289
46.70
1.57
48.819

Std. Deviation
.1845814
16.677
.496
19.1071

N
248
248
248
248

ters II
Slide
58

OVERALL RELATIONSHIP BETWEEN


INDEPENDENT AND DEPENDENT VARIABLES
The probability of the F statistic (50.759) for the overall
regression relationship is <0.001, less than or equal to the
level of significance of 0.05. We reject the null hypothesis
that there is no relationship between the set of independent
variables and the dependent variable (R = 0). We support
the research hypothesis that there is a statistically
significant relationship between the set of independent
variables and the dependent variable.
We support the research hypothesis that there is a
statistically significant relationship between the set of
independent variables and the dependent variable.

ters II
Slide
59

OVERALL RELATIONSHIP BETWEEN


INDEPENDENT AND DEPENDENT VARIABLES

The Multiple R for the relationship


between the set of independent
variables and the dependent variable
is 0.620, which would be
characterized as strong using the
rule of thumb than a correlation less
than or equal to 0.20 is
characterized as very weak; greater
than 0.20 and less than or equal to
0.40 is weak; greater than 0.40 and
less than or equal to 0.60 is
moderate; greater than 0.60 and
less than or equal to 0.80 is strong;
and greater than 0.80 is very strong.

ters II
Slide
60

MULTICOLLINEARITY
Coefficientsa

Model
1

(Constant)
AGE
SEX
SEI

Unstandardized
Coefficients
B
Std. Error
.626
.048
-.007
.001
.024
.019
.000
.000

Standardized
Coefficients
Beta
-.615
.065
.018

t
12.989
-12.237
1.284
.354

Sig.
.000
.000
.200
.724

a. Dependent Variable: LOGEARN

Multicollinearity occurs when one independent


variable is so strongly correlated with one or more
other variables that its relationship to the
dependent variable is likely to be misinterpreted.
Its potential unique contribution to explaining the
dependent variable is minimized by its strong
relationship to other independent variables.
Multicollinearity is indicated when the tolerance
value for an independent variable is less than
0.10.
The tolerance values for all of the independent
variables are larger than 0.10. Multicollinearity is
not a problem in this regression analysis.

Collinearity Statistics
Tolerance
VIF
.999
.997
.997

1.001
1.003
1.004

ters II
Slide
61

RELATIONSHIP OF INDIVIDUAL INDEPENDENT


VARIABLES TO DEPENDENT VARIABLE - 1
For the independent variable age, the probability of the
t statistic (-12.237) for the b coefficient is <0.001
which is less than or equal to the level of significance
of 0.05. We reject the null hypothesis that the slope
associated with age is equal to zero (b = 0) and
conclude that there is a statistically significant
relationship between age and log transformation of
how many in family earned money.

Coefficientsa

Model
1

(Constant)
AGE
SEX
SEI

Unstandardized
Coefficients
B
Std. Error
.626
.048
-.007
.001
.024
.019
.000
.000

a. Dependent Variable: LOGEARN

Standardized
Coefficients
Beta
-.615
.065
.018

t
12.989
-12.237
1.284
.354

Sig.
.000
.000
.200
.724

Collinearity Statistics
Tolerance
VIF
.999
.997
.997

The b coefficient associated with age (-0.007) is


negative, indicating an inverse relationship in which
higher numeric values for age are associated with
lower numeric values for log transformation of how
many in family earned money. Therefore, the negative
value of b implies that survey respondents who were
older had fewer family members earning money.

1.001
1.003
1.004

ters II
Slide
62

RELATIONSHIP OF INDIVIDUAL INDEPENDENT


VARIABLES TO DEPENDENT VARIABLE - 2
Coefficientsa

Model
1

(Constant)
AGE
SEX
SEI

Unstandardized
Coefficients
B
Std. Error
.626
.048
-.007
.001
.024
.019
.000
.000

Standardized
Coefficients
Beta
-.615
.065
.018

t
12.989
-12.237
1.284
.354

Sig.
.000
.000
.200
.724

a. Dependent Variable: LOGEARN

For the independent variable sex, the probability of the


t statistic (1.284) for the b coefficient is 0.200 which is
greater than the level of significance of 0.05.
We fail to reject the null hypothesis that the slope
associated with sex is equal to zero (b = 0) and
conclude that there is not a statistically significant
relationship between sex and log transformation of
how many in family earned money.

Collinearity Statistics
Tolerance
VIF
.999
.997
.997

1.001
1.003
1.004

ters II
Slide
63

RELATIONSHIP OF INDIVIDUAL INDEPENDENT


VARIABLES TO DEPENDENT VARIABLE - 3
Coefficientsa

Model
1

(Constant)
AGE
SEX
SEI

Unstandardized
Coefficients
B
Std. Error
.626
.048
-.007
.001
.024
.019
.000
.000

Standardized
Coefficients
Beta
-.615
.065
.018

t
12.989
-12.237
1.284
.354

Sig.
.000
.000
.200
.724

a. Dependent Variable: LOGEARN

For the independent variable respondent's


socioeconomic index, the probability of the t statistic
(0.354) for the b coefficient is 0.724 which is greater
than the level of significance of 0.05.
We fail to reject the null hypothesis that the slope
associated with respondent's socioeconomic index is
equal to zero (b = 0) and conclude that there is not a
statistically significant relationship between
respondent's socioeconomic index and log
transformation of how many in family earned money.

Collinearity Statistics
Tolerance
VIF
.999
.997
.997

1.001
1.003
1.004

ters II
Slide
64

Validation analysis:
set the random number seed

To set the random number


seed, select the Random
Number Seed command
from the Transform menu.

ters II
Slide
65

Set the random number seed

First, click on the


Set seed to option
button to activate
the text box.

Second, type in the


random seed stated in
the problem.

Third, click on the OK


button to complete the
dialog box.
Note that SPSS does not
provide you with any
feedback about the change.

ters II
Slide
66

Validation analysis:
compute the split variable

To enter the formula for the


variable that will split the
sample in two parts, click
on the Compute
command.

ters II
Slide
67

The formula for the split variable


First, type the name for the
new variable, split, into the
Target Variable text box.
Second, the formula for the
value of split is shown in the
text box.
The uniform(1) function
generates a random decimal
number between 0 and 1.
The random number is
compared to the value 0.50.

Third, click on the


OK button to
complete the dialog
box.

If the random number is less


than or equal to 0.50, the
value of the formula will be 1,
the SPSS numeric equivalent
to true. If the random
number is larger than 0.50,
the formula will return a 0,
the SPSS numeric equivalent
to false.

ters II
Slide
68

The split variable in the data editor

In the data editor, the


split variable shows a
random pattern of zeros
and ones.
To select half of the
sample for each validation
analysis, we will first
select the cases where
split = 0, then select the
cases where split = 1.

ters II
Slide
69

Repeat the regression with first validation


sample

To repeat the multiple


regression analysis for the first
validation sample, select
Linear Regression from the
Dialog Recall tool button.

ters II
Slide
70

Using "split" as the selection variable

First, scroll
down the list of
variables and
highlight the
variable split.

Second, click on the


right arrow button to
move the split variable
to the Selection
Variable text box.

ters II
Slide
71

Setting the value of split to select cases

When the variable named


split is moved to the
Selection Variable text
box, SPSS adds "=?" after
the name to prompt up to
enter a specific value for
split.

Click on the
Rule button
to enter a
value for split.

ters II
Slide
72

Completing the value selection

First, type the value


for the first half of the
sample, 0, into the
Value text box.

Second, click on the


Continue button to
complete the value entry.

ters II
Slide
73

Requesting output for the first validation


sample

Click on the OK
button to
request the
output.

When the value entry


dialog box is closed, SPSS
adds the value we entered
after the equal sign. This
specification now tells
SPSS to include in the
analysis only those cases
that have a value of 0 for
the split variable.

Since the validation analysis


requires us to compare the
results of the analysis using
the two split sample, we will
request the output for the
second sample before doing
any comparison.

ters II
Slide
74

Repeat the regression with second validation


sample

To repeat the multiple


regression analysis for the
second validation sample,
select Linear Regression from
the Dialog Recall tool button.

ters II
Slide
75

Setting the value of split to select cases

Since the split variable is already in the


Selection Variable text box, we only need
to change its value.
Click on the Rule button to enter a
different value for split.

ters II
Slide
76

Completing the value selection

First, type the value


for the first half of the
sample, 1, into the
Value text box.

Second, click on the


Continue button to
complete the value entry.

ters II
Slide
77

Requesting output for the second validation


sample

Click on the OK
button to
request the
output.

When the value entry


dialog box is closed, SPSS
adds the value we entered
after the equal sign. This
specification now tells
SPSS to include in the
analysis only those cases
that have a value of 1 for
the split variable.

ters II
Slide
78

SPLIT-SAMPLE VALIDATION - 1
In both of the split-sample validation
analyses, the relationship between the
independent variables and the dependent
variable was statistically significant.
ANOVAb,c
Model
1

Regression
Residual
Total

Sum of
Squares
1.692
2.538
4.230

df
3
109
112

a. Predictors: (Constant), SEI, AGE, SEX

Mean Square
.564
.023

F
24.220

Sig.
.000a

In the first validation, the


probability for the F
statistic testing overall
relationship was <0.001.

b. Dependent Variable: LOGEARN


c. Selecting only cases for which SPLIT = .0000
ANOVAb,c
Model
1

Regression
Residual
Total

Sum of
Squares
1.500
2.614
4.114

df
3
131
134

a. Predictors: (Constant), SEI, SEX, AGE


b. Dependent
Thus far, the validation verifies
the Variable: LOGEARN

Mean Square
.500
.020

F
25.062

For the second validation


analysis, the probability for
the F statistic testing overall
relationship was <0.001.

existence of the relationship


c. Selecting only cases for which SPLIT = 1.0000
between the dependent variable
and the independent variables.

Sig.
.000a

ters II
Slide
79

SPLIT-SAMPLE VALIDATION - 2
Model Summaryb,c
R

Model
1

SPLIT =
.0000
(Selected)
.632a

SPLIT ~=
.0000
(Unselected)
.593

R Square
.400

Durbin-Watson Statistic
The total proportion
SPLIT =of variance
SPLIT in
~= the
Adjusted relationship
Std. Error of utilizing
.0000 the full data
.0000 set
38.4% compared
for
R Square was
the Estimate
(Selected) to 40.0%
(Unselected)
the
first
split
sample
validation
and
.383
.1525916
2.117
1.862

36.5% for the second split sample


validation.

a. Predictors: (Constant), SEI, AGE, SEX

b. Unless noted otherwise, statistics are based only on cases for which SPLIT = .0000.
c. Dependent Variable: LOGEARN

In both of the split-sample validation


analyses, the total proportion of
variance in the dependent variable
explained by the independent
Model Summaryb,cvariables was within 5% of the
variance explained in the model using
the full data set (38.4%).

Model
1

SPLIT =
1.0000
(Selected)
.604a

SPLIT ~=
1.0000
(Unselected)
.621

R Square
.365

Adjusted
R Square
.350

Std. Error of
the Estimate
.1412615

Durbin-Watson Statistic
SPLIT =
SPLIT ~=
1.0000
1.0000
(Selected)
(Unselected)
1.839
2.161

a. Predictors: (Constant), SEI, SEX, AGE


b. Unless noted otherwise, statistics are based only on cases for which SPLIT = 1.0000.
c. Dependent Variable: LOGEARN

ters II
Slide
80

RELATIONSHIP OF INDIVIDUAL INDEPENDENT


VARIABLES TO DEPENDENT VARIABLE - 1
The relationship between "age" [age] and "log
transformation of how many in family earned
money" [logearn] was statistically significant for
the model using the full data set (p<0.001).
Similarly, the relationships in both of the
validation analyses were statistically significant.
Coefficientsa,b

Model
1

(Constant)
AGE
SEX
SEI

Unstandardized
Coefficients
B
Std. Error
.663
.077
-.007
.001
.024
.029
.000
.001

Standardized
Coefficients
Beta
-.628
.062
-.039

t
8.603
-8.429
.828
-.525

Sig.
.000
.000
.410
.601

Collinearity Statistics
Tolerance
VIF

a. Dependent Variable: LOGEARN


b. Selecting only cases for which SPLIT = .0000

In the first validation analysis, the probability for


the test of relationship between "age" [age] and
"log transformation of how many in family earned
money" [logearn] was <0.001, which was less
than or equal to the level of significance of 0.05
and statistically significant.

.992
.989
.985

1.008
1.011
1.015

ters II
Slide
81

RELATIONSHIP OF INDIVIDUAL INDEPENDENT


VARIABLES TO DEPENDENT VARIABLE - 2
Coefficientsa,b

Model
1

(Constant)
AGE
SEX
SEI

Unstandardized
Coefficients
B
Std. Error
.595
.062
-.007
.001
.022
.025
.001
.001

Standardized
Coefficients
Beta
-.598
.063
.076

t
9.552
-8.590
.907
1.098

Sig.
.000
.000
.366
.274

a. Dependent Variable: LOGEARN


b. Selecting only cases for which SPLIT = 1.0000

In the second validation analysis, the probability


for the test of relationship between "age" [age]
and "log transformation of how many in family
earned money" [logearn] was <0.001, which was
less than or equal to the level of significance of
0.05 and statistically significant.

Collinearity Statistics
Tolerance
VIF
.999
.999
.999

1.001
1.001
1.001

ters II
Slide
82

RELATIONSHIP OF INDIVIDUAL INDEPENDENT


VARIABLES TO DEPENDENT VARIABLE - 3
The relationship between "respondent's
socioeconomic index" [sei] and "log
transformation of how many in family earned
money" [logearn] was not statistically significant
for the model using the full data set (p=0.724).
Similarly, the relationships in both of the
validation analyses were not statistically
significant.
Coefficientsa,b

Model
1

(Constant)
AGE
SEX
SEI

Unstandardized
Coefficients
B
Std. Error
.663
.077
-.007
.001
.024
.029
.000
.001

Standardized
Coefficients
Beta
-.628
.062
-.039

t
8.603
-8.429
.828
-.525

Sig.
.000
.000
.410
.601

a. Dependent Variable: LOGEARN


b. Selecting only cases for which SPLIT = .0000

In the first validation analysis, the probability for


the test of relationship between "respondent's
socioeconomic index" [sei] and "log
transformation of how many in family earned
money" [logearn] was 0.601, which was greater
than the level of significance of 0.05 and not
statistically significant.

Collinearity Statistics
Tolerance
VIF
.992
.989
.985

1.008
1.011
1.015

ters II
Slide
83

RELATIONSHIP OF INDIVIDUAL INDEPENDENT


VARIABLES TO DEPENDENT VARIABLE - 4
Coefficientsa,b

Model
1

(Constant)
AGE
SEX
SEI

Unstandardized
Coefficients
B
Std. Error
.595
.062
-.007
.001
.022
.025
.001
.001

Standardized
Coefficients
Beta
-.598
.063
.076

t
9.552
-8.590
.907
1.098

Sig.
.000
.000
.366
.274

a. Dependent Variable: LOGEARN


b. Selecting only cases for which SPLIT = 1.0000

In the second validation analysis, the probability


for the test of relationship between "respondent's
socioeconomic index" [sei] and "log
transformation of how many in family earned
money" [logearn] was 0.274, which was greater
than the level of significance of 0.05 and not
statistically significant.

Collinearity Statistics
Tolerance
VIF
.999
.999
.999

1.001
1.001
1.001

ters II
Slide
84

RELATIONSHIP OF INDIVIDUAL INDEPENDENT


VARIABLES TO DEPENDENT VARIABLE - 5
The relationship between "sex" [sex] and "log
transformation of how many in family earned
money" [logearn] was not statistically significant
for the model using the full data set (p=0.200).
Similarly, the relationships in both of the
validation analyses were not statistically
significant.
Coefficientsa,b

Model
1

(Constant)
AGE
SEX
SEI

Unstandardized
Coefficients
B
Std. Error
.663
.077
-.007
.001
.024
.029
.000
.001

Standardized
Coefficients
Beta
-.628
.062
-.039

t
8.603
-8.429
.828
-.525

Sig.
.000
.000
.410
.601

Collinearity Statistics
Tolerance
VIF

a. Dependent Variable: LOGEARN


b. Selecting only cases for which SPLIT = .0000

In the first validation analysis, the probability for


the test of relationship between "sex" [sex] and
"log transformation of how many in family earned
money" [logearn] was 0.410, which was greater
than the level of significance of 0.05 and not
statistically significant.

.992
.989
.985

1.008
1.011
1.015

ters II
Slide
85

RELATIONSHIP OF INDIVIDUAL INDEPENDENT


VARIABLES TO DEPENDENT VARIABLE - 6
Coefficientsa,b

Model
1

(Constant)
AGE
SEX
SEI

Unstandardized
Coefficients
B
Std. Error
.595
.062
-.007
.001
.022
.025
.001
.001

Standardized
Coefficients
Beta
-.598
.063
.076

t
9.552
-8.590
.907
1.098

Sig.
.000
.000
.366
.274

a. Dependent Variable: LOGEARN


b. Selecting only cases for which SPLIT = 1.0000

In the second validation analysis, the probability


for the test of relationship between "sex" [sex]
and "log transformation of how many in family
earned money" [logearn] was 0.366, which was
greater than the level of significance of 0.05 and
not statistically significant.

The split sample validation supports the findings


of the regression analysis using the full data set.
The answer to the original question is true.

Collinearity Statistics
Tolerance
VIF
.999
.999
.999

1.001
1.001
1.001

ters II
Slide
86

Table of validation results: standard regression


It may be helpful to create a table
for our validation results and fill in
its cells as we complete the
analysis.

Full Data Set

Split = 0
(Split1 = 1)

Split = 1
(Split2 = 1)

ANOVA significance
(sig <= 0.05)

<0.001

<0.001

<0.001

R2

0.384

0.400

0.365

Significant Coefficients
(sig <= 0.05)

Age of respondent

Age of respondent

Age of respondent

The split sample validation supports


the findings of the regression
analysis using the full data set.

ters II
Slide
87

Answering the problem question - 1


In the dataset GSS2000.sav, is the following statement true, false, or an
incorrect application of a statistic?
Assume
there
is no problem with
We have found
that that
there is
a statistically
missing data. Use a level of significance
of 0.05
for the
regression
significant relationship
between
the set
of IVs
and the DV
the Multipleassumptions.
R was
analysis. Use a level of significance
of(p<0.001),
0.01 forand
evaluating
which would
be characterized
as aValidate the
Use 0.0160 as the criteria for0.620,
identifying
influential
cases.
strong relationship.
results of your regression analysis by splitting the sample in two, using
788035 as the random number seed.
The variables "age" [age], "sex" [sex], and "respondent's socioeconomic
index" [sei] have a strong relationship to the variable "how many in family
earned money" [earnrs].
Survey respondents who were older had fewer family members earning
money. The variables sex and respondent's socioeconomic index did not
have a relationship to how many in family earned money.
1.
2.
3.
4.

True
True with caution
False
Inappropriate application of a statistic

ters II
Slide
88

Answering the problem question - 2


In the dataset GSS2000.sav, is the following statement true, false, or an
The b coefficient
withAssume
age was statistically
significant
incorrect application
of aassociated
statistic?
that there
is no problem with
(p<
0.001),
so
there
was
an
individual
relationship
to
interpret.
missing data. Use a level of significance of 0.05 for the regression
analysis. UseThe
a blevel
of significance of 0.01 for evaluating assumptions.
coefficient (-0.007) was negative, indicating an inverse
Use 0.0160 as
the criteria
for
identifying
influential
Validate the
relationship
in which
higher
numeric values
for age arecases.
associated
results of your
splitting theofsample
withregression
lower numericanalysis
values for by
log transformation
how manyin
in two, using
family
earned money.
788035 as the
random
number seed.
Therefore, the negative value of b implies that survey respondents
who
were older
had"sex"
fewer [sex],
family members
earning money. socioeconomic
variables
"age"
[age],
and "respondent's

The
index" [sei] have a strong relationship to the variable "how many in family
earned money" [earnrs].
Survey respondents who were older had fewer family members earning
money. The variables sex and respondent's socioeconomic index did not
have a relationship to how many in family earned money.
1.
2.
3.
4.

True
True with caution
False
Inappropriate application of a statistic

ters II
Slide
89

Answering the problem question - 3


In the dataset GSS2000.sav, is the following statement true, false, or an
incorrect application For
of the
a statistic?
thatthethere
is no
with
independentAssume
variable sex,
probability
of problem
the
t statistic
(1.284) for the bof
coefficient
is 0.200
which is
missing data. Use a level
of significance
0.05 for
the regression
greater than the level of significance of 0.05. Sex did
analysis. Use a level of
of 0.01
for evaluating
assumptions.
not significance
have a relationship
to the number
of persons in
the
family earning
money.
Use 0.0160 as the criteria
for identifying
influential cases. Validate the
results of your regression
analysis by splitting the sample in two, using
For the independent variable respondent's
788035 as the randomsocioeconomic
number seed.
index, the probability of the t statistic
(0.354) for the b coefficient is 0.724 which is greater
than the level of significance of 0.05. Socioeconomic
status did
not [sex],
have a relationship
to the number
of
[age],
"sex"
and "respondent's
socioeconomic
persons in the family earning money.

The variables "age"


index" [sei] have a strong relationship to the variable "how many in family
earned money" [earnrs].
Survey respondents who were older had fewer family members earning
money. The variables sex and respondent's socioeconomic index did not
have a relationship to how many in family earned money.
1.
2.
3.
4.

True
True with caution
False
Inappropriate application of a statistic

The answer to the


question is true.

ters II
Slide
90

Steps in regression analysis:


running the baseline model
The following is a guide to the decision process for answering
problems about the complete regression analysis:
Dependent variable metric?
Independent variables
metric or dichotomous?

No

Inappropriate
application of
a statistic

Yes

Ratio of cases to
independent variables at
least 5 to 1?

Yes
Run baseline regression, using method for including
variables identified in the research question.
Record R for evaluation of transformations and
removal of outliers and influential cases.
Record Durbin-Watson statistic for assumption of
independence of errors.

No

Inappropriate
application of
a statistic

ters II
Slide
91

Steps in regression analysis:


evaluating assumptions - 1

Is the dependent variable


normally distributed?

No

Try:
1. Logarithmic transformation
2. Square root transformation
3. Inverse transformation
If unsuccessful, add caution
for violation of regression
assumptions

Yes

Metric IVs normally


distributed and linearly
related to DV?

Yes

No

Try:
1. Logarithmic transformation
2. Square root transformation
(3. Square transformation)
4. Inverse transformation
If unsuccessful, add caution
for violation of regression
assumptions

ters II
Slide
92

Steps in regression analysis:


evaluating assumptions - 2

DV is homoscedastic for
categories of dichotomous
IVs?

No

Add caution for violation of


regression assumptions

Yes

Residuals are
independent,
Durbin-Watson between
1.5 and 2.5?

Yes

No

Add caution for violation of


regression assumptions

ters II
Slide
93

Steps in regression analysis:


evaluating outliers
Request statistics for detecting outliers and
influential cases by running standard
multiple regression using Enter method to
include all variables and substituting
transformed variables.

Univariate outliers (DV),


multivariate outliers
(IVs), or influential cases?

Yes

Remove outliers and


influential cases from
data set

No

Ratio of cases to
independent variables at
least 5 to 1?

Yes

No

Restore outliers and


influential cases to
data set, add caution
to findings

ters II
Slide
94

Steps in regression analysis:


picking regression model for interpretation
No

Were transformed variables


substituted, or outliers and
influential cases omitted?

Yes
Evaluate impact of transformations and
removal of outliers by running regression
again, using method for including variables
identified in the research question.

Yes
Pick regression with
transformations and omitting
outliers for interpretation

R for evaluated regression


greater than R for
baseline regression by 2%
or more?

No
Pick baseline regression
for interpretation

ters II
Slide
95

Steps in regression analysis:


overall relationship is interpretable

Probability of ANOVA test of


regression less than/equal to
level of significance?

No

False

Yes
Tolerance for all IVs
greater than 0.10,
indicating no
multicollinearity?

Yes

No

False

ters II

Steps in regression analysis:


validation - 1

Slide
96

No

Enough valid cases to


split sample and keep 5
to 1 ratio of
cases/variables?

Yes

Set the random seed and


compute the split variable
Re-run regression with split = 0
Re-run regression with split = 1

Set the first random seed and


compute the split1 variable
Re-run regression with split1 = 1
Set the second random seed and
compute the split2 variable
Re-run regression with split2 = 1

Probability of ANOVA test


<= level of significance for
both validation analyses?

Yes

No

False

ters II
Slide
97

Steps in regression analysis:


validation - 2

R for both validations


within 5% of R for
analysis of full data set?

No

False

Yes
Change in R statistically
significant in both
validation analyses?
(Hierarchical only)

No

False

Yes

Pattern of significance for


independent variables in
both validations matches
pattern for full data set?

Yes

No

False

ters II
Slide
98

Steps in regression analysis:


answering the question

Satisfies ratio for preferred


sample size: 15 to 1
(stepwise: 50 to 1)

No

True with caution

Yes
DV is interval level and IVs
are interval level or
dichotomous?

No

True with caution

Yes

Assumptions not violated?


Outliers/influential cases
excluded from interpretation?

Yes
True

No

True with caution

ters II
Slide
99

Interpreting the coefficients when the


dependent variable is transformed

ters II

Interpreting b coefficients when the


dependent variable is transformed - 1

Slide
100

For the independent variable age, the probability of the


t statistic (-12.237) for the b coefficient is <0.001
which is less than or equal to the level of significance
of 0.05. We reject the null hypothesis that the slope
associated with age is equal to zero (b = 0) and
conclude that there is a statistically significant
relationship between age and log transformation of
how many in family earned money.

Coefficientsa

Model
1

(Constant)
AGE
SEX
SEI

Unstandardized
Coefficients
B
Std. Error
.626
.048
-.007
.001
.024
.019
.000
.000

a. Dependent Variable: LOGEARN

Standardized
Coefficients
Beta
-.615
.065
.018

t
12.989
-12.237
1.284
.354

Sig.
.000
.000
.200
.724

Collinearity Statistics
Tolerance
VIF
.999
.997
.997

The b coefficient associated with age (-0.007) is


negative, indicating an inverse relationship in which
higher numeric values for age are associated with
lower numeric values for log transformation of how
many in family earned money. Therefore, the negative
value of b implies that survey respondents who were
older had fewer family members earning money.

1.001
1.003
1.004

ters II
Slide
101

Interpreting b coefficients when the


dependent variable is transformed - 2
If we not to interpret a specific change in number of
earners for some amount of change in age, we will
need to find the answer and convert it from log units to
decimal units. We can use Microsoft Excel to calculate
the answer.

In the worksheet, I have entered the b


coefficient from the SPSS output (-0.007)
in row 1.
In row 2, I have entered different ages,
e.g. 20, 30, 40, and 50.

ters II
Slide
102

Interpreting b coefficients when the


dependent variable is transformed - 3
On row 3, we multiply the value for the independent
variable age by the b coefficient, which measures the
contribution to the dependent variable in log units.

On row 4, we reverse the log transform


back to decimal units by raising the
number 10 the value on row 3.
The caret symbol is used by Excel for raise
to a power, so 10^-0.14 is 10 raised to the
-0.14 power, or 0.7244.

ters II
Slide
103

Interpreting b coefficients when the


dependent variable is transformed - 4
Based on our table, ia respondent were age 20
contributes 0.7244 to the number of earners in the
family. If a respondent were 30, rather than 20, the
contribution to number of earns would be 0.6166, a
decrease of -0.1078. Thus, increasing age has a
negative effect on the number of earners.

Note that as we go up in increments of 10,


the difference between increments is
decreasing. The logarithmic scale is not
linear, requiring us to compute the change
for any specific interval of interest.

You might also like