MultipleRegression CompleteProblems

SW388R7
Data Analysis &

Computers II
Strategy for Complete Regression Analysis
Slide 1
Additional issues in regression analysis

Assumption of independence of errors
Influential cases
Multicollinearity
Adjusted R
Strategy for Solving problems
Sample problems
Complete regression analysis
Compu
ters II
Assumption of independence of errors - 1
Slide 2
Multiple regression assumes that the errors are

independent and there is no serial correlation. Errors
are the residuals or differences between the actual
score for a case and the score estimated using the
regression equation. No serial correlation implies
that the size of the residual for one case has no
impact on the size of the residual for the next case.
The Durbin-Watson statistic is used to test for the

presence of serial correlation among the residuals.
The value of the Durbin-Watson statistic ranges from
0 to 4. As a general rule of thumb, the residuals are
not correlated if the Durbin-Watson statistic is
approximately 2, and an acceptable range is 1.50 2.50.
Compu
ters II
Assumption of independence of errors - 2
Slide 3
Serial correlation is more of a concern in analyses

that involve time series.
If it does occur in relationship analyses, its presence

can be usually be understood by changing the
sequence of cases and running the analysis again.
If the problem with serial correlation disappears, it

may be ignored.
If the problem with serial correlation remains, it can

usually be handled by using the difference between
successive data points as the dependent variable.
Compu
ters II
Multicollinearity - 1
Slide 4
Multicollinearity is a problem in regression analysis

that occurs when two independent variables are
highly correlated, e.g. r = 0.90, or higher.
The relationship between the independent variables

and the dependent variables is distorted by the very
strong relationship between the independent
variables, leading to the likelihood that our
interpretation of relationships will be incorrect.
In the worst case, if the variables are perfectly

correlated, the regression cannot be computed.
SPSS guards against the failure to compute a

regression solution by arbitrarily omitting the
collinear variable from the analysis.
Compu
ters II
Multicollinearity - 2
Slide 5
Multicollinearity is detected by examining the

tolerance for each independent variable. Tolerance is
the amount of variability in one independent variable
that is no explained by the other independent
variables.
Tolerance values less than 0.10 indicate collinearity.
If we discover collinearity in the regression output,

we should reject the interpretation of the
relationships as false until the issue is resolved.
Multicollinearity can be resolved by combining the

highly correlated variables through principal
component analysis, or omitting a variable from the
analysis.
Compu
ters II
Adjusted R
Slide 6
The coefficient of determination, R which measures

the strength of a relationship, can usually be
increased simply by adding more variables. In fact, if
the number of variables equals the number of cases,
it is often possible to produce a perfect R of 1.00.
Adjusted R is a measure which attempts to reduce

the inflation in R by taking into account the number
of independent variables and the number of cases.
If there is a large discrepancy between R and

Adjusted R, extraneous variables should be removed
from the analysis and R recomputed.
Compu
ters II
Influential cases
Slide 7
The degree to which outliers affect the regression

solution depends upon where the outlier is located
relative to the other cases in the analysis. Outliers
whose location have a large effect on the regression
solution are called influential cases.
Whether or not a case is influential is measured by

Cooks distance.
Cooks distance is an index measure; it is compared to

a critical value based on the formula:
4 / (n k 1)
where n is the number of cases and k is
the number of independent variables.
If a case has a Cooks distance greater than the

critical value, it should be examined for exclusion.
Compu
ters II
Overall strategy for solving problems
Slide 8
1.
2.
3.
4.
5.
Run a baseline regression using the method for

including variables implied by the problem
statement to find the initial strength of the
relationship, baseline R.
Test for useful transformations to improve normality,
linearity, and homoscedasticity.
Substitute transformed variables and check for
outliers and influential cases.
If R from regression model using transformed
variables and omitting outliers is at least 2% better
than baseline R, select it for interpretation;
otherwise select baseline model.
Validate and interpret the selected regression
model.
Compu
ters II
Problem 1
Slide 9
In the dataset GSS2000.sav, is the following statement true, false, or an

incorrect application of a statistic? Assume that there is no problem with
missing data. Use a level of significance of 0.05 for the regression
analysis. Use a level of significance of 0.01 for evaluating assumptions.
Use 0.0160 as the criteria for identifying influential cases. Validate the
results of your regression analysis by splitting the sample in two, using
788035 as the random number seed.
The variables "age" [age], "sex" [sex], and "respondent's socioeconomic
index" [sei] have a strong relationship to the variable "how many in family
earned money" [earnrs].
Survey respondents who were older had fewer family members earning
money. The variables sex and respondent's socioeconomic index did not
have a relationship to how many in family earned money.
1.
2.
3.
4.
True
True with caution
False
Inappropriate application of a statistic
ters II
Slide
10
Dissecting problem 1 - 1
When we test for influential cases
using Cooks distance, we need to
compute a critical value for
comparison using the formula:
4 / (n k 1)
where n is the number of cases and
k is the number of independent
variables. The correct value
(0.0160) is provided in the problem.
The problem may give us

different levels of
significance for the analysis.
In this problem, we are told
to use 0.05 as alpha for the
regression, but 0.01 for
testing assumptions.

incorrect application of a statistic? Assume that there is no problem with
After and
evaluating
assumptions,socioeconomic
outliers, and
The variables "age" [age], "sex" [sex],
"respondent's
influential cases, we will decide whether we
relationship
to the variable "how many in family
should use the model with transformations
The random number

index"
[sei]
a strong
seed
(788035)
forhave
the
earned
split
sample money"
validation [earnrs].
is
provided.
and excluding outliers, or the model with the

original form of the variables and all cases.
ters II
Slide
11
When a problem
states that there
is
a
variables
first in the
incorrect
application
of
a
statistic?
Assume thatThe
there
is nolisted
problem
with
relationship between some independent
problem statement are the
missing
data.
Use a level
ofwe
significance of 0.05independent
for the regression
variables and
a dependent
variable,
variables (IVs):
analysis.
a level
of significance of 0.01 for evaluating
assumptions.
do standard Use
multiple
regression.
"age" [age], "sex"
[sex], and
Use 0.0160 as the criteria for identifying influential
cases.socioeconomic
Validate the
"respondent's
index"
[sei]. in two, using
results of your regression analysis by splitting the
sample
Survey respondents who were older had fewer
family members earning
The variable that is the
money. The variables sex and respondent's socioeconomic
index
target of the relationship
is did not
have a relationship to how many in family earned
money.
the dependent
variable
1.
2.
3.
4.
(DV): "how many in family

earned money [earnrs].
True
True with caution
False
ters II
Slide
12
incorrect application of a statistic?
Assume
that
there
is no
with
In order for
a problem
to be
true, we
will problem
have
missing data. Use a level of significance
for the
regression
to find that thereof
is 0.05
a statistically
significant
relationship
the evaluating
set of IVs and assumptions.
the
analysis. Use a level of significance
ofbetween
0.01 for
and the strength
of the relationship
Use 0.0160 as the criteria forDV,
identifying
influential
cases. stated
Validate the
in the problem must be correct.
1.
2.
3.
4.
True
In addition, the relationship or lack of
True with caution
relationship between the individual IV's
and the DV must be identified correctly,
False
and must be characterized correctly.
ters II
Slide
13
LEVEL OF MEASUREMENT
Multiple regression requires that
incorrect application
a statistic?
the dependent of
variable
be metric Assume that there is no problem with
missing data.
Useindependent
a level of
significance
of 0.05 for the regression
and the
variables
be
metric
or
dichotomous.
"How many in family earned money" [earnrs]
Survey respondents who were

older
had
fewer
family
is an
interval
level
variable,
which members
satisfies the earning
level
of
measurement
requirement.
have a relationship to how many
family
earned money.
"Age" in
[age]
and "respondent's
socioeconomic
1.
2.
3.
4.
index" [sei] are interval level variables, which

satisfies the level of measurement
requirements for multiple regression analysis.
True
"Sex" [sex] is a dichotomous or dummyTrue with caution
coded nominal variable which may be included
False
in multiple regression analysis.
ters II
Slide
14
The baseline regression

We begin out analysis by
runring a standard multiple
regression analysis with
earnrs as the dependent
variable and age, sex, and
sei as the independent
variables.
Select Enter as the

Method for including
variables to produce a
standard multiple
regression.
Click on the Statistics
button to select statistics
we will need for the
analysis.
ters II
Slide
15
The baseline regression

Retain the default checkboxes for Estimates and Model
fit to obtain the baseline R, which will be used to
determine whether we should use the model with
transformations and excluding outliers, or the model
with the original form of the variables and all cases.
Mark the Descriptives

checkbox to get the
number of cases
available for the analysis.
Mark the checkbox for the
Durbin-Watson statistic,
which will be used to test
the assumption of
independence of errors.
ters II
Slide
16
Initial sample size

Descriptive Statistics
EARNRS
AGE
SEX
SEI
Mean
1.47
46.62
1.57
48.601
Std. Deviation
1.008
16.642
.496
19.1110
N
254
254
254
254
The initial sample size before excluding

outliers and influential cases is 254. With
3 independent variables, the ratio of
cases to variables is 84.7 to 1, satisfying
both the minimum and preferred sample
size for multiple regression.
If the sample size did not initially satisfy
the minimum requirement, regression
analysis is not appropriate.
ters II
Slide
17
R before transformations or removing outliers

The R of 0.187 is the benchmark
that we will use to evaluate the
utility of transformations and the
elimination of outliers/influential
cases.
Prior to any transformations of variables

to satisfy the assumptions of multiple
regression or removal of outliers, the
proportion of variance in the dependent
variable explained by the independent
variables (R) was 18.7%.
The relationship is statistically
significant, though we would not stop if
it were not significant because the lack
of significance may be a consequence of
violation of assumptions or the inclusion
of outliers and influential cases.
ters II
Slide
18
Assumption of independence of errors:

the Durbin-Watson statistic
The Durbin-Watson statistic is used to test for the presence of

serial correlation among the residuals, i.e., the assumption of
independence of errors, which requires that the residuals or
errors in prediction do not follow a pattern from case to case.
The value of the Durbin-Watson statistic ranges from 0 to 4.
As a general rule of thumb, the residuals are not correlated if
the Durbin-Watson statistic is approximately 2, and an
acceptable range is 1.50 - 2.50.
The Durbin-Watson statistic for this problem is 1.849 which
falls within the acceptable range.
If the Durbin-Watson statistic was not in the acceptable range,
we would add a caution to the findings for a violation of
regression assumptions.
ters II
Slide
19
Normality of dependent variable:

how many in family earned money
After evaluating the dependent variable, we

examine the normality of each metric
variable and linearity of its relationship with
the dependent variable.
To test the normality of number of earners in
family, run the script:
NormalityAssumptionAndTransformations.SB
S
First, move the

independent variable
EARNRS to the list box
of variables to test.
Second, click on the

OK button to
produce the output.
ters II
Slide
20

Descriptives
HOW MANY IN FAMILY Mean
EARNED MONEY
95% Confidence
Interval for Mean
Lower Bound
Upper Bound
5% Trimmed Mean
Median
Variance
Std. Deviation
Minimum
Maximum
Range
Interquartile Range
Skewness
Kurtosis
The dependent variable "how many in family earned

money" [earnrs] does not satisfy the criteria for a normal
distribution.
The skewness (0.742) fell between -1.0 and +1.0, but the
kurtosis (1.324) fell outside the range from -1.0 to +1.0.
Statistic
1.43
1.31
Std. Error
.061
1.56
1.37
1.00
1.015
1.008
0
5
5
1.00
.742
1.324
.149
.296
ters II
Slide
21

The logarithmic
transformation
improves the normality
of "how many in family
earned money"
[earnrs]. In evaluating
normality, the skewness
(-0.483) and kurtosis (0.309) were both within
the range of acceptable
values from -1.0 to
+1.0.
The square root transformation also has
values of skewness and kurtosis in the
acceptable range.
However, by our order of preference for
which transformation to use, the logarithm
is preferred to the square root or inverse.
ters II
Transformation for how many in family

earned money
Slide
22
The logarithmic transformation improves the

normality of "how many in family earned money"
[earnrs].
We will substitute the logarithmic transformation of

how many in family earned money as the dependent
variable in the regression analysis.
ters II
Slide
23
Adding a transformed variable
Before testing the assumptions

for the independent variables,
we need to add the
transformation of the dependent
variable to the data set.
Second, mark the

checkbox for the
transformation we
want to add to the
data set, and clear
the other checkboxes.
Third, clear the

checkbox for Delete
transformed variables
from the data. This will
save the transformed
variable.
First, move the variable

that we want to transform
to the list box of variables
to test.
Fourth, click on the

OK button to
produce the output.
ters II
Slide
24
The transformed variable in the data editor

If we scroll to the extreme
right in the data editor, we
see that the transformed
variable has been added to
the data set.
Whenever we add
transformed variables to
the data set, we should be
sure to delete them before
starting another analysis.
ters II
Slide
25
Normality/linearity of independent variable:

age
After evaluating the dependent variable, we

examine the normality of each metric
variable and linearity of its relationship with
the dependent variable.
To test the normality of age, run the script:
S
First, move the

AGE to the list box of
variables to test.

OK button to
produce the output.
ters II
Slide
26

age
Descriptives
AGE OF RESPONDENT Mean
95% Confidence
Interval for Mean
Lower Bound
Upper Bound
5% Trimmed Mean
Median
Variance
Std. Deviation
Minimum
Maximum
Range
Interquartile Range
Skewness
Kurtosis
In evaluating normality, the skewness

(0.595) and kurtosis (-0.351) were
both within the range of acceptable
values from -1.0 to +1.0.
Statistic
45.99
43.98
Std. Error
1.023
48.00
45.31
43.50
282.465
16.807
19
89
70
24.00
.595
-.351
.148
.295
ters II
Slide
27

age
First, move the transformed
dependent variable LOGEARN
to the text box for the
dependent variable.
To evaluate the linearity of age and the log

transformation of number of earners in the
family, run the script for the assumption of
linearity:
LinearityAssumptionAndTransformations.SBS
Second, move the

independent variable,
AGE, to the list box for
independent variables.
Third, click on the

OK button to
produce the output.
ters II
Slide
28

age
Correlations
Logarithm of EARNRS
[LG10( 1+EARNRS)]
AGE OF RESPONDENT
Logarithm of AGE
[LG10(AGE)]
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Logarithm of
EARNRS
AGE OF
Logarithm of
Square of
Square Root
Inverse of
The
evidence AGE
of linearity in
the
[LG10(
RESPON
AGE
of AGE
AGE
relationship[(AGE)**2]
between the
1+EARNRS)]
DENT
[LG10(AGE)]
[SQRT(AGE)]
[-1/(AGE)]
independent
variable
1
-.493**
-.417**
-.552**"age" [age]
-.457**
-.336**
and
the
dependent
variable
"log
.
.000
.000
.000
.000
.000
transformation of how many in
269
269
269
269
269
269
family earned money" [logearn]
-.493**
1
.979**
.983**
.916**
was the
statistical
significance .995**
of
.000
.
.000
.000
.000
the correlation
coefficient
(r = .000
269
270
270 The probability
270
270
-0.493).
for the270
-.417**
.979**
1 coefficient
.926** was <0.001,
.994**
.978**
correlation
less than. or equal
to the level of
.000
.000
.000
.000
.000
269
270
Square of AGE [(AGE)**2] Pearson Correlation

-.552**
.983**
Sig. (2-tailed)
.000
.000
N
269
270
Square Root of AGE
Pearson
-.457**
The Correlation
"age" .995**
[SQRT(AGE)]
Sig.[age]
(2-tailed)
.000for both.000
satisfies the criteria
N the assumption of normality
269 and 270
the assumption
of linearity
with the
Inverse of AGE [-1/(AGE)] Pearson
Correlation
-.336**
.916**
dependent
variable
"log
Sig. (2-tailed)
.000
.000
transformation
of
how
many
in
N
269
270
family earned money" [logearn].
**. Correlation is significant at the 0.01 level (2-tailed).
significance of 0.01. We reject the

270
270
null hypothesis
that
r = 0 and 270
conclude
.926**that there1is a linear .960**
relationship
between
.000
. the variables.
.000
270
.994**
.000
270
.978**
.000
270
270
.960**
.000
270
.832**
.000
270
270
1
.
270
.951**
.000
270
270
.832**
.000
270
.951**
.000
270
1
.
270
ters II
Slide
29

respondent's socioeconomic index
To test the normality of respondent's

socioeconomic index, run the script:
S
First, move the

SEI to the list box of
variables to test.

OK button to
produce the output.
ters II
Slide
30

Descriptives
RESPONDENT'S
SOCIOECONOMIC INDEX
Mean
95% Confidence
Interval for Mean
5% Trimmed Mean
Median
Variance
Std. Deviation
Minimum
Maximum
Range
Interquartile Range
Skewness
Kurtosis
Lower Bound
Upper Bound
Statistic
48.710
46.348
Std. Error
1.1994
51.072
47.799
39.600
366.821
19.1526
19.4
97.2
77.8
31.100
.585
-.862
The independent variable "respondent's socioeconomic index" [sei]

satisfies the criteria for the assumption of normality, but does not satisfy
the assumption of linearity with the dependent variable "log
transformation of how many in family earned money" [logearn].
In evaluating normality, the skewness (0.585) and kurtosis (-0.862)
were both within the range of acceptable values from -1.0 to +1.0.
.153
.304
ters II
Slide
31

dependent variable.
To evaluate the linearity of the relationship

between respondent's socioeconomic index
and the log transformation of how many in
family earned money, run the script for the
assumption of linearity:
LinearityAssumptionAndTransformations.SBS
Second, move the

SEI, to the list box for
Third, click on the

OK button to
produce the output.
ters II
Slide
32

Correlations
Logarithm of EARNRS
[LG10( 1+EARNRS)]
RESPONDENT'S
SOCIOECONOMIC INDEX
Logarithm of SEI
[LG10(SEI)]
Square of SEI [(SEI)**2]
Square Root of SEI

[SQRT(SEI)]
Inverse of SEI [-1/(SEI)]
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Logarithm of
EARNRS
[LG10(
1+EARNRS)]
1
.
269
.055
.385
254
.073
.243
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
**. Correlation is significant at the 0.01 level (2-tailed).
254
.036
.563
254
.064
.309
254
.092
.142
254
RESPONDEN
T'S
Logarithm of
Square Root
SOCIOECON
SEI
Square of SEI
of SEI
Inver
OMIC INDEX
[LG10(SEI)] for[(SEI)**2]
[SQRT(SEI)] SEI [-1
The probability
the correlation
.055
.073
.036
coefficient was 0.385, greater than .064
.385
the level of.243
significance .563
of 0.01. We .309
254
254
254
254
cannot reject the null hypothesis
1
.988**
.997**
that
r = 0,.987**
and cannot conclude
.
.000
.000
that
there .000
is a linear relationship
between the
255
255variables. 255
255
.987**
1
.951**
.997**
Since none of. the transformations
.000
.000
.000
to improve linearity were
255
255 that
successful, 255
it is an indication
the problem
may be a weak
.988**
.951**
1
relationship,
rather
than
a
.000
.000
.
curvilinear
relationship
correctable
255
255
255
by
using
a
transformation.
A weak
.997**
.997**
.973**
relationship is not a violation of the
.000
.000
.000
assumption of linearity, and does
255
255
255
not require a caution.
.948**
.986**
.892**
.000
.000
.000
255
255
255
255
.973**
.000
255
1
.
255
.970**
.000
255
ters II
Slide
33
Homoscedasticity of independent variable:

Sex
dependent variable.
To evaluate the homoscedasticity of the

relationship between sex and the log
transformation of how many in family earned
money, run the script for the assumption of
homogeneity of variance:
Second, move the
SEX, to the list box for
HomoscedasticityAssumptionAnd
Transformations.SBS
Third, click on the

OK button to
produce the output.
ters II
Slide
34
Homoscedasticity of independent variable:

Sex
Based on the Levene Test, the

variance in "log transformation
of how many in family earned
money" [logearn] is
homogeneous for the categories
of "sex" [sex].
The probability associated with
the Levene Statistic (0.767) is
greater than the level of
significance, so we fail to reject
the null hypothesis and
conclude that the
homoscedasticity assumption is
satisfied.
ters II
Slide
35
The regression to identify outliers and

influential cases
We use the regression procedure
to identify univariate outliers,
multivariate outliers, and
influential cases.
We start with the same dialog we
used for the baseline analysis and
substitute the transformed
variables which we think will
improve the analysis.
To run the regression again,

select the Regression |
Linear command from the
Analyze menu.
ters II
Slide
36
The regression to identify outliers and

influential cases
Second, we keep the

method of entry to Enter so
that all variables will be
included in the detection of
outliers.
NOTE: we should always use
Enter when testing for
outliers and influential cases
to make sure all variables are
included in the
determination.
Third, we want to save the

calculated values of the outlier
statistics to the data set.
Click on the Save button to
specify what we want to save.
First, we substitute the

logarithmic transformation of
earnrs, logearn, into the list
of independent variables.
ters II
Slide
37
Saving the measures of outliers/influential

cases
First, mark the checkbox for Studentized
residuals in the Residuals panel.
Studentized residuals are z-scores
computed for a case based on the data
for all other cases in the data set.
Fourth, click on
the OK button to
complete the
specifications.
Second, mark the

checkbox for
Mahalanobis in the
Distances panel. This
will compute Mahalanobis
distances for the set of
Third, mark the

checkbox for Cooks in
the Distances panel.
This will compute Cooks
distances to identify
influential cases.
ters II
Slide
38
The variables for identifying

outliers/influential cases
The variable for identifying
univariate outliers for the
dependent variable are in a column
which SPSS has named sre_1.
These are the studentized
residuals for the log transformed
variables.
The variable for identifying

multivariate outliers for the
independent variables are in
a column which SPSS has
named mah_1.
The variable
containing
Cooks distances
for identifying
influential cases
has been
named coo_1
by SPSS.
ters II
Slide
39
Computing the probability for Mahalanobis D

To compute the probability
of D, we will use an SPSS
function in a Compute
command.
First, select the

Compute command
from the Transform
menu.
ters II
Slide
40
Formula for probability for Mahalanobis D

First, in the target variable text box, type the
name "p_mah_1" as an acronym for the probability
of the mah_1, the Mahalanobis D score.
Second, to complete the

specifications for the CDF.CHISQ
function, type the name of the
variable containing the D scores,
mah_1, followed by a comma,
followed by the number of variables
used in the calculations, 3.
Third, click on the OK button

to signal completion of the
computer variable dialog.
Since the CDF function (cumulative

density function) computes the
cumulative probability from the left
end of the distribution up through a
given value, we subtract it from 1 to
obtain the probability in the upper tail
of the distribution.
ters II
Slide
41
Univariate outliers
A score on the dependent variable is
considered unusual if its studentized
residual is bigger than 3.0.
ters II
Slide
42
Multivariate outliers
The combination of scores for the
independent variables is an outlier
if the probability of the Mahalanobis
D distance score is less than or
equal to 0.001.
ters II
Slide
43
Influential cases
In addition, a case may have a

large influence on the regression
analysis, resulting in an analysis
that is less representative of the
population represented by the
sample. The criteria for identifying
influential case is a Cook's distance
score with a value of 0.0160 or
greater.
The criteria for Cooks distance is:
4 / (n k 1) =
4 / (254 3 1) = 0.0160
ters II
Slide
44
Omitting the outliers and influential cases

To omit the outliers and
influential cases from the
analysis, we select in the
cases that are not outliers
and are not influential cases.
First, select the

Select Cases
command from the
Transform menu.
ters II
Slide
45
Specifying the condition to omit outliers
First, mark the If

condition is satisfied
option button to
indicate that we will
enter a specific
condition for
including cases.

If button to specify
the criteria for inclusion
in the analysis.
ters II
Slide
46
The formula for omitting outliers
To eliminate the outliers and

influential cases, we request the
cases that are not outliers or
influential cases.
After typing in the formula,

click on the Continue button
to close the dialog box,
The formula specifies that we should

include cases if the studentized
residual (regardless of sign) is less
than 3, the probability for
Mahalanobis D is higher than the
level of significance of 0.001, and
the Cooks distance value is less than
the critical value of 0.0160.
ters II
Slide
47
Completing the request for the selection
To complete the
request, we click on
the OK button.
ters II
Slide
48
An omitted outlier and influential case
SPSS identifies the excluded cases by

drawing a slash mark through the case
number. This omitted case has a large
studentized residual, greater than 3.0,
as well as a Cooks distance value that is
greater than the critical value, 0.0160.
ters II
Slide
49
The outliers and influential cases
Case 20000159 is an influential case (Cook's

distance=0.0320) as well as an outlier on the dependent
variable (studentized residual=3.13). Case 20000915 is an
influential case (Cook's distance=0.0239). Case 20001016
is an influential case (Cook's distance=0.0598) as well as
an outlier on the dependent variable (studentized
residual=-3.12). Case 20001761 is an influential case
(Cook's distance=0.0167). Case 20002587 is an influential
case (Cook's distance=0.0264). Case 20002597 is an
influential case (Cook's distance=0.0293). There are 6
cases that have a Cook's distance score that is large
enough to be considered influential cases.
ters II
Slide
50
Running the regression omitting outliers
We run the regression again,

without the outliers which
we selected out with the
Select If command.
Select the Regression |
Linear command from the
Analyze menu.
ters II
Slide
51
Opening the save options dialog

We specify the dependent and
independent variables, continuing
to substitute any transformed
variables required by assumptions.
On our last run, we

instructed SPSS to save
studentized residuals,
Mahalanobis distance, and
Cooks distance. To prevent
these values from being
calculated again, click on the
Save button.
ters II
Slide
52
Clearing the request to save outlier data

First, clear the checkbox
for Studentized residuals.
Second, clear
the checkbox for
Mahalanobis
distance.
Third, clear the

checkbox form
Cooks distance.
Third, click on
the OK button to
complete the
specifications.
ters II
Slide
53
Opening the statistics options dialog
Once we have removed outliers,

we need to check the sample
size requirement for regression.
Since we will need the
descriptive statistics for this,
click on the Statistics button.
ters II
Slide
54
Requesting descriptive statistics
First, mark the

checkbox for
Descriptives.
Second, mark the checkbox for

Collinearity diagnostics to obtain
the tolerance values for each
independent variable in order to
assess multicollinearity.
Third, click
on the
Continue
button to
complete the
specifications.
ters II
Slide
55
Requesting the output
Having specified the

output needed for the
analysis, we click on
the OK button to obtain
the regression output.
ters II
Slide
56
SELECTION OF MODEL FOR INTERPRETATION
Prior to any transformations of variables to satisfy the

assumptions of multiple regression and the removal of outliers
and influential cases, the proportion of variance in the dependent
variable explained by the independent variables (R) was 18.7%.
After substituting transformed variables and removing outliers
and influential cases, the proportion of variance in the dependent
variable explained by the independent variables (R) was 38.4%.
Model Summaryb
Model
1
R
.620a
R Square
.384
Adjusted
R Square
.377
Std. Error of
the Estimate
.1457258
a. Predictors: (Constant), SEI, AGE, SEX

b. Dependent Variable: LOGEARN
Since the regression analysis using transformations and omitting

outliers and influential cases explained at least two percent more
variance than the regression analysis with all cases and no
transformations, the regression analysis with transformed
variables omitting outliers and influential cases was interpreted.
ters II
Slide
57
SAMPLE SIZE
The minimum ratio of valid cases to independent

variables for multiple regression is 5 to 1. After
removing 6 influential cases or outliers, there are
248 valid cases and 3 independent variables.
The ratio of cases to independent variables for this
analysis is 82.67 to 1, which satisfies the
minimum requirement. In addition, the ratio of
82.67 to 1 satisfies the preferred ratio of 15 to 1.
Descriptive Statistics
LOGEARN
AGE
SEX
SEI
Mean
.354289
46.70
1.57
48.819
Std. Deviation
.1845814
16.677
.496
19.1071
N
248
248
248
248
ters II
Slide
58
OVERALL RELATIONSHIP BETWEEN

INDEPENDENT AND DEPENDENT VARIABLES
The probability of the F statistic (50.759) for the overall
regression relationship is <0.001, less than or equal to the
level of significance of 0.05. We reject the null hypothesis
that there is no relationship between the set of independent
variables and the dependent variable (R = 0). We support
the research hypothesis that there is a statistically
significant relationship between the set of independent
variables and the dependent variable.
We support the research hypothesis that there is a
statistically significant relationship between the set of
independent variables and the dependent variable.
ters II
Slide
59
OVERALL RELATIONSHIP BETWEEN

INDEPENDENT AND DEPENDENT VARIABLES
The Multiple R for the relationship

between the set of independent
variables and the dependent variable
is 0.620, which would be
characterized as strong using the
rule of thumb than a correlation less
than or equal to 0.20 is
characterized as very weak; greater
than 0.20 and less than or equal to
0.40 is weak; greater than 0.40 and
less than or equal to 0.60 is
moderate; greater than 0.60 and
less than or equal to 0.80 is strong;
and greater than 0.80 is very strong.
ters II
Slide
60
MULTICOLLINEARITY
Coefficientsa
Model
1
(Constant)
AGE
SEX
SEI
Unstandardized
Coefficients
B
Std. Error
.626
.048
-.007
.001
.024
.019
.000
.000
Standardized
Coefficients
Beta
-.615
.065
.018
t
12.989
-12.237
1.284
.354
Sig.
.000
.000
.200
.724
a. Dependent Variable: LOGEARN
Multicollinearity occurs when one independent

variable is so strongly correlated with one or more
other variables that its relationship to the
dependent variable is likely to be misinterpreted.
Its potential unique contribution to explaining the
dependent variable is minimized by its strong
relationship to other independent variables.
Multicollinearity is indicated when the tolerance
value for an independent variable is less than
0.10.
The tolerance values for all of the independent
variables are larger than 0.10. Multicollinearity is
not a problem in this regression analysis.
Collinearity Statistics
Tolerance
VIF
.999
.997
.997
1.001
1.003
1.004
ters II
Slide
61
RELATIONSHIP OF INDIVIDUAL INDEPENDENT

VARIABLES TO DEPENDENT VARIABLE - 1
For the independent variable age, the probability of the
t statistic (-12.237) for the b coefficient is <0.001
which is less than or equal to the level of significance
of 0.05. We reject the null hypothesis that the slope
associated with age is equal to zero (b = 0) and
conclude that there is a statistically significant
relationship between age and log transformation of
how many in family earned money.
Coefficientsa
Model
1
(Constant)
AGE
SEX
SEI
Unstandardized
Coefficients
B
Std. Error
.626
.048
-.007
.001
.024
.019
.000
.000
Standardized
Coefficients
Beta
-.615
.065
.018
t
12.989
-12.237
1.284
.354
Sig.
.000
.000
.200
.724
Tolerance
VIF
.999
.997
.997
The b coefficient associated with age (-0.007) is

negative, indicating an inverse relationship in which
higher numeric values for age are associated with
lower numeric values for log transformation of how
many in family earned money. Therefore, the negative
value of b implies that survey respondents who were
older had fewer family members earning money.
1.001
1.003
1.004
ters II
Slide
62

Coefficientsa
Model
1
(Constant)
AGE
SEX
SEI
Unstandardized
Coefficients
B
Std. Error
.626
.048
-.007
.001
.024
.019
.000
.000
Standardized
Coefficients
Beta
-.615
.065
.018
t
12.989
-12.237
1.284
.354
Sig.
.000
.000
.200
.724
For the independent variable sex, the probability of the

t statistic (1.284) for the b coefficient is 0.200 which is
greater than the level of significance of 0.05.
We fail to reject the null hypothesis that the slope
associated with sex is equal to zero (b = 0) and
conclude that there is not a statistically significant
relationship between sex and log transformation of
Tolerance
VIF
.999
.997
.997
1.001
1.003
1.004
ters II
Slide
63

Coefficientsa
Model
1
(Constant)
AGE
SEX
SEI
Unstandardized
Coefficients
B
Std. Error
.626
.048
-.007
.001
.024
.019
.000
.000
Standardized
Coefficients
Beta
-.615
.065
.018
t
12.989
-12.237
1.284
.354
Sig.
.000
.000
.200
.724
For the independent variable respondent's

socioeconomic index, the probability of the t statistic
(0.354) for the b coefficient is 0.724 which is greater
than the level of significance of 0.05.
We fail to reject the null hypothesis that the slope
associated with respondent's socioeconomic index is
equal to zero (b = 0) and conclude that there is not a
statistically significant relationship between
respondent's socioeconomic index and log
transformation of how many in family earned money.
Tolerance
VIF
.999
.997
.997
1.001
1.003
1.004
ters II
Slide
64
Validation analysis:
set the random number seed
To set the random number

seed, select the Random
Number Seed command
from the Transform menu.
ters II
Slide
65
Set the random number seed
First, click on the

Set seed to option
button to activate
the text box.
Second, type in the

random seed stated in
the problem.
Third, click on the OK

button to complete the
dialog box.
Note that SPSS does not
provide you with any
feedback about the change.
ters II
Slide
66
Validation analysis:
compute the split variable
To enter the formula for the

variable that will split the
sample in two parts, click
on the Compute
command.
ters II
Slide
67
The formula for the split variable

First, type the name for the
new variable, split, into the
Target Variable text box.
Second, the formula for the
value of split is shown in the
text box.
The uniform(1) function
generates a random decimal
number between 0 and 1.
The random number is
compared to the value 0.50.
Third, click on the

OK button to
complete the dialog
box.
If the random number is less

than or equal to 0.50, the
value of the formula will be 1,
the SPSS numeric equivalent
to true. If the random
number is larger than 0.50,
the formula will return a 0,
the SPSS numeric equivalent
to false.
ters II
Slide
68
The split variable in the data editor
In the data editor, the

split variable shows a
random pattern of zeros
and ones.
To select half of the
sample for each validation
analysis, we will first
select the cases where
split = 0, then select the
cases where split = 1.
ters II
Slide
69
Repeat the regression with first validation

sample
To repeat the multiple

regression analysis for the first
validation sample, select
Linear Regression from the
Dialog Recall tool button.
ters II
Slide
70
Using "split" as the selection variable
First, scroll
down the list of
variables and
highlight the
variable split.

right arrow button to
move the split variable
to the Selection
Variable text box.
ters II
Slide
71
Setting the value of split to select cases
When the variable named

split is moved to the
Selection Variable text
box, SPSS adds "=?" after
the name to prompt up to
enter a specific value for
split.
Click on the
Rule button
to enter a
value for split.
ters II
Slide
72
Completing the value selection
First, type the value

for the first half of the
sample, 0, into the
Value text box.

Continue button to
complete the value entry.
ters II
Slide
73
Requesting output for the first validation

sample
Click on the OK
button to
request the
output.
When the value entry

dialog box is closed, SPSS
adds the value we entered
after the equal sign. This
specification now tells
SPSS to include in the
analysis only those cases
that have a value of 0 for
the split variable.
Since the validation analysis

requires us to compare the
results of the analysis using
the two split sample, we will
request the output for the
second sample before doing
any comparison.
ters II
Slide
74
Repeat the regression with second validation

sample
To repeat the multiple

regression analysis for the
second validation sample,
select Linear Regression from
the Dialog Recall tool button.
ters II
Slide
75
Setting the value of split to select cases
Since the split variable is already in the

Selection Variable text box, we only need
to change its value.
Click on the Rule button to enter a
different value for split.
ters II
Slide
76
Completing the value selection
First, type the value

for the first half of the
sample, 1, into the
Value text box.

Continue button to
complete the value entry.
ters II
Slide
77
Requesting output for the second validation

sample
Click on the OK
button to
request the
output.
When the value entry

dialog box is closed, SPSS
adds the value we entered
after the equal sign. This
specification now tells
SPSS to include in the
analysis only those cases
that have a value of 1 for
the split variable.
ters II
Slide
78
SPLIT-SAMPLE VALIDATION - 1
In both of the split-sample validation
analyses, the relationship between the
independent variables and the dependent
variable was statistically significant.
ANOVAb,c
Model
1
Regression
Residual
Total
Sum of
Squares
1.692
2.538
4.230
df
3
109
112
Mean Square
.564
.023
F
24.220
Sig.
.000a
In the first validation, the

probability for the F
statistic testing overall
relationship was <0.001.
b. Dependent Variable: LOGEARN

c. Selecting only cases for which SPLIT = .0000
ANOVAb,c
Model
1
Regression
Residual
Total
Sum of
Squares
1.500
2.614
4.114
df
3
131
134
a. Predictors: (Constant), SEI, SEX, AGE

b. Dependent
Thus far, the validation verifies
the Variable: LOGEARN
Mean Square
.500
.020
F
25.062
For the second validation

analysis, the probability for
the F statistic testing overall
relationship was <0.001.
existence of the relationship

c. Selecting only cases for which SPLIT = 1.0000
between the dependent variable
and the independent variables.
Sig.
.000a
ters II
Slide
79
SPLIT-SAMPLE VALIDATION - 2
Model Summaryb,c
R
Model
1
SPLIT =
.0000
(Selected)
.632a
SPLIT ~=
.0000
(Unselected)
.593
R Square
.400
Durbin-Watson Statistic
The total proportion
SPLIT =of variance
SPLIT in
~= the
Adjusted relationship
Std. Error of utilizing
.0000 the full data
.0000 set
38.4% compared
for
R Square was
the Estimate
(Selected) to 40.0%
(Unselected)
the
first
split
sample
validation
and
.383
.1525916
2.117
1.862
36.5% for the second split sample

validation.
b. Unless noted otherwise, statistics are based only on cases for which SPLIT = .0000.
c. Dependent Variable: LOGEARN
In both of the split-sample validation

analyses, the total proportion of
variance in the dependent variable
explained by the independent
Model Summaryb,cvariables was within 5% of the
variance explained in the model using
the full data set (38.4%).
Model
1
SPLIT =
1.0000
(Selected)
.604a
SPLIT ~=
1.0000
(Unselected)
.621
R Square
.365
Adjusted
R Square
.350
Std. Error of
the Estimate
.1412615
Durbin-Watson Statistic
SPLIT =
SPLIT ~=
1.0000
1.0000
(Selected)
(Unselected)
1.839
2.161
a. Predictors: (Constant), SEI, SEX, AGE

b. Unless noted otherwise, statistics are based only on cases for which SPLIT = 1.0000.
c. Dependent Variable: LOGEARN
ters II
Slide
80

The relationship between "age" [age] and "log
money" [logearn] was statistically significant for
the model using the full data set (p<0.001).
Similarly, the relationships in both of the
validation analyses were statistically significant.
Coefficientsa,b
Model
1
(Constant)
AGE
SEX
SEI
Unstandardized
Coefficients
B
Std. Error
.663
.077
-.007
.001
.024
.029
.000
.001
Standardized
Coefficients
Beta
-.628
.062
-.039
t
8.603
-8.429
.828
-.525
Sig.
.000
.000
.410
.601
Tolerance
VIF

b. Selecting only cases for which SPLIT = .0000
In the first validation analysis, the probability for

the test of relationship between "age" [age] and
"log transformation of how many in family earned
money" [logearn] was <0.001, which was less
than or equal to the level of significance of 0.05
and statistically significant.
.992
.989
.985
1.008
1.011
1.015
ters II
Slide
81

Coefficientsa,b
Model
1
(Constant)
AGE
SEX
SEI
Unstandardized
Coefficients
B
Std. Error
.595
.062
-.007
.001
.022
.025
.001
.001
Standardized
Coefficients
Beta
-.598
.063
.076
t
9.552
-8.590
.907
1.098
Sig.
.000
.000
.366
.274

b. Selecting only cases for which SPLIT = 1.0000
In the second validation analysis, the probability

for the test of relationship between "age" [age]
and "log transformation of how many in family
earned money" [logearn] was <0.001, which was
less than or equal to the level of significance of
0.05 and statistically significant.
Tolerance
VIF
.999
.999
.999
1.001
1.001
1.001
ters II
Slide
82

The relationship between "respondent's
socioeconomic index" [sei] and "log
money" [logearn] was not statistically significant
for the model using the full data set (p=0.724).
validation analyses were not statistically
significant.
Coefficientsa,b
Model
1
(Constant)
AGE
SEX
SEI
Unstandardized
Coefficients
B
Std. Error
.663
.077
-.007
.001
.024
.029
.000
.001
Standardized
Coefficients
Beta
-.628
.062
-.039
t
8.603
-8.429
.828
-.525
Sig.
.000
.000
.410
.601


the test of relationship between "respondent's
money" [logearn] was 0.601, which was greater
than the level of significance of 0.05 and not
statistically significant.
Tolerance
VIF
.992
.989
.985
1.008
1.011
1.015
ters II
Slide
83

Coefficientsa,b
Model
1
(Constant)
AGE
SEX
SEI
Unstandardized
Coefficients
B
Std. Error
.595
.062
-.007
.001
.022
.025
.001
.001
Standardized
Coefficients
Beta
-.598
.063
.076
t
9.552
-8.590
.907
1.098
Sig.
.000
.000
.366
.274


for the test of relationship between "respondent's
Tolerance
VIF
.999
.999
.999
1.001
1.001
1.001
ters II
Slide
84

The relationship between "sex" [sex] and "log
money" [logearn] was not statistically significant
for the model using the full data set (p=0.200).
validation analyses were not statistically
significant.
Coefficientsa,b
Model
1
(Constant)
AGE
SEX
SEI
Unstandardized
Coefficients
B
Std. Error
.663
.077
-.007
.001
.024
.029
.000
.001
Standardized
Coefficients
Beta
-.628
.062
-.039
t
8.603
-8.429
.828
-.525
Sig.
.000
.000
.410
.601
Tolerance
VIF


the test of relationship between "sex" [sex] and
"log transformation of how many in family earned
.992
.989
.985
1.008
1.011
1.015
ters II
Slide
85

Coefficientsa,b
Model
1
(Constant)
AGE
SEX
SEI
Unstandardized
Coefficients
B
Std. Error
.595
.062
-.007
.001
.022
.025
.001
.001
Standardized
Coefficients
Beta
-.598
.063
.076
t
9.552
-8.590
.907
1.098
Sig.
.000
.000
.366
.274


for the test of relationship between "sex" [sex]
and "log transformation of how many in family
earned money" [logearn] was 0.366, which was
greater than the level of significance of 0.05 and
not statistically significant.
The split sample validation supports the findings

of the regression analysis using the full data set.
The answer to the original question is true.
Tolerance
VIF
.999
.999
.999
1.001
1.001
1.001
ters II
Slide
86
Table of validation results: standard regression

It may be helpful to create a table
for our validation results and fill in
its cells as we complete the
analysis.
Full Data Set
Split = 0
(Split1 = 1)
Split = 1
(Split2 = 1)
ANOVA significance
(sig <= 0.05)
<0.001
<0.001
<0.001
R2
0.384
0.400
0.365
Significant Coefficients
(sig <= 0.05)
Age of respondent
Age of respondent
Age of respondent
The split sample validation supports

the findings of the regression
analysis using the full data set.
ters II
Slide
87
Answering the problem question - 1

incorrect application of a statistic?
Assume
there
is no problem with
We have found
that that
there is
a statistically
missing data. Use a level of significance
of 0.05
for the
regression
significant relationship
between
the set
of IVs
and the DV
the Multipleassumptions.
R was
analysis. Use a level of significance
of(p<0.001),
0.01 forand
evaluating
which would
be characterized
as aValidate the
Use 0.0160 as the criteria for0.620,
identifying
influential
cases.
strong relationship.
1.
2.
3.
4.
True
True with caution
False
ters II
Slide
88

The b coefficient
withAssume
age was statistically
significant
incorrect application
of aassociated
statistic?
that there
is no problem with
(p<
0.001),
so
there
was
an
individual
relationship
to
interpret.
analysis. UseThe
a blevel
of significance of 0.01 for evaluating assumptions.
coefficient (-0.007) was negative, indicating an inverse
Use 0.0160 as
the criteria
for
identifying
influential
Validate the
relationship
in which
higher
numeric values
for age arecases.
associated
results of your
splitting theofsample
withregression
lower numericanalysis
values for by
log transformation
how manyin
in two, using
family
earned money.
788035 as the
random
number seed.
Therefore, the negative value of b implies that survey respondents
who
were older
had"sex"
fewer [sex],
family members
earning money. socioeconomic
variables
"age"
[age],
and "respondent's
The
1.
2.
3.
4.
True
True with caution
False
ters II
Slide
89

incorrect application For
of the
a statistic?
thatthethere
is no
with
independentAssume
variable sex,
probability
of problem
the
t statistic
(1.284) for the bof
coefficient
is 0.200
which is
missing data. Use a level
of significance
0.05 for
the regression
greater than the level of significance of 0.05. Sex did
analysis. Use a level of
of 0.01
for evaluating
assumptions.
not significance
have a relationship
to the number
of persons in
the
family earning
money.
Use 0.0160 as the criteria
for identifying
influential cases. Validate the
results of your regression
analysis by splitting the sample in two, using
For the independent variable respondent's
788035 as the randomsocioeconomic
number seed.
index, the probability of the t statistic
(0.354) for the b coefficient is 0.724 which is greater
than the level of significance of 0.05. Socioeconomic
status did
not [sex],
have a relationship
to the number
of
[age],
"sex"
and "respondent's
socioeconomic
persons in the family earning money.
The variables "age"

1.
2.
3.
4.
True
True with caution
False
The answer to the

question is true.
ters II
Slide
90
Steps in regression analysis:

running the baseline model
The following is a guide to the decision process for answering
problems about the complete regression analysis:
Dependent variable metric?
Independent variables
metric or dichotomous?
No
Inappropriate
application of
a statistic
Yes
Ratio of cases to
independent variables at
least 5 to 1?
Yes
Run baseline regression, using method for including
variables identified in the research question.
Record R for evaluation of transformations and
removal of outliers and influential cases.
Record Durbin-Watson statistic for assumption of
independence of errors.
No
Inappropriate
application of
a statistic
ters II
Slide
91

evaluating assumptions - 1
Is the dependent variable

normally distributed?
No
Try:
1. Logarithmic transformation
2. Square root transformation
3. Inverse transformation
If unsuccessful, add caution
for violation of regression
assumptions
Yes
Metric IVs normally

distributed and linearly
related to DV?
Yes
No
Try:
1. Logarithmic transformation
2. Square root transformation
(3. Square transformation)
4. Inverse transformation
If unsuccessful, add caution
for violation of regression
assumptions
ters II
Slide
92

evaluating assumptions - 2
DV is homoscedastic for
categories of dichotomous
IVs?
No
Add caution for violation of

regression assumptions
Yes
Residuals are
independent,
Durbin-Watson between
1.5 and 2.5?
Yes
No
Add caution for violation of

regression assumptions
ters II
Slide
93

evaluating outliers
Request statistics for detecting outliers and
influential cases by running standard
multiple regression using Enter method to
include all variables and substituting
transformed variables.
Univariate outliers (DV),

multivariate outliers
(IVs), or influential cases?
Yes
Remove outliers and

influential cases from
data set
No
Ratio of cases to
independent variables at
least 5 to 1?
Yes
No
Restore outliers and

influential cases to
data set, add caution
to findings
ters II
Slide
94

picking regression model for interpretation
No
Were transformed variables

substituted, or outliers and
influential cases omitted?
Yes
Evaluate impact of transformations and
removal of outliers by running regression
again, using method for including variables
identified in the research question.
Yes
Pick regression with
transformations and omitting
outliers for interpretation
R for evaluated regression

greater than R for
baseline regression by 2%
or more?
No
Pick baseline regression
for interpretation
ters II
Slide
95

overall relationship is interpretable
Probability of ANOVA test of

regression less than/equal to
level of significance?
No
False
Yes
Tolerance for all IVs
greater than 0.10,
indicating no
multicollinearity?
Yes
No
False
ters II

validation - 1
Slide
96
No
Enough valid cases to

split sample and keep 5
to 1 ratio of
cases/variables?
Yes
Set the random seed and

compute the split variable
Re-run regression with split = 0
Re-run regression with split = 1
Set the first random seed and

compute the split1 variable
Re-run regression with split1 = 1
Set the second random seed and
compute the split2 variable
Re-run regression with split2 = 1
Probability of ANOVA test

<= level of significance for
both validation analyses?
Yes
No
False
ters II
Slide
97

validation - 2
R for both validations

within 5% of R for
analysis of full data set?
No
False
Yes
Change in R statistically
significant in both
validation analyses?
(Hierarchical only)
No
False
Yes
Pattern of significance for

independent variables in
both validations matches
pattern for full data set?
Yes
No
False
ters II
Slide
98

answering the question
Satisfies ratio for preferred

sample size: 15 to 1
(stepwise: 50 to 1)
No
True with caution
Yes
DV is interval level and IVs
are interval level or
dichotomous?
No
True with caution
Yes
Assumptions not violated?

Outliers/influential cases
excluded from interpretation?
Yes
True
No
True with caution
ters II
Slide
99
Interpreting the coefficients when the

dependent variable is transformed
ters II
Interpreting b coefficients when the

dependent variable is transformed - 1
Slide
100
For the independent variable age, the probability of the

t statistic (-12.237) for the b coefficient is <0.001
which is less than or equal to the level of significance
of 0.05. We reject the null hypothesis that the slope
associated with age is equal to zero (b = 0) and
conclude that there is a statistically significant
relationship between age and log transformation of
Coefficientsa
Model
1
(Constant)
AGE
SEX
SEI
Unstandardized
Coefficients
B
Std. Error
.626
.048
-.007
.001
.024
.019
.000
.000
Standardized
Coefficients
Beta
-.615
.065
.018
t
12.989
-12.237
1.284
.354
Sig.
.000
.000
.200
.724
Tolerance
VIF
.999
.997
.997
The b coefficient associated with age (-0.007) is

negative, indicating an inverse relationship in which
higher numeric values for age are associated with
lower numeric values for log transformation of how
many in family earned money. Therefore, the negative
value of b implies that survey respondents who were
older had fewer family members earning money.
1.001
1.003
1.004
ters II
Slide
101

If we not to interpret a specific change in number of
earners for some amount of change in age, we will
need to find the answer and convert it from log units to
decimal units. We can use Microsoft Excel to calculate
the answer.
In the worksheet, I have entered the b

coefficient from the SPSS output (-0.007)
in row 1.
In row 2, I have entered different ages,
e.g. 20, 30, 40, and 50.
ters II
Slide
102

On row 3, we multiply the value for the independent
variable age by the b coefficient, which measures the
contribution to the dependent variable in log units.
On row 4, we reverse the log transform

back to decimal units by raising the
number 10 the value on row 3.
The caret symbol is used by Excel for raise
to a power, so 10^-0.14 is 10 raised to the
-0.14 power, or 0.7244.
ters II
Slide
103

Based on our table, ia respondent were age 20
contributes 0.7244 to the number of earners in the
family. If a respondent were 30, rather than 20, the
contribution to number of earns would be 0.6166, a
decrease of -0.1078. Thus, increasing age has a
negative effect on the number of earners.
Note that as we go up in increments of 10,

the difference between increments is
decreasing. The logarithmic scale is not
linear, requiring us to compute the change
for any specific interval of interest.

MultipleRegression CompleteProblems

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

MultipleRegression CompleteProblems

Uploaded by

Copyright:

Available Formats

SW388R7

Data Analysis &

Strategy for Complete Regression Analysis

Additional issues in regression analysis

Assumption of independence of errors - 1

Multiple regression assumes that the errors are

The Durbin-Watson statistic is used to test for the

Assumption of independence of errors - 2

Serial correlation is more of a concern in analyses

If it does occur in relationship analyses, its presence

If the problem with serial correlation disappears, it

If the problem with serial correlation remains, it can

Multicollinearity is a problem in regression analysis

The relationship between the independent variables

In the worst case, if the variables are perfectly

SPSS guards against the failure to compute a

Multicollinearity is detected by examining the

Tolerance values less than 0.10 indicate collinearity.

If we discover collinearity in the regression output,

Multicollinearity can be resolved by combining the

The coefficient of determination, R which measures

Adjusted R is a measure which attempts to reduce

If there is a large discrepancy between R and

The degree to which outliers affect the regression

Whether or not a case is influential is measured by

Cooks distance is an index measure; it is compared to

If a case has a Cooks distance greater than the

Overall strategy for solving problems

Run a baseline regression using the method for

In the dataset GSS2000.sav, is the following statement true, false, or an

The problem may give us

In the dataset GSS2000.sav, is the following statement true, false, or an

The random number

and excluding outliers, or the model with the

(DV): "how many in family

Survey respondents who were

index" [sei] are interval level variables, which

The baseline regression

Select Enter as the

The baseline regression

Mark the Descriptives

Initial sample size

The initial sample size before excluding

R before transformations or removing outliers

Prior to any transformations of variables

Assumption of independence of errors:

The Durbin-Watson statistic is used to test for the presence of

Normality of dependent variable:

After evaluating the dependent variable, we

First, move the

Second, click on the

Normality of dependent variable:

The dependent variable "how many in family earned

Normality of dependent variable:

Transformation for how many in family

The logarithmic transformation improves the

We will substitute the logarithmic transformation of

Adding a transformed variable

Before testing the assumptions

Second, mark the

Third, clear the

First, move the variable

Fourth, click on the