You are on page 1of 60

RESEARCH METHODOLOGY

RESULT AND
ANALYSIS
(part 2)

HYPOTHESIS TESTING
A hypothesis
is a conjecture about a population
parameter. This conjecture may or may not
be true.
An educated guess based on theory and
background information
A proposed explanation for a phenomenon.
Hypothesis Testing is a process of using
sample data and statistical procedures to
decide whether to reject or not reject a
hypothesis (statement) about a population
parameter value.

Examples
Whether seat belts will reduce the
severity of injuries caused by accident
Whether the public prefer certain
colour in the fabric lining
Whether adding a chemical will
improve water quality
The average life expectancy in the
next decade for man will be more than
100 years
Education increases income

education increases
income

a positive relationship between the concepts

"education" and "income."


This abstract or conceptual hypothesis cannot be
tested. First, it must be operationalized or situated
in the real world by rules of interpretation. Consider
again the simple hypothesis "Education increases
Income."
To test the hypothesis the abstract meaning of
education and income must be derived or
operationalized. The concepts should be measured.
Education could be measured by "years of
school completed" or "highest degree
completed" etc. Income could be measured
by "hourly rate of pay" or "yearly salary" etc.

Two type of statistical


hypothesis

The Null Hypothesis: symbolised by Ho,

states that there is no difference between a


parameter and a specific value OR that
there is no difference between two
parameters. NULL means NO CHANGE.
Statement of equality
The Alternative Hypothesis: symbolised
by Ha, states a specific difference between
parameter and a specific value OR states
that there is a difference between two
parameters. TEST or Research Hyphothesis.

Situation A: A researcher is interested in finding out


whether a new medicine will have any undesirable
side effects on the pulse rate of the patient. Will the
pulse rate increase, decrease or remain unchanged.
Since the researcher knows the pulse rate of the
population under study is 82 beats per minute, the
hypothesis will be
Ho : = 82 (remain uncahnged)
H1 : 82 (will be different)
This is a two-tailed test since the possible effect
could be to raise or lower the pulse

Situation B: A chemist invents an additive to


increase the life of an automobile battery. The
mean life time of ordinary battery is 36 months. The
hypothesis will be:
Ho : 36
Ha : > 36
The chemist is interested only in increasing the
lifespan of the battery. His alternative hypothesis is
that the mean is larger than 36.
Therefore the test is called right-tailed, interested
in the increase only.

Situation C: A contractor wishes to lower heating


bill by using a special type of insulation in house.
If the average monthly bill is RM100, his
hypothesis will be
Ho : RM 100
H1 : RM 100
This is a left-tailed test since the contractor is
only interested in reducing the bill

General Procedure for testing the


hypothesis. Can be done statistically.
Step 1: State the hypothesis
Step 2: find critical value for a selected level of
significant or formulate an analysis plan e.g.
0.1, 0.05, 0.01. Consider case for one-tailed or
two-tailed
Step 3: Analyze sample data.
Step 4: Interpret results or make the decision
to reject or not to reject the hypothesis. If test
value < critical value accept Ho. test value >
critical value reject Ho.

significant difference
A significant difference occurs if the
difference between the hypothesized (null)
value and the sample statistic value is too
large to be attributed to chance. A
significant difference strongly suggests that
the null hypothesis is not true.
Significant difference at p<0.05 means,
95% of the time the sample mean is larger
than the hypothesised value.

TESTING THE DIFFERENCE


AMONG MEANS AND
VARIANCE
Situations:
To compare the average lifetime of

two difference brands of tires


Two different brands of fertilizer,
whether one is better than the other
for growing plants
Two brands of cough syrup, to test
whether one brand is more effective
than the other

Problem 1: Two-Tailed Test


Suppose the Acme Drug Company develops a new
drug, designed to prevent colds. The company
states that the drug is equally effective for men
and women. To test this claim, they choose a a
simple random sample of 100 women and 200
men from a population of 100,000 volunteers.
At the end of the study, 38% of the women caught
a cold; and 51% of the men caught a cold. Based
on these findings, can we reject the company's
claim that the drug is equally effective for men
and women? Use a 0.05 level of significance.

Solution:
State the hypotheses. The first step is to state

the null hypothesis and an alternative hypothesis.


Null hypothesis: P1 P2
Alternative hypothesis: P1 = P2
Note that these hypotheses constitute a two-tailed
test. The null hypothesis will be rejected if the
proportion from population 1 is too big or if it is too
small.
Formulate an analysis plan. For this analysis, the

significance level is 0.05. The test method is a twoproportion z-test.

Analyze sample data. Using sample data, we

calculate the pooled sample proportion (p) and the


standard error (SE). Using those measures, we
compute the z-score test statistic (z).
p = (p1 * n1 + p2 * n2) / (n1 + n2) = [(0.38 * 100) +
(0.51 * 200)] / (100 + 200) = 140/300 = 0.467
SE = sqrt{ p * ( 1 - p ) * [ (1/n1) + (1/n2) ] }
SE = sqrt [ 0.467 * 0.533 * ( 1/100 + 1/200 ) ] = sqrt
[0.003733] = 0.061
z = (p1 - p2) / SE = (0.51 - 0.38)/0.061 = 2.13

where p1 is the sample proportion in sample 1, where


p2 is the sample proportion in sample 2, n1 is the
size of sample 2, and n2 is the size of sample 2.

Since we have a two-tailed test, the P-value is

the probability that the z-score is less than


-2.13 or greater than 2.13.
We use the Normal Distribution Calculator to
find P(z < -2.13) = 0.017, and P(z > 2.13) =
0.017. Thus, the P-value = 0.017 + 0.017 =
0.034.
Interpret results. Since the P-value (0.034)

is less than the significance level (0.05),


then.

Problem 2: One-Tailed Test


Suppose the previous example is stated a little bit
differently. Suppose the Acme Drug Company
develops a new drug, designed to prevent colds.
The company states that the drug is more effective
for women than for men. To test this claim, they
choose a a simple random sample of 100 women
and 200 men from a population of 100,000
volunteers.
At the end of the study, 38% of the women caught a
cold; and 51% of the men caught a cold. Based on
these findings, can we conclude that the drug is
more effective for women than for men? Use a 0.01
level of significance.

Solution:

State the hypotheses. The first step is to state

the null hypothesis and an alternative hypothesis.


Null hypothesis: P1 < P2
Alternative hypothesis: P1 >= P2
Note that these hypotheses constitute a one-tailed
test. The null hypothesis will be rejected if the
proportion of women catching cold (p1) is
sufficiently smaller than the proportion of men
catching cold (p2).

Formulate an analysis plan. For this analysis, the

significance level is 0.01. The test method is a twoproportion z-test.

Analyze sample data. Using sample data, we

calculate the pooled sample proportion (p) and the


standard error (SE). Using those measures, we
compute the z-score test statistic (z).
p = (p1 * n1 + p2 * n2) / (n1 + n2) = [(0.38 * 100) +
(0.51 * 200)] / (100 + 200) = 140/300 = 0.467
SE = sqrt{ p * ( 1 - p ) * [ (1/n1) + (1/n2) ] }
SE = sqrt [ 0.467 * 0.533 * ( 1/100 + 1/200 ) ] = sqrt
[0.003733] = 0.061
z = (p1 - p2) / SE = (0.38 - 0.51)/0.061 = -2.13
where p1 is the sample proportion in sample 1, where
p2 is the sample proportion in sample 2, n1 is the size
of sample 2, and n2 is the size of sample 2.

Since we have a one-tailed test, the P-value is

the probability that the z-score is less than


-2.13. We use the Normal Distribution
Calculator to find P(z < -2.13) = 0.017. Thus,
the P-value = 0.017.
Interpret results. Since the P-value (0.017) is
greater than the significance level (0.01)

Commonly used Methods


1. z-test
For detecting difference between two
means for large sample (two samples)
Assumptions required
The sample must be independent,
that is no relationship between the
subject in the sample
The sample must be normally distributed

Example problem
Suppose that in a particular geographic region, the
mean and standard deviation of scores on a
reading test are 100 points, and 12 points,
respectively. Our interest is in the scores of 55
students in a particular school who received a
mean score of 96. We can ask whether this mean
score is significantly lower than the regional
mean that is, are the students in this school
comparable to a simple random sample of 55
students from the region as a whole, or are their
scores surprisingly low. Calculate z score?

solution
We begin by calculating the standard error

(SE) of the mean:

Next we calculate the z-score, which is the distance from the


sample mean to the population mean in units of the standard
error:

problem

the mean and standard deviation of scores on a calculating


test are 120 points, and 18 points, respectively. Our interest is
in the scores of 81 students in a particular school who
received a mean score of 92. We can ask whether this mean
score is significantly lower than the regional mean that is,
are the students in this school comparable to a simple
random sample of 81 students from the region as a whole, or
are their scores surprisingly low. Calculate Z- score?
2.
Every year, 50,000 runners compete in the Peachtree Road
Race. They run 10 kilometers (a little over 6 miles). The
average finishing time is 55 minutes, with a standard
deviation of 10 minutes. Fred and Wilma completed the race
in 61 and 51 minutes, respectively. Barney and Betty had
finishing times with z-scores of -0.3 and 0.7, respectively.
List the runners in order, starting with the fastest runner
and ending with the slowest runner.
(A) Wilma, Barney, Fred, Betty
(B) Barney, Wilma, Fred, Betty
(C) Wilman, Barney, Betty, Fred
(D) Betty, Fred, Barney, Wilma
(E) None of the above
1.

solution
1. Calculate (SE) of the mean:

18 18
SE

2
n
81 9
Next we calculate the z-score

M 92 120 28
Z

3.11
SE
2
9

solution
2. The answer is A. This problem can be solved by converting
Fred and Wilma's raw scores into z-scores. To do this, we use
the z-score equation: To do this, we use the z-score equation:
z = (M-) / sd
where z is the z-score, x is the runner's raw score, M is the
mean finishing time, and sd is the standard deviation of
finishing times.
Solving first for Fred's z-score, we get
z = (M-) / sd = ( 61-55) / 10 = 0.60
Using the same approach to compute Wilma's z-score, we get
z = (M-) / sd = ( 51-55) / 10 = - 0.4
Based on z-scores, we can order the runners from fastest to
slowest as follows: Wilma (z = -0.4), Barney (z = -0.3), Fred (z
= 0.6), and Betty (z = 0.7).

problem
Each year, a national achievement test is

administered to 3rd graders. The test has a


mean score of 100 and a standard deviation of
15. If Jane's z-score is 1.20, what was her
score on the test?
(A) 82
(B) 88
(C) 100
(D) 112
(E) 118

solution
The correct answer is (E). From the z-score

equation, we know
z = (M-) / sd
where z is the z-score, x is the value of Jane's
test score, M is the mean test score, and sd
is the standard deviation of test scores.
Solving for Jane's test score (M), we get
M = ( z * sd) + 100 = ( 1.20 * 15) + 100 =
18 + 100 = 118

2. F test
For the comparison of two variances or
standard deviations. E.g variation in
cholesterol level in man and women
Assumptions
The population from which the
samples were obtained must be
normally distributed
Samples must be independent of
each other

Example problem
Consider an experiment to

study the effect of three


different levels of a factor on a
response (e.g. three levels of a
fertilizer on plant growth). If
we had 6 observations for
each level, we could write the
outcome of the experiment in
a table like this, where a1, a2,
and a3 are the three levels of
the factor being studied.

a1 a2

a3

6
8
4
5
3
4

13
9
11
8
7
12

8
12
9
11
6
8

solution
Step 1: Calculate the mean within each group:

Step 2: Calculate the overall mean:

where a is the number of groups.

Step 3: Calculate the "between-group" sum of

squares:

where n is the number of data values per group.


The between-group degrees of freedom is one less than the number of
groups

fb = 3 1 = 2
so the between-group mean square value is

MSB = 84 / 2 = 42

Step 4: Calculate the "within-group" sum

of squares. Begin by centering the data in


aeach
a2
a3
group
1
65= 89= 1310
1
-1
=3
85= 129
3
=3

910=
-1

45= 99= 1110


-1
0
=1
55= 119 810=
The within-group sum of squares is the sum of squares of all 18 values in this
0
=2
-2
table
SW = 1 + 9 +69=
1 + 0 + 4 + 710=
1+1+9+0+4+9+1+9+1+1+4+9+4
35=
= -2
68
-3
-3

The within-group degrees of freedom is


45=
89=
fW = a(n 1)
= 3(6 1) =1210
15

-1

-1

=2

Thus the within-group mean square

value is

Step 5: The F-ratio is

2. t-test
To test the difference between two
means for small independent sample
(n<30)
Assumptions
Sample must be independent
The populations are normally
distributed

CORRELATION AND
REGRESSION

Correlation is a statistical method used to determine

whether a relationship between variable exists.


Correlation attempts to study the strength of the
mutual relationship between two variables. In
correlation we assume that the variables are random
and dependence of any nature is not involved.
Regression describe the nature of the relationship
between variables. Regression studies the
relationship where dependence is necessarily
involved. One variable has the dependence on a
certain number of variables. Regression can be used
for predicting the values of the variable which
depends upon other variables.

Linear and Non Linear


Linear
Correlation:
Correlation

Correlation is said to be linear if the ratio of


change is constant. The amount of output in a
factory is doubled by doubling the number of
workers is the example of linear correlation.
In other words it can be defined as if all the
points on the scatter diagram tends to lie near
a line which are look like a straight line, the
correlation is said to be linear, as shown in the
figure.

Non Linear (Curvilinear) Correlation:

Correlation is said to be non linear if


the ratio of change is not constant. In other
words it can be defined as if all the points on
the scatter diagram tends to lie near a smooth
curve, the correlation is said to be non linear
(curvilinear), as shown in the figure.

Positive and Negative


Positive
Correlation:
Correlation

The correlation in the same direction is called positive


correlation. If one variable increase other is also
increase and one variable decrease other is also
decrease. For example, the length of an iron bar will
increase as the temperature increases.
Negative Correlation:
The correlation in opposite direction is called
negative correlation, if one variable is increase other
is decrease and vice versa, for example, the volume
of gas will decrease as the pressure increase or the
demand of a particular commodity is increase as price
of such commodity is decrease.
No Correlation or Zero Correlation:
If there is no relationship between the two variables
such that the value of one variable change and the
other variable remain constant is called no or zero
correlation.

Perfect Correlation
If there is any change in the value of one variable, the
value of the others variable is changed in a fixed
proportion, the correlation between them is said to be
perfect correlation. It is indicated numerically as +1 and
-1.
Perfect Positive Correlation:
If the values of both the variables are move in
same direction with fixed proportion is called perfect
positive correlation. It is indicated numerically as +1.
Perfect Negative Correlation:
If the values of both the variables are move in
opposite direction with fixed proportion is called perfect
negative correlation. It is indicated numerically as -1.

Coefficient of Correlation

For sample data the correlation coefficient


denoted by r is a measure of strength of the
linear relation between X and Y variables, where
r is a pure number and lies between -1 and +1.

Examples of Correlation
Calculate and analyze the correlation

coefficient between the number of study


hours and the number of sleeping hours of
different students.

Solution:
The necessary calculation is given below:

There is perfect negative correlation between the number


of study hours and the number of sleeping hours.

Problem
From the following data, compute the

coefficient of correlation between X and Y:

Summation of products of deviations of X and Y


series from their arithmetic means = 122.

Solution:

LINEAR REGRESSION
If the plot of n pairs of data (x , y) for an
experiment appear to indicate a "linear
relationship" between y and x, then the
method of least squares may be used to write
a linear relationship between x and y.
The least square regression line for the set of n
data points is given by
y = ax + b
where a and b are given by

Example
Consider the following set of points: {(-2 , -1) ,

(1 , 1) , (3 , 2)}
a) Find the least square regression line for the
given data points.
b) Plot the given points and the regression line
in the same rectangular system of axes.

Solutions
a) Let us organize the data in a table.

We now use the above formula to calculate a and b


as follows
a = (nx y - xy) / (nx2 - (x)2) = (3*9 - 2*2) /
(3*14 - 22) = 23/38
b = (1/n)(y - a x) = (1/3)(2 - (23/38)*2) = 5/19

b) We now graph the regression line given by


y = ax + b and the given points.

Problems

2 a) Find the least square regression line for the


following set of data {(-1 , 0),(0 , 2),(1 , 4),(2 , 5)}
b) Plot the given points and the regression line in
the same rectangular system of axes.
3 The values of y and their corresponding values of y
are shown in the table below
a) Find the least square regression line y = ax + b.
b) Estimate the value of y when x = 10.
4 The sales of a company (in million dollars) for each
year are shown in the table below.
a) Find the least square regression line y = ax + b.
b) Use the least squares regression line as a model
to estimate the sales of the company in 2012.

SOLUTION

Solution

Solution

Multiple Regression
Several independent variables and one dependent

Y = a +b1x1+ b2x2 + . bkxk


Assumptions for multiple regression
For any specific value of independent variable, the

value of the y variable are normally distributed


(normality assumption)
The variances or standard deviation for the y variable
are the same for each value of the independent
variable (equal variance assumption)
There is a linear relationship between the dependent
variable and the independent variable (linearity
assumption)
The independent variables are not correlated
The values for the y variables are independent

NON-PARAMETRIC TEST
Z, f and t-tests are parametric when data are

normally distributed
When data is not normally distributed NonParametric test is more appropriate.
Also called Distribution Free Statistics

Non-parametric methods are widely used for


studying populations that take on a ranked order
(such as movie reviews receiving one to four
stars). The use of non-parametric methods may
be necessary when data have a ranking but no
clear numerical interpretation, such as when
assessing preferences; in terms of level of
measurement, for data on an ordinal scale.

Advantages &
Disadvantages

Advantages of Non Parametric Test


Can be used when the variable is not
normally distributed
Can be used when data is small
Can be used to test hypothesis
The computation is easier
Easier to understand
Disadvantages
Less sensitive
Less information
Less efficient

USING MODELS
Be sure with data requirement and the
need of the study
Consists of 4 main steps
Model formulation
Model optimization
Model calibration/verification
Model Application

Model Formulation
Involved empirical and theoretical evidences
Make assumptions to reduce the problem to a
manageable form (simplification of process)
Model optimization
Regression analysis analytical way
Subjective optimization based on experience of the
modelers
Model Calibration
Changing the coefficient
Reduce error between observed and predicted values
Model Application
After the model has been calibrated and validated

You might also like