You are on page 1of 11

T -TEST

t test is used for following purposes :

1. Testing significance between hypothetical mean of sample and population mean.


2. Testing significance between mean of 2 different samples.
3. Testing significance between mean of sample post and prior to a stimuli or operation.
4. Testing significance between hypothetical rank correlation coefficient and population rank
correlation coefficient
5. Testing line of regression analysis
6. Testing significance of partial and multiple correlation coefficient .

EXAMPLE:
For example, a drug company may want to test a new cancer drug to find out if it improves life
expectancy. In an experiment, there’s always a control group (a group who are given a placebo, or
“sugar pill”). The control group may show an average life expectancy of +5 years, while the group
taking the new drug might have a life expectancy of +6 years. It would seem that the drug might
work. But it could be due to a fluke. To test this, researchers would use a Student’s t-test to find out
if the results are repeatable for an entire population.

Suppose,a large conglomerate like TCS(Indian IT company) which has employees more than 300,000. So,
TCS wants to estimate average over time an employee works for the company, in a week. So, it might not
be possible to get required data(hypothetical situation, though these days it might be possible) from all
employees. Therefore, the company takes a sample, say 3000, and finds the number of extra hours of
work employees have done in week. With the help of the sample mean and sample standard deviation; for
the entire population- one can estimate the range of average number of extra hours of work, employees
have done in week.

Z-TEST

A Z-test is a type of hypothesis test. Hypothesis testing is just a way for you to figure out if results from a
test are valid or repeatable.

For example, if someone said they had found a new drug that cures cancer, you would want to be sure it
was probably true. A hypothesis test will tell you if it’s probably true, or probably not true. A Z test, is used
when your data is approximately normally distributed.

You would use a Z test if:

1. Your sample size is greater than 30. Otherwise, use a t test.


2. Data points should be independent from each other. In other words, one data point isn’t related
or doesn’t affect another data point.
3. Your data should be normally distributed. However, for large sample sizes (over 30) this doesn’t

always matter.

4. Your data should be randomly selected from a population, where each item has an equal chance

of being selected.
5. Sample sizes should be equal if at all possible.

F-TEST

Any statistical test that uses F-distribution can be called an F-test. It is used when the sample size is small
i.e. n < 30.

For example, suppose one is interested to test if there is any significant difference between the mean
height of male and female students in a particular college. In such a situation, a t-test for difference of
means can be used.
However one assumption of the t-test is that the variance of the two populations is equal; in this case the
two populations are the populations of heights for male and female students. Unless this assumption is
true, the t-test for difference of means cannot be carried out.

The F-test can be used to test the hypothesis that the population variances are equal.
CHI SQUARE TEST

The test is applied when you have two categorical variables from a single population. It is used to
determine whether there is a significant association between the two variables.

For example, in an election survey, voters might be classified by gender (male or female) and voting
preference (Democrat, Republican, or Independent). We could use a chi-square test for independence to
determine whether gender is related to voting preference. The sample problem at the end of the lesson
considers this example.

When to Use Chi-Square Test for Independence

The test procedure described in this lesson is appropriate when the following conditions are met:

 The sampling method is simple random sampling.


 The variables under study are each categorical.
 If sample data are displayed in a contingency table, the expected frequency count for each cell of
the table is at least 5.

This approach consists of four steps: (1) state the hypotheses, (2) formulate an analysis plan, (3) analyze
sample data, and (4) interpret results.
ANOVA TEST

An ANOVA test is a way to find out if survey or experiment results are significant. In other words, they
help you to figure out if you need to reject the null hypothesis or accept the alternate hypothesis.
Basically, you’re testing groups to see if there’s a difference between them. Examples of when you might
want to test different groups:

 A group of psychiatric patients are trying three different therapies: counseling, medication and
biofeedback. You want to see if one therapy is better than the others.

 A manufacturer has two different processes to make light bulbs. They want to know if one process is
better than the other.

 Students from different colleges take the same exam. You want to see if one college outperforms the
other.

MEASURES OF DISPERSION

A measure of central tendency helps in identifying the center of all the


observations and therefore is also called as Statistical
Averages or Averages or Measures of Central Location. The central
tendency helps in condensing the large data into a single value that represents
the entire data set. Thus, central tendency is very useful when the data under
study is very large.

A measure of central tendency also helps in comparing one data set with
another. Such as if there are two samples of girls studying in two different
schools and their marks in class 12th are needed to be compared. Then by
calculating the average marks for each sample an easy comparison between the
girls can be drawn.

Also, the central tendency helps in comparing one value of data with the
entire data set. For example, if a boy obtained 50% marks in science can
compare with the average marks obtained by each student to find out where he
stands in class.

Basically, there are three important measures of central tendency:

1. Mean: The mean is the most common measure of central tendency. It is the
value obtained by dividing the sum of all the observations by the number of
observations in the dataset. Symbolically:
2.

3. Median: The median is a positional average, basically used in the context of


qualitative data, such as intelligence, etc. It divides the data into two equal
parts where half of the items are less than the median while the half of the part
is greater than the median. Therefore, the data set is first arranged in either the
ascending order or the descending order. Such as, if the number of
observations in the dataset

EXAMPLE

a) ‘n’ is odd:

b) ‘n’ is even:
4. Mode: In a data set, the most frequently occurring item or observation is
mode.

For example, a manufacturer of cloth wants to know the size which most
frequently ordered by the customers so that he can manufacture a large
quantity of that size.

Thus, these are the measures of central tendency used to find out the most
representative value of the dataset.

MEASURES OF DISPERSION

The measures of central tendency are not adequate to describe data. Two data sets can have
the same mean but they can be entirely different. Thus to describe data, one needs to know the
extent of variability. This is given by the measures of dispersion. Range, interquartile range, and
standard deviation are the three commonly used measures of dispersion
RANGE
The range is the difference between the largest and the smallest observation in the data. The
prime advantage of this measure of dispersion is that it is easy to calculate. On the other hand, it
has lot of disadvantages. It is very sensitive to outliers and does not use all the observations in a
data set. It is more informative to provide the minimum and the maximum values rather than
providing the range.

1. General: Universe of all realistic possibilities.


2. Cost estimating: Upper and lower possibilities of the forecasted costs for a product, program,
or project.
3. Marketing: Group of associated products offered by a producer or supplier for the same
broad set of customers.
4. Statistics: Measure of the variation in a set of data, computed by subtracting the lowest value
from the highest value in the same set.

INTERQUARTILE RANGE
Interquartile range is defined as the difference between the 25th and 75th percentile (also called
the first and third quartile). Hence the interquartile range describes the middle 50% of
observations. If the interquartile range is large it means that the middle 50% of observations are
spaced wide apart. The important advantage of interquartile range is that it can be used as a
measure of variability if the extreme values are not being recorded exactly (as in case of open-
ended class intervals in the frequency distribution).Other advantageous feature is that it is not
affected by extreme values. The main disadvantage in using interquartile range as a measure of
dispersion is that it is not amenable to mathematical manipulation.

STANDARD DEVIATION
Standard deviation (SD) is the most commonly used measure of dispersion. It is a measure of
spread of data about the mean. SD is the square root of sum of squared deviation from the
mean divided by the number of observations.
The standard deviation is a statistic that measures the dispersion of a dataset relative to its
mean and is calculated as the square root of the variance. It is calculated as the square root
of variance by determining the variation between each data point relative to the mean. If the data
points are further from the mean, there is higher deviation within the data set; thus, the more spread
out the data, the higher the standard deviation.

In finance, standard deviation is a statistical measurement; when applied to the annual rate
of return of an investment, it sheds light on the historical volatility of that investment. The
greater the standard deviation of a security, the greater the variance between each price and
the mean, which shows a larger price range. For example, a volatile stock has a high standard
deviation, while the deviation of a stable blue-chip stock is usually rather low.

1. Investment: Measure of the variability (volatility) of a security, derived from the security's historical
returns, and used in determining the range of possible future returns. The higher the standard deviation,
the greater the potential for volatility.
2. Marketing: Measure of the difference between an average (arithmetic mean) and the individual
values included in the average, such as the variation between the response to the same advertisement
in different media vehicles.
3. Statistics: Measure of the unpredictability of a random variable, expressed as the average deviation of
a set of data from its arithmetic mean and computed as the positive square root of the variance.

MEAN DEVIATION

Mean deviation is a measure that removes several shortcomings of other measures i.e. it does
not ignore extreme terms or values which play a significant role in average or Mean.

According to some Economists, Mean Deviation is very useful for the forecasting of Business
Cycles.

Average of absolute differences (differences expressed without plus or minus sign) between
each value in a set of values, and the average of all values of that set. For example, the average
(arithmetic mean or mean) of the set of values 1, 2, 3, 4, and 5 is (15 ÷ 5) or 3. The difference
between this average (3) and the values in the set is 2, 1, 0, -1, and -2; the absolute difference
being 2, 1, 0, 1, and 2. The average of these numbers (6 ÷ 5) is 1.2 which is the mean deviation.
Also called mean absolute deviation, it is used as a measure of dispersion where the number of
values or quantities is small, otherwise standard deviation is used.

MEASURES OF RELATIONSHIP

Measures of Relationship

Definition - are statistical measures which show a relationship between two or more variables or two or
more sets of data. For example, generally there is a high relationship or correlation between parent's
education and academic achievement. On the other hand, there is generally no relationship or correlation
between a person's height and academic achievement. The major statistical measure of relationship is
the correlation coefficient.

Correlational Coefficient

1. Correlation is the relationship between two or more variables or sets of data. It is expressed in the
form of a coefficient with +1.00 indicating a perfect positive correlation; -1.00 indicating a perfect
inverse correlation; 0.00 indicating a complete lack of a relationship.

Note: A simplified method of determining the magnitude of a correlation is as follows:


•.00 - .20 Negligible
• .20 - .40 Low
* .40 - .60 Moderate
* .60 - .80 Substantial
* .80 - 1.0 High

2. Pearson's Product Moment Coefficient (r) is the most often used and most precise coefficient;
and generally used with continuous variables.

3. Spearman Rank Order Coefficient (p) is a form of the Pearson's Product Moment Coefficient
which can be used with ordinal or ranked data.

4. Phi Correlation Coefficient is a form of the Pearson's Product Moment Coefficient which can be
used with dichotomous variables (i.e. pass/fail, male/female).

Linear Regression and Multiple Regression

1. Linear regression is the use of correlation coefficients to plot a line illustrating the linear
relationship of two variables X and Y. It is based on the slope of the line which is represented by
the formula : Y = a + bX where
* Y = dependent variable
* X = independent variable
* b = slope of the line
* a = constant or Y intercept

Regression is used extensively in making predictions based on finding unknown Y values from
known X values. For example, the linear regression formula for predicting college GPA from
known high school grade point averages would be displayed as follows:

College GPA = a + b(High School GPA

2. Multiple Regression is the same as regression except that it attempts to predict Y from two or
more independent X variables. The formula for multiple regression is an extension of the linear
regression formula: Y = a + b1 X1 + b2 X2 + ....

Multiple regression is used extensively in making predictions based on finding unknown Y values
from known X values. For example, the multiple regression formula for predicting college GPA
from known high school grade point averages and SAT scores would be displayed as follows:

College GPA = a + b1(High School GPA) + b2(SAT Score)

Discriminant Analysis

Discriminant analysis is analogous to multiple regression, except that the criterion variable consists of two
categories rather than a continuous range of values.

Factor Analysis

Factor analysis is often used when a large number of correlations have been explored in a given study; it
is a means of grouping into clusters or factors, certain variables that are moderately to highly correlated
with each other.

Eg:
Judging from the scatterplot, there is a positive correlation between verbal SAT score and GPA.
For used cars, there is a negative correlation between the age of the car and the selling price.

Suppose you were to make a scatterplot of (adult) sons’ heights versus fathers’ heights by
collecting data on both from several of your male friends. You would now like to predict how tall
your nephew will be when he grows up, based on his father’s height.
Correlation: measures the strength of a certain type of relationship between two
measurement variables.

Regression: gives a numerical method for trying to predict one measurement variable from
another.

You might also like