You are on page 1of 124

STATISTICAL METHODS NOTES

DEFINITION OF CONCEPTS
STATISTICS
Today, statistics or more specifically statistical method is used extensively in almost all phases of human
endeavor. In ancient times, it dealt with the affairs of the state, like collection of information (or data)
regarding population and property or wealth of the state so as to sub-serve political purposes of rulers.
Today, its influence has spread to various areas, such as agriculture, business, economics, medicine,
biology, education, electronics, sociology, psychology, political science, and many other branches of
science and technology.
The origin of the word statistics may be traced to the Latin word ‘status’ or the Italian word ‘statista’ or
the German word ‘statistics’, meaning political state. As time progressed, the idea behind the word
statistics has undergone a phenomenal change. Over time, the character of information as provided has
been extended to any particular sphere of human activity. Statistics as statistical data forms the backbone
of many disciplines. By statistical’ data we mean numerical statement of facts while statistical methods
deal with information of the principles and techniques used in collecting and analyzing such data.

What is statistics?

The word ‘Statistics’ was first introduced by a German Scholar, Gottfried Achenwall, in the middle of the
18th century. From the very name, it is felt that it must be related to the administrative functioning of
state supplying facts regularly and quantitatively regarding its various fields of administration. Today,
statistics as a separate discipline from mathematics is closely associated with almost all branches of
education and human endeavour which are mostly numerically representable. In modem times, it has
innumerable and varied applications both qualitatively and quantitatively.

To be more precise, statistics refers to classified facts that represent the conditions of the people in a state
especially those facts which can be stated in numbers or in any tabular or classified arrangements. One of
the widely accepted definitions came from Horace Secrets. He defines statistics as, “By statistics we
mean aggregate of facts, affected to a marked extent by a multiplicity of causes numerically
expressed, enumerated or estimated according to reasonable standards of accuracy, collected in a
systematic manner for a predetermined purpose and placed in relation to each other.”

This definition reveals the following characteristics of statistics:


(i) Numerical data should be aggregate of facts. Single factor isolated facts, or unrelated figures do not
constitute statistics,
(ii) Data should be affected to a marked extent by multiplicity of causes. For instance data on malnutrition
is affected not only by poverty but also by a host of other factors like hygienic rules, etc.
(iii) Data should be numerically expressed.
(iv) Data should be enumerated or estimated with accuracy as far as practicable.
(v) Data should be collected in a systematic manner.
(vi) Data should be collected to serve a predetermined purpose.
(vii) Data should be placed in order in relation to each other.
Croxton and Cowden defined statistics as “the science which deals with the collection,
analysis and interpretation of numerical data”.
FUNCTIONS OF STATISTICS
1. It forms a bridge between a question and conclusion.
2. Statistics helps the researcher to simplify or summarize data so that they can be easily examined. It
enables the researcher summarize and present a large quantity of information in such a version that
facilitate its communication and interpretation.
3. Statistics is used to evaluate data in order to determine its adequacy for the study that the
researcher wants to undertake.
4. Statistics presents facts and figures in a definite form. That makes the statement logical and
convincing than mere description. It condenses the whole mass of figures into a single figure. This
makes the problem intelligible.
5. Statistics simplifies the complexity of data. The raw data are unintelligible. We make them simple
and intelligible by using different statistical measures. Some such commonly used measures are
graphs, averages, dispersions, skewness, kurtosis, correlation and regression etc. These measures
help in interpretation and drawing inferences. Therefore, statistics enables to enlarge the horizon of
one's knowledge.
6. Comparison between different sets of observation is an important function of statistics.
Comparison is necessary to draw conclusions as Professor Boddington rightly points out.” the
object of statistics is to enable comparison between past and present results to ascertain the reasons
for changes, which have taken place and the effect of such changes in future. So to determine the
efficiency of any measure comparison is necessary. Statistical devices like averages, ratios,
coefficients etc. are used for the purpose of comparison.
7. Formulating and testing of hypothesis is an important function of statistics. This helps in
developing new theories. So statistics examines the truth and helps in innovating new ideas.
8. Statistics helps in formulating plans and policies in different fields. Statistical analysis of data
forms the beginning of policy formulations. Hence, statistics is essential for planners, economists,
scientists and administrators to prepare different plans and programmes.
9. The future is uncertain. Statistics helps in forecasting the trend and tendencies. Statistical
techniques are used for predicting the future values of a variable. For example a producer forecasts
his future production on the basis of the present demand conditions and his past experiences.
Similarly, the planners can forecast the future population etc. considering the present population
trends.
10. Statistical methods mainly aim at deriving inferences from an enquiry. Statistical techniques are
often used by scholars’ planners and scientists to evaluate different projects. These techniques are
also used to draw inferences regarding population parameters on the basis of sample information.

MAIN LIMITATIONS OF STATISTICS

1. Qualitative Aspect Ignored:


The statistical methods don’t study the nature of phenomenon which cannot be expressed in quantitative
terms. Such phenomena cannot be a part of the study of statistics. These include health, riches,
intelligence etc. It needs conversion of qualitative data into quantitative data.
So experiments are being undertaken to measure the reactions of a man through data. Nowadays statistics
is used in all the aspects of the life as well as universal activities.
2. It does not deal with individual items:
It is clear from the definition given by Prof. Horace Sacrist, “By statistics we mean aggregates of facts….
and placed in relation to each other”, that statistics deals with only aggregates of facts or items and it does
not recognize any individual item. Thus, individual terms as death of 6 persons in an accident, 85%
results of a class of a school in a particular year, will not amount to statistics as they are not placed in a
group of similar items. It does not deal with the individual items, however, important they may be.

3. It does not depict entire story of phenomenon:


When even phenomena happen, that is due to many causes, but all these causes cannot be expressed in
terms of data. So we cannot reach at the correct conclusions. Development of a group depends upon many
social factors like, parents’ economic condition, education, culture, region, administration by government
etc. But all these factors cannot be placed in data. So we analyze only that data we find quantitatively and
not qualitatively. So results or conclusion are not 100% correct because many aspects are ignored.

4. It is liable to be miscued:
As W.I. King points out, “One of the short-comings of statistics is that do not bear on their face the label
of their quality.” So we can say that we can check the data and procedures of its approaching to
conclusions. But these data may have been collected by inexperienced persons or they may have been
dishonest or biased. As it is a delicate science and can be easily misused by an unscrupulous person. So
data must be used with a caution. Otherwise results may prove to be disastrous.

5. Laws are not exact:


As far as two fundamental laws are concerned with statistics:
(i) Law of inertia of large numbers and
(ii) Law of statistical regularity, are not as good as their science laws.
They are based on probability. So these results will not always be as good as of scientific laws. On the
basis of probability or interpolation, we can only estimate the production of paddy in 2008 but cannot
make a claim that it would be exactly 100 %. Here only approximations are made.

6. Results are true only on average:


As discussed above, here the results are interpolated for which time series or regression or probability can
be used. These are not absolutely true. If average of two sections of students in statistics is same, it does
not mean that all the 50 students is section A has got same marks as in B. There may be much variation
between the two. So we get average results.
“Statistics largely deals with averages and these averages may be made up of individual items radically
different from each other.” —W.L King

7. Too Many methods to study problems:


In this subject we use so many methods to find a single result. Variation can be found by quartile
deviation, mean deviation or standard deviations and results vary in each case.
“It must not be assumed that the statistics is the only method to use in research, neither should this
method of considered the best attack for the problem.” —Croxten and Cowden

8. Statistical results are not always beyond doubt:


“Statistics deals only with measurable aspects of things and therefore, can seldom give the complete
solution to problem. They provide a basis for judgment but not the whole judgment.” —Prof. L.R. Connor
Although we use many laws and formulae in statistics but still the results achieved are not final and
conclusive. As they are unable to give complete solution to a problem, the result must be taken and used
with much wisdom.
Misuses of Statistics (Distrust of Statistics)
The improper use of statistical tools by unscrupulous people with an improper statistical bend of mind has
led to the public distrust in statistics. By this we mean that public losses its belief, faith and confidence in
the science of statistics and starts condemning it. Such irresponsible, inexperienced and dishonest persons
who use statistical data and statistical techniques to fulfill their selfish motives have discredited the
science of statistics with some very interesting comments, some of which are stated below:
(i) An ounce of truth will produce tons of statistics
(ii) Statistics can prove anything
(iii) Figures do not lie. Liars figure
(iv) Statistics is an unreliable science
(v) There are three types of lies – lies, damned lies and statistics wicked in the order of their naming;
and so on.
Some of the reasons for the above remarks may be enumerated as follows:
a) Arguments are put forward to establish certain results which are not true by making use of inaccurate
figures or by using incomplete data, thus distorting the truth.
b) Though accurate, the figures might be moulded and manipulated by dishonest and unscrupulous
persons to conceal the truth and present a wrong and distorted picture of the facts to the public for
personal and selfish motives.
Hence, if statistics and its tools are misused the fault does not lie with the science of statistics. Rather, it is
the people who misuse it, are to be blamed.

Utmost care and precautions should be taken for the interpretation of statistical data in all its
manifestations. Statistics should not be used as a blind man uses a lamp-post for support instead of
illumination. However, there are misapprehensions about the argument that statistics can be used
effectively by expert statisticians, as is given in the following remark due to Wallis and Roberts: ‘He who
accepts statistics indiscriminately will often be duped unnecessarily. But he who distrusts statistics
indiscriminately will often be ignorant unnecessarily’. There is an accessible alternative between blind
gullibility and blind distrust. It is possible to interpret statistics skillfully. The art of interpretation need
not be monopolized by statisticians, though, of course, technical statistical knowledge helps. Many
important ideas of technical statistics can be conveyed to the non-statistician without distortion or
dilution. Statistical interpretation depends not only on statistical ideas but also on ordinary clear thinking.
Clear thinking is not only indispensable in interpreting statistics but is often sufficient even in the absence
of specific statistical knowledge. For the statistician not only death and taxes but also statistical fallacies
are unavoidable. With skill, common sense, patience and above all objectivity, their frequency can be
reduced and their effects minimized.

Below are some illustrations regarding the miss-interpretation of statistical data.


1. ‘The number of car accidents committed in a city in a particular year by women drivers is 10, while
those committed by men drivers is 40. Hence, women are safe drivers? This statement is obviously
wrong since nothing is said about the total number of men and women drivers in the city in the given
year. Some valid conclusions can be drawn if we are given the proportion of the accidents committed
by male and female drivers.
2. ‘80% of the people who drink alcohol dies before attaining the age of 70 years. Hence, drinking is
harmful for longevity of life. This statement is also fallacious since no information is given about the
number of persons who do not drink alcohol and die before attaining the age of 70 years. In the
absence of the information about the proportion of such persons we cannot draw any valid
conclusions.
3. A report: ‘The number of traffic accidents is lower in foggy weather than on clear weather days.
Hence, it is safer to drive in fog? The statement again is obviously wrong. To arrive at any valid
conclusions, we must take into account the difference between the rush of traffic under the two
weather conditions and also the extra cautiousness observed when driving in bad weather.
4. Incomplete data usually leads us to fallacious conclusions. Let us consider the scores of two students
John and Peter in three tests during a year.
1st test 2nd test 3rd test Average score
John’s score 50% 60% 70% 60%
Peter’s score 70% 60% 50% 60%
If we are given the average score which is 60% in each case, we will conclude that the level of
intelligence of the two students at the end of the year is the same. But this conclusion is false and
misleading since a careful study of the detailed marks over the three tests reveal that John has improved
consistently while Peter has deteriorated consistently.

Conclusion
Numerous such examples can be constructed to illustrate the misuse of statistical methods and this is all
due to their injudicious applications and interpretations for which their injudicious applications and
interpretations for which the science of statistics cannot be blamed. Many people disbelieve statistics
because it does not prove a particular thing in a particular manner.

It should be clearly understood that statistics does not prove anything. Statistics is only a method of
approach; it is a tool in the hands of a statistician to present a phenomenon in a particular manner and
nothing beyond. The science of statistics doesn’t prove or disprove a thing; it merely presents the true
facts about a problem and leaves the rest to other people. Different types of conclusions can be arrived at
from the same set of figures if there is a difference in the approach of various persons. From one set of
figures, a socialist can prove that a country has eliminated unemployment and improved the lot of the
working clan and from the same set of figures; an anti-socialist can derive an opposite conclusion. This
fundamental difference in approach or bias in the minds of the investigators has been responsible for
different conclusions being drawn from the same set of figures. For this, the science of statistics cannot
be blamed. It is not the fault of the science. It is the mischief of those who use it

DATA COLLECTION ORGANISATION AND PRESENTATION


Introduction
Statistics are a set of numerical data. Infact, only numerical data constitute statistics. This means that the
phenomenon under study must be capable of quantitative measurement. Thus, the raw material of
statistics always originates from the operation of counting (enumeration) or measurement. For any
statistical enquiry, whether it is in business, economics or natural science, the basic problem is to collect
facts and figures relating to particular phenomenon under study. The person who conducts the statistical
enquiry i.e. counts or measures the characteristics under study for further statistical analysis is known as
investigator. The persons from whom the information is collected are known as respondents and the
items on which the measurements are taken are called the statistical units. The process of counting or
enumeration or measurement together with the systematic recording of results is called the collection of
statistical data. The entire structure of the statistical analysis for any enquiry is based upon systematic
collection of data. Information on any field, when expressed qualitatively and/or quantitatively, is called
data and they are usually classified into two main categories—primary and secondary data, depending on
their origin or source.
Type of data
Primary Data
Definition: This is the name given to the data that are used for the specific purpose for which they were
collected for. In other words information when collected directly from the desired field for the use of a
predetermined purpose is called the primary data e.g., heights and weights of the students in a class.

It can thus be used with much confidence because of its direct nature of collection. The primary data are
usually published by the authorities who are directly responsible for its collection. However, it requires
enough manpower, time and money to make the process successful. This data is collected in a natural
setting e.g. experiments or directly from the field. This data is collected through observation or through
direct communication with respondents by mail telephones or personal interviews.

Secondary Data
This is a type of data that are being used for purpose other than that for which they were originally
collected for. Data collected by other investigators & institutions for research purposes that differs from
the original reason for collecting is included in this category. The most important and remarkable
advantage here is that it requires less manpower and time and as a result less costly to complete the entire
procedure. But, in practice, it frequently contains a number of errors due to erroneous transcription, faulty
rounding-off (up or down), etc., and, therefore, less dependable in nature for the researchers.

The investigators and the scholars working with them should therefore be much careful while using them
in their own fields. It is, therefore, quite clear that the data which are initially primary in nature at the
origin for one use becomes secondary in qualities and character for other uses.

Therefore, the distinction between primary and secondary data is one of degree only. A particular data
may be primary in the hands of data collecting authority but may be secondary to other people using them
afterwards. Prof. H. Secrist in this context says: “The distinction between primary and secondary data
is largely one of degree. Data which are secondary in the hands of one party may be primary in the
hands of another.”

However, the method of collection of primary data and secondary data must not be identical in nature
because, in the former case data are collected originally while, in the latter case, data are to be taken up in
the nature of compilation.

There exist various methods for the collection of the primary and secondary data. Choice of the exact
method depends largely on the nature, object and scope of statistical investigation.

Archival records form an example of secondary data. These Archival records include public records,
judicial records, the mass media and public records such as letters, autography paints all this constitute
to what we call ARCHIVAL RECORDS. The methods used in deriving secondary data are observation
& content analysis (document Analysis).

Advantages of Primary Data:


It is usually preferable to use primary data because of the following reasons:

1. First, it generally contains a detailed description and information of the definition for the terms used.
2. Secondly, since secondary data are second-hand data or ‘finished products’, an element of error may
creep in afterwards. This may then give some misleading information. Primary data cannot have such
errors.

3. Thirdly, in the primary data precise definition of the terms used is given and the scope of the data is
clearly mentioned.

4. Finally, collecting primary data often include the method or procedure followed and any approximation
used so that one can find its limitations. On the contrary, secondary data usually lack such information.

Despite these advantages of primary data, secondary data are extensively used particularly when a large
number of items are required. The secondary data seems to be of minor importance, especially when
collection of primary data is much expensive and time-consuming. Secondary data invariably give less
meaning of the statistics and frequently present no explanation other than the captions and footnotes in the
tables. In fact, some information is usually suppressed in secondary data.

Advantages of Secondary Data:


Practically, there are various advantages in using secondary data.

1. First, cost of collection of data is less. That is why data produced by the Government, companies and
various organizations are readily available.

2. Secondly, one obtains a great variety of data on a wide range of subjects.

3. Thirdly, much of the secondary data available has been collected for many years and, therefore, it can
be used to study the trends.

4. Fourthly and most importantly, secondary data is of great value to the government, business world and
industry and also for research organizations. Again, secondary data help the government in making
present policy decisions and also planning for future economic policies.

Limitations of Secondary Data:


In spite of all these advantages of secondary data, one must use it with enough caution.

1. First, the method employed for the collection of such data is often unsatisfactory. That is why,
secondary data in most cases, is subject to transcribing errors (i.e., errors occurring due to wrong
transcriptions of the primary data).

2. Secondly, secondary data are really mere estimates and not the facts.

In view of this, secondary data suffer from estimating errors.

3. Thirdly, scrutiny of the secondary data is obviously essential since errors may creep in due to unwanted
bias. This is because of the fact that often fictitious figures are recorded unknowingly in secondary data.
In this sense, secondary data are not only inaccurate but also incomplete and inadequate. Without detailed
scrutiny of the secondary data one must not be advised to use them.
Precautions for Using Secondary Data:

Since secondary data is second-hand data, one must know as much about it as possible. In other words,
certain precautions are to be taken before using secondary data so that the data become really helpful to
the government or the researcher.

Secondary data users must consider the following points:

(i) The scope and object of enquiry for which the data were originally collected;

(ii) The proper methods of collecting such data;

(iii) The period (normal or unusual time) and the area covered for collection;

(iv) The reliability, integrity and dependability of the data collectors engaged;

(v) Precise definitions of the terms used and their units of measurement considered while the data were
collected;

(vi) Interpreting the data, especially when figures collected for one purpose are used for other fields.

Thus, one can say that secondary data first should be reliable. There are various ways of testing such
reliability of the data. Secondly, secondary data should be used in such a manner that they can suitably
serve the purpose of investigation and experimentation. Finally, in addition to reliability and suitability of
the data, it should be declared adequate and accurate as far as practicable.

Methods of Primary Data Collection


This is the means by which information is obtained from selected subject of an investigation. There are
various methods of collecting primary data. This include, interviews, questionnaires & observations
Interview guide
Advantages of Interviews.
-In depth information – this is not possible to get using questionnaires
-Classify- helps the respondent to give relevant response.
-Flexible- Are flexible than questionnaires because interviews can adapt the situation and get
as much information as possible.
-Personal information - can be extracted from respondent by honest & personal interaction
between the respondent & interviews.
-Purpose – research & effectively convince respondent about the importance of research.
-Probing – Unlike questionnaires, interviews can get more information using probing
questionnaires
-Return rate

Disadvantage of Interviews
-Expensive – Researcher have to travel to meet respondent.
-Skills – It requires high level of skills i.e. require communication and interpretation of skills.
-Biasness – Interviews need o trained to avoid biasness
-Time consuming – Involves smaller samples because of time consuming if a researcher is
interested in using a big sample hence becomes constraints
-Influence respondents – Response may be influenced by the respondents reaction to the
interviewer.
Observational guide
Observation methods
It is mostly used in studies relating to consumer behavior.

Advantages
 The information is obtained by the investigators own observation without asking
from the respondent.
 Subjective bias is eliminated if observation is done accurately
 Information obtained relates to what is currently happening. It is not complicated
by either the part behavior or future intentions or attitude.
Disadvantages

 The method is expensive since there are expenses incurred.


 The method gives very limited information.

Questionnaires
Advantages
1) Low cost even when the universe is large and spread widely geographically.
2) Free from the bias of the interviewer
3) Respondents have time to give well thought answer
4) Respondents who are not easily approachable can also be reached conveniently
5) Large samples can be made use of and thus the results made more dependable and
reliable.

Disadvantages
1) Low rates of returns of the newly filled questionnaires
2) Can only be used when respondents are educated and cooperating.
3) Inhibit inflexibility because there is always difficulty in amending the approach once
questionnaires have been dispatched.
4) Possibility of ambiguous replies or omissions of replies altogether to contain questions.
5) Difficulty of knowing whether willing respondents are truly representatives.
6) The problem of constantly updating the mailing list.

Essentials of a Good Questionnaire


i) It’s size should be kept to a minimum
ii) There should be some control quiz which indicate the reliability of the
respondents
iii) Adequate space for answers should be provided in the questionnaires for ease
of editing and elaborating.
iv) Brief directions with regards to filling up questionnaires should variable be
given in questionnaires itself
v) Personal and intimate questions should be left at the end.
vi) Questions should proceed from easy ones to difficult ones.
vii) Technical terms, vague expressions capable of different interpretations
should be avoided

Classification and Tabulation of Raw Data


After collecting raw data, which are so voluminous and huge that they are unwieldy and
incomprehensible, we need to edit the data. Afterwards, we need to organize it, i.e. present it in
a readily comprehensible condensed form which will highlight the important characteristics of
the data. This facilitates comparisons and renders the data suitable for further processing
(statistical analysis) and interpretations.

The presentation of the data is broadly classified into the following categories:
(i) Tabular presentation
(ii) Diagrammatic or graphic presentation

A statistical table is an orderly and logical arrangement of data into rows and columns and it
attempts to present the voluminous and heterogeneous data in a condensed and homogeneous
form. But before tabulating the data, generally, systematic arrangement of the raw data into
different homogeneous classes is necessary to sort out the relevant and significant features
(details) from the irrelevant and insignificant ones.

This process of arranging the data into groups or classes according to resemblances and
similarities is technically called classification. Thus, classification of the data is preliminary to
its tabulation. It is thus, the first step in tabulation because the items with similarities must be
brought together before the data are presented in the form of a ‘table’

Some definitions of classification


Secrist ‘classification is the process of arranging data into sequences and groups according to
their common characteristics, or separating them into different but related parts’.
Tuttle A M ‘A classification is a scheme for breaking a category into a set of parts, called
classes, according to some precisely defined differing characteristics possessed by all the
elements of the category’.

Functions of classification
(i) It condenses the data. Classification presents the huge unyielding raw data in a
condensed form which is readily comprehensible to the mind and attempts to highlight
the significant features contained in the data.
(ii) It facilitates comparisons. Classification enables us to make meaningful comparisons
depending on the basis or criterion of classification. For instance, the classification of
students in a college according to sex enables us to make a comparative study of the
prevalence of college education among males and females.
(iii) It helps to study the relationship. The classification of the given data w.r.t two or more
criteria, say, the sex of the students and the faculty they join in a university will enable us
to study the relationship between these two criteria.
(iv) It facilitates the statistical treatment of the data. The arrangement of the voluminous
heterogeneous data into relatively homogeneous groups or classes according to their
points of similarities introduces homogeneity or uniformity amidst diversity and makes it
more intelligible, useful and readily amenable for further processing like tabulation,
analysis and interpretation of the data.

Rules for classification


No hard and fast rules can be laid down for classification but the following general guiding
principles may be observed for good classification.
(i) It should be un-ambiguous, i.e. the classes should be rigidly defined. In other words,
there should not be any room for doubt or confusion regarding the placement of the
observations in the given classes. For example, if we have to classify a group of
individuals as ‘employed’ and ‘unemployed’, it is imperative to define in clear cut terms
as to what we mean by an employed person and un-employed person.
(ii) It should be exhaustive and mutually exclusive. The classification must be exhaustive
in the sense that each and every item in the data must belong to one of the classes.
Further, the various classes should be mutually disjoint or non-overlapping so that an
observed value belongs to one and only one of the classes. For instance, if we classify
the students in a college by sex, i.e. as males and females, the two classes are mutually
exclusive. But if the same group is classified as males, females and addicts to a particular
drug then the classification is faulty because the group ‘addicts to a particular drug’
includes both males and females. However, in such a case, a proper classification will be
w.r.t two criteria i.e., w.r.t sex (males and females) and further dividing the students in
each of these two classes into ‘addicts’ and ‘non-addicts’ to the given drug.
(iii) It should be suitable for the purpose. The classification must be in keeping with the
objectives of the enquiry. For instance, if we want to study the relationship between
university education and sex, it will be futile to classify the student w.r.t to age and
religion
(iv) It should be stable.

Bases of classification
The bases or the criteria w.r.t which the data are classified primarily depend on the objectives
and purpose of the enquiry. Generally, the data can be classified on the following four bases:
(i) Geographical, i.e. Area-wise or regional: For example, data can be classified in terms
of the yield of agricultural output per hectare for different countries or regions in some
given period.
(ii) Chronological classification: here the data are classified on the basis of differences in
time, e.g., the production of an industrial concern for different periods. The profits of a
big business house over different years; the population of any country for different years
etc. The time series data, which are quite frequent in Economic and Business Statistics,
are generally classified chronologically, usually starting with the first period of
occurrence.
(iii) Qualitative classification: When the data are classified according to some qualitative
phenomena which are not capable of quantitative measurement like honesty, beauty,
employment, intelligence, sex, etc., the classification is termed as qualitative or
descriptive or w.r.t attributes. In qualitative classification the data are classified
according to the presence or absence of the attributes in the given units.
(iv) Quantitative classification: where the data are classified on the basis of phenomenon
which is capable of quantitative measurement like age, height, weight, prices, production,
income, expenditure, sales, profits etc. The quantitative phenomenon under study is
known as variable and hence this classification is also sometimes called classification by
variables.

In order to present and analyze data in logical and meaningful way, it’s necessary to understand some of
the natural forms that they take.
There are various ways of classifying data and thus are as follows;-
Preciseness
Data can either be measured precisely (described as discrete) or by approximation (described and
continuous).

Discrete Data
Can be obtained by counting e.g. No. of students taking HRM or DBM in the Diploma Class.
It can also be obtained through situation where collecting is not involved e.g. shoes sizes of a sample of
people.
The characteristics of discrete data is that its values progresses in definite steps e.g. 1,2,3,4 e.t.c

Continuous Data
This data cannot be measured precisely their values can only be approximated e.g. length, weight,
temperature, time etc.
How well continuous data are approximated depends on the situation and the quality of measuring
instruments.

Frequency Distributions
This is concerned with organization and presentation of numerical data.

Raw statistical data.


Before the data obtained from a statistical survey or investigations have been worked on, they are known
as raw data.
It is organized in haphazard way without any patterns e.g. 60 50 20 46
51 45 31 80
20 90 60 50
Data arrays.
Raw data normally yields little information. One simply way of extracting information from raw data is
by arranging them into sizes or order and this is what is known as DATA ARRAY e.g.
90 80 60 60
51 50 50 46
45 31 20 20

Raw statistical data


60 55 70 80
50 66 71 91
70 60 88 66
50 55 71 80
Data arrays
91 88 80 80
71 71 70 66
66 66 60 60
55 55 50 50

Types of Frequency Distribution (F.D)


a) Simple frequency distribution.
Consists of data values each showing the number of items having that value (frequency)
There are two types of simple frequency distribution.
Univariate frequency distribution
This is a frequency distribution of a single variable.
Bivariate frequency distribution
Consist of two or more variable.
Example of univariate

Marks of diploma students

Marks (X) Frequency


1
88 1
80 2
71 2
70 2
66 2
60 2
55 2
50 2

Total 16

Bivariate. Example

QT -X
Remarks - Y
Marks X Marks Y
Jane 70 60
John 80 55
June 70 80

Construction of a simple Frequency Distribution using a tally chart

A company dealing with importation of shoes recorded the following shoe sizes
5 6 9 8 5 6 4
4 4 9 9 7 7 6
6 5 4 4 4 4 5
6 9 8 4 5 6 10
10 4 5 7 8 8 7
5 5 6 6 7 8 10

Construction of grouped frequency distribution


The main aim of a frequency distribution is to summarize the numerical data in a logical manner that
enables an overall perspective of the data to be obtained quickly and easily. When the number of distinct
data value in a set of raw data is large. Say more than 2000 items a simple frequency distribution is not
appropriate since there will be too much information that may not be easily assimilated.
In this type of situation a group frequency distribution is used.

A group Frequency Distribution Organize data items into groups (classes) of value each showing how
many items have values included within the group (known as the class frequency)

N/B However, once items have been grouped into this way, their individual values are lost

Examples of Grouped Frequency Distribution.


The distribution of the length of life of a sample of bad debt of a company

No. Of Days (x) No. of bad debts


1-5 44
6-10 50
11-15 42
16-20 27
21-25 16

Example 2
The value of properties handled by a property dealer over a 6-month period.

Value of properties in ( m ) No. Of properties


10 and less than 15 2
15 and less than 20 6
20 and less than 25 14
25 and less than 30 21
30 and less than 35 33
35 and less than 40 19
40 and less than 45 5

Definition Associated With Frequency Distribution Classes.


Class limits.
These are the lower and upper values of the classes as physically described in the distribution.

Class boundaries.
These are lower and upper values of class that mark common points between classes

Class width. (Length)


These are the numerical differences between lower and upper class limits

Class midpoint
These are situated at the center of classes. They are the midway between the upper & lower boundaries or
limits.

Steps Involved In Formation of Grouped Frequency Distribution.


1) Calculate the range of values covered by the data. I.e. highest value minus the lowest value.
2) Divide the range obtained by 10 and adjust these values upwards to obtain a standard class width
for the distribution which is appropriate for the data concerned. Class width of 5,10,20, etc are best
since this will span bands of values that form natural groups.
-The 1st class should contain the lowest value and the last class should contain the highest value.
3) The frequency distribution table can now be formed using a tally chart.
24 13 25 25 25 29 15 46
9 10 17 22 23 17 16 32
11 12 18 20 13 27 18 22
20 14 26 14 19 19 40 31
17 21 23 26 18 24 21 27
The following data describes the number. of orders received by a company each week over a period of 40
weeks. Compile a group of frequency distribution.

1) Step one 46 -9 = 37

2) Step two 37/10 = 3.7

Class Tally Frequency


5-9 I 1
10-14 IIII II 7
15-19 IIII IIII I 11
20-24 IIII IIII 10
25-29 IIII II 7
30-34 II 2
35-39 _ _
40-44 I 1
45-49 I 1
40

The following are marks extracted from the consolidated mark sheet for diploma students in QT
20 35 39 40 51
35 40 41 51 55
45 60 76 62 61
81 82 83 85 60
90 72 73 80 50
51 32 31 38 37
40 20 27 60 76
86 90 91 60 51
52 50 55 57 56
56 76 60 40 30
92 25 30 45 40

Quiz
Formulate a grouped frequency distribution where class 41- 50 will be among the classes.
ANSWER
Step 1 – 92-20 =72
Step 2 – 72/10 = 7.2=10

CLASS TALLY FREQUENCY


11-20 II 2
21-30 IIII 4
31-40 IIII IIII II 12
41-50 IIII 4
51-60 IIII IIII IIII 15
61-70 II 2
71-80 IIII II 7
81-90 IIII II 7
91-100 II 2
55

Cumulative Frequency Distribution.


Any frequency distribution can be adapted to form what is known as a Cumulative Frequency
Distribution whereas an ordinary frequency distribution describes a particular class of values according
to how many items lies within it.
Cumulative frequency distribution describes the number of items that have values either above or below a
particular level.
Cumulative frequency distribution can be described in 2 forms

i) Less than CF
Here a set of items value is listed (normally class upper limit) with each one showing the number of items
in the distribution having values less than this or that.

Class CF
Less than 9 1
14 8
19 19
24 29
29 36
34 38
39 38
44 39
49 40
More than CF
Here a set of items values is listed (normally a class lower limit) with each one showing the number of
items in the distribution having values greater than this or that
Class CF
More than 5 40
10 39
15 32
20 21
25 11
30 04
35 02
40 02
45 0
Data Presentation
Managers take decisions on the basis of available information or data. Raw data maybe available to
managers in scattered form unless data are properly collected, analyzed and presented. It may not be
useful for decision-making
Raw data may not be easily comprehensive and it becomes useful and important to classify and present
them in meaningful manner.
There are various methods of classifying and presenting data. One of the ways is
Tabular presentation
It is used for summarizing and condensation of data
It also helps in analysis of relationship trends and relative size of given data.

TABLES
No of respondents by gender (sex)
Respondent Frequency Percent
Male 18 60
Female 12 40
Total 30 100

Characteristics of a table
a) It must have a number
b) It must have a title
c) It must have a caption (heading of the column)
d) It must have studs (heading of the raw)
Additional characteristics of a table
The source of data
Sometimes tables can reflect data with respect to other variables.

Types of tables
Two-way classification table
In this type of a table is set up in such a way that, two different variables can be compared or contrasted

Male Female
Response Frequency % Frequency %

Yes. 10 33.3 11 36.6

No 8 26.6 1 3.3

Total 18 60.0 % 12 40

Three way classification table.


In this table 3 variables or more can be contrasted
Sale of products per zone (zones)

W E C
Products 1990 1991 1990 1991 1990 1991
P1 20 40 80 30 29 90
P2 30 50 20 30 40 50
P3 100 20 30 50 80 70
Total 150 110 130 110 149 210
B) Graphical presentation

i) Linear / Arithmetic charts.


This chart is used for identifying the changes and trends that have taken place during a particular period.
Figure 4.1 The sale volume of products Y1 &Y2

Sale volume
Year Y1 Y2
1986 10 9
1987 14 13
1988 12 11
1989 15 14
1990 20 19
1991 24 23
1992 23 22
1993 28 27

Plot in the following and draw a graph

volume
of sales

Years
ii) Bar chart /graphs

In bar charts we make use of rectangles to present the given data.


A bar chart can be set up in different form i.e. horizontal, vertical, or component form.
Bar chart distinguishes relative magnitude more easily than the line chart whose mean objective is to
distinguish trends.

Example
Suppose in a class of diploma students, 10 students were born in the month of November
30 students were born in the month of December
50 students were born in the month of January
70 students were born in the month of February
Required
Contract a bar chart.(horizontal)

70

60
Students do draw a composite bar
chart

50

40

30

20

10

Nov Dec Jan Feb

Pie Chart
In pie chart different segments of a circle represents percentage contribution of various components to the
totals
The pie chart is very useful because it clearly brings out the relative importance of various components
In drawing a pie chart, we construct a circle of any dimensions and this is broken down into various
segments.
Angle 3600 represents 100% and the correspondent can be found by multiplying 360 with percentage of
the component.

Histogram & Frequency Polygon


 This is essentially a bar graph of a frequency distribution in which frequency is presented in form
of vertical.
 It is important to note that in histogram the width of each bar stretches from lower limits of the
class interval to the upper limit. Class boundaries may also be used.
 You can table scores, intervals by labeling either the midpoints or the limits of the intervals (in
most cases midpoints are used)
A histogram is a series of rectangles each proportional in width to the range of values within a class and
proportional in height to the number of items falling in the class i.e. frequency.
 If the class size is the same, the width of each rectangle will be the same.

Class Frequency Midpoint Rf


5–9 1 7
10 – 14 7 12
15 – 19 11 17
20 – 24 10 22
25 – 29 7 27
30 – 34 2 32
38

Frequency Polygon / histogram


Is constructed by plotting the midpoints of the x – axis and the frequency on the y – axis. After plotting
the midpoint they are joined by a straight line and the resulting graph is called a Frequency polygon.

Relative Frequency Distribution Graph

The relative frequency of a class is the frequency of the classes divided by the total frequency of all
classes and is generally expressed as a percentage.
Students to plot in the values and draw the relative frequency distribution graph.

Cumulative Frequency Curve


 A finance manager may be interested in finding out the number of customers who have paid in
less than 30 days or who have taken more than 50 days to make payments.
 To answer such questions it is important to draw a cumulative frequency curve (orgive)
 A cumulative frequency curve enables us to see as to how many observation lies above or below
a certain value rather than merely recording the number of observation within intervals.
 We have two types of ogive depending upon whether we are interested in “more than”
information or “less than” information.

More than X – axis 5, 10, ………………………………………


Y – axis C.F i.e. 40,39,32,21

More than cumulative frequency orgive


Class limits cumulative frequency
More than 5 40
More than 10 39
More than 15 32
More than 20 21
More than 25 11
More than 30 4
More than 35 2
More than 40 2
More than 45 1
Plot the following and draw a “more than” cumulative frequency curve.
Exercise
Draw “less than” cumulative frequency curve.

The Purpose of Graph


 It presents data in a colorful and attractive way.
 It enables a general perspective of the data to be shown without excessive details.
 The graphs compliment tabular representation.

Types of Frequency Curves


1) Symmetrical or bell
 This is a shape characteristic of the distribution forming a mirror image about the center of
distribution.
 It is characterized by the fact that the observation is equidistant from the central maximum.
Examples include: normal distribution
Triangle distribution
U – Shaped distribution
Normal distribution Triangle distribution

U shaped distribution

 In the
symmetrical distribution the mean, median and mode are all the same.
Mean= mode= median

2) Skewness of the distribution


 This is a distribution that do not satisfy symmetrical property .i.e. mean, median, mode
 There are two types of skewness.

Positively skewed distribution

B C
X-tics of the positive skewed distribution
 The curve is skewed to the right
 Am median mode
 Score lies below the mean

Negatively skewed

B C

Characteristics of negatively skewed


 The curve is skewed to the left
 Am> median> mode
 Scores lies above the mean

3 kurtosis
Kurtosis is a Greek language meaning (bulginess)
It means the flatness of the curve
Three terms are used for indicating flatness

1) mesokurtic (normal flatness)


2) leptokurtic (peaked curves)
3) Platykurtick(less peaked curves)

Leptokurtic

Mesokurtick

Platykurtick
MEASURE OF CENTRAL TENDENCY
Introduction
It is very difficult to understand the mass of data in order get a concise and complete picture large data. It
is essential to obtain a figure which should present the whole data, by the help of such a figure ,the data
are compared and understood easily .A figure which represents the whole data is known as an average or
measure of central tendency. An average removes all the unnecessary details of the data and gives a
concise picture of the huge data under investigation

QUALITIES OF AGOOD AVERAGE


The measurement of the values around which the data is scattered is known as measures of central
tendency or average
The qualities of a good average are as follows;
1) Should be rigidly defined.
2) Should be based on all values.
3) Should be easily understood and calculated.
4) Should be least affected by the fluctuation of sampling
5) Should be capable of further algebraic and statistical treatment
6) Should be least affected by the extreme values

According to Prof Yule, the following are the desiderata (requirements) to be satisfied by an ideal
average:
(i) It should be rigidly defined i.e., the definition should be clear and un-ambiguous so that it leads
to one and only one interpretation by different persons. In other words, the definition should not
leave anything to the discretion of the investigator or the observer. If it is not rigidly defined then
the bias introduced by the investigator will make its value unstable and render it unrepresentative
of the distribution.
(ii) It should be easy to understand and calculate even for a non-mathematical person. In other
words, it should be readily comprehensible and should be computed with sufficient ease and
rapidity and should not involve heavy arithmetical calculations. However, this should not be
accomplished at the expense of accuracy or some other advantages which an average may
possess.
(iii) It should be based on all the observations, Thus in the computation of an ideal average the
entire set of data at our disposal should be used and there should not be any loss of information
resulting from not using the available data. Obviously, if the whole data is not used in computing
the average, it will be unrepresentative of the distribution.
(iv) It should be suitable for further mathematical treatment. In other words, the average should
possess some important and interesting mathematical properties so that its use in further statistical
theory is enhanced. For example, if we are given the averages and sizes (frequencies) of a
number of different groups then for an ideal average we should be in a position to compute the
average of the combined group. If the average is not amenable to further algebraic manipulation,
then obviously its use will be very much limited for further applications in statistical theory.
(v) It should be affected as little as possible by fluctuations of sampling. By this we mean that if
we take independent random samples of the same size from a given population and compute the
average for each of these samples then, for an ideal average, the values so obtained from different
samples should not vary much from one another. The difference in the values of the average for
different samples is attributed to the so called fluctuations of sampling. This property is also
explained by saying that an ideal average should possess sampling stability.
(vi) It should not be affected much by extreme observations. By extreme observations we mean
very small or very large observations. Thus, a few very small or very large observations should
not unduly affect the value of a good average.
There are five measures of tendency;
1) Arithmetic measures
2) Geometrical measures
3) Harmonic measures
4) Median measures
5) Mode measures

The first three measures are known as computed averages while the last two are known as position
averages.

ARITHMETIC MEAN
Arithmetic mean of a set of values is defined as the sum of the values divided by the number of values.
In other words the mean is also called the average or the center of gravity.
AM = sum of all values
The No of values
The AM is normally denoted by x (x – bar)

For ungrouped data,


x1  x2  x3  ...  xn
Mean( x) 
n

 the average of a variable x in a sample is represented by x 


x where x is the number of
n
observations in the sample

 the average of a variable x in a population is represented by the parameter  


x
N
Where N is the size of the population

Example 1

Find the mean for the set data: 3, 7, 2, 1, and 7

Solution
3  7  2 1 7
x =4
5

For grouped data,

Direct method:

Mean( x) 
x1 f1  x2 f 2  x3 f 3  ...  xn f n
x
 xf
f1  f 2  f 3  ...  f n n

Where  = frequencies.
 = total number of frequencies
Short cut method:
Under this method the formula for calculating mean is
n

i 1
fid i
A n

 x=
i 1
fi

d i = deviations of items taken from the assumed mean.


Where A=assumed mean
f i = frequency of the ith observation.
n = number of observations

a) Discrete Series
In discrete series, the value of each of the individual item is multiplied by the corresponding frequency
and the total of the products is divided by the sum of the number of items. So in a discrete series,
arithmetic mean is calculated as
X = X

Where  = frequencies.
 = total number of frequencies.

Example 2
 Calculate the arithmetic mean from the following data:
Values 5 10 15 20 25 30 35 40 45 50
Frequency 20 43 75 67 72 45 39 9 8 6

Solution
Values (x)  x
5 20 100
10 43 430
15 75 1125 x = x = 8530 = 22.2
20 67 1340  348
25 72 1800
30 45 1350
35 39 1365
40 9 360
45 8 360
50 6 300
Total  = 348  = 8530
Example 3: From the following data of the marks obtained by 60 students of a class.
Marks 20 30 40 50 60 70
No of students 8 12 20 10 6 4
Sol:

Calculation of Arithmetic mean


Marks (x) No. of students(f) fx d=(x-40) fd
20 8 160 -20 -160
30 12 360 -10 -120
40 20 800 0 0
50 10 500 10 100
60 6 360 20 120
70 4 280 30 120
N=60
 fx  2,460  fd  60
(i) Direct method:
Here N= total frequency=60
n

i 1
f i xi
n

x=

i 1
fi
Mean
2460
x  41
60
Hence the average marks = 41.
(ii) Short cut method:
n


i 1
fid i
A n

 fi
Mean x = i 1
60
x  40   40  1  41
Since A = 40, 60 .
Hence the average marks = 41.

b) Continuous Series
Now, all this is very well, but much data is in the form of grouped, continuous frequency distributions.
The method of calculating the arithmetic mean from a continuous series is exactly the same as that of
discrete series with the exception that in a continuous series, we first take the mid-points of the various
class intervals which are written against each class interval. These mid-values are multiplied by the
corresponding frequencies.

Example 3
Calculate the mean for the following frequency distribution:
Marks 0-10 10-20 20-30 30-40 40-50 50-60 60-70
No of students 6 5 8 15 7 6 3

Solution
Marks Mid-values (X) No of students () X
0-10 5 6 30
10-20 15 5 75
20-30 25 8 200
30-40 35 15 525
40-50 45 7 315
50-60 55 6 330
60-70 65 3 195
 = 50  = 1670

X = X = 1670 = 33.4 marks


 50

Example3: Estimate the mean of a sample that has been grouped according to the following distribution:

Solution
frequen Class Mark
cy
Class f x xf
15-24 8 19.5 156
25-34 12 29.5 354
35-44 13 39.5 513.5
45-54 23 49.5 1138.5
55-65 14 59.5 833
66-74 12 69.5 834
Sum 82 3829

The mean of means-combined mean (grand-mean) - to find the average of means x1 , x 2 , ... , x k , use the
n1 x1  n2 x 2  . .. nk x k
formula x 
n1  n 2  .. .  n k
Problem: the average of the 17 females in a Statistics class is 83 points and the average of the 14 males
was 78. What is the average of the whole class?
n1 x1  n 2 x 2 17(83)  14(78) 2503
Solution: x     80.74
n1  n2 17  14 31

In summary, one follows 3 steps to compute the AM of grouped data:


(i) Multiply each value of X or mid-value of the class (in case of grouped or continuous frequency
distribution) by the corresponding frequency ().
(ii) Obtain the total of the products obtained in step (i) above to get X.
(iii) Divide the total obtained in step (ii) by N =, the total frequency.
The resulting value gives the arithmetic mean
Merits and Demerits of Arithmetic Mean
Merits
1. It can be easily understood and calculated.
2. It is based on all the observations (i.e., items of the series).
3. It can be computed for any set of numerical data, so it always exists.
4. A set of numerical data has one and only one mean, so it is always unique.
5. It is suitable for further mathematical, e.g., the means of several sets of data can be combined into the
overall mean of all the data.
6. Unlike the median, it is not necessary to arrange the data first and then calculate the average.
7. It is used very frequently.

Demerits
1. It may produce a figure which does not exist in a series. In other words, it often produces results that
are not suitable for a communication view point, for example, the figure 2.2 children per adult female
is absurd! In situations like these, people expect an integer (i.e., whole number) to be representative
of the number of children per family.
2. It cannot be determined by inspection nor can it be located graphically.
3. It cannot be used if we are dealing with qualitative characteristics which cannot be measured
quantitatively such intelligence, honesty, beauty, etc. In such cases median is the only average to be
used.
4. It cannot be obtained if a single observation is missing or lost or is illegible unless we drop it out and
compute the AM of the remaining values.
5. The strongest drawback of arithmetic mean is that it is very much affected by extreme observations.
Two or three very large or small values of the variable may unduly affect the value of the AR.
Consider for example, the case of John, who at an interview for a job is told that the average income
of salesmen in the company is £8000. He accepts the job as he considers the firm to be very
progressive with excellent prospects for himself. Although his starting salary is only £2000 a year,
his salary will obviously climb very quickly. You could image how cheated John felt when he found
that the sales force consisted of just 5 men: the sales director as £30,000 a year and 4 salesmen on
£2500 a year.
x = 30,000-2500(4) = 40,000/5 = £8,000.
The extreme value (in this case the sales director’s salary) has caused the AM to be most
unrepresentative.

THE MEDIAN
It is the value of the middle item of a series when these items are arranged in ascending or descending
order. In the words of L R Connor: ‘The median is that value of the variable which divides the group in
two equal parts, one part comprising all the values greater and the other, all values less than median’.

Thus, median of a distribution may be defined as that value of the variable which exceeds and is exceeded
by the same number of observations i.e., it is the value such that the number of observations above it is
equal to the number of observations below it. Thus, we see that as against AM which is based on all the
items of the distribution, the median is only positional average, i.e., its value depends on the position
occupied by a value in the frequency distribution.

The median: the median is the value at the middle after they are arranged in either ascending or
descending order.
To find the median in the case of ungrouped data:
Step 1. Arrange the values in order.
Step 2. If n is odd, the middle value is the median
If n is even, the average of the two middle values is the median
Example1: find the median of the values 8, 5, 7, 13, 2, 13, 9
Step 1: 2, 5, 7, 8, 9, 13, 13
Step 2: the median is the middle value 8.

Example 2: find the median of the values 8, 5, 7, 13, 2, 13, 9, 14


Step 1: 2, 5, 7, 8, 9, 13, 13, 14
Step 2: the median is the average value 8 and 9 which is 8.5

MEDIAN FOR A SIMPLE FREQUENCY DISTRIBUTION

X f C.F
0 15 15
1 24 39
2 median 18 57
3 12 69
4 8 77
5 2 79
6 1 80
80
Steps
1. Calculate the value of f + 1 80 +1 = 81 = 40.5
2 2
2. Form a cumulative frequency column
3. Find that cumulative value which first exceeds the value gotten in step 1
4. The median is the x value corresponding to the C.F. value found identified in step 3.

Median for grouped frequency.

As mention in the previous work the penalty for grouping values is the loss of their individual identities
and thus there is no way that a median can be calculated exactly in this situation .However there are
two methods commonly employed for estimating the median.

1. Using interpolation formula


2. By graphical interpolation

Interpolation in this contest is simply mathematical technique which estimates an unknown value by
utilizing immediate surrounding known value.

Continuous Frequency Distribution


As before, the steps involved are:
(i) Prepare ‘less than’ cumulative frequency (cf) distribution.
(ii) Find N/2
(iii) See cf just greater than N/2
(iv) The corresponding class contains the median value and is called the median class.

One difficulty in computing the median of the continuous series is that the value of the median lies in a
class interval. So this value is calculated by the method of interpolation. The value of median is obtained
by the formula:
Md = Lm + ½ N – CP x c
m
Where
Lm = represents the lower limit of the median class.
N = total number of observations (=, total frequency).
Cp = cumulative frequency of the class immediately before the median class.
m = the frequency of the median class
c = the magnitude or width of the median class.

(Note: if N is an odd number, the value of ½ N must be rounded up to the nearest integer), i.e., just add 1
to the value.
Also the formula works in the assumption that the distribution of the variable under consideration is
continuous with exclusive type classes without any gaps. Where classes are not continuous, then the
distribution must be converted into a continuous frequency distribution before applying the formula. This
adjustment will affect only the value of Lm.

Example
Classes’ f c.f
20 – 25 2 2
25 – 30 14 16
30 – 35 29 45
35 – 40 median class 43 88
40 – 45 33 121
45 – 50 9 130
Total 130

F = 130 / 2 = 65 35 + 65 – 45 5
43
= 37.32

Students to calculate the answer in class. The answer should be = 18.854 for the example below

Class f c.f

5–9 10 10
10 – 14 36 46
15 – 19 m.c 62 108
20 – 24 72 180
25 – 29 20 200
Total 200

Characteristics of the median

 It is an appropriate alternative to the mean wishes extreme values are present at one or both ends
of a set of distribution.
 It can be used certain end values of a set or distribution are difficult, expensive or impossible to
obtain.
 Can be used when non numerical data is involved e.g the size of object and even height.
 It assumes value equal to one of the original.
 The only disadvantage of median is that it is difficult to handle theoretically in more advanced
statistical work.

Merits and Demerits of Median


Merits
1. It is easy to understand and calculate, even for a non-mathematical person.
2. It is not affected at all by extreme observations, given that it is a positional average and as such is
very useful in the case of skewed distributions. So in the case of extreme observations, median is a
better average to use than AM since the latter gives a distorted picture of the distribution.
3. Median can sometimes be located by simple inspection and can also be computed graphically.
4. Median is the only average to be used while dealing with qualitative characteristics which cannot be
measured quantitatively but can still be arranged in ascending or descending order of magnitude, eg,
to find the average intelligence, average beauty, average honesty etc, among a group of people.

Demerits
1. In case of even number of observations for an ungrouped data, median cannot be determined exactly.
We merely estimate it as the AM of the two middle terms. In fact, any value lying between the two
middle observations can serve the purpose of median.
2. Median, being a positional average, is not based on each and every item of the distribution. It
depends on all the observations only to the extent whether they are smaller than or greater than it; the
exact magnitude of the observations being immaterial.
Consider a simple example:
The median value of 8, 12, 35, 40, and 60 is 35. Now if we replace the values 8 and 12 by any two
values less than 35 and the values of 40 and 60 by any two values greater than 35 the median is
unaffected.
3. Median is not suitable for further mathematical treatment, i.e., given the sizes and the median values
of the different groups, we cannot compute the median of the combined group.

THE MODE
Although the mean and the median will be the averages used in most circumstances, there are situations in
which other averages are particularly appropriate, whereas the mean can be said to find the center of
gravity and the median the middle of a set of item, the mode identifies the most popular item .The mode
of a set of data is that value which occurs most often or equivalently has the largest frequency.

Mode is the value which occurs most frequently in a set of observations and around which the other items
of the set cluster densely. In other words, mode is the value of a series which is predominant in it.
According to A M Turtle, ‘Mode is the value which has the greatest frequency density in its immediate
neighborhoods’. Accordingly, mode may also be termed as the fashionable value (a derivation of the
French word ‘la mode’) of the distribution.

In the following statements:


(i) Average size of the shoe sold in a shop is 7;
(ii) Average height of a Kenyan (male) is 1.68 meters approximately;
(iii) Average collar size of a shirt sold in a readymade garment shop is 35 cms;
(iv) Average student in a professional college spends Ksh8, 000 per month.

 The average referred to is neither mean nor median but mode, the most frequent value in the
distribution. For example, by the first statement we mean that there is maximum demand for the shoe
of size no 7.
Computation of Mode- Ungrouped data
From individual series, mode is obtained through observation. It is the value of that item which occurs
maximum number of times. A distribution could have only one mode (unimodal), could have two modes
(bimodal), three modes (tri-modal) or many modes (multimodal)

Example:
The marks of 10 students in a test are: 65 43 57 63 39 57 60 48 57 55. Find the mode.

Solution: Through observation, we find that 57 has been repeated 3 times. So mode is 57 marks.

Frequency Distribution (Discrete Series)


In case of a frequency distribution, mode is the value of the variable corresponding to the maximum
frequency. This method can be applied with ease and simplicity if the distribution is ‘unimodal’, i.e., if it
has only one mode.
For example:
X: 1 2 3 4 5 6 7 8 9
: 3 1 18 25 40 30 22 10 6

The maximum frequency is 40 and therefore, the corresponding value of x, i.e., 5 gives the value of mode.

Continuous Frequency Distribution


In the case of continuous frequency distribution, the class corresponding to the maximum frequency is
called the ‘modal class and the value of the mode is obtained by any of the following three formulae:
(i) Mode = 3 median – 2 Arithmetic mean
1
(ii) Mode = l1  i
1   2

(Preferred formula)
f1  f 0
(iii) Mode (Mo) = l1  i
2 f1  f 0  f 2

Where
1 = f 1  f 0
 2 = f1  f 2
l1 = lower limit of the modal class.
f 1 =frequency of the modal class.
f 0 =frequency of then class preceding to the modal class.
f 2 = frequency of the class succeeding to the modal class.
i = size of the class.
Note:
1) While applying the above formula for calculating mode, it is necessary to see that the class
intervals are uniform throughout. If they are unequal they should first made equal on the
assumption that the frequencies are equally distributed throughout.
2) In case of bimodal distribution the mode cannot be found.
Finding mode in case of bimodal distribution:
In a bimodal distribution the value of mode cannot be determined by the help of the above
formulae. In this case the mode can be determined by using the empirical relation given below.
Mode = 3Median - 2Mean
And the mode which is obtained by using the above relation is called ‘Empirical mode’

Remarks
1. It may be pointed out that the above formula (iii) for computing mode is based on the following
assumptions:
(i) The frequency distribution must be continuous with exclusive type classes without any gaps. If
the data is not given in the form of continuous classes, it must first be converted into continuous
classes before applying formula (ii) or (iii).
(ii) The class intervals must be uniform throughout i.e., the width of all the class intervals must be the
same. In case of the distribution with unequal class intervals, they should be made equal under
the assumption that the frequencies are uniformly distributed over all the classes, otherwise the
value of mode computed from the formulae (ii) and (iii) will give misleading results.
2. The above technique of locating mode is not practicable in the following situations:
(i) If the maximum frequency is repeated or approximately equal concentration is found in two or
more neighboring values.
(ii) If the maximum frequency occurs either in the very beginning or at the end of the distribution.
(iii) If there are irregularities in the distribution, ie, the frequencies of the variable increase or decrease
in a haphazard way.

In the above situations, mode (or modal class in the case of continuous frequency distribution) is located
by the method of grouping – (Statistics II).

Graphic Location of Mode


Mode can be located graphically from the histogram of frequency distribution by making use of the
rectangles erected on the modal, pre-modal and post-modal classes. The method consists of the following
steps:
Locating mode graphically:
In a frequency distribution the value of mode can be determined graphically.
The steps in calculation are
1) Draw a histogram of the given data.
2) Draw two lines diagonally in the inside of the modal class bar, starting from each upper corner of
the bar to the upper corner of the adjacent bars.
3) Draw a perpendicular line from the intersection of the two diagonal lines to the X-axis which
gives the modal value.
Note:
The graphic method of determining mode can be used only where the data is unimodal.
Finding Mode Graphically
Marks Conversions into No. of students
inclusive series exclusive series (frequency)
(x) (f)
10-19 9.5-19.5 10
20-29 19.5-29.5 12
30-39 29.5-39.5 18
40-49 39.5-49.5 30
50-59 49.5-59.5 16
60-69 59.5-69.5 6
70-79 69.5-79.5 8

The following steps must be followed to find the mode graphically.


1. Represent the given data in the form of a Histogram. The height of the rectangles in the histogram
is marked by the frequencies of the class interval as shown in the graph .Identify the highest
rectangle. This corresponds to the modal class of the series.
2. Join the top corners of the modal rectangle with the immediately next corners of the adjacent
rectangles. The two lines must be cutting each other. This might be difficult to visualize so look
at the graph given below.
3. Let the point where the joining lines cut each other be 'A'. Draw a perpendicular line from point A
onto the x-axis. The point 'P' where the perpendicular will meet the x-axis will give the mode.

The Histogram

In this case the value of point P turns out to be 44.12


Merit, Demerits and Uses of Mode

Merits
i. It is easy to calculate and understand
ii. It can be located in some cases by inspectors
iii. It is capable of being ascertained graphically
iv. It is not affected by extreme values
v. It represents the most frequent value and hence it is very often used in the fashion industry.

Demerits
i. There are different formulae for its calculators which ordinary gives different answers.
ii. Mode is determined some series have two or more than two or more than two modes.
iii. It cannot be subjected to further statistical analysis. For example, the combined mode of two
series cannot be calculated.
iv. It is an unsuitable measure as it is affected more by sampling fluctuations.
v. Mode for the series with unequal class – intervals cannot be calculated.
Uses:
i. It is used for the study of most popular fashion items
ii. It is extensively used by business and commercial managers.

1.6 Empirical Relation Between Mean (M), Median (Md) and Mode (Mo)
In case of a symmetrical distribution, mean, median and mode coincide i.e., M = Md = Mo.
In a highly symmetrical distribution, it is impossible to forecast the relationship between the averages.
Prof Karl Pearson has demonstrated the following important empirical relationship:
 Mode = Mean – 3(mean-median).
 Mode = 3 median - 2 mean
This formula is especially useful to determine the value of mode in case it is ill defined, eg, in the
case of bimodal or multimode distributions.
 Mean – median = 1/3 (mean – mode)
Thus, we see that the difference between mean and mode is three times the difference between mean and
median. In other words, median is closer to mean than mode.

Geometric mean
Geometric mean is defined as the n th root of the product of n items (or) values.
Calculation of G.M- Individual series:
x , x 2 , x 3 ,......., x n be n observations studied on a variable X, then the G.M of the observations is
If 1
defined as
1
x x x ........x n 
G.M= 1 2 3
n

Applying log both sides


1
log G.M  log  x1 .x 2 ..........x n 
n
1
[log x1  log x 2  ...........  log x n ]
=n
1 n
 log xi
= n i 1
1 n 
 G.M  anti log  log x i 
 n i 1 

Calculation of G.M- Discrete series:


x , x , x ,......., x f 1 , f 2 , f 3 ,......., f n
If 1 2 3 n be n observations of a variable X with frequencies

respectively then the G.M is defined as


1

G.M=
x 1
f1 f f
x 2 2 x 3 3 ........ x n
f1
 N
………………(*)
n

f i
Where N = i 1 i.e. total frequency
Applying log both sides in (*) we get
n
1 
 f i log x i 
G.M= antilog  N i 1 

Calculation of G.M-Continuous series:


x i by the mid points of the
In continuous series the G.M is calculated by replacing the value of
mi .
class’s i.e.
n
1 

N
f i log mi 
G.M= antilog  i 1 
mi is the mid value of the i th class interval.
Where

Properties of G.M:
1) If G1 and G2 are geometric means of two components having n1 and n2 observations and G is the
geometric mean of the combined series of n (n1+n2) values then
W W
G  G1 1 G 2 2
n1 n2
w1  w2 
Where
n1  n 2 & n1  n 2

Uses of G.M:
Geometrical Mean is especially useful in the following cases.
1) The G.M is used to find the average percentage increase in sales, production, or other economic
or business series.
For example, from 1992 to 1994 prices increased by 5%,10%,and 18% respectively, then the average
annual income is not 11% which is calculated by A.M but it is 10.9 which is calculated by G.M.
2) G.M is theoretically considered to be best average in the construction of Index numbers.

Weighted Geometric Mean:


Like weighted Arithmetic mean, we can also find weighted geometric mean and the formula is
given by
1

G.M =
w
x 1
w1 w w
x 2 2 x3 3 ........ x n
wn
 N
n

w i
Where N = i 1 i.e. total weight
Applying log both sides we get
n
1 

N
w i log x i 
G.M= antilog  i 1 
Harmonic mean:
The harmonic mean (H.M) is defined as the reciprocal of the arithmetic mean of the reciprocal of the
individual observations.

Calculation of H.M -Individual series:


x1 , x 2 , x3 ,.........., x n be ‘n’ observations of a variable X then harmonic mean is defined as
If
n
H .M 
1 1 1
  .............. 
x1 x 2 xn
n
 H .M  n
1

i 1 x i
Calculation of H.M -Discrete series:
x , x , x ,.........., x n be ‘n’ observations occurs with frequencies f1 , f 2 , f 3 ,.........., f n respectively then
If 1 2 3
H.M is defined as
n

f
i 1
i
H .M 
f1 f 2 f
  ..............  n
x1 x 2 x n …………..(*)
n

f
i 1
i
 H .M  n
fi
x
i 1 i
Calculation of H.M – Continuous series:
mi ) in place of xi ' s in the equation
In continuous series H.M can be calculated by replacing mid values (
(*). Hence H.M is given by
n

f
i 1
i
 H .M  n
fi
m
i 1 mi is the mid value of the ith class interval
i , where
Uses of harmonic mean:
1) The H.M is used for computing the average rate of increase in profits of a concern.
2) The H.M is used to calculate the average speed at which a journey has been performed.

Merits:
1) Its value is based on all the observations of the data.
2) It is less affected by the extreme values.
3) It is suitable for further mathematical treatment.
4) It is strictly defined.

Demerits:
1) It is not simple to calculate and easy to understand.
2) It cannot be calculated if one of the observations is zero.
3) The H.M is always less than A.M and G.M.

Relation between A.M, G.M, and H.M:


The relation between A.M, G.M, and H.M is given by
A.M  G.M  H .M

Note: The equality condition holds true only if all the items are equal in the distribution.

Prove that if a and b are two positive numbers then A.M  G.M  H .M
Sol:
Let a and b are two positive numbers then
ab
The Arithmetic mean of a and b = 2
The Geometric mean of a and b = ab
2ab
The harmonic men of a and b = a  b
Let us assume A.M  G.M
ab
  ab
2
 a  b  2 ab
2
 a  b   4ab
 a  b   0
2

Which is always true.


A.M  G.M ………………………… (1)

let us assume GM  HM
2ab
 ab 
ab
 a  b  2 ab
2
 a  b   4ab
2
 a  b   0
Which is always true.
 G.M  HM ………………………… (2)
from (1) and (2) we get A.M  G.M  H .M
Problems:
1. Calculate AM of the following data.
a) 4, 3, 2, 5, 3, 4, 5, 1, 7, 3, 2, 1 [3.33]
b) 30,70,10,75,500,8,42,250,40,36 [106.1]
c) 35, 46, 27, 38, 52, 44,50, 37, 41, 50[42]
2. Find the A.M of first n natural numbers.
3. Find the A.M of first n even numbers.
4. Find the A.M of first n odd numbers.
5. Find the A.M of first 10 even numbers.
6. Find the A.M of first 100 odd numbers.
7. Find A.M, G.M, H.M, median and mode of following
X 5 6 7
8.
f 1 4 3
9.

X 20 21 22 23 24 25 26
f 1 2 4 7 5 3 1

10. Find A.M, G.M, H.M, median and mode of following data.
Marks 20 30 40 50 60 70
No. Of Students 8 12 20 10 6 4

11. Find A.M, G.M, H.M, median and mode of following


CI 0-10 10-20 20-30 30-40 40-50 50-60
Frequency 6 14 16 27 22 15
12. Find A.M, G.M, H.M, median and mode of following data.
CI 20-25 25-30 30-35 35-40 40-45 45-50 50-55
Frequency 10 12 8 20 11 4 5

13. Find A.M, G.M, H.M, median and mode of following data.
CI 10-20 20-40 40-70 70-120 120-200
Frequency 4 10 26 8 2

14. Find A.M, G.M, H.M, median and mode of following data.
CI 1-7 8-14 15-21 22-28 29-35
Frequency 3 17 12 11 7
15. In an examination marks secured by three students A, B, C along with the respective weights of
the subjects are given below. Determine the best performance

Students Math(wt-4) Science(wt-3) English(wt-2) History(wt-1)


A 90 80 60 70
B 85 75 65 75
C 95 70 55 80

16. Find A.M, G.M, H.M, median and mode of following data
Wages in rupees Less than Less than Less than Less than Less than 50
10 20 30 40
No. of workers 5 17 20 22 25

17. Find the missing frequencies from the data given below if mean is 60.

Marks 50 55 60 65 70 Total
No. Of Students ? 20 25 ? 10 100

18. Find the missing frequencies from the data given below if mean is 60.

Marks 60-62 63-65 66-68 69-71 72-74


No. Of Students 15 54 ? 81 24

19. Find the G.M following data.


Marks 10 20 30 40 50 60
No. Of Students 12 15 25 10 6 2
20. Find the average rate of growth of population which in the first decade has increased of 20%, in
the second decade by 30% and in third by 45%.

21. Find the modal marks of the following data.

Marks 0-9 10-19 20-29 30-39 40-49 50-59 60-69 70-79 80-89 90-99
No of students 6 29 87 181 247 263 133 43 9 2

22. Suggest appropriate averages for the following data


i. Average size of shoes
ii. Average number of children
iii. Average height of students
iv. Average speed of motor car
v. Average changes in the price level
23. Mean of the 10 observations is 20. If each observation is increased by 5 what is the mean of the
resultant series?
24. Mean of the 5 observations is 10. If each observation is doubled then what is the mean of the
new series.
25. How mean, median is affected if every value of the variable is increased by 2 and multiplied by
5?
26. The GM and HM of two observations are respectively 18 and 10.8. Find the observations.
27. The arithmetic mean of 10 observations is 72.5 and the arithmetic mean of 9 observations is
63.2, find the value of 10th observation.
Quartiles, Deciles and Percentiles: measures of relative standing
Measures of relative standing: the percentiles, the quartiles and the deciles are measures of
relative standing. They are also considered measures of location or position. They indicate the
position of a value taking into account the rest of the group
Three of these divide the data set into four, ten or hundred divisions, respectively.

Quartiles, Deciles and Percentiles are measures of position useful for comparing scores within
one set of data. You probably all took some type of college placement exam at some point. If
your composite math score was say 28, it might have been reported that this score was in the 94th
percentile. What does this mean? This does not mean you received a 94% on the test. It does
mean that of all the students who took that exam, 94% of them scored lower than you did (and
6% higher). For a set of data you can divide the data into three quartiles ( Q1 , Q2 , Q3 ), nine deciles
( D1 , D2 ,...D9 ) and 99 percentiles ( P1 , P2 ,...., P99 ). The quartile Q1 separates the bottom 25% from
the top 75%, Q2 is the median and Q3 separates the top 25% from the bottom 75%. To work with
percentiles, deciles and quartiles - you need to learn to do two different tasks. First you should
learn how to find the percentile that corresponds to a particular score and then how to find the
score in a set of data that corresponds to a given percentile.

Percentiles of a distribution are the values on the scale of measurement below which we find a
certain percentage of the group.
P90 : is called the 90th percentile; it is the value below which 90 percent of the group is located.
P40 : is called the 40th percentile; it is the value below which 40 percent of the group is located.
Pk : is called the kth percentile; it is the value below which k percent of the group is located.
Quartiles: there are three quartiles Q1 , Q2 and Q3
Q1  P25 , the first quartile is the same as the 25th percentiles.
Q2  P50  Median , the second quartile is the same as the 50th percentiles.
Q3  P75 , the third quartile is the same as the 55th percentiles.
Deciles: there are nine deciles
D1  P10 , the first decile is the same as the 10th percentiles.
D2  P20 , the first decile is the same as the 20th percentiles, and so on
Note: D5  P50  Q2  Median , the fifth decile is the same as the median
How to approximate the percentiles:
Step 1. Arrange the given scores in increasing order
k n
Step 2. Find the value of  N where N is obtained by rounding to a whole number.
100
X N  X N 1
Step 3. If N is obtained by rounding up, use Pk  X N , otherwise use Pk 
2
How to approximate the quartiles:
Step 1. Arrange the given scores in increasing order
Step 2. Q2 is the median of the distribution
Step 3. Q1 is the median of the first half of the distribution
Step 3. Q3 is the median of the second half of the distribution
Inter-quartile range Q3  Q1

How to find the percentile rank of a value xi


Number of values less than xi
Percentile rank of xi   100
The total number of values in the data
Example 1:
Let X=73, 75, 69, 68, 78, 69, 74, 76, 72, 79, 68, 77, 71
a) Find the values of the three quartiles and the inter-quartile range Q3  Q1
Solution:
n=13; arranged in order X= 68, 68, 69, 69, 71, 72, 73, 74, 75, 76, 77,78, 79
The middle value is 73, therefore Q2  Median  73
69  69
The middle value of the first half = 68, 68, 69, 69, 71, 72 is Q1   69
2
76  77
The middle value of the second half is 74, 75, 76, 77,78, 79 Q3   76.5
2
Inter-quartile range=76.5 -69 = 7.5
b) Approximate the value of the 35th percentile.
35  13
N  4.55  5  P35  X 5  71
100
c) Compute the percentile rank of 71
4
Percentile rank of 71 is   100  30.76923077%  30.77%
13
Example 2. Let X=4, 8, 0, 3, 11, 7, 4, 14, 8, 13, 7, 9
N=12, arranged in order X= 0, 3, 4, 4, 7, 7, 8, 8, 9, 11, 13, 14
78
a) Q2  Middle value  7.6
2
Q1  Middle value of first half  4
Q3  Middle value of sec ond half  11
Inter-quartile range = Q3  Q1 =11 – 4 = 7
X  X9 8  9
b) 70% of 12 = 8.4, P70  8   8.5
2 2
1
c) The percentile rank of 3 =   100  8.33333%  8.33%
12

THE MEASURE OF DISPERSION AND SKEWNESS.


Introduction and Meaning
The measures of central tendency give us an idea of the concentration of the observations about the
central part of the distribution. In spite of their great utility in statistical analysis, they have their own
limitations. If we are given only the average of a series of observations, we cannot form complete idea
about the distribution since there may exist a number of distributions whose averages are same but which
may differ widely from each other in a number of ways. The following example will illustrate this view
point.

Let us consider the following three series A, B and C of 9 items each.


Series Total Mean

A 15, 15, 15, 15, 15, 15, 15, 15, 15 135 15

B 11, 12, 13, 15, 15, 16, 17, 18, 19 135 15

C 3, 6, 9, 12, 15, 18, 21, 24, 27 135 15

All the three series A, B, and C, have the same size (n=9) and the same mean, i.e., 15. Thus, if we are
given that the mean of a series of 9 observations is 15, we cannot determine if we are talking of the series
A, B, or C. In fact any series of 9 items which total 135 will give mean 15. Thus, we may have a large
number of series with entirely different structures and compositions but having the same mean.

From the above illustration, it is obvious that the measure of central tendency is inadequate to describe the
distribution completely. In the words of George Simpson and Fritz Kafka, ‘An average does not tell the
full story. It is hardly fully representative of a mass unless we know the manner in which the individual
items scatter around it. A further description of the series is necessary if we are to gauge how
representative the average is’.

Thus, the measures of central tendency must be supported and supplemented by some other measures.
One such measure is dispersion.

Literal meaning of dispersion is ‘scatteredness’. We study dispersion to have an idea of the homogeneity
(compactness) or heterogeneity (scatter) of the distribution. In the above illustration, we say that the
series A is stationary, i.e., it is constant and shows no variability. Series B is slightly dispersed and series
C is relatively more dispersed. We say that series B is more homogeneous (or uniform) as compared with
series C or the series C is more heterogeneous than series B.

W I King has defined dispersion as a term which is used to indicate the facts that within a given group,
the items differ from one another in size or in other words, there is lack of uniformity in their sizes.
Spiegel notes that ‘the degree to which numerical data tend to spread about an average value is the
variation or dispersion of the data’.

Characteristics for an Ideal Measure of Dispersion


The desiderata for an ideal measure of dispersion are the same as those for an ideal measure of central
tendency, i.e.
(i) It should be rigidly defined.
(ii) It should be easy to calculate and easy to understand
(iii) It should be based on all the observations
(iv) It should be amenable to further mathematical treatment
(v) It should not be affected much by extreme observations
(vi) It should have sampling stability.

Absolute and Relative Measures of Dispersion


The measures of dispersion which are expressed in terms of the original units of a series are termed as
absolute measures. Such measures are not suitable for comparing the variability of the two distributions
which are expressed in different units of measurement. On the other hand, relative measures of
dispersion are obtained as ratios or percentages and are thus pure numbers independent of the units of
measurement. For comparing the variability of the two distributions (even if they are measured in the
same units), we compute the relative measures of dispersion instead of absolute measures of dispersion.

The Usual Measures of Dispersion:


The usual measures of dispersion, very often suggested by the statisticians, are exhibited with the aid of
the following chart:

Primarily, we use two separate devices for measuring dispersion of a variable. One is an Algebraic
method and the other is Graphical method. In the algebraic method we use different notations and
definitions to measure it in a number of ways and in the graphical method we try to measure the
variability of the given observations graphically mainly drought scattered diagrams and by fitting
different lines through those scattered points.
In the Algebraic method we split them up into two main categories, one is Absolute measure and the other
is Relative measure. Under the Absolute measure we again have four separate measures, namely Range,
Quartile Deviation, Standard Deviation and the Mean Deviation. And finally, under the Relative measure,
we have four other measures termed as Coefficient of Range, Coefficient of Variation, Coefficient of
Quartile Deviation and the Coefficient of Mean Deviation.

The Range
It is the simplest of all the measures of dispersion. It is defined as the difference between the two extreme
observations of the distribution. In other words, range is the difference between the greatest (maximum)
and the smallest (minimum) observation of the distribution. Thus:
Range = Xmax – Xmin (Range = L-S).
Where Xmax is the greatest observation (L) and
Xmin is the smallest observation of the variable value (S).

In case of the grouped frequency distribution (for discrete values) or the continuous frequency
distribution, range is defined as the difference between the upper limit of the highest class and the lower
limit of the smallest class. Here, the frequencies of the classes are immaterial.

To compute the coefficient of range (for comparison purposes), we have:


Coefficient of range = Xmax – Xmin
Xmax – Xmin
If the averages of the two distributions are about the same, a comparison of the range indicates that the
distribution with the smaller range has less dispersion and the average of that distribution is more typical
of the group.
Example 1:
Find the range and the coefficient of range of the following prices of shares of ABC company ltd.
Day: Monday Tuesday Wednesday Thursday Friday Saturday
Prices Ksh: 200 210 208 160 200 250
Solution:
Range = Xmax – Xmin where Xmax = 250, Xmin = 160.
= 250 – 160 = Ksh90.
Coefficient of range = Xmax – Xmin = 250 – 160 = 0.22
Xmax + Xmin 250 + 160
Example 2:
The following table gives the age distribution of a group of 50 individuals:
Age (in years): 16-20 21-25 26-30 31-35
No of persons: 10 15 17 8
Solution
Since age is continuous variable we should first convert the given classes into continuous classes. The
first class will then become 15.5 – 20.5 and the last class will become 30.5 – 35.5.
Largest value (Xmax) = 35.5; Smallest value (Xmin) = 15.5
 Range = 35.5-15.5 = 20 years.
Coefficient of range = 35.5-15.5 = 20 = 0.39
35.5+15.5 51
Example 3:
For the data presented with their respective frequencies, the idea is to measure the same as the difference
between the mid-values of the two marginal classes.
Consider the following table:

The required Range is 54.5 – 4.5 = 50 or the observations on the variable are found scattered within 50
units.
It is to be noted that any change in marginal values or the classes of the variable in the series given will
change both the absolute and the percentage values of the Range.

Merits and Demerits of Range


Merits
 It is rigidly defined, readily comprehensive and is perhaps the easiest to compute, requiring very little
calculations.

Demerits
 It is not based on the entire set of data (i.e., it ignores the bulk of the data available to us). It is based
only on two extreme observations.
 It is very much affected by fluctuations of sampling. Its value varies very widely from sample to
sample.

Uses of Range
In spite of these limitations, the range has its applications in a number of fields;
 It is used in studying the variations in the prices of stocks (ie, stock market fluctuations) and other
commodities.
 Range is used in industry for the statistical quality control of the manufactured product by the
construction of R-chart, ie, the control chart for range.
 Used very conveniently by meteorological department; ie, maximum and minimum temperatures of
the day.
 Most widely used measure of variability in our day-to-day life, difference between highly paid and
lowly paid worker, etc.

MEAN DIVIATION
Is a measure of dispersion that gives the average absolute difference
(I.e. ignoring –ve signs) between each item and the mean.
It is as much more representative measure than the range since all item values are
taken into account its calculation
Mean deviation for a set of values formula is

Md = (x - x )
n
43, 75,48,39,51,47,50,47
Mean x = 400 = 50
8
Md = (x – 50 ) md = 52/8 = 6.5
n
x (x – x )
43 -7
75 25
48 -2
39 -11 ignore the negative sign
51 1
47 -8
50 0
41 -8
52
Mean deviation for a frequency distribution (simple frequency)

X f fx x-x f (x-x)
0 2 0 -2.56 5.12
1 4 4 -1.56 6.24
2 7 14 -0. 56 3.92
3 11 33 0.44 4.84
4 4 16 1.44 5.76
5 2 10 2.44 4.88
30 77 9.0 30.76

Formula is f (x-x )
f get the mean first i.e. fx
f
x = 77
30 = 2.56

30.76
30 = 1.025
Mean deviation for grouped frequency
Formula = f (x-x )
f
The example below to be calculated in class together with the teacher.

Class f x fx x-x f(x-x )


0-4 1
5-9 14
10-14 23
15-19 21
20-24 15
25-29 6
80

The answer should be 5.1


Characteristics of mean deviation
i. It is a good representative measure of dispersion that is not difficult to understand.
ii. It’s a practical disadvantage is that it can be complicated and onward to calculate if the mean is
not a whole number.
iii. It may not be used in advanced statistics.

Advantages of the MD:


(a) On many occasions it gives fairly good results to represent the degree of variability or the
extent of dispersion of the given values of a variable as it takes separately all the observations
given into account.
(b) It can also be calculated about the median value of those observations as their central value
and then it gives us the minimum value for the MD
(c) In usual situations, it is calculated taking deviations from the easily computable arithmetic
mean of the given observations on the variable.
(d) It is easy to calculate numerically and simple to understand.

Disadvantages of the MD:


(a) The main complaint against this measure is that it ignores the algebraic signs of the
deviations.
(b) It is not generally computed taking deviations from the mode value and thereby disregards it
as another important average value of the variable.
(c) It is rarely used in practical purposes.

Standard Deviation
Standard deviation, usually denoted by the letter s (small sigma) of the Greek alphabet was first suggested
by Karl Pearson as a measure of dispersion in 1893. It is defined as the positive square root of the
arithmetic mean of the squares of the deviations of the given observations from their arithmetic mean.
Thus, if X1, X2, …. Xn is a set of n observations, and then its standard deviation is given by:

= s = 1 (x-x)2 = (x-x)2 where x = x/n (AM of the given values).


n n

Computation of Standard Deviation


From individual series
(i) Compute the AM (x) of the given series.
(ii) Compute the deviation (x-x) of each observation from arithmetic mean, ie, obtain x1-x, x2-x, …,
xn-x.
(iii) Square each of the deviations obtained in step (ii), ie, compute (x 1-x)2, (x2-x)2, …, (xn-x)2.
(iv) Find the sum of the squared deviations in step (iii) given by: (x-x)2 = x1-x)2 + …, + (xn-x)2.
(v) Divide this sum in step (iv) by n to obtain (x-x)2
n
(vi) Take the positive square root of the value obtained in step (v). (x-x)2
n
(vii) The resulting value gives standard deviation of the distribution.

4.4 Variance
According to William I Greenwald the variance is the mean of the squared deviations about the mean of a
series. Thus, variance is the square of the standard deviation and is denoted by 02=s2. For the above
individual series, it is computed as: s2 (x-x)2
N

Computation of standard deviation and variance in case of frequency distribution


In case of frequency distribution, the standard deviation is given by:
S = (x-x)2
N
Where X is the value of the variable or the mid-value of the class (in case of grouped or continuous
frequency distributions);  is the corresponding frequency of the value X; N=, is the total frequency
and x = x is the arithmetic mean of the distribution.

Steps:
(i) Compute x by either x = x/ or the usual step deviation formula: x = a + d/N.
(ii) Compute deviations (x-x) for each value of the variable.
(iii) Obtain the squares of the deviations obtained in step (ii) ie, compute (x-x)2
(iv) Multiply each of the squared deviations by the corresponding frequency to get (x-x)2
(v) Find the sum of the obtained in step (iv) to get (x-x)2
(vi) Divide the sum obtained in step (v) by N, the total frequency S2= (x-x)2

 At this point, we have computed the variance for the frequency distribution.
(vii) The positive square root of the value obtained in step (vi) gives the standard deviation of the
distribution. S = (x-x)2

Note: The value of the s.d depends on the numerical value of the deviations (x1-x), (x2-x), …, (xn-x).
Thus, the value of S will be greater if the values of X are scattered widely away from the mean. Thus, a
small value of S will imply that the distribution is homogeneous and a large value of S will imply that it is
heterogeneous. In particular, s.d is zero if each of the deviations is zero, ie, S = 0 if and only if the
variable assumes a constant value ie,
S = 0 if, x1= x2 = x3 … = xn = k (constant).

Merits and Demerits of Standard Deviation


Advantages:
(a) Calculation of SD involves all the values of the given variable.
(b) It uses AM of the given data as an important component which is simply computable.
(c) It is least affected by sampling fluctuations.
(d) It is easily usable and capable of further Mathematical treatments.
(e) It is well defined.
(f) It is taken as the most reliable and dependable device for measuring dispersion or the variability of the
given values of a variable.
(g) Statisticians very often prescribe SD as the true measure of dispersion of a series of information.
(h) It can tactfully avoid the complication of considering negative algebraic sign while calculating
deviations.
Disadvantages:
(a) It involves complicated and laborious numerical calculations specially when the information are large
enough.
(b) The concept of SD is neither easy to take up, nor much simple to calculate.
(c) It is considerably affected by the extreme values of the given variable.
(d) To compute SD correctly, the method claims much moments, money and manpower.
Therefore, the SD possesses almost all the prerequisites of a good measure of dispersion and hence it has
become the most familiar, important and widely used device for measuring dispersion for a set of values
on a given variable.

Example 1
Calculate the mean, median, mode, standard deviation and variance of the following data.
5, 8, 15, 29, 47, 47, 64, 71, 71, 74.

Solution:
x x-x (x-x)2
5 5-40 = -35 1,225
8 8-40 = -32 1,024
15 15-40 = -25 625
29 29-40 = -11 121
47 47-40 = 7 49
47 47-40 = 7 49
64 64-40 = 24 576
71 71-40 = 31 961
74 74-40 = 34 1,156
x=360 (x-x)2 = 5786

Mean (x) = x = 360 = 40


n 9
Median = n+1 = 9+1 = 5th item = 47
2 2
Mode = 47 (appears twice – max).

s2 (x-x)2 = 5786 = 642.89


n

s = (x-x)2 = 642.89 = 25.36


n

Example 2
Calculate the mean, variance and standard deviation from the following data:
Value: 90-99 80-89 70-79 60-69 50-59 40-49 30-39
Frequency: 2 12 22 20 14 4 1

Solution:
Class Class mid-value Frequency x (x-x) (x-x)2 (x-x)2
boundary (x) ()
90-99 89.5-99.5 94.5 2 189 26.4 696.96 1393.92
80-89 79.5-89.5 84.5 12 1014 16.4 266.96 3203.52
70-79 69.5-79.5 74.5 22 1639 6.4 40.96 901.12
60-69 59.5-69.5 64.5 20 1290 -3.6 12.96 256.20
50-59 49.5-59.5 54.5 14 763 -13.6 184.96 2589.44
40-49 39.5-49.5 44.5 4 178 -23.6 556.96 2227.84
30-39 29.5-39.5 34.5 1 345 -33.6 1128.96 1128.96

x = x = 5107.5 = 68.1 x2 = (x-x)2 = 11701.0 = 156 s = (x-x)2 = s2 = 12.5


 75 

4.5 Coefficient of Variation / Dispersion


S.d is only an absolute measure of dispersion, depending upon the units of measurement. The relative
measure of dispersion based on s.d is called the coefficient of s.d and is given by
Coefficient of s.d = S/x
This is a pure number independent of the units of measurement and thus, is suitable for comparing the
variability, homogeneity or uniformity of two or more distributions.

100 times the coefficient of dispersion based on sd is called the coefficient of variation, abbreviated as cv.
Thus: CV = 100 x S/x

For comparing the variability of two distributions we compute the CV for each distribution. A
distribution with smaller CV is said to be more homogeneous or uniform or less variable than the other
and the series with greater CV is said to be more heterogeneous or more variable than the other.

Example:
Two workers on the same job show the following results over a long period of time:
Worker A Worker B
Mean time of completing the job (minutes) 30 25
Standard deviation (minutes) 6 4

(i) Which worker appears to be more consistent in the time she requires to complete the job?
(ii) Which worker appears to be faster in completing the job? Explain.

Solution:
(i) We know CV = S/x x 100
CV (for worker A) = 6x100/30 = 20
CV (for worker B) = 4x100/25 = 16
Since CV (B) is less than CV (A), the worker B appears to be more consistent in the time she requires to
complete the job.

(ii) Since, XB<XA, ie, on the average the worker B takes less time than worker A to complete the job,
the worker B appears to be faster in completing the job.

4.6 Skewness and Kurtosis


As we have already noted, we need statistical measures which will reveal clearly the salient features of a
frequency distribution. The measures of central tendency tell us about the concentration of the
observations about the middle of the distribution and the measures of dispersion give us an idea about the
spread or scatter of the observations about some measure of central tendency. We may come across
frequency distributions which differ very widely in their nature and composition and yet may have the
same central tendency and dispersion. For example, the following two frequency distributions have the
same mean x=15 and s.d=6, yet they give histograms which differ very widely in shape and size.

Frequency distribution I Frequency distribution II


Class Frequency Class Frequency
0-5 10 0-5 10
5-10 30 5-10 40
10-15 60 10-15 30
15-20 60 15-20 90
20-25 30 20-25 20
25-30 10 25-30 10
Therefore, the two measures, ie, central tendency and dispersion are inadequate to characterize a
distribution completely and they must be supported and supplemented by two more measures ie,
skewness and kurtosis. Skewness helps in the study the shape, i.e., symmetry of or asymmetry of the
distribution while kurtosis refers to the flatness or peakness of the curve which can be drawn with the help
of the given data. These four measures, i.e., central tendency, dispersion, skewness and kurtosis are
sufficient to describe a frequency distribution completely.

Skewness. Literal meaning of skewness is ‘lack of symmetry’. We study skewness to have an idea about
the shape of the curve which we can draw with the help of the given frequency distribution. It helps us to
determine the nature and extent of concentration of the observations towards the higher or lower values of
the variable. A distribution with an asymmetric tail extending out to the right is referred to as “positively
skewed” or “skewed to the right,” while a distribution with an asymmetric tail extending out to the left is
referred to as “negatively skewed” or “skewed to the left.” Skewness can range from minus infinity to
positive infinity.

Skewness
A score of zero infers a perfectly normal distribution
Negative scores infer a negative skew
Positive score infer a positive skew

Kurtosis
A score of zero infers a mesokurtic curve
Negative scores infer a platykurtic curve (too flat)
Positive score infer a leptokurtic curve (too pointly)

The more each score deviates from zero, the more the curve deviates from a normal distribution.

CORRELATION ANALYSIS

Introduction
This is a technique used to measure the strength of the relationship between two variables. It’s very hard
at times to discuss correlation without mentioning regression. The purpose of regression is to identify a
relationship of a given set of bivariate data. What it does not do however is to give any indication of how
good this relationship might be, correlation therefore comes in to provide the measure of how well a line
of best fit describes the scattered point.

Correlation means the existence of some definite relationship between two or more variables. It means
that if two quantities vary in such a way that movements in one are accompanied by movements in the
other, these quantities are said to be correlated. For example, there exists some relationship between
family income and expenditure on luxury items, price of a commodity and amount demanded. There are
also appearing to be related in some way to movements in one or several other factors. For example, a
marketing manager may observe that sales increase when there has been a change in advertising
expenditure. The transport manager may notice that as vans and Lorries cover more miles then the need
for maintenance becomes more frequent.
Certain questions may arise in the mind of the manager or analyst. These may be summarized as follows:
(i) Are the movements in the same or in opposite direction?
(ii) Could changes in one phenomenon or variable be causing or be caused by movements in the other
variable?
(iii) Could apparently related movements come about purely by chance?
(iv) Could movements in one factor or variable be as a result of combined movements in several other
factors or variables?
(v) Could movements in two factors be related, not directly, but through movements in a third
variable hitherto unnoticed?
(vi) What is the use of this knowledge anyway?

5.2 Some Definitions of Correlation


(i) Connor ‘If two or more quantities vary in sympathy, so that movements in one tend to be
accompanied by movements in the other they are said to correlated?
(ii) Dr Bowley ‘When two quantities are so related that the fluctuations in one are in sympathy with
fluctuations in the other, so that an increase or decrease of the other, and the greater is the
magnitude of the changes in one, the greater is the magnitude of change in the other, the
quantities are said to be correlated’.

Correlation and Causation


Correlation analysis helps us in determining the degree of relationship between two or more variables – it
does not tell us anything about cause-effect relationship. Even a high degree of correlation does not
necessarily mean that a relationship of cause and effect exists between the variables, or simply stated,
correlation does not necessarily imply causation though the existence of causation always implies
correlation. The correlation may be due to pure chance,

20
18
16
14
Female

12
10
8
6
4
2
0
1 2 3 4 5 6 7 8 9 10 11 12 13

Male
MALE FEMALE

9 6
10 8
13 20
5 9
6 11
10 9
Correlation therefore is concerned with describing the strength of relationship between two variables by
measuring the degree of scatter of the data value. The less scattered the data values are the stronger the
correlation is said to be.

TYPES OF CORRELATION
i) Positive & Negative Correlation
If 2 variables move in the same direction, they are said to be +ve correlated. On the other hand if the
variables move in different direction, they are said to be – vely correlated.

ii) Linear on non – linear correlation (Curvilinear)


Based on the constant of the ration of change between the variables.
If the amount of change in one variable tends to bear a constant ratio to the amount of change in the
other variable then the correlation is said to be linear. On the other hand, if the amount of change in
one variable does not tend to bear a constant ratio to the amount of change in the other variable then
the correlation is said to be nonlinear.

iii) Simple, partial and multiple correlations.


The distinction between simple, partial & multiple correlation is based upon the number of variables
studied when only two variables are studied it’s a problem of simple correlation, when three or more
variables are studied it’s a problem of multiple or partial correlation.

5.6 Methods of Studying correlation


The key methods include:
(i) Scatter diagram method
(ii) Karl Pearson’s coefficient of correlation
(iii) Spearman’s rank correlation coefficient and
(iv) Method of least squares (regression analysis).

Of these, the scatter diagram method is based on the knowledge of graphs whereas the others are
mathematical techniques.

A) Scatter Diagram
 It is a basic way of calculating correlation.
 It is using plotting dots against two variables on dot chart called diagram or scatter diagram this
 Is normally done on a graph paper.

MERITS (ADVANTAGES)
 It is simple and non – Mathematical.
 It is not influenced by extreme values.
 It is normally the 1st step in investigating correlation.

DEMERITS (DISADVANTAGES)
It based on estimation therefore different people can have different answers on the same problem.
+ve correlation

No correlation

-ve correlation

Pearson’s product Moment Correlation Coefficient

We need a way of measuring the strength of the correlation between 2 variables, this is achieved through
a correlation coefficient normally represented by a symbol “r” and which its value lies between –ve one
and +ve one.

-ve correlation +ve Correlation

-09 -08 -07 -06 -05 -04 -03 -02 -01 01 02 03 04 05 06 07 08 09 10


Strong Moderate Weak Weak Moderate Strong

No Correlation

Steps
1. Identity the sign on the index “v”
2. Strength & Weakness

Calculation of Pearson Product moment correlation

r = n xy – x y
n x2 – (x)2 n y2 – (y)2

This provides a measure of the strength of association between two variables, one the dependent variable,
the other the independent variable. This coefficient-denoted ‘r’ gives us an indication of the strength of
the linear relationship between two variables. There are several possible formulae but a practical one is:

Interpretation of the Coefficient (r)


First of all, the coefficient of correlation lies between –1 and +1, symbolically, -1 r  1. Thus:
(i) When r = +1 it means that there is perfect positive correlation between the variables.
(ii) When r = -1, it means that there is perfect negative correlation between the variables.
(iii) When r = 0, it means that there is no correlation between the variables, ie, the variables are
uncorrelated.
(iv) The closer the r is to +1 or –1, the closer the relationship between the variables and the closer r is
to 0, the less close the relationship. Therefore, the higher the value of r, the better the estimate.

Example

Using the Ron’s Department Data:


X Y XY X2 Y2
100 9 900 10,000 81
105 8 840 11,025 64
90 5 450 8,100 25
80 2 160 6,400 4
80 4 320 6,400 16
85 6 510 7,225 36
87 4 348 7,569 16
92 7 644 8,464 49
90 6 540 8,100 36
95 7 665 9,025 49
93 5 465 8,649 25
85 5 425 7,225 25
85 4 340 7,225 16
70 3 210 4,900 9
85 3 225 7,225 9
1322 78 7072 117,532 460

r= nXY – (X) (Y)


nX2 – (X2) nY2 – (Y)2

r= 15 x 7072 – (1322) (78)


15(117532) – (1322)2 [15(460) – (78)2]

r= 106080 – 103116 = 2964


1762980 – 1747684 6900 – 6084 12481536
r = 2964 = 0.8389
353292

The coefficient is high and positive, meaning that on the evidence of this data, an increase in advertising
expenditure has a positive and large impact on sales.

Interpretation of ‘r’ = Caution


One disturbing feature of many students’ work is that although they can calculate the value of the
coefficient of correlation in a perfectly efficient manner they completely ignore any effort to interpret
what their results mean. What is the point of calculating a statistic if you cannot interpret?

Even though analysis indicates that correlation exists between the variables, we are not justified in
assuming that there is therefore, a cause and effect relationship. We must never fall into the trap of
assuming that cause and effect exists when it is nonsense to do so. Therefore, care is needed in the
interpretation of the calculated value of r. A high value (above +0.9 or –0.9) only shows a strong
association between the two variables if there is a causal relationship, ie, if a change in one variable
causes changes in the other. But a high positive correlation between television sales and annual
admissions into mental institutions does not indicate any association. It would be ludicrous to suggest
that cause and effect exists here. The only logical conclusion that one can draw is that, quite by chance,
both statistics were increasing at the same rate. In this case it is quite clear that cause and effect exists
here. The only logical conclusion that one can draw is that, quite by chance, both statistics were
increasing at the same rate. In this case, it is quite clear that cause and effect is not proved by a high
correlation coefficient.

Also, a low correlation coefficient, somewhere near zero, does not always mean that there is no
relationship between the variables. All it says is that there is no linear relationship between the variables
– there may be a strong relationship but of a non-linear kind.

A further problem in interpretation arises from the fact that ‘r’ measures the relationship between a single
independent variable and dependent variables, whereas a particular variable may be dependent on several
independent variables in which case multiple correlation should have been calculated rather than the
simple two-variable coefficient.

To conclude, one may argue that if we can never use correlation analysis to pure a cause and effect
relationship, why study it?

ASSIGNMENT
(1) The data of the table below relate to the weekly maintaince cost in millions (m) to the Age
(Month) of 10 machines of similar type in a manufacturing company.

Machine 1 2 3 4 5 6 7 8 9 10
Age 5 10 15 20 30 30 30 50 50 60
Cost 190 240 250 300 310 335 300 300 350 395

Required
1. Calculate the Pearson product moment correlation and interpret it.
2. As a manager what strategy would you recommend on the machine?

(2) The following data obtained from claims drawn on life assurance policies for a particular
Category of employee relates age at official retirement to age at death for 9 males

Age at retirement 57 62 60 57 65 60 58 62 56
Age at death 71 70 66 70 69 67 69 63 70
Required

1. Calculate the Pearson product moment correlation and comment on the correlation.
2. As a claim manager what basic decision can you make about this correlation?

(3) Calculate the Karl Pearson co-efficient of correlation calculation of the following ages of
husbands and wives at their time of their marriage.

Age of Husbands 23 27 28 28 28 30 30 33 35 38
Age of wife 18 20 22 27 21 2 9 29 29 28 29
The Rank Correlation Coefficient (e)
Sometimes we come across statistical series (data) in which the variables under consideration are not
capable of quantitative measurement but can be arranged in serial order. This happens when we are
dealing with qualitative characteristics (attributes) such as honesty, beauty, character, morality etc, which
cannot be measured quantitatively but can be arranged serially. Therefore, Edward Spearman, a British
psychologist, developed a formula in 1904 which consists in obtaining the correlation coefficient between
the ranks of individuals in the attributes under study.

Therefore, the purpose of the rank coefficient is to establish whether there is any form of association
between two variables when the variables are arranged in a ranked form.

The formula is as follows:


e = 1 - 6d2
n(n2-1)
Where:
e (Rho) is the rank correlation coefficient
d = difference between the pairs of ranked values
n = number of pairs of rankings.

Notes:
(i) The limits for e are between –1 and +1, i.e., -1 e  1 and like r, has a similar meaning.
(ii) As with r, care should be taken in any interpretation of the value of e whether it is a particular
high or low value.
(iii) Since the square of a real quantity is always non-negative, ie, 0, d2 being the sum of non-
negative quantities is also non-negative. Therefore, the sign of equality holds if and only if d2=
0. Thus, d2 = 0 if and only if each d=0.

In rank correlation, we may have two types of problems:


a) Where actual ranks are given
b) Where ranks are not given.

a) When actual ranks are given


Procedure
 Compute d, the difference of ranks
 Compute d2
 Obtain the sum of d2 ie, d2
 Use the formula to get the value of e

Example
Ten competitors in a beauty contest are ruled by 3 judges in the following order.
1st Judge 1 6 5 10 3 2 4 9 7 8
2nd Judge 3 5 8 4 7 10 2 1 6 9
3rd Judge 6 4 9 8 1 2 3 10 5 7

Use the rank correlation coefficient to determine which pair of judges has the nearest approach to
common tastes in beauty

Solution:
R1 R2 R3 d12 d13 d23 d212 d213 d223
R1-r2 R1-R3 R2-R3
1 3 6 -2 -5 -3 4 25 9
6 5 4 1 2 1 1 2 1
5 8 9 -3 -4 -1 9 16 1
10 4 8 6 2 -4 36 4 16
3 7 1 -4 2 6 16 4 36
2 10 2 -8 0 8 64 0 64
4 2 3 2 1 -1 4 1 1
9 1 10 8 -1 -9 64 1 81
7 6 5 1 2 1 1 4 1
8 9 7 -1 1 2 1 1 4
0 0 0 200 60 214

e12 = 1 – 6d2 = 1 – 6 x 200 = 1200 = -0.212


n(n2-1) 10(102-1) 990

e13 = 1 – 6 x 60 = 1 – 0.3636 = 0.6363


990

e23 = 1 – 6 x 214 = 1 – 1.2969


990

Since e13 is maximum, the pair of first and third judges have the nearest approach to common tastes in
beauty. Since e12 and e23 are negative, the pair of judges (1,2) and (2,3) have opposite (divergent) tastes
for beauty.

Example 2
The personnel department of a large company is investigating the possibility of assessing the suitability of
applicants by using psychological tests instead of normal interview procedures. A comparative test of
seven applicants was carried out using both methods. The results were as follows:

Applicant Ranking by Interview procedure (x) Psychological tests (-1)


A 4 5
B 1 2
C 7 7
D 6 4
E 2 1
F 3 3
G 5 6

You are required to:


a) Calculate the rank correlation coefficient
b) Interpret the result established.
Solution
Applicant A B C D E F G
X 4 1 7 6 2 3 5
Y 5 2 7 4 1 3 6
d -1 -1 0 2 1 0 -1
d2 1 1 0 4 1 0 1 =8
2
e = 1- 6d = 1- 6x8 = 1-0.1429 = 0.857
n (n2-1) 7(49-1)

(b) There is a high correlation between the two methods, meaning they can be used interchangeably.

When ranks are not given:


Procedure:
Spearman’s rank correlation formula can also be used even if we are dealing with variables which are
measurable quantitatively, ie, when the actual data but not the ranks relating to the two variables are
given. In such a case, we shall have to convert the data into ranks. The highest (smallest) observation is
given the rank 1. The next highest (next lowest) observation is given rank 2 and so on. It is immaterial in
which way (descending or ascending) the ranks are assigned. However, the same approach should be
followed for all the variables under consideration.

Example 3
Compute the rank correlation coefficient between advertisement cost and sales revenue from the
following series:

(X) Advertisement (Ksh000) 39 65 62 90 82 75 25 98 36 78


(Y) Sales (Ksh million) 47 53 58 86 62 68 60 91 51 84

Solution
Let X denotes the advertisement cost and Y denote the sales revenue.

X Y Rank of X Rank of Y (x-y) d2


d
39 47 8 10 -2 4
65 53 6 8 -2 4
62 58 7 7 0 0
90 86 2 2 0 0
82 62 3 5 -2 4
75 68 5 4 1 1
25 60 10 6 4 16
98 91 1 1 0 0
36 51 9 9 0 0
78 84 4 3 1 1
30
(Ranking from the largest to the lowest number).
e = 1-6d2 = 1-6x30 = 1-0.1818 = 0.818
n(n2-1) 990
There is a high positive correlation between advertisement expenditure and sales revenue.
Remarks
1. One great advantage of rank correlation is that it can be used to rank attributes which cannot be given
a numerical value. Thus, we could assess the consistency with which different panels of judges
assess the contestants in beauty contests such as Miss World. in the same way we could assess
whether different methods of selecting applicants for employment are likely to lead to the same
results. Do we for example, end up with the same ranking from a personal interview as we would
from a written test or a psychological test?
This is the only formula to be used to find correlation coefficient if we are dealing with qualitative
characteristics.
2. Another advantage is that it can be used where actual data are given.

3. Since d =0 (always) it provides a check for the numerical calculations.


4. Also, since Spearman’s rank correlation coefficient is nothing but Pearsonian correlation coefficient
between the ranks, it can be interpreted in the same way as the Karl Pearson correlation coefficient.
5. Finally, the rank correlation coefficient is easier to compute than the Karl Pearson correlation
coefficient.

Limitations
1. The rank correlation coefficient cannot be used for finding out correlation in a grouped frequency
distribution.
2. If n Z 30 (greater than) the formula should not be used unless the ranks are given, since in the
contrary case the calculations are quite time consuming.

REGRESSION ANALYSIS
Introduction
This is the measure of the average relationship between two or more variables in terms of the original
units of the data.
It is concerned with the estimating the value of one variable when the value of other variable is known.
Definition
The word ‘regression’ refers to the act of returning or going back. It is concerned with the estimation of
the relationship between variables. Regression analysis helps statisticians to determine the probable form
of the relationship between variables. In other words, one is able to find out the extent to which one
variable changes in response to a given chance in the other. It can be used to predict or estimate the
unknown value of one variable corresponding to a given value of another variable. It also reveals the
nature of the relationship between the variables and hence determines the average probable change in one
variable when a certain amount of change in the other is known.

Uses of Regression Analysis


1. It provides estimates of value of the dependents variables from value of independent variables.
The device used is the regression line

Regression line

2. It is used to obtain a measure of the error involved in using the regression line as
a basis for estimates.
3. It helps to obtain a measure of the degree of association or correlation that exists
between 2 variables.

Difference between Correlation & Regression

i. The co-efficient of correlation is a measure of degree of relationship between X & Y where as the
objection of regression analysis is to study the nature of relationship between the variables.

ii. The cause and effect relationship is as clearly indicated then regression analysis which is not the
care with correlation.
iii. Correlation analysis is confined only to the study of linear relationship between the variables
therefore has limited application. On the other hand regression has a wider application since it
studies between linear as well as non – linear relationship between the variables
iv. There maybe non – sencical correlation between the 2 variables which is due to more chance and
has no practical relevance.

The Main Uses of Regression Analysis


(i) Through the methods of extrapolation and interpolation, it provides estimates of values of the
dependent variables from the values of the independent variables.
(ii) Regression analysis also enables us to obtain a measure of the error involved in using the
regression line as a basis for estimation.
(iii) Regression analysis also enables us to compute the coefficient of determination – which obtains a
measure of association or correlation that exists between the two variables.

Differences between Correlation and Regression


1. The objective of regression analysis is to study the relationship between the variables while the
coefficient of correlation is a measure of the degree of relationship between X and Y.
2. The cause and effect relation is clearly indicated through regression analysis but in correlation we
cannot say that one variable is the cause and the other the effect.
3. Correlation analysis is confined only to the study of linear relationship between the variables and
therefore, has limited applications. Regression analysis has much wider applications as it studies both
linear as well as non-linear relationship between the variables.
4. There may be nonsense correlation between two variables which is due to mere chance and has no
practical relevance. There is no such thing like nonsense regression.

METHODS OF CALCULATION REGRESSION

Simple Linear Regression Model

A simple linear regression involves an attempt to develop a straight or linear mathematical model to
describe the relationship between 2 variables. Linear or straight line equation is normally used. This is
important because they closely approximate much real relationship and that they are easy to work with &
interpret.

y
y = a + bx + m

0 a x

a – represenst the y intercepts

(The intercepts specifies the value of dependent variable when the independent variable hass a value of 0.
it is also referred to as a constant)

b- Is another constant indicating the slope of the regression line.(The slope of the line indicates the
amount of change in the value of the dependent variable for a unit change in the independent variable.)

Which is?
Change in Y
Change in X
Y = Dependent

X = Independent / Explanatory/ Predictor

M = Random term / Stochastic/ Error term/ Kendal tab/ Disturbance.

These names mean that there are other factors that can explain change in Y which are not quantifiable.
These are factors that are not captured by the regression line and thus are absorbed by M.

Least Square Method of Calculating Regression


This method of obtaining regression line has a mathematical base; the least square method of fitting a line
(line of best fit / regression line) through the scatter diagram is a method which minimizes the sum of the
squared vertical deviation from the fitted lines.

Methods under the least square model (L.C.M)

a) Simultaneous equation

y = na + b x

xy = a x + b X2

Example one
An agronomist experimented with different amount of liquid fertilizer on a sample of equal size plots, the
amount of fertilizer and the yield per acre in tones is given as below.

X – (Independent variable) Y- (dependent variable)

Plots Amount of fertilizer (tone) Yield in (tone)

A 2 7

B 1 3

C 3 8

D 4 10

Required

a. Identify the dependent & independent variable.


b. Draw a scatter diagram illustration the relationship between dependent variable.
c. Determine the regression line & interpret it in line with both dependent and independent variable.
10
8
6
4
2
1 2 3 4 5
X Y XY X2
2 7 14 4
1 3 3 1
3 8 24 9
4 10 40 16
10 28 81 30

i) y = na + b x ii) Exy = aEx + bEx2


28 = 4a + 10b 81 = 10a + 30b
Students to calculate the value of b.

Determining Regression Line

Y = a + bx
Y = 1.5 + 2.2x
Ignore value of a in interpretation Y= +2.2x
Y and X are positively related (the yield & fertilizer are + vely related)
A unit of X causes an increase in Y of 2.2 (one tonne of fertilizers causes 2.2 increase in the yield)
Question d
Calculate the amount of yield that 8 tonne of fertilizers will produce

THE SECOND FORMULAE

b = nExy – ExEy
nEx2 – (Ex) 2

a=Ey – bEx
n n

Equation = Y = na + bEx

A company is planning to invest heavily in a marketing programme the initial performance of marketing
programme is as follows.
X Amount invested (M) 5 10 15 20 30
Y Volume of sales (Tonnes) 190 240 250 300 310

Required

a. Determine the dependant and independent variables.


b. Determine the regression line and show it using a scatter diagram (10Mks)
c. Advice the board of directors for or against their decision to pump into the investment programme, a
total of 50 million.
The value of b = 4.76 substitute to get the value of a

Calculation of regression using deviation method.

a = Y – bX

b = EXiyi
EXi2
X y xi yi xi2 xiyi
16 21 -2.8 0.2
18 22 -0.8 1.2
11 27 -7.8 6.2
23 14 4.2 -6.8
26 20 7.2 -0.8
94 104

Mean of x = 18.8 mean y = 20.8


Substitute and find the value of a and b

NORMAL PROBABILITY DISTRIBUTION


Normal Distribution – the Z Values

In statistics, we usually deal with the normal, binomial and poison distribution. The binomial and
poison distributions are discrete probability distributions (i.e., the variables under study are discrete
random variables).

The normal distribution or normal curve is the most important continuous theoretical distribution in
statistics. It occurs frequently when describing natural occurrences and is of particular importance in
sampling theory and statistical inference. Also, most of the data relating to economic and business
statistics or even in social and physical sciences conform to this distribution.

The normal distribution was first discovered by English mathematician De-Moivre (1667-1754) in 1733
who obtained the mathematical equation for this distribution while dealing with problems arising in the
game of chance. Normal distribution is also known as Gaussian distribution (Gaussian Law of Errors)
after Karl Friedrich Gauss (1777-1855) who used this distribution to describe the theory of accidental
errors of measurements involved in the calculation of orbits of heavenly bodies.

The Main Features of the Normal Distribution


(i) It is a continuous distribution.
(ii) It is a perfectly symmetrical bell shaped curve. The top of the bell is directly above the main (u).

P(x)

Y= 1 e –(x-u)2
E 2II 22

x=u

(iii) The curve is symmetrical about the line x=u, (z=0), i.e., it has the same shape on either side of the
line x=u (or z=0).
(iv) Since the distribution is symmetrical, mean, median and mode coincide. Thus, Mean = Median =
Mode = u. Thus, the distribution is unimodal, the only mode occurring at x=u.

(v) No portion of the curve lies below the x-axis, since p(x) being the probability can never be
negative. Thus, the ‘tails’ of the distribution continually approach, but never touch, the horizontal
axis.

Using the Normal Area Table


Normal area tables show the area under the normal curve between the mean and any given number of s.d
(including fractional values). To use the tables, it is necessary to calculate the standardized variate (often
called the ‘Z’ score) which is simply the number of s.d that the required value is away from the mean.
Z is calculated thus: Z = x-u
e
Where: Z = number of standard deviations above or below the mean.
X = the value of the variable being considered.
u = the population mean
e = the population standard deviation
The parallel lines round the expression mean that we are normally only concerned with the number of sds
calculated, not whether the answer is negative or positive.

Applications of the Standardized Normal Curve

0.5000

y = 1 e-(x-u)2
2II 202

0.1915

-00 0.5 0 0.5 +00

If we wanted to compute the areas between 0 and 0.5 and 0 and +0.5, we do not need to keep on
evaluating using the above (y) formula. Tables of these integrals have been compiled and, due to the
symmetry of the curve, usually take the following form: values of z are listed, running from 0 upwards
and opposite each an area is given, which is the area under the curve between 0 and z.

For example, if we look up the value corresponding to z=0.5 in a set of these tables, we find a figure of
0.1915. This means that the shaded area (shown in the fig above) is 0.6915, i.e., 0.5000+0.1915 = 0.6915.

Now, suppose we require p (-1<z<+1). We can use the tables to find this probability as follows: by
symmetry the area between –1 and +1 equals twice the area between 0 and 1.

0.3142 0.3143 The area between 0 and 1 = 0.3413. Therefore,


the required probability is
2 x 0.3413 = 0.6826, ie, p(-1<z<+1) = 0.6826.

-00 -1 0 +1 +00

EXERCISE
1. A banker claims that the life of a regular saving Account opened with this bank average 18 months
with a d of 6.45 months
What is the probability that there will still be money in 22 months in a saving account opened with the
same bank by a deposit?

What is the probability that the account will have been closed before 2 years?

2. Regarding a certain distribution concerning the income of individuals we are given that the mean is
500Km and D = 100Km
Required

Find the probability selected at random will belong to income group


(a) Between 550Km and 650 Km
(b) Between 420Km to 570 Km.
3. Weights of bags of potation are normally distribution with the mean of 5Kgs and d of 0.2Kgs the
potation are delivered to a supermarket at the rate of 200 bags at time.

Required
a) What is the probability that random bag will weigh more than 5.5Kgs
b) How many bags from a single delivery will be expected to weigh more then 5.5Kgs.

4. A 1000 student sat for an examination the marks obtained followed a normal distribution with mean of
50 and a standard deviation of 10.

Required
Find the number of students who scored.
i) Between 50 and 60 marks
ii) Between 50 marks
iii) Above 70 marks
iv) Between 30 and 40 marks
5. Draw and determine the area under the Normal Curve for the following.
i) Between Z = -1.88 and Z= -2.44
ii) Between Z= -1.64 and Z= -1.77
iii) Between Z= 1.44 and Z= -1.84
iv) To the left of Z= -1.97
v) To the left of Z= -2.88
vi) Between Z= 1.96 and Z= -1.89

STATISTICAL INFERENCE

Introduction
Statistical inference can be defined as the process by which conclusions are drawn about some
measure or attribute of a population (e.g., the mean or s.d). Based upon analysis of sample data. In
other words, the process of drawing inferences about a population on the basis of information
contained in a sample taken from the population is called statistical inference. Statistical inference is
traditionally divided into two main branches.
 Estimation of parameters; and
 Testing of hypothesis

Confidence Limits and Interval


Confidence limits are the outer limits to a confidence zone or interval. This is a zone of values within
which we may be confident that the true population mean (or the parameter being considered) does lie.
These limits are called the confidence limits since they are the limits determined by the chosen level of
confidence. The interval between these limits is called the confidence interval. The higher the
confidence level, the greater the confidence interval.

The first step in constructing a confidence interval is to decide how much confidence we want that this
interval will contain the population value. The following procedure is adopted in interval estimation:
(i) The particular statistic, say the mean of the sample or standard deviation of the sample is
determined.
(ii) The confidence level is decided, i.e., 95%, 99% etc.
(iii) The standard error of the particular statistic is calculated.
(iv) Finally, we state with a known degree of confidence that the parameter lies in this interval.

Confidence limits, for population means, are based on the sample mean, the standard error of the mean
and on the known characteristics of normal distribution.
It is known that a normal distribution has the following characteristics:
 Mean + 1.960 includes 95% of the population
 Mean + 2.580 includes 99% of the population.

These characteristics can be used to calculate confidence limits for the population mean when we have
established the sample mean and the standard error.

Estimation of Population Mean


The use of sample mean (x) to estimate the population mean (u) is a common procedure in statistics. If
we take a large sample (sample size not less than thirty) from a population, then we are of the opinion that
the mean of the population is very close to the mean of the sample.
To estimate a population mean, the following procedure is followed:
(i) Take a random sample of n items. Where n represents sample size and it is not less than 30.
(ii) Compute the sample mean (x) and standard deviation (sx).
(iii) Compute the standard error of the mean by using the formula: Sx = S / n
Where: Sx = standard error of the mean
S = standard deviation of the sample
n = sample size
(iv) Choose a confidence level e.g., 95% or 99%.
(v) Estimate the population mean as under:

Population mean (u) = X+ appropriate number of Sx. Appropriate number means confidence level, at
95% confidence level, it is 1.96 and for 99% level confidence this number is 2.58.

Example 1
The quality department of a wire manufacturing company periodically selects a sample of wire specimens
in order to test for breaking strength. Past experience has shown that the breaking strengths of a certain
type of wire are normally distributed with standard deviation of 200 kg. A random sample of 64
specimens gave a mean of 6,200 kg. Find out the population mean at 95% level of confidence.

Solution:
Population mean = x+1.965 Sx
Here x = 6200 n = 64 s = 200
Therefore Sx = S/ n = 200 = 200 = 25
64 8

Population mean (u) = x + 1.965x


= 6200+ 1.96(25)
= 6200 + 49
= 6200 + 49 = 6200-49 = 6151
= 6151 to 6249.

This means that we are 95% confident that the population means lies within the confidence zone, i.e.,
somewhere between 6151 and 6249.
Example 2
A firm purchases very large quantity of metal offcuts and wishes to know the average weight of an offcut.
A random sample of 625 offcuts is weighted and it is found that the mean sample weight is 150 grams
with a sample s.d of 30 grams.
(i) Compute the standard error of the mean.
(ii) What would be the standard error if the sample size was 1,225?
(iii) What is the estimate of the population mean? Use the 95% confidence level.

Solution:
(i) When n = 625 and x = 150 and s = 30.
Sx = S/ n = 30/ 625 = 30/ 25 = 1.2 grams.

(ii) When n = 1,225


Sx = 30 = 30 = 0.857 grams
1,225 35
It is seen that increasing the sample size reduces the standard error. This accord with common sense, we
would expect a large sample to be better than a smaller one.

(iii) At the 95% confidence level


Where x = 150 grams and Sx = 1.2 grams
u = x + 1.965x = 150 + 1.96 (1.2) grams
= 150 + 2.35 = 152.35 grams and 150-2.35 = 147.65 grams

This means that we are 95% confident that the population means lies within the confidence zone i.e.,
somewhere between 147.65 grams and 152.35 grams.

At the 99% confidence level the limits are:


U = 150 + 2.58 (1.2) grams
= 150 + 3.1 grams
i.e., a range from 146.9 to 153.1 grams

Remarks
(i) Raising the confidence factor from 95% to 99% increases the assurance that the confidence zone
contains the population mean but it makes the estimate less precise.
(ii) Any confidence level could be chosen and the appropriate number of standard errors found from
the normal area tables. However, the 95% and 99% values of 1.960 and 2.580 are widely used
and should be committed to memory.

The principles involved in setting confidence limits can be used to determine what sample size should be
taken, if we wish to achieve a given level of precision.

For example, what sample size would be required in the above example (I) if we wish to be 95%
confident that the population mean is within 2 grams of the sample mean.

To obtain 95% limits of 2 grams would require a standard error of 1.02 grams, ie, 2 grams / 1.96.
Therefore, as Sx = S/ n then 1.02 = 30/ n
Therefore, n = 30/1.02 = 865
n = 865
The principles of confidence limits have been illustrated in connection with means but their use is not
limited solely to the estimation of population means from a sample. The concept is applied across a broad
range of statistical applications.

HYPOTHESIS TESTING
Introduction
A hypothesis is an assumption, belief or opinion which may or may not be true. For example, it may
be believed that a give drug curves 90% of the patients taking it or the average height of soldiers in
the army is 168 cms. The testing of a statistical hypothesis is the process by which this belief or
opinion is tested by statistical means. It means that testing of a hypothesis is a procedure which
enables to decide on the basis of information obtained from sample data whether to accept or reject a
statement or an assumption about the value of a population parameter. We accept the hypothesis as
being true, when it is supported by the sample data. We reject the hypothesis when the sample data
fail to support it. The testing of hypothesis is the most important technique in statistical inference.
The tests are widely used in business and industry for making decisions.

It is important to understand what we mean by the terms reject and accept in hypothesis testing. The
rejection of a hypothesis is to declare it false. The rejection of a hypothesis is to declare it false. The
rejection of a hypothesis is to declare it false. The acceptance of a hypothesis is to conclude that there is
insufficient evidence to reject it. Acceptance does not necessarily mean that the hypothesis is true.

Statistical hypothesis. It is a tentative conclusion that specifies the properties of a distribution of a


random variable. These properties generally refer to parameters of the population and are hypothetical
values with which the values of statistics derived from a sample are compared in order to find the
difference between statistic and corresponding parameter. In other words, statistical hypothesis is some
assumption or statement, which may or may not be true, about a population or about the probability
distribution characterizing the given population, which we want to test on the basis of the evidence from a
random sample.

Test of hypothesis. The testing of hypothesis is a procedure that helps us to ascertain the likelihood of
hypothesized population parameter being correct by making use of the sample statistic. In other words. It
is a process of test of significance which concerns with the testing of some hypothesis regarding a
parameter of the population on the basis of statistic from the sample. In testing of hypothesis, a statistic is
computed from a sample drawn from the parent population and on the basis of this statistic, it is observed
whether the sample so drawn has come from the population with certain specified characteristic. The
value of sample statistic may differ from corresponding population parameter due to sampling
fluctuations. The test of hypothesis discloses the fact whether the difference between sample statistic and
the corresponding hypothetical population parameter is significant or not significant. Thus the test of
hypothesis is also known as the test of significance.

Functions of Hypotheses
Hypotheses serve the following functions;
i. Hypothesis provides tentative explanations of observed phenomena subject to the finding of
research. It is the strategy through which findings of investigative work can be tested for
acceptance or falsification.
ii. It offers guidance to the direction of the research by providing a framework that delimits the
direction and scope of relationships of variables under investigation. For example, hypothesis
may asserts the nature of relationship whether positive or negative and/or the degree of
relationship whether strong or weak.
iii. Hypothesis and theory complement each other in the sense that a tested and a widely accepted
hypothesis may in itself be generalized into or contribute to a theory, while on the other hand
specific hypotheses may be constructed from general theories, particularly in the rare, but not
impossible instances when theories are subjected to tests for purposes of falsification.
iv. Guides the direction of study, identifies facts that are relevant and suggests the research design
that is appropriate

Characteristic of good Hypotheses


a) Hypothesis must be adequate for its purpose by revealing the original problem, clearly identifying
facts that are relevant, explaining the facts that gave rise to the need for explanation, suggesting
the form of most appropriate research design and providing for organized conclusion.
b) Hypothesis must be testable by ensuring that it uses acceptable techniques, requiring explanation
that is plausible given known physical and psychological laws and revealing consequences and
derivatives that can be deduced for testing purposes. Testability also requires that the hypothesis
is simple necessitating few conditions or assumptions
c) Hypothesis should explain more facts and scope than its rivals.
d) A good hypothesis should be clear and precise to enable reliability of inference drawn from the
hypothesis
e) Hypothesis should state relationship between variables, if it happens to be relational
f) It should be specific and simple and limited in scope, but significant to allow precise testing
g) Hypothesis should be consistent with theories and established facts and common sense.
Hypothesis should have empirical reference

Hypotheses and Theory


Cooper and Schindler (2008), defines theory as a set of systematically interrelated concepts, definitions
and propositions that are advanced to explain and predict a phenomenon. According to Mugenda and
Mugenda (2003), a theory is a set of concepts or constructs and the interrelationship that are assumed to
exist among those concepts. In this definition, a good theory can be logically broken down into a set of
hypotheses which can be verified by experiments and observations to provide the basis for imperial
generalizations. The process of developing hypotheses from theories and then testing the hypotheses
through observations and experiments is called deductive logic. On the other hand the process of
constructing hypotheses and then theory from repeated observations is called inductive logic.
The general distinction between theory and hypotheses lies in the degree of complexity and abstraction.
Theories are complex and abstract and involve many variables. Hypotheses on the other are simpler, with
limited variables involving concrete instances (Cooper and Schindler, 2008). A theory predicts events in
general terms, while a hypothesis makes reference about a specific set of circumstances. A theory has
been extensively tested and is generally accepted whereas hypothesis is in most cases speculative and is
yet to be tested. Table 1 gives comparison between hypothesis and theory under various themes:
Theme Hypothesis Theory
Definition A suggested explanation for an In science, a theory is a well-
observable phenomenon or prediction substantiated, unifying explanation
of a possible causal correlation among for a set of verified, proven
multiple phenomena. hypotheses
Based on Suggestion, possibility, projection or Certainty, evidence, verification,
prediction, but the result is uncertain repeated testing
Testable Yes Yes
Falsifiable Yes Yes
Is well substantiated No Yes
Usually based on very limited data Based on a very wide set of data
Data
tested under various circumstances.
Specific: Hypothesis is usually based on General: A theory is the
a very specific observation and is establishment of a general principle
Instance limited to that instance. through multiple tests and
experiments, and this principle may
apply to various specific instances

Types of Hypothesis
Literature review on types of hypothesis reveals that there is no standard way of classifying hypothesis.
However, various authorities concur that hypothesis can take the form of Statistical Hypothesis which is
purely quantitative and Research Hypothesis which extends beyond the quantitative boundary to allow
for deductive and inductive logic in qualitative paradigm.
a) Statistical Hypothesis
A statistical hypothesis is given in statistical terms. It is a statement about one or more parameters that are
measures of the population under study. This is done when it is time to test whether data support or refute
the research hypothesis. Using inferential statics conclusions about the population values are drawn from
the sample. This is done through a null or alternative hypothesis
b) Null Hypothesis
The Null hypotheses is a statement put forward as true or believed to be true of the relationship between
the variables. The Null hypotheses are formulated in a manner that portrays no relationship between the
variables. The Null hypotheses become the basis of argument for the results.
c) Alternative Hypothesis
The Alternative hypotheses displayed as H1 or Ha that indicates the perdition of the relationship between
the variables. The Alternative hypotheses may be non-directional or two tailed and directional; one-
detailed. The non-directional alternative hypotheses (H1) are one where the direction or the difference
between the variables is not given. On the other hand the directional or one tailed hypotheses provides
direction or predicts the direction of the relationship between the variables.
d) Research Hypothesis
The research hypothesis is a term used when a hypothesized relationship or prediction is to be tested by
scientific methods. It is a predictive statement that relates an independent variable to a dependent variable
and must be stated in a testable form. Research hypothesis are classified by either direction or logic
Classification of hypotheses by direction of relationship

 Directional Hypothesis; These are hypotheses that stipulate the


direction of the expected differences or relations
 Non- directional Hypothesis; This is a type of research hypothesis
which does not specify the direction of the expected differences or
relationships.
Classification by logic applicable
 Deductive Hypothesis; A deductive logic starts with alternative
generalizations (hypotheses) and uses particular observations to
discriminate among them.
 Inductive Hypothesis Induction is the inference that a relationship
probably holds in the general population because it holds in the
study population.
Basic Concepts
The basic concepts associated with hypothesis include:

(i) Null and Alternative Hypothesis


A null hypothesis, generally denoted by the symbol Ho, is any hypothesis which is to be tested for
possible rejection or nullification under the assumption that it is true. A null hypothesis should always be
precise such as ‘a drug is ineffective in curing a particular disease’.
A hypothesis is usually assigned a numerical value. For example, suppose we think that the average
height of students in all colleges is 150 cms. This statement is taken as a hypothesis and is written
symbolically as H0; u = 150 cm. In other words, we hypothesis that u = 150 cms.

An alternative hypothesis is any other hypothesis which we are willing to accept when the null hypothesis
Ho is rejected. It is customary denoted by H1 or HA. A null hypothesis Ho is thus tested against an
alternative hypothesis H1. For example, if our null hypothesis is Ho; u = 150 cm, then or alternative
hypothesis may be:
H1; u = 150 cms or H1; u > 150 cms or H1; u < 150 cms.

(ii) Test-Statistics
A statistic (i.e. a function of the sample data not containing any parameter), which provides a basis for
testing a null hypothesis, is called a test-statistic. Every test-statistic has a probability (sampling)
distribution which gives the probability of obtaining a specified value of the test-statistic when the null
hypothesis is true. It is important to remember that a test-statistic does not prove the hypothesis to be
correct but it furnishes evidence against the hypothesis. The most commonly used test-statistics are Z, t,
X2 or F.
(iii) Acceptance and Rejection Regions
All possible values which a test-statistic may assume can be divided into two mutually exclusive groups,
one group consisting of values which appear to be consistent with the null hypothesis and the other
having values which lead to the rejection of the null hypothesis. The first group is called the acceptance
region and the second set of values is known as the rejection region for a test. The rejection region is also
called the critical region. The value(s) that separates the critical region from the acceptance region is
called the critical value(s).

(iv) The significance level


The significance level of a test is the probability used as a standard for rejecting a null hypothesis H1 is
assumed to be true. This probability is equal to some small pre-assigned value, conventionally denoted
by. The value  is also known as the size of the critical region. The most frequently used values of,
the significance level are 0.05 and 0.01, i.e. 5 percent and 1 percent. By  = 5%, we mean that there are
about 5 chances in 100 of incorrectly rejecting a true null hypothesis. To put it in another way, we say
that we are 95% confident in making the correct decision.
Rejection & Acceptance Region

Acceptance Region

Rejection region 50% Rejection region

0.05 -1 X- 1d 0.05

(v) Test of Significance


A test of significance is a rule or procedure by which sample results are used to decide whether to accept
or reject a null hypothesis. Such a procedure is usually based on the test statistic and the sampling
distribution of such a statistic under Ho. A value of the statistic is said to be statistically significant when
the probability of its occurrence under Ho in this case is rejected. If, on the other hand, the value falls in
the acceptance region, it is said to be statistically insignificant. In this case, Ho may be accepted. There
are two desirable qualities for a test of significance.
 When the null hypothesis is actually true, it must have a low probability of rejecting Ho and
 When Ho is actually false, it must have a high probability of rejecting. Ho It is because the
significance level must be chosen before the test is carried out and because it is a critical factor in
deciding whether to accept or reject a hypothesis that the term ‘significance testing’ is commonly
used instead of hypothesis testing.

vi) Critical value


The value of the sample statistic that defines the region of acceptance and rejection is called the critical
value. It is important to note that “the critical value of Z for a single tailed test (left or right) at a level of
significant a is the same as the critical value of Z for a two tailed test at a level of significance 2a”.
Further for Right Tailed Test:
P(Z Za) = a1
For the left tailed test: P(Z -Za) = a1

Also for a two tailed test: P(Z Za) = a

P(Z Za) + P(Z -Za) = a

P(Z Za) + P(Z Za) = a

P(Z Za) = a The area of each tail is a


2 2
vii) Type I and Type II Errors
Obviously, efforts are made to avoid any type of error but it is not possible to make a correct decision
with 100% certainty. When a hypothesis is tested by sampling, there is always a possibility of either
Type I or Type II error.
When, on the basis of sample information, we reject a null hypothesis Ho, when it is in fact true, we
commit a Type I error. On the other hand, when, on the basis of sample information, we accept a null
hypothesis Ho, when it is actually false, we commit a Type II error. These two types of errors may be
displayed in a tabular form as below:
Decision Accept Ho Reject Ho (or accept H1)
Ho is true Correct decision (no error) Wrong decision (Type I error)
Ho is false Wrong decision (Type II error) Correct decision (No error)

A legal analogy will help in understanding the difference between Type I and Type II errors. In a court
trial, the supposition of law is that the accused (the defendant) is innocent. This supposition of innocence
may be regarded as a kind of null hypothesis Ho that is to be rejected or accepted. After having heard the
evidence presented during the trial, the judge arrives at a decision. Suppose the accused is, in fact,
innocent (i.e. Ho is true), but the finding of the judge is guilty.
 The judge has rejected a true null hypothesis and in so doing has made a Type I error.
 If, on the other hand, the accused is, in fact, guilty (i.e. Ho is false) and the finding

viii). Decision – the last step is the decision about the null hypothesis i.e., whether to accept it or to reject
it. In this regard, we compare the computed value of Z (Obtained in step 2) with the critical value or
significant value or tabled value Za (given by table for critical values in step 7) at a given level of
significance a and decide as under:

i. if Zs< Za then we accept the null hypothesis H0 – if the calculated value of Z is less than the tabled
value Za of Z at a level of significance a, then the difference between t – E(t) is not significant (and this
difference may be due to fluctuations of sampling), so we accept the null hypothesis. In this case, the test
statistic falls in the region of acceptance.
ii. if Zs> Za then we reject the null hypothesis H0 and accept the alternative hypothesis H1. In this case,
the computed value of Z is numerically greater than the critical value Za at a level of significance a, and
therefore the computed value of test statistic falls in the rejection region. So we reject the null hypothesis
and accept the alternative hypothesis at a level of significance a or confidence level (1 – a)

Steps to Hypothesis Testing


The goal of hypothesis testing is to determine the likelihood that a population parameter,
such as the mean, is likely to be true. In this section, we describe the four steps of hypothesis
testing.

Step 1: State the hypotheses.


Step 2: Set the criteria for a decision.
Step 3: Compute the test statistic.
Step 4: Make a decision.
Hypothesis Test for a Population Mean
Using a Large Sample (n>30)
H0:  = o
H1: 1. > o
2. < o
3.   o

Test Statistic:
x  o
z

n
Rejection Region: For a probability of a Type-I error, we can reject H0 if

1. z  z
2. z  - z
3. z  zor z  - z

If  (the population standard deviation) is unknown, s may be used instead as an approximation.

Example1. WSU uses thousands of fluorescent light bulbs each year. The brand of bulb it currently uses
has a mean life of 900 hours. A manufacturer claims that its new brand of bulbs, which cost the same as
the brand the university currently uses, has a mean life of more than 900 hours. The university has
decided to purchase the new brand if, when tested, the test evidence supports the manufacturer’s claim at
=.05. Suppose sixty-four bulbs were tested with the following results:

MEAN = 930 hours s= 80 hours

Will WSU purchase the new brand of fluorescent bulbs? Conduct hypothesis test.

 = population mean life for the new brand of bulbs

H0: = 900

Ha: > 900 (the mean life for the new brand of bulbs is higher than the mean life for the old brand)

Significance Level: = .05

x  900
Test Statistic: z =
s
n

Rejection Region: For =.05, we reject H0 if z  zwhere z= 1.645


930  900
Calculations: z = = 3.00
80
64
Conclusion: Using =.05 we reject H0. There is sufficient evidence to conclude that the mean life for the
new brand of bulbs is greater than 900.

Example 2 The average (mean) live weight of a farmer’s steers prior to slaughter was 380 pounds in past
years. This year his 50 steers were fed on a new diet. Suppose we consider these 50 steers on the new diet
as a random sample taken from a population of all possible steers that may be fed the diet now or in the
future. Use the sample data given below and =.01 to test the research hypothesis that the mean live
weight for steers on the new diet is greater than 380.

MEAN =390 s=35.2.



 = population mean live weight of steers fed on the new diet

H0: = 380

Ha: > 380

Significance Level: = P (Type-I Error) = .01

x  380
Test Statistic: z =
s
n

Rejection Region: For =.01, we reject H0 if z  zwhere z=2.33

390  380
Calculations: z = = 2.01
35.2
50

Conclusion: Using =.01 we fail to reject H0. There is not sufficient evidence to conclude that the mean
live weight for steers on the new diet is greater than 380.

Example 3
A Stenographer claims that she can type at the rate of 120 words per minute. Can we reject her claim on
the basis of 100 trials in which she demonstrates a mean of 116 words with a standard deviation of 15
words? Use 5% level of significance.

Solution
1. Null hypothesis: H0 : Stenographer’s claim is true, i.e H0 : U = 120.

Alternative hypothesis: H1: u = 120 = it is a case of two tailed test.

2. Calculation of test statistic.


Here n =100, x=116, u =120, s=15

Standard error of mean: S.E. (x0

Test statistic : Zs = 2.67

3. Level of significance: a = 0.05

4. Critical value: the value of Za at 5% level of significance is Za = 1.96. (from the table)

5. Decision: since Zs = 2.67 is greater than Za = 1.96 at 5% level of significance, so the null hypothesis
H0 is rejected. Thus, the stenographer’s claim of typing at the rate of 120 words per minute is not true

CHI-SQUARE TEST
Introduction
The chi-square (I) test is used to determine whether there is a significant difference between the expected
frequencies and the observed frequencies in one or more categories. Hence the chi-square test is a useful
measure of comparing experimentally obtained results with those expected theoretically and based on the
hypothesis. Chi-square test is applied to those problems in which we study whether the frequency with
which a given event has occurred is significantly different from the one as expected theoretically. The
measure of Chi-square enables us to find out the degree of discrepancy between observed frequencies and
theoretical frequencies and thus to determine whether the discrepancy so obtained between observed
frequencies and theoretical frequencies is due to error of sampling or due to a chance.

The Chi-square is computed on the basis of frequencies in a sample and thus the value of Chi-square so
obtained is a statistic. Chi-square is not a parameter as its value is not derived from the observations in a
population. Hence chi-square test is a non-parametric test. Chi-square test is not concerned with any
population distribution and its observations.

The Chi-square test is intended to test how likely it is that an observed distribution is due to chance. It is
also called a "goodness of fit" statistic, because it measures how well the observed distribution of data
fits with the distribution that is expected if the variables are independent.

Advantages of using nonparametric techniques are the following.

1. They are appropriate when only weak assumptions can be made about the distribution.

2. They can be used with categorical data when no adequate scale of measurement is available.

3. For data that can be ranked, nonparametric test using ranked data may be the best option.

4. They are relatively quick and easy to apply and to learn since they involve counts, ranks and signs.
Types of Data:
There are basically two types of random variables and they yield two types of data: numerical and
categorical. A chi square (X2) statistic is used to investigate whether distributions of categorical variables
differ from one another. Basically categorical variable yield data in the categories and numerical variables
yield data in numerical form. Responses to such questions as "What is your major?" or Do you own a
car?" are categorical because they yield data such as "biology" or "no." In contrast, responses to such
questions as "How tall are you?" or "What is your G.P.A.?" are numerical. Numerical data can be either
discrete or continuous. The table below may help you see the differences between these two variables.

Data Type Question Type Possible Responses

Categorical What is your sex? male or female

Discrete- How many cars do you


Numerical two or three
own?

Numerical Continuous - How tall are you? 72 inches

Notice that discrete data arise from a counting process, while continuous data arise from a measuring
process. The Chi Square statistic compares the tallies or counts of categorical responses between two (or
more) independent groups. (note: Chi square tests can only be used on actual numbers and not on
percentages, proportions, means, etc.)

Therefore a Chi-square test is designed to analyze categorical data. That means that the data has been
counted and divided into categories. It will not work with parametric or continuous data (such as height in
inches). For example, if you want to test whether attending class influences how students perform on an
exam, using test scores (from 0-100) as data would not be appropriate for a Chi-square test. However,
arranging students into the categories "Pass" and "Fail" would. Additionally, the data in a Chi-square grid
should not be in the form of percentages, or anything other than frequency (count) data. Thus, by dividing
a class of 54 into groups according to whether they attended class and whether they passed the exam, you
might construct a data set like this:

Pass Fail

Attended 25 6

Skipped 8 15
IMPORTANT: Be very careful when constructing your categories! A Chi-square test can tell you
information based on how you divide up the data. However, it cannot tell you whether the categories you
constructed are meaningful. For example, if you are working with data on groups of people, you can
divide them into age groups (18-25, 26-40, 41-60...) or income level, but the Chi-square test will treat the
divisions between those categories exactly the same as the divisions between male and female, or alive
and dead! It's up to you to assess whether your categories make sense, and whether the difference (for
example) between age 25 and age 26 is enough to make the categories 18-25 and 26-40 meaningful. This
does not mean that categories based on age are a bad idea, but only that you need to be aware of the
control you have over organizing data of that sort.

Chi-Square Test Requirements


1. Quantitative data.
2. One or more categories: this test is used only for drawing inferences by testing hypothesis. It cannot be
used for estimation of parameter or any other value.
3. Independent observations: each of the observations constituting the sample for this test should be
independent of each other.
4. Adequate sample size (at least 10): The expected frequency of any item or cell should not be less than
5. If it is less than 5, then frequencies from the adjacent items or cells are pooled together in order to
make it 5 or more than 5.
5. Simple random sample: The observations collected for chi test should be on random basis of sampling
6. Data in frequency form: the frequencies used in chi test should be absolute and not relative in terms.
7. All observations must be used.
8. The total number of observations used in this test must be large, i.e., n is greater than 30
9. It is wholly dependent on the degree of freedom.

Properties of Chi –square Distribution


1. Chi –square curve is always positively skewed.
2. The mean of Chi –square distribution is the number of degrees of freedom
3. Chi-square values increases with the increase in degree of freedom.
4. The value of Chi –square lies between zero and infinity, i.e.,
5. For different degrees of freedom, the shape of the curve will be different. Its shape depends on the
degree of freedom but it is not a symmetrical distribution

Degree of Freedom
The number of data that are given in the form of a series of variables in a row or column or the number of
frequencies that are put in cells in a contingency table, which can be calculated independently is called the
degrees of freedom and is denoted by v.

Case I If the data is given in the form of a series of variables in a row or column, then the degrees of
freedom = (number of items in the series) – 1, i.e., v = n – 1, where n is the number of variables in the
series in a row column.

Case II When the number of frequencies are put in cells in a contingency table, the degrees of freedom
will be the product of (number of rows less one) and the (number of columns less one), i.e., v = (R – 1)
(C – 1), where R is the number of rows and C is the number of columns.
Use of Chi-Square - Test
The Chi –square test is a very powerful test for testing the hypothesis of a number of statistical problems.
The important uses of Chi –square test are:

1. Test of Goodness of Fit – this test is for assessing if a particular discrete model is a good fitting model
for a discrete characteristic, based on a random sample from the population.

E.g. Has the model for the method of transportation (drive, bike, walk, other) used by students to get
the class changed from that for 5 years ago? Under this test there is only one variable i.e., the degrees of
freedom v = n – 1

2. Test of independence of attributes – this test helps us to assess if two discrete (categorical) variables
are independent for a population, or if there is an association between the two variables.

E.g. Is there an association between satisfaction with the quality of public schools (not satisfied,
somewhat satisfied, very satisfied) and political party (Republican, Democrat, etc.)

The chi-square test is used to see that the principles of classification of attributes are independent. In this
test the attributes are classified into a two way table or a contingency table as the case may be. The
observed frequency in each cell (square) is known as cell frequency. The total frequencies in each row or
column of the two ways contingency table is known as marginal frequency.

The degree of freedom is v = (R – 1)(C – 1),

Where R = number of rows, C = number of columns in the two way contingency table. This test discloses
whether there is any association or relationship between two or more attributes.

3. Test of homogeneity or a test for a specified standard deviation-The chi-square test may be used to
test the homogeneity of the attributes in respect of a particular characteristics or it may be used to test the
population variance. In other words this test is for assessing if two or more populations are homogeneous
(alike) with respect to the distribution of some discrete (categorical) variable.

E.g. Is the distribution of opinion on legal gambling the same for adult males versus adult females?
2
All of these tests are based on a X test statistic that, if the corresponding H0 is true and the assumptions
2
hold, follows a chi-square distribution with some degrees of freedom, written  ( df ) .

The steps in using the chi-square test may be summarized as follows: Test of Goodness of Fit

1. Set up the Null Hypothesis H0 and Alternative hypothesis H1

2. Calculate the expected frequency E

3. Use the formula to find the chi-square value

4. Find the df. (N-1)

5. Find the table value (consult the Chi Square Table.)


6. If your chi-square value is equal to or greater than the table value, reject the null hypothesis:
differences in your data are not due to chance alone

Working Rule for Test of homogeneity andTest of independence of attributes

The chi-square test is widely used to test the independence of attributes. It is applied to test the
association between the attributes when the sample data is presented in the form of a contingency table
with any number of rows or columns.

Step 1 Set up the Null Hypothesis H0: No Association exists between the attributes.

Alternative hypothesis H1 : An association exists between the attributes.

Step 2. Calculate the expected frequency E corresponding to each cell by the formula

Where R = Sum total of the row in which E is lying

C = sum total of the column in which E is lying

n = total sample size

(O  E ) 2
Step 3. Use the formula to find the chi-square value Pearson X 2   E
All cells

The characteristics of this distribution are completely defined by the number of degrees of freedom v
which is given by V = (R – 1) (C – 1), where R = number of rows and C = numbers of columns in the
contingency table.

Step 4. Find from the table the value of chi-square for a given value of the level of significance and the
degrees of freedom v. calculated in STEP 2

Step 5. Compare the computed value of chi-square with the tabled value of chi-square found in step 4. If
your chi-square value is equal to or greater than the table value, reject the null hypothesis: differences in
your data are not due to chance alone

Example 1
The example used is that for testing whether Gender is independent of type of high school attended
(public or private). The sample data is (the numbers reflect the counts).

Public Private Total

Female 38 7 45

Male 46 9 55

Total 84 16 100
Step 1. The basic premise of the test of independence is to see if the distribution of the percentages is the
same for each level of category. That is, is the percentage of males attending public schools “close
enough” to that percentage for females attending public schools? We do this by comparing what we
“observe” (i.e. the data in the table which is the sample data) to what we would expect to see in our
sample if there was no relationship, i.e. the variables were independent.

H0: “Gender and School Type are independent” versus the alternative hypothesis

H1: “Gender and School Type are dependent” or “there is a relationship between Gender and School
Type”

Step 2. The observed counts are easy, but how do we get the expected counts for each of the cells in the
table? Well, since the idea of the expected counts is to provide what the distribution would be if no
relationship existed we use the observed column and row totals to calculate how each individual cell
count would be distributed if in fact there was not relationship. We do this by taking each row total times
the column total and then divide by the overall total. This produces an expected count table of:

Public Private Total

Female (45*84)/100 = 37.80 (45*16)/100 = 7.20 45

Male (55*84)/100 = 46.20 (55*16)/100 = 8.80 55

Total 84 16 100

Step 3.Substituting into the formula for the chi-square:

 2

 (O  E ) 
(38  37.80) 2 (7  7.20) 2 (46  46.20) 2 (9  8.80) 2
    0.012
E 37.80 7.20 46.20 8.80

Step 4. As you can see this test statistic is quite small as you might have guessed given how close the
observed values were to the expected values. Reading the chi-square table is similar to reading the T-
table in that there is a degree of freedom consideration and the table provides right tail probabilities. The
DF is found by taking the number of rows minus 1 times the number of columns minus 1, written as: (R-
1)*(C-1). For this example the degrees of freedom are (2-1)*(2-1) = 1.

Step 5. Decision: If you have a chi-square table you will see that the test statistic of 0.012 is less than the
chi-square value presented in the table for the 0.05 with 1 degree of freedom. This means that our p-value
is greater than that right tail probability which in turn is greater than 0.05 resulting in us not rejecting Ho.
We would conclude that there is not enough evidence to reject Ho: we cannot say a relationship exists
between Gender and School Type.

Example 2
Suppose you conducted a drug trial on a group of animals and you hypothesized that the animals
receiving the drug would show increased heart rates compared to those that did not receive the drug. You
conduct the study and collect the following data:
Hypothetical drug trial results.
Heart Rate No Heart Rate
Total
Increased Increase
Treated 36 14 50
Not treated 30 25 55
Total 66 39 105

Ho: The proportion of animals whose heart rate increased is independent of drug treatment.

Ha: The proportion of animals whose heart rate increased is associated with drug treatment.

Applying the formula above we get:

Chi square = 105[(36) (25) - (14)(30)]2 / (50)(55)(39)(66) = 3.418

Before we can proceed we need to know how many degrees of freedom we have. When a comparison is
made between one sample and another, a simple rule is that the degrees of freedom equal (number of
columns minus one) x (number of rows minus one) not counting the totals for rows or columns. For our
data this gives (2-1) x (2-1) = 1.

We now have our chi square statistic (x2 = 3.418), our predetermined alpha level of significance (0.05),
and our degrees of freedom (df = 1). Entering the Chi square distribution table with 1 degree of freedom
and reading along the row we find our value of x2 (3.418) lies between 2.706 and 3.841. The
corresponding probability is between the 0.10 and 0.05 probability levels. That means that the p-value is
above 0.05 (it is actually 0.065). Since a p-value of 0.65 is greater than the conventionally accepted
significance level of 0.05 (i.e. p > 0.05) we fail to reject the null hypothesis. In other words, there is no
statistically significant difference in the proportion of animals whose heart rate increased.

Example 3. Chi Square Test of Independence


For a contingency table that has r rows and c columns, the chi square test can be thought of as a test of
independence. In a test of independence the null and alternative hypotheses are:

Ho: The two categorical variables are independent.

H1: The two categorical variables are related.

We can use the equation Chi Square = the sum of all the(fo - fe)2 / fe
Here fo denotes the frequency of the observed data and fe is the frequency of the expected values. The
general table would look something like the one below:
Category Category
Category III Row Totals
I II
Sample A a b c a+b+c
Sample B d e f d+e+f
Sample C g h i g+h+i
Column
a+d+g b+e+h c+f+i a+b+c+d+e+f+g+h+i=N
Totals

Now we need to calculate the expected values for each cell in the table and we can do that using the the
row total times the column total divided by the grand total (N). For example, for cell a the expected value
would be (a+b+c) (a+d+g)/N.

Once the expected values have been calculated for each cell, we can use the same procedure are before
for a simple 2 x 2 table.
|O -
Observed Expected (O — E)2 (O — E)2/ E
E|

Suppose you have the following categorical data set. Incidence of three types of malaria in three tropical
regions.

South
Asia Africa Totals
America
Malaria
31 14 45 90
A
Malaria
2 5 53 60
B
Malaria
53 45 2 100
C
Totals 86 64 100 250

We could now set up the following table:

Observed Expected |O -E| (O — E)2 (O — E)2/ E


31 30.96 0.04 0.0016 0.0000516
14 23.04 9.04 81.72 3.546
45 36.00 9.00 81.00 2.25
2 20.64 18.64 347.45 16.83
5 15.36 10.36 107.33 6.99
53 24.00 29.00 841.00 35.04
53 34.40 18.60 345.96 10.06
45 25.60 19.40 376.36 14.70
2 40.00 38.00 1444.00 36.10
Chi Square = 125.516
Degrees of Freedom = (c - 1)(r - 1) = 2(2) = 4
Chi Square distribution table. Probability level (alpha)

Df 0.5 0.10 0.05 0.02 0.01 0.001

1 0.455 2.706 3.841 5.412 6.635 10.827

2 1.386 4.605 5.991 7.824 9.210 13.815

3 2.366 6.251 7.815 9.837 11.345 16.268

4 3.357 7.779 9.488 11.668 13.277 18.465

5 4.351 9.236 11.070 13.388 15.086 20.517


Reject Ho because 125.516 is greater than 9.488 (for alpha 

Thus, we would reject the null hypothesis that there is no relationship between location and type of
malaria. Our data tell us there is a relationship between type of malaria and location, but that's all it says.

Formula Card
Chi-Square Tests

Test of Independence &


Test for Goodness of Fit
Test of Homogeneity
Expected Count Expected Count
row total  column tot al E i  expected  npi 0
E  expected 
total n
Test Statistic Test Statistic
2  
O  E  2

(observed  expected) 2
2  
O  E 2 
(observed  expected) 2
E expected E expected
df = (r – 1)(c – 1) df = k – 1
If Y follows a  2 df  distribution, then E(Y) = df and Var(Y) = 2(df).
LINEAR PROGRAMMING
Linear programming is of one of the most powerful tools of management science. Linear
programming (or simply LP) refers to several related mathematical techniques that are used to
allocate limited resources among competing demands in an optimal way. In general, the goal of
manager decision making is to achieve some objective such as maximizing profit or minimizing
costs.
A Linear Program (LP) is a problem that can be expressed as follows (the so-called Standard
Form):

The expression "cx" is called the objective function, and the equations "Ax=b" are called the
constraints. The set of factors under the control of the manager is called the decision variables.
There are also a number of factors that limit or constrain what can be done. These may include,
for example, capacities of plant and equipment, limits on market demands, or processing and
delivery requirements.
A linear programming model, or LP model, is a particular type of mathematical model in which
the relationships involving the variables are linear, and in which there is a single performance
measure or objective. An advantage of this type of model is that there exists a mathematical
technique, called linear programming that can determine the best or optimal decision even when
there are thousands of variables and relationships.
Essential conditions
These are the conditions for linear programming to pertain.
 First, there must be limited resources-constraints (e.g., a limited number of workers,
machines, finances, and material); otherwise there would be no problem, nothing to be
sold.
 Second, there must be an explicit objective-well-defined objective function (such as
maximize profit or minimize cost). When a single objective is to be maximized or
minimized, we can use linear programming. When multiple objectives exist, goal is used.
 Third, there must be linearity (if it takes three hours to make a part, then two parts
would take six hours, three parts would take nine hours etc.). Other restrictions on the
nature of the problem may require that it be solved by other variations of the technique,
such as non-linear programming or dynamic programming.
 Fourth, there must be homogeneity (the products are identical, or all the hours available
from a worker are equally productive).
 Fifth is divisibility (products and resources can be subdivided into fractions). If this
subdivision is not possible (such as flying half an aero plane or hiring one-fourth of a
person), a modification of linear programming, called integer programming, can be used.
 Sixth, there must be an alternative course of action.
 Seventh, decision variables should be interrelated and non- negative.
Linear programming does not allow for uncertainty in any of the relationship; there cannot be
any probabilities or any random variables. Also any time-dependent changes cannot be involved
into classical LP problem.
Since its discovery in the late 1950s, linear programming has been applied to a wide variety of
decision problems in business and the public sector.
Graphical Solution
It is useful to get familiar with the graphical method for solving LP problems first — not because
this is used in practice, but because it provides best understanding of the model and its solution.
Graphical solution is based on a geometrical representation of the feasible region and the
objective function. In particular, the space to be considered is the n-dimensional space with each
dimension defined by one of the LP variables. The objective function will be described in this n-
dim space by its contour plots, i.e., the sets of points that correspond to the same objective value.
To facilitate the visualization we are restricted to the two-dimensional case, i.e., to LP models
with two decision variables (possibly with 3 variables in 3D graphic).
The steps of the graphical method are as follows:
1. Formulate the Problem in Mathematical Terms Example: Product planning - LP model
As mentioned above there are
 limited resources
 an explicit objective function
 the equations are linear
 the resources are homogeneous
 the decision variables are divisible and non-negative
2. Plot Constraint Equations
Every vector of two the variables can be plotted using two points with
Co-ordinates in a 2-dimensional (planar) Cartesian system. The constraint equations are easily
plotted by letting one variable equal zero and solving for the axis intercept of the other. (The
inequality portions of the restrictions are disregarded for this step.)
3. Determine the Area of Feasibility
The direction of inequality signs in each con-constraint determines the area of feasibility where a
feasible solution is found.
The feasible region of a 2-variables LP is depicted by the set of points the co-ordinates of which
satisfy all LP constraints and the sign restrictions. If all constraints are expressed by linear
inequalities, to geometrically characterize the feasible region, we must first characterize the set
of points that constitute the solution space of each linear inequality. Then, the LP feasible region
will result from the intersection of the solution spaces corresponding to each technological
constraint.
A feasible solution or feasible point satisfies all of the constraints and any restrictions on
the variables value (e.g. nonnegativities). The feasible set is a set of all feasible solutions.
The region of feasible solutions forms a convex polygon. If this condition of convexity does not
exist, the problem is either incorrectly set up or not amenable to linear programming.
4. Plot the Objective Function
The objective function may be plotted by assuming some arbitrary total profit figure and then
solving for the axis co-ordinates, as was done for the constraint equations. Other terms for the
objective function, when used in this context, are the iso-profit line or equal contribution line,
because it shows all possible production combinations for any given profit figure. All iso-profit
lines are parallel, so that we can move it to one side to maximize the objective function or other
side for minimization.

5.FindtheOptimumPoint
It can be shown mathematically that the optimal combination of decision variables is always
found at an extreme point (corner point) of the convex polygon. The number of corner points is
limited (compare with unlimited number of points of the whole polygon representing the set of
feasible solutions). We can determine which one is the optimum by either of two approaches.
The first approach is to find the values of the various comer solutions algebraically. This entails
simultaneously solving the equations of various pairs of intersecting lines and substituting the
quantities of the resultant variables in the objective function.
The second and generally preferred approach entails using the objective function or iso-profit
line directly to find the optimum point. The procedure involves simply drawing a straight line
parallel to any arbitrarily selected initial iso-profit line so that the iso-profit line is farthest from
the origin of the graph (in cost-minimization problems, the objective would be to draw the line
through the point closest to the origin.)
Note that all iso-profit lines have the same slope but different axis intercepts.
The profit lines of a linear programming model are always parallel.
The underlying idea is to keep ”sliding'' the iso-profit line in the direction of increasing value of
the objective function, until we cross the boundary of the LP feasible region. This extreme point
of the feasible region is the optimum point.
If there is a finite optimal solution for a linear programming model, an optimal solution
will be a corner point.
Not all examples lead to one optimal solution. It could happen that attempting to plot the feasible
region for some problem, we get no points in feasible region i.e. on the plane that satisfy all
constraints, and therefore the problem is infeasible called also over-constrained problem.
If there is no point in the feasible set, there is no feasible solution of the linear
programming model.
One possible reason for no feasible solution is an error in transcribing the data or inputting
model. Another type of error that is harder to determine is the inclusion of reasonable constrains,
but with no possibility to satisfy them all. For example, there may not be enough available
capacity to satisfy all of the requirements.
In the LP modes considered above, the feasible region (if not empty) was a bounded area. For
this kind of problems it is obvious that all values of the LP objective function (and therefore the
optimal) are bounded.
A bounded feasible set is one which a finite numbers may be specified so that any variable
value, at any point in the feasible set, is less or equal to that number.
Consider however the possibility that the feasible region is not bounded and the iso-profit line
can be shifted with no limits. An unbounded feasible set is a feasible set that is not bounded.
Only one variable has to be limitless for a feasible set to be unbounded. If the feasible point
exists with the objective function value as favorable as desired, a model is said to have an
unbounded optimal solution and the problem is called unbounded problem (usually one extreme
- maximum or minimum can be found, problem is if this is not the extreme we search for). The
practical interpretation of such optimal solution is the same as in the previous case i.e. there is no
optimal solution and the model creator must return to the real situation and check all constrains.
In the unbounded case there is probably some limitation missing, forgotten in the model
formulation.
The model can have also more than one optimal solution – alternative solution. In this case the
iso-profit line goes in the same direction as one of constrain line. The extreme points than is a set
of points lying between two polygon corner points on this line (contour line).
If two corner points are optimal, then all of the points on the line segment connecting them
are also optimal.
Note that for all examples of optimal solutions for linear programming model, there is an optimal
solution in the corner point.
Graphical Method Exercises Solved in Linear Programming
EXAMPLE 1: A workshop has three (3) types of machines A, B and C; it can manufacture two
(2) products 1 and 2, and all products have to go to each machine and each one goes in the same
order; First to the machine A, then to B and then to C. The following table shows:
 The hours needed at each machine, per product unit
 The total available hours for each machine, per week
 The profit of each product per unit sold

Formulate and solve using the graphical method a Linear Programming model for the previous
situation that allows the workshop to obtain maximum gains.
Decision Variables:
 : Product 1 Units to be produced weekly
 : Product 2 Units to be produced weekly
Objective Function:
Maximize
Constraints:




The constraints represent the number of hours available weekly for machines A, B and C,
respectively, and also incorporate the non-negativity conditions.
For the graphical solution of this model we will use the Graphic Linear Optimizer
(GLP) software. The green colored area corresponds to the set of feasible solutions and the level
curve of the objective function that passes by the optimal vertex is shown with a red dotted line.

The optimal solution is and with an optimal


value that represents the workshop’s profit.

EXAMPLE 2 : A winemaking company has recently acquired a 110 hectares piece of land. Due
to the quality of the sun and the region’s excellent climate, the entire production of Sauvignon
Blanc and Chardonnay grapes can be sold. You want to know how to plant each variety in the
110 hectares, given the costs, net profits and labor requirements according to the data shown
below:

Suppose that you have a budget of US$10,000 and an availability of 1,200 man-days during the
planning horizon. Formulate and solve graphically a Linear Programming model for this
problem. Clearly outline the domain of feasible solutions and the process used to find the optimal
solution and the optimal value.
Decision Variables:
 : Hectares intended for growing Sauvignon Blanc
 : Hectares intended for growing Chardonnay
Objective Function:
Maximize
Constraints:




Where the restrictions are associated with the maximum availability of hectares for planting,
available budget, man-hours in the planting period and non-negativity, respectively.
The following graph shows the representation of the winemaking company problem. The shaded
area corresponds with the domain of feasible solutions, where the optimal basic feasible
solution is reached at vertex C, where the budget and man-days restraints are active. Thus
solving said equation system the coordinate of the optimal solution is found where
and (hectares). The optimal value is (dollars).

EXAMPLE 3
Nairobi industries limited manufactured three items A, B and C. the production requirements per
unit of these items and capacities of the three departments are shown as follows:

Production requirements
Capacity of the
department A B C
departments (Hours)
Assembling 2 3 4 32,000
Painting 1 2 3 37,500
Finishing 3 2 3 29,000
Profit contribution($) 20 50 40

Required – Construct the linear programming model that would maximize the profit contribution
from the given information.
Let x units be the manufactured item A
Let Y units be the manufactured item B
Let ƶ units be the manufactured item C

Maximize profit:
P=20x +50y +40ƶ
Constraints:
Assembly: 2x +3y+4ƶ≤32000
Painting: x+2y +3ƶ≤37500
Finishing: 3x +2y+3ƶ≤29000

Non-negativity constraints: x≥0, y≥0 , ƶ≥0


INTRODUCTION TO INVENTORY MANAGEMENT
The Concepts of Inventory and Inventory Management
What is inventory?
Inventory can be defined as below;
1. It is the stock of any items or resources held in an organization for sale or use.
2. It can also be defined as materials in a supply chain or in a segment of supply chain; expressed
in quantities, locations/sites and/or values.
There is a distinction between manufacturing firm inventory and service firm inventory.
Manufacturing firm inventory refers to all items held in inventory that contributes to or become
part of organizations products e.g. raw materials, finished goods, supplies, work in progress and
piece parts.
Service firm’s inventory refers to all tangible goods to be sold (stock in trade) and the supplies
necessary to administer the service. E.g. a mobile phone company such as Safaricom keeps stock
of scratch cards, SIM cards, packaging material etc. in order to offer competitive services to its
clients.
Types of Inventory.
5 Basic types of inventories are raw materials, work-in-progress, finished goods, packing
material and MRO supplies. Inventories are also classified as merchandise and manufacturing
inventory. Other such classifications on various bases are goods in transit, buffer stock,
anticipatory stock, decoupling inventory, and cycle inventory.
Merchandise Inventory: It is the inventory of trading goods held by the trader.
Manufacturing Inventory: It is the inventory held for manufacturing and selling of goods.
Based on the value addition or stage of completion, the manufacturing inventories are further
classified into 3 types of inventory – Raw Material, Work-In-Progress and Finished Goods.
Another type is MRO inventories which is to support the whole manufacturing and
administrating operation.
 Raw Materials: These are the materials or goods purchased by the manufacture.
Manufacturing process is applied on the raw material to produce desired finished goods.
For example, aluminum scrap is used to produce aluminum ingots. Flour is used to
produce bread. Finished goods for someone can be raw material for someone. For
example, the aluminum ingot can be used as raw material by utensils manufacturer.
Business importance of raw material as an inventory is mainly to protect any interruption
in production planning. Other reasons can be availing price discount on bulk purchases,
guard against market shortage situation, etc.
 Work-In-Progress (WIP): These are the partly processed raw materials lying on the
production floor. They may or may not be saleable. These are also called semi-finished
goods. It is unavoidable inventory which will be created in almost any manufacturing
business.
This level of this inventory should be kept as low as possible. Since, a lot of money is
blocked over here which otherwise can be used to achieve better returns. Speeding up the
manufacturing process, proper production planning, customer and supplier system
integration etc. can diminish the levels of work in progress. Lean management considers
it as waste.
 Finished Goods: These are the final products after manufacturing process on raw materials.
They are sold in the market.
There are two kinds of manufacturing industries. One, where the product is first manufactured
and then sold. Second, where the order is received first and then it is manufactured as per
specifications. In the first one, it is inevitable to keep finished goods inventory whereas it can be
avoided in the second one.
 Packing Material: Packing material is the inventory used for packing of goods. It can be
primary packing and secondary packing. Primary packing is the packing without which the
goods are not usable. Secondary packing is the packing done for convenient transportation of
goods.
 MRO Goods: MRO stands for maintenance, repairs and operating supplies. They are also
called as consumables in various parts of the world. They are like a support function.
Maintenance and repairs goods like bearings, lubricating oil, bolt, nuts etc are used in the
machineries used for production. Operating supplies mean the stationery etc. used for operating
the business.
Other Types of Inventories are classified on various basis are as follows:
Materially, there are 4 types of inventories only as explained above. Following types of
inventories are either the reasons to hold those 4 basic inventory or business requirement for the
same. Some of them are suitable strategies for certain businesses.
Goods in Transit: Under normal conditions, a business transports raw materials, WIP, finished
goods etc. from one site to other for various purpose like sales, purchase, further processing etc.
Due to long distances, the inventory stays on the way for days, weeks and even months
depending on distances. These are called Inventory / Goods in Transit. Goods in transit may
consist of any type of basic inventories.
Buffer Inventory: Buffer inventory is the inventory kept or purchased for the purpose of
meeting future uncertainties. Also known as safety stock, it is the amount of inventory besides
the current inventory requirement. Benefit is smooth business flow and customer satisfaction and
disadvantage is the carrying cost of inventory. Raw material as buffer stock is kept for achieving
nonstop production and finished goods for delivering any size, any type of order by customer.
Anticipatory Stock: Based on the past experiences, a businessman is able to foresee the future
trends of the market and takes certain decisions based on that. Expecting a price rise, spurt in
demand etc. some businessman invests money in stocking those goods. Such kind of inventory is
known as anticipatory stock. It is normally the raw materials or finished goods and this strategy
is executed by traders.
Decoupling Inventory: In manufacturing concern, plant and machinery should always keep
running. The act of stopping machinery, costs to the entrepreneur in terms of additional set up
costs, repairs, idle time depreciation, damages, trial runs etc. Reason for halt is not always
demand of the product. It may be because of availability of input. In production line, one
machine / process uses the output of other machine / process. Speed of different machines may
not always integrate with each other. For that reason, stock of input for all the machines should
be sufficient to keep the factory running. Such WIP inventory is called decoupling inventory.
Cycle Inventory: It is a type of inventory accumulated due to ordering in lots or sizes to avoid
carrying cost of inventory. In other words, it is the inventory to balance the carrying cost and
holding cost for optimizing the inventory ordering cost as suggested by Economic Order
Quantity (EOQ).
What inventory management entails:
→ Inventory management entails all the unified management of those internal activities
associated with the acquisition, storage, issue, use and internal distribution of inventory used in
production and service.
→ it is the activity of determining the rate and quantities and the procedures of materials to be
stocked in an organization and the regulation of receipts and issues of those stocks.
Importance of Inventory Management
The potential benefits of the application of inventory management concepts are many and
include the following:
1. Provides both internal and external customers the required service levels in terms of quantities
and the order rate fill (timing).
2. Ascertains present and future requirements for all types of inventory to avoid
overstocking or under- stocking.
3. Keeps costs at the minimum by variety reduction, economic lot sizes and analysis of
costs incurred in obtaining and keeping inventories.
4. Provides upstream and downstream inventory visibility or service in the supply chain.
→ In order to reap these benefits, inventory management requires an effective organization and
sound communication among supervisors and managers.
This is because various functions conflict making it difficult to have a harmonious way of
enjoying the above benefits e.g.
1) Purchasing, in an attempt to maintain low cost may choose vendors who have poor delivery
records (time, quality). This may have a potentially –ve effect on the production.
2) Production on the other hand may require large adequate supply of materials. This may lead
to maintaining high inventory levels than necessary.
Therefore, due to these potential conflicts, it is necessary to deliberately set up rationalized
inventory management systems.
Reasons/Need for inventory.
1. Meeting production requirements. Raw materials, components and parts are required for
producing finished goods. A manufacturing organization keeps stocks to meet the continuous
requirements of production. Work in progress (WIP) inventory constitutes a major portion of
production related inventory.

2. Support operational Requirements. To support production operations, inventory is


required for repairs, maintenance and operations support e.g. spare parts, consumables such
as lubrication oils, welding rods, chemicals etc.
3. Customer Service Consideration. Products like equipment, machinery and appliances
require replacement of spare parts and other consumables for trouble free and smooth operation.
Suppliers maintain an inventory of these parts to extend after sales services to its customers.
Availability of spare parts when required at customer end is crucial for customer satisfaction and
may be used as a tool for competitive advantage.
4. Hedge against future expectations. Inventory takes care of shortages in material or
product availability or due to an unanticipated increase in the price of products.
5. Balancing supply and demand. The production and the consumption cycle never match.
The sudden requirement of products in large quantities may not be satisfied since production
cannot be taken so soon. In such a case, the products are manufactured in advance in anticipation
of a sudden demand and kept in stock for supply during the peak period.
6. Periodic variation. For seasonal products, the demand is at peak in certain periods while
it is lean/little for the rest of the year. Production runs in the factory are taken based on the
average demand for the year. Excess production in the lean periods is kept in inventory to take
care of high demand. In cases where raw material is available seasonally, the products are
manufactured and stocked as inventory to meet the demand of the finished product throughout
the year. e.g. agricultural produce.
7. Economies of scale. Products are manufactured at focused factories to achieve
economies of scale. This is done because of the availability of the latest technology, raw
materials and skilled labour. Hence the product is kept in store for distribution to consumption
centers as and when it is required.
8. Other reasons for keeping inventory include:
- To take advantage of quantity discounts.
- As a necessary part of production process e.g. the maturing of whisky/ wine products.
- Case of critical and strategic products such as petroleum or cereals (strategic food
reserve).
- As a contractual obligation e.g. the oil sector.
Note:
Sometimes the firm finds itself with stocks for non-logical reasons such as
1. Obsolete items are retained in stock.
2. Poor or non-existent inventory control policies resulting in larger than required orders,
replenishment orders being out of phase with production etc.
3. Inadequate or non-existent stock records.
4. Poor liaison between production control, purchasing and marketing departments.
5. Sub-optimal decision-making e.g. production department might increase WIP so as to ensure
long production runs (conflict between the functional departments).
Inventory Costs
These can broadly be categorized into 4 groups.
a) Holding costs/carrying costs.
These are costs incurred because a firm owns or maintains inventories e.g.:
i) Opportunity cost of capital i.e. money tied up in stock is not available to be
used or invested elsewhere e.g. interest foregone.
ii) Storage/warehouse costs such as personnel, equipment, running the
warehouse….etc.
iii) Insurance against fire, theft …etc.
iv) Security costs such as security personnel, alarm systems, electric fencing…etc.
v) Perishability costs. These are for perishable items such as edibles (bread, milk,
fruits, vegetables, newspapers and other periodicals; tax reports, fresh
flowers… etc.)
vi) Obsolescence costs: These are usually due to being overtaken by technology.
This is quite prevalent in the electronics industry e.g. computers, mobile
phones.
vii) Pilferages.
viii) Spillages for liquids and gases etc.
x) Damages e.g. breakages (fragile items like glass, tiles etc.)
b) Ordering / Procurement costs.
These are the costs of getting the items into the firm’s inventory. They are typically incurred
each time an order is made.
Examples:
i) Purchasing dept. costs: include personnel, equipment’s, communication costs
(Phone, internet, fax), consumables (stationery, ribbons, ink, etc.)
ii) Transportation costs.
iii) Insurance on transit.
iv) Taxes e.g. customs duty for an imported item.
v) Clearing and forwarding charges.
vi) Handling costs e.g. loading, pilferages, damages (breakages & spillages)
vii) Exchange rate differentials. In the case of imported products, valuation is done
on the basis of the current exchange rates in the market. Any fluctuations
may increase or decrease the value of the product. Due to this, there is the risk
of selling stocks at prices lower than the landed costs.

c) Shortage costs.
These are incurred as a result of the item not being in stock.
Examples:
i) Loss of goodwill; could lead to loss of customers.
ii) Contribution lost, due to not making a sale.
iii) Back order costs - these are costs of dealing with disappointed customers.
iv) Costs of idle resources e.g.:- production personnel being paid when there’s a
raw material missing.
v) Cost of having to speed up orders e.g. personnel working overtime, using a faster
transportation mode (and hence more costly)
d) Purchase Costs:
This is what is paid to the supplier /seller by the buyer in exchange of the product.
Inventory is usually a large investment for many firms. It is normally the second largest item in
the balance sheet among the assets after fixed assets. Thus, inventory should only be held if
the benefits (service to customers) exceed the inventory costs. Also in inventory modeling,
purchase costs are a relevant factor to inventory policy due to availability of quantity discounts.
Thus, for inventories, (Total Cost) TC = Purchase + Holding + Ordering + Shortage
costs costs costs costs
The objective of any inventory management systems or models is to minimize these total costs.
DETERMINISTIC INVENTORY CONTROL MODELS
There are generally two types of inventory management models:
 Deterministic and
 Stochastic
A brief comparison is given below;
Differences between deterministic and probalistic models
Deterministic Stochastic (probabilistic)
i) Certainly model; Factors known with i) Models to cope with uncertainty; Factors
certainty and usually constant are uncertain and are usually variable.
ii) Simple model more complex model
iii) Not very realistic Reflect reality better than deterministic model

Basic Economic Order Quantity (EOQ) Model


Assumptions of Basic EOQ Model

1. Demand is constant and known with certainty.


2. Lead-time is constant and known with certainty
3. There are no shortages - hence no shortage cost (no stock-outs). This is implied by
assumptions 1 & 2.
4. All items for a given order arrive in one batch or at the same time. i.e. simultaneous or
instantaneous arrival. In particular they do not arrive gradually.
5. Purchase cost is constant i.e. no discounts, hence for the basic EOQ model, purchase cost
is irrelevant since total purchase cost is the same regardless of the quantity ordered.
6. Holding cost per unit p.a. is constant. This implies that total holding cost is an increasing
linear function of quantity of stock in the year.
7. Ordering cost per order is constant irrespective of size of order.

EXAMPLE
Carpet Discount Store in North Georgia stocks carpet in its warehouse and sells it through an
adjoining showroom. The store keeps several brands and styles of carpet in stock; however, its
biggest seller is Super Shag carpet. The store wants to determine the optimal order size and total
inventory cost for this brand of carpet given an estimated annual demand of 10,000 yards of
carpet, an annual carrying cost of $0.75 per yard, and an ordering cost of $150. The store would
also like to know the number of orders that will be made annually and the time between orders
(i.e., the order cycle) given that the store is open every day except Sunday, Thanksgiving Day,
and Christmas Day (which is not on a Sunday).

SOLUTION:

Cc = $0.75 per yard

Co = $150

D = 10,000 yards

The optimal order size is

The total annual inventory cost is determined by substituting Qopt into the total cost formula:
The number of orders per year is computed as follows:

Given that the store is open 311 days annually (365 days minus 52 Sundays, Thanksgiving, and
Christmas), the order cycle is

TIME SERIES ANALYSIS


A time series is a collection of observations of well-defined data items obtained through repeated
measurements over time. For example, measuring the value of retail sales each month of the year
would comprise a time series. This is because sales revenue is well defined, and consistently
measured at equally spaced intervals. Data collected irregularly or only once are not time series.

Time Series Analysis is used for many applications such as:


 Economic Forecasting
 Sales Forecasting
 Budgetary Analysis
 Stock Market Analysis
 Yield Projections
 Process and Quality Control
 Inventory Studies
 Workload Projections
 Utility Studies
 Census Analysis

The factors that are responsible to bring about changes in a time series, also called the
components of time series, are as follows:
1. Secular Trend (or General Trend)
2. Seasonal Movements
3. Cyclical Movements
4. Irregular Fluctuations

Secular Trend:
The secular trend is the main component of a time series which results from long term effect of
socio-economic and political factors. This trend may show the growth or decline in a time series
over a long period. This is the type of tendency which continues to persist for a very long period.
Prices, export and imports data, for example, reflect obviously increasing tendencies over time.
Seasonal Trend:
These are short term movements occurring in a data due to seasonal factors. The short term is
generally considered as a period in which changes occur in a time series with variations in
weather or festivities. For example, it is commonly observed that the consumption of ice-cream
during summer us generally high and hence sales of an ice-cream dealer would be higher in some
months of the year while relatively lower during winter months. Employment, output, export etc.
are subjected to change due to variation in weather. Similarly sales of garments, umbrella,
greeting cards and fire-work are subjected to large variation during festivals like Valentine’s
Day, Eid, Christmas, New Year etc. These types of variation in a time series are isolated only
when the series is provided biannually, quarterly or monthly.
Cyclic Movements:
These are long term oscillation occurring in a time series. These oscillations are mostly observed
in economics data and the periods of such oscillations are generally extended from five to twelve
years or more. These oscillations are associated to the well-known business cycles. These cyclic
movements can be studied provided a long series of measurements, free from irregular
fluctuations is available.
Irregular Fluctuations:
These are sudden changes occurring in a time series which are unlikely to be repeated, it is that
component of a time series which cannot be explained by trend, seasonal or cyclic movements. It
is because of this fact these variations some-times called residual or random component. These
variations though accidental in nature, can cause a continual change in the trend, seasonal and
cyclical oscillations during the forthcoming period. Floods, fires, earthquakes, revolutions,
epidemics and strikes etc., are the root cause of such irregularities.
EXAMPLE
The amount of cotton grown in the country during 2011 to 2014 is shown in the following table.
Cotton Grown (Kg, 000)
Quarter
I II III IV
Year
2011 25 22 24 27
2012 28 23 25 29
2013 30 25 27 31
2014 32 27 29 33

By using moving average of suitable order, compute the trend values.


Year Quarter Production Moving Moving Moving S=Y-T
(Y) Total average average
order 4 order 4 centered
(T)

2011 I 25

II 22

98 24.5

III 24 24.875 -0.875

101 25.25

IV 27 25.375 1.625

102 25.5

2012 I 28 25.625 2.375

103 25.75

II 23 26.00 -3.00

105 26.25

III 25 26.50 -1.50

107 26.75

IV 29 27.00 2.00

109 27.25

2013 I 30 27.50 2.5

111 27.25

II 25 28.00 -3.00

113 28.25

III 27 28.50 -1.50

115 28.75
IV 31 29.00 2.00

117 29.25

2014 I 32 29.50 2.50

119 29.75

II 27 30.00 -3.00

121 30.25

III 29

IV 33
NETWORK DIAGRAMS AND SCHEDULE ANALYSIS
Network diagrams are schematic displays of project schedule activities and the
interdependencies between these activities. When developed properly, this graphical view of a
project’s activities conveys critical schedule characteristics required to effectively analyze and
adjust schedules – thus resulting in accurate and feasible schedules. This document addresses
what should be considered in the development of a network diagram, how network diagrams are
created, and how they may be analyzed to identify necessary corrective actions and ensure
optimal schedule definition.
Project Scheduling and Control Techniques
1. Critical Path Method (CPM)
2. Program Evaluation and Review Technique (PERT)

Project Network
• Network analysis is the general name given to certain specific techniques which can be used for
the planning, management and control of projects
• Use of nodes and arrows

Arrows
An arrow leads from tail to head directionally
– Indicate ACTIVITY, a time consuming effort that is required to perform a part of the
work.
Nodes 
A node is represented by a circle
- Indicate EVENT, a point in time where one or more activities start and/or finish

• Activity
– A task or a certain amount of work required in the project
– Requires time to complete
– Represented by an arrow
• Dummy Activity
– Indicates only precedence relationships
– Does not require any time of effort

• Event
– Signals the beginning or ending of an activity
– Designates a point in time
– Represented by a circle (node)
• Network
– Shows the sequential relationships among activities using nodes and arrows
 Activity-on-node (AON) nodes represent activities, and arrows show precedence relationships
 Activity-on-arrow (AOA) arrows represent activities and nodes are events for points in time

Lay foundation 3 Dummy


2 0 Build Finish
3 1 house work
1 2 4 3 6 1 7
Design house Order and
Select 1 1 Select
and obtain receive
financing materials paint carpet
5
Lay foundations Build house

2 4
Finish work
2 3
7
Start 1 1
3

Design house and 6


3
obtain financing 5 1
1
1 Select carpet
Order and receive
Select paint
materials

Fig AON Project Network for House

Situations in network diagram

B
A

C
Fig A must finish before either B or C can start

A
C
B
Fig both A and B must finish before C can start

A C
B D
Fig both A and C must finish before either of B or D can start
A B

Dummy

C
D
Fig A must finish before B can start
both A and C must finish before D can start

L ay 3
L ay
Dumm
foundatio 2 0
2 3
1
Order 2 4
Order

(a) Incorrect (b) Correct


precedence precedence

Fig Concurrent Activities

CPM - Critical Path Method


History of CPM
 E I Du Pont de Nemours & Co. (1957) for construction of new chemical plant and maintenance
shut-down
 Deterministic task times
 Activity-on-node network construction
 Repetitive nature of jobs

Planning a project usually involves dividing it into a number of small tasks that can be assigned to
individuals or teams. The project’s schedule depends on the duration of these tasks and the sequence in
which they are arranged. This sequence can be driven by several factors: customer deadlines, availability
of personnel or resources, and dependencies among tasks.
DuPont developed a Critical Path Method (CPM) designed to address the challenge of shutting down
chemical plants for maintenance and then restarting the plants once the maintenance had been completed.
Complex project require a series of activities, some of which must be performed sequentially and others
that can be performed in parallel with 0ther activities.
This collection of series and parallel tasks can be modeled as a network.
CPM models the activities and events of a project as a network. Activities are shown as nodes on the
network and events that signify the beginning or ending of activities are shown as arcs or lines between
the nodes.
The Figure below shows an example of a CPM network diagram:

Fig CPM network diagram


Steps in CPM Project Planning
1. Specify the individual activities.
2. Determine the sequence of those activities.
3. Draw a network diagram.
4. Estimate the completion time for each activity.
5. Identify the critical path (longest path through the network)
6. Update the CPM diagram as the project progresses.

1. Specify the individual activities

All the activities in the project are listed. This list can be used as the basis for adding sequence and
duration information in later steps.

2. Determine the sequence of the activities

Some activities are dependent on the completion of other activities. A list of the immediate predecessors
of each activity is useful for constructing the CPM network diagram.

3. Draw the Network Diagram

Once the activities and their sequences have been defined, the CPM diagram can be drawn. CPM
originally was developed as an activity on node network.

4. Estimate activity completion time


The time required to complete each activity can be estimated using past experience. CPM does not take
into account variation in the completion time.

5. Identify the Critical Path

The critical path is the longest-duration path through the network. The significance of the critical path is
that the activities that lie on it cannot be delayed without delaying the project. Because of its impact on
the entire project, critical path analysis is an important aspect of project planning.

The critical path can be identified by determining the following four parameters for each activity:

• ES - earliest start time: the earliest time at which the activity can start given that its precedent activities
must be completed first.
• EF - earliest finish time, equal to the earliest start time for the activity plus the time required to complete
the activity.
• LF - latest finish time: the latest time at which the activity can be completed without delaying the
project.
• LS - latest start time, equal to the latest finish time minus the time required to complete the activity.

The slack time for an activity is the time between its earliest and latest start time, or between its earliest
and latest finish time. Slack is the amount of time that an activity can be delayed past its earliest start or
earliest finish without delaying the project.

The critical path is the path through the project network in which none of the activities have slack, that is,
the path for which ES=LS and EF=LF for all activities in the path. A delay in the critical path delays the
project. Similarly, to accelerate the project it is necessary to reduce the total time required for the
activities in the critical path.

6. Update CPM diagram

As the project progresses, the actual task completion times will be known and the network diagram can be
updated to include this information.
A new critical path may emerge, and structural changes may be made in the network if project
requirements change.

CPM calculation

• Path
– A connected sequence of activities leading from the starting event to the ending event
• Critical Path
– The longest path (time); determines the project duration
• Critical Activities
– All of the activities that make up the critical path

Forward Pass
The forward pass goes from the initial task (the task with no predecessors) to the final task (the one with
no successors), visiting every task in every path and setting the ES and EF dates on the tasks.
The algorithm is similar to graph theory’s depth-first search, except that the forward pass follows every
path from initial to final task, while depth-first search stops when it arrives at a task that it’s already
visited. When the forward pass arrives at a task, it may change that task’s ES and EF dates, and that
change must be carried forward to the final task. During the forward pass, a task may be visited several
times as different paths through the network are followed. A task’s ES is determined by the predecessor
task with the latest EF, since a task can’t start until all of its predecessors have finished.
• Earliest Start Time (ES)
– earliest time an activity can start
– ES = maximum EF of immediate predecessors
• Earliest finish time (EF)
– earliest time an activity can finish
– earliest start time plus activity time EF= ES + t

Backward Pass
The backward pass goes from the final task to the initial task, visiting every task in every path and setting
the LS and LF dates on the tasks.
It’s similar to the forward pass in that it arrives at a task, it may change that task’s LS and LF dates, and
that change must be carried back to the initial task. The difference is:
• The forward pass sets the task’s latest ES, as determined by the EFs of its predecessors
• The backward pass sets the task’s earliest LF, as determined by the LSs of its successors.
The reason for the backward pass’s rule for setting LF is not as obvious as the forward pass’s rule. Any
start date, ES or LS, must be after the corresponding finish dates of all of the task’s predecessors. To
maintain this consistency, the backward pass must set a task’s LF to a value that’s earlier than the LS of
any of the task’s successors.
 Latest Start Time (LS)
Latest time an activity can start without delaying critical path time
LS= LF - t
 Latest finish time (LF)
Latest time an activity can be completed without delaying critical path time
LS = minimum LS of immediate predecessors
CPM analysis
• Draw the CPM network
• Analyze the paths through the network
• Determine the float for each activity
– Compute the activity’s float
float = LS - ES = LF - EF
– Float is the maximum amount of time that this activity can be delay in its completion
before it becomes a critical activity, i.e., delays completion of the project
• Find the critical path is that the sequence of activities and events where there is no “slack” i.e.
Zero slack
– Longest path through a network
• Find the project duration is minimum project completion time

Program Evaluation and Review Technique (PERT)


 U S Navy (1958) for the POLARIS missile program
 Multiple task time estimates (probabilistic nature)
 Activity-on-arrow network construction
 Non-repetitive jobs

The Program Evaluation and Review Technique (PERT) is a network model that allows for
randomness in activity completion times. PERT was developed in the late 1950's for the U.S. Navy's
Polaris project having thousands of contractors. It has the potential to reduce both the time and cost
required to complete a project.

The Network Diagram


In a project, an activity is a task that must be performed and an event is a milestone marking the
completion of one or more activities. Before an activity can begin, all of its predecessor activities must be
completed. Project network models represent activities and milestones by arcs and nodes.

PERT is typically represented as an activity on arc network, in which the activities are represented on the
lines and milestones on the nodes. The Figure 2-1 shows a simple example of a PERT diagram.

Figure PERT diagram.


The milestones generally are numbered so that the ending node of an activity has a higher number than
the beginning node. Incrementing the numbers by 10 allows for new ones to be inserted without
modifying the numbering of the entire diagram. The activities in the above diagram are labeled with
letters along with the expected time required to complete the activity.

Steps in the PERT Planning Process


PERT planning involves the following steps:

1. Identify the specific activities and milestones.


2. Determine the proper sequence of the activities.
3. Construct a network diagram.
4. Estimate the time required for each activity.
5. Determine the critical path.
6. Update the PERT chart as the project progresses.

1. Identify activities and milestones


The activities are the tasks required to complete the project. The milestones are the events marking the
beginning and end of one or more activities.

2. Determine activity sequence


This step may be combined with the activity identification step since the activity sequence is known for
some tasks. Other tasks may require more analysis to determine the exact order in which they must be
performed.

3. Construct the Network Diagram


Using the activity sequence information, a network diagram can be drawn showing the sequence of the
serial and parallel activities.
4. Estimate activity times
Weeks are a commonly used unit of time for activity completion, but any consistent unit of time can be
used.
A distinguishing feature of PERT is its ability to deal with uncertainty in activity completion times. For
each activity, the model usually includes three time estimates:

• Optimistic time (OT) - generally the shortest time in which the activity can be completed. (This is what
an inexperienced manager believes!)
• Most likely time (MT) - the completion time having the highest probability. This is different from
expected time. Seasoned managers have an amazing way of estimating very close to actual data from
prior estimation errors.
• Pessimistic time (PT) - the longest time that an activity might require.

The expected time for each activity can be approximated using the following weighted average:

Expected time = (OT + 4 x MT+ PT) / 6

This expected time might be displayed on the network diagram.

Variance for each activity is given by: [(PT - OT) / 6]2

5. Determine the Critical Path


The critical path is determined by adding the times for the activities in each sequence and determining the
longest path in the project. The critical path determines the total time required for the project.

If activities outside the critical path speed up or slow down (within limits), the total project time does not
change. The amount of time that a non-critical path activity can be delayed without delaying the project is
referred to as slack time.

If the critical path is not immediately obvious, it may be helpful to determine the following four quantities
for each activity:

• ES - Earliest Start time


• EF - Earliest Finish time
• LS - Latest Start time
• LF - Latest Finish time

These times are calculated using the expected time for the relevant activities. The ES and EF of each
activity are determined by working forward through the network and determining the earliest time at
which an activity can start and finish considering its predecessor activities.

The latest start and finish times are the latest times that an activity can start and finish without delaying
the project. LS and LF are found by working backward through the network. The difference in the latest
and earliest finish of each activity is that activity's slack.
The critical path then is the path through the network in which none of the activities have slack.

The variance in the project completion time can be calculated by summing the variances in the
completion times of the activities in the critical path. Given this variance, one can calculate the
probability that the project will be completed by a certain date.
Since the critical path determines the completion date of the project, the project can be accelerated by
adding the resources required to decrease the time for the activities in the critical path. Such a shortening
of the project sometimes is referred to as project crashing.

6. Update as project progresses


Make adjustments in the PERT chart as the project progresses. As the project unfolds, the estimated times
can be replaced with actual times. In cases where there are delays, additional resources may be needed to
stay on schedule and the PERT chart may be modified to reflect the new situation.

PERT analysis
 PERT is based on the assumption that an activity’s duration follows a probability distribution
instead of being a single value
 Three time estimates are required to compute the parameters of an activity’s duration distribution:

Mean (expected time): te =


tp + 4 tm +
to
2
tp -
Variance: Vt =2 =
to

 Draw the network.


 Analyze the paths through the network and find the critical path.
 The length of the critical path is the mean of the project duration probability distribution which is
assumed to be normal
 The standard deviation of the project duration probability distribution is computed by adding the
variances of the critical activities (all of the activities that make up the critical path) and taking
the square root of that sum
 Probability computations can now be made using the normal distribution table.

Probability computation
Determine probability that project is completed within specified time

x-
Z=

Where  = tp = project mean time
 = project standard mean time
x = (proposed) specified time

Probabilit

Z 

 = tp x Time

Figure Normal Distribution of Project Time

PROJECT COST
Cost consideration in project
 Project managers may have the option or requirement to crash the project, or accelerate the
completion of the project.
 This is accomplished by reducing the length of the critical path(s).
 The length of the critical path is reduced by reducing the duration of the activities on the critical
path.
 If each activity requires the expenditure of an amount of money to reduce its duration by one unit of
time, then the project manager selects the least cost critical activity, reduces it by one time unit,
and traces that change through the remainder of the network.
 As a result of a reduction in an activity’s time, a new critical path may be created.
 When there is more than one critical path, each of the critical paths must be reduced.
 If the length of the project needs to be reduced further, the process is repeated.

Project Crashing
• Crashing
– reducing project time by expending additional resources
• Crash time
– an amount of time an activity is reduced
• Crash cost
– cost of reducing activity time
• Goal
– reduce project duration at minimum cost
Activity crashing
Crash
Crashing activity
cost
Slope = crash cost per unit
Activity cost

time
Normal Activity
Normal
cost
Normal
time
Crash
Activity time
time
Figure Time-Cost Relationship
 Crashing costs increase as project duration decreases
 Indirect costs increase as project duration increases
 Reduce project length as long as crashing costs are less than indirect costs

Min total cost


Total project
= optimal
Indirect cost
project time
cost

Direct cost

time
Figure Time-Cost Tradeoff

PERT/CPM Chart
A project has been defined to contain the following list of activities along with their required times for
completion

Activity No Activity Expected completion Dependency


time
1. Requirements collection 5 -
2. Screen design 6 1
3. Report design 7 1
4. Database design 2 2,3
5. User documentation 6 4
6. Programming 5 4
7. Testing 3 6
8. Installation 1 5,7

TE = 11 TE = 20 TE = 23
2 5 8
6 6 1
TE = 14

1 4
5 2
TE = 12 TE = 19 TE = 22

3 6 7
5 3

Fig CPM Chart

Using information from the table, indicate expected completion time for each activity.
2 5 8
1
6 TE = 14 6
TE = 5
1 4
5 2
TE = 12 TE = 19 TE = 22

3 6 7
5 3

Fig CPM Chart with TE

Calculate earliest expected completion time for each activity (TE) and the entire project.
The earliest expected completion time for a given activity is determined by summing the expected
completion time of this activity and the earliest expected completion time of the immediate predecessor.

Rule: if two or more activities precede an activity, the one with the largest TE is used in calculation (e.g.,
for activity 4, we will use TE of activity 3 but not 2 since 12 > 11).

The critical path 1, 3, 4,6,7,8


TE = 20 TE = 23
2 5 8
1
6 TE = 14 6
TE = 5
1 4
5 2
TE = 12 TE = 19 TE = 22

3 6 7
5 3

Fig critical path


The critical path represents the shortest time, in which a project can be completed. Any activity on the
critical path that is delayed in completion delays the entire project. Activities not on the critical path
contain slack time and allow the project manager some flexibility in scheduling.
PERT
Immed. Optimistic Most Likely Pessimistic
Activity Predec. Time (Hr.) Time (Hr.) Time (Hr.)
A -- 4 6 8
B -- 1 4.5 5
C A 3 3 3
D A 4 5 6
E A 0.5 1 1.5
F B,C 3 4 5
G B,C 1 1.5 5
H E,F 5 6 7
I E,F 2 5 8
J D,H 2.5 2.75 4.5
K G,I 3 5 7

D
3 5

A E H J

1 C 4 7

B I K
F

G
2 6

Fig PERT Network


Activity Expected Time Variance
A 6 4/9
B 4 4/9
C 3 0
D 5 1/9
E 1 1/36
F 4 1/9
G 2 4/9
H 6 1/9
I 5 1
J 3 1/9
K 5 4/9

Activity ES EF LS LF Slack
A 0 6 0 6 0 *critical
B 0 4 5 9 5
C 6 9 6 9 0*
D 6 11 15 20 9
E 6 7 12 13 6
F 9 13 9 13 0*
G 9 11 16 18 7
H 13 19 14 20 1
I 13 18 13 18 0*
J 19 22 20 23 1
K 18 23 18 23 0*

The estimated project completion time is the Max EF at node 7 = 23.

s2 = s2A + s2C + s2F + s2H + s2K

= 4/9 + 0 + 1/9 + 1 + 4/9


= 2
 path = 1.414
z = (24 - 23)/(24-23)/1.414 = .71
From the Standard Normal Distribution table:

P(z < .71) = .5 + .2612 = .7612


f(x) P(T< 24) =
. 7612

. 5000 . 2612

x
 

Fig Probability the project will be completed within 24 hrs

PERT/COST

Activity Normal Normal cost Crash Allowable slope


time RS time Crash crash
cost time
1 12 3000 7 5000 5 400
2 8 2000 5 3500 3 500
3 4 4000 3 7000 1 3000
4 12 50000 9 71000 3 7000
5 4 500 1 1100 3 200
6 4 500 1 1100 3 200
7 4 1500 3 22000 1 7000
75000 110700

Table (1) Time Cost data


2 4
1
8
7
1 4
1

3 6
4 5 4
4

Fig PERT Network

R500 R7000
Project duration = 36
2 4
8 12 R700
7
1 4
12

R400 3 6
4 5 4
4 R200
R3000
R200

Fig Project duration


R500 R7000

2 4
Project 8 12 R700
7
duration = 31 1 4
7

Additional cost R400 3 6


5 4
= R2000 4
4 R200
R3000
R200

Fig modified Project duration

Summary

Program evaluation and review technique (PERT) charts depict task, duration, and dependency
information. Each chart starts with an initiation node which is the first task. Each task is represented by a
node (Activity on Node Network – AON) with lines connecting dependent tasks.
Each task is connected to its successor tasks in this manner forming a network of nodes and connecting
lines. The chart is complete when all final tasks come together at the completion node. When slack time
exists between the end of one task and the start of another, the usual method is to draw a broken or dotted
line between the end of the first task and the start of the next dependent task.
PERT charts are usually drawn on ruled paper with the horizontal axis indicating time period divisions in
days, weeks, months, and so on. Many PERT charts terminate at the major review points, such as at the
end of the analysis.
Critical Path Method (CPM) charts are similar to PERT charts and are sometimes known as PERT/CPM.
In a CPM chart, the critical path is indicated. A critical path consists that set of dependent tasks (each
dependent on the preceding one), which together take the longest time to complete. Tasks which fall on
the critical path should be noted in some way, so that they may be given special attention. One way is to
draw critical path tasks with a double line instead of a single line.
Tasks which fall on the critical path should receive special attention by both the project manager and the
personnel assigned to them.
The critical path for any given method may shift as the project progresses; this can happen when tasks are
completed either behind or ahead of schedule, causing other tasks which may still be on-schedule to fall
on the new critical path.
Critical path computations are quite simple, yet they provide valuable information that simplifies the
scheduling of complex projects. The result is that PERT-CPM techniques enjoy tremendous popularity
among practitioners in the field .The usefulness of the techniques is further enhanced by the availability
of specialized computer system for executing, analyzing, and controlling network projects.
CPM Benefits

• Provides a graphical view of the project.


• Predicts the time required to complete the project.
• Shows which activities are critical to maintaining the schedule and which are not.
CPM Limitations

While CPM is easy to understand and use, it does not consider the time variations that can have a great
impact on the completion time of a complex project. CPM was developed for complex but fairly routine
projects with minimum uncertainty in the project completion times.
For less routine projects there is more uncertainty in the completion times, and this uncertainty limits its
usefulness.

Benefits of PERT
PERT is useful because it provides the following information:

• Expected project completion time.


• Probability of completion before a specified date.
• The critical path activities that directly impact the completion time.
• The activities that have slack time and that can lend resources to critical path activities.
• Activities start and end dates.

Limitations of PERT
The following are some of PERT's limitations:

• The activity time estimates are somewhat subjective and depend on judgment. In cases where there is
little experience in performing an activity, the numbers may be only a guess. In other cases, if the person
or group performing the activity estimates the time there may be bias in the estimate.
• The underestimation of the project completion time due to alternate paths becoming critical is perhaps
the most serious.

You might also like