Professional Documents
Culture Documents
DEFINITION OF CONCEPTS
STATISTICS
Today, statistics or more specifically statistical method is used extensively in almost all phases of human
endeavor. In ancient times, it dealt with the affairs of the state, like collection of information (or data)
regarding population and property or wealth of the state so as to sub-serve political purposes of rulers.
Today, its influence has spread to various areas, such as agriculture, business, economics, medicine,
biology, education, electronics, sociology, psychology, political science, and many other branches of
science and technology.
The origin of the word statistics may be traced to the Latin word ‘status’ or the Italian word ‘statista’ or
the German word ‘statistics’, meaning political state. As time progressed, the idea behind the word
statistics has undergone a phenomenal change. Over time, the character of information as provided has
been extended to any particular sphere of human activity. Statistics as statistical data forms the backbone
of many disciplines. By statistical’ data we mean numerical statement of facts while statistical methods
deal with information of the principles and techniques used in collecting and analyzing such data.
What is statistics?
The word ‘Statistics’ was first introduced by a German Scholar, Gottfried Achenwall, in the middle of the
18th century. From the very name, it is felt that it must be related to the administrative functioning of
state supplying facts regularly and quantitatively regarding its various fields of administration. Today,
statistics as a separate discipline from mathematics is closely associated with almost all branches of
education and human endeavour which are mostly numerically representable. In modem times, it has
innumerable and varied applications both qualitatively and quantitatively.
To be more precise, statistics refers to classified facts that represent the conditions of the people in a state
especially those facts which can be stated in numbers or in any tabular or classified arrangements. One of
the widely accepted definitions came from Horace Secrets. He defines statistics as, “By statistics we
mean aggregate of facts, affected to a marked extent by a multiplicity of causes numerically
expressed, enumerated or estimated according to reasonable standards of accuracy, collected in a
systematic manner for a predetermined purpose and placed in relation to each other.”
4. It is liable to be miscued:
As W.I. King points out, “One of the short-comings of statistics is that do not bear on their face the label
of their quality.” So we can say that we can check the data and procedures of its approaching to
conclusions. But these data may have been collected by inexperienced persons or they may have been
dishonest or biased. As it is a delicate science and can be easily misused by an unscrupulous person. So
data must be used with a caution. Otherwise results may prove to be disastrous.
Utmost care and precautions should be taken for the interpretation of statistical data in all its
manifestations. Statistics should not be used as a blind man uses a lamp-post for support instead of
illumination. However, there are misapprehensions about the argument that statistics can be used
effectively by expert statisticians, as is given in the following remark due to Wallis and Roberts: ‘He who
accepts statistics indiscriminately will often be duped unnecessarily. But he who distrusts statistics
indiscriminately will often be ignorant unnecessarily’. There is an accessible alternative between blind
gullibility and blind distrust. It is possible to interpret statistics skillfully. The art of interpretation need
not be monopolized by statisticians, though, of course, technical statistical knowledge helps. Many
important ideas of technical statistics can be conveyed to the non-statistician without distortion or
dilution. Statistical interpretation depends not only on statistical ideas but also on ordinary clear thinking.
Clear thinking is not only indispensable in interpreting statistics but is often sufficient even in the absence
of specific statistical knowledge. For the statistician not only death and taxes but also statistical fallacies
are unavoidable. With skill, common sense, patience and above all objectivity, their frequency can be
reduced and their effects minimized.
Conclusion
Numerous such examples can be constructed to illustrate the misuse of statistical methods and this is all
due to their injudicious applications and interpretations for which their injudicious applications and
interpretations for which the science of statistics cannot be blamed. Many people disbelieve statistics
because it does not prove a particular thing in a particular manner.
It should be clearly understood that statistics does not prove anything. Statistics is only a method of
approach; it is a tool in the hands of a statistician to present a phenomenon in a particular manner and
nothing beyond. The science of statistics doesn’t prove or disprove a thing; it merely presents the true
facts about a problem and leaves the rest to other people. Different types of conclusions can be arrived at
from the same set of figures if there is a difference in the approach of various persons. From one set of
figures, a socialist can prove that a country has eliminated unemployment and improved the lot of the
working clan and from the same set of figures; an anti-socialist can derive an opposite conclusion. This
fundamental difference in approach or bias in the minds of the investigators has been responsible for
different conclusions being drawn from the same set of figures. For this, the science of statistics cannot
be blamed. It is not the fault of the science. It is the mischief of those who use it
It can thus be used with much confidence because of its direct nature of collection. The primary data are
usually published by the authorities who are directly responsible for its collection. However, it requires
enough manpower, time and money to make the process successful. This data is collected in a natural
setting e.g. experiments or directly from the field. This data is collected through observation or through
direct communication with respondents by mail telephones or personal interviews.
Secondary Data
This is a type of data that are being used for purpose other than that for which they were originally
collected for. Data collected by other investigators & institutions for research purposes that differs from
the original reason for collecting is included in this category. The most important and remarkable
advantage here is that it requires less manpower and time and as a result less costly to complete the entire
procedure. But, in practice, it frequently contains a number of errors due to erroneous transcription, faulty
rounding-off (up or down), etc., and, therefore, less dependable in nature for the researchers.
The investigators and the scholars working with them should therefore be much careful while using them
in their own fields. It is, therefore, quite clear that the data which are initially primary in nature at the
origin for one use becomes secondary in qualities and character for other uses.
Therefore, the distinction between primary and secondary data is one of degree only. A particular data
may be primary in the hands of data collecting authority but may be secondary to other people using them
afterwards. Prof. H. Secrist in this context says: “The distinction between primary and secondary data
is largely one of degree. Data which are secondary in the hands of one party may be primary in the
hands of another.”
However, the method of collection of primary data and secondary data must not be identical in nature
because, in the former case data are collected originally while, in the latter case, data are to be taken up in
the nature of compilation.
There exist various methods for the collection of the primary and secondary data. Choice of the exact
method depends largely on the nature, object and scope of statistical investigation.
Archival records form an example of secondary data. These Archival records include public records,
judicial records, the mass media and public records such as letters, autography paints all this constitute
to what we call ARCHIVAL RECORDS. The methods used in deriving secondary data are observation
& content analysis (document Analysis).
1. First, it generally contains a detailed description and information of the definition for the terms used.
2. Secondly, since secondary data are second-hand data or ‘finished products’, an element of error may
creep in afterwards. This may then give some misleading information. Primary data cannot have such
errors.
3. Thirdly, in the primary data precise definition of the terms used is given and the scope of the data is
clearly mentioned.
4. Finally, collecting primary data often include the method or procedure followed and any approximation
used so that one can find its limitations. On the contrary, secondary data usually lack such information.
Despite these advantages of primary data, secondary data are extensively used particularly when a large
number of items are required. The secondary data seems to be of minor importance, especially when
collection of primary data is much expensive and time-consuming. Secondary data invariably give less
meaning of the statistics and frequently present no explanation other than the captions and footnotes in the
tables. In fact, some information is usually suppressed in secondary data.
1. First, cost of collection of data is less. That is why data produced by the Government, companies and
various organizations are readily available.
3. Thirdly, much of the secondary data available has been collected for many years and, therefore, it can
be used to study the trends.
4. Fourthly and most importantly, secondary data is of great value to the government, business world and
industry and also for research organizations. Again, secondary data help the government in making
present policy decisions and also planning for future economic policies.
1. First, the method employed for the collection of such data is often unsatisfactory. That is why,
secondary data in most cases, is subject to transcribing errors (i.e., errors occurring due to wrong
transcriptions of the primary data).
2. Secondly, secondary data are really mere estimates and not the facts.
3. Thirdly, scrutiny of the secondary data is obviously essential since errors may creep in due to unwanted
bias. This is because of the fact that often fictitious figures are recorded unknowingly in secondary data.
In this sense, secondary data are not only inaccurate but also incomplete and inadequate. Without detailed
scrutiny of the secondary data one must not be advised to use them.
Precautions for Using Secondary Data:
Since secondary data is second-hand data, one must know as much about it as possible. In other words,
certain precautions are to be taken before using secondary data so that the data become really helpful to
the government or the researcher.
(i) The scope and object of enquiry for which the data were originally collected;
(iii) The period (normal or unusual time) and the area covered for collection;
(iv) The reliability, integrity and dependability of the data collectors engaged;
(v) Precise definitions of the terms used and their units of measurement considered while the data were
collected;
(vi) Interpreting the data, especially when figures collected for one purpose are used for other fields.
Thus, one can say that secondary data first should be reliable. There are various ways of testing such
reliability of the data. Secondly, secondary data should be used in such a manner that they can suitably
serve the purpose of investigation and experimentation. Finally, in addition to reliability and suitability of
the data, it should be declared adequate and accurate as far as practicable.
Disadvantage of Interviews
-Expensive – Researcher have to travel to meet respondent.
-Skills – It requires high level of skills i.e. require communication and interpretation of skills.
-Biasness – Interviews need o trained to avoid biasness
-Time consuming – Involves smaller samples because of time consuming if a researcher is
interested in using a big sample hence becomes constraints
-Influence respondents – Response may be influenced by the respondents reaction to the
interviewer.
Observational guide
Observation methods
It is mostly used in studies relating to consumer behavior.
Advantages
The information is obtained by the investigators own observation without asking
from the respondent.
Subjective bias is eliminated if observation is done accurately
Information obtained relates to what is currently happening. It is not complicated
by either the part behavior or future intentions or attitude.
Disadvantages
Questionnaires
Advantages
1) Low cost even when the universe is large and spread widely geographically.
2) Free from the bias of the interviewer
3) Respondents have time to give well thought answer
4) Respondents who are not easily approachable can also be reached conveniently
5) Large samples can be made use of and thus the results made more dependable and
reliable.
Disadvantages
1) Low rates of returns of the newly filled questionnaires
2) Can only be used when respondents are educated and cooperating.
3) Inhibit inflexibility because there is always difficulty in amending the approach once
questionnaires have been dispatched.
4) Possibility of ambiguous replies or omissions of replies altogether to contain questions.
5) Difficulty of knowing whether willing respondents are truly representatives.
6) The problem of constantly updating the mailing list.
The presentation of the data is broadly classified into the following categories:
(i) Tabular presentation
(ii) Diagrammatic or graphic presentation
A statistical table is an orderly and logical arrangement of data into rows and columns and it
attempts to present the voluminous and heterogeneous data in a condensed and homogeneous
form. But before tabulating the data, generally, systematic arrangement of the raw data into
different homogeneous classes is necessary to sort out the relevant and significant features
(details) from the irrelevant and insignificant ones.
This process of arranging the data into groups or classes according to resemblances and
similarities is technically called classification. Thus, classification of the data is preliminary to
its tabulation. It is thus, the first step in tabulation because the items with similarities must be
brought together before the data are presented in the form of a ‘table’
Functions of classification
(i) It condenses the data. Classification presents the huge unyielding raw data in a
condensed form which is readily comprehensible to the mind and attempts to highlight
the significant features contained in the data.
(ii) It facilitates comparisons. Classification enables us to make meaningful comparisons
depending on the basis or criterion of classification. For instance, the classification of
students in a college according to sex enables us to make a comparative study of the
prevalence of college education among males and females.
(iii) It helps to study the relationship. The classification of the given data w.r.t two or more
criteria, say, the sex of the students and the faculty they join in a university will enable us
to study the relationship between these two criteria.
(iv) It facilitates the statistical treatment of the data. The arrangement of the voluminous
heterogeneous data into relatively homogeneous groups or classes according to their
points of similarities introduces homogeneity or uniformity amidst diversity and makes it
more intelligible, useful and readily amenable for further processing like tabulation,
analysis and interpretation of the data.
Bases of classification
The bases or the criteria w.r.t which the data are classified primarily depend on the objectives
and purpose of the enquiry. Generally, the data can be classified on the following four bases:
(i) Geographical, i.e. Area-wise or regional: For example, data can be classified in terms
of the yield of agricultural output per hectare for different countries or regions in some
given period.
(ii) Chronological classification: here the data are classified on the basis of differences in
time, e.g., the production of an industrial concern for different periods. The profits of a
big business house over different years; the population of any country for different years
etc. The time series data, which are quite frequent in Economic and Business Statistics,
are generally classified chronologically, usually starting with the first period of
occurrence.
(iii) Qualitative classification: When the data are classified according to some qualitative
phenomena which are not capable of quantitative measurement like honesty, beauty,
employment, intelligence, sex, etc., the classification is termed as qualitative or
descriptive or w.r.t attributes. In qualitative classification the data are classified
according to the presence or absence of the attributes in the given units.
(iv) Quantitative classification: where the data are classified on the basis of phenomenon
which is capable of quantitative measurement like age, height, weight, prices, production,
income, expenditure, sales, profits etc. The quantitative phenomenon under study is
known as variable and hence this classification is also sometimes called classification by
variables.
In order to present and analyze data in logical and meaningful way, it’s necessary to understand some of
the natural forms that they take.
There are various ways of classifying data and thus are as follows;-
Preciseness
Data can either be measured precisely (described as discrete) or by approximation (described and
continuous).
Discrete Data
Can be obtained by counting e.g. No. of students taking HRM or DBM in the Diploma Class.
It can also be obtained through situation where collecting is not involved e.g. shoes sizes of a sample of
people.
The characteristics of discrete data is that its values progresses in definite steps e.g. 1,2,3,4 e.t.c
Continuous Data
This data cannot be measured precisely their values can only be approximated e.g. length, weight,
temperature, time etc.
How well continuous data are approximated depends on the situation and the quality of measuring
instruments.
Frequency Distributions
This is concerned with organization and presentation of numerical data.
Total 16
Bivariate. Example
QT -X
Remarks - Y
Marks X Marks Y
Jane 70 60
John 80 55
June 70 80
A company dealing with importation of shoes recorded the following shoe sizes
5 6 9 8 5 6 4
4 4 9 9 7 7 6
6 5 4 4 4 4 5
6 9 8 4 5 6 10
10 4 5 7 8 8 7
5 5 6 6 7 8 10
A group Frequency Distribution Organize data items into groups (classes) of value each showing how
many items have values included within the group (known as the class frequency)
N/B However, once items have been grouped into this way, their individual values are lost
Example 2
The value of properties handled by a property dealer over a 6-month period.
Class boundaries.
These are lower and upper values of class that mark common points between classes
Class midpoint
These are situated at the center of classes. They are the midway between the upper & lower boundaries or
limits.
1) Step one 46 -9 = 37
The following are marks extracted from the consolidated mark sheet for diploma students in QT
20 35 39 40 51
35 40 41 51 55
45 60 76 62 61
81 82 83 85 60
90 72 73 80 50
51 32 31 38 37
40 20 27 60 76
86 90 91 60 51
52 50 55 57 56
56 76 60 40 30
92 25 30 45 40
Quiz
Formulate a grouped frequency distribution where class 41- 50 will be among the classes.
ANSWER
Step 1 – 92-20 =72
Step 2 – 72/10 = 7.2=10
i) Less than CF
Here a set of items value is listed (normally class upper limit) with each one showing the number of items
in the distribution having values less than this or that.
Class CF
Less than 9 1
14 8
19 19
24 29
29 36
34 38
39 38
44 39
49 40
More than CF
Here a set of items values is listed (normally a class lower limit) with each one showing the number of
items in the distribution having values greater than this or that
Class CF
More than 5 40
10 39
15 32
20 21
25 11
30 04
35 02
40 02
45 0
Data Presentation
Managers take decisions on the basis of available information or data. Raw data maybe available to
managers in scattered form unless data are properly collected, analyzed and presented. It may not be
useful for decision-making
Raw data may not be easily comprehensive and it becomes useful and important to classify and present
them in meaningful manner.
There are various methods of classifying and presenting data. One of the ways is
Tabular presentation
It is used for summarizing and condensation of data
It also helps in analysis of relationship trends and relative size of given data.
TABLES
No of respondents by gender (sex)
Respondent Frequency Percent
Male 18 60
Female 12 40
Total 30 100
Characteristics of a table
a) It must have a number
b) It must have a title
c) It must have a caption (heading of the column)
d) It must have studs (heading of the raw)
Additional characteristics of a table
The source of data
Sometimes tables can reflect data with respect to other variables.
Types of tables
Two-way classification table
In this type of a table is set up in such a way that, two different variables can be compared or contrasted
Male Female
Response Frequency % Frequency %
No 8 26.6 1 3.3
Total 18 60.0 % 12 40
W E C
Products 1990 1991 1990 1991 1990 1991
P1 20 40 80 30 29 90
P2 30 50 20 30 40 50
P3 100 20 30 50 80 70
Total 150 110 130 110 149 210
B) Graphical presentation
Sale volume
Year Y1 Y2
1986 10 9
1987 14 13
1988 12 11
1989 15 14
1990 20 19
1991 24 23
1992 23 22
1993 28 27
volume
of sales
Years
ii) Bar chart /graphs
Example
Suppose in a class of diploma students, 10 students were born in the month of November
30 students were born in the month of December
50 students were born in the month of January
70 students were born in the month of February
Required
Contract a bar chart.(horizontal)
70
60
Students do draw a composite bar
chart
50
40
30
20
10
Pie Chart
In pie chart different segments of a circle represents percentage contribution of various components to the
totals
The pie chart is very useful because it clearly brings out the relative importance of various components
In drawing a pie chart, we construct a circle of any dimensions and this is broken down into various
segments.
Angle 3600 represents 100% and the correspondent can be found by multiplying 360 with percentage of
the component.
The relative frequency of a class is the frequency of the classes divided by the total frequency of all
classes and is generally expressed as a percentage.
Students to plot in the values and draw the relative frequency distribution graph.
U shaped distribution
In the
symmetrical distribution the mean, median and mode are all the same.
Mean= mode= median
B C
X-tics of the positive skewed distribution
The curve is skewed to the right
Am median mode
Score lies below the mean
Negatively skewed
B C
3 kurtosis
Kurtosis is a Greek language meaning (bulginess)
It means the flatness of the curve
Three terms are used for indicating flatness
Leptokurtic
Mesokurtick
Platykurtick
MEASURE OF CENTRAL TENDENCY
Introduction
It is very difficult to understand the mass of data in order get a concise and complete picture large data. It
is essential to obtain a figure which should present the whole data, by the help of such a figure ,the data
are compared and understood easily .A figure which represents the whole data is known as an average or
measure of central tendency. An average removes all the unnecessary details of the data and gives a
concise picture of the huge data under investigation
According to Prof Yule, the following are the desiderata (requirements) to be satisfied by an ideal
average:
(i) It should be rigidly defined i.e., the definition should be clear and un-ambiguous so that it leads
to one and only one interpretation by different persons. In other words, the definition should not
leave anything to the discretion of the investigator or the observer. If it is not rigidly defined then
the bias introduced by the investigator will make its value unstable and render it unrepresentative
of the distribution.
(ii) It should be easy to understand and calculate even for a non-mathematical person. In other
words, it should be readily comprehensible and should be computed with sufficient ease and
rapidity and should not involve heavy arithmetical calculations. However, this should not be
accomplished at the expense of accuracy or some other advantages which an average may
possess.
(iii) It should be based on all the observations, Thus in the computation of an ideal average the
entire set of data at our disposal should be used and there should not be any loss of information
resulting from not using the available data. Obviously, if the whole data is not used in computing
the average, it will be unrepresentative of the distribution.
(iv) It should be suitable for further mathematical treatment. In other words, the average should
possess some important and interesting mathematical properties so that its use in further statistical
theory is enhanced. For example, if we are given the averages and sizes (frequencies) of a
number of different groups then for an ideal average we should be in a position to compute the
average of the combined group. If the average is not amenable to further algebraic manipulation,
then obviously its use will be very much limited for further applications in statistical theory.
(v) It should be affected as little as possible by fluctuations of sampling. By this we mean that if
we take independent random samples of the same size from a given population and compute the
average for each of these samples then, for an ideal average, the values so obtained from different
samples should not vary much from one another. The difference in the values of the average for
different samples is attributed to the so called fluctuations of sampling. This property is also
explained by saying that an ideal average should possess sampling stability.
(vi) It should not be affected much by extreme observations. By extreme observations we mean
very small or very large observations. Thus, a few very small or very large observations should
not unduly affect the value of a good average.
There are five measures of tendency;
1) Arithmetic measures
2) Geometrical measures
3) Harmonic measures
4) Median measures
5) Mode measures
The first three measures are known as computed averages while the last two are known as position
averages.
ARITHMETIC MEAN
Arithmetic mean of a set of values is defined as the sum of the values divided by the number of values.
In other words the mean is also called the average or the center of gravity.
AM = sum of all values
The No of values
The AM is normally denoted by x (x – bar)
Example 1
Solution
3 7 2 1 7
x =4
5
Direct method:
Mean( x)
x1 f1 x2 f 2 x3 f 3 ... xn f n
x
xf
f1 f 2 f 3 ... f n n
Where = frequencies.
= total number of frequencies
Short cut method:
Under this method the formula for calculating mean is
n
i 1
fid i
A n
x=
i 1
fi
a) Discrete Series
In discrete series, the value of each of the individual item is multiplied by the corresponding frequency
and the total of the products is divided by the sum of the number of items. So in a discrete series,
arithmetic mean is calculated as
X = X
Where = frequencies.
= total number of frequencies.
Example 2
Calculate the arithmetic mean from the following data:
Values 5 10 15 20 25 30 35 40 45 50
Frequency 20 43 75 67 72 45 39 9 8 6
Solution
Values (x) x
5 20 100
10 43 430
15 75 1125 x = x = 8530 = 22.2
20 67 1340 348
25 72 1800
30 45 1350
35 39 1365
40 9 360
45 8 360
50 6 300
Total = 348 = 8530
Example 3: From the following data of the marks obtained by 60 students of a class.
Marks 20 30 40 50 60 70
No of students 8 12 20 10 6 4
Sol:
i 1
f i xi
n
x=
i 1
fi
Mean
2460
x 41
60
Hence the average marks = 41.
(ii) Short cut method:
n
i 1
fid i
A n
fi
Mean x = i 1
60
x 40 40 1 41
Since A = 40, 60 .
Hence the average marks = 41.
b) Continuous Series
Now, all this is very well, but much data is in the form of grouped, continuous frequency distributions.
The method of calculating the arithmetic mean from a continuous series is exactly the same as that of
discrete series with the exception that in a continuous series, we first take the mid-points of the various
class intervals which are written against each class interval. These mid-values are multiplied by the
corresponding frequencies.
Example 3
Calculate the mean for the following frequency distribution:
Marks 0-10 10-20 20-30 30-40 40-50 50-60 60-70
No of students 6 5 8 15 7 6 3
Solution
Marks Mid-values (X) No of students () X
0-10 5 6 30
10-20 15 5 75
20-30 25 8 200
30-40 35 15 525
40-50 45 7 315
50-60 55 6 330
60-70 65 3 195
= 50 = 1670
Example3: Estimate the mean of a sample that has been grouped according to the following distribution:
Solution
frequen Class Mark
cy
Class f x xf
15-24 8 19.5 156
25-34 12 29.5 354
35-44 13 39.5 513.5
45-54 23 49.5 1138.5
55-65 14 59.5 833
66-74 12 69.5 834
Sum 82 3829
The mean of means-combined mean (grand-mean) - to find the average of means x1 , x 2 , ... , x k , use the
n1 x1 n2 x 2 . .. nk x k
formula x
n1 n 2 .. . n k
Problem: the average of the 17 females in a Statistics class is 83 points and the average of the 14 males
was 78. What is the average of the whole class?
n1 x1 n 2 x 2 17(83) 14(78) 2503
Solution: x 80.74
n1 n2 17 14 31
Demerits
1. It may produce a figure which does not exist in a series. In other words, it often produces results that
are not suitable for a communication view point, for example, the figure 2.2 children per adult female
is absurd! In situations like these, people expect an integer (i.e., whole number) to be representative
of the number of children per family.
2. It cannot be determined by inspection nor can it be located graphically.
3. It cannot be used if we are dealing with qualitative characteristics which cannot be measured
quantitatively such intelligence, honesty, beauty, etc. In such cases median is the only average to be
used.
4. It cannot be obtained if a single observation is missing or lost or is illegible unless we drop it out and
compute the AM of the remaining values.
5. The strongest drawback of arithmetic mean is that it is very much affected by extreme observations.
Two or three very large or small values of the variable may unduly affect the value of the AR.
Consider for example, the case of John, who at an interview for a job is told that the average income
of salesmen in the company is £8000. He accepts the job as he considers the firm to be very
progressive with excellent prospects for himself. Although his starting salary is only £2000 a year,
his salary will obviously climb very quickly. You could image how cheated John felt when he found
that the sales force consisted of just 5 men: the sales director as £30,000 a year and 4 salesmen on
£2500 a year.
x = 30,000-2500(4) = 40,000/5 = £8,000.
The extreme value (in this case the sales director’s salary) has caused the AM to be most
unrepresentative.
THE MEDIAN
It is the value of the middle item of a series when these items are arranged in ascending or descending
order. In the words of L R Connor: ‘The median is that value of the variable which divides the group in
two equal parts, one part comprising all the values greater and the other, all values less than median’.
Thus, median of a distribution may be defined as that value of the variable which exceeds and is exceeded
by the same number of observations i.e., it is the value such that the number of observations above it is
equal to the number of observations below it. Thus, we see that as against AM which is based on all the
items of the distribution, the median is only positional average, i.e., its value depends on the position
occupied by a value in the frequency distribution.
The median: the median is the value at the middle after they are arranged in either ascending or
descending order.
To find the median in the case of ungrouped data:
Step 1. Arrange the values in order.
Step 2. If n is odd, the middle value is the median
If n is even, the average of the two middle values is the median
Example1: find the median of the values 8, 5, 7, 13, 2, 13, 9
Step 1: 2, 5, 7, 8, 9, 13, 13
Step 2: the median is the middle value 8.
X f C.F
0 15 15
1 24 39
2 median 18 57
3 12 69
4 8 77
5 2 79
6 1 80
80
Steps
1. Calculate the value of f + 1 80 +1 = 81 = 40.5
2 2
2. Form a cumulative frequency column
3. Find that cumulative value which first exceeds the value gotten in step 1
4. The median is the x value corresponding to the C.F. value found identified in step 3.
As mention in the previous work the penalty for grouping values is the loss of their individual identities
and thus there is no way that a median can be calculated exactly in this situation .However there are
two methods commonly employed for estimating the median.
Interpolation in this contest is simply mathematical technique which estimates an unknown value by
utilizing immediate surrounding known value.
One difficulty in computing the median of the continuous series is that the value of the median lies in a
class interval. So this value is calculated by the method of interpolation. The value of median is obtained
by the formula:
Md = Lm + ½ N – CP x c
m
Where
Lm = represents the lower limit of the median class.
N = total number of observations (=, total frequency).
Cp = cumulative frequency of the class immediately before the median class.
m = the frequency of the median class
c = the magnitude or width of the median class.
(Note: if N is an odd number, the value of ½ N must be rounded up to the nearest integer), i.e., just add 1
to the value.
Also the formula works in the assumption that the distribution of the variable under consideration is
continuous with exclusive type classes without any gaps. Where classes are not continuous, then the
distribution must be converted into a continuous frequency distribution before applying the formula. This
adjustment will affect only the value of Lm.
Example
Classes’ f c.f
20 – 25 2 2
25 – 30 14 16
30 – 35 29 45
35 – 40 median class 43 88
40 – 45 33 121
45 – 50 9 130
Total 130
F = 130 / 2 = 65 35 + 65 – 45 5
43
= 37.32
Students to calculate the answer in class. The answer should be = 18.854 for the example below
Class f c.f
5–9 10 10
10 – 14 36 46
15 – 19 m.c 62 108
20 – 24 72 180
25 – 29 20 200
Total 200
It is an appropriate alternative to the mean wishes extreme values are present at one or both ends
of a set of distribution.
It can be used certain end values of a set or distribution are difficult, expensive or impossible to
obtain.
Can be used when non numerical data is involved e.g the size of object and even height.
It assumes value equal to one of the original.
The only disadvantage of median is that it is difficult to handle theoretically in more advanced
statistical work.
Demerits
1. In case of even number of observations for an ungrouped data, median cannot be determined exactly.
We merely estimate it as the AM of the two middle terms. In fact, any value lying between the two
middle observations can serve the purpose of median.
2. Median, being a positional average, is not based on each and every item of the distribution. It
depends on all the observations only to the extent whether they are smaller than or greater than it; the
exact magnitude of the observations being immaterial.
Consider a simple example:
The median value of 8, 12, 35, 40, and 60 is 35. Now if we replace the values 8 and 12 by any two
values less than 35 and the values of 40 and 60 by any two values greater than 35 the median is
unaffected.
3. Median is not suitable for further mathematical treatment, i.e., given the sizes and the median values
of the different groups, we cannot compute the median of the combined group.
THE MODE
Although the mean and the median will be the averages used in most circumstances, there are situations in
which other averages are particularly appropriate, whereas the mean can be said to find the center of
gravity and the median the middle of a set of item, the mode identifies the most popular item .The mode
of a set of data is that value which occurs most often or equivalently has the largest frequency.
Mode is the value which occurs most frequently in a set of observations and around which the other items
of the set cluster densely. In other words, mode is the value of a series which is predominant in it.
According to A M Turtle, ‘Mode is the value which has the greatest frequency density in its immediate
neighborhoods’. Accordingly, mode may also be termed as the fashionable value (a derivation of the
French word ‘la mode’) of the distribution.
The average referred to is neither mean nor median but mode, the most frequent value in the
distribution. For example, by the first statement we mean that there is maximum demand for the shoe
of size no 7.
Computation of Mode- Ungrouped data
From individual series, mode is obtained through observation. It is the value of that item which occurs
maximum number of times. A distribution could have only one mode (unimodal), could have two modes
(bimodal), three modes (tri-modal) or many modes (multimodal)
Example:
The marks of 10 students in a test are: 65 43 57 63 39 57 60 48 57 55. Find the mode.
Solution: Through observation, we find that 57 has been repeated 3 times. So mode is 57 marks.
The maximum frequency is 40 and therefore, the corresponding value of x, i.e., 5 gives the value of mode.
(Preferred formula)
f1 f 0
(iii) Mode (Mo) = l1 i
2 f1 f 0 f 2
Where
1 = f 1 f 0
2 = f1 f 2
l1 = lower limit of the modal class.
f 1 =frequency of the modal class.
f 0 =frequency of then class preceding to the modal class.
f 2 = frequency of the class succeeding to the modal class.
i = size of the class.
Note:
1) While applying the above formula for calculating mode, it is necessary to see that the class
intervals are uniform throughout. If they are unequal they should first made equal on the
assumption that the frequencies are equally distributed throughout.
2) In case of bimodal distribution the mode cannot be found.
Finding mode in case of bimodal distribution:
In a bimodal distribution the value of mode cannot be determined by the help of the above
formulae. In this case the mode can be determined by using the empirical relation given below.
Mode = 3Median - 2Mean
And the mode which is obtained by using the above relation is called ‘Empirical mode’
Remarks
1. It may be pointed out that the above formula (iii) for computing mode is based on the following
assumptions:
(i) The frequency distribution must be continuous with exclusive type classes without any gaps. If
the data is not given in the form of continuous classes, it must first be converted into continuous
classes before applying formula (ii) or (iii).
(ii) The class intervals must be uniform throughout i.e., the width of all the class intervals must be the
same. In case of the distribution with unequal class intervals, they should be made equal under
the assumption that the frequencies are uniformly distributed over all the classes, otherwise the
value of mode computed from the formulae (ii) and (iii) will give misleading results.
2. The above technique of locating mode is not practicable in the following situations:
(i) If the maximum frequency is repeated or approximately equal concentration is found in two or
more neighboring values.
(ii) If the maximum frequency occurs either in the very beginning or at the end of the distribution.
(iii) If there are irregularities in the distribution, ie, the frequencies of the variable increase or decrease
in a haphazard way.
In the above situations, mode (or modal class in the case of continuous frequency distribution) is located
by the method of grouping – (Statistics II).
The Histogram
Merits
i. It is easy to calculate and understand
ii. It can be located in some cases by inspectors
iii. It is capable of being ascertained graphically
iv. It is not affected by extreme values
v. It represents the most frequent value and hence it is very often used in the fashion industry.
Demerits
i. There are different formulae for its calculators which ordinary gives different answers.
ii. Mode is determined some series have two or more than two or more than two modes.
iii. It cannot be subjected to further statistical analysis. For example, the combined mode of two
series cannot be calculated.
iv. It is an unsuitable measure as it is affected more by sampling fluctuations.
v. Mode for the series with unequal class – intervals cannot be calculated.
Uses:
i. It is used for the study of most popular fashion items
ii. It is extensively used by business and commercial managers.
1.6 Empirical Relation Between Mean (M), Median (Md) and Mode (Mo)
In case of a symmetrical distribution, mean, median and mode coincide i.e., M = Md = Mo.
In a highly symmetrical distribution, it is impossible to forecast the relationship between the averages.
Prof Karl Pearson has demonstrated the following important empirical relationship:
Mode = Mean – 3(mean-median).
Mode = 3 median - 2 mean
This formula is especially useful to determine the value of mode in case it is ill defined, eg, in the
case of bimodal or multimode distributions.
Mean – median = 1/3 (mean – mode)
Thus, we see that the difference between mean and mode is three times the difference between mean and
median. In other words, median is closer to mean than mode.
Geometric mean
Geometric mean is defined as the n th root of the product of n items (or) values.
Calculation of G.M- Individual series:
x , x 2 , x 3 ,......., x n be n observations studied on a variable X, then the G.M of the observations is
If 1
defined as
1
x x x ........x n
G.M= 1 2 3
n
G.M=
x 1
f1 f f
x 2 2 x 3 3 ........ x n
f1
N
………………(*)
n
f i
Where N = i 1 i.e. total frequency
Applying log both sides in (*) we get
n
1
f i log x i
G.M= antilog N i 1
Properties of G.M:
1) If G1 and G2 are geometric means of two components having n1 and n2 observations and G is the
geometric mean of the combined series of n (n1+n2) values then
W W
G G1 1 G 2 2
n1 n2
w1 w2
Where
n1 n 2 & n1 n 2
Uses of G.M:
Geometrical Mean is especially useful in the following cases.
1) The G.M is used to find the average percentage increase in sales, production, or other economic
or business series.
For example, from 1992 to 1994 prices increased by 5%,10%,and 18% respectively, then the average
annual income is not 11% which is calculated by A.M but it is 10.9 which is calculated by G.M.
2) G.M is theoretically considered to be best average in the construction of Index numbers.
G.M =
w
x 1
w1 w w
x 2 2 x3 3 ........ x n
wn
N
n
w i
Where N = i 1 i.e. total weight
Applying log both sides we get
n
1
N
w i log x i
G.M= antilog i 1
Harmonic mean:
The harmonic mean (H.M) is defined as the reciprocal of the arithmetic mean of the reciprocal of the
individual observations.
f
i 1
i
H .M
f1 f 2 f
.............. n
x1 x 2 x n …………..(*)
n
f
i 1
i
H .M n
fi
x
i 1 i
Calculation of H.M – Continuous series:
mi ) in place of xi ' s in the equation
In continuous series H.M can be calculated by replacing mid values (
(*). Hence H.M is given by
n
f
i 1
i
H .M n
fi
m
i 1 mi is the mid value of the ith class interval
i , where
Uses of harmonic mean:
1) The H.M is used for computing the average rate of increase in profits of a concern.
2) The H.M is used to calculate the average speed at which a journey has been performed.
Merits:
1) Its value is based on all the observations of the data.
2) It is less affected by the extreme values.
3) It is suitable for further mathematical treatment.
4) It is strictly defined.
Demerits:
1) It is not simple to calculate and easy to understand.
2) It cannot be calculated if one of the observations is zero.
3) The H.M is always less than A.M and G.M.
Note: The equality condition holds true only if all the items are equal in the distribution.
Prove that if a and b are two positive numbers then A.M G.M H .M
Sol:
Let a and b are two positive numbers then
ab
The Arithmetic mean of a and b = 2
The Geometric mean of a and b = ab
2ab
The harmonic men of a and b = a b
Let us assume A.M G.M
ab
ab
2
a b 2 ab
2
a b 4ab
a b 0
2
let us assume GM HM
2ab
ab
ab
a b 2 ab
2
a b 4ab
2
a b 0
Which is always true.
G.M HM ………………………… (2)
from (1) and (2) we get A.M G.M H .M
Problems:
1. Calculate AM of the following data.
a) 4, 3, 2, 5, 3, 4, 5, 1, 7, 3, 2, 1 [3.33]
b) 30,70,10,75,500,8,42,250,40,36 [106.1]
c) 35, 46, 27, 38, 52, 44,50, 37, 41, 50[42]
2. Find the A.M of first n natural numbers.
3. Find the A.M of first n even numbers.
4. Find the A.M of first n odd numbers.
5. Find the A.M of first 10 even numbers.
6. Find the A.M of first 100 odd numbers.
7. Find A.M, G.M, H.M, median and mode of following
X 5 6 7
8.
f 1 4 3
9.
X 20 21 22 23 24 25 26
f 1 2 4 7 5 3 1
10. Find A.M, G.M, H.M, median and mode of following data.
Marks 20 30 40 50 60 70
No. Of Students 8 12 20 10 6 4
13. Find A.M, G.M, H.M, median and mode of following data.
CI 10-20 20-40 40-70 70-120 120-200
Frequency 4 10 26 8 2
14. Find A.M, G.M, H.M, median and mode of following data.
CI 1-7 8-14 15-21 22-28 29-35
Frequency 3 17 12 11 7
15. In an examination marks secured by three students A, B, C along with the respective weights of
the subjects are given below. Determine the best performance
16. Find A.M, G.M, H.M, median and mode of following data
Wages in rupees Less than Less than Less than Less than Less than 50
10 20 30 40
No. of workers 5 17 20 22 25
17. Find the missing frequencies from the data given below if mean is 60.
Marks 50 55 60 65 70 Total
No. Of Students ? 20 25 ? 10 100
18. Find the missing frequencies from the data given below if mean is 60.
Marks 0-9 10-19 20-29 30-39 40-49 50-59 60-69 70-79 80-89 90-99
No of students 6 29 87 181 247 263 133 43 9 2
Quartiles, Deciles and Percentiles are measures of position useful for comparing scores within
one set of data. You probably all took some type of college placement exam at some point. If
your composite math score was say 28, it might have been reported that this score was in the 94th
percentile. What does this mean? This does not mean you received a 94% on the test. It does
mean that of all the students who took that exam, 94% of them scored lower than you did (and
6% higher). For a set of data you can divide the data into three quartiles ( Q1 , Q2 , Q3 ), nine deciles
( D1 , D2 ,...D9 ) and 99 percentiles ( P1 , P2 ,...., P99 ). The quartile Q1 separates the bottom 25% from
the top 75%, Q2 is the median and Q3 separates the top 25% from the bottom 75%. To work with
percentiles, deciles and quartiles - you need to learn to do two different tasks. First you should
learn how to find the percentile that corresponds to a particular score and then how to find the
score in a set of data that corresponds to a given percentile.
Percentiles of a distribution are the values on the scale of measurement below which we find a
certain percentage of the group.
P90 : is called the 90th percentile; it is the value below which 90 percent of the group is located.
P40 : is called the 40th percentile; it is the value below which 40 percent of the group is located.
Pk : is called the kth percentile; it is the value below which k percent of the group is located.
Quartiles: there are three quartiles Q1 , Q2 and Q3
Q1 P25 , the first quartile is the same as the 25th percentiles.
Q2 P50 Median , the second quartile is the same as the 50th percentiles.
Q3 P75 , the third quartile is the same as the 55th percentiles.
Deciles: there are nine deciles
D1 P10 , the first decile is the same as the 10th percentiles.
D2 P20 , the first decile is the same as the 20th percentiles, and so on
Note: D5 P50 Q2 Median , the fifth decile is the same as the median
How to approximate the percentiles:
Step 1. Arrange the given scores in increasing order
k n
Step 2. Find the value of N where N is obtained by rounding to a whole number.
100
X N X N 1
Step 3. If N is obtained by rounding up, use Pk X N , otherwise use Pk
2
How to approximate the quartiles:
Step 1. Arrange the given scores in increasing order
Step 2. Q2 is the median of the distribution
Step 3. Q1 is the median of the first half of the distribution
Step 3. Q3 is the median of the second half of the distribution
Inter-quartile range Q3 Q1
All the three series A, B, and C, have the same size (n=9) and the same mean, i.e., 15. Thus, if we are
given that the mean of a series of 9 observations is 15, we cannot determine if we are talking of the series
A, B, or C. In fact any series of 9 items which total 135 will give mean 15. Thus, we may have a large
number of series with entirely different structures and compositions but having the same mean.
From the above illustration, it is obvious that the measure of central tendency is inadequate to describe the
distribution completely. In the words of George Simpson and Fritz Kafka, ‘An average does not tell the
full story. It is hardly fully representative of a mass unless we know the manner in which the individual
items scatter around it. A further description of the series is necessary if we are to gauge how
representative the average is’.
Thus, the measures of central tendency must be supported and supplemented by some other measures.
One such measure is dispersion.
Literal meaning of dispersion is ‘scatteredness’. We study dispersion to have an idea of the homogeneity
(compactness) or heterogeneity (scatter) of the distribution. In the above illustration, we say that the
series A is stationary, i.e., it is constant and shows no variability. Series B is slightly dispersed and series
C is relatively more dispersed. We say that series B is more homogeneous (or uniform) as compared with
series C or the series C is more heterogeneous than series B.
W I King has defined dispersion as a term which is used to indicate the facts that within a given group,
the items differ from one another in size or in other words, there is lack of uniformity in their sizes.
Spiegel notes that ‘the degree to which numerical data tend to spread about an average value is the
variation or dispersion of the data’.
Primarily, we use two separate devices for measuring dispersion of a variable. One is an Algebraic
method and the other is Graphical method. In the algebraic method we use different notations and
definitions to measure it in a number of ways and in the graphical method we try to measure the
variability of the given observations graphically mainly drought scattered diagrams and by fitting
different lines through those scattered points.
In the Algebraic method we split them up into two main categories, one is Absolute measure and the other
is Relative measure. Under the Absolute measure we again have four separate measures, namely Range,
Quartile Deviation, Standard Deviation and the Mean Deviation. And finally, under the Relative measure,
we have four other measures termed as Coefficient of Range, Coefficient of Variation, Coefficient of
Quartile Deviation and the Coefficient of Mean Deviation.
The Range
It is the simplest of all the measures of dispersion. It is defined as the difference between the two extreme
observations of the distribution. In other words, range is the difference between the greatest (maximum)
and the smallest (minimum) observation of the distribution. Thus:
Range = Xmax – Xmin (Range = L-S).
Where Xmax is the greatest observation (L) and
Xmin is the smallest observation of the variable value (S).
In case of the grouped frequency distribution (for discrete values) or the continuous frequency
distribution, range is defined as the difference between the upper limit of the highest class and the lower
limit of the smallest class. Here, the frequencies of the classes are immaterial.
The required Range is 54.5 – 4.5 = 50 or the observations on the variable are found scattered within 50
units.
It is to be noted that any change in marginal values or the classes of the variable in the series given will
change both the absolute and the percentage values of the Range.
Demerits
It is not based on the entire set of data (i.e., it ignores the bulk of the data available to us). It is based
only on two extreme observations.
It is very much affected by fluctuations of sampling. Its value varies very widely from sample to
sample.
Uses of Range
In spite of these limitations, the range has its applications in a number of fields;
It is used in studying the variations in the prices of stocks (ie, stock market fluctuations) and other
commodities.
Range is used in industry for the statistical quality control of the manufactured product by the
construction of R-chart, ie, the control chart for range.
Used very conveniently by meteorological department; ie, maximum and minimum temperatures of
the day.
Most widely used measure of variability in our day-to-day life, difference between highly paid and
lowly paid worker, etc.
MEAN DIVIATION
Is a measure of dispersion that gives the average absolute difference
(I.e. ignoring –ve signs) between each item and the mean.
It is as much more representative measure than the range since all item values are
taken into account its calculation
Mean deviation for a set of values formula is
Md = (x - x )
n
43, 75,48,39,51,47,50,47
Mean x = 400 = 50
8
Md = (x – 50 ) md = 52/8 = 6.5
n
x (x – x )
43 -7
75 25
48 -2
39 -11 ignore the negative sign
51 1
47 -8
50 0
41 -8
52
Mean deviation for a frequency distribution (simple frequency)
X f fx x-x f (x-x)
0 2 0 -2.56 5.12
1 4 4 -1.56 6.24
2 7 14 -0. 56 3.92
3 11 33 0.44 4.84
4 4 16 1.44 5.76
5 2 10 2.44 4.88
30 77 9.0 30.76
Formula is f (x-x )
f get the mean first i.e. fx
f
x = 77
30 = 2.56
30.76
30 = 1.025
Mean deviation for grouped frequency
Formula = f (x-x )
f
The example below to be calculated in class together with the teacher.
Standard Deviation
Standard deviation, usually denoted by the letter s (small sigma) of the Greek alphabet was first suggested
by Karl Pearson as a measure of dispersion in 1893. It is defined as the positive square root of the
arithmetic mean of the squares of the deviations of the given observations from their arithmetic mean.
Thus, if X1, X2, …. Xn is a set of n observations, and then its standard deviation is given by:
4.4 Variance
According to William I Greenwald the variance is the mean of the squared deviations about the mean of a
series. Thus, variance is the square of the standard deviation and is denoted by 02=s2. For the above
individual series, it is computed as: s2 (x-x)2
N
Example 1
Calculate the mean, median, mode, standard deviation and variance of the following data.
5, 8, 15, 29, 47, 47, 64, 71, 71, 74.
Solution:
x x-x (x-x)2
5 5-40 = -35 1,225
8 8-40 = -32 1,024
15 15-40 = -25 625
29 29-40 = -11 121
47 47-40 = 7 49
47 47-40 = 7 49
64 64-40 = 24 576
71 71-40 = 31 961
74 74-40 = 34 1,156
x=360 (x-x)2 = 5786
Example 2
Calculate the mean, variance and standard deviation from the following data:
Value: 90-99 80-89 70-79 60-69 50-59 40-49 30-39
Frequency: 2 12 22 20 14 4 1
Solution:
Class Class mid-value Frequency x (x-x) (x-x)2 (x-x)2
boundary (x) ()
90-99 89.5-99.5 94.5 2 189 26.4 696.96 1393.92
80-89 79.5-89.5 84.5 12 1014 16.4 266.96 3203.52
70-79 69.5-79.5 74.5 22 1639 6.4 40.96 901.12
60-69 59.5-69.5 64.5 20 1290 -3.6 12.96 256.20
50-59 49.5-59.5 54.5 14 763 -13.6 184.96 2589.44
40-49 39.5-49.5 44.5 4 178 -23.6 556.96 2227.84
30-39 29.5-39.5 34.5 1 345 -33.6 1128.96 1128.96
100 times the coefficient of dispersion based on sd is called the coefficient of variation, abbreviated as cv.
Thus: CV = 100 x S/x
For comparing the variability of two distributions we compute the CV for each distribution. A
distribution with smaller CV is said to be more homogeneous or uniform or less variable than the other
and the series with greater CV is said to be more heterogeneous or more variable than the other.
Example:
Two workers on the same job show the following results over a long period of time:
Worker A Worker B
Mean time of completing the job (minutes) 30 25
Standard deviation (minutes) 6 4
(i) Which worker appears to be more consistent in the time she requires to complete the job?
(ii) Which worker appears to be faster in completing the job? Explain.
Solution:
(i) We know CV = S/x x 100
CV (for worker A) = 6x100/30 = 20
CV (for worker B) = 4x100/25 = 16
Since CV (B) is less than CV (A), the worker B appears to be more consistent in the time she requires to
complete the job.
(ii) Since, XB<XA, ie, on the average the worker B takes less time than worker A to complete the job,
the worker B appears to be faster in completing the job.
Skewness. Literal meaning of skewness is ‘lack of symmetry’. We study skewness to have an idea about
the shape of the curve which we can draw with the help of the given frequency distribution. It helps us to
determine the nature and extent of concentration of the observations towards the higher or lower values of
the variable. A distribution with an asymmetric tail extending out to the right is referred to as “positively
skewed” or “skewed to the right,” while a distribution with an asymmetric tail extending out to the left is
referred to as “negatively skewed” or “skewed to the left.” Skewness can range from minus infinity to
positive infinity.
Skewness
A score of zero infers a perfectly normal distribution
Negative scores infer a negative skew
Positive score infer a positive skew
Kurtosis
A score of zero infers a mesokurtic curve
Negative scores infer a platykurtic curve (too flat)
Positive score infer a leptokurtic curve (too pointly)
The more each score deviates from zero, the more the curve deviates from a normal distribution.
CORRELATION ANALYSIS
Introduction
This is a technique used to measure the strength of the relationship between two variables. It’s very hard
at times to discuss correlation without mentioning regression. The purpose of regression is to identify a
relationship of a given set of bivariate data. What it does not do however is to give any indication of how
good this relationship might be, correlation therefore comes in to provide the measure of how well a line
of best fit describes the scattered point.
Correlation means the existence of some definite relationship between two or more variables. It means
that if two quantities vary in such a way that movements in one are accompanied by movements in the
other, these quantities are said to be correlated. For example, there exists some relationship between
family income and expenditure on luxury items, price of a commodity and amount demanded. There are
also appearing to be related in some way to movements in one or several other factors. For example, a
marketing manager may observe that sales increase when there has been a change in advertising
expenditure. The transport manager may notice that as vans and Lorries cover more miles then the need
for maintenance becomes more frequent.
Certain questions may arise in the mind of the manager or analyst. These may be summarized as follows:
(i) Are the movements in the same or in opposite direction?
(ii) Could changes in one phenomenon or variable be causing or be caused by movements in the other
variable?
(iii) Could apparently related movements come about purely by chance?
(iv) Could movements in one factor or variable be as a result of combined movements in several other
factors or variables?
(v) Could movements in two factors be related, not directly, but through movements in a third
variable hitherto unnoticed?
(vi) What is the use of this knowledge anyway?
20
18
16
14
Female
12
10
8
6
4
2
0
1 2 3 4 5 6 7 8 9 10 11 12 13
Male
MALE FEMALE
9 6
10 8
13 20
5 9
6 11
10 9
Correlation therefore is concerned with describing the strength of relationship between two variables by
measuring the degree of scatter of the data value. The less scattered the data values are the stronger the
correlation is said to be.
TYPES OF CORRELATION
i) Positive & Negative Correlation
If 2 variables move in the same direction, they are said to be +ve correlated. On the other hand if the
variables move in different direction, they are said to be – vely correlated.
Of these, the scatter diagram method is based on the knowledge of graphs whereas the others are
mathematical techniques.
A) Scatter Diagram
It is a basic way of calculating correlation.
It is using plotting dots against two variables on dot chart called diagram or scatter diagram this
Is normally done on a graph paper.
MERITS (ADVANTAGES)
It is simple and non – Mathematical.
It is not influenced by extreme values.
It is normally the 1st step in investigating correlation.
DEMERITS (DISADVANTAGES)
It based on estimation therefore different people can have different answers on the same problem.
+ve correlation
No correlation
-ve correlation
We need a way of measuring the strength of the correlation between 2 variables, this is achieved through
a correlation coefficient normally represented by a symbol “r” and which its value lies between –ve one
and +ve one.
No Correlation
Steps
1. Identity the sign on the index “v”
2. Strength & Weakness
r = n xy – x y
n x2 – (x)2 n y2 – (y)2
This provides a measure of the strength of association between two variables, one the dependent variable,
the other the independent variable. This coefficient-denoted ‘r’ gives us an indication of the strength of
the linear relationship between two variables. There are several possible formulae but a practical one is:
Example
The coefficient is high and positive, meaning that on the evidence of this data, an increase in advertising
expenditure has a positive and large impact on sales.
Even though analysis indicates that correlation exists between the variables, we are not justified in
assuming that there is therefore, a cause and effect relationship. We must never fall into the trap of
assuming that cause and effect exists when it is nonsense to do so. Therefore, care is needed in the
interpretation of the calculated value of r. A high value (above +0.9 or –0.9) only shows a strong
association between the two variables if there is a causal relationship, ie, if a change in one variable
causes changes in the other. But a high positive correlation between television sales and annual
admissions into mental institutions does not indicate any association. It would be ludicrous to suggest
that cause and effect exists here. The only logical conclusion that one can draw is that, quite by chance,
both statistics were increasing at the same rate. In this case it is quite clear that cause and effect exists
here. The only logical conclusion that one can draw is that, quite by chance, both statistics were
increasing at the same rate. In this case, it is quite clear that cause and effect is not proved by a high
correlation coefficient.
Also, a low correlation coefficient, somewhere near zero, does not always mean that there is no
relationship between the variables. All it says is that there is no linear relationship between the variables
– there may be a strong relationship but of a non-linear kind.
A further problem in interpretation arises from the fact that ‘r’ measures the relationship between a single
independent variable and dependent variables, whereas a particular variable may be dependent on several
independent variables in which case multiple correlation should have been calculated rather than the
simple two-variable coefficient.
To conclude, one may argue that if we can never use correlation analysis to pure a cause and effect
relationship, why study it?
ASSIGNMENT
(1) The data of the table below relate to the weekly maintaince cost in millions (m) to the Age
(Month) of 10 machines of similar type in a manufacturing company.
Machine 1 2 3 4 5 6 7 8 9 10
Age 5 10 15 20 30 30 30 50 50 60
Cost 190 240 250 300 310 335 300 300 350 395
Required
1. Calculate the Pearson product moment correlation and interpret it.
2. As a manager what strategy would you recommend on the machine?
(2) The following data obtained from claims drawn on life assurance policies for a particular
Category of employee relates age at official retirement to age at death for 9 males
Age at retirement 57 62 60 57 65 60 58 62 56
Age at death 71 70 66 70 69 67 69 63 70
Required
1. Calculate the Pearson product moment correlation and comment on the correlation.
2. As a claim manager what basic decision can you make about this correlation?
(3) Calculate the Karl Pearson co-efficient of correlation calculation of the following ages of
husbands and wives at their time of their marriage.
Age of Husbands 23 27 28 28 28 30 30 33 35 38
Age of wife 18 20 22 27 21 2 9 29 29 28 29
The Rank Correlation Coefficient (e)
Sometimes we come across statistical series (data) in which the variables under consideration are not
capable of quantitative measurement but can be arranged in serial order. This happens when we are
dealing with qualitative characteristics (attributes) such as honesty, beauty, character, morality etc, which
cannot be measured quantitatively but can be arranged serially. Therefore, Edward Spearman, a British
psychologist, developed a formula in 1904 which consists in obtaining the correlation coefficient between
the ranks of individuals in the attributes under study.
Therefore, the purpose of the rank coefficient is to establish whether there is any form of association
between two variables when the variables are arranged in a ranked form.
Notes:
(i) The limits for e are between –1 and +1, i.e., -1 e 1 and like r, has a similar meaning.
(ii) As with r, care should be taken in any interpretation of the value of e whether it is a particular
high or low value.
(iii) Since the square of a real quantity is always non-negative, ie, 0, d2 being the sum of non-
negative quantities is also non-negative. Therefore, the sign of equality holds if and only if d2=
0. Thus, d2 = 0 if and only if each d=0.
Example
Ten competitors in a beauty contest are ruled by 3 judges in the following order.
1st Judge 1 6 5 10 3 2 4 9 7 8
2nd Judge 3 5 8 4 7 10 2 1 6 9
3rd Judge 6 4 9 8 1 2 3 10 5 7
Use the rank correlation coefficient to determine which pair of judges has the nearest approach to
common tastes in beauty
Solution:
R1 R2 R3 d12 d13 d23 d212 d213 d223
R1-r2 R1-R3 R2-R3
1 3 6 -2 -5 -3 4 25 9
6 5 4 1 2 1 1 2 1
5 8 9 -3 -4 -1 9 16 1
10 4 8 6 2 -4 36 4 16
3 7 1 -4 2 6 16 4 36
2 10 2 -8 0 8 64 0 64
4 2 3 2 1 -1 4 1 1
9 1 10 8 -1 -9 64 1 81
7 6 5 1 2 1 1 4 1
8 9 7 -1 1 2 1 1 4
0 0 0 200 60 214
Since e13 is maximum, the pair of first and third judges have the nearest approach to common tastes in
beauty. Since e12 and e23 are negative, the pair of judges (1,2) and (2,3) have opposite (divergent) tastes
for beauty.
Example 2
The personnel department of a large company is investigating the possibility of assessing the suitability of
applicants by using psychological tests instead of normal interview procedures. A comparative test of
seven applicants was carried out using both methods. The results were as follows:
(b) There is a high correlation between the two methods, meaning they can be used interchangeably.
Example 3
Compute the rank correlation coefficient between advertisement cost and sales revenue from the
following series:
Solution
Let X denotes the advertisement cost and Y denote the sales revenue.
Limitations
1. The rank correlation coefficient cannot be used for finding out correlation in a grouped frequency
distribution.
2. If n Z 30 (greater than) the formula should not be used unless the ranks are given, since in the
contrary case the calculations are quite time consuming.
REGRESSION ANALYSIS
Introduction
This is the measure of the average relationship between two or more variables in terms of the original
units of the data.
It is concerned with the estimating the value of one variable when the value of other variable is known.
Definition
The word ‘regression’ refers to the act of returning or going back. It is concerned with the estimation of
the relationship between variables. Regression analysis helps statisticians to determine the probable form
of the relationship between variables. In other words, one is able to find out the extent to which one
variable changes in response to a given chance in the other. It can be used to predict or estimate the
unknown value of one variable corresponding to a given value of another variable. It also reveals the
nature of the relationship between the variables and hence determines the average probable change in one
variable when a certain amount of change in the other is known.
Regression line
2. It is used to obtain a measure of the error involved in using the regression line as
a basis for estimates.
3. It helps to obtain a measure of the degree of association or correlation that exists
between 2 variables.
i. The co-efficient of correlation is a measure of degree of relationship between X & Y where as the
objection of regression analysis is to study the nature of relationship between the variables.
ii. The cause and effect relationship is as clearly indicated then regression analysis which is not the
care with correlation.
iii. Correlation analysis is confined only to the study of linear relationship between the variables
therefore has limited application. On the other hand regression has a wider application since it
studies between linear as well as non – linear relationship between the variables
iv. There maybe non – sencical correlation between the 2 variables which is due to more chance and
has no practical relevance.
A simple linear regression involves an attempt to develop a straight or linear mathematical model to
describe the relationship between 2 variables. Linear or straight line equation is normally used. This is
important because they closely approximate much real relationship and that they are easy to work with &
interpret.
y
y = a + bx + m
0 a x
(The intercepts specifies the value of dependent variable when the independent variable hass a value of 0.
it is also referred to as a constant)
b- Is another constant indicating the slope of the regression line.(The slope of the line indicates the
amount of change in the value of the dependent variable for a unit change in the independent variable.)
Which is?
Change in Y
Change in X
Y = Dependent
These names mean that there are other factors that can explain change in Y which are not quantifiable.
These are factors that are not captured by the regression line and thus are absorbed by M.
a) Simultaneous equation
y = na + b x
xy = a x + b X2
Example one
An agronomist experimented with different amount of liquid fertilizer on a sample of equal size plots, the
amount of fertilizer and the yield per acre in tones is given as below.
A 2 7
B 1 3
C 3 8
D 4 10
Required
Y = a + bx
Y = 1.5 + 2.2x
Ignore value of a in interpretation Y= +2.2x
Y and X are positively related (the yield & fertilizer are + vely related)
A unit of X causes an increase in Y of 2.2 (one tonne of fertilizers causes 2.2 increase in the yield)
Question d
Calculate the amount of yield that 8 tonne of fertilizers will produce
b = nExy – ExEy
nEx2 – (Ex) 2
a=Ey – bEx
n n
Equation = Y = na + bEx
A company is planning to invest heavily in a marketing programme the initial performance of marketing
programme is as follows.
X Amount invested (M) 5 10 15 20 30
Y Volume of sales (Tonnes) 190 240 250 300 310
Required
a = Y – bX
b = EXiyi
EXi2
X y xi yi xi2 xiyi
16 21 -2.8 0.2
18 22 -0.8 1.2
11 27 -7.8 6.2
23 14 4.2 -6.8
26 20 7.2 -0.8
94 104
In statistics, we usually deal with the normal, binomial and poison distribution. The binomial and
poison distributions are discrete probability distributions (i.e., the variables under study are discrete
random variables).
The normal distribution or normal curve is the most important continuous theoretical distribution in
statistics. It occurs frequently when describing natural occurrences and is of particular importance in
sampling theory and statistical inference. Also, most of the data relating to economic and business
statistics or even in social and physical sciences conform to this distribution.
The normal distribution was first discovered by English mathematician De-Moivre (1667-1754) in 1733
who obtained the mathematical equation for this distribution while dealing with problems arising in the
game of chance. Normal distribution is also known as Gaussian distribution (Gaussian Law of Errors)
after Karl Friedrich Gauss (1777-1855) who used this distribution to describe the theory of accidental
errors of measurements involved in the calculation of orbits of heavenly bodies.
P(x)
Y= 1 e –(x-u)2
E 2II 22
x=u
(iii) The curve is symmetrical about the line x=u, (z=0), i.e., it has the same shape on either side of the
line x=u (or z=0).
(iv) Since the distribution is symmetrical, mean, median and mode coincide. Thus, Mean = Median =
Mode = u. Thus, the distribution is unimodal, the only mode occurring at x=u.
(v) No portion of the curve lies below the x-axis, since p(x) being the probability can never be
negative. Thus, the ‘tails’ of the distribution continually approach, but never touch, the horizontal
axis.
0.5000
y = 1 e-(x-u)2
2II 202
0.1915
If we wanted to compute the areas between 0 and 0.5 and 0 and +0.5, we do not need to keep on
evaluating using the above (y) formula. Tables of these integrals have been compiled and, due to the
symmetry of the curve, usually take the following form: values of z are listed, running from 0 upwards
and opposite each an area is given, which is the area under the curve between 0 and z.
For example, if we look up the value corresponding to z=0.5 in a set of these tables, we find a figure of
0.1915. This means that the shaded area (shown in the fig above) is 0.6915, i.e., 0.5000+0.1915 = 0.6915.
Now, suppose we require p (-1<z<+1). We can use the tables to find this probability as follows: by
symmetry the area between –1 and +1 equals twice the area between 0 and 1.
-00 -1 0 +1 +00
EXERCISE
1. A banker claims that the life of a regular saving Account opened with this bank average 18 months
with a d of 6.45 months
What is the probability that there will still be money in 22 months in a saving account opened with the
same bank by a deposit?
What is the probability that the account will have been closed before 2 years?
2. Regarding a certain distribution concerning the income of individuals we are given that the mean is
500Km and D = 100Km
Required
Required
a) What is the probability that random bag will weigh more than 5.5Kgs
b) How many bags from a single delivery will be expected to weigh more then 5.5Kgs.
4. A 1000 student sat for an examination the marks obtained followed a normal distribution with mean of
50 and a standard deviation of 10.
Required
Find the number of students who scored.
i) Between 50 and 60 marks
ii) Between 50 marks
iii) Above 70 marks
iv) Between 30 and 40 marks
5. Draw and determine the area under the Normal Curve for the following.
i) Between Z = -1.88 and Z= -2.44
ii) Between Z= -1.64 and Z= -1.77
iii) Between Z= 1.44 and Z= -1.84
iv) To the left of Z= -1.97
v) To the left of Z= -2.88
vi) Between Z= 1.96 and Z= -1.89
STATISTICAL INFERENCE
Introduction
Statistical inference can be defined as the process by which conclusions are drawn about some
measure or attribute of a population (e.g., the mean or s.d). Based upon analysis of sample data. In
other words, the process of drawing inferences about a population on the basis of information
contained in a sample taken from the population is called statistical inference. Statistical inference is
traditionally divided into two main branches.
Estimation of parameters; and
Testing of hypothesis
The first step in constructing a confidence interval is to decide how much confidence we want that this
interval will contain the population value. The following procedure is adopted in interval estimation:
(i) The particular statistic, say the mean of the sample or standard deviation of the sample is
determined.
(ii) The confidence level is decided, i.e., 95%, 99% etc.
(iii) The standard error of the particular statistic is calculated.
(iv) Finally, we state with a known degree of confidence that the parameter lies in this interval.
Confidence limits, for population means, are based on the sample mean, the standard error of the mean
and on the known characteristics of normal distribution.
It is known that a normal distribution has the following characteristics:
Mean + 1.960 includes 95% of the population
Mean + 2.580 includes 99% of the population.
These characteristics can be used to calculate confidence limits for the population mean when we have
established the sample mean and the standard error.
Population mean (u) = X+ appropriate number of Sx. Appropriate number means confidence level, at
95% confidence level, it is 1.96 and for 99% level confidence this number is 2.58.
Example 1
The quality department of a wire manufacturing company periodically selects a sample of wire specimens
in order to test for breaking strength. Past experience has shown that the breaking strengths of a certain
type of wire are normally distributed with standard deviation of 200 kg. A random sample of 64
specimens gave a mean of 6,200 kg. Find out the population mean at 95% level of confidence.
Solution:
Population mean = x+1.965 Sx
Here x = 6200 n = 64 s = 200
Therefore Sx = S/ n = 200 = 200 = 25
64 8
This means that we are 95% confident that the population means lies within the confidence zone, i.e.,
somewhere between 6151 and 6249.
Example 2
A firm purchases very large quantity of metal offcuts and wishes to know the average weight of an offcut.
A random sample of 625 offcuts is weighted and it is found that the mean sample weight is 150 grams
with a sample s.d of 30 grams.
(i) Compute the standard error of the mean.
(ii) What would be the standard error if the sample size was 1,225?
(iii) What is the estimate of the population mean? Use the 95% confidence level.
Solution:
(i) When n = 625 and x = 150 and s = 30.
Sx = S/ n = 30/ 625 = 30/ 25 = 1.2 grams.
This means that we are 95% confident that the population means lies within the confidence zone i.e.,
somewhere between 147.65 grams and 152.35 grams.
Remarks
(i) Raising the confidence factor from 95% to 99% increases the assurance that the confidence zone
contains the population mean but it makes the estimate less precise.
(ii) Any confidence level could be chosen and the appropriate number of standard errors found from
the normal area tables. However, the 95% and 99% values of 1.960 and 2.580 are widely used
and should be committed to memory.
The principles involved in setting confidence limits can be used to determine what sample size should be
taken, if we wish to achieve a given level of precision.
For example, what sample size would be required in the above example (I) if we wish to be 95%
confident that the population mean is within 2 grams of the sample mean.
To obtain 95% limits of 2 grams would require a standard error of 1.02 grams, ie, 2 grams / 1.96.
Therefore, as Sx = S/ n then 1.02 = 30/ n
Therefore, n = 30/1.02 = 865
n = 865
The principles of confidence limits have been illustrated in connection with means but their use is not
limited solely to the estimation of population means from a sample. The concept is applied across a broad
range of statistical applications.
HYPOTHESIS TESTING
Introduction
A hypothesis is an assumption, belief or opinion which may or may not be true. For example, it may
be believed that a give drug curves 90% of the patients taking it or the average height of soldiers in
the army is 168 cms. The testing of a statistical hypothesis is the process by which this belief or
opinion is tested by statistical means. It means that testing of a hypothesis is a procedure which
enables to decide on the basis of information obtained from sample data whether to accept or reject a
statement or an assumption about the value of a population parameter. We accept the hypothesis as
being true, when it is supported by the sample data. We reject the hypothesis when the sample data
fail to support it. The testing of hypothesis is the most important technique in statistical inference.
The tests are widely used in business and industry for making decisions.
It is important to understand what we mean by the terms reject and accept in hypothesis testing. The
rejection of a hypothesis is to declare it false. The rejection of a hypothesis is to declare it false. The
rejection of a hypothesis is to declare it false. The acceptance of a hypothesis is to conclude that there is
insufficient evidence to reject it. Acceptance does not necessarily mean that the hypothesis is true.
Test of hypothesis. The testing of hypothesis is a procedure that helps us to ascertain the likelihood of
hypothesized population parameter being correct by making use of the sample statistic. In other words. It
is a process of test of significance which concerns with the testing of some hypothesis regarding a
parameter of the population on the basis of statistic from the sample. In testing of hypothesis, a statistic is
computed from a sample drawn from the parent population and on the basis of this statistic, it is observed
whether the sample so drawn has come from the population with certain specified characteristic. The
value of sample statistic may differ from corresponding population parameter due to sampling
fluctuations. The test of hypothesis discloses the fact whether the difference between sample statistic and
the corresponding hypothetical population parameter is significant or not significant. Thus the test of
hypothesis is also known as the test of significance.
Functions of Hypotheses
Hypotheses serve the following functions;
i. Hypothesis provides tentative explanations of observed phenomena subject to the finding of
research. It is the strategy through which findings of investigative work can be tested for
acceptance or falsification.
ii. It offers guidance to the direction of the research by providing a framework that delimits the
direction and scope of relationships of variables under investigation. For example, hypothesis
may asserts the nature of relationship whether positive or negative and/or the degree of
relationship whether strong or weak.
iii. Hypothesis and theory complement each other in the sense that a tested and a widely accepted
hypothesis may in itself be generalized into or contribute to a theory, while on the other hand
specific hypotheses may be constructed from general theories, particularly in the rare, but not
impossible instances when theories are subjected to tests for purposes of falsification.
iv. Guides the direction of study, identifies facts that are relevant and suggests the research design
that is appropriate
Types of Hypothesis
Literature review on types of hypothesis reveals that there is no standard way of classifying hypothesis.
However, various authorities concur that hypothesis can take the form of Statistical Hypothesis which is
purely quantitative and Research Hypothesis which extends beyond the quantitative boundary to allow
for deductive and inductive logic in qualitative paradigm.
a) Statistical Hypothesis
A statistical hypothesis is given in statistical terms. It is a statement about one or more parameters that are
measures of the population under study. This is done when it is time to test whether data support or refute
the research hypothesis. Using inferential statics conclusions about the population values are drawn from
the sample. This is done through a null or alternative hypothesis
b) Null Hypothesis
The Null hypotheses is a statement put forward as true or believed to be true of the relationship between
the variables. The Null hypotheses are formulated in a manner that portrays no relationship between the
variables. The Null hypotheses become the basis of argument for the results.
c) Alternative Hypothesis
The Alternative hypotheses displayed as H1 or Ha that indicates the perdition of the relationship between
the variables. The Alternative hypotheses may be non-directional or two tailed and directional; one-
detailed. The non-directional alternative hypotheses (H1) are one where the direction or the difference
between the variables is not given. On the other hand the directional or one tailed hypotheses provides
direction or predicts the direction of the relationship between the variables.
d) Research Hypothesis
The research hypothesis is a term used when a hypothesized relationship or prediction is to be tested by
scientific methods. It is a predictive statement that relates an independent variable to a dependent variable
and must be stated in a testable form. Research hypothesis are classified by either direction or logic
Classification of hypotheses by direction of relationship
An alternative hypothesis is any other hypothesis which we are willing to accept when the null hypothesis
Ho is rejected. It is customary denoted by H1 or HA. A null hypothesis Ho is thus tested against an
alternative hypothesis H1. For example, if our null hypothesis is Ho; u = 150 cm, then or alternative
hypothesis may be:
H1; u = 150 cms or H1; u > 150 cms or H1; u < 150 cms.
(ii) Test-Statistics
A statistic (i.e. a function of the sample data not containing any parameter), which provides a basis for
testing a null hypothesis, is called a test-statistic. Every test-statistic has a probability (sampling)
distribution which gives the probability of obtaining a specified value of the test-statistic when the null
hypothesis is true. It is important to remember that a test-statistic does not prove the hypothesis to be
correct but it furnishes evidence against the hypothesis. The most commonly used test-statistics are Z, t,
X2 or F.
(iii) Acceptance and Rejection Regions
All possible values which a test-statistic may assume can be divided into two mutually exclusive groups,
one group consisting of values which appear to be consistent with the null hypothesis and the other
having values which lead to the rejection of the null hypothesis. The first group is called the acceptance
region and the second set of values is known as the rejection region for a test. The rejection region is also
called the critical region. The value(s) that separates the critical region from the acceptance region is
called the critical value(s).
Acceptance Region
0.05 -1 X- 1d 0.05
A legal analogy will help in understanding the difference between Type I and Type II errors. In a court
trial, the supposition of law is that the accused (the defendant) is innocent. This supposition of innocence
may be regarded as a kind of null hypothesis Ho that is to be rejected or accepted. After having heard the
evidence presented during the trial, the judge arrives at a decision. Suppose the accused is, in fact,
innocent (i.e. Ho is true), but the finding of the judge is guilty.
The judge has rejected a true null hypothesis and in so doing has made a Type I error.
If, on the other hand, the accused is, in fact, guilty (i.e. Ho is false) and the finding
viii). Decision – the last step is the decision about the null hypothesis i.e., whether to accept it or to reject
it. In this regard, we compare the computed value of Z (Obtained in step 2) with the critical value or
significant value or tabled value Za (given by table for critical values in step 7) at a given level of
significance a and decide as under:
i. if Zs< Za then we accept the null hypothesis H0 – if the calculated value of Z is less than the tabled
value Za of Z at a level of significance a, then the difference between t – E(t) is not significant (and this
difference may be due to fluctuations of sampling), so we accept the null hypothesis. In this case, the test
statistic falls in the region of acceptance.
ii. if Zs> Za then we reject the null hypothesis H0 and accept the alternative hypothesis H1. In this case,
the computed value of Z is numerically greater than the critical value Za at a level of significance a, and
therefore the computed value of test statistic falls in the rejection region. So we reject the null hypothesis
and accept the alternative hypothesis at a level of significance a or confidence level (1 – a)
Test Statistic:
x o
z
n
Rejection Region: For a probability of a Type-I error, we can reject H0 if
1. z z
2. z - z
3. z zor z - z
Example1. WSU uses thousands of fluorescent light bulbs each year. The brand of bulb it currently uses
has a mean life of 900 hours. A manufacturer claims that its new brand of bulbs, which cost the same as
the brand the university currently uses, has a mean life of more than 900 hours. The university has
decided to purchase the new brand if, when tested, the test evidence supports the manufacturer’s claim at
=.05. Suppose sixty-four bulbs were tested with the following results:
Will WSU purchase the new brand of fluorescent bulbs? Conduct hypothesis test.
Ha: > 900 (the mean life for the new brand of bulbs is higher than the mean life for the old brand)
x 900
Test Statistic: z =
s
n
Example 2 The average (mean) live weight of a farmer’s steers prior to slaughter was 380 pounds in past
years. This year his 50 steers were fed on a new diet. Suppose we consider these 50 steers on the new diet
as a random sample taken from a population of all possible steers that may be fed the diet now or in the
future. Use the sample data given below and =.01 to test the research hypothesis that the mean live
weight for steers on the new diet is greater than 380.
x 380
Test Statistic: z =
s
n
390 380
Calculations: z = = 2.01
35.2
50
Conclusion: Using =.01 we fail to reject H0. There is not sufficient evidence to conclude that the mean
live weight for steers on the new diet is greater than 380.
Example 3
A Stenographer claims that she can type at the rate of 120 words per minute. Can we reject her claim on
the basis of 100 trials in which she demonstrates a mean of 116 words with a standard deviation of 15
words? Use 5% level of significance.
Solution
1. Null hypothesis: H0 : Stenographer’s claim is true, i.e H0 : U = 120.
4. Critical value: the value of Za at 5% level of significance is Za = 1.96. (from the table)
5. Decision: since Zs = 2.67 is greater than Za = 1.96 at 5% level of significance, so the null hypothesis
H0 is rejected. Thus, the stenographer’s claim of typing at the rate of 120 words per minute is not true
CHI-SQUARE TEST
Introduction
The chi-square (I) test is used to determine whether there is a significant difference between the expected
frequencies and the observed frequencies in one or more categories. Hence the chi-square test is a useful
measure of comparing experimentally obtained results with those expected theoretically and based on the
hypothesis. Chi-square test is applied to those problems in which we study whether the frequency with
which a given event has occurred is significantly different from the one as expected theoretically. The
measure of Chi-square enables us to find out the degree of discrepancy between observed frequencies and
theoretical frequencies and thus to determine whether the discrepancy so obtained between observed
frequencies and theoretical frequencies is due to error of sampling or due to a chance.
The Chi-square is computed on the basis of frequencies in a sample and thus the value of Chi-square so
obtained is a statistic. Chi-square is not a parameter as its value is not derived from the observations in a
population. Hence chi-square test is a non-parametric test. Chi-square test is not concerned with any
population distribution and its observations.
The Chi-square test is intended to test how likely it is that an observed distribution is due to chance. It is
also called a "goodness of fit" statistic, because it measures how well the observed distribution of data
fits with the distribution that is expected if the variables are independent.
1. They are appropriate when only weak assumptions can be made about the distribution.
2. They can be used with categorical data when no adequate scale of measurement is available.
3. For data that can be ranked, nonparametric test using ranked data may be the best option.
4. They are relatively quick and easy to apply and to learn since they involve counts, ranks and signs.
Types of Data:
There are basically two types of random variables and they yield two types of data: numerical and
categorical. A chi square (X2) statistic is used to investigate whether distributions of categorical variables
differ from one another. Basically categorical variable yield data in the categories and numerical variables
yield data in numerical form. Responses to such questions as "What is your major?" or Do you own a
car?" are categorical because they yield data such as "biology" or "no." In contrast, responses to such
questions as "How tall are you?" or "What is your G.P.A.?" are numerical. Numerical data can be either
discrete or continuous. The table below may help you see the differences between these two variables.
Notice that discrete data arise from a counting process, while continuous data arise from a measuring
process. The Chi Square statistic compares the tallies or counts of categorical responses between two (or
more) independent groups. (note: Chi square tests can only be used on actual numbers and not on
percentages, proportions, means, etc.)
Therefore a Chi-square test is designed to analyze categorical data. That means that the data has been
counted and divided into categories. It will not work with parametric or continuous data (such as height in
inches). For example, if you want to test whether attending class influences how students perform on an
exam, using test scores (from 0-100) as data would not be appropriate for a Chi-square test. However,
arranging students into the categories "Pass" and "Fail" would. Additionally, the data in a Chi-square grid
should not be in the form of percentages, or anything other than frequency (count) data. Thus, by dividing
a class of 54 into groups according to whether they attended class and whether they passed the exam, you
might construct a data set like this:
Pass Fail
Attended 25 6
Skipped 8 15
IMPORTANT: Be very careful when constructing your categories! A Chi-square test can tell you
information based on how you divide up the data. However, it cannot tell you whether the categories you
constructed are meaningful. For example, if you are working with data on groups of people, you can
divide them into age groups (18-25, 26-40, 41-60...) or income level, but the Chi-square test will treat the
divisions between those categories exactly the same as the divisions between male and female, or alive
and dead! It's up to you to assess whether your categories make sense, and whether the difference (for
example) between age 25 and age 26 is enough to make the categories 18-25 and 26-40 meaningful. This
does not mean that categories based on age are a bad idea, but only that you need to be aware of the
control you have over organizing data of that sort.
Degree of Freedom
The number of data that are given in the form of a series of variables in a row or column or the number of
frequencies that are put in cells in a contingency table, which can be calculated independently is called the
degrees of freedom and is denoted by v.
Case I If the data is given in the form of a series of variables in a row or column, then the degrees of
freedom = (number of items in the series) – 1, i.e., v = n – 1, where n is the number of variables in the
series in a row column.
Case II When the number of frequencies are put in cells in a contingency table, the degrees of freedom
will be the product of (number of rows less one) and the (number of columns less one), i.e., v = (R – 1)
(C – 1), where R is the number of rows and C is the number of columns.
Use of Chi-Square - Test
The Chi –square test is a very powerful test for testing the hypothesis of a number of statistical problems.
The important uses of Chi –square test are:
1. Test of Goodness of Fit – this test is for assessing if a particular discrete model is a good fitting model
for a discrete characteristic, based on a random sample from the population.
E.g. Has the model for the method of transportation (drive, bike, walk, other) used by students to get
the class changed from that for 5 years ago? Under this test there is only one variable i.e., the degrees of
freedom v = n – 1
2. Test of independence of attributes – this test helps us to assess if two discrete (categorical) variables
are independent for a population, or if there is an association between the two variables.
E.g. Is there an association between satisfaction with the quality of public schools (not satisfied,
somewhat satisfied, very satisfied) and political party (Republican, Democrat, etc.)
The chi-square test is used to see that the principles of classification of attributes are independent. In this
test the attributes are classified into a two way table or a contingency table as the case may be. The
observed frequency in each cell (square) is known as cell frequency. The total frequencies in each row or
column of the two ways contingency table is known as marginal frequency.
Where R = number of rows, C = number of columns in the two way contingency table. This test discloses
whether there is any association or relationship between two or more attributes.
3. Test of homogeneity or a test for a specified standard deviation-The chi-square test may be used to
test the homogeneity of the attributes in respect of a particular characteristics or it may be used to test the
population variance. In other words this test is for assessing if two or more populations are homogeneous
(alike) with respect to the distribution of some discrete (categorical) variable.
E.g. Is the distribution of opinion on legal gambling the same for adult males versus adult females?
2
All of these tests are based on a X test statistic that, if the corresponding H0 is true and the assumptions
2
hold, follows a chi-square distribution with some degrees of freedom, written ( df ) .
The steps in using the chi-square test may be summarized as follows: Test of Goodness of Fit
The chi-square test is widely used to test the independence of attributes. It is applied to test the
association between the attributes when the sample data is presented in the form of a contingency table
with any number of rows or columns.
Step 1 Set up the Null Hypothesis H0: No Association exists between the attributes.
Step 2. Calculate the expected frequency E corresponding to each cell by the formula
(O E ) 2
Step 3. Use the formula to find the chi-square value Pearson X 2 E
All cells
The characteristics of this distribution are completely defined by the number of degrees of freedom v
which is given by V = (R – 1) (C – 1), where R = number of rows and C = numbers of columns in the
contingency table.
Step 4. Find from the table the value of chi-square for a given value of the level of significance and the
degrees of freedom v. calculated in STEP 2
Step 5. Compare the computed value of chi-square with the tabled value of chi-square found in step 4. If
your chi-square value is equal to or greater than the table value, reject the null hypothesis: differences in
your data are not due to chance alone
Example 1
The example used is that for testing whether Gender is independent of type of high school attended
(public or private). The sample data is (the numbers reflect the counts).
Female 38 7 45
Male 46 9 55
Total 84 16 100
Step 1. The basic premise of the test of independence is to see if the distribution of the percentages is the
same for each level of category. That is, is the percentage of males attending public schools “close
enough” to that percentage for females attending public schools? We do this by comparing what we
“observe” (i.e. the data in the table which is the sample data) to what we would expect to see in our
sample if there was no relationship, i.e. the variables were independent.
H0: “Gender and School Type are independent” versus the alternative hypothesis
H1: “Gender and School Type are dependent” or “there is a relationship between Gender and School
Type”
Step 2. The observed counts are easy, but how do we get the expected counts for each of the cells in the
table? Well, since the idea of the expected counts is to provide what the distribution would be if no
relationship existed we use the observed column and row totals to calculate how each individual cell
count would be distributed if in fact there was not relationship. We do this by taking each row total times
the column total and then divide by the overall total. This produces an expected count table of:
Total 84 16 100
2
(O E )
(38 37.80) 2 (7 7.20) 2 (46 46.20) 2 (9 8.80) 2
0.012
E 37.80 7.20 46.20 8.80
Step 4. As you can see this test statistic is quite small as you might have guessed given how close the
observed values were to the expected values. Reading the chi-square table is similar to reading the T-
table in that there is a degree of freedom consideration and the table provides right tail probabilities. The
DF is found by taking the number of rows minus 1 times the number of columns minus 1, written as: (R-
1)*(C-1). For this example the degrees of freedom are (2-1)*(2-1) = 1.
Step 5. Decision: If you have a chi-square table you will see that the test statistic of 0.012 is less than the
chi-square value presented in the table for the 0.05 with 1 degree of freedom. This means that our p-value
is greater than that right tail probability which in turn is greater than 0.05 resulting in us not rejecting Ho.
We would conclude that there is not enough evidence to reject Ho: we cannot say a relationship exists
between Gender and School Type.
Example 2
Suppose you conducted a drug trial on a group of animals and you hypothesized that the animals
receiving the drug would show increased heart rates compared to those that did not receive the drug. You
conduct the study and collect the following data:
Hypothetical drug trial results.
Heart Rate No Heart Rate
Total
Increased Increase
Treated 36 14 50
Not treated 30 25 55
Total 66 39 105
Ho: The proportion of animals whose heart rate increased is independent of drug treatment.
Ha: The proportion of animals whose heart rate increased is associated with drug treatment.
Before we can proceed we need to know how many degrees of freedom we have. When a comparison is
made between one sample and another, a simple rule is that the degrees of freedom equal (number of
columns minus one) x (number of rows minus one) not counting the totals for rows or columns. For our
data this gives (2-1) x (2-1) = 1.
We now have our chi square statistic (x2 = 3.418), our predetermined alpha level of significance (0.05),
and our degrees of freedom (df = 1). Entering the Chi square distribution table with 1 degree of freedom
and reading along the row we find our value of x2 (3.418) lies between 2.706 and 3.841. The
corresponding probability is between the 0.10 and 0.05 probability levels. That means that the p-value is
above 0.05 (it is actually 0.065). Since a p-value of 0.65 is greater than the conventionally accepted
significance level of 0.05 (i.e. p > 0.05) we fail to reject the null hypothesis. In other words, there is no
statistically significant difference in the proportion of animals whose heart rate increased.
We can use the equation Chi Square = the sum of all the(fo - fe)2 / fe
Here fo denotes the frequency of the observed data and fe is the frequency of the expected values. The
general table would look something like the one below:
Category Category
Category III Row Totals
I II
Sample A a b c a+b+c
Sample B d e f d+e+f
Sample C g h i g+h+i
Column
a+d+g b+e+h c+f+i a+b+c+d+e+f+g+h+i=N
Totals
Now we need to calculate the expected values for each cell in the table and we can do that using the the
row total times the column total divided by the grand total (N). For example, for cell a the expected value
would be (a+b+c) (a+d+g)/N.
Once the expected values have been calculated for each cell, we can use the same procedure are before
for a simple 2 x 2 table.
|O -
Observed Expected (O — E)2 (O — E)2/ E
E|
Suppose you have the following categorical data set. Incidence of three types of malaria in three tropical
regions.
South
Asia Africa Totals
America
Malaria
31 14 45 90
A
Malaria
2 5 53 60
B
Malaria
53 45 2 100
C
Totals 86 64 100 250
Thus, we would reject the null hypothesis that there is no relationship between location and type of
malaria. Our data tell us there is a relationship between type of malaria and location, but that's all it says.
Formula Card
Chi-Square Tests
The expression "cx" is called the objective function, and the equations "Ax=b" are called the
constraints. The set of factors under the control of the manager is called the decision variables.
There are also a number of factors that limit or constrain what can be done. These may include,
for example, capacities of plant and equipment, limits on market demands, or processing and
delivery requirements.
A linear programming model, or LP model, is a particular type of mathematical model in which
the relationships involving the variables are linear, and in which there is a single performance
measure or objective. An advantage of this type of model is that there exists a mathematical
technique, called linear programming that can determine the best or optimal decision even when
there are thousands of variables and relationships.
Essential conditions
These are the conditions for linear programming to pertain.
First, there must be limited resources-constraints (e.g., a limited number of workers,
machines, finances, and material); otherwise there would be no problem, nothing to be
sold.
Second, there must be an explicit objective-well-defined objective function (such as
maximize profit or minimize cost). When a single objective is to be maximized or
minimized, we can use linear programming. When multiple objectives exist, goal is used.
Third, there must be linearity (if it takes three hours to make a part, then two parts
would take six hours, three parts would take nine hours etc.). Other restrictions on the
nature of the problem may require that it be solved by other variations of the technique,
such as non-linear programming or dynamic programming.
Fourth, there must be homogeneity (the products are identical, or all the hours available
from a worker are equally productive).
Fifth is divisibility (products and resources can be subdivided into fractions). If this
subdivision is not possible (such as flying half an aero plane or hiring one-fourth of a
person), a modification of linear programming, called integer programming, can be used.
Sixth, there must be an alternative course of action.
Seventh, decision variables should be interrelated and non- negative.
Linear programming does not allow for uncertainty in any of the relationship; there cannot be
any probabilities or any random variables. Also any time-dependent changes cannot be involved
into classical LP problem.
Since its discovery in the late 1950s, linear programming has been applied to a wide variety of
decision problems in business and the public sector.
Graphical Solution
It is useful to get familiar with the graphical method for solving LP problems first — not because
this is used in practice, but because it provides best understanding of the model and its solution.
Graphical solution is based on a geometrical representation of the feasible region and the
objective function. In particular, the space to be considered is the n-dimensional space with each
dimension defined by one of the LP variables. The objective function will be described in this n-
dim space by its contour plots, i.e., the sets of points that correspond to the same objective value.
To facilitate the visualization we are restricted to the two-dimensional case, i.e., to LP models
with two decision variables (possibly with 3 variables in 3D graphic).
The steps of the graphical method are as follows:
1. Formulate the Problem in Mathematical Terms Example: Product planning - LP model
As mentioned above there are
limited resources
an explicit objective function
the equations are linear
the resources are homogeneous
the decision variables are divisible and non-negative
2. Plot Constraint Equations
Every vector of two the variables can be plotted using two points with
Co-ordinates in a 2-dimensional (planar) Cartesian system. The constraint equations are easily
plotted by letting one variable equal zero and solving for the axis intercept of the other. (The
inequality portions of the restrictions are disregarded for this step.)
3. Determine the Area of Feasibility
The direction of inequality signs in each con-constraint determines the area of feasibility where a
feasible solution is found.
The feasible region of a 2-variables LP is depicted by the set of points the co-ordinates of which
satisfy all LP constraints and the sign restrictions. If all constraints are expressed by linear
inequalities, to geometrically characterize the feasible region, we must first characterize the set
of points that constitute the solution space of each linear inequality. Then, the LP feasible region
will result from the intersection of the solution spaces corresponding to each technological
constraint.
A feasible solution or feasible point satisfies all of the constraints and any restrictions on
the variables value (e.g. nonnegativities). The feasible set is a set of all feasible solutions.
The region of feasible solutions forms a convex polygon. If this condition of convexity does not
exist, the problem is either incorrectly set up or not amenable to linear programming.
4. Plot the Objective Function
The objective function may be plotted by assuming some arbitrary total profit figure and then
solving for the axis co-ordinates, as was done for the constraint equations. Other terms for the
objective function, when used in this context, are the iso-profit line or equal contribution line,
because it shows all possible production combinations for any given profit figure. All iso-profit
lines are parallel, so that we can move it to one side to maximize the objective function or other
side for minimization.
5.FindtheOptimumPoint
It can be shown mathematically that the optimal combination of decision variables is always
found at an extreme point (corner point) of the convex polygon. The number of corner points is
limited (compare with unlimited number of points of the whole polygon representing the set of
feasible solutions). We can determine which one is the optimum by either of two approaches.
The first approach is to find the values of the various comer solutions algebraically. This entails
simultaneously solving the equations of various pairs of intersecting lines and substituting the
quantities of the resultant variables in the objective function.
The second and generally preferred approach entails using the objective function or iso-profit
line directly to find the optimum point. The procedure involves simply drawing a straight line
parallel to any arbitrarily selected initial iso-profit line so that the iso-profit line is farthest from
the origin of the graph (in cost-minimization problems, the objective would be to draw the line
through the point closest to the origin.)
Note that all iso-profit lines have the same slope but different axis intercepts.
The profit lines of a linear programming model are always parallel.
The underlying idea is to keep ”sliding'' the iso-profit line in the direction of increasing value of
the objective function, until we cross the boundary of the LP feasible region. This extreme point
of the feasible region is the optimum point.
If there is a finite optimal solution for a linear programming model, an optimal solution
will be a corner point.
Not all examples lead to one optimal solution. It could happen that attempting to plot the feasible
region for some problem, we get no points in feasible region i.e. on the plane that satisfy all
constraints, and therefore the problem is infeasible called also over-constrained problem.
If there is no point in the feasible set, there is no feasible solution of the linear
programming model.
One possible reason for no feasible solution is an error in transcribing the data or inputting
model. Another type of error that is harder to determine is the inclusion of reasonable constrains,
but with no possibility to satisfy them all. For example, there may not be enough available
capacity to satisfy all of the requirements.
In the LP modes considered above, the feasible region (if not empty) was a bounded area. For
this kind of problems it is obvious that all values of the LP objective function (and therefore the
optimal) are bounded.
A bounded feasible set is one which a finite numbers may be specified so that any variable
value, at any point in the feasible set, is less or equal to that number.
Consider however the possibility that the feasible region is not bounded and the iso-profit line
can be shifted with no limits. An unbounded feasible set is a feasible set that is not bounded.
Only one variable has to be limitless for a feasible set to be unbounded. If the feasible point
exists with the objective function value as favorable as desired, a model is said to have an
unbounded optimal solution and the problem is called unbounded problem (usually one extreme
- maximum or minimum can be found, problem is if this is not the extreme we search for). The
practical interpretation of such optimal solution is the same as in the previous case i.e. there is no
optimal solution and the model creator must return to the real situation and check all constrains.
In the unbounded case there is probably some limitation missing, forgotten in the model
formulation.
The model can have also more than one optimal solution – alternative solution. In this case the
iso-profit line goes in the same direction as one of constrain line. The extreme points than is a set
of points lying between two polygon corner points on this line (contour line).
If two corner points are optimal, then all of the points on the line segment connecting them
are also optimal.
Note that for all examples of optimal solutions for linear programming model, there is an optimal
solution in the corner point.
Graphical Method Exercises Solved in Linear Programming
EXAMPLE 1: A workshop has three (3) types of machines A, B and C; it can manufacture two
(2) products 1 and 2, and all products have to go to each machine and each one goes in the same
order; First to the machine A, then to B and then to C. The following table shows:
The hours needed at each machine, per product unit
The total available hours for each machine, per week
The profit of each product per unit sold
Formulate and solve using the graphical method a Linear Programming model for the previous
situation that allows the workshop to obtain maximum gains.
Decision Variables:
: Product 1 Units to be produced weekly
: Product 2 Units to be produced weekly
Objective Function:
Maximize
Constraints:
The constraints represent the number of hours available weekly for machines A, B and C,
respectively, and also incorporate the non-negativity conditions.
For the graphical solution of this model we will use the Graphic Linear Optimizer
(GLP) software. The green colored area corresponds to the set of feasible solutions and the level
curve of the objective function that passes by the optimal vertex is shown with a red dotted line.
EXAMPLE 2 : A winemaking company has recently acquired a 110 hectares piece of land. Due
to the quality of the sun and the region’s excellent climate, the entire production of Sauvignon
Blanc and Chardonnay grapes can be sold. You want to know how to plant each variety in the
110 hectares, given the costs, net profits and labor requirements according to the data shown
below:
Suppose that you have a budget of US$10,000 and an availability of 1,200 man-days during the
planning horizon. Formulate and solve graphically a Linear Programming model for this
problem. Clearly outline the domain of feasible solutions and the process used to find the optimal
solution and the optimal value.
Decision Variables:
: Hectares intended for growing Sauvignon Blanc
: Hectares intended for growing Chardonnay
Objective Function:
Maximize
Constraints:
Where the restrictions are associated with the maximum availability of hectares for planting,
available budget, man-hours in the planting period and non-negativity, respectively.
The following graph shows the representation of the winemaking company problem. The shaded
area corresponds with the domain of feasible solutions, where the optimal basic feasible
solution is reached at vertex C, where the budget and man-days restraints are active. Thus
solving said equation system the coordinate of the optimal solution is found where
and (hectares). The optimal value is (dollars).
EXAMPLE 3
Nairobi industries limited manufactured three items A, B and C. the production requirements per
unit of these items and capacities of the three departments are shown as follows:
Production requirements
Capacity of the
department A B C
departments (Hours)
Assembling 2 3 4 32,000
Painting 1 2 3 37,500
Finishing 3 2 3 29,000
Profit contribution($) 20 50 40
Required – Construct the linear programming model that would maximize the profit contribution
from the given information.
Let x units be the manufactured item A
Let Y units be the manufactured item B
Let ƶ units be the manufactured item C
Maximize profit:
P=20x +50y +40ƶ
Constraints:
Assembly: 2x +3y+4ƶ≤32000
Painting: x+2y +3ƶ≤37500
Finishing: 3x +2y+3ƶ≤29000
c) Shortage costs.
These are incurred as a result of the item not being in stock.
Examples:
i) Loss of goodwill; could lead to loss of customers.
ii) Contribution lost, due to not making a sale.
iii) Back order costs - these are costs of dealing with disappointed customers.
iv) Costs of idle resources e.g.:- production personnel being paid when there’s a
raw material missing.
v) Cost of having to speed up orders e.g. personnel working overtime, using a faster
transportation mode (and hence more costly)
d) Purchase Costs:
This is what is paid to the supplier /seller by the buyer in exchange of the product.
Inventory is usually a large investment for many firms. It is normally the second largest item in
the balance sheet among the assets after fixed assets. Thus, inventory should only be held if
the benefits (service to customers) exceed the inventory costs. Also in inventory modeling,
purchase costs are a relevant factor to inventory policy due to availability of quantity discounts.
Thus, for inventories, (Total Cost) TC = Purchase + Holding + Ordering + Shortage
costs costs costs costs
The objective of any inventory management systems or models is to minimize these total costs.
DETERMINISTIC INVENTORY CONTROL MODELS
There are generally two types of inventory management models:
Deterministic and
Stochastic
A brief comparison is given below;
Differences between deterministic and probalistic models
Deterministic Stochastic (probabilistic)
i) Certainly model; Factors known with i) Models to cope with uncertainty; Factors
certainty and usually constant are uncertain and are usually variable.
ii) Simple model more complex model
iii) Not very realistic Reflect reality better than deterministic model
EXAMPLE
Carpet Discount Store in North Georgia stocks carpet in its warehouse and sells it through an
adjoining showroom. The store keeps several brands and styles of carpet in stock; however, its
biggest seller is Super Shag carpet. The store wants to determine the optimal order size and total
inventory cost for this brand of carpet given an estimated annual demand of 10,000 yards of
carpet, an annual carrying cost of $0.75 per yard, and an ordering cost of $150. The store would
also like to know the number of orders that will be made annually and the time between orders
(i.e., the order cycle) given that the store is open every day except Sunday, Thanksgiving Day,
and Christmas Day (which is not on a Sunday).
SOLUTION:
Co = $150
D = 10,000 yards
The total annual inventory cost is determined by substituting Qopt into the total cost formula:
The number of orders per year is computed as follows:
Given that the store is open 311 days annually (365 days minus 52 Sundays, Thanksgiving, and
Christmas), the order cycle is
The factors that are responsible to bring about changes in a time series, also called the
components of time series, are as follows:
1. Secular Trend (or General Trend)
2. Seasonal Movements
3. Cyclical Movements
4. Irregular Fluctuations
Secular Trend:
The secular trend is the main component of a time series which results from long term effect of
socio-economic and political factors. This trend may show the growth or decline in a time series
over a long period. This is the type of tendency which continues to persist for a very long period.
Prices, export and imports data, for example, reflect obviously increasing tendencies over time.
Seasonal Trend:
These are short term movements occurring in a data due to seasonal factors. The short term is
generally considered as a period in which changes occur in a time series with variations in
weather or festivities. For example, it is commonly observed that the consumption of ice-cream
during summer us generally high and hence sales of an ice-cream dealer would be higher in some
months of the year while relatively lower during winter months. Employment, output, export etc.
are subjected to change due to variation in weather. Similarly sales of garments, umbrella,
greeting cards and fire-work are subjected to large variation during festivals like Valentine’s
Day, Eid, Christmas, New Year etc. These types of variation in a time series are isolated only
when the series is provided biannually, quarterly or monthly.
Cyclic Movements:
These are long term oscillation occurring in a time series. These oscillations are mostly observed
in economics data and the periods of such oscillations are generally extended from five to twelve
years or more. These oscillations are associated to the well-known business cycles. These cyclic
movements can be studied provided a long series of measurements, free from irregular
fluctuations is available.
Irregular Fluctuations:
These are sudden changes occurring in a time series which are unlikely to be repeated, it is that
component of a time series which cannot be explained by trend, seasonal or cyclic movements. It
is because of this fact these variations some-times called residual or random component. These
variations though accidental in nature, can cause a continual change in the trend, seasonal and
cyclical oscillations during the forthcoming period. Floods, fires, earthquakes, revolutions,
epidemics and strikes etc., are the root cause of such irregularities.
EXAMPLE
The amount of cotton grown in the country during 2011 to 2014 is shown in the following table.
Cotton Grown (Kg, 000)
Quarter
I II III IV
Year
2011 25 22 24 27
2012 28 23 25 29
2013 30 25 27 31
2014 32 27 29 33
2011 I 25
II 22
98 24.5
101 25.25
IV 27 25.375 1.625
102 25.5
103 25.75
II 23 26.00 -3.00
105 26.25
107 26.75
IV 29 27.00 2.00
109 27.25
111 27.25
II 25 28.00 -3.00
113 28.25
115 28.75
IV 31 29.00 2.00
117 29.25
119 29.75
II 27 30.00 -3.00
121 30.25
III 29
IV 33
NETWORK DIAGRAMS AND SCHEDULE ANALYSIS
Network diagrams are schematic displays of project schedule activities and the
interdependencies between these activities. When developed properly, this graphical view of a
project’s activities conveys critical schedule characteristics required to effectively analyze and
adjust schedules – thus resulting in accurate and feasible schedules. This document addresses
what should be considered in the development of a network diagram, how network diagrams are
created, and how they may be analyzed to identify necessary corrective actions and ensure
optimal schedule definition.
Project Scheduling and Control Techniques
1. Critical Path Method (CPM)
2. Program Evaluation and Review Technique (PERT)
Project Network
• Network analysis is the general name given to certain specific techniques which can be used for
the planning, management and control of projects
• Use of nodes and arrows
•
Arrows
An arrow leads from tail to head directionally
– Indicate ACTIVITY, a time consuming effort that is required to perform a part of the
work.
Nodes
A node is represented by a circle
- Indicate EVENT, a point in time where one or more activities start and/or finish
• Activity
– A task or a certain amount of work required in the project
– Requires time to complete
– Represented by an arrow
• Dummy Activity
– Indicates only precedence relationships
– Does not require any time of effort
• Event
– Signals the beginning or ending of an activity
– Designates a point in time
– Represented by a circle (node)
• Network
– Shows the sequential relationships among activities using nodes and arrows
Activity-on-node (AON) nodes represent activities, and arrows show precedence relationships
Activity-on-arrow (AOA) arrows represent activities and nodes are events for points in time
2 4
Finish work
2 3
7
Start 1 1
3
B
A
C
Fig A must finish before either B or C can start
A
C
B
Fig both A and B must finish before C can start
A C
B D
Fig both A and C must finish before either of B or D can start
A B
Dummy
C
D
Fig A must finish before B can start
both A and C must finish before D can start
L ay 3
L ay
Dumm
foundatio 2 0
2 3
1
Order 2 4
Order
Planning a project usually involves dividing it into a number of small tasks that can be assigned to
individuals or teams. The project’s schedule depends on the duration of these tasks and the sequence in
which they are arranged. This sequence can be driven by several factors: customer deadlines, availability
of personnel or resources, and dependencies among tasks.
DuPont developed a Critical Path Method (CPM) designed to address the challenge of shutting down
chemical plants for maintenance and then restarting the plants once the maintenance had been completed.
Complex project require a series of activities, some of which must be performed sequentially and others
that can be performed in parallel with 0ther activities.
This collection of series and parallel tasks can be modeled as a network.
CPM models the activities and events of a project as a network. Activities are shown as nodes on the
network and events that signify the beginning or ending of activities are shown as arcs or lines between
the nodes.
The Figure below shows an example of a CPM network diagram:
All the activities in the project are listed. This list can be used as the basis for adding sequence and
duration information in later steps.
Some activities are dependent on the completion of other activities. A list of the immediate predecessors
of each activity is useful for constructing the CPM network diagram.
Once the activities and their sequences have been defined, the CPM diagram can be drawn. CPM
originally was developed as an activity on node network.
The critical path is the longest-duration path through the network. The significance of the critical path is
that the activities that lie on it cannot be delayed without delaying the project. Because of its impact on
the entire project, critical path analysis is an important aspect of project planning.
The critical path can be identified by determining the following four parameters for each activity:
• ES - earliest start time: the earliest time at which the activity can start given that its precedent activities
must be completed first.
• EF - earliest finish time, equal to the earliest start time for the activity plus the time required to complete
the activity.
• LF - latest finish time: the latest time at which the activity can be completed without delaying the
project.
• LS - latest start time, equal to the latest finish time minus the time required to complete the activity.
The slack time for an activity is the time between its earliest and latest start time, or between its earliest
and latest finish time. Slack is the amount of time that an activity can be delayed past its earliest start or
earliest finish without delaying the project.
The critical path is the path through the project network in which none of the activities have slack, that is,
the path for which ES=LS and EF=LF for all activities in the path. A delay in the critical path delays the
project. Similarly, to accelerate the project it is necessary to reduce the total time required for the
activities in the critical path.
As the project progresses, the actual task completion times will be known and the network diagram can be
updated to include this information.
A new critical path may emerge, and structural changes may be made in the network if project
requirements change.
CPM calculation
• Path
– A connected sequence of activities leading from the starting event to the ending event
• Critical Path
– The longest path (time); determines the project duration
• Critical Activities
– All of the activities that make up the critical path
Forward Pass
The forward pass goes from the initial task (the task with no predecessors) to the final task (the one with
no successors), visiting every task in every path and setting the ES and EF dates on the tasks.
The algorithm is similar to graph theory’s depth-first search, except that the forward pass follows every
path from initial to final task, while depth-first search stops when it arrives at a task that it’s already
visited. When the forward pass arrives at a task, it may change that task’s ES and EF dates, and that
change must be carried forward to the final task. During the forward pass, a task may be visited several
times as different paths through the network are followed. A task’s ES is determined by the predecessor
task with the latest EF, since a task can’t start until all of its predecessors have finished.
• Earliest Start Time (ES)
– earliest time an activity can start
– ES = maximum EF of immediate predecessors
• Earliest finish time (EF)
– earliest time an activity can finish
– earliest start time plus activity time EF= ES + t
Backward Pass
The backward pass goes from the final task to the initial task, visiting every task in every path and setting
the LS and LF dates on the tasks.
It’s similar to the forward pass in that it arrives at a task, it may change that task’s LS and LF dates, and
that change must be carried back to the initial task. The difference is:
• The forward pass sets the task’s latest ES, as determined by the EFs of its predecessors
• The backward pass sets the task’s earliest LF, as determined by the LSs of its successors.
The reason for the backward pass’s rule for setting LF is not as obvious as the forward pass’s rule. Any
start date, ES or LS, must be after the corresponding finish dates of all of the task’s predecessors. To
maintain this consistency, the backward pass must set a task’s LF to a value that’s earlier than the LS of
any of the task’s successors.
Latest Start Time (LS)
Latest time an activity can start without delaying critical path time
LS= LF - t
Latest finish time (LF)
Latest time an activity can be completed without delaying critical path time
LS = minimum LS of immediate predecessors
CPM analysis
• Draw the CPM network
• Analyze the paths through the network
• Determine the float for each activity
– Compute the activity’s float
float = LS - ES = LF - EF
– Float is the maximum amount of time that this activity can be delay in its completion
before it becomes a critical activity, i.e., delays completion of the project
• Find the critical path is that the sequence of activities and events where there is no “slack” i.e.
Zero slack
– Longest path through a network
• Find the project duration is minimum project completion time
The Program Evaluation and Review Technique (PERT) is a network model that allows for
randomness in activity completion times. PERT was developed in the late 1950's for the U.S. Navy's
Polaris project having thousands of contractors. It has the potential to reduce both the time and cost
required to complete a project.
PERT is typically represented as an activity on arc network, in which the activities are represented on the
lines and milestones on the nodes. The Figure 2-1 shows a simple example of a PERT diagram.
• Optimistic time (OT) - generally the shortest time in which the activity can be completed. (This is what
an inexperienced manager believes!)
• Most likely time (MT) - the completion time having the highest probability. This is different from
expected time. Seasoned managers have an amazing way of estimating very close to actual data from
prior estimation errors.
• Pessimistic time (PT) - the longest time that an activity might require.
The expected time for each activity can be approximated using the following weighted average:
If activities outside the critical path speed up or slow down (within limits), the total project time does not
change. The amount of time that a non-critical path activity can be delayed without delaying the project is
referred to as slack time.
If the critical path is not immediately obvious, it may be helpful to determine the following four quantities
for each activity:
These times are calculated using the expected time for the relevant activities. The ES and EF of each
activity are determined by working forward through the network and determining the earliest time at
which an activity can start and finish considering its predecessor activities.
The latest start and finish times are the latest times that an activity can start and finish without delaying
the project. LS and LF are found by working backward through the network. The difference in the latest
and earliest finish of each activity is that activity's slack.
The critical path then is the path through the network in which none of the activities have slack.
The variance in the project completion time can be calculated by summing the variances in the
completion times of the activities in the critical path. Given this variance, one can calculate the
probability that the project will be completed by a certain date.
Since the critical path determines the completion date of the project, the project can be accelerated by
adding the resources required to decrease the time for the activities in the critical path. Such a shortening
of the project sometimes is referred to as project crashing.
PERT analysis
PERT is based on the assumption that an activity’s duration follows a probability distribution
instead of being a single value
Three time estimates are required to compute the parameters of an activity’s duration distribution:
Probability computation
Determine probability that project is completed within specified time
x-
Z=
Where = tp = project mean time
= project standard mean time
x = (proposed) specified time
Probabilit
Z
= tp x Time
PROJECT COST
Cost consideration in project
Project managers may have the option or requirement to crash the project, or accelerate the
completion of the project.
This is accomplished by reducing the length of the critical path(s).
The length of the critical path is reduced by reducing the duration of the activities on the critical
path.
If each activity requires the expenditure of an amount of money to reduce its duration by one unit of
time, then the project manager selects the least cost critical activity, reduces it by one time unit,
and traces that change through the remainder of the network.
As a result of a reduction in an activity’s time, a new critical path may be created.
When there is more than one critical path, each of the critical paths must be reduced.
If the length of the project needs to be reduced further, the process is repeated.
Project Crashing
• Crashing
– reducing project time by expending additional resources
• Crash time
– an amount of time an activity is reduced
• Crash cost
– cost of reducing activity time
• Goal
– reduce project duration at minimum cost
Activity crashing
Crash
Crashing activity
cost
Slope = crash cost per unit
Activity cost
time
Normal Activity
Normal
cost
Normal
time
Crash
Activity time
time
Figure Time-Cost Relationship
Crashing costs increase as project duration decreases
Indirect costs increase as project duration increases
Reduce project length as long as crashing costs are less than indirect costs
Direct cost
time
Figure Time-Cost Tradeoff
PERT/CPM Chart
A project has been defined to contain the following list of activities along with their required times for
completion
TE = 11 TE = 20 TE = 23
2 5 8
6 6 1
TE = 14
1 4
5 2
TE = 12 TE = 19 TE = 22
3 6 7
5 3
Using information from the table, indicate expected completion time for each activity.
2 5 8
1
6 TE = 14 6
TE = 5
1 4
5 2
TE = 12 TE = 19 TE = 22
3 6 7
5 3
Calculate earliest expected completion time for each activity (TE) and the entire project.
The earliest expected completion time for a given activity is determined by summing the expected
completion time of this activity and the earliest expected completion time of the immediate predecessor.
Rule: if two or more activities precede an activity, the one with the largest TE is used in calculation (e.g.,
for activity 4, we will use TE of activity 3 but not 2 since 12 > 11).
3 6 7
5 3
D
3 5
A E H J
1 C 4 7
B I K
F
G
2 6
Activity ES EF LS LF Slack
A 0 6 0 6 0 *critical
B 0 4 5 9 5
C 6 9 6 9 0*
D 6 11 15 20 9
E 6 7 12 13 6
F 9 13 9 13 0*
G 9 11 16 18 7
H 13 19 14 20 1
I 13 18 13 18 0*
J 19 22 20 23 1
K 18 23 18 23 0*
. 5000 . 2612
x
PERT/COST
3 6
4 5 4
4
R500 R7000
Project duration = 36
2 4
8 12 R700
7
1 4
12
R400 3 6
4 5 4
4 R200
R3000
R200
2 4
Project 8 12 R700
7
duration = 31 1 4
7
Summary
Program evaluation and review technique (PERT) charts depict task, duration, and dependency
information. Each chart starts with an initiation node which is the first task. Each task is represented by a
node (Activity on Node Network – AON) with lines connecting dependent tasks.
Each task is connected to its successor tasks in this manner forming a network of nodes and connecting
lines. The chart is complete when all final tasks come together at the completion node. When slack time
exists between the end of one task and the start of another, the usual method is to draw a broken or dotted
line between the end of the first task and the start of the next dependent task.
PERT charts are usually drawn on ruled paper with the horizontal axis indicating time period divisions in
days, weeks, months, and so on. Many PERT charts terminate at the major review points, such as at the
end of the analysis.
Critical Path Method (CPM) charts are similar to PERT charts and are sometimes known as PERT/CPM.
In a CPM chart, the critical path is indicated. A critical path consists that set of dependent tasks (each
dependent on the preceding one), which together take the longest time to complete. Tasks which fall on
the critical path should be noted in some way, so that they may be given special attention. One way is to
draw critical path tasks with a double line instead of a single line.
Tasks which fall on the critical path should receive special attention by both the project manager and the
personnel assigned to them.
The critical path for any given method may shift as the project progresses; this can happen when tasks are
completed either behind or ahead of schedule, causing other tasks which may still be on-schedule to fall
on the new critical path.
Critical path computations are quite simple, yet they provide valuable information that simplifies the
scheduling of complex projects. The result is that PERT-CPM techniques enjoy tremendous popularity
among practitioners in the field .The usefulness of the techniques is further enhanced by the availability
of specialized computer system for executing, analyzing, and controlling network projects.
CPM Benefits
While CPM is easy to understand and use, it does not consider the time variations that can have a great
impact on the completion time of a complex project. CPM was developed for complex but fairly routine
projects with minimum uncertainty in the project completion times.
For less routine projects there is more uncertainty in the completion times, and this uncertainty limits its
usefulness.
Benefits of PERT
PERT is useful because it provides the following information:
Limitations of PERT
The following are some of PERT's limitations:
• The activity time estimates are somewhat subjective and depend on judgment. In cases where there is
little experience in performing an activity, the numbers may be only a guess. In other cases, if the person
or group performing the activity estimates the time there may be bias in the estimate.
• The underestimation of the project completion time due to alternate paths becoming critical is perhaps
the most serious.