Professional Documents
Culture Documents
Research is the method to expand the frontiers of human knowledge. In geography and environmental studies, for instance, it has to aim at maintaining the environment so as to attain the well-being and betterment of the human and physical environment. The aim of any good research is to advance the frontiers of human knowledge and add valid information to what is already known.
Mesay Mulugeta, 2009 1
Therefore, research can be carried out in any field wherever there is a need to expand the horizon of human knowledge that bring about significant change in the betterment of all human affairs.
At this juncture one may raise at least two questions. What is science? What is the relationship between science and research? Nachmias, C. and David Nachmias (1996: 3) try to answer these questions as, The word science is derived from the Latin word Scire meaning to know. Science is difficult to define primarily because people often confuse the content of science with its methodology. Science has no particular subject matter of its own .but a distinct methodology.
Science, in this sense, refers to any systematic and highly skilled means of acquiring knowledge. Research is a scientific method or a technique for investigating phenomena and acquiring new knowledge. To be termed scientific, a method of inquiry must be based on gathering observable, empirical, and measurable data subject to specific principles of reasoning.
If so, when has human begun to search a truth? Or when has human started research? One may answer that human started searching for truth when s/he began to find fruits/animals for survival (hunting/gathering) millions of years back. Others may respond to such a question in such a way that human started research when s/he began to select the most appropriate animal/plant for domestication. Still some other may say it might when human began explore the world. Others still may say it was when human started to identify the location of the most important minerals for industrial proposes. In this manner different individual may comment his/her view for the forwarded questions. Anyhow, these people may not be wrong in that throughout history humans have tried to grasp knowledge in various ways. One of these ways, which is also the most recent, is scientific inquiry. Though scientific method is not the only means to know, science has been helping humans to understand their environment and themselves. Science is, therefore, the best instrument to grasp knowledge through research which involves observation, identification, description, experimental investigation and theoretical explanation of any phenomena that occur in nature.
The above paragraph tries to briefly explain the fact that research is one of the long-dated human activities. One should also bear in mind that any systematical and justifiable method of inquiry on human and physical environment is a science. Researches in social sciences are therefore scientific inquiries to the extent that these fields are founded on scientific methodology, rigorous data analysis and systematic observations. The only reasonable difficulty in human science is the fact that the uniformity of nature is not a reasonable assumption in the world of human beings and their characteristics. This is mainly because of the complex nature of human being to which developing sound theories is much more difficult unlike the cases in the physical world. It also involves the environment which is equally dynamic and any investigation pertaining to man or any living being cannot be treated in isolation.
Regarding this, Best and Kahn (2005) explains that researches in human subjects are difficult mainly because of the following human natures. These are: 1. No two persons are alike in feelings, drives and emotions. For instance, an event that extremely delights an individual may irritate the others or a method that we may employ to approach an interviewee may not work to hand other respondents in a research. 2. No one person is completely consistent from one moment to another. Human behavior is influenced by the interactions of the individual with every changing element in his or her environment. 3. Human beings are influenced by the research process itself. They are influenced by the attention that is focused on them when under investigation unlike other animals such as mice. Because of these factors, some scholars in the field of applied sciences are less confident about the scientific inquiries in nonphysical aspects of our world. Hence, they recommend the application of scientific methods with greater vigor and imagination in social science aspects. It is believed that the development of scientific inquiries in social sciences and their applications to several human affairs may be best solutions to some of our present and future greatest challenges such as peace and security, human rights violation, global warming, food insecurity, polar ice-melting and global economic recession.
revolution is that it led to a shift from descriptive (idiographic) geography to an empirical law making (nomothetic) geography. This is mainly because modern geography is an all-encompassing discipline that foremost seeks to understand the earth and all of its human and natural complexities in more scientific manner. It studies not merely where objects are, but how they are interrelated, why they are there and their socio-economic values of being there. In the early 1950s there was a growing sense that the existing paradigm for geographical research was not adequate in explaining how physical, economic, social and political processes are spatially organized or ecologically related. A more abstract, theoretical approach to geographical research has emerged, evolving the analytical method of inquiry. Look at the logical structure of quantitative research approach in the figure below. Figure 1.1: The logical structure of quantitative research process Theory or research problem ---------------------------------- Deduction Hypothesis --------------------------------- Operationalization Observations/ Data Collection ---------------------------------- Data Processing Data Analysis ---------------------------------- Interpretation Findings ---------------------------------- Induction
Quantitative research in social sciences is, therefore, a set of quantitative techniques that allow researchers to answer research questions in the discipline. These methods and techniques tend to specialize in quantities in the sense that numbers come to represent the variables like altitude, income, rainfall, temperature, dietary energy, body weight and age. The interpretation of the numbers is viewed as strong scientific evidence of how a phenomenon works. The presence of quantities is so predominant in quantitative research approach that statistical tools and packages are essential element in the researcher's toolkit. Sources of data are of less concern in identifying an approach as being quantitative research approach than the fact that empirically derived numbers lie at the core of the scientific evidence assembled. A quantitative researcher may use archival data or gather it through different tools such as interview, questionnaire, measurements, and personal observations. In all cases, the researcher is motivated by the numerical outputs and how to derive meanings from them. As indicated in the figure above (Fig. 1.1), quantitative research process consists of at least five stages: theory, hypothesis formulation, observation or data collection, data analysis and finding or generalization. The figure illustrates that a research process is cyclic in nature. It starts with a theory or research problem and ends with research findings or empirical generalization. The ending of one research cycle will be the beginning of the other one. This cyclic process continues indefinitely reflecting the process of a scientific investigation and it opens door for self-correcting in such away that scientific investigators test the generalizations, hypotheses and findings of the research problems logically and empirically.
Concept
Several writers define the term concept simply as an abstract notion or idea, something that isn t concrete. It is an abstract summary of characteristics that we see as having something in common.
Mesay Mulugeta, 2009 6
Concepts are created by people for the purpose of communication and efficiency. Therefore, as an educator or researcher you would be expected to review all the existing range of definitions of the term concept and decide on which you are going to use.
Theory
As the case for the term concept stated above, there are definitions of theory in literatures and electronic media. However, a more substantial definition of a theory seems the one stated in ENCARTA World English Dictionary which defines the term theory as a set of facts, propositions, or principles analyzed in their relation to one another and used, especially in science, to explain phenomena. It is a set of hypotheses or principles linked by logical or mathematical arguments which is advanced to explain an area of empirical reality of a type of phenomenon." A theory, therefore, includes a set of basic assumptions and axioms as the foundation and the body of the theory is composed of logically interrelated and empirically verifiable propositions.
Researchers use theory in a quantitative study to provide an explanation or prediction about the relationship among variables in the study. A theory explains how the variables are related, acting as a bridge between or among the variables.
Theories exist in different social science disciplines such as economics, psychology and sociology. As stated in Cresswell (2003), in quantitative research, hypothesis and research questions are often based on theories that the researcher seeks to test. Cresswell (2003) cited Kerliger (1979) defining theory as a set of interrelated constructs (variables) , definitions, and propositions that presents a systematic view of phenomena by specifying relations among variables . In this definition theory is an interrelated set of constructs (variables) formed into propositions or hypothesis that specify the relationship of variables typically in terms of magnitudes or directions. The systematic view may be an argument, a discussion or a rationale and it helps to explain or predict phenomena that occur in the world. Why would an independent variable, X, influence or affect a dependent variable, Y? The theory would provide the explanation for this expectations or prediction.
Theories develop when researchers test a prediction many times. When investigators test hypotheses over and over in different settings and with different populations a theory emerges and someone gives it a name. Thus, theory develops as explanation to advance knowledge in particular field.
Mesay Mulugeta, 2009 7
Another aspect of a theory is that it varies in its breadth of coverage. Theories can be classified into micro-level, meso-level and macro-level. Micro-level theories provide explanations limited to small slices of time, space and variables while meso-level theories integrate some micro-level theories of organizations, social movements or communities. Macro-level theories explain larger aggregates such as social institutions, cultural systems and the whole societies. As stated by Cresswell (2003), the following procedures should be used to present a model for writing a quantitative theoretical perspective section into a research plan. These are review the related literature, find what theories were used by other investigators or researchers in your area of study, ask questions why the independent variable(s) affect the dependent variable(s), and script out the theory section. Hence, a researcher or investigator must intensively read different related research works and books before s/he goes to building a theory or hypotheses for his or her research.
Model
ENCARTA World English Dictionary defines model as an interpretation of a theory arrived at by assigning referents in such a way as to make the theory true. It is a simplified version of something complex used in analyzing and solving problems or making predictions A model can more simply be analogous model, for instance globe as a model of the earth, or symbolic models, which are based on logics and inter-relationships between concepts and usually expressed mathematically or algebraically. Symbolic models are concerned with quantification. For instance the under-mentioned regression model is used to quantify an estimated or predicted value of a data set and the correlation model below helps us to analyze the strength and direction of a linear relationship between an independent and dependent variables. There are many such symbolic models in the fields of human and natural sciences. Examples of symbolic models:
+NXn
Y) Y2 ( Y )2 ]
.
...
.....
Regression Model
There is also what we call conceptual model. A conceptual model is composed of a pattern of interrelated concepts but not expressed in mathematical form and primarily not concerned with quantification. Maps, graphs, charts, balance sheets, circuit diagrams, and flowcharts, are often used to represent such models.
Law
Another word associated with concepts, theories and models is law. A law in research is a precise statement of a relationship among facts that has been repeatedly corroborated by scientific investigation and is generally accepted as accurate by experts in the field. Laws are generally derived from a theory. A law is frequently referred to as a universal and predictive statement. It is universal in the sense that the stated relationship is held always to occur under the specified conditions, although the conditions may be predicted to follow.
2. Specify the hypothesis 3. Use only research question or hypothesis, not both, to eliminate redundancy 4. You can use hypothesis either in null or alternative form. Null hypothesis (Ho) is statistical hypothesis that states there are no differences between observed and expected data. Alternative hypothesis (H1) predicts about the outcome for the population of the study. It can be directional (makes use of words like higher, less, better, e.t.c.) or non-directional, formulated when a researcher does not know what can be predicted from post literature. 1.6. Merits and Demerits of Quantitative Methods Merits: Examines the relationships between and among variables critically. Answers research questions through surveys and experiments Provides measures or observations to test theories and hypotheses Leads to meaningful interpretation of quantitative data Provides more empirical data analysis techniques than the qualitative ones Seems more valid and reliable method Relatively more free of motivations, feelings, opinions, and attitudes of individuals who are carrying out the research and also those individuals participating in the research Demerits: Collect a much narrower and sometimes superficial dataset Results are limited as they provide numerical descriptions rather than detailed narrative and generally provide less elaborate accounts of human perception Often carried out in an unnatural, artificial environment so that a level of control can be applied to the exercise. Preset answers will not necessarily reflect how people really feel about a subject and in some cases might just be the closest match. Overlooks motivations, feelings, opinions, and attitudes of individuals who are carrying out the research and also those individuals participating in the research
10
statistics to make judgments of the probability that an observed difference between groups is a dependable one or one that might have happened by chance in this study. Thus, we use inferential statistics to make inferences from our data to more general conditions; we use descriptive statistics simply to describe what's going on in our data. For example, let s say we have data on the incomes of 20 instructors at Adama University. This data can be summarized by finding the average income of those 20 instructors and we could describe the difference each income is above or below the average. We could also go into Excel or SPSS softwares and construct a table with this data in it, or make a pie chart or bar chart, maybe a frequency distribution of the number or proportion of the instructors in each class or range. This is descriptive statistics! Now, if this group is representative of the whole university, we could then estimate and test various hypotheses about these 20 instructors average income to the university as a whole. These conclusions will be subject to some error, and we could even quantify this probability of error. We are now inferring, so this would be inferential statistics.
12
2.1. Introduction
Quantitative research usually employs surveying or measuring to collect data. It is a research depending upon quantities or quantifying variables. In order to collect the quantified data, this research approach usually carries out at least one of the sampling techniques which will be discussed in detail in this unit. Sampling is a statistical practice concerned with the selection of some representations from a population or universe intended to yield some generalized characteristics of the concerned population by minimizing the computational work. Socioeconomic variables such as yield/hectare, landholding, daily income, number of oxen per rural household, origin of immigrants and reason of migration, family size, rainfall and any other related data of a very large entity can be obtained through sampling and be used for certain statistical inferences. Let's begin by covering some of the key terms in sampling like population, samples, sampling frame, sampling error, non-sampling error, parameter and statistic.
13
2.2. Population/Universe
In statistics, the term population/universe is used in a different sense from its literary sense. A population is any entire collection of people, animals, plants or things from which we may select sample data. It is the entire group we are interested in, which we wish to describe or draw conclusions about. For instance, if you want to study the food security status of farm households in a woreda having 1200 farm households, you may communicate only 8% of the total which is only 96. Here the 1200 farm households in the woreda are said to be population and the 96 ones are your samples.
In order to make any generalizations about a population, a sample, that is meant to be representative of the population, is often studied. A sample statistic gives information about the corresponding population parameter. For example, it is assumed that a sample mean for a set of data would give information about the overall population mean.
2.3. Samples
A sample is a group of items selected from a larger group (the population/universe) for any statistical analysis. It is assumed that the sample is the utmost perfect representative of the general population. By studying the sample it is hoped to draw valid conclusions about the larger group. A sample is generally selected for study because the population may be too large to study in its entirety; or it may be too costly and time consuming to deal with each and every population under consideration. Before selecting the samples, it is important that the researcher must carefully and completely defines the population, including a description of the members to be included. At this juncture, you may raise questions like: What is the advantage of sampling? Why shouldn t we study the whole population rather than limiting ourselves to the data that we obtain from certain proportion of the overall population? You will get answers to these questions after you effectively complete discussing all about sampling with your instructor in this unit.
14
A statistic is a figure that is calculated based on sample data. The mean, median, variance, standard deviation and Skewness of a sample set of data are termed as statistic. It is used to give information about unknown values in the corresponding population. For example, the mean of the data in a sample is used to give information about the overall average in the population from which that sample was drawn. Parameters are often assigned by Greek letters (e.g. and .), whereas statistics are assigned by Roman letters (e.g. m and s).
Mesay Mulugeta, 2009 15
What is optimum sample size or proportion? You will see in the discussions hereinafter. There are two basic causes of sampling error. One is the error that occurs just because of chance. Some literature calls this bad chance. This may result in untypical choices. Unusual units (extremely small or large units) in a population do exist and there is always a possibility that an abnormally large or small number of them will be chosen. For example, for the data for your BA Thesis at woreda level you may unluckily select all the well-to-do farm households in the woreda in your set of sample. You may select your samples randomly but, all the rich households in the whole population, which have the highest crop yield per year, may be selected making the sample average by far higher than what it should be. Here also you may raise a question: How can I protect such a bad chance during field work for my research? The main protection against this kind of error is to use a sample size larger enough. The second cause of sampling error is sampling bias. Sampling bias is a tendency to favor the selection of units/items that have particular characteristics. Sampling bias is usually the result of a poor sampling plan. The most notable is the bias of non response when for some reason some units have no chance of appearing in the sample. A means of selecting the units of analysis must be designed to avoid the most obvious forms of bias. For example, when you would like to know the average income of the residents of a town, you may
Mesay Mulugeta, 2009 16
decide to use mobile telephone numbers to select a sample from the total population in a locality where only the well-to-do social class households (in Ethiopian case) own mobile telephones. You will then end up with high average income which will lead to wrong conclusions in your findings. Therefore, you must be very careful in selecting your research samples free of any bias.
17
There are also other significant factors which results in errors and reduces the quality of data. These are:
1. Interviewer s effect: No two interviewers are alike and the same person may provide different
answers to different interviewers. The manner in which a question is formulated can also result in inaccurate responses. Individuals tend to provide false answers to particular questions. For example, some people want to feel younger or older for some reason known to them. If you ask such a person her/his age in years, it is easier for the individual just to lie to you by over or under stating her/his age by some years less or more than the reality. But if you ask which year s/he was born, s/he may tell definitely more accurate figure since it will require a bit of quick arithmetic to give a false date. 2. Respondents effect: This might also give incorrect answers to impress the interviewer. This type of error is the most difficult to prevent because it results from outright deceit (dishonest) on the part of the respondents or interviewees. An example of this is what I witnessed during my MA Thesis Study in which I was asking farmers how much crop they harvested last year, 2001. In most cases, the household heads tended to lie by responding only very small amount of yield per year. I then tried all my best to convince them so that they could tell me somewhat correct figures. 3. Knowing the study purpose: This is the case when the respondents give wrong data solely because they are aware of why a study is being conducted. A good example can be a question on income. If a government agency is asking, a different figure may be provided than the respondents would give to some neutral or purely academic researcher. One way to guard against such bias is to make the questions very specific, allowing no room for personal interpretation. For example, the questions might be: "Are you employed?" could be followed by "What is your salary?", "Do you have any extra income? , If yes how much is it? and so on. A sequence of such questions may produce more accurate information than directly asking questions like, What is your monthly income? Error and cost seems competing in sampling. This is because to reduce error often it requires an increased expenditure of resources such as time, finance and human power. Of the two types of statistical errors, only sampling error can be controlled by exercising care in determining the appropriate method for choosing the sample.
Mesay Mulugeta, 2009 18
The above discussion has shown that sampling error may be due to either bias or chance. The chance component (sometimes called random error) exists no matter how carefully the selection procedures are implemented, and the only way to minimize chance sampling errors is to select an adequately large sample size. Sampling bias, on the other hand, may be minimized by the wise choice of a sampling procedure.
Sampling Methods
Probability Sampling (Usually for Quantitative Methods) Simple Random Sampling Systematic sampling Stratified random sampling Cluster sampling Multistage sampling
Non-probability Sampling (Usually for Qualitative Methods) Convenience sampling Judgment/Purposive sampling Quota sampling Snowball sampling Volunteer sampling
Example: Let us assume that you want to study certain characteristics of rural households in Y
woreda. Let the total farm households in the woreda be 2000 serially numbered from 1 to 2000 in alphabetical order. Firstly, you have to decide your sample size based on the criteria you have read above. Let it be 100. Divide the population (2000) by sample size (100). i.e. 2000 divided by 100
Mesay Mulugeta, 2009 20
gives you 20 which will provide the positions of sample items; in every 20th item from the first item identified randomly. Now draw a random number from 1 to 20. Let number 13 is selected randomly. Then select your samples as follows:
Calculation
13 13 + 20 = 33 13 + 2*20 = 53, 13 + 3*20 =73 13 + 4*20 =93 13 + 5*20 =113 13 + 6*20 =133 .
21
In multistage stratification say the gender attribute of the household heads each dividing the universe/population into two groups. At the second stage the monthly family income monthly can be introduced to have sub-classes. It can be further sub-divided by introducing other characteristics one by one. In other cases, the study area can be divided into agro-climatic zones which can be further divided by woredas, then kebeles, villages and so on. The two main reasons for using a stratified sampling design are (1) to ensure that every group within a population are adequately represented in the sample, and (2) to improve efficiency by gaining greater control on the composition of the sample since sample size is usually proportional to the relative size of the strata.
Example: Let us assume that you want to study certain characteristics of urban households in
one of the towns in Ethiopia. Firstly, you can divide the whole urban households into a number of homogeneous groups called strata on the bases of certain characteristics of the households. For example, they may be divided into homogeneous groups according to sex of household head, age, family income or any other available information can be used. In fact, at this stage we need predocumented information about each household. Finally, from each homogeneous group (stratum) you can select a required size of sample households randomly and distribute your questionnaire or administer an interview.
Example: Let us assume that we want to know the amount of crop production in a woreda during
a specific crop year. It may be very costly to communicate each and every farming household in the
Mesay Mulugeta, 2009 22
woreda. Then we can divide the woreda into smaller segments (e.g. kebeles) known as primary units in statistics. Now, we can randomly or purposely take reasonable and representative number of the segments (primary units) and communicate each household in the segment.
is its greatest weakness and quota versus probability has been a matter of controversy for many years.
Moreover, the three most important criteria that need to be specified to determine the appropriate sample size are level of precision, level of confidence or risk and the degree of variability in the attributes being measured. In case of inadequate sample size (if any), the researchers should be able to report both the statistically appropriate sample size and the sample size actually used in the study. This allows the reader to make her/his own judgments as to whether s/he accepts the researcher s assumptions and procedures. Finally, regarding sample size you should note that adequate sample size with high quality data collection efforts may result in more reliable, valid and generalizable results than studies conducted with entire population or census data.
Level of precision
The level of precision, sometimes called sampling error, is the range in which the true value of the population is estimated to be. This range is often expressed in percentage points, (e.g. 5percent). For instance, if a researcher finds that 60% of the farmers in the sample have adopted a
Mesay Mulugeta, 2009 25
recommended practice with a precision rate of 5%, then s/he can conclude that between 55% and 65% of the farmers in the population have adopted the practice.
Confidence level
The confidence or risk level is based on the ideas encompassed under the Central Limit Theorem. The key idea encompassed in the Central Limit Theorem is that when a population is repeatedly sampled, the average value of the attribute obtained by those samples is equal to the true population value. Furthermore, the values obtained by these samples are distributed normally about the true value, with some samples having a higher value and some obtaining a lower score than the true population value. In a normal distribution, approximately 95% of the sample values are within two standard deviations of the true population value (e.g., mean). In other words, this means that, if a 95% confidence level is selected, 95 out of 100 samples will have the reliable population value. There is always a chance that the sample you obtain does not represent the true population value.
Degree of Variability
The third criterion, the degree of variability in the attributes being measured, refers to the distribution of attributes in the population. The more heterogeneous a population, the larger the sample size required to obtain a given level of precision.
N 1 N ( e) 2
Where,
N
e
population size
level of precision
sample size
27
no
Z
=
2 2
Where,
no
sample size
z = the abscissa of the normal curve that cuts off an area at the tails
e = the desired level of precision in the same unit of measure as the variance
2
The above mathematical formula can also be rewritten as follows to determine the required sample size with specific confidence and margin of error.
2
za n
Where,
n = sample size
2 E
It is obvious that this formula can be used if and only if the value of size necessary to establish the population mean value within possible to use this formula if we are able to determine documents.
E with a confidence of 1
Then, it is possible to determine the appropriate number of samples the study to be 95% sure that the sample mean is within 1 unit of the population mean. The value of distribution for a 95% ( deviation
za
The disadvantage of the sample size based on the mean is that a good estimate of the population variance is necessary. Often, an estimate is not available. Furthermore, the sample size can vary widely from one attribute to another because each is likely to have a different variance.
28
Project work
Let us assume that you are going to study the ethnic composition of the residents of your town based only on primary data. Here you can imagine that it is very expensive and time consuming to carry out a census survey for this specific study. As a result, you may be required to select samples for this study. Then, try to explain the type and method of sampling technique you are going to apply to answer the research questions and come up with the most appropriate research outcomes. What will be your appropriate sample size? What method will you use to decide the sample size and what do you say about the adequate representation of the samples?
29
3.1. Introduction
The data collected from the field or obtained through measurements are usually presented in condensed forms in tables, charts and graphs. The purpose of putting the data into tables, charts and graphs is threefold. Firstly, it enables the researcher to visually look at the data and see what happened and make interpretations. Secondly, it is usually the best way to show the data to the audience rather than immersing them in reading lots of numbers in the text which may put people sleep and grasp little information. Thirdly, it allows the readers to easily understand the research findings and what the writer wants to communicate. As a result, it is highly recommended to present data in tables, charts and graphs in condensed form. All the data presentation tools such as tables, charts and graphs can be drawn by hand or on computer. Softwares such as Microsoft Excel produce graphs and perform some statistical calculations. Statistical programs like SPSS, SYSTAT and SAS are higher-powered programs that perform many statistical analyses as well as producing high quality graphs. However, graphs may mislead unless carefully and precisely drawn. Hence, you should bear in mind the undermentioned points while producing tables, charts and graphs for any quantitative data. These are: Scales must be in regular intervals Graphs/charts that are compared must have the same scale You must be clear with what you are going to communicate Graphs, charts and tables must be easy to read You should know who your audience is Be sure whether or not the display tell the entire story of the specific issue
30
II. Ordinal (Rank) data is where the numerical value indicates something about relative rather than absolute position in a series such as ranks. Many statistical tests make use of this type of ranked data. For instance, let four farmers, namely Chalchisa, Lamesa, Sardessa and Arfase, are the 1st, 2nd 3rd and 4th crop producers in a rural kebele called Halelu-Chari during crop year 2007/8. Then, the numerical values 1, 2, 3 and 4 above indicates not the amount of yield the four farmers produced in 2007/8 but it indicates the relative positions of the four farmers out of the existing total farmers in Halelu-Chari Kebele. Hence, 1, 2, 3 and 4 here are Ordinal data. Here you can see that the data are ordered but differences between values are not important. III. Interval data is a scale of measurement where the distance between any two adjacent units of measurement (or intervals) is the same but the zero point is arbitrary. Scores on an interval scale can be added and subtracted but cannot be meaningfully multiplied or divided. An interval scale has all the characteristics of an ordinal scale, i.e. individuals or responses belonging to a subcategory have a common characteristic and the sub-categories are arranged in an ascending or descending order. In addition, an interval scale uses a unit of measurement that enables the individuals or responses to be placed at equally spaced intervals in addition to the spread of the variable. This scale has a starting and terminating points and the number of units/intervals between them are arbitrary and vary from scale to scale.
Mesay Mulugeta, 2009 31
Centigrade and Fahrenheit scales are examples of the interval scale. In the centigrade system the starting point (considered as freezing point) is 00C while the end point (considered as boiling point) is 100oC. The gap between freezing and boiling points is divided into 100 equally spaced intervals known as degrees. In the Fahrenheit system the freezing point is 320F and the boiling point is 2120F. The gap between the two points is divided into 180 equally spaced intervals. Each degree or interval is a measurement of temperature. As the starting and terminating points are arbitrary, they are not absolute, i.e. you cannot say 600C is twice as hot as 300C or 300F is not three times hotter than 100F. This means that while no mathematical operation can be performed on the readings, it can be performed on the differences between readings. IV. Ratio data is a set of data if the values (observations) belonging to it may take on any value within a finite or infinite interval. It has all the properties of nominal, ordinal and interval scales plus its own property, i.e. the zero point of a ratio scale is fixed, which means it has fixed starting point. You can count, order and measure continuous data. It has a real negative, zero or positive values. For example, during your field work (data collection) for your research you may get 80quintals/year, 15quintals/year or no (zero) production for certain farmers. Other example can be height and weight of sample newly born babies where a 2.5kg baby is twice heavier than a 1.5kg one and a 90cm baby is longer than 1.5times than a 60cm long baby. In this scale all mathematical operations such as multiplication and division are therefore meaningful. The zero value on a ratio scale is non-arbitrary. Most physical quantities, such as mass, length or energy are measured on ratio scales; so is temperature when it is measured in Kelvin s, i.e. relative to absolute zero. In real quantitative data collection process, it is also common to identify two sets of data namely Continuous and discrete data. Continuous variables are those variables that have theoretically an infinite number of gradations between two measurements. For example, crop yield per unit of land, body weight of individuals, milk yield of cows, etc are continuous variables. Various variables in geography and environmental studies are of continuous type. Discrete variables, on the other hand, do not have continuous gradations but there is a definite gap between two measurements, i.e. they cannot be measured in fractions. For example, number of peoples in a country, a family size, and number of eggs per chicken in poultry are discrete data.
32
Note that the distinction between nominal, ordinal, interval and ratio scale is of importance mainly because it helps us to choose the appropriate statistical tool for the analysis of the data.
Type of seed Fertilizers Soil type Farmers level of education Farming technology, etc Extraneous Variables Consider a study aimed at finding factors affecting crop productivity in a certain area in Ethiopia. We may hypothesize that climate and soil will have beneficial effects on crop productivity. In this situation climate and soil are the Independent Variables (IV) and crop productivity is the Dependent Variable (DV). However, it is very likely that the DV (the amount of crop production per unit area) may be suffering from other factors such as type of seed and fertilizers. All these other potential sources of influence are known as extraneous variables. Usually, experimental design is done in most scientific researches to control these extraneous factors as far as possible.
of data into more and more comprehensible form. The process of grouping into different classes or sub classes according to some characteristics is known as classification. Thus classification is the first step in tabulation. For Example, crop products in the crop year 2007/8 can be classified according to their types such as teff, barley, wheat, sorghum and oat. Similarly, population data can be classified as male or female, educated or uneducated, married or unmarried, e.t.c.
Importance of Classification
The following are main objectives of classifying geographical data: It condenses the mass of data in an easily manageable form It eliminates unnecessary details. It facilitates comparison and highlights the significant aspect of data. It enables one to get a mental picture of the information and helps in drawing inferences. It helps in the statistical treatment of the information collected
Look at the table above that the farmers sources of cash income are clearly indicated with their respective percentages and it is now easier to read and understand or more suitable for farther statistical analysis.
34
Types of classification
Statistical data are classified in respect of their characteristics. Broadly there are four basic types of classification namely a) Chronological classification b) Geographical or Locational classification c) Qualitative classification d) Quantitative classification
a) Chronological or locational classification In chronological classification the collected data are arranged according to the order of time expressed in years, months, weeks, etc. The data is generally classified in ascending order of time.
b) Geographical classification In this type of classification the data are classified according to geographical region or place. For instance, the production of coffee in different zonal administration of Ethiopia, production of wheat in different woredas in Oromiya, etc
c) Qualitative classification In this type of classification data are classified on the basis of same attributes or quality like sex, literacy, religion, employment, etc. Such attributes cannot be measured along with a scale. For example, if the population to be classified in respect to one attribute, say sex, then we can classify them into two namely that of males and females. Similarly, they can also be classified into employed or unemployed on the basis of another attribute employment .
d) Quantitative Classification Quantitative classification refers to the classification of data according to some characteristics that can be measured such as amount of rainfall, temperature, crop production, altitude, height, weight, etc., For example, farmers in a kebele may be classified according to their amount of crop production within a year as given below.
35
Table 3.2: Classification of hypothetical farmers by quantity of production Production in Quintals (Classes/Groups) Discontinuous Continuous 6 - 10 5.5 - 10.5 11 - 15 10.5 - 15.5 16 - 20 15.5 - 20.5 21 - 25 20.5 - 25.5 26 - 30 25.5 - 30.5 31 - 35 30.5 - 35.5 36 - 40 35.5 - 40.5 41 - 45 40.5 - 45.5 Total Number of Farmers (Frequency) 8 12 17 21 14 11 6 11 100
In this type of classification there are two elements, namely (i) the variable i.e the production in the above example, and (ii) the frequency in the number of farmers in each class. There are 12 farmers having production ranging from 10 to 15, 14 farmers having production ranging between 26 to 30 and so on. Dear Student! Do you know the advantages of data classification? Please, explain it.
Several types of statistical/data presentation tools (graphic aids) exist. These include bar graphs, histograms, scattered diagrams, pie-chart, line graphs and tables.
3.5.1. Tabulation
Tabulation is the process of summarizing classified or grouped data in the form of a table so that it is easily understood and an investigator is quickly able to locate the desired information. A table is a
Mesay Mulugeta, 2009 36
systematic arrangement of classified data in columns and rows. Thus, a statistical table makes it possible for the investigator to present a huge mass of data in a detailed and orderly form. It facilitates comparison and often reveals certain patterns in data which are otherwise not obvious. Classification and Tabulation , as a matter of fact, are not two distinct processes. Actually they go together. Before tabulation data are classified and then displayed under different columns and rows of a table.
Advantages of Tabulation
Statistical data arranged in a tabular form serve the following objectives: 1. It simplifies complex data and the data presented are easily understood. 2. It facilitates comparison of related facts. 3. It facilitates computation of various statistical measures like measures of central tendency, dispersion and correlation. 4. It presents facts in minimum possible space unnecessary repetitions and explanations are avoided. Moreover, the needed information can be easily located 5. Tabulated data are good for references and they make it easier to present the information in the form of graphs and diagrams.
Preparing a Table
The making of a compact table itself is an art. This should contain all the information needed within the smallest possible space. What the purpose of tabulation is and how the tabulated information is to be used are the main points to be kept in mind while preparing a statistical table. An ideal table should consist of main parts such as table number, title of the table, body of the table and sources of data Dear Student! Look at the table below and observe how table number, title of the table, body of the table and sources of data should be presented.
37
Table 3.3: Number of livestock per household by types and agro-climatic zones: Kuyu Woreda Types Dega Woina Dega Qolla All zones
Frequency Distribution
In statistics, a frequency distribution is a list of values that a variable takes in a sample. It is usually a list, ordered by quantity, showing the number of times each value appears. For example in the table below 22 appears three times while others such as 8, 12 and 17 appear two times each. Table 3.4: Frequency table
Laborers Daily Income (Eth. Birr) Number of Laborers (Frequency) Cumulative Frequency (<UL) Percentage Cumulative Frequency
3.00 8.00 12.00 17.00 18.00 22.00 23.00 24.00 27.00 Total
2 2 2 2 2 3 1 1 1 16
2 4 6 8 10 13 14 15 16 ----
12.5 25.0 37.5 50.0 62.5 81.3 87.5 93.8 100.0 ----
This simple tabulation has two drawbacks. When a variable can take continuous values instead of discrete values or when the number of possible values is too large, the table construction is cumbersome, if not impossible. A slightly different tabulation scheme based on the range of values is used in such cases. So we group the data into class intervals (or groups) to help us organize,
Mesay Mulugeta, 2009 38
interpret and analyze the data. The frequency of a group (or class interval) is the number of data values that fall in the range specified by that group (or class interval). There are two ways in which observations in the data set are classified on the basis of class intervals, namely exclusive method and inclusive method. Exclusive method is when the data is classified in such a way that the upper limit of a class interval is the lower limit of the succeeding class interval. This method is illustrated in Table 3.5. Table 3.5: Exclusive method of data classification Class Interval 0 5 5 10 10 15 15 20 20 25 25 - 30 Total
Frequency 2 2 2 4 5 1 16
CF < UL 2 4 6 10 15 16 ----
Inclusive method, on the other hand, is when the data are classified in such a way that both lower and upper limits of a class in the interval itself as illustrated below. Table 3.6: Inclusive method of data classification Class Interval 0 4 5 10 15 20 9 14 19 24
Frequency 2 2 2 4 5 1 16
25 - 30 Total
39
An exclusive method is used to classify a set of data involving continuous variables while inclusive method should be used to classify a set of data involving discrete variables. If a continuous variable is classified according to the inclusive method, then certain adjustment in the class interval is needed to obtain continuity as shown in Table 3.7. (Table 3.6. above can be adjusted as Table 3.7. below for continuous variables) Table 3.7: Adjusted inclusive method for continuous variables Class Interval -0.5 4.5 4.5 9.5 14.5 19.5 9.5 14.5 19.5 24.5
Frequency 2 2 2 4 5 1 16
24.5 30.5 Total The method is that first calculate correction factor as:
Upper Limit of a Class Lower Limit of the Next Higher Class , Where X is correction factor 2
And then subtract the value of X from the lower limits of all classes and add it to the upper limits of all the classes.
Graphic representations of data can be categorized into five types based on the dimensions of the graphs in use. These are: 1. Dimensionless diagrams also known as point graphs or dot graphs 2. Unidimensional graphs or line graphs such as frequency curve/polygon and cumulative frequency curve (Ogive) 3. Bidimensional diagrams such as histograms, bardiagrams and pie-charts 4. Tridimensional diagrams such as cubes, blocks and spheres 5. Pictorial representation of data by using pictures like human being, animals, cups, houses and fruits Generally, presentation of data in diagrams has the following advantages: 1. Diagrams give an attractive and elegant presentation of data 2. Diagrams leave good visual impact and facilitates comparisons 3. Interpretations from diagrams save time 4. Diagrams simplify complexity and easily depict the characteristics of the data
41
Linechart or Linegraph
Linechart or linegraph is a type of graph created by connecting a series of the coordinates of data points that represent individual measurements with line segments. A linechart is a basic type of chart common in many fields and it gives the reader a fairly good idea of the nature of the data. For instance, a linechart is often used to visualize a trend in data over individuals of time.
For instance, the data depicted in Table 3.8 above can easily be converted to a linechart as shown in figure below.
42
25.00
20.00
15.00
10.00
1985.00
1990.00
1995.00
2000.00
2005.00
Crop Year
Frequency polygon is also a type of linechart formed by marking the midpoint of the top of bars in histogram and joining these dots by series of straight lines. The frequency polygons are formed as a closed figure with the horizontal axis. A series of straight lines are drawn from the midpoint the top base of the first and the last rectangles to the midpoint falling on the horizontal axis of the next outlaying interval with zero frequency. Drawing a frequency polygon does not necessarily require constructing a histogram first. It can be obtained directly on plotting points above each class midpoint at heights equal to the corresponding class frequency. The points so drawn are then joined by a series of straight lines and the polygon is closed as explained earlier. In this case, horizontal x-axis measures the successive class midpoints and not the lower class limits.
Exercise
Use the data indicated in Table 3.4 hereinbefore and draw a frequency polygon A) By constructing a histogram first B) Without constructing histogram
43
Example
Plot (construct) an Ogive based on the data (assumed students mark) in the table below. (i) Plot the points with coordinates having abscissa (x-axis) as actual limits and ordinates (y-axis) as the cumulative frequencies, (ii) Join the points plotted by a smooth curve. (iii) An Ogive is connected to a point on the x-axis representing the actual lower limit of the first class.
8 12 14 10 6 50
8 20 34 44 50 ------
16 24 28 20 12 100
16 40 68 88 100 -----
44
Figure 3.3: Cumulative frequency curve (Ogive) based on the data in Table 3.9
100.00
80.00
Commulative Frequency
60.00
40.00
20.00
0.00
8.00
13.00
18.00
23.00
28.00
33.00
38.00
43.00
Histogram and Bardiagram: These are one dimensional diagrams used to represent both
ungrouped (raw) and grouped data. Values of the variables (the characteristics to be measured) are scaled along the horizontal axis and the number of observations (frequencies) along the vertical axis of the graph. The plotted points are then connected by straight lines to enhance the shape of the distribution. The height of such boxes (rectangles) measures the number of observations in each of the classes. See Figures 3.4 and 3.5. Histogram is simply a bar graph where the bar lengths are determined by the frequencies in each class of a grouped frequency distribution. Notice how the bar diagram above is represented by the histogram below having eight interconnected bars that represent the numbers of farmers in each of the quantities of crop production distribution.
45
Figure 3.4: Bardiagram Presenting Number of Farmers by Crop Production Hypothetical Data
Piediagram or piechart or a circlegraph is a circular chart divided into sectors, illustrating relative magnitudes or frequencies or percents. In a pie chart, the arc length of each sector and consequently its central angle and area, is proportional to the quantity it represents. Together, the sectors create a full disk. It is named for its resemblance to a pie which has been sliced.
Figure 3.5: Histogram Indicating Number of Farmers by Crop production: Hypothetical Data
25.00
20.00
15.00
10.00
5.00
46
Observe that the data indicated in the table below (Table 3.10) has been presented by the preceding pie-diagram or Figure 3.6.
Table 3.10: Percentage distribution of hypothetical population by marital status Marital Status Single Married Widowed Divorced % Distribution 20 62 14 4
Figure 3.6: Piechart presenting percentage distribution of marital status: Hypothetical data
To classify the pie-chart proportionally use the following methods. Observe that the summation of the percentages (14 + 4 + 20 + 62) is 100. You may also know that the total angular measurement of any circle is 360o. Then you can now establish a relationship between the percentages and angular measurements as follows.
47
For instance if 100% = 360o, what will be the equivalent angular measurement of the remaining percentages? 14% 4% 20% 62% 100% = 50.400 = 14.400 = 72.000 = 223.200 = 360.00o
Now, by using your protractor you can construct a piechart as indicated in Fig 3.6. Important formulas in this unit are: 1. Class Interval (h)= Upper Limit 2. Midpoint of a class (m) = Lower Limit
3. Approximate Interval Size to be used in constructing a frequency distribution (h): L arg est data value Smallest data value h Number of classes
Table 3.11: A farmer s Crop Output over Years: Hypothetical Data Year 1985 1990 1995 2000 2005 Crop Output in Qtls 10 12 15 20 28
48
Figure 3.8: Pictorial representation of national regional states of Ethiopia by population Regions Oromiya Amhara SNNPRS Somali Tigray Addis Ababa Afar B/Gumuz
Source: CSA (2008) Mesay Mulugeta, 2009 Scale: = 1,000,000 peoples 49
Exercise
1. Define the following statistical terms A. Qualitative data B. Quantitative data C. Extraneous variable D. Histogram E. Ogive F. Cumulative frequency G. Frequency distribution H. Midpoint
2. Let the following data represent the gross income (in Eth. Birr) of forty (40) urban households in Adama town. Then construct a cumulative frequency table and Ogive for the data.
550, 600, 760, 500, 900, 550, 760, 750, 1000, 550, 700, 600, 550, 550, 670, 740, 1100, 570, 610, 480, 620, 680, 720, 740, 750, 550, 600, 780, 800, 820, 840, 880, 900, 920, 950, 750, 600, 520, 980, 500
3. The following table indicates the population size of the top 8 most populous countries in the world. Present the data in pie diagram and bar chart Table 3.12: The world's 8 most populous countries, 2008 Country Population China India USA Indonesia Brazil Pakistan Bangladesh Russia 1,321,851,888 1,129,866,154 301,139,947 234,693,997 190,010,647 169,270,617 150,448,339 141,377,752
4. The following data is percentage distribution of sources of annual cash income of sample rural households in Kuyu woreda during a particular year (2001). Draw a suitable diagram (chart) to present the data.
Mesay Mulugeta, 2009 50
Table 3.13: Rural households' major sources of cash income S/N 1 2 3 4 5 6 7 8 9 10 11 Source of cash income Livestock and livestock products sale Poultry Bee product Grain sale vegetables sale Firewood sale charcoal sale Transfer/gift Rural credit local trades Other non-farm activities Total Cash income per household (Birr/household) 712.26 32.84 48.64 118.72 59.46 56.87 22.93 70.12 799.71 112.20 44.30 945.38
5. The following data is the mean maximum temperature of Debre Birhan town in oC (1997 2006). Present the data by using bar-diagram. Table 3.14
Months Mean Max To 19.80 19.85 19.85 19.85 19.85 19.85 19.85 19.85 19.85 19.85 19.85 19.85 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
51
4.1. Introduction
In the previous sections, we discussed how raw data can be collected, organized and presented in terms of tables, charts and frequency distribution in order to be easily understood and analyzed. Although frequency distribution and corresponding graphical presentations make raw data more meaningful, yet they fail to identify three major properties that describe a set of quantitative data. These are: 1. Central value of a set of data called central tendency 2. The extent to which numerical values are dispersed around the central value, called variation or dispersion in terms of single distances of individual observation from central values. 3. The extent of departure of numerical values from symmetry of distribution around the central value, called Skewness or Shape of frequency distribution or measure of symmetry. In this unit we will deeply discuss about Central Tendencies also known as Measures of Location First Order Analysis. The term central tendency was coined because observations (numerical values) in most data sets show a distinct tendency to group or cluster around a value of an observation located somewhere in the observations. It is necessary to identify or calculate these typical central values to describe or project the characteristic of the entire data set in one figure. This descriptive value is known as measure of central tendency. It is very important in social science applications such as planning. For instance, it becomes easy to plan for the annual total need of potable water supply for Adama town residents firstly by studying the average quantity of water needed per household per head in the town.
Mesay Mulugeta, 2009 52
The most widely used measures of central tendency are the mean, the median and the mode. We will be calculating these values for populations (i.e. the collection of all elements we are describing) and for samples drawn from populations, as well as for grouped and ungrouped data sets.
4.2. Mean
Mean is a central value which is computed by taking into consideration all the observations or all recorded values. It has four sub types known as Arithmetic, Geometric, Weighted and Harmonic means. But unless and until specified, the term mean invariably refers to the Arithmetic Mean or Average. It is this measure which is most frequently used because it is easier to compute as well as it is used in further rigorous statistical analysis where Geometric and Harmonic means are not useful. Moreover, Geometric and Harmonic Means have very limited applications as well. Mathematically, mean of a list of numerical observations is the sum of the entire observation divided by the number of items in the list. The Greek letter is used to denote the mean of an entire population while sample mean is typically denoted by x (enunciated x bar). Some related literatures also use y and d to denote sample mean. a) Calculating Arithmetic Mean for Raw Data It is the most widely used and widely reported measure of central tendency. There are at least two methods to calculate arithmetic mean for ungrouped (raw) data.
Direct Method
In this method it is calculated by adding the values of all observations and dividing the total by the number (N) of observations.
Population Mean x1 X1 X2 X3 N . . . .x n . . . .X n or xi n Xi N
Sample Mean x
x2
x3 n
or x
Then, the arithmetic mean of the above observations (data) is calculated to be 25.3kg.
34 23 18 22 36 38 12 Total
d fi d i N 29 23 1.26
2 3 2 4 3 4 5 23
10 -1 -6 -2 12 14 -12 -------
20 -3 -12 -8 36 56 -60 29
am
fi d i N
24
29 23
24
1.26
25.26
54
with C and then adding to the assumed mean. For this it is also advisable to change the class interval into continuous form of grouping.
Example
Let the following table represents a daily earnings of 23 hypothetical employees in a firm in a classified form. This method requires grouping the raw data into class intervals, calculation of class midpoint and identification of the number of observations (data) in each class. Look at the table below.
Case 1
Table 4.2: Calculation of Arithmetic Mean for Grouped Data
Daily Earnings in Birr (Discrete Grouping) 10 14 15 19 20 24 25 29 30 34 35 39 Total Class mid-value mi 12 17 22 27 32 37 ------Number of Employees fi 2 3 2 4 3 4 18 fi mi 24 51 44 108 96 148 471
Arithmetic Mean
fm
or
fm n
n = sample size Then, the arithmetic Mean for the above example
471 18
26.167
Case 2
Table 4.3: Calculation of Arithmetic Mean for Grouped Data
Daily Earning in Birr Class Mark (Mid-Point) mi No. of Employees
(Frequency, fi) 2 3 2 4 3 4 18
di (mi am)
-15 -10 -5 0 5 10 ---
fi d i
-30 -30 -10 0 15 40 -15
ui = mi am
C
-3 -2 -1 0 1 2 ---
fi ui
-6 -6 -2 0 3 8 -3
12 17 22 27 32 37 Total
55
fi d i fi
15 18
0.833
Correct mean
OR fi ui
am
27
( 0.833)
26.167
fi
3 18
0.167
Correct mean
am
C (u )
27
5 ( 0.167)
27
0.833
26.167
Merits and Demerits of Arithmetic Mean Sharma, J.K (2004) writes the merits and demerits of arithmetic mean as follows: Merits 1. The calculation of arithmetic mean is simple and it is unique, 2. It is clear and unambiguous since every data set has one and only one mean value 3. The calculation of arithmetic mean is based on all values given in the data set 4. The arithmetic mean is reliable single value that reflects all values in the data set. 5. The arithmetic mean is least affected by fluctuations in sample size. Its value determined from various samples drawn from a population vary by the least possible amount 6. It used for more rigorous further statistical analysis 7. Arithmetic mean is a stable average Demerits 1. The value of arithmetic mean cannot be calculated accurately for open-ended class intervals 2. It is affected by the extreme values which are not the exact representative of the data set. 3. For large data sets the calculations of arithmetic mean may sometimes be difficult and tedious as every element is used in the calculation 4. It cannot be calculated for qualitative characteristics such as intelligence, beauty and loyalty 5. Arithmetic mean cannot be determined by inspection
Exercise
The following data is a monthly minimum temperature data taken from Guder Weather Station. Calculate the monthly minimum arithmetic mean and write your answers on the space provided at the bottom of the table.
Mesay Mulugeta, 2009 56
Jun 12.5 12.6 10.2 11.4 11.5 8.2 7.4 9.9 8.1
Mean Min T0 ? ? ? ? ? ? ? ? ? ? ? ?
Exercise
Form a frequency table for the raw data given below and calculate an arithmetic mean.
22 38 12 17 17 18 22 23 22 27 18 28 12 12 24 18 18 22 33 33 38 24 24 17 22 22 28 12 28 24 18 33 27 22 24
Example
Let us assume that East Shewa Zone administration wants to award only one top crop producing farmer in the zone based on his/her annual crop production. In the process of selecting the nominee, let three top farmers (Farmers X, Y and Z) be found and each of them harvested 80quintals of crops during the crop year. If each farmer produced four types of crops namely wheat, teff, barley and sorghum as depicted in the Table 4.4, who do think should be awarded? It is some what difficult to
Mesay Mulugeta, 2009 57
decide as three of the farmers produced the same quantity and the average production per crop is 20 quintals for each of them. Do not forget that only one farmer is required to be awarded. In order to decide whom to award, we may attach to each crop type value (weight) as w1, and
w2 , w3
Therefore, based on
their current market price, for instance, we can attach 1200 to teff, 800 to barley, 700 to wheat and 400 to sorghum. Observe Table 4.5. below. Table 4.5: Calculation of Weighted Arithmetic Mean
Current
X
Prodn in qntls
Farmers Y
Prodn in qntls
Z
Prodn in qntls
(Weight) 1200
(x) 18
wx
21600 17600 21000 4000 64200
(x) 12
wx
14400 14400 31500 2000 62300
(x) 25
wx
30000 20000 10500 6000 66500
22 30 10 80
18 45 5 80
25 15 15 80
In order to decide who should be awarded, we should now calculate the weighted mean for each farmer from the table above:
WM for farmer X 64200 3100 20.7096
WM for farmer Y
WX
20.0968
WM for farmer Z
21.4516
As per the calculation above, it can be noted that Farmer Z should be awarded. Remark: As noted by Sharma, J. K (2004) the weighted arithmetic mean should be used, among others, where the importance of all the numerical values in the given data set is not equal and when the frequencies of various classes are widely varying. The term weighted mean usually refers to a weighted arithmetic mean, but weighted versions of other means can also be calculated, such as weighted geometric mean and weighted harmonic mean.
Mesay Mulugeta, 2009 58
Exercise
Let an examination was held to decide the award of a scholarship from three selected students namely Gebre, Mandefro and Waktole. The weights of various subjects were different (4 for mathematics, 3 for GeES, 2 for civics and 1 for language). The marks obtained by three candidates (out of 100 in each subject) are given below. Table 4.6 Subjects Mathematics GeES Civics Language Gebre 60 62 55 67 Mandefro 57 61 53 77 Waktole 62 67 60 49
Then, decide who should be awarded the scholarship by calculating the weighted arithmetic mean.
Geometric Mean
It is the nth root of the product of n numbers (observations) or the average of the logarithmic values of a data set, converted back to a base 10 number. It is given by the formula GM
LogGM
1 (Logx1 Logx2 n
... Logxn ) .
Hence GM ,
Then, how do you calculate a geometric mean? The easiest way to think of the geometric mean is that it is the average of the logarithmic values, converted back to a base 10 number. However, the actual formula and definition of the geometric mean is that it is the
nth
nth root of
59
GM
Consider this example. Suppose you want to calculate the geometric mean of the numbers 2 and 32. This simple example can be done as follows: First, take the product that means 2 and 32), the
x 32
64
nth root is the square root, and the square root of 64 is 8. Therefore the geometric mean
of 2 and 32 is 8. Now, let's solve the problems using logs. In this case, we will convert to base-2 logs so that we can solve the problem as follows. Converting our numbers, we have:
21
,
6 3
32
25 ,
21 x 2 5
26
64
The square root of 2 is 2 which is equal to 8. Of course, the short cut to solve the problem is to take the average of the two exponents (1 and 5) which is 3, and 23 is 8. Merits and Demerits of Geometric Mean Sharma, J.K (2004) writes the merits and demerits of Geometric mean as follows: Merits 1. The values of GM is not much affected by extreme observations 2. GM is calculated by taking all the observations into account 3. It useful in determining rate of increase or decrease Demerits 1. The calculation of GM as compared to AM is more difficult and intricate 2. GM cannot be calculated when any of the observation in the data set is either negative or zero
Exercise
Calculate the geometric mean of 8, 24, 25, 26, 28 and 32.
60
Harmonic Mean
In mathematics, the harmonic mean is one of several kinds of average. Typically, it is appropriate for situations when the average of rates is desired. The harmonic mean H of the positive real numbers x1,
Note: Equivalently, the harmonic mean is the reciprocal of the arithmetic mean of the reciprocals.
Demerits
1. It is not easy to calculate and understand compared to AM 2. It is impossible to determine harmonic mean if any of the values is zero and/or negative 3. It is not representative value of the distribution or the data set unless the analysis requires greater weight to be given to smaller items.
Exercise
Calculate the harmonic mean of 10, 14, 20, 22, 26 and 30. Note that for a given set of observations the following inequality holds: Arithmetic Mean Geometric Mean Harmonic Mean
4.3. Median Median is defined as the middle value in the data set when its elements are arranged in a sequential
order, that is, in either ascending or descending order of magnitude. It is called a middle value in an ordered sequence of data in the sense that half of the observations are smaller and half are larger than this value.
61
Median can be calculated for both ungrouped and classified data. In order to calculate it for ungrouped data, first the data should be arranged in an ascending or descending order. If the number of observations (n) is odd number, then the median (Med) is represented by the numerical value corresponding to the positioning point of the (n 1)th order observation . Note that median is only 2
moderately useful for farther statistical analysis unlike arithmetic mean which is the most important measure of central tendency (measure of location) for farther statistical analysis of a given dataset.
Example
Find the median value of the following data. 50 55 60 63 66 (n 1)th 2 69 72 75 78
(9 1)th 2
Then, the median value of the data above is 66 If the number of observations (n) is an even number, then the median is defined as the arithmetic mean of the numerical values of the (n)th n and ( 2 2 1)th observations in the data array
Example
Find the median value of the following data 50 55 60 63 66 69 72 75 78 84
n = 10 which means even (n)th value 2 (n)th 2 value (10)th value 2 (10)th 2 5th value 66 69
1 value
6th value
67.5 which is the median of the data above. 2 Median for Grouped Data: To find the median for grouped data, first identify the class interval n 1 th which contains the median value or ( ) observation of the data set. To find such class interval, 2
Mesay Mulugeta, 2009 62
66
69
find the cumulative frequency of each class for which the cumulative frequency is equal to or greater than the value of (n)th observation.
Med L (n 1 / 2) f cf x h
Where,
cf = cumulative frequency of the class prior to the median class interval, that is, the sum of all the class frequencies unto, but not including, the median class interval f h n = frequency of the median class = width of the median class interval = total number of observations in the distribution
Example
Table 4.7 represents the dietary energy intake per person per day (kcal/day/person) of 100 rural households in one of the woredas in Ethiopia. Calculate the median value of the dietary energy intake in kilocalorie based on the discussion above.
63
500
20 38 28 4 3 3 4 100
20 58 86 90 93 96 100
---------
Demerits
1. In case of even number of observations for ungrouped data, median cannot be determined exactly 2. Median, being a positioning value, is not based on each item in a data set 3. Median is not suitable for further mathematical treatment 4. Median is more affected by fluctuations of sampling as compared to arithmetic mean.
Exercise
Let us assume that you have collected a data from a factory employing 80 workers during your project work for this course. Let your data indicate that the daily wage of 20 workers is less than 10 Eth. Birr, of 30 workers is 10 to 20 Eth. Birr, of 14 workers is from 20 to 30 Eth. Birr, of 7 workers is 30 to 40 Eth. Birr and of the remaining 9 workers ranges from 40 to 50 Eth. Birr. Then, calculate the median wage of the workers.
64
4.4. Mode
In statistics, the mode is the value that occurs most frequently in a data set or a probability distribution. Like the statistical mean and the median, the mode is a way of capturing important information about a random variable or a population in a single quantity. For instance, the mode of the sample [1, 3, 6, 6, 6, 6, 7, 7, 12, 12, 17] is 6 Calculation of Mode in Grouped Data: In case of the grouped data the following formula is used to calculate the mode.
Mode.
f 2f f1
f1 f2
x h
Where, L = lower limit of the modal class interval f = frequency of modal class f1 = frequency of the class preceding the modal class interval f2 = frequency of the class following the modal class interval h = width of the modal class interval
Example
Calculate the mode of the dietary energy intake in kilocalorie indicated in Table 4.7.
Steps: The largest frequency (38) corresponds to the class (500-1000) Then, we have L = 500 f = 38
f1 f2
500
= 20 = 28
= 500
Mode
500
38 20 x 500 2(38) 20 28
321.4
821.4
Though a poor measure of central tendency, mode has some advantages. It can be calculated only by inspection from a simple frequency distribution. It is also unaffected by the presence of extreme values of a data set and can be calculated from frequency distribution with open ended classes. However, mode has no significance unless large number of observations is available. Mode has little or no use for farther statistical analysis. You should also note that a data set may have one mode value (unimodal distribution), two mode values (bimodal distribution), three mode values (trimodal distribution) or many mode values (multimodal distribution).
Mesay Mulugeta, 2009 65
Demerits
1. A data set may have more than one mode value which makes the comparison and interpretation more difficult 2. It is difficult to locate modal class in the case of multi-modal frequency distributions 3. Mode is not used for further rigorous statistical analysis
Quartiles: The value of observation in a data set, when arranged in an ordered sequence, can be
divided into four equal parts or quarters namely the first quartile (Q1), the second quartile (Q2) and the third quartile (Q3). The first quartile (Q1) divides a distribution in such a way that 25% (=1/4) of
Mesay Mulugeta, 2009 66
observations have a value less than or equal to Q1 and 75% (=3/4) have a value more than or equal to Q1. Similarly Q2 has 50% items with values less than or equal to Q2 For discrete data it is simple to locate the partition values. Arrange the data in order, if they are not and work out the cumulative frequencies. To find out Q1 calculate (N
contained. The variate value against this cumulative frequency is the value of Q1. For Q2 find
(N
variate value corresponding to this cumulative frequency is Q2. Calculate 3( N in the same manner as Q1 and Q2. If the population or sample size is relatively small, (N
formula above. This is because in very large sample or population size the difference between the ratio of (N
1) and N to the same denominator is negligible as compared to the case of small size 1) in geographic researches since we usually make use of
smaller sample sizes as compared to other fields of study such as psychology. Then, find where the value of 1( N
1) / 4 value
(7.5)th is contained or
where it lies. This value lies at 11. Then, the Q1 of the data is 9. If you continue calculating, the Q2 and Q3 will be 11 and 17, respectively.
Example
Calculate Q1, Q2 and Q3 for hypothetical data given below Table 4.8 Daily income 4 5 7 9 11 14 17 24 28
Mesay Mulugeta, 2009
for percentiles and proceeding on as for quartiles and deciles, the percentiles are located. For a grouped set of data, to locate the ith quartile value, first calculate i ( N 1 / 4) th value and proceed as we do for ungrouped data above. This means search that minimum cumulative frequency in which i ( N 1 / 4) th is contained. The class corresponding to this cumulative frequency is called ith quartile class (1st Quartile, 2nd Quartile, or 3rd Quartile) The general formula for calculating quartiles in case of grouped data is:
i (n 1 / 4) f cf
x h; i
1, 2, 3
Where,
cf = cumulative frequency prior to the ith quartile class L = lower limit of the ith quartile class f = frequency of the ith quartile class interval h = width of the class interval Qi
=
Deciles: In descriptive statistics, a decile is any of the 9 values that divide the sorted data into 10
equal parts using nine deciles, Di, (i = 1, 2, 3, 9), so that each part represents 1/10th of the sample or population set of data. The procedure for calculating the ith class is to calculate iN/10 and search that minimum cumulative frequency in which this value is contained. The class corresponding to this cumulative frequency is the ith decile class. The general formula for calculating deciles in case of grouped data is:
i (n 1 / 10) f cf
Di
x h; i 1, 2, 3, ....9
68
Percentile: Represents values of observations in a data when arranged in an ordered sequence into
hundred equal parts ninety nine percentiles, Pi (i = 1, 2, 3, 4 .99). So the 20th percentile is the value (or score) below which 20 percent of the observations may be found. The term percentile and the related term percentile rank are often used in descriptive statistics as well as in the reporting of scores from norm-referenced tests. The 25th percentile is also known as the first quartile (Q1); the 50th percentile as the median or second quartile (Q2); the 75th percentile as the third quartile (Q3). The general formula for calculating percentiles in case of grouped data is:
i(n 1 / 100) f cf
Pi
x h; i 1, 2, 3,.....99
Example Let us assume that the following distribution (Table 4.9) gives the pattern of livestock population per household for 100 rural households in Woreda W in Ethiopia. Calculate median, first quartile, 8th decile and 70th percentile of the grouped data. Since the number of observations in the data set are 100, the median value is (n/2)th = (100/2)th = 50th observation. This observation lies in the class interval 25 30. Applying the earlier formula, the median number of livestock can be calculated as:
Med .
25
(100 / 2) 48 x5 4
25 2.5
27.50
15 8 25 4 33 8 7 100
15 23 48 52 85 93 100 ---------
69
To calculate Q1 first find where i(N)/4 observation is contained where N=100. Then, i(N+1)/4 = (101)/4 = (25.25)th value is contained in 20-25 class interval. Then, 20-25 is
20
25 25
23
x5
20.45
Similarly, to calculate D8 first find where i(N+1)/10 observation is contained where N=100. Then, (iN)/10 = (8 x 101 )/10 = (80.8)th value is contained in 30-35 class interval. Then, 30-35 is D8or decile 8 class. Apply the formula above.
D8 L i (n 1 / 10) cf x h; f 30 8(101 / 10) 33 53 x5 34.21
You can also calculate P70 by using the same methods for Q1and D8
above.
(iN)/100 observation is contained where N=100. Then, (iN)/100 = (70 x 100 )/100 = (70)th value is contained in 30-35 class interval. Then, 30-35 is
i (n 1 / 100) f cf
P8
P8
xh
30
70 (101 / 100) 33
x 5
32.83
Exercise
Calculate Q3, D7, and P72 for the grouped data below Table 4.10: Class Interval
0 -4 5 9 10 14 15 19 20 24 25 29 30 -34 35 39 40- 44
Frequency
12 34 67 73 82 74 66 54 48
70
Remark
Partitioning values are used in the classification or grouping of certain data sets where the intervals, invariably, differ but frequencies of the classes will remain constant. Partitioning (locational) values are not only confined to Quartiles, Deciles and Percentiles. One can also go for any other numbers of grouping say 5, 15 30, 60 and 90 according to which the values in the formula used.
i (n f 1)
can be
Relationships b/n Mean, Median and Mode: In a unimodal and symmetrical distribution, the
values of mean, median and mode are equal. In other words, when these three values are all not equal to each other (as in Fig. 4.1 below) the distribution is not symmetrical Figure 4.1: Comparison of Mean, Median and Mode
f r e q u e n c y
Conditions Unimodal and symmetrical distribution Most of the values of observation in the distribution
fall to the right, skewed to the right or positively skewed (See Figure 1 above) Most of the values of observation in the distribution
fall to the left or skewed to the left or negatively skewed. (See Figure 2 above)
71
Exercise
Find the mean, median, mode, Q3, D4 and P65 of the grouped data set below and comment on the results. Give your judgment whether it is an asymmetrical or symmetrical distribution based on Karl Pearson s principle. Remark also whether it is negatively or positively skewed set of data. Table 4.11: Rural households by value of tradable material ownership in Kuyu Woreda (January 2001) Value of materials No. of (Eth. Birr) Households 0 50 122 50 100 95 100 150 50 150 200 68 200 250 18 250 300 11 300 350 7 350 400 14 400 450 15 Total 400
Source: Mesay M. (2002)
SPSS Practice
By using SPSS software create a frequency table, pie-chart and bar graph for the data given below. Edit the chart/graph (i.e. insert title, insert the variables and percentages into the chart/graph, and select the suitable color by using SPSS Data Editor and Chart Editor Windows) Table 4.12: Farmland covered by major crops during 1999 crop year in Kuyu Woreda Crops Area in hectare 254.0 Teff Sorghum 96.3 Wheat 64.0 Barley 14.6 Maize 6.8 Pulses 29.7 Oilseeds 54.1 Oats 1.0 Total 520.5
Source: Mesay M. 2002
72
Exercise
Let us assume that the mean and median of a certain skewed (asymmetrical) set of data are 117 and 86 units, respectively. Find the mode value of this data set and comment whether it is positively or negatively skewed.
73
5.1. Introduction
The term dispersion in statistical methods refers to the variability or spread in a data set. Measures of dispersion, also known as second order analysis, or spread or variability provide information about the spread of the scores in quantitative techniques. Measures of dispersion help us to know whether the scores clustered around certain measures of central tendencies or spread out over a large segment of the scale. Measures of dispersion, invariably, are better employed when measures of central tendency failed to explicitly distinguish the two or more sets of distributions. For example, two sets of observations may provide the same Arithmetic Means. Here, it becomes difficult to identify or distinguish their specific differences. See the cases of two sets of values below.
Sum = 500 unit Mean = 100 unit SD = 7.07unit Range = 20 unit Sum = 500 unit Mean = 100 unit SD = 14.14unit Range = 40 unit
Data Set A
90 95
100
105
110
Data Set B
80 90
100
110
120
Look at the characteristics of the two data sets (Data Set A and Data Set B) above that in both cases the Arithmetic Mean is 100 unit, which may lead to wrong interpretations indicating similar
74
distributions. The actual inference can be drawn only by using certain measure of dispersion such as range, standard deviation and coefficient of variation.
For instance, compare the means and measure of variability (i. e. standard deviation) of the two data sets below. Data set I is characterized by extremely less value of measure of variability (SD) and better representation of the mean value than the case of data set II.
2. To compare two or more sets of values with regard to their variability: Two or more sets of values can be compared by calculating the same or similar measures of dispersion. A set with smaller value possesses lesser variability. Look at the example in serial number 1 above. 3. To provide information about the structure of a series: A value of measure of dispersion gives an idea about the spread of the observation.
75
4. To pave way to the use of other statistical measures: Measure of dispersion, especially Standard Deviation and Variance lead to many other statistical techniques like correlation, coefficient of variation, regression, analysis of variance (ANOVA), e.t.c. A measure of dispersion, therefore, is defined as a numerical value explaining the extent to which individual observations vary among themselves. They can be broadly categorized into two: Absolute and Relative Measures of Dispersion. The main difference between the two is that absolute measures of dispersion measure numerically heterogeneity of data and are not free from unit of measurement. Absolute measures of dispersion, unlike the relative ones, cannot be used to measure the degree of heterogeneity between two data sets, which does not have similar unit of measurement. The following diagram gives an overview of measures of dispersion.
Solution: Max value = 12.50C Range = Max value Min value = 7.8 C
0
Min value
76
Absolute Dispersion
Relative Dispersion
Range
Quartile Deviation
Mean Deviation
Standard Deviation
Variance
Coefficient of Variation
IR
QD = Q
Q1
Q 2
Exercise The following data is the mean maximum temperature data in oC for the month of January for 10 years (1998 2008) taken from Entoto weather station. Find the Interquartile Range (IR) and
Quartile Deviation (QD) for the temperature records and statistically explain the result you have calculated.
19.1
18.9
19.1
19.5
19.2
19.9
20.3
20.2
21.1
21.2
77
MD
xi N
MD
f xi N
Where, the letters have their usual meanings and | | means ignoring the negative signs. The two vertical bars in the formula indicate the absolute values or values omitting the signs with the other symbols having the same meaning discussed earlier. In case of grouped data, the mid-point of each class interval is treated as xi.
Exercise
Calculate mean deviation for the data given below.
Mesay Mulugeta, 2009 78
27.9
28.1
28.7
27.2
27.8
28.1
27.5
Source: NMSA
27.7
Variance
It is simply the mean of the squares deviations from a central value, generally the Arithmetic Mean (i.e., the average of the squared deviations). The symbol for variance is s or
2
2
accompanied by a
subscript for the corresponding variable. The main demerit of variance is that its unit is the square of the unit of measurement of the values and this value is large making it very difficult to decide about variation of magnitude. Here is the formula for variance of variable x for ungrouped or raw data:
2
(Xi N
Where,
and
(xi x) n 1
s2 = Sample variance,
2
= Population mean
Variance is an important measure in statistics particularly in assessing variation b/n two or more samples of a population. One very powerful statistical technique known as analysis of variance (ANOVA) uses variance to help decide whether a two or more sets of samples differ significantly from each other. ANOVA will be discussed later in this course In case of grouped data, mid-values of the classes are considered as xi and consequently we can make use of the formula below:
79
f ( Xi N
and
S2
f ( xi ) 2 or
f (xi x) n 1
f ( xi
squares (TSS). TSS measures the total variation among values in a data set while variance measures the average variation among data set. The larger the values of TSS or variance, the greater the variation among the values of data set.
Properties of Variance
Values are squared only to get rid of negative signs in variance. If they are not squared, a central value cannot be obtained because
(X
i
X )
is always zero.
The main drawback of variance is that its unit is the square of the unit of measurement of the observations. This results in larger value of the variance making it more difficult to interpret The variance gives more weightage to the extreme values as compared to those which are near to the mean value. This is because the difference is squared in variance.
Standard Deviation
Standard deviation, also known as root mean squared deviation, explains the average amount of variation on either side of the mean. It is considered to be the best measure of dispersion and is used most widely even in advanced techniques. The population ( ) and the sample (s) standard deviations are the positive square roots of their respective variances and have the desirable property of being in the same units as the data. That is, if the data is in hectares, the standard deviation is also in hectares as well. The Standard deviation is another way to calculate dispersion. The standard deviation formula for ungrouped (raw) sample data is
S (x n
i
x) 1
Notice the difference between the sample and population standard deviations. The sample standard deviation uses n-1 in the denominator, hence is slightly larger than the population standard deviation which uses N. The SD formula for ungrouped population data is
80
Where,
( xi N
We have already discussed the use of Greek letters for sample statistics vs population parameters. This is why s is used for the sample standard deviation and (sigma) is used for the population
standard deviation. However, another sigma, the capital one ( ), appears inside the formula. It serves to indicate that we are adding things up. What is added up are the deviations from the mean:
( xi x) . But the average deviation from the mean is actually zero. Occasionally the mean
deviation, using average distance or using the symbols for absolute value, x x , is used. However, a better measure of variation comes from squaring each deviation, summing those squares, and then taking the square root after dividing by the number of data elements or one less than that. If you compare this with the formula for quadratic mean you will realize we are doing the same thing, except for what we are dividing by. The SD formulae for grouped population and sample data are:
f ( xi N )2
and
f ( xi x) 2 n 1
81
CQD
Q Q
1 1
It is often reported as a percentage (%) by multiplying the above formula by 100. The coefficient of variation is useful because the standard deviation of a data must always be understood in the context of the mean of the data. It is a dimensionless number. So when comparing
Mesay Mulugeta, 2009 82
between data sets with different units or wildly different means, one should use the coefficient of variation for comparison instead of the standard deviation. When the mean value is near zero, the coefficient of variation is sensitive to small changes in the mean, limiting its usefulness.
Example
Let the following table indicates 65 sample farmers by their annual crop yield output in Adama district. Then, find the average deviation, variance, standard deviation and coefficient of variation (CV) of the data. Table 5.1: Number of Sample Farmers by their Annual Crop Output: Hypothetical Data Crop Output in Quintals 10 19 20 29 30 39 40 49 50 59 60 69 70 - 79 80 - 89 90 - 99 Number of Farmers 3 5 8 14 12 10 7 4 2
In order to calculate the above requested values, we must first find class midpoint (x), f, fx, mean ( ), x, (x- )2, (x- )2. Look at the table below.
Table 5.2:
Output 10 20 30 40 50 60 70 80 90 19 29 39 49 59 69 79 89 99 Class mid-point (x) 14.5 24.5 34.5 44.5 54.5 64.5 74.5 84.5 94.5 f 3 5 8 14 12 10 7 4 2 fx 43.5 122.5 276.0 623.0 654.0 645.0 521.5 338.0 189.0
x38 28 18 8 2 12 22 32 42
Total
---
65
3412.5
202
1020
---
---
24240
83
Solution:
Mean
3412.5 65
f x x
52.5
f ( x x) 2 n 1
MD
1020 65
15.69
S2
24240 372.92 64
f ( x x) 2 n 1
372.92 = 19.31
CV
S x
19.31 52.50
0.37
37%
Though not the scope of this resource material to deal with, there are also statistical techniques to measure inequalities, concentrations and diversifications of a variable. These, for instance, include Lorenz Curve, Gini Coefficient, Theil Index of Inequality, Herfindahl-Hirschman Index, and Tideman-Hall Index Exercises 1. The following table is a hypothetical data containing the number of urban households by their monthly income in Kebele K. a) Then, find the mean deviation, variance, standard deviation and coefficient of variation (cv) of the data based on the example given above. Table 5.3: Yield in Quintals 501 750 751 1000 1001 1250 1251 - 1500 1501 1750 1751 - 2000 2001 - 2250 2251 2500 2501 2750 2751 - 3000 Total Number of Farmers 16 24 19 28 10 12 8 6 6 2 131
b) Try to answer the question above by using SPSS software and compare your answers with the one you have done in question (a).
Mesay Mulugeta, 2009 84
2. Calculate the variance, standard deviation and coefficient of variation for the two sets of data below and comment on the results you have calculated. Which set of data is more dispersed? Table 5.4: Data Set x: 450 465 Data Set y: 45 54 87 56 112 76 232 45 132 34 233 54 546 345 435 67 342 10 23 27 460 765 500 496 440 392 389 871 986 567 457 987 876 234 345 567 987 432
3. Calculate the mean maximum values, standard deviation and coefficient of variation for the temperature data given below. Comment on the results you have found. Table 5.5: Monthly maximum temperature of Adama in oC (2000
Year Recorded 2000 2001 2002 2003 2004 2005 Mean Max SD CV (%) Jan 27.3 24.7 26.2 27 27.8 26.6 Feb 28.3 27.8 29.2 29.5 28.2 29.6 Mar 30.2 28.5 29.9 29.8 29.3 29.9 Apr 30.3 29.8 31 29.1 28.7 30.2 May 30.8 29.6 32.8 32.4 * 29.1 Jun 29.8 28.2 30.9 30.1 29.6 29.6 Jul 25.9 25.8 29.7 25.5 26.3 25.6 Aug 25.6 25.3 26.6 25.9 26 26.7
2005)
Sep 26.7 27.4 28.2 27 27.3 27.6 Oct 25.8 28.7 29.8 28.5 26.7 28.9 Nov 25.8 27.3 28.6 27.5 26.8 27.5 Dec 25.3 26.8 26.5 25.4 26.2 26.3
Source: NMSA
85
6.1. Introduction
In the previous chapters we have discussed measures of location and variation of a data set to describe the nature of individual values in the data set. However, the analysis of a data set still remains incomplete until we measure the degree to which these individual values in the data set deviate from symmetry on both sides of the central value and the direction in which these are distributed.
86
Graphically in case of symmetrical distribution the lengths of the two segments of the curve on either sides of the peak point are equal while in asymmetrical curve one of the segments or tails of the curve is longer than the other. Look at Figure 6.1 1. Negative skew: The left tail is longer; the mass of the distribution is concentrated on the right of the figure. It has a few relatively low values. The distribution is said to be left-skewed. In such a distribution, the mean is lower than median which in turn is lower than the mode
(i.e mean median mod e) ; in which case the Skewness coefficient is lower than zero.
Figure 6.1: Symmetrical and Skewed Distribution
2. Positive skew: The right tail is longer; the mass of the distribution is concentrated on the left of the figure. It has a few relatively high values. The distribution is said to be right-skewed. In such a distribution, the mean is greater than median which in turn is greater than the mode
(i.e mean median mod e) ; in which case the Skewness coefficient is greater than zero.
3. Symmetric Distribution: If there is no Skewness or the distribution is symmetrical (and unimodal) like the bell-shaped normal curve then the mean median mod e) . The degree of Skewness can be measured both Absolute Skewness and Relative or Coefficient of Skewness. For an asymmetrical distribution, the distance between mean and mode may be used to measure the degree of Skewness because the mean is equal to mode in a symmetrical distribution. Absolute Skewness (Sk) = Mean Mode
87
For a positively skewed distribution, Mean > Mode and therefore Sk is positive otherwise it is negative. Other than the above stated absolute method of measuring Skewness, there are at least three important relative measures of Skewness. These are Karl Pearson s Coefficient of Skewness, Bowley s Coefficient of Skewness and Kelly s Coefficient of Skewness. For the purpose of this course, you will observe the procedures used by Karl Pearson to measure the Skewness of a data set herein. The measure suggested by Karl Pearson for measuring coefficient of Skewness is given by:
SK
Mean
p
Mode SD
Where, SkP = Karl Pearson s Coefficient of Skewness Since a mode does not always exist uniquely in a distribution, it is convenient to define it using median. From our previous discussion in this course, you know the relationship between mode, median and mean.
3( Mean Median)
3Median 2Mean
SK
3 ( Mean
p
Median ) SD
Example
Calculate the mode, mean, median and Pearson s Coefficient of Skewness for a set of data indicated below. Comment on the results!
88
5 15 41 42 2 12 3
N = 120
f1 f0 f2
Solution
Look at the table above that the mode lies the class 36 40 by inspection. Then, by applying the formula we have discussed earlier mode can be calculated as:
Mode L
f 2f f1
f1 f2
xh
= 36
42 41 x4 2 x 42 41 2
36.096
To calculate the mean, we can make use of any method (formula) we have seen in our earlier discussions. Table 6.2: Class Interval (x)
21 25 26 30 31 35 36 40 41 - 45 46 50 51 - 55 ------
Frequency (f)
5 15 41 42 2 12 3 --------
fx = 4305
89
Mean
4305 120
35.875
Now we can calculate the remaining measure of central tendency. What is it? Median
Now we are left to know standard deviation so that we can easily calculate Pearson s Coefficient of Skewness. Standard deviation can, therefore, be calculated based on the formula below:
(xi N )
2
Table 6.3:
Class Interval (x) Midpoints (x)
Mean ( )
x-
(x -
f(x -
21 25 26 30 31 35 36 40 41 - 45 46 50 51 - 55
23 28 33 38 43 48 53
36 36 36 36 36 36 36
- 13 -8 -3 2 7 12 17
5 15 41 42 2 12 3
)2
f (X N
)2
5035 120
6 .5
3( Mean Median) SD
90
0.035
Since the coefficient of Skewness is - 0.035 (SkP = -0.035), the distribution is slightly skewed to the left (slightly negatively skewed). Thus, the concentration of the values of the distribution is slightly to the lower values to the extent of 3.5%. Remark: Skewness is an important concept in geographical statistics because very many of the variables measured in geographical studies show highly skewed distributions. This fact has two important consequences. First, it casts doubt upon the diversity of applying parametric statistical tests to this data. A high degree of Skewness is one sign that sample data are not normally distributed. They are therefore unlikely to come from a population which is normally distributed. At this juncture you have to remember that a normal distribution is symmetrical and has a Skewness of zero. Secondly, other descriptive measures particularly the mean, may be misleading if used in isolation. In a highly skewed data observation the mean on its own is not a very informative measure. Rather than the above discussed ones there are other various statistical measures of Skewness. The most common one, properly known as Momental Skewness, is calculated using the following equation:
Skewness (x n
3
x )3
Where,
(x
x ) 3 denotes the cube of the deviations of the values from their mean,
is the
standard deviation and n is the number of values. The value of Skewness for a symmetrical distribution is zero. Logically, positive values of the index indicate positive Skewness and negative values indicate negative Skewness. 6.3. Moments Measure of moments includes the measure of mean, average deviation and standard deviation. The value of these measures is obtained by taking the deviation of individual observations from a given origin. As in physics the term moments is affected by (i) size of class interval representing the force and (ii) deviation of mid-values of each class from an observation representing the distance. Moments can be calculated about mean, arbitrary mean, zero or origin, and about any arbitrary
Mesay Mulugeta, 2009 91
Then the rth moment about the actual mean of a variable both for ungrouped and grouped data is
, r 1, 2 , 3, 4
For grouped
data mr
f (x n
x) r
r 1, 2, 3, 4
Exercise
Calculate the first four moments about the mean for the following grouped data and explain the results you have found. Explain whether the distribution is symmetric or asymmetric, and whether it is positively skewed or negatively skewed. Comment also whether the distribution is Leptokurtic, Platykurtic or Mesokurtic.
Table 4.4
Class Interval 22- 26 27 31 Frequency 2 5 4 8 10 9 3
32- 36 37 -41 42 47 46 51
52 - 56
6.4. Kurtosis
Kurtosis (from the Greek word kyrtos or kurtos, meaning bulging) describes the degree of concentration of frequencies in a given distribution. That is, whether the observed values are concentrated more around the mode (a peaked curve) or away from the mode towards both tails of the frequency curve. In other words, the term kurtosis in statistics refers to the degree of flatness or peakedness in the region about the mode of a frequency curve.
92
Note that two or more distributions may have identical average, variation, and Skewness but they may show different degrees of concentration of values of observation around the mode and hence may show different degrees of peakedness. The usual measure of Kurtosis is calculated with the following equation:
(x Kurtosis (x n
4
x) n x)
2
x)
(x n
Where,
(x
x) 4 denotes the fourth power of the deviations of the values from the mean,
(x n x)4
is the standard deviation and n is the number of values. A normal distribution has a kurtosis of 3.0 while a very peaked (leptokurtic) distribution and a very flat (Platykurtic) distribution has a kurtosis greater than 3.0 and less than 3.0, respectively. Kurtosis is based on the size of a distribution's tails. Distributions with relatively large tails are called leptokurtic; those with small tails are called Platykurtic. A distribution with the same kurtosis as the normal distribution is called mesokurtic. This definition is used so that the standard normal distribution has a kurtosis of 0. In addition, with the second definition positive kurtosis indicates a peaked distribution and negative kurtosis indicates a flat distribution. Like Skewness. Kurtosis gives valuable information about the distribution of a set of data values, in addition to that provided by the mean and standard deviation.
Exercises
A) Find the Variance, Skewness and Kurtosis of the following frequency distribution by the method of moments. Explain the results you have calculated.
93
58 11
61
62
65 14
66
69 11
70
73 6
74 8
77
B) Compute the first four moments about the mean from the following data. Comment on the result you have found.
17 8
22 12
27 13
32 16
37 5
42 7
94
7.1. Introduction
Although there may be disagreement in detail about the nature and purpose of geography, it can reasonably be argued that one of the central themes of geographical enquiry is the appreciation of distribution of phenomena on the earth s surface. In view of the enthusiasm with which
geographers embraced the, quantitative revolution , it is, perhaps, surprising that remarkably little progress has been made in the development and application of spatial statistics. Although, a wide variety of statistical techniques have been applied by geographers to data which have been collected on an areal basis, only a handful of techniques are in common use for analyzing the spatial distribution of these data.
There are several possible explanations for this state of affairs. First, there is the lack of pre-existing theory and methods of statistical analysis applicable to spatial distributions. Both geographers and statisticians themselves have not shown a great deal of interest in spatial statistics in the past, and geographers are in a minority in demanding them now. Secondly, many of the techniques, which do exist are, at least at first sight, difficult to understand and tedious to use. It is certainly true that
some of the most advanced spatial techniques may not be applied to solve the real problems and analyze the data without the aid of a computer. Currently, very essential computer softwares, such as ArcGIS, are introduced into this specific area of geography. ArcGIS and other related such softwares have remarkably eased spatial geographic data analysis at present.
95
Despite these reservations, there are several simple spatial techniques which can be applied using manual methods of calculation, and which are intuitively easy to understand. It may well be that the present underuse of spatial techniques is merely due to lack of publicity. Various types of
phenomena can be studied using these techniques: points, lines, and areas, and a number of different characteristics can be measured. These include central tendency, dispersion, shape, pattern and spatial relationships.
phenomenon. As a first step in calculating the mean centre it is necessary to devise some way of quantifying the locations of the points by coordinate systems. This can be done by calculating the co-ordinates of each point by using set of rectangular axial systems that can be laid down of the map showing the locations. Then, with reference to these X and Y axes, the coordinates can be measured either in centimeters/inches or ground distances by using the map scale. For the calculation of most spatial statistics the position of points needs to be measured in relation to some such co-ordinate system. The orientation of the co-ordinate grid is quite arbitrary, however. Geographers are used to measuring location in terms of eastings and nothings, but there is no reason why they should not use say south-easting and north-easting. Similarly the origin of the grid, the point from which the co-ordinates are measured, is arbitrary. For example, the national grid origin
Mesay Mulugeta, 2009 96
of Ethiopia is found at 00N latitude and 34030 E longitude. The only prerequisites of a co-ordinate system which is to be used in the calculation of spatial statistics are: 1. The co-ordinate axes must be at right angles to each other, in other words they must be orthogonal axes, 2. Measurements along the two axes must be made in the same units. In Figure 7.1 below, an arbitrary co-ordinate system has been superimposed, with its origin at the bottom left hand corner. For simplicity the horizontal axis, measuring easting, has been labeled x, and the vertical axis, measuring northings, has been labeled y. The axes have been marked off in arbitrary distance units. The co-ordinate for all points is given in Table 7.1. Figure 7.1: Identification of Spatial Mean Center
T1
T2
T3
T4 T6
T5
T7
Map scale 1 : 50,000
T8
The mean centre can now be found simply by calculating the mean of the x co-ordinates (easting) and the mean of the y co-ordinates (northings). These two mean co-ordinates mark the location of the mean centre. The equation for the mean centre is thus:
x n
y n
97
respectively, and n is the number of points. The calculation of the mean centre for Figure 7.1 is given in Table 7.1 and its position. Table 7.1: Finding the spatial mean centre Co- ordinate Values in cm Points or Towns X Y 1.0 6.7 T1 T2 T3 T4 T5 T6 T7 T8 Total 6.5 11.6 1.9 10.9 7.0 4.1 12.0 55.0 6.6 6.7 4.4 4.4 3.1 0.9 1.0 33.8
n x
8 55 8
x 6.875
55 y 34.8 8
33.8
4.35
The co-ordinates of the spatial mean center, therefore, is 6.875, 4.35. Then, we can easily find and locate the spatial mean center of the towns in Fig 7.1 above. The spatial mean centers of population can also be calculated provided that the population size of each town (P1, P2 .) is given.
The simple formula to calculate spatial mean of population is: X co-ordinate = P1X1 + P2X2 + P3X3 + . PnXn P1 + P2 + P3 + . Pn Y co-ordinate = P1Y1 + P2Y2 + P3Y3 + . PnYn P1 + P2 + P3 + . Pn
98
T1
T2 SMC
T3
T4 T6 T7 T8
T5
The advantage of the median centre is that its location can be found very quickly, without resorting to any mathematics other than counting points. The disadvantage is that its location depends on the orientation of the two lines used to divide up the point distribution. This results in the fact that the location of the median centre cannot be uniquely found and its use should be restricted to preliminary geographical investigations, where speed may be more important than accuracy.
99
Example
Locate the spatial median center of population for towns (T1, T2, data/size (P1, P2, ... P10,) given below. . T10,) by using the population
T1
T2
T3
T4 T6 T7 T8 T9
T5
T10
P1= 5000 P6=1900 P2= 5200 P7=4300 P3= 2800 P8=6000 P4= 7800 P9=3450 P5= 3200 P10= 4280
Solution
First you have to calculate the spatial median center of population by using the formula
Median center
p1
p2
... p10
2 500 5200 2800 7800 3200 1900 4300 6000 3450 4280 21, 965 2
Exercise
Let us assume that the following figure indicates the distribution of hypothetical towns T1, T2 .) in hypothetical region R. Then answer the following questions based on data given about the assumed map.
Mesay Mulugeta, 2009 100
T1
T2
T3
T4
T7 T9
T5
T6
T8
T10
T11
Map scale 1 : 100,000
A. Find the spatial mean center and median center of the towns. Table 7.2: The towns population data: Hypothetical
Total population of each town in 1995 T1= 200 T2= 250 T3= 400 T4= 120 T5= 80 T6= 456 T7= 250 T8= 550 T9= 350 T10= 680 T11= 450 Total population of each town in 2005 T1= 700 T2= 650 T3= 600 T4= 220 T5= 180 T6= 756 T7= 550 T8= 850 T9= 850 T10= 780 T11= 550
B. Calculate the spatial mean and median centers of population for the given two periods if the population of each town is as given in the Table 7.2. Try to comment on the direction of shift of spatial population mean during the given periods.
Mesay Mulugeta, 2009 101
Exercise
Let us assume that the following figure indicates the distribution of towns (T1, T2 towns indicated in the map? .) in certain region R. Then which one of the towns can serve as the center of minimum travel for the whole eight
T1
T2
T3
T4 T6
T5
T7
T8
Map scale 1 : 50,000
102
d2 n
Exercise
Calculate the standard distance for the point pattern shown below and comment on the result you have found if the points represent the distribution of towns.
T1
T2
T3 T4
T5
T6
T7
Map scale 1 : 100,000
Having located the mean centre, it is possible to measure all distances directly from the map, square them, add up all the squares, divide by the number of points and then take the square root. Though
Mesay Mulugeta, 2009 103
there are various methods, this will be the simplest and the quickest way of calculating the standard distance for many map distributions.
Er =
As Er approaching to 1, the shape is more and more compact or circular. For most elongated shapes such as straight line Er = 0. Er = 1, for a compact shape Er = 0, for an elongated shape such as straight line.
Mesay Mulugeta, 2009 104
2. Farm ratio
Fr =
Example
Find the Farm ratio for a square which has area as one kilometer square. Area = 1km2
L
1km
1km
L=
1 = 0.5 2
Cr =
Example
Calculate the Cr value for a square given in the example above.
Cr =
2 P
105
Example
Calculate Cole s compactness ratio for a square shaped area with each side measuring 1kilometer.
CI =
AA ACSP
Where, AA = actual area ACSP = area of a circle with the same perimeter
The value of CI ranges b/n 0 and 1, indicating the most elongated and perfect compact shapes, respectively.
Exercise
Calculate the circularity index (CI) and compactness ratios of each of the country indicated below. Explain the results you have found.
Table 7.3:
Countries Ethiopia Djibouti Eritrea Kenya Somalia Sudan Area (km2) 1,106,000 22,000 117,400 582,644 637,657 2,502,813 Boundary (km)) 5,290 820 2,420 3,600 5,100 7,192
106
Project Work
You know that Ethiopia is currently divided into 9 regional states and 2 city administrations. Try to visit any nearby governmental or non-governmental offices, libraries or websites and find both the area and boundary length of all the regions/city administrations and analyze their shape based on what you have learnt in this unit. You are expected to show all the necessary steps and formulas you have used for the analysis.
107
8.1. Introduction
In this chapter we are going to analyze the degree and actual of relationships between two variables known as Correlation and Regression Analysis. For both regression and correlation studies, the number of variables may be two (Bivariate analysis) or more (Multivariate analysis). In this course we shall discuss in some detail both the bivariate and multivariate regression relationships. We shall see correlation and then regression in this chapter.
108
The relationship between variables (coefficients of correlation) can be direct or inverse. Its value ranges between +1 and -1 indicating the positive and negative correlation between the variables, respectively. The analysis of the coefficient of correlation has been attempted by different scholars. The most widely known is the Karl Pearson s product-moment correlation coefficient or simply Pearson s Coefficient of Correlation, which is obtained by dividing the covariance of the two variables by the product of their standard deviations. The Pearson s Coefficient of Correlation runs as follows for bivariate linear correlation:
( X
XY i
X )( Y
X Y
Y )
Where
, r X
t of
n iables
It can
XY
X N
2 i
( X iY i ) N ( X
2
X Y Y N
2 i
) (
( X iY i ) N ( X
2 i
N X Y Yi
2
( X iYi ) N
N Y N
2
N XY
2 2
N X N
)(
1 N
2 i
NY )(
Yi
NY
( X iY i ) N ( X
2 i
N X Y Yi
2
( X iYi ) N
N Y
2
Xi N Yi
2
Yi N N( Yi N )2
N X
) (
Xi
N(
Xi N
)2 ) (
N N Xi
2
X iYi ( N X i )2 N
X iYi Yi N ( Yi ) 2
(N Xi
2
N (
( X iYi ) X i )2 ) ( N
X iYi Yi
2
)(
Yi )
109
Where,
N = Number of pairs of scores XY = Sum of the products of paired scores X = Sum of X scores Y = Sum of Y scores X 2 = Sum of squared X scores Y 2= Sum of squared Y scores
Example
Let's assume that we want to look at the relationship between two variables, food grain available per head (in quintals) and family size. Perhaps we can have a hypothesis that family size affects the daily calorie supply per head in a family. Let's say we collect some information on 15 households and recorded as indicated in Table 8.1 below. Table 8.1: Food Grain Available (Quintal per Head) 12 8 8 9 3 5 21 2 2 7 21 25 18 2 3
Mesay Mulugeta, 2009
Family Size 3 5 4 3 6 5 2 9 10 4 3 2 2 9 7
110
You should immediately see in the Bivariate plot that the relationship between the variables is a negative or inverse one because if you were to fit a single straight line through the dots it would have a negative slope or move down from left to right. Since the correlation is nothing more than a quantitative estimate of the relationship, we would expect a negative correlation. What does a negative relationship mean in this context? It means that, if one increases, the other decreases in value. You should confirm visually that this is generally true in the plot below (Figure 8.1). Figure 8.1: Scatter plot of the data in Table 8.1
25.00
20.00
X=-0.2841Y+7.6986
15.00
Intersection Point
10.00
Y=-2.3917X+21.5325
5.00
0.00
2.00
4.00
6.00
8.00
10.00
Faily Size
From Figure 8.1 above we can see that there are two lines indicating the two variables are mutually regressed against each other. The angle between the two lines will be zero when there is perfect relationship between the variables i.e. the coefficient of correlation is +1 or -1. Greater is the angle, lesser will be the value of correlation coefficient. Another check is that if the calculations and the
Mesay Mulugeta, 2009 111
drawings have gone smoothly, the coordinates of the intersection of the two lines will be the averages of the two variables. The two regression lines give two regression coefficients: for the regression of y on x, regression coefficient bYX coefficient bXY
r
X Y
Y X
)(
bXY
X Y
)]
gives the coefficient of determination, r2. Regression equation by using means, standard deviations and correlation coefficients Case1: Y regressed on X
(Y Y )
Y X
(X
Y 9.733 Y Y
2.3917 ( X 4.933 )
Case 2: X regressed on Y
(X
X X
X)
X Y
(Y Y ); X 4.933 0.8243(
By using the previously explained formula, it is now easy to compute the coefficient of correlation for the variables in Table 8.1.
N (N Xi
2
rXY
( X iYi ) ( Xi ) ) (N
2
X iYi Yi
2
Yi )
The symbol
r stands for the coefficient of correlation. It is always between -1.0 and +1.0. If the
112
coefficient of correlation is negative, we have an inverse relationship; if it's direct, the relationship is
Mesay Mulugeta, 2009
positive. You don't need to know how we came up with this formula unless you want to be a statistician. But you probably will need to know how the formula relates to real data and how you can use the formula to compute the correlation coefficient. Table 8.2: Grain available (Y)
12 8 8 9 3 5 21 2 2 7 21 25 18 2 3
XY
36 40 32 27 18 25 42 18 20 28 63 50 36 18 21
X2
9 25 16 9 36 25 4 81 100 16 9 4 4 81 49
Y2
144 64 64 81 9 25 441 4 4 49 441 625 324 4 9
Total
146
74
474
468
2288
Let's look at the data we need for the formula. Here's the original data with the other necessary columns. (See table 8.2). The first two columns are the same as in the Table 8.1 above. The next three columns are simple computations based on the height and self esteem data. The bottom row consists of the sum of each column. This is all the information we need to compute the coefficient of correlation. Here are the values from the bottom row of the table (where N is 15) as they are related to the symbols in the formula: N= 15 X= 74
2 X = 468
2 X = 474 Y Y= 146 Y = 2288 Now, when we plug these values into the given formula given, we get the following:
r
15 * 468
15 * 474 5476
r
Mesay Mulugeta, 2009
0 . 8249
113
Here we can determine the Probable Error of correlation coefficient and confidence interval as:
Pe
0 . 6745
Where, Pe r n
r2 n
This probable error sets a range for the coefficients of correlation of other sets of samples selected randomly from the same population. The range is put as r associated with Pe is that if correlation coefficient is if r
Pe to r
Pe . Other properties
6 Pe , it is definitely significant.
Example Calculate the coefficients of correlation and determination for the following data and comment on the results you have found. Table 8.3:
Crop yield/ha (in quintals)
18
28
20
14
22
24
16
12
50
35
15
45
100
38
27
43
55
114
1 2 3 4 5 6 7 8 9 10 Total
50 35 15 45 100 0 38 27 43 55 408
18 8 28 20 14 22 24 16 6 12 168
2500 1225 225 2025 10000 0 1444 729 1849 3025 23022
900 280 420 900 1400 0 912 432 258 660 6162
You can now substitute the value above in the formula below: N [N X
2
XY ( (
2
X )( Y
Y)
2
X ) ][ N
Y )2 ]
10 * 6162
408 *168
- 0.412654
between crop yield/hectare and use of fertilizer per hectare. The coefficient of determination ( r 2 = 0.1885) indicates that about 18.85% of the dependent variable(y) is explained by the investigated independent variable (X).
115
r=0 r 0
The easiest way to test this hypothesis is to find a statistics book that has a table of critical values of r. Most introductory statistics texts would have a table like this. As in all hypotheses testing, you need to first determine the significance level. Here, we can use the common significance level of =
.05. This means that we are conducting a test where the odd that the correlation is a chance occurrence is no more than 5 out of 100. Before we look up the critical value in a table we should also have to compute the degree of freedom (df). The df is simply equal to N-2 or, in this example of Table 8.2, is 15-2 = 13. Finally, we have to decide whether we are doing a one-tailed or two-tailed test. In this example, since we have no strong prior theory to suggest whether the relationship between food grain available and family size would be positive or negative, we will opt for the twotailed test. With these three pieces of information i.e. the significance level ( = .05)), degrees of freedom (df = 13), and type of test (two-tailed) we can now test the significance of the coefficient of correlation we have found. When we lookup this value in a table at the back of any statistics book we find that the critical value is 1.771. This means that if the calculated value ( r
less than -1.771 (remember, this is a two-tailed test) we should conclude that the odds are less than 5 out of 100 that this is a chance occurrence. Since our calculated correlation of r
actually quite a bit higher than the negative tabulated value; we can conclude that it is not a chance finding and that the correlation is "statistically significant". We can reject the null hypothesis and accept the alternative. We can also compute the correlation coefficient and statistically confirm (test) the relationship between the variables by using SPSS software as indicated in the table below. Bivariate Correlation SPSS Output Family Size Pearson Correlation Sig. (2-tailed) Food Grain per Head -0.824(**) .000
** Correlation is significant at the 0.05 level (2-tailed). Mesay Mulugeta, 2009 116
N * ( N 1) 2
We could do the above computations 36 times to obtain the correlations. Or we could use just SPSS or any statistics program to automatically compute all 36 with a simple click of the mouse. Here I used SPSS to calculate correlation among the variables and create the correlation matrix for the 8 variables. I told the program to compute the correlations among these variables. Here's the result. Table 8.5: Correlation Matrix
Variables Grain per Head Family Size Farm Land Size . Number of Oxen No. of Livestock Sex of H/Head Off-farm Income Fertilizer per Ha
Grain per Head Family Size Farm Land Size Number of Oxen No. of Livestock Sex of H/Head Off-farm Income Fertilizer per ha Dung per ha -0.824 0.912 0.911 0.917 0.626 0.916 0.915 0.867 -0.650 -0.705 -0.589 -0.666 -0.755 -0.743 -0.834 0.819 0.855 0.562 0.771 0.742 0.638 0.882 0.564 0.900 0.870 0.818 0.512 0.849 0.901 0.790 0.716 0.669 0.678 0.936 0.860 0.930 Pearson Correlation
This type of table is called a correlation matrix. It lists the variable names down the first column and across the first row. It shows only the lower triangle of the correlation matrix. In every correlation matrix there are two triangles that are the values below and to the left of the diagonal (lower triangle) and above and to the right of the diagonal (upper triangle). There is no reason to print both triangles because the two triangles of a correlation matrix are always mirror images of each other (the
Mesay Mulugeta, 2009 117
correlation of variable x with variable y is always equal to the correlation of variable y with variable x). When a matrix has this mirror-image quality above and below the diagonal we refer to it as a symmetric matrix. A correlation matrix is always a symmetric matrix. To locate the correlation for any pair of variables, find the value in the table for the row and column intersection for those two variables. For instance, to find the correlation between variables Family Size and Grain per Head, we should look for where row and column intersects. Then, we find that the correlation is -0.824.
Other Correlations
The specific type of correlation we have seen above is known as the Pearson s Product Moment Correlation coefficient. It is appropriate when both variables are measured at an interval level. However there are a wide variety of other types of correlations for other circumstances. For instance, if you have two ordinal variables, you could use the Spearman s rank Order Correlation or the Kendall Rank Order Correlation. When one measure is a continuous interval level one and the other is dichotomous (i.e., two-category) you can use the Point-Biserial Correlation also.
r2
Total variation =
(Y - Y )2
(Y - YC)2
r = r =
2
( XY ) N Y 2
(Y Y ) 2 A Y B ( X 1Y ) C ( X 2Y ) D (Y 2 NY )
2
. Bivariate
( X 3Y ).... N Y2
Multivariate
118
Exercise
Calculate the coefficients of correlation and determination for the following data and comment on the results you have found. Use SPSS software to do so. Crop yield/ha (in quintals) Farm oxen/ha
18 2
8 1
28 4
20 0
14 3
22 1
24 2
16 1
6 4
12 2
Regression models are the mathematical/algebraic expressions while regression lines are the graphical representations based on the models in a two dimensional space in case of a bivariate distribution, and in multidimensional spaces in case of multivariate analysis. The numbers of curves/lines depends on the numbers of variables. For bivariate distributions, for instance, there can be two regression lines.
Y
Y Y Y
bX + e
a bX cX 2 a bX ab X cX 2 dX 3
It is hardly obvious why we should choose our line using the minimum sum of squared errors criterion. One virtue of the sum of squared errors criterion is that it is very easy to employ computationally. When one expresses the sum of squared errors/deviations mathematically and applies calculus techniques to ascertain the values of expressions for and that are easy to evaluate. and that minimize it, one obtains
120
and
is the intercept where the line cuts on the axis of Y (or when x=0), and
subtended with x-axis. It ( ) is also called the regression coefficient and defined as the measure of change in the dependent variable (Y) corresponding to a unit change in the independent variable (X). The symbol e is the noise which is usually omitted in calculations. The noise component e is comprised of factors that are unobservable or at least unobserved while and are the constants.
Figure 8.2:: Y Y= + X
X Dear student! Do you know the importance of regression analysis? Where do you use it? Applications of regression analysis exist in almost every field such as geography, economics, political science, sociology, psychology and education. The common aspect of the applications of regression analysis in these fields is that the dependent variable is a quantitative measure of some conditions or behaviors. When the dependent variable is qualitative or categorical, then other methods might be more appropriate to study. You will know more about the application of regression analysis after you observe the examples below.
Mesay Mulugeta, 2009 121
Production in Quintals
10 12 15 20 28
The prediction equation nicknamed as normal or standard equations are given as Y Y=n XY = + X + X X2
bX
1 2 as follows:
Y, X, XY and X2
122
Table 8.8:
Crop Year (X) 1 2 3 4 5 Total 15 Production (Y) 10 12 15 20 28 85 XY 10 24 45 80 140 299 X2 1 4 9 16 25 55
Those numerical values replace the symbols in the above two equations to provide two simultaneous equations in terms of
and
as 85 = 5 + 15 1 2
299 =15 + 55
and find 4
+ 45
85 = 5 + 15 85 = 5 + 15 (4.4) 19 = 5 = 3.8
The estimating (prediction) equation or model is, therefore, Y Once the constants,
3.8
4.4 X
and , are calculated, there remains two unknown variables in the regression
equation, Y and X. We know Y depends on X in the case of regression equation of Y on X. Under the presumption that the trend of change in Y corresponding to X remains the same, the value of Y
Mesay Mulugeta, 2009 123
can be estimated for any value of X. It is simply substituting the values of X (1, 2, 3, 4 and 5) into the estimating (prediction) equation and calculating for Yc turn by turn. For instance, the value of Yc (the estimated value) will be calculated as follows when the value of X is 3.
Production (Y) 10 12 15 20 28 85
Yc 8.2 12.6 17.0 21.4 25.8 85.0 Y YC 1.8 -0.6 -2.0 -1.4 2.2 0.0
Y=
Y=
When
When
0 X
124
Production (Y) 10 12 15 20 28 85
XY 10 24 45 80 140 299
X2 1 4 9 16 25 55
X3 1 8 27 64 125 225
+ X + X2 +
X + c X2 X2 + c X3 X3 + c X4
1 2 3
Then, firstly you have to find Y, X, X2, XY, X3, (X2Y) and X4 Now you are expected to substitute in the three equations above as follows:
85 = 5 299 = 15
+ 15 + 55c + 55 + 225c
1 2 3
To eliminate 255 = 15
+ 45 + 165c
125
Equation No. 7 minus Equation No. 8 gives you 14 = 14c Then, c = 1 Substituting c into equation No. 6 you can get: 44 = 10 + 60c 44 = 10 + 60 (1) = -1.6 Substituting c and 85 = 5 85 = 5 85 = 5 85 = 5 85 into equation No. 1 to get
31 = 5
You can confirm the results you have found by substituting into one of the equations above. For instance, if you substitute in equation No. 2 you can find 299 = 299, which means confirmed! The estimating or prediction equation is, therefore, Yc 10 . 8 1 .6 X 1X
2
Now, for every values of X you can estimate Yc. Look at the estimated or predicated value below. Table 8.11:
Independent variable
X 1 2 3 4 5 Total
Dependent Variable Estimated value Actual Value Yc Y 10 10.2 12 11.6 15 15.0 20 20.4 28 27.8 85 85.0
Note again that Y is always equal to Yc except minor differences because of rounding of fractions
The exponential prediction equation will be the following based on the linear equation Y=
X
or
127
X X2
1 2
X + Log
Now, you have to find LogY, X, XLogY and X2as follows: Table 8.12: No. 1 2 3 4 5 Total X 1 2 3 4 5 15 Y 10 12 15 20 28 85 LogY 1 1.08 1.18 1.30 1.45 6.0035 XLogY 1.00 2.16 3.54 5.20 7.25 19.1265 X2 1 4 9 16 25 55
Now you are expected to substitute the values above in the exponential equations as follows:
6.01 = 5Log + 15Log 19.15 = 15Log + 55Log Multiply equation No. 1 by 3 to eliminate 18.03 = 15Log + 45Log Subtract equation No. 3 from No. 2 so that you can get 1.12 = 10Log Log = 0.112
1 2
Anti Log or
Now substitute
= 1.294
= 1.294 or Log = 0.112 in equation No. 1 to find .
Anti- Log
Mesay Mulugeta, 2009
or
= 7.345
128
or
Now for every values of X you can calculate Yc. Look at the estimated or predicated value below. Table 8.13
Independent variable X 1 2 3 4 5 Total Dependent Variable Actual Value Estimated value Y Yc 10 9.506 12 12.303 15 15.922 20 20.606 28 26.668 85 85.005 LogYC 0.978 1.090 1.202 1.314 1.426 6.01 YC 9.506 12.303 15.922 20.606 26.669 0.005
Note again that LogY is always equal to LogYC except minor differences because of rounding off fractions. An exponential curve will never be zero whatsoever the value of X is. It never crosses the Y-axis but approaches to it. This is why an exponential prediction equation is used in distance decay analysis.
Degree
X6= 20515
(X3Y) = 5291
129
Then, we can now apply quadratic equation as usual as follows: Equation No. 85=5 +15+55c+225d 229=15 +55+225c+979d 1213=55 +225+979c+4425d 5291=225 +979+4425c+20515d
Multiply equation No.1 by 3 Multiply equation No.1 by 11 Multiply equation No.1 by 45 Equation No.2 Equation No.5 Equation No.3 Equation No.6 Equation No.4 Equation No.7 Multiply equation No.8 by 6 Multiply equation No.8 by 30.4 Equation No.9 Equation No.11 Equation No.10 Equation No.12 Multiply equation No. 13 by 9 Equation No.14 Equation No.15 Substituting the value of d into equation No.13 we can get Substituting the value of c and d into equation No.8, we can get Substituting the value of , c and d into equation No.1, we can get
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
225=15 +45+165c+675d 935=55 +165+605c+2475d 3825=225 +675+2475c+10125d 44=10+60c+304d 278=60+374c+1950d 1466=304+1950c+10390d 264=60+360c+1824d 1337.6=304+1824c+9241.6d 14=14c+126d 128.4=126c+1148.4d 126=126c+1134d 2.4=14.4d;
d 0.166
0.50 2.33
-7 = 14c;
23.33 = 10 ;
8.00 c 0.50
Then,
8.00 2.33
d 0.166
0.5X2 + 0.166X3
130
Now for every value of X you can calculate Yc. Look at the estimated or predicated value below. Table 8.14 Independent variable X 1 2 3 4 5 Total Remark Dependent Variable Actual Value Estimated value Y Yc 10 9.99 12 11.99 15 14.99 20 19.99 28 27.99 85 84.95 (Y-YC) 0.01 0.01 0.01 0.01 0.01 0.05 (Y-YC)2 0.001 0.001 0.001 0.001 0.001 0.0005
Finally, by comparing the values of (Y YC)2 for the linear, 2nd degree, 3rd degree and exponential curves we can determine which fits best to the given data based on the minimum value of the sum. The smallest sum value indicates the best fit.
Exercise
Let us assume that a department has trained salesmen and has given them test. The department measured the performance of the salesmen after certain months and found the following data. Then identify (a) the best fitting line, (b) prediction equation and (c) the predicted sales due to a salesman having the test score of 9.5.
2 12
3 23
5 14
5.5 30
6 28
8 34
the equation Y=A + BX; i.e. the Y variable can be expressed in terms of a constant ( ) and a tangent slope () times the X variable. The constant is also referred to as the intercept on Y-axis when X is 0, and the slope as the regression coefficient or . In the multivariate case, when there are more than one independent variables, the regression line cannot be visualized in the two dimensional space, but can be computed just as easily. We can, however, construct a linear equation containing all those variables. In general, multiple regression procedures will estimate a linear equation of the form:
Y=
+ n Xn
Y = dependent variable X1, X2, X3 1, 2, 3 Estimating equation is YC = + 1X1 + 1X2 + nX3 +nXn Xn = independent variables + n = constants
In this equation, the values of each regression coefficient (or coefficients) indicate the contributions of each independent variable to the observed change within the dependent variable. The smaller the variability of the residual values around the regression line relative to the overall variability, the better is our prediction. For example, if there is no relationship between the X and Y variables, then the ratio of the residual variability of the Y variable to the original variance is equal to 1.0. If X and Y are perfectly related then there is no residual variance and the ratio of variance would be 0.0. In most cases, the ratio would fall somewhere between these extremes, that is, between 0.0 and 1.0. 1.0 minus this ratio is referred to as R square or the coefficient of determination. This value is immediately interpretable in the following manner. If we have an R square of 0.4 then we know that the variability of the Y values around the regression line is 0.4 times the original variance; in other words we have explained 40% of the original variability, and are left with 60% residual variability.
132
Ideally, we would like to explain most if not all of the original variability. The R square value is an indicator of how well the model fits the data. For instance, an R square close to 1.0 indicates that we have accounted for almost all of the variability with the variables specified in the model. The degree to which two or more predictors (independent or x variables) are related to the dependent (Y) variable is expressed in the correlation coefficient R, which is the square root of R square. In multiple regressions, R can assume values between 0 and 1. To interpret the direction of the relationship between variables, one looks at the signs (plus or minus) of the regression or coefficients. If a coefficient is positive, then the relationship of this variable with the dependent variable is positive. Of course, if the B coefficient is equal to 0 then there is no relationship between the variables.
Example
Analyze the relationship between the dependent variable (y) and independent variables (xi) of the hypothetical data given below. In order to find the solution for the question, we have to know the values of the following summations as in the table below: Y, X1, X2, X3, X4 -------------------------------Calculated in the table
(YX1),
133
Table 8.15:
Crop yield/ha (Y) 18 8 28 20 14 22 24 16 6 12 Total 168 Farm Oxen/ha (X1) 2 1 4 0 3 1 2 1 4 2 20 Fertilizer (Kg/ha) (X2) 50 35 15 45 100 0 38 27 43 55 408 Per capita farmland (ha) (X3) 2 1 1.5 0.5 3 2 4 2 1 3 20 Per capita irrigated land (X4) 1 0.5 1 0 2 1 2 1 0.5 2 11
In order to find the solution for the question, we have to know the values of the following summations as in the table below: Y, X1, X2, X3, X4 -------------------------------Calculated above
(YX1),
(X1X3), (X2 X3), (X1X4), (X2X4), (X3X4) + (X42) The normal equation is: Y = NA + B X1 + C X2 + D X3 + E X4 ... ....1 . .2 .3 .4
(YX1) = A X1 + B (X12) + C (X1X2)+ D (X1X3) + E (X1X4) (YX2) = A X2 + B (X1 X2) + C (X22) + D (X2X3) + E (X2X4)
134
Table 8.16: YX1 36 8 112 0 42 22 48 16 24 24 332 YX2 900 280 420 900 1400 0 912 432 258 660 6162 YX3 36 8 42 10 42 44 96 32 6 36 352 YX4 18 4 28 0 28 22 48 16 3 24 191 X12 4 1 16 0 9 1 4 1 16 4 56 X22 2500 1225 225 2025 10000 0 1444 729 1849 3025 23022 X32 4 1 2.25 0.25 9 4 16 4 1 9 50.50
X42
1 0.25 1 0 4 1 4 1 0.25 4 16.50
X1X3 4 1 6 0 9 2 8 2 4 6 42
168 = 10A + 20B + 408C + 20D + 11E 332 = 20A + 56B + 880C+ 42D + 24.5E .
..
...
...1 .2
..
3 4
You can now find all the constants to arrive at an estimating equation. The values of the constants are calculated as follows:
135
Variables X1 X2 X3 X4 Constant
0.129X2 + 4.873X3
3.950X4
Note: The independent variables with negative Beta coefficients (X2 and X4 in the above example) affect the dependent variable (Y) negatively with the greater the absolute value the greater the effect is. Likewise, the independent variables with positive Beta coefficients (X1 and X3) in the above example affect the dependent variable positively. The Beta coefficient of X3, for instance, is +0.751 which means an increase in one unit of an independent variable changes positively the dependent variable by 0.751 units. Let us calculate the multiple coefficients of correlation and determination for the example above. The formula for multiple coefficient of determination is:
r2 =
Y B
( X 1Y ) C
( X 2Y ) D Y 2 NY 2
( X 3Y ).... N
Y2
By using this formula, the coefficients of correlation and determination are calculated to be: Multiple coefficient of correlation (R) = 0.566 Multiple coefficient of determination (R ) = 0.320 or 32.0 % What do the values above indicate? The value of multiple coefficient of determination (R2) = 0.320 or 32.0 % indicates that all the independent variables together explain some 32% of the changes or variance in the dependent variable. This can easily be done by using SPSS software, the output table of which looks like the one indicated below.
2
136
Exercise: Answer the following questions based on the following hypothetical data
Table 8.18: Hypothetical Data
Dependent Variable Food Grain Available per Head (Qntls/Head) Independent Variables X4 X5 X6 Number of Sex of Off-farm other Household Income per Livestock Head Head Per Hectare (Birr) 3 2.5 2.5 2 1 0.5 7 3 1 1.3 7 8 4 1.8 1.5 M M M M F F M F F M M M M M F 322 323 123 212 76 45 380 23 76 150 456 444 267 80 35
X2 X1 Family Size
Farmland Size (Hectare per Head)
12 8 8 9 3 5 21 2 2 7 21 25 18 2 3
3 5 4 3 6 5 2 9 10 4 3 2 2 9 7
1.5 1.2 1.4 1.5 0.4 0.4 4 0.3 0.6 1.6 5 7 6 1.8 1.4
a) Calculate the coefficients of each independent variables b) What percent of the dependent variable is explained by the identified independent variables? c) Find the multivariate prediction equation d) Calculate the predicted values by using the prediction equation
Mesay Mulugeta, 2009 137
e) Screen out the most significant independent variables by using Stepwise regression analysis model f) Construct correlation matrix g) Conform the correlation between the dependent variable and each independent variable by using statistical testing technique you have learnt in this unit
138
9.1. Introduction
For any statistical/quantitative analysis, it is always required to establish the validity or acceptance and rejection level of the results, generally, in relation with the already established values. For this purpose, a set of techniques are established by statisticians pertaining to various statistical parameters or measures. Thus, the whole operation is based on (1) establishing a hypothesis or assumption concerning the computed results in relation with the values which stand for being compared (2) a level of significance telling the fractional or percentage level of the comparisons and (3) on the degree of freedom, which will be varying with techniques and the numbers of observations. There is a fact to be considered in hypothesizing or assuming our notion about the results computed. These are whether (1) the result to be tested differs significantly or (2) not. Thus, the hypothesis can be considered. Thus, while hypothesizing we should consider whether or not the value/s does/do not differ significantly from the already established norm or the value with which the comparison is to be made. In this case it is referred to as Null Hypothesis generally denoted by Ho. It is also possible to presume that the result differs significantly from the value to be compared with. It is referred to as Alternative Hypothesis denoted by H1. The outcome of the test is to be compared with some already established values appealing in relevant tables and the interpretations concerning acceptance or rejection is based on instructions accompanying the tables. It is also true that if Ho is rejected, automatically the alternative hypothesis is accepted i.e. the computed value differs significantly.
Mesay Mulugeta, 2009 139
If Ho =0 1 = 2
2 2
H1 will be 1
2
<
>
Two broad classifications can be made among parametric and non-parametric tests. Parametric tests use the statistical parameters like mean, standard deviation variance and correlation coefficients. Non-parametric methods, on the other hand, are widely used for studying populations that take on a ranked order ignoring the actual values. The use of non-parametric methods is, therefore, necessary when data have no actual numerical interpretations/values rather take into consideration information like frequencies and ranks. In other words, parametric test is a statistical test that depends on an assumption about the distribution unlike non-parametric test. For example, in Analysis of Variance (ANOVA), a typical parametric test, there are three assumptions: Observations are independent The sample data have a normal distribution Scores in different groups have homogeneous variances Included in the first broad category (parametric test), among others, are z-test and ANOVA while among non-parametric tests are Chi-square (X2) test, Mann-Whitney U-Test or Wilcoxon-MannWhitney Rank-Sum Test and Kruskal Wallis Test (H-test). It is very difficult to have a comparative assessment of the two groups, parametric and nonparametric tests. This is because both suffer from some limitations. Invariably the assumptions with the parametric tests are that the distribution is normal while nonparametric tests can be used with all the types of distributions. Thus, nonparametric test is more flexible in conditions than the counterpart. But it is also remarked that nonparametric tests are less reliable than the parametric tests.
140
other levels of significance like 0.1, 0.2 and 0.3 may also appear in tables.
on the observations. The statistical test tells the researcher whether the landholding sizes differ significantly among the woredas or not. It is also possible to find the degree of relationship (correlation) between two geographical data, say crop yield and fertilizer input per unit area, and perform a statistical test which enables us to decide whether the correlation is statistically significant or not. Note that the correlation between two variables X and Y is statistically not significant means; the relationship between the variables is only because of chance not because one variable affects the other.
The level of significance is specified before the samples are drawn, so that the results obtained should not influence the choice of the decision-maker. It is specified in terms of the level of probability of the null hypothesis being wrong or rejected.
Mesay Mulugeta, 2009 142
Step 3: Establish Critical or Rejection Region As can be seen from the figure below, the sample space of the experiment is divided into two mutually exclusive regions. These are called the acceptance region and the rejection or critical regions in a normally distributed data.
Critical Values
If the value of the test statistic fall into the acceptance region, the null hypothesis is accepted, otherwise it is rejected. At this stage we should bear in mind that research hypotheses can be of two types, one-tailed and two tailed. A one tailed hypothesis makes predictions regarding both the presence of a significant effect and also of the direction of this difference or association. For instance, (1) we can test whether there will be a difference in performance of young and old participants on a memory test; and (2) whether young participants perform better on a memory test than elderly. In contrast, a twotailed hypothesis predicts only the presence of a statistically significant effect, not its direction. Step 4: Calculate the Suitable Test Statistic The value of the test statistic is calculated from the distribution of sample statistic by using the following formula:
Tests statistic
Value of sample statistic value of hypothesized population parameter S tan dared error of the sample statistic
Step 5: Reach a Conclusion Compare the calculated value of the test statistic with the critical value (also called standard table value or tabulated value). The decision rules for null hypothesis are as follows: |Value|Cal |Value|Table; Reject the Ho
Mesay Mulugeta, 2009 143
|Value|Cal < |Value|Table; Accept the Ho 9.4.1. One-tailed and Two-tailed Tests There are two types of tests referred to as the one-tailed and two-tailed tests. The type of tests depends on the way the hypotheses are formulated. a. Two-tailed test is when null and alternative hypotheses are stated as: Ho: = o and H1: o This implies that any deviation (either on the lower or higher side) of the calculated value of test statistic from the hypothesized value leads to rejection of the null hypothesis, Ho. The rejection region is kept in both tails as indicated in Fig.9.1. Then, if the significance level for the test is percent, the rejection region equal to distribution. b. One-tailed test is when null and alternative hypotheses are stated as: Ho: Ho: o and H1: >o (Right-tailed test), o and H1: < o (Left-tailed test), or /2 percent which is kept in each tail of the sampling
This implies that the value of sample statistic is either higher or lower than the hypothesized parameter value. This leads to the rejection of null hypothesis for significant deviation from the specified value in one direction or tail of the curve of sampling distribution. Look at Fig. 9.2 below. Figure 9.2: One-tailed test (Right-tailed)
Acceptance region (Ho is accepted)
Critical values
decision to accept or reject a hypothesis is based on sample data, there is a possibility of an incorrect decision or error. A decision-maker may commit two types of errors while testing a null hypothesis. These are known as Type I Error ( ) and Type II Error (). A type I error is made when Ho is rejected and conclude that the H1 is true when it is wrong. On the other hand, a type II error is made when a false Ho is accepted and concludes that the H1 is wrong when it is true.
I. Hypothesis testing for Single Population Mean: The test statistic for determining the difference b/n the sample mean and population mean is given by: x s*
145
Note: This test statistic has a t distribution with n-1 degree of freedom. The tabulated t-value gives the critical value of t. More clearly, if tcal population) are desirable to apply t-test. Example The average rural households cereal production per year is specified to be 18.5 quintals. From that a sample of 14 households was selected. The mean and standard deviation of the samples were calculated as 17.85quintals and 1.955quintals, respectively. Test the significance of the deviation. Solution: Let us take the null hypothesis that there is no significant deviation in amount of production among households. Ho: = 18.50 and H1: 18.50 = 0.05. Critical value of t at df =13 and t or t /2, reject the Ho, otherwise accept it. Note also that
the sample size should be small (<30) and at least five observations (taken from normally distributed
Since tcal (-1.24) value is less than its critical value (ttab = 2.160), the null hypothesis Ho is accepted. Hence, we conclude that there is no significant deviation of sample mean from the population mean. Exercise Let a herbicide spray machine is set to give 20 kilograms of herbicide per hectare of land. Seven plots of land (each one hectare in area) are examined and the amounts of herbicides in the plots are found to be 19, 22, 20, 18, 21, 17and 19 kilograms. Is there reason to accept that the machine is defective? Hint: Table 9.1: Variables (x) 19 22 20 18 21 17 19 x=?
Mesay Mulugeta, 2009 146
II. Hypothesis testing for Difference of Two Means: For comparing two mean values of two normally distributed populations, we can draw independent random samples of sizes n1 and n2 from the two populations. If 1 and 2 are the mean values of the two populations, then our aim is to estimate the value of the difference 1 - 2 b/n mean values of the two populations. Let the sample values be denoted by (x1a, x1b, x1b, x1c, x1c) and (x2a, x2b, x2b, x2c, x2c). Then, the expression for t is:
t
Sp
x1 x 2 1 n1 1 n2
or
x1 Sp
x2
n1 n2 n1 n2
Where, x1 and x 2 are means of the samples I and II respectively, Sp is the pooled standard deviation which is equal to S p . S p can be calculated by using the formula below:
(n1 n2 2) Note: In hypothesis testing for difference of two means, statistic t has (n1 + n2 - 2) degree of
2
2
Sp =
( x1i x1 ) 2
( x2i x2 ) 2
freedom. The calculated value of the t-test statistic here represents the number of standard deviations the difference x1 - x 2 is from 1 - 2 specified in Ho. Thus the rule to either accept or reject a null hypothesis is as follows: Ho: 1 - 2 = 0 and H1: 1 - 2 and degrees of freedom (n1 + n2 0
Accept Ho if calculated value of t is less than its critical value at a specified level of significance 2). Otherwise reject Ho.
Exercise Let us assume that the following table shows a test score (out of 20) of two groups of students in a class Group I Group II 11 20 18 8 12 13 12 18 17 16 14 19 14 15 16 17 11 17 7
10 13 15 16 13 11 14 19 10
Then, examine the significance of the difference between the mean of the marks secured by the students of the two groups.
Mesay Mulugeta, 2009 147
2. Z-test
It is one of the commonest types of hypothesis testing. Z-test (not the same as Z-score though closely related) compares sample and population means to determine if there is statistically a significant difference. Z-test, also known as normal test, is used in cases when the population variances (s) is/are known and sample size is large (>30). Theoretically, when the sample size is large, sample variance approaches to population variance and is deemed to be almost equal to population variance. In this way, the population variance is known even if we have sample data and hence the normal test is applicable. The distribution of Z is always normal with a mean zero (0) and a variance one (1). For testing Ho: = o against H1: x 0 Z / n Whereas, x is the sample mean and formula above, o, the test statistics is
is the standard deviation based on large sample size n. In the x represents standard error, SE. Then, z-test can also be stated as SE n
Exercise
Let the table below gives the daily income (in Eth. Birr) of randomly selected 40 laborers in a manufacturing plant.
Table 9.2:
12 7 8 9 12 10 16 23 21 17 6 8 9 12 32 16 4 21 20 22 8 9 4 23 21 20 12 16 18 19 5 18 9 12 12 6 14 21 22 18
Analyze whether it can be concluded or not that the average (mean) income of a person in this manufacturing plant is 15 Eth. Birr.
148
Hint: 1. State the hypothesis in such a way that Ho: = 15 against H1:
15 against
2. Since the sample size is 40 (i.e. large), you should use a normal test (Z-test). Then, first you should calculate sample mean, x 3. = 15 4. Calculate also Then, by using the formula for Z-test, compare it against the tabulated value and decide whether to accept or reject the Ho.
) and N (
).
2 2
against
2 1 2
Whenever independent random samples of size n1 and n2 are drawn from two normal populations, the F-ratio will be calculated by the formula below:
F S1 S2
2 2
149
As a norm, larger variance is taken in the numerator of the formula S1 > S 2 in the formula above with the n1 1 degree of freedom for the numerator and n 2 1 degree of freedom for the denominator. Note that we keep the larger variance in the denominator so that the ratio is always equal to or greater than one. For H1: For H1:
2 1 2 1
> <
2 2 2 2
Exercise Let us assume that the following table represents the life expectancy of 7 and 9 regional states of Ethiopia in 1991 and 2007, respectively. Then, confirm whether the variation in life expectancy in various regions in 1991 and in 2007 is the same or not. Table 9.3:
Life expectancy in years Regions 1 2 3 4 5 6 7 1991 43.2 41.5 47.2 50.5 41.2 38.0 39.1 2007 54.5 47.0 56.9 60.3 58.2 49.6 54.9 48.8 58.5
2 1
2 2
vs
2 1 2
c. Degrees of freedom are 6 and 8 for the data set of 1991 and 2005, respectively.
observations come from the same distribution or not. It is one of the best known non-parametric significance tests.
Where, N = total number of observations R = sum of the ranks of ith set, ni = frequency of ith set
Exercise
Find H-value for the four sets of data given below and compare against tabulated value.
Table 9.4:
A B C D 4 6 12 10 6 8 16 10 2 6 14 10 6 10 8 6 2 0 10 4
Solution Rank the data as follows. Start from the lowest data. Table 9.5: Data Ranking A B C D 4(4.5) 6(8) 12(17) 6(8) 8(11.5) 16(19) 2(2.5) 6(8) 14(18) 10(14.5) 6(8) 10(14.5 8(11.5) 6(8) 2(2.5) 0(10 10(14.5) 4(4.5) R 25.5 43.0 85.5 56.0 210.0 7310.25 5 3136 ) 3 x 21 5
10(14.5) 10(14.5)
H-value =
10.789
151
By using the chi-square compare H-calculated with H-tabulated and decide whether or not the Ho is to be rejected or retained. Note that it is the Chi-square ( 2) table that must be used for H-test also. H-calculated = 10.789 H-tabulated = 7.820 Hcal > Htab, then reject Ho
6. Chi-square ( 2) Test
This is also one of the non-parametric or distribution free statistical tests which go back to 1900, when Karl Pearson used it for frequency data classified into k-mutually exclusive categories. It is usually represented by , a Greek letter Chi . The sampling distribution of
2 2 2
is called
with its critical (or table) value to know whether the Ho hypothesis is true or not. The decision of accepting the Ho is based on how close the sample results are to the expected results. The data should be expressed in original units, rather than in percentage or ratio form. Let us assume that we have a perceived value of variance (
2 o
of previous knowledge. Draw a random sample of size n (<30) from this population. On the basis of n sample observations, the postulated value
2 o
of population variance (
) is to be either
substantiated or rejected with the help of statistical test. The hypothesis here is: 2 2 2 Ho: o = vs o This can be tested by:
2
( xi
2 o
x) 2
2
or
(n 1) S 2
2 0
if Valuecal
Value
/2
at n-1 degree of
/2)
2
vs
2 o
>
), reject Ho if Valuecal
Value at n-1
152
Example Let us assume that a factory owner wants to purchase a commodity if it does not have variance of more than 0.4 kilograms in weight. To make sure of the specifications, the buyer selects 8 sample items of the commodity. The weight of each sample item was measured to be as follows. 5 9 Table 9.6: 5 7 10 4 9 4 8 6 (xi- x ) - 6.625 - 6.625 - 6.625 - 6.625 - 6.625 - 6.625 - 6.625 - 6.625 (xi- x )2 2.64 0.14 11.39 6.89 5.64 6.89 1.89 0.39 (xi- x )2 = 35.87 Weight in kilogram 7 10 4 8 4 6
x =6.625
( xi
2 o
x) 2 1, 8
For
( 2-distribution) is 14.0671. Then, since the calculated (6.9922) value is less than tabulated value (14.0671), we should accept the null hypothesis. It means that the factory owner should purchase the commodity.
153
Hence, the null and alternative hypothesis of population means imply that the null hypothesis should be rejected if any of the r sample means is different from others. The assumptions for the analysis of variance are: a. Each populations are a normal distributions b. The sets of populations from which the samples are drawn have equal variances c. Each sample is drawn randomly and is independent of other samples The first step in the analysis of variance is to partition the total variation in the sample data into the following two component variations. These are: a. The amount of variation among (variation between) the sample means or the variations attributable to the difference among sample means. b. The amount of variation within the sample observations. This difference is considered due to chance causes or random errors. In analysis of variance, a table known as ANOVA table is required and established as follows: Table 9.7:
Source of variation Degrees of freedom Sum of squares Variance F-value
Total
Example Let us assume that 4 enumerators are sent to a market to collect a data related to a price of a commodity. Then, we can apply ANOVA to check (test) the differences of prices collected by the enumerators.
H0: 1 = 2 = 3 = 3 H1: 1 2 3 3
154
Enumerators
A 4 6 2 6 2 20 4
B 6 8 6 10 6 30 6
C 12 16 14 8 20 70 14
D 10 10 10 6 4 40 8
Then, find variation within as follows: Variation within a data set collected by A Variation within a data set collected by B Variation within a data set collected by C Variation within a data set collected by D Total variation within ( xi ( xi ( xi ( xi x) 2 = 16 x) 2 = 56 x) 2 = 80 x) 2 = 32
16 + 56 + 80 + 32 = 184
We have to find also variation between as follows. Here the mean of each data set represents the whole data: 165 = 80
( x grand mean) 2
( x grand mean) 2
45 = 16
( x grand mean) 2
365 = 180
05 = 0 280
155
Grand variation = total variation within (184) + total variation b/n (280) = 464 Finally, create an ANOVA table as follows: Source of variation Variation b/n Degrees of freedom 3 Sum of squares 280 Variance F-value
Variation within
16
184
Grand variation
19
464
The conclusion of the ANOVA above is that since the Fcal (8.113) > Ftab (3.24), the difference is significant (1 2 3 3) and reject the null hypothesis.
tn
156
Where,
r = sample coefficient of correlation of n pairs of observations (n-2) p = degree of freedom = population coefficient of correlation = t-test with n-2 degree of freedom = sample size value of significance and (n-
tn
2) degrees of freedom, reject Ho. Rejection of Ho leads to the conclusion that the two variables are not independent. This means that the correlation between them is worth considering. On the other hand, if Ho is accepted, it means that the value of r is due to sampling error whereas in reality two variables are uncorrelated in the population too. Example Calculate the correlation coefficient of the following paired hypothetical data and confirm or test its validity. Table 9.8: Farm Households 1 2 3 4 5 6 Crop Yield per Unit Area (in Quintals) 25 14 18 22 15 18 Fertilizer Application Per Unit Area (in Kg) 70 30 30 60 30 35
Solution 1. Firstly, we should calculate the coefficient of correlation(r) for the paired data. Then by the methods (formula) you have learnt earlier in this course, the Pearson s correlation coefficient for the data is calculated to be +0.940. 2. State the hypothesis i.e. Ho: p = 0 (There is no correlation) correlation) 3. Apply the method above to confirm or test the correlation r n r2 1 r
2
vs H1: p
0 (There exist a
tn
0.940,
r2
0.8836, n
157
tn
4.
= 6.233
which is less than the calculated value of t (6.233). Hence, reject Ho. Rejection of Ho leads to the confirmation of the conclusion that the two variables are really highly correlated. This means that the correlation between them is worth considering.
158
References
1. Agrawal, B.L (2006). Basic Statistics. New Age International Publishers: New Delhi 2. Bartlett, James E., Joe W. Kotrlik and Chadwick C. Higgins (2001). Organizational Research: Determining Appropriate Sample Size in Survey Research. Information Technology Learning and Performance Journal, Vol. 19, No. 1. Ball State University: Muncie 3. Best, J.W. and J.V. Kahn (2005). Research in Education. Prentice-Hall Pvt. New Delhi. 4. Bryman, A. (2000). Quantity and Quality Research in Social Sciences. Unwin University: London 5. Cresswell, J. W. (2003). Research Design: Qualitative, Quantitative and Mixed Methods Approaches. 2nd Edition. SAGE Publications: London. 6. Frechtling, J.et.el.Eds (1997). User-Friendly Handbooks for Mixed Methods Evaluation. DIANE Publisher. 7. Gurumani, N. (2007). Research Methodology for Biological Sciences. MJP Publications: Chennai 8. Henn, M et.al. (2006). A Short Introduction to Social Research. SAGE Publications Ltd. London. 9. Nachmias, C. F. and D. Nachmias (1996). Research Methods in Social Sciences. St Martin s Press Inc.: London. 10. Salvatore, D. (1982). Theory and Problems of Statistics and Econometrics. Schaum s Outline Series. New York 11. Sayer, A. (1999). Methods in Social Sciences. Routledge Inc. London. 12. Sharma, J. K. (2004). Business Statistics. Pearson Education Ltd. New Delhi 13. Shipman, M. (1998). The Limitations of Social Research. Longman Group: London 14. Spratt, C., Waker, R and Robinson, B. (2004). Mixed Research Methods. Commonwealth of Learning
159
This document was created with Win2PDF available at http://www.daneprairie.com. The unregistered version of Win2PDF is for evaluation or non-commercial use only.