You are on page 1of 160

Quantitative Methods in Social Sciences

Unit One Background Issues


Unit objectives
Having studied this material, you should be able to: Understand the general concept, approaches and steps of quantitative methods in social sciences Describe the role of quantitative methods in social science researches Explain the logical structure in quantitative techniques Understand how concepts, models, theories and laws develop in quantitative methods Explain the merits and demerits of quantitative methods Differentiate descriptive and inferential statistics Formulate and test hypotheses Apply basic quantitative techniques in research

1.1. Research Methods in Social Sciences


Research is a scientific or critical investigation aimed at discovering facts and interpreting data or applying the evolved techniques in solving certain problems and give answers to research questions. It is an organized and systematic way of finding solutions to problems. It is systematic because there is a definite set of procedures and a series of steps which a researcher must follow. There are certain procedures in research process which must always be done in order to get the most accurate results. Similarly, research is an organized activity or it is done in a planned procedure so that it gives the most appropriate answers to problems or questions. Questions and problems are central to any research. This is because if there is no problem or question, there will be nothing to solve or answer. Hence, the end result of almost all research is finding answers to problems, research questions and hypotheses.

Research is the method to expand the frontiers of human knowledge. In geography and environmental studies, for instance, it has to aim at maintaining the environment so as to attain the well-being and betterment of the human and physical environment. The aim of any good research is to advance the frontiers of human knowledge and add valid information to what is already known.
Mesay Mulugeta, 2009 1

Quantitative Methods in Social Sciences

Therefore, research can be carried out in any field wherever there is a need to expand the horizon of human knowledge that bring about significant change in the betterment of all human affairs.

At this juncture one may raise at least two questions. What is science? What is the relationship between science and research? Nachmias, C. and David Nachmias (1996: 3) try to answer these questions as, The word science is derived from the Latin word Scire meaning to know. Science is difficult to define primarily because people often confuse the content of science with its methodology. Science has no particular subject matter of its own .but a distinct methodology.

Science, in this sense, refers to any systematic and highly skilled means of acquiring knowledge. Research is a scientific method or a technique for investigating phenomena and acquiring new knowledge. To be termed scientific, a method of inquiry must be based on gathering observable, empirical, and measurable data subject to specific principles of reasoning.

If so, when has human begun to search a truth? Or when has human started research? One may answer that human started searching for truth when s/he began to find fruits/animals for survival (hunting/gathering) millions of years back. Others may respond to such a question in such a way that human started research when s/he began to select the most appropriate animal/plant for domestication. Still some other may say it might when human began explore the world. Others still may say it was when human started to identify the location of the most important minerals for industrial proposes. In this manner different individual may comment his/her view for the forwarded questions. Anyhow, these people may not be wrong in that throughout history humans have tried to grasp knowledge in various ways. One of these ways, which is also the most recent, is scientific inquiry. Though scientific method is not the only means to know, science has been helping humans to understand their environment and themselves. Science is, therefore, the best instrument to grasp knowledge through research which involves observation, identification, description, experimental investigation and theoretical explanation of any phenomena that occur in nature.

Mesay Mulugeta, 2009

Quantitative Methods in Social Sciences

The above paragraph tries to briefly explain the fact that research is one of the long-dated human activities. One should also bear in mind that any systematical and justifiable method of inquiry on human and physical environment is a science. Researches in social sciences are therefore scientific inquiries to the extent that these fields are founded on scientific methodology, rigorous data analysis and systematic observations. The only reasonable difficulty in human science is the fact that the uniformity of nature is not a reasonable assumption in the world of human beings and their characteristics. This is mainly because of the complex nature of human being to which developing sound theories is much more difficult unlike the cases in the physical world. It also involves the environment which is equally dynamic and any investigation pertaining to man or any living being cannot be treated in isolation.

Regarding this, Best and Kahn (2005) explains that researches in human subjects are difficult mainly because of the following human natures. These are: 1. No two persons are alike in feelings, drives and emotions. For instance, an event that extremely delights an individual may irritate the others or a method that we may employ to approach an interviewee may not work to hand other respondents in a research. 2. No one person is completely consistent from one moment to another. Human behavior is influenced by the interactions of the individual with every changing element in his or her environment. 3. Human beings are influenced by the research process itself. They are influenced by the attention that is focused on them when under investigation unlike other animals such as mice. Because of these factors, some scholars in the field of applied sciences are less confident about the scientific inquiries in nonphysical aspects of our world. Hence, they recommend the application of scientific methods with greater vigor and imagination in social science aspects. It is believed that the development of scientific inquiries in social sciences and their applications to several human affairs may be best solutions to some of our present and future greatest challenges such as peace and security, human rights violation, global warming, food insecurity, polar ice-melting and global economic recession.

Mesay Mulugeta, 2009

Quantitative Methods in Social Sciences

1.2. Essential Steps in Research


Research, as discussed hereinbefore, is a systematic way of finding solutions to problems. Here the term systematic implies the fact that in scientific research there is always a definite set of procedures that a researcher must follow in order to arrive at justifiable, testable and the most accurate research outputs. Pertaining to this, Gurumani (2007) clearly indicates that a creditable research should follow the following general steps: 1. Selection of the topic and specific problem of the research 2. General or pilot survey of the field to understand the problem of the research 3. Specific survey of the pertinent literature and development of bibliography 4. Definition of the problem, including differentiating, defining and classifying the concepts 5. Determining the parameters required to be studied towards the solution of the problem 6. Choosing the methodology to study the parameters 7. Standardizing the methodology and testing its suitability for the specific problem 8. Designing field survey with appropriate statistical techniques and tools 9. Collecting data or information 10. Systematic classification, tabulation, presentation, analysis and interpretation of data 11. Reporting in the form of essay, thesis, dissertation or any kind of research output.

1.3. Research Approaches in Social Sciences


Cresswell (2003) clearly identified three approaches to social sciences research: Quantitative, Qualitative and Mixed methods approaches. In fact, other writers such as Best and Kahn (2005), Spratt, C., Waker, R and Robinson, B. (2004) and Frechtling, J.et.el.eds (1997) has also tried to distinguish and explain the three research approaches though not as clearly as Cresswell. The objective of this course is therefore to discuss in detail all about quantitative research techniques in social sciences such as geography, sociology and development studies.

1.3.1. Quantitative Research Approaches in Social Sciences


One of the most notable social sciences that employ quantitative techniques in its research is geography. In geography field of study, quantitative revolution was one of the major turning points in its history of development. It occurred during 1950s and 1960s in the universities of Europe and USA and marked a rapid change in geographical researches. The main claim for the quantitative
Mesay Mulugeta, 2009 4

Quantitative Methods in Social Sciences

revolution is that it led to a shift from descriptive (idiographic) geography to an empirical law making (nomothetic) geography. This is mainly because modern geography is an all-encompassing discipline that foremost seeks to understand the earth and all of its human and natural complexities in more scientific manner. It studies not merely where objects are, but how they are interrelated, why they are there and their socio-economic values of being there. In the early 1950s there was a growing sense that the existing paradigm for geographical research was not adequate in explaining how physical, economic, social and political processes are spatially organized or ecologically related. A more abstract, theoretical approach to geographical research has emerged, evolving the analytical method of inquiry. Look at the logical structure of quantitative research approach in the figure below. Figure 1.1: The logical structure of quantitative research process Theory or research problem ---------------------------------- Deduction Hypothesis --------------------------------- Operationalization Observations/ Data Collection ---------------------------------- Data Processing Data Analysis ---------------------------------- Interpretation Findings ---------------------------------- Induction

Source: Bryman, A (2000: p. 20)

Mesay Mulugeta, 2009

Quantitative Methods in Social Sciences

Quantitative research in social sciences is, therefore, a set of quantitative techniques that allow researchers to answer research questions in the discipline. These methods and techniques tend to specialize in quantities in the sense that numbers come to represent the variables like altitude, income, rainfall, temperature, dietary energy, body weight and age. The interpretation of the numbers is viewed as strong scientific evidence of how a phenomenon works. The presence of quantities is so predominant in quantitative research approach that statistical tools and packages are essential element in the researcher's toolkit. Sources of data are of less concern in identifying an approach as being quantitative research approach than the fact that empirically derived numbers lie at the core of the scientific evidence assembled. A quantitative researcher may use archival data or gather it through different tools such as interview, questionnaire, measurements, and personal observations. In all cases, the researcher is motivated by the numerical outputs and how to derive meanings from them. As indicated in the figure above (Fig. 1.1), quantitative research process consists of at least five stages: theory, hypothesis formulation, observation or data collection, data analysis and finding or generalization. The figure illustrates that a research process is cyclic in nature. It starts with a theory or research problem and ends with research findings or empirical generalization. The ending of one research cycle will be the beginning of the other one. This cyclic process continues indefinitely reflecting the process of a scientific investigation and it opens door for self-correcting in such away that scientific investigators test the generalizations, hypotheses and findings of the research problems logically and empirically.

1.4. Concepts, Models, Theories and Laws in Quantitative Research


Concept, theory and model are the three important words that occur very regularly in research texts. It is often assumed that everyone knows what these words mean and what the differences between them are. The definitions of these important terms are collected from different sources like websites and books. Here we have to bear in mind that there are a number of possible definitions for each word.

Concept
Several writers define the term concept simply as an abstract notion or idea, something that isn t concrete. It is an abstract summary of characteristics that we see as having something in common.
Mesay Mulugeta, 2009 6

Quantitative Methods in Social Sciences

Concepts are created by people for the purpose of communication and efficiency. Therefore, as an educator or researcher you would be expected to review all the existing range of definitions of the term concept and decide on which you are going to use.

Theory
As the case for the term concept stated above, there are definitions of theory in literatures and electronic media. However, a more substantial definition of a theory seems the one stated in ENCARTA World English Dictionary which defines the term theory as a set of facts, propositions, or principles analyzed in their relation to one another and used, especially in science, to explain phenomena. It is a set of hypotheses or principles linked by logical or mathematical arguments which is advanced to explain an area of empirical reality of a type of phenomenon." A theory, therefore, includes a set of basic assumptions and axioms as the foundation and the body of the theory is composed of logically interrelated and empirically verifiable propositions.

Researchers use theory in a quantitative study to provide an explanation or prediction about the relationship among variables in the study. A theory explains how the variables are related, acting as a bridge between or among the variables.

Theories exist in different social science disciplines such as economics, psychology and sociology. As stated in Cresswell (2003), in quantitative research, hypothesis and research questions are often based on theories that the researcher seeks to test. Cresswell (2003) cited Kerliger (1979) defining theory as a set of interrelated constructs (variables) , definitions, and propositions that presents a systematic view of phenomena by specifying relations among variables . In this definition theory is an interrelated set of constructs (variables) formed into propositions or hypothesis that specify the relationship of variables typically in terms of magnitudes or directions. The systematic view may be an argument, a discussion or a rationale and it helps to explain or predict phenomena that occur in the world. Why would an independent variable, X, influence or affect a dependent variable, Y? The theory would provide the explanation for this expectations or prediction.

Theories develop when researchers test a prediction many times. When investigators test hypotheses over and over in different settings and with different populations a theory emerges and someone gives it a name. Thus, theory develops as explanation to advance knowledge in particular field.
Mesay Mulugeta, 2009 7

Quantitative Methods in Social Sciences

Another aspect of a theory is that it varies in its breadth of coverage. Theories can be classified into micro-level, meso-level and macro-level. Micro-level theories provide explanations limited to small slices of time, space and variables while meso-level theories integrate some micro-level theories of organizations, social movements or communities. Macro-level theories explain larger aggregates such as social institutions, cultural systems and the whole societies. As stated by Cresswell (2003), the following procedures should be used to present a model for writing a quantitative theoretical perspective section into a research plan. These are review the related literature, find what theories were used by other investigators or researchers in your area of study, ask questions why the independent variable(s) affect the dependent variable(s), and script out the theory section. Hence, a researcher or investigator must intensively read different related research works and books before s/he goes to building a theory or hypotheses for his or her research.

Model
ENCARTA World English Dictionary defines model as an interpretation of a theory arrived at by assigning referents in such a way as to make the theory true. It is a simplified version of something complex used in analyzing and solving problems or making predictions A model can more simply be analogous model, for instance globe as a model of the earth, or symbolic models, which are based on logics and inter-relationships between concepts and usually expressed mathematically or algebraically. Symbolic models are concerned with quantification. For instance the under-mentioned regression model is used to quantify an estimated or predicted value of a data set and the correlation model below helps us to analyze the strength and direction of a linear relationship between an independent and dependent variables. There are many such symbolic models in the fields of human and natural sciences. Examples of symbolic models:

Y = A + BX1 + CX2 + DX3


r
[N N X2 ( XY ( X )( X ) 2 ][ N

+NXn
Y) Y2 ( Y )2 ]

.
...

.....

Regression Model

Correlation analysis model

Mesay Mulugeta, 2009

Quantitative Methods in Social Sciences

There is also what we call conceptual model. A conceptual model is composed of a pattern of interrelated concepts but not expressed in mathematical form and primarily not concerned with quantification. Maps, graphs, charts, balance sheets, circuit diagrams, and flowcharts, are often used to represent such models.

Law
Another word associated with concepts, theories and models is law. A law in research is a precise statement of a relationship among facts that has been repeatedly corroborated by scientific investigation and is generally accepted as accurate by experts in the field. Laws are generally derived from a theory. A law is frequently referred to as a universal and predictive statement. It is universal in the sense that the stated relationship is held always to occur under the specified conditions, although the conditions may be predicted to follow.

1.5. Hypothesis Formulation in Quantitative Methods


In quantitative research, investigators use research questions or hypotheses to shape and specifically focus the purpose of the study. Research questions are interrogative statements or questions that the investigator seeks to answer. Hypotheses, on the other hand, are propositions or predictions, or set of propositions, set forth as an explanation for the occurrence of some specified group of phenomena, either asserted merely as a provisional conjecture to guide investigation or accepted as highly probable in the light of established facts. They are predictions that the researcher hold about the relationships among variables. They are numeric estimates of population values based on the data collected from samples. A hypothesis requires more work by the researcher in order to either confirm or disprove it. In due course, a confirmed hypothesis may become part of a theory or occasionally may grow to become a theory itself. Testing of hypotheses requires statistical procedures in which the investigator draws inferences about the population from a study sample. Techniques of hypothesis testing will be discussed in Unit 8 of this teaching material. Cresswell (2003) forwards the following guidelines for writing good quantitative research questions and hypotheses: 1. One of the three basic approaches should be followed: (a) in the form of comparing the variables-the impact of independent variable on dependent variable, (b) in the form of relating the variables, and (c) in the form of describing the variables.
Mesay Mulugeta, 2009 9

Quantitative Methods in Social Sciences

2. Specify the hypothesis 3. Use only research question or hypothesis, not both, to eliminate redundancy 4. You can use hypothesis either in null or alternative form. Null hypothesis (Ho) is statistical hypothesis that states there are no differences between observed and expected data. Alternative hypothesis (H1) predicts about the outcome for the population of the study. It can be directional (makes use of words like higher, less, better, e.t.c.) or non-directional, formulated when a researcher does not know what can be predicted from post literature. 1.6. Merits and Demerits of Quantitative Methods Merits: Examines the relationships between and among variables critically. Answers research questions through surveys and experiments Provides measures or observations to test theories and hypotheses Leads to meaningful interpretation of quantitative data Provides more empirical data analysis techniques than the qualitative ones Seems more valid and reliable method Relatively more free of motivations, feelings, opinions, and attitudes of individuals who are carrying out the research and also those individuals participating in the research Demerits: Collect a much narrower and sometimes superficial dataset Results are limited as they provide numerical descriptions rather than detailed narrative and generally provide less elaborate accounts of human perception Often carried out in an unnatural, artificial environment so that a level of control can be applied to the exercise. Preset answers will not necessarily reflect how people really feel about a subject and in some cases might just be the closest match. Overlooks motivations, feelings, opinions, and attitudes of individuals who are carrying out the research and also those individuals participating in the research

Mesay Mulugeta, 2009

10

Quantitative Methods in Social Sciences

1.7. Quantitative vs Qualitative Methods


In most social sciences the use of quantitative or qualitative method has become a matter of controversy and even ideology, with particular schools of thought within each discipline favoring one type of method and pouring scorn on to the other. Advocators of quantitative methods argue that only by using such methods can the social sciences become truly scientific. Advocators of qualitative methods, on the other hand, argue that quantitative methods tend to obscure the reality of the social phenomena under study because they underestimate or neglect the non-measurable factors, which may be even the most important ones. The modern tendency (and in reality the majority tendency throughout the history of social science) is to use eclectic approaches. Quantitative methods might be used with a global qualitative frame. Qualitative methods might be used to understand the meaning of the numbers produced by quantitative methods. Using quantitative methods, it is possible to give precise and testable expression to qualitative ideas. This combination of quantitative and qualitative data gathering is often referred to as mixed-methods research. Quantitative method has been available to social and human scientists for years while qualitative method has emerged primarily during the last three or four decades. Mixed method is new and still developing in form and substance.

1.8. Descriptive vs Inferential Statistics


Statistics as a subject is broken into two branches: Descriptive Statistics and Inferential Statistics. Descriptive statistics includes collecting, organizing, summarizing, and presenting data. Inferential statistics is when we make inferences to hypothesis testing, determine relationships, and make predictions out of the analyzed data. Descriptive statistics are used to describe the basic features of the data in a study. They provide simple summaries about the sample and the measures. Together with simple graphic analysis, they form the basis of virtually every quantitative analysis of data. Descriptive statistics are typically distinguished from inferential statistics. With descriptive statistics you are simply describing what is or what the data shows. With inferential statistics, you are trying to reach conclusions that extend beyond the immediate data alone. For instance, we use inferential statistics to try to infer from the sample data what the population might think. Or, we use inferential
Mesay Mulugeta, 2009 11

Quantitative Methods in Social Sciences

statistics to make judgments of the probability that an observed difference between groups is a dependable one or one that might have happened by chance in this study. Thus, we use inferential statistics to make inferences from our data to more general conditions; we use descriptive statistics simply to describe what's going on in our data. For example, let s say we have data on the incomes of 20 instructors at Adama University. This data can be summarized by finding the average income of those 20 instructors and we could describe the difference each income is above or below the average. We could also go into Excel or SPSS softwares and construct a table with this data in it, or make a pie chart or bar chart, maybe a frequency distribution of the number or proportion of the instructors in each class or range. This is descriptive statistics! Now, if this group is representative of the whole university, we could then estimate and test various hypotheses about these 20 instructors average income to the university as a whole. These conclusions will be subject to some error, and we could even quantify this probability of error. We are now inferring, so this would be inferential statistics.

Mesay Mulugeta, 2009

12

Quantitative Methods in Social Sciences

Unit Two Methods of Quantitative Data Collection


Unit objectives
Having studied this unit, you should be able to: Understand the rationale for sampling Explain merits and demerits of sampling Define key terms in sampling such as samples, population/universe, sampling error and sampling frame. Determine appropriate sample size for any project work such as senior essay Describe different sampling processes and techniques Appreciate the use of sampling in geographic researches

2.1. Introduction
Quantitative research usually employs surveying or measuring to collect data. It is a research depending upon quantities or quantifying variables. In order to collect the quantified data, this research approach usually carries out at least one of the sampling techniques which will be discussed in detail in this unit. Sampling is a statistical practice concerned with the selection of some representations from a population or universe intended to yield some generalized characteristics of the concerned population by minimizing the computational work. Socioeconomic variables such as yield/hectare, landholding, daily income, number of oxen per rural household, origin of immigrants and reason of migration, family size, rainfall and any other related data of a very large entity can be obtained through sampling and be used for certain statistical inferences. Let's begin by covering some of the key terms in sampling like population, samples, sampling frame, sampling error, non-sampling error, parameter and statistic.

Mesay Mulugeta, 2009

13

Quantitative Methods in Social Sciences

2.2. Population/Universe
In statistics, the term population/universe is used in a different sense from its literary sense. A population is any entire collection of people, animals, plants or things from which we may select sample data. It is the entire group we are interested in, which we wish to describe or draw conclusions about. For instance, if you want to study the food security status of farm households in a woreda having 1200 farm households, you may communicate only 8% of the total which is only 96. Here the 1200 farm households in the woreda are said to be population and the 96 ones are your samples.
In order to make any generalizations about a population, a sample, that is meant to be representative of the population, is often studied. A sample statistic gives information about the corresponding population parameter. For example, it is assumed that a sample mean for a set of data would give information about the overall population mean.

2.3. Samples
A sample is a group of items selected from a larger group (the population/universe) for any statistical analysis. It is assumed that the sample is the utmost perfect representative of the general population. By studying the sample it is hoped to draw valid conclusions about the larger group. A sample is generally selected for study because the population may be too large to study in its entirety; or it may be too costly and time consuming to deal with each and every population under consideration. Before selecting the samples, it is important that the researcher must carefully and completely defines the population, including a description of the members to be included. At this juncture, you may raise questions like: What is the advantage of sampling? Why shouldn t we study the whole population rather than limiting ourselves to the data that we obtain from certain proportion of the overall population? You will get answers to these questions after you effectively complete discussing all about sampling with your instructor in this unit.

Mesay Mulugeta, 2009

14

Quantitative Methods in Social Sciences

2.4. Sampling Frame


Sampling frame is the actual set of units from which a sample has been drawn. In other words, it is the listing or displaying of all accessible population from which the researcher draws his/her samples. Consider, for example, a survey aimed at studying the position of assets of rural households in a woreda. In this example, population of interest is all rural households of the woreda and a possible sampling frame may be list or display of all rural households in the woreda under investigation. In defining the frame, practical, economic, ethical, spatial and technical issues need to be addressed. The need to obtain timely results may prevent extending the frame far into the future. The difficulties can be extreme when the population and frame are disjoining. This may result in particular problems in forecasting where inferences about the future are made from.

2.5. Parameter and Statistic


A parameter is a value, usually unknown (and which therefore has to be estimated), used to represent a certain population characteristic. For example, the population mean is a parameter that is often used to indicate the average conditions of the population pertaining to different attributes. For a population, the specific parameter is a fixed value which does not vary. But it is normal that each sample set drawn from the population will have its own value with negligible dissimilarities between the value of one set and other sample sets.

A statistic is a figure that is calculated based on sample data. The mean, median, variance, standard deviation and Skewness of a sample set of data are termed as statistic. It is used to give information about unknown values in the corresponding population. For example, the mean of the data in a sample is used to give information about the overall average in the population from which that sample was drawn. Parameters are often assigned by Greek letters (e.g. and .), whereas statistics are assigned by Roman letters (e.g. m and s).
Mesay Mulugeta, 2009 15

Quantitative Methods in Social Sciences

2.6. Errors in Sampling


A researcher may commit at least two types of errors during sampling. S/he should, therefore, take great care at the sampling stage of her/his research so as to avoid committing sampling errors or not to draw wrong samples which may lead to wrong conclusions. The two major errors in sampling are known as sampling error and non-sampling (measurement) error.

2.6.1. Sampling Error


Sampling errors may happen simply because of sampling itself or due to certain biasness towards certain parameters. Sampling error causes the discrepancy between population parameters and statistic. The discrepancy generally decreases as the sample size increases, and becomes negligible with increasing sample size. Hence a sample of optimum size must be obtained for a study.

What is optimum sample size or proportion? You will see in the discussions hereinafter. There are two basic causes of sampling error. One is the error that occurs just because of chance. Some literature calls this bad chance. This may result in untypical choices. Unusual units (extremely small or large units) in a population do exist and there is always a possibility that an abnormally large or small number of them will be chosen. For example, for the data for your BA Thesis at woreda level you may unluckily select all the well-to-do farm households in the woreda in your set of sample. You may select your samples randomly but, all the rich households in the whole population, which have the highest crop yield per year, may be selected making the sample average by far higher than what it should be. Here also you may raise a question: How can I protect such a bad chance during field work for my research? The main protection against this kind of error is to use a sample size larger enough. The second cause of sampling error is sampling bias. Sampling bias is a tendency to favor the selection of units/items that have particular characteristics. Sampling bias is usually the result of a poor sampling plan. The most notable is the bias of non response when for some reason some units have no chance of appearing in the sample. A means of selecting the units of analysis must be designed to avoid the most obvious forms of bias. For example, when you would like to know the average income of the residents of a town, you may
Mesay Mulugeta, 2009 16

Quantitative Methods in Social Sciences

decide to use mobile telephone numbers to select a sample from the total population in a locality where only the well-to-do social class households (in Ethiopian case) own mobile telephones. You will then end up with high average income which will lead to wrong conclusions in your findings. Therefore, you must be very careful in selecting your research samples free of any bias.

2.6.2. Non-sampling error (Measurement error)


The other main cause of unrepresentative samples is what is known as non-sampling error. This type of error can occur whether a census or a sample is being used. Like sampling error, non-sampling error may either be produced by participants in the statistical study or be an innocent by product of the sampling plans and procedures. A non sampling error is an error that results solely from the manner in which the observations are made. The simplest example of non-sampling error is an inaccurate measurement due to malfunctioning instruments or poor procedures. For example, in case of data of crop yield of farm households in Ethiopia, if persons are asked to state their annual production, they may tell you in terms of traditional measuring tools like qunna or enqib which may vary in size from household to household or from place to place, as a result of which the two answers will not be of equal reliability. Responses, therefore, will not be of comparable validity unless the information of all persons is weighed under the same measuring tools and circumstances. Errors due to inaccurate measurement may happen innocently but very devastating leading to extremely wrong research findings, conclusions and recommendations. Therefore, we have to take care of the accuracy or proper functioning of our instruments like altimeter, weighing machine, and GPS while collecting any primary socioeconomic data for our research or any project work. Related to this a story is told of a French astronomer who once proposed a new theory based on spectroscopic measurements of light emitted by a particular star. When his colleagues discovered that the measuring instrument had been contaminated by cigarette smoke as a result of which they mistakenly rejected his findings.

Mesay Mulugeta, 2009

17

Quantitative Methods in Social Sciences

There are also other significant factors which results in errors and reduces the quality of data. These are:

1. Interviewer s effect: No two interviewers are alike and the same person may provide different
answers to different interviewers. The manner in which a question is formulated can also result in inaccurate responses. Individuals tend to provide false answers to particular questions. For example, some people want to feel younger or older for some reason known to them. If you ask such a person her/his age in years, it is easier for the individual just to lie to you by over or under stating her/his age by some years less or more than the reality. But if you ask which year s/he was born, s/he may tell definitely more accurate figure since it will require a bit of quick arithmetic to give a false date. 2. Respondents effect: This might also give incorrect answers to impress the interviewer. This type of error is the most difficult to prevent because it results from outright deceit (dishonest) on the part of the respondents or interviewees. An example of this is what I witnessed during my MA Thesis Study in which I was asking farmers how much crop they harvested last year, 2001. In most cases, the household heads tended to lie by responding only very small amount of yield per year. I then tried all my best to convince them so that they could tell me somewhat correct figures. 3. Knowing the study purpose: This is the case when the respondents give wrong data solely because they are aware of why a study is being conducted. A good example can be a question on income. If a government agency is asking, a different figure may be provided than the respondents would give to some neutral or purely academic researcher. One way to guard against such bias is to make the questions very specific, allowing no room for personal interpretation. For example, the questions might be: "Are you employed?" could be followed by "What is your salary?", "Do you have any extra income? , If yes how much is it? and so on. A sequence of such questions may produce more accurate information than directly asking questions like, What is your monthly income? Error and cost seems competing in sampling. This is because to reduce error often it requires an increased expenditure of resources such as time, finance and human power. Of the two types of statistical errors, only sampling error can be controlled by exercising care in determining the appropriate method for choosing the sample.
Mesay Mulugeta, 2009 18

Quantitative Methods in Social Sciences

The above discussion has shown that sampling error may be due to either bias or chance. The chance component (sometimes called random error) exists no matter how carefully the selection procedures are implemented, and the only way to minimize chance sampling errors is to select an adequately large sample size. Sampling bias, on the other hand, may be minimized by the wise choice of a sampling procedure.

2.7. Stages in Sampling Process


Most literatures recommend the under-mentioned stages while sampling to collect primary data. Defining the concerned population Specifying a sampling frame, a set of items or events possible to measure Specifying a sampling method for selecting items or events from the frame Determining the sample size Pilot testing for questionnaire Administering the questionnaire or data collection

2.8. Sampling Techniques


Sampling methods are classified into probability or non-probability sampling techniques. In probability sampling, each member of the population has a known non-zero probability of being selected. Probability methods include random sampling, systematic sampling, and stratified sampling. In non-probability sampling, members are selected from the population in some nonrandom manner. These include convenience sampling, judgment sampling, quota sampling, and snowball sampling. The advantage of probability sampling is that all the items in the set of population have the chance to be selected.

2.8.1. Simple Random Sampling


In a simple random sampling, every member of the population has equal and independent chance of being selected. The selection of one observation does not affect the opportunity of other observations to be selected. Each element of the sampling frame has an equal probability of selection if the frame is not subdivided or partitioned. One disadvantage of this method is that all members of the population have to be available for selection.
Mesay Mulugeta, 2009 19

Quantitative Methods in Social Sciences

Figure 2.1: Classification of Sampling Methods

Sampling Methods

Probability Sampling (Usually for Quantitative Methods) Simple Random Sampling Systematic sampling Stratified random sampling Cluster sampling Multistage sampling

Non-probability Sampling (Usually for Qualitative Methods) Convenience sampling Judgment/Purposive sampling Quota sampling Snowball sampling Volunteer sampling

2.8.2. Systematic Sampling


It is also called an ith item selection technique. After the required sample size has been calculated, every ith record is selected from a list of population members. Selecting, say, every 10th name from the telephone directory is called an every 10th sample, which is an example of systematic sampling. It is a type of probability sampling unless the directory itself is not randomized. It is easy to implement and the stratification induced can make it efficient, but it is especially vulnerable to periodicities in the list. If periodicity is present and the period is a multiple of 10, then bias will result. It is important that the first name chosen is not simply the first in the list, but is chosen to be (say) the 7th, where 7 is a random integer.
Population Sample Size Size

Example: Let us assume that you want to study certain characteristics of rural households in Y
woreda. Let the total farm households in the woreda be 2000 serially numbered from 1 to 2000 in alphabetical order. Firstly, you have to decide your sample size based on the criteria you have read above. Let it be 100. Divide the population (2000) by sample size (100). i.e. 2000 divided by 100
Mesay Mulugeta, 2009 20

Quantitative Methods in Social Sciences

gives you 20 which will provide the positions of sample items; in every 20th item from the first item identified randomly. Now draw a random number from 1 to 20. Let number 13 is selected randomly. Then select your samples as follows:

Calculation
13 13 + 20 = 33 13 + 2*20 = 53, 13 + 3*20 =73 13 + 4*20 =93 13 + 5*20 =113 13 + 6*20 =133 .

Serial No. of the sample households from the list


13th 33rd 53rd 73rd 93rd 113th 133rd

Continue till you select the predetermined 100 samples

2.8.3. Stratified Random Sampling


Stratified sampling is one of the most commonly used probability sampling method that is superior to random sampling because it reduces sampling error. A stratum is a subset of the population that shares at least one common characteristic. Examples of stratums might be males and females, educated and uneducated, rural and urban and so on. The researcher first identifies the relevant stratums and their actual representation in the population. Random sampling is then used to select a sufficient number of subjects/items from each stratum. Sufficient here refers to a sample size large enough for the researcher to be reasonably confident that the stratum represents the population. Stratified sampling is often used when one or more of the stratums in the population have a low incidence relative to the other stratums. At this stage attention should be given to the appropriate proportional representation of the items to each stratum based on certain characteristics like total size of population and area. Where the population embraces a number of distinct categories, the frame can be organized by these categories into separate strata. A sample is then selected from each stratum separately, producing a stratified sample.

Mesay Mulugeta, 2009

21

Quantitative Methods in Social Sciences

In multistage stratification say the gender attribute of the household heads each dividing the universe/population into two groups. At the second stage the monthly family income monthly can be introduced to have sub-classes. It can be further sub-divided by introducing other characteristics one by one. In other cases, the study area can be divided into agro-climatic zones which can be further divided by woredas, then kebeles, villages and so on. The two main reasons for using a stratified sampling design are (1) to ensure that every group within a population are adequately represented in the sample, and (2) to improve efficiency by gaining greater control on the composition of the sample since sample size is usually proportional to the relative size of the strata.

Example: Let us assume that you want to study certain characteristics of urban households in
one of the towns in Ethiopia. Firstly, you can divide the whole urban households into a number of homogeneous groups called strata on the bases of certain characteristics of the households. For example, they may be divided into homogeneous groups according to sex of household head, age, family income or any other available information can be used. In fact, at this stage we need predocumented information about each household. Finally, from each homogeneous group (stratum) you can select a required size of sample households randomly and distribute your questionnaire or administer an interview.

2.8.4. Cluster or Area Sampling


Cluster sampling is an example of two-stage sampling or multistage sampling: in the first stage sample of cluster/s is/are chosen while in the second stage sample/s of respondent/s within those areas is/are selected. This can reduce travel and other administrative costs. It also means that one does not need a sampling frame for the entire population, but only for the selected clusters. Cluster sampling generally increases the variability of sample estimates above that of simple random sampling, depending on how the clusters differ between themselves, as compared with the within-cluster variation.

Example: Let us assume that we want to know the amount of crop production in a woreda during
a specific crop year. It may be very costly to communicate each and every farming household in the
Mesay Mulugeta, 2009 22

Quantitative Methods in Social Sciences

woreda. Then we can divide the woreda into smaller segments (e.g. kebeles) known as primary units in statistics. Now, we can randomly or purposely take reasonable and representative number of the segments (primary units) and communicate each household in the segment.

2.8.5. Multistage Sampling


Multistage sampling is a complex form of cluster sampling. Using all the sample elements in all the selected clusters may be prohibitively expensive or unnecessary. Under these circumstances, multistage cluster sampling becomes useful. Instead of using all the items contained in the selected clusters, the researcher may randomly select items from each cluster. Constructing the clusters is the first stage. Deciding what items within the cluster to use is the second stage. The technique is used frequently when a complete list of all members of the population does not exist and is inappropriate Example: In the example given for the cluster sampling above, in case the selected units or cluster (kebeles in the woreda) may be homogeneous or if the whole elements of the selected cluster give the same response, it is advisable not to study the whole cluster or kebele as it increases the cost and time of the study. Hence, if the cluster (kebele) is homogeneous and basically each unit in the population is divisible into a number of smaller units (e.g. sub-kebeles and then into villages) multistage sampling is recommended. Then, firstly we can randomly/purposely take reasonable number of kebeles from the woreda and the kebeles are now called 1st stage units. Again, we can randomly select any reasonable size of sub-kebeles (2nd stage unit), then villages (3rd stage unit) and finally we tend to take appropriate farm households from the selected villages. The sample households are now termed as 4th stage units.

2.8.6. Quota sampling


Quota sampling is one of the non-probability sampling techniques. In quota sampling, the population is first segmented into mutually exclusive sub-groups, just as in stratified sampling. Then judgment is used to select the subjects or units from each segment based on a specified proportion. For example, an interviewer may be told to sample 200 females and 300 males between the age of 45 and 60. In quota sampling the selection of the sample is non-random. For example interviewers might be tempted to interview those individuals who look most helpful to them. The problem is that these samples may be biased because not everyone gets a chance of being selected. This random element
Mesay Mulugeta, 2009 23

Quantitative Methods in Social Sciences

is its greatest weakness and quota versus probability has been a matter of controversy for many years.

2.8.7. Convenience sampling


Convenience sampling, sometimes called opportunity sampling, is used in exploratory research where the researcher is interested in getting an inexpensive approximation of the truth. As the name implies, the samples are selected because they are convenient to the researcher. This non-probability method is often used during preliminary research efforts to get a general estimate of the results without incurring the cost or time required to select a random sample.

2.8.8. Judgment/Purposive sampling


Judgment sampling is one of the commonest non-probability sampling techniques. The researcher selects the sample based on his/her own judgment. This is usually an extension of convenience sampling technique. For example, a researcher who wants to study the status of urban poverty in Ethiopia may decide to draw the entire sample from one representative city in the country. Addis Ababa may be purposively selected for such a study even though the population includes all cities/towns in Ethiopia such as Adama, Bahr Dar, Mekele, Gonder, Dessie, Nekemte and Awasa. Note that when using this method, the researcher must be confident that the chosen sample is truly representative of the entire population.

2.8.9. Snowball sampling


Snowball sampling is a special non-probability sampling technique used when the desired sample characteristic is rare or little known to the researcher, the reason for the term snowball. It may be extremely difficult or cost prohibitive to locate respondents in these situations. Snowball sampling relies on referrals from initial subjects to generate additional subjects. While this technique can dramatically lower search costs, it comes at the expense of introducing bias because the technique itself reduces the likelihood that the sample will represent a good cross-section from the population. For instance, if you want to select chat-addicted individuals from the total residents of a kebele, your initial contact person should be a person who can effectively help you to communicate one or more such individuals (other required chat-addicted individuals or samples) in the kebele. As you began to communicate other such samples the importance of the initially contacted person may be negligible or melts as snow.
Mesay Mulugeta, 2009 24

Quantitative Methods in Social Sciences

2.8.10. Volunteer Sampling


Volunteer sampling is such a sampling technique in which only volunteer samples are communicated. For instance, this may be the method through which Ethiopian television and radio station communicates and interview experts on current political, economic and social affairs of the country.

2.8.11. Spatial or Grid Sampling


This is a form of cluster sampling, a cluster being individual areas of a grid and hence consisting of groups of basic grid cells arranged in some standard geometric pattern. In grid sampling we should divide the field into squares or rectangles of equal size, usually referred to as grid cells. The location of each grid-cell is usually geo-referenced using global positioning system technology. Samples, such as plant, soil and rock samples, are then collected from each of the grid-cells.

2.9. Sample Size


One of the challenging and most frequently asked question concerning sampling in research is the question of sample size or how large of a sample is required to infer research findings back to a population? The answer to this question is influenced by a number of factors, including the purpose of the study, population size, budget, time and the allowable sampling error.

Moreover, the three most important criteria that need to be specified to determine the appropriate sample size are level of precision, level of confidence or risk and the degree of variability in the attributes being measured. In case of inadequate sample size (if any), the researchers should be able to report both the statistically appropriate sample size and the sample size actually used in the study. This allows the reader to make her/his own judgments as to whether s/he accepts the researcher s assumptions and procedures. Finally, regarding sample size you should note that adequate sample size with high quality data collection efforts may result in more reliable, valid and generalizable results than studies conducted with entire population or census data.

Level of precision
The level of precision, sometimes called sampling error, is the range in which the true value of the population is estimated to be. This range is often expressed in percentage points, (e.g. 5percent). For instance, if a researcher finds that 60% of the farmers in the sample have adopted a
Mesay Mulugeta, 2009 25

Quantitative Methods in Social Sciences

recommended practice with a precision rate of 5%, then s/he can conclude that between 55% and 65% of the farmers in the population have adopted the practice.

Confidence level
The confidence or risk level is based on the ideas encompassed under the Central Limit Theorem. The key idea encompassed in the Central Limit Theorem is that when a population is repeatedly sampled, the average value of the attribute obtained by those samples is equal to the true population value. Furthermore, the values obtained by these samples are distributed normally about the true value, with some samples having a higher value and some obtaining a lower score than the true population value. In a normal distribution, approximately 95% of the sample values are within two standard deviations of the true population value (e.g., mean). In other words, this means that, if a 95% confidence level is selected, 95 out of 100 samples will have the reliable population value. There is always a chance that the sample you obtain does not represent the true population value.

Degree of Variability
The third criterion, the degree of variability in the attributes being measured, refers to the distribution of attributes in the population. The more heterogeneous a population, the larger the sample size required to obtain a given level of precision.

2.10. Strategies for Determining Size of Respondents


Determination of sample size to be selected from a population is one of the pivotal issues that arise during research. In fact, the sample size depends on several factors: purpose of the study, the type of population (varied or identical), availability of equipment and technical people, recourses allotted to the study in terms of time and money, and the level of precession required. Here we should bear in mind that determining sample size is a very important issue because samples that are too large may waste time, resources and money, while samples that are too small may lead to inaccurate results. There are several approaches to determining the size of the respondents. These include using a census for small populations, imitating a sample size of similar studies, using published tables, and applying formulas to calculate a sample size. Each strategy is discussed below briefly.
Mesay Mulugeta, 2009 26

Quantitative Methods in Social Sciences

Using a Census for Small Populations


One approach is a census survey in which we consider the entire population as respondent. It is very difficult and even sometimes impossible to carryout census survey for large population as it is time and money consuming. However, for a small population (eg. 200 or less), census yields more precise research output than sampling does as it touches each and every population under study. Census survey also eliminates sampling error and provides data on all the individuals in the population.

Using a Sample Size of a Similar Study


Another approach is to use the same sample size as those of other methodologically very sounding studies similar to the one you plan to carry out. In this case you must be very careful that you may run the risk of repeating errors that were made in determining the sample size for another study if you fail to deeply review the procedures employed in the earlier studies. However, a review of the literature in your discipline can provide guidance about typical sample sizes which are used.

Sample Size Determination Table


This is a table created by the scholars in this field of study to be used to determine appropriate sample size for a research based on three alpha levels ( = 0.10, of error. = 0.05 and = 0.01) and margins

Using Formulas to Calculate a Sample Size


Although tables can provide a useful guide for determining the sample size, you may need to calculate the necessary sample size for a different combination of levels of precision, confidence and variability. The fourth approach to determine sample size is the application of one of the several formulas such as:

N 1 N ( e) 2

Where,

N
e

population size
level of precision

sample size

Mesay Mulugeta, 2009

27

Quantitative Methods in Social Sciences

Another formula to determine sample size may be

no

Z
=

2 2

Where,

no

sample size

z = the abscissa of the normal curve that cuts off an area at the tails

e = the desired level of precision in the same unit of measure as the variance
2

= the variance of an attribute in the population.

The above mathematical formula can also be rewritten as follows to determine the required sample size with specific confidence and margin of error.
2

za n

Where,

n = sample size

2 E

E = the maximum difference b/n the sample and population means


za 2
= critical value: Obtained from the table of the probability distribution which the data follow. = population standard deviation

It is obvious that this formula can be used if and only if the value of size necessary to establish the population mean value within possible to use this formula if we are able to determine documents.

is known and to determine the sample . It is still

E with a confidence of 1

from pilot test, similar studies or other related

Then, it is possible to determine the appropriate number of samples the study to be 95% sure that the sample mean is within 1 unit of the population mean. The value of distribution for a 95% ( deviation

za

is 1.96 in the table of standard normal

0.05) level of confidence. The margin of error E 1 and the standard

can be calculated from pilot test or related studies/documents.

The disadvantage of the sample size based on the mean is that a good estimate of the population variance is necessary. Often, an estimate is not available. Furthermore, the sample size can vary widely from one attribute to another because each is likely to have a different variance.

Mesay Mulugeta, 2009

28

Quantitative Methods in Social Sciences

Project work
Let us assume that you are going to study the ethnic composition of the residents of your town based only on primary data. Here you can imagine that it is very expensive and time consuming to carry out a census survey for this specific study. As a result, you may be required to select samples for this study. Then, try to explain the type and method of sampling technique you are going to apply to answer the research questions and come up with the most appropriate research outcomes. What will be your appropriate sample size? What method will you use to decide the sample size and what do you say about the adequate representation of the samples?

Mesay Mulugeta, 2009

29

Quantitative Methods in Social Sciences

Unit Three Techniques of Quantitative Data Presentation


Unit objectives
Having studied this chapter, you should be able to: Explain briefly statistical data presentation tools Present any quantitative raw data in its condensed and easily communicable way

3.1. Introduction
The data collected from the field or obtained through measurements are usually presented in condensed forms in tables, charts and graphs. The purpose of putting the data into tables, charts and graphs is threefold. Firstly, it enables the researcher to visually look at the data and see what happened and make interpretations. Secondly, it is usually the best way to show the data to the audience rather than immersing them in reading lots of numbers in the text which may put people sleep and grasp little information. Thirdly, it allows the readers to easily understand the research findings and what the writer wants to communicate. As a result, it is highly recommended to present data in tables, charts and graphs in condensed form. All the data presentation tools such as tables, charts and graphs can be drawn by hand or on computer. Softwares such as Microsoft Excel produce graphs and perform some statistical calculations. Statistical programs like SPSS, SYSTAT and SAS are higher-powered programs that perform many statistical analyses as well as producing high quality graphs. However, graphs may mislead unless carefully and precisely drawn. Hence, you should bear in mind the undermentioned points while producing tables, charts and graphs for any quantitative data. These are: Scales must be in regular intervals Graphs/charts that are compared must have the same scale You must be clear with what you are going to communicate Graphs, charts and tables must be easy to read You should know who your audience is Be sure whether or not the display tell the entire story of the specific issue

Mesay Mulugeta, 2009

30

Quantitative Methods in Social Sciences

3.2. Nature of Geographic Data


Numerical data is the essence of quantitative methods. In order to try to understand the phenomena under study, the researcher has to find a means of expressing the variables to be measured using some form of numerical technique. For most practical purposes, data can be measured at four different levels; each level has a specific purpose and also has important implications for the type of analysis to be undertaken. These four levels of measurement are known as nominal, ordinal, interval and ratio. I. Nominal data is set of data and said to be nominal if the values (observations) belonging to it can be assigned as a code in the form of a number where the numbers are simply labels. You can count but not order or measure nominal data. For example, in your research you can code males 0 and females as 1; marital status of an individual could be coded as 2 if married, 3 if single.

II. Ordinal (Rank) data is where the numerical value indicates something about relative rather than absolute position in a series such as ranks. Many statistical tests make use of this type of ranked data. For instance, let four farmers, namely Chalchisa, Lamesa, Sardessa and Arfase, are the 1st, 2nd 3rd and 4th crop producers in a rural kebele called Halelu-Chari during crop year 2007/8. Then, the numerical values 1, 2, 3 and 4 above indicates not the amount of yield the four farmers produced in 2007/8 but it indicates the relative positions of the four farmers out of the existing total farmers in Halelu-Chari Kebele. Hence, 1, 2, 3 and 4 here are Ordinal data. Here you can see that the data are ordered but differences between values are not important. III. Interval data is a scale of measurement where the distance between any two adjacent units of measurement (or intervals) is the same but the zero point is arbitrary. Scores on an interval scale can be added and subtracted but cannot be meaningfully multiplied or divided. An interval scale has all the characteristics of an ordinal scale, i.e. individuals or responses belonging to a subcategory have a common characteristic and the sub-categories are arranged in an ascending or descending order. In addition, an interval scale uses a unit of measurement that enables the individuals or responses to be placed at equally spaced intervals in addition to the spread of the variable. This scale has a starting and terminating points and the number of units/intervals between them are arbitrary and vary from scale to scale.
Mesay Mulugeta, 2009 31

Quantitative Methods in Social Sciences

Centigrade and Fahrenheit scales are examples of the interval scale. In the centigrade system the starting point (considered as freezing point) is 00C while the end point (considered as boiling point) is 100oC. The gap between freezing and boiling points is divided into 100 equally spaced intervals known as degrees. In the Fahrenheit system the freezing point is 320F and the boiling point is 2120F. The gap between the two points is divided into 180 equally spaced intervals. Each degree or interval is a measurement of temperature. As the starting and terminating points are arbitrary, they are not absolute, i.e. you cannot say 600C is twice as hot as 300C or 300F is not three times hotter than 100F. This means that while no mathematical operation can be performed on the readings, it can be performed on the differences between readings. IV. Ratio data is a set of data if the values (observations) belonging to it may take on any value within a finite or infinite interval. It has all the properties of nominal, ordinal and interval scales plus its own property, i.e. the zero point of a ratio scale is fixed, which means it has fixed starting point. You can count, order and measure continuous data. It has a real negative, zero or positive values. For example, during your field work (data collection) for your research you may get 80quintals/year, 15quintals/year or no (zero) production for certain farmers. Other example can be height and weight of sample newly born babies where a 2.5kg baby is twice heavier than a 1.5kg one and a 90cm baby is longer than 1.5times than a 60cm long baby. In this scale all mathematical operations such as multiplication and division are therefore meaningful. The zero value on a ratio scale is non-arbitrary. Most physical quantities, such as mass, length or energy are measured on ratio scales; so is temperature when it is measured in Kelvin s, i.e. relative to absolute zero. In real quantitative data collection process, it is also common to identify two sets of data namely Continuous and discrete data. Continuous variables are those variables that have theoretically an infinite number of gradations between two measurements. For example, crop yield per unit of land, body weight of individuals, milk yield of cows, etc are continuous variables. Various variables in geography and environmental studies are of continuous type. Discrete variables, on the other hand, do not have continuous gradations but there is a definite gap between two measurements, i.e. they cannot be measured in fractions. For example, number of peoples in a country, a family size, and number of eggs per chicken in poultry are discrete data.

Mesay Mulugeta, 2009

32

Quantitative Methods in Social Sciences

Note that the distinction between nominal, ordinal, interval and ratio scale is of importance mainly because it helps us to choose the appropriate statistical tool for the analysis of the data.

3.3. Statistical Variables


A variable is a factor or character that can have more than one or different values such as yield, height, weight and family size. In research, particularly in quantitative researches, where we are using experiments to try to establish cause and effect certain variables are especially important: (a)Independent Variable (IV), (b)Dependent Variable (DV) and (c)Extraneous Variable (EV) Climatic Conditions (Assumed Causes) Independent Variable Crop production per unit area (Assumed effect) Dependent Variable

Type of seed Fertilizers Soil type Farmers level of education Farming technology, etc Extraneous Variables Consider a study aimed at finding factors affecting crop productivity in a certain area in Ethiopia. We may hypothesize that climate and soil will have beneficial effects on crop productivity. In this situation climate and soil are the Independent Variables (IV) and crop productivity is the Dependent Variable (DV). However, it is very likely that the DV (the amount of crop production per unit area) may be suffering from other factors such as type of seed and fertilizers. All these other potential sources of influence are known as extraneous variables. Usually, experimental design is done in most scientific researches to control these extraneous factors as far as possible.

3.4. Data Classification


The collected data, also known as raw data or ungrouped data, are always in an unorganized form and need to be organized and presented in meaningful and readily comprehensible form in order to facilitate further statistical analysis. It is, therefore, essential for an investigator to condense a mass
Mesay Mulugeta, 2009 33

Quantitative Methods in Social Sciences

of data into more and more comprehensible form. The process of grouping into different classes or sub classes according to some characteristics is known as classification. Thus classification is the first step in tabulation. For Example, crop products in the crop year 2007/8 can be classified according to their types such as teff, barley, wheat, sorghum and oat. Similarly, population data can be classified as male or female, educated or uneducated, married or unmarried, e.t.c.

Importance of Classification
The following are main objectives of classifying geographical data: It condenses the mass of data in an easily manageable form It eliminates unnecessary details. It facilitates comparison and highlights the significant aspect of data. It enables one to get a mental picture of the information and helps in drawing inferences. It helps in the statistical treatment of the information collected

Table 3.1: Classification of farm households' cash sources in Kuyu woreda


S/N 1 2 3 4 5 6 7 8 9 10 Source of Income Livestock and livestock products sale Poultry Bee production Grain sale vegetables sale Firewood sale charcoal sale Transfer/gift Rural credit local trades % as of total income 61.2 3.3 0.2 11.9 2.5 1.1 0.2 0.3 18.8 0.5
Source: Mesay M. (2001).

Look at the table above that the farmers sources of cash income are clearly indicated with their respective percentages and it is now easier to read and understand or more suitable for farther statistical analysis.

Mesay Mulugeta, 2009

34

Quantitative Methods in Social Sciences

Types of classification
Statistical data are classified in respect of their characteristics. Broadly there are four basic types of classification namely a) Chronological classification b) Geographical or Locational classification c) Qualitative classification d) Quantitative classification

a) Chronological or locational classification In chronological classification the collected data are arranged according to the order of time expressed in years, months, weeks, etc. The data is generally classified in ascending order of time.

b) Geographical classification In this type of classification the data are classified according to geographical region or place. For instance, the production of coffee in different zonal administration of Ethiopia, production of wheat in different woredas in Oromiya, etc

c) Qualitative classification In this type of classification data are classified on the basis of same attributes or quality like sex, literacy, religion, employment, etc. Such attributes cannot be measured along with a scale. For example, if the population to be classified in respect to one attribute, say sex, then we can classify them into two namely that of males and females. Similarly, they can also be classified into employed or unemployed on the basis of another attribute employment .

d) Quantitative Classification Quantitative classification refers to the classification of data according to some characteristics that can be measured such as amount of rainfall, temperature, crop production, altitude, height, weight, etc., For example, farmers in a kebele may be classified according to their amount of crop production within a year as given below.

Mesay Mulugeta, 2009

35

Quantitative Methods in Social Sciences

Table 3.2: Classification of hypothetical farmers by quantity of production Production in Quintals (Classes/Groups) Discontinuous Continuous 6 - 10 5.5 - 10.5 11 - 15 10.5 - 15.5 16 - 20 15.5 - 20.5 21 - 25 20.5 - 25.5 26 - 30 25.5 - 30.5 31 - 35 30.5 - 35.5 36 - 40 35.5 - 40.5 41 - 45 40.5 - 45.5 Total Number of Farmers (Frequency) 8 12 17 21 14 11 6 11 100

In this type of classification there are two elements, namely (i) the variable i.e the production in the above example, and (ii) the frequency in the number of farmers in each class. There are 12 farmers having production ranging from 10 to 15, 14 farmers having production ranging between 26 to 30 and so on. Dear Student! Do you know the advantages of data classification? Please, explain it.

3.5. Data Presentation Tools and Analysis


Raw statistical data must be presented in a suitable and summarized form without any loss of relevant information so that it can be efficiently used for decision making. Whenever there is a need to present statistical data, graphic aids can help communicate this information to your audience more quickly. The two graphic aids mostly used in research reports are tables and graphs. Besides making the report easier to read and understand, graphic aids make attractive the physical appearance of your research. .

Several types of statistical/data presentation tools (graphic aids) exist. These include bar graphs, histograms, scattered diagrams, pie-chart, line graphs and tables.

3.5.1. Tabulation
Tabulation is the process of summarizing classified or grouped data in the form of a table so that it is easily understood and an investigator is quickly able to locate the desired information. A table is a
Mesay Mulugeta, 2009 36

Quantitative Methods in Social Sciences

systematic arrangement of classified data in columns and rows. Thus, a statistical table makes it possible for the investigator to present a huge mass of data in a detailed and orderly form. It facilitates comparison and often reveals certain patterns in data which are otherwise not obvious. Classification and Tabulation , as a matter of fact, are not two distinct processes. Actually they go together. Before tabulation data are classified and then displayed under different columns and rows of a table.

Advantages of Tabulation
Statistical data arranged in a tabular form serve the following objectives: 1. It simplifies complex data and the data presented are easily understood. 2. It facilitates comparison of related facts. 3. It facilitates computation of various statistical measures like measures of central tendency, dispersion and correlation. 4. It presents facts in minimum possible space unnecessary repetitions and explanations are avoided. Moreover, the needed information can be easily located 5. Tabulated data are good for references and they make it easier to present the information in the form of graphs and diagrams.

Preparing a Table
The making of a compact table itself is an art. This should contain all the information needed within the smallest possible space. What the purpose of tabulation is and how the tabulated information is to be used are the main points to be kept in mind while preparing a statistical table. An ideal table should consist of main parts such as table number, title of the table, body of the table and sources of data Dear Student! Look at the table below and observe how table number, title of the table, body of the table and sources of data should be presented.

Mesay Mulugeta, 2009

37

Quantitative Methods in Social Sciences

Table 3.3: Number of livestock per household by types and agro-climatic zones: Kuyu Woreda Types Dega Woina Dega Qolla All zones

Oxen Cows Young cattle Sheep Goats Equine Total

1.7 1.4 1.9 2.2 1.4 0.7 9.3

1.9 1.4 2.2 1.0 2.3 0.5 9.3

0.9 0.9 1.4 0.0 3.5 0.2 6.9

1.7 1.4 2.0 1.4 2.0 0.6 9.1

Source: MA Thesis (Mesay M., 2001)

Frequency Distribution
In statistics, a frequency distribution is a list of values that a variable takes in a sample. It is usually a list, ordered by quantity, showing the number of times each value appears. For example in the table below 22 appears three times while others such as 8, 12 and 17 appear two times each. Table 3.4: Frequency table
Laborers Daily Income (Eth. Birr) Number of Laborers (Frequency) Cumulative Frequency (<UL) Percentage Cumulative Frequency

3.00 8.00 12.00 17.00 18.00 22.00 23.00 24.00 27.00 Total

2 2 2 2 2 3 1 1 1 16

2 4 6 8 10 13 14 15 16 ----

12.5 25.0 37.5 50.0 62.5 81.3 87.5 93.8 100.0 ----

This simple tabulation has two drawbacks. When a variable can take continuous values instead of discrete values or when the number of possible values is too large, the table construction is cumbersome, if not impossible. A slightly different tabulation scheme based on the range of values is used in such cases. So we group the data into class intervals (or groups) to help us organize,
Mesay Mulugeta, 2009 38

Quantitative Methods in Social Sciences

interpret and analyze the data. The frequency of a group (or class interval) is the number of data values that fall in the range specified by that group (or class interval). There are two ways in which observations in the data set are classified on the basis of class intervals, namely exclusive method and inclusive method. Exclusive method is when the data is classified in such a way that the upper limit of a class interval is the lower limit of the succeeding class interval. This method is illustrated in Table 3.5. Table 3.5: Exclusive method of data classification Class Interval 0 5 5 10 10 15 15 20 20 25 25 - 30 Total

Frequency 2 2 2 4 5 1 16

CF < UL 2 4 6 10 15 16 ----

Inclusive method, on the other hand, is when the data are classified in such a way that both lower and upper limits of a class in the interval itself as illustrated below. Table 3.6: Inclusive method of data classification Class Interval 0 4 5 10 15 20 9 14 19 24

Frequency 2 2 2 4 5 1 16

25 - 30 Total

Mesay Mulugeta, 2009

39

Quantitative Methods in Social Sciences

An exclusive method is used to classify a set of data involving continuous variables while inclusive method should be used to classify a set of data involving discrete variables. If a continuous variable is classified according to the inclusive method, then certain adjustment in the class interval is needed to obtain continuity as shown in Table 3.7. (Table 3.6. above can be adjusted as Table 3.7. below for continuous variables) Table 3.7: Adjusted inclusive method for continuous variables Class Interval -0.5 4.5 4.5 9.5 14.5 19.5 9.5 14.5 19.5 24.5

Frequency 2 2 2 4 5 1 16

24.5 30.5 Total The method is that first calculate correction factor as:

Upper Limit of a Class Lower Limit of the Next Higher Class , Where X is correction factor 2

And then subtract the value of X from the lower limits of all classes and add it to the upper limits of all the classes.

3.5.2. Graphical Presentation of Data


It has already been discussed that one of the most important functions of statistics is to present complex and unorganized (raw) data in such a manner that they would easily be understandable. In the same way graphic presentation of data helps us to easily understand the overall nature of the data and facilitates for further analysis or interpretation. The shape of the graph offers easy and immediate answers to several questions such as the variations of the data distribution and trends. Graphic presentation, therefore, serves as an easy technique for quick and effective comparison b/n two or more variables.
Mesay Mulugeta, 2009 40

Quantitative Methods in Social Sciences

Graphic representations of data can be categorized into five types based on the dimensions of the graphs in use. These are: 1. Dimensionless diagrams also known as point graphs or dot graphs 2. Unidimensional graphs or line graphs such as frequency curve/polygon and cumulative frequency curve (Ogive) 3. Bidimensional diagrams such as histograms, bardiagrams and pie-charts 4. Tridimensional diagrams such as cubes, blocks and spheres 5. Pictorial representation of data by using pictures like human being, animals, cups, houses and fruits Generally, presentation of data in diagrams has the following advantages: 1. Diagrams give an attractive and elegant presentation of data 2. Diagrams leave good visual impact and facilitates comparisons 3. Interpretations from diagrams save time 4. Diagrams simplify complexity and easily depict the characteristics of the data

Point Graph or Dot Graph


Point graph provides a visual representation of a relationships between two variables, x and y, of a set of data. A graph consists of two axes called the x-axis (horizontal line) and y-axis (vertical line). The axes correspond to the variables we are analyzing. The given point graph (Fig. 2.1), for example, charts the relationship between fertilizer input per hectare and crop yield per hectare in quintals. The horizontal coordinates (x-coordinate) of each point corresponds to the amount of fertilizer input while the vertical coordinate (y-coordinate) corresponds to the yield per hectare in quintals as demarcated on the left hand side of the chart. Each point in the graph is defined by a pair numbers containing two coordinates. Points are identified by stating their coordinates in the form of (x, y). Note that the x-coordinate always comes first followed by the y-coordinate. The coordinate values of a point are identified by drawing lines extending out from the specific values at each axis. For example, in the figure we have been using, (Fig. 3.1), both x and y coordinates for point P can be identified as 40 KG per hectare and 25 quintals per hectare, respectively Hence, the coordinate value for Point P is written as (40, 25).

Mesay Mulugeta, 2009

41

Quantitative Methods in Social Sciences

Figure 3.1: Point graph

Linechart or Linegraph
Linechart or linegraph is a type of graph created by connecting a series of the coordinates of data points that represent individual measurements with line segments. A linechart is a basic type of chart common in many fields and it gives the reader a fairly good idea of the nature of the data. For instance, a linechart is often used to visualize a trend in data over individuals of time.

Table 3.8: A hypothetical farmer s crop output over years


Crop Year 1985 1990 1995 2000 2005 Production in Quintals 10 12 15 20 28

For instance, the data depicted in Table 3.8 above can easily be converted to a linechart as shown in figure below.

Mesay Mulugeta, 2009

42

Quantitative Methods in Social Sciences

Figure 3.2: Line graph


30.00

25.00

Yield per Hectare

20.00

15.00

10.00

1985.00

1990.00

1995.00

2000.00

2005.00

Crop Year

Frequency polygon is also a type of linechart formed by marking the midpoint of the top of bars in histogram and joining these dots by series of straight lines. The frequency polygons are formed as a closed figure with the horizontal axis. A series of straight lines are drawn from the midpoint the top base of the first and the last rectangles to the midpoint falling on the horizontal axis of the next outlaying interval with zero frequency. Drawing a frequency polygon does not necessarily require constructing a histogram first. It can be obtained directly on plotting points above each class midpoint at heights equal to the corresponding class frequency. The points so drawn are then joined by a series of straight lines and the polygon is closed as explained earlier. In this case, horizontal x-axis measures the successive class midpoints and not the lower class limits.

Exercise
Use the data indicated in Table 3.4 hereinbefore and draw a frequency polygon A) By constructing a histogram first B) Without constructing histogram

Mesay Mulugeta, 2009

43

Quantitative Methods in Social Sciences

Ogive (Cumulative frequency curve) is another type of linechart. It is a cumulative frequency


polygon often presented in cumulative frequencies or in percentage cumulative terms. Cumulative frequencies show the running total, the frequency below each class boundary, as shown in the example below. One way of constructing an Ogive (pronounced as O-jive) or cumulative frequency curve is indicated below. The curve is usually of S shape.

Example
Plot (construct) an Ogive based on the data (assumed students mark) in the table below. (i) Plot the points with coordinates having abscissa (x-axis) as actual limits and ordinates (y-axis) as the cumulative frequencies, (ii) Join the points plotted by a smooth curve. (iii) An Ogive is connected to a point on the x-axis representing the actual lower limit of the first class.

Table 3.9: Frequency distribution


Students Mark Upper Limit Frequency (No of Students in the Range) Cumulative Frequency % Frequency % Cumulative Frequency

0-19 20-39 40-59 60-79 80-99

19.5 39.5 59.5 79.5 99.5 Total

8 12 14 10 6 50

8 20 34 44 50 ------

16 24 28 20 12 100

16 40 68 88 100 -----

Mesay Mulugeta, 2009

44

Quantitative Methods in Social Sciences

Figure 3.3: Cumulative frequency curve (Ogive) based on the data in Table 3.9
100.00

80.00

Commulative Frequency

60.00

40.00

20.00

0.00

8.00

13.00

18.00

23.00

28.00

33.00

38.00

43.00

Crop Production in Quintals

Histogram and Bardiagram: These are one dimensional diagrams used to represent both
ungrouped (raw) and grouped data. Values of the variables (the characteristics to be measured) are scaled along the horizontal axis and the number of observations (frequencies) along the vertical axis of the graph. The plotted points are then connected by straight lines to enhance the shape of the distribution. The height of such boxes (rectangles) measures the number of observations in each of the classes. See Figures 3.4 and 3.5. Histogram is simply a bar graph where the bar lengths are determined by the frequencies in each class of a grouped frequency distribution. Notice how the bar diagram above is represented by the histogram below having eight interconnected bars that represent the numbers of farmers in each of the quantities of crop production distribution.

Mesay Mulugeta, 2009

45

Quantitative Methods in Social Sciences

Figure 3.4: Bardiagram Presenting Number of Farmers by Crop Production Hypothetical Data

Piediagram or piechart or a circlegraph is a circular chart divided into sectors, illustrating relative magnitudes or frequencies or percents. In a pie chart, the arc length of each sector and consequently its central angle and area, is proportional to the quantity it represents. Together, the sectors create a full disk. It is named for its resemblance to a pie which has been sliced.

Figure 3.5: Histogram Indicating Number of Farmers by Crop production: Hypothetical Data
25.00

Number of Farmers (Frequency)

20.00

15.00

10.00

5.00

0.00 8.00 13.00 18.00 23.00 28.00 33.00 38.00 43.00

Crop production in Quintals

Mesay Mulugeta, 2009

46

Quantitative Methods in Social Sciences

Observe that the data indicated in the table below (Table 3.10) has been presented by the preceding pie-diagram or Figure 3.6.

Table 3.10: Percentage distribution of hypothetical population by marital status Marital Status Single Married Widowed Divorced % Distribution 20 62 14 4

Figure 3.6: Piechart presenting percentage distribution of marital status: Hypothetical data

To classify the pie-chart proportionally use the following methods. Observe that the summation of the percentages (14 + 4 + 20 + 62) is 100. You may also know that the total angular measurement of any circle is 360o. Then you can now establish a relationship between the percentages and angular measurements as follows.

Mesay Mulugeta, 2009

47

Quantitative Methods in Social Sciences

For instance if 100% = 360o, what will be the equivalent angular measurement of the remaining percentages? 14% 4% 20% 62% 100% = 50.400 = 14.400 = 72.000 = 223.200 = 360.00o

Now, by using your protractor you can construct a piechart as indicated in Fig 3.6. Important formulas in this unit are: 1. Class Interval (h)= Upper Limit 2. Midpoint of a class (m) = Lower Limit

Upper Limit Lower Limit 2

3. Approximate Interval Size to be used in constructing a frequency distribution (h): L arg est data value Smallest data value h Number of classes

Three dimensional representation of dataset


This is the method of depicting a data set by using the picture of objects that has three dimensional (length, width and height) such as prism, cube, cylinder and cone. For instance, Table 3.11 below can be shown by one of the three dimensional graphs as follows.

Table 3.11: A farmer s Crop Output over Years: Hypothetical Data Year 1985 1990 1995 2000 2005 Crop Output in Qtls 10 12 15 20 28

Mesay Mulugeta, 2009

48

Quantitative Methods in Social Sciences

Figure 3.7: A Farmer s Crop Output over Years

Pictorial representation of data


This, also known as Pictogram or Ideograph, refers to the method of presenting a set of quantitative data by using pictures of different objects like tree, cattle, car, airplane, cup, sack, ball and man. What is very crucial and worth mentioning here is a picture should be given certain scale that it represents. For example, in Fig. 2.8, a single picture represents 1,000,000 peoples.

Figure 3.8: Pictorial representation of national regional states of Ethiopia by population Regions Oromiya Amhara SNNPRS Somali Tigray Addis Ababa Afar B/Gumuz
Source: CSA (2008) Mesay Mulugeta, 2009 Scale: = 1,000,000 peoples 49

Pictorial representation of population

Quantitative Methods in Social Sciences

Exercise
1. Define the following statistical terms A. Qualitative data B. Quantitative data C. Extraneous variable D. Histogram E. Ogive F. Cumulative frequency G. Frequency distribution H. Midpoint

2. Let the following data represent the gross income (in Eth. Birr) of forty (40) urban households in Adama town. Then construct a cumulative frequency table and Ogive for the data.

550, 600, 760, 500, 900, 550, 760, 750, 1000, 550, 700, 600, 550, 550, 670, 740, 1100, 570, 610, 480, 620, 680, 720, 740, 750, 550, 600, 780, 800, 820, 840, 880, 900, 920, 950, 750, 600, 520, 980, 500

3. The following table indicates the population size of the top 8 most populous countries in the world. Present the data in pie diagram and bar chart Table 3.12: The world's 8 most populous countries, 2008 Country Population China India USA Indonesia Brazil Pakistan Bangladesh Russia 1,321,851,888 1,129,866,154 301,139,947 234,693,997 190,010,647 169,270,617 150,448,339 141,377,752

4. The following data is percentage distribution of sources of annual cash income of sample rural households in Kuyu woreda during a particular year (2001). Draw a suitable diagram (chart) to present the data.
Mesay Mulugeta, 2009 50

Quantitative Methods in Social Sciences

Table 3.13: Rural households' major sources of cash income S/N 1 2 3 4 5 6 7 8 9 10 11 Source of cash income Livestock and livestock products sale Poultry Bee product Grain sale vegetables sale Firewood sale charcoal sale Transfer/gift Rural credit local trades Other non-farm activities Total Cash income per household (Birr/household) 712.26 32.84 48.64 118.72 59.46 56.87 22.93 70.12 799.71 112.20 44.30 945.38

5. The following data is the mean maximum temperature of Debre Birhan town in oC (1997 2006). Present the data by using bar-diagram. Table 3.14
Months Mean Max To 19.80 19.85 19.85 19.85 19.85 19.85 19.85 19.85 19.85 19.85 19.85 19.85 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

Source: Computed from NMSA s data

Mesay Mulugeta, 2009

51

Quantitative Methods in Social Sciences

Unit Four Measures of Central Tendency


Unit objectives
Having studied this chapter, you should be able to: Understand the general concept of measures of central tendency or measures of location in geographic data Identify the specific types, methods and applications of measures of central tendency Describe the role of measures of central tendency in geography and environmental studies Understand the merits and demerits of different measures of central tendency

4.1. Introduction
In the previous sections, we discussed how raw data can be collected, organized and presented in terms of tables, charts and frequency distribution in order to be easily understood and analyzed. Although frequency distribution and corresponding graphical presentations make raw data more meaningful, yet they fail to identify three major properties that describe a set of quantitative data. These are: 1. Central value of a set of data called central tendency 2. The extent to which numerical values are dispersed around the central value, called variation or dispersion in terms of single distances of individual observation from central values. 3. The extent of departure of numerical values from symmetry of distribution around the central value, called Skewness or Shape of frequency distribution or measure of symmetry. In this unit we will deeply discuss about Central Tendencies also known as Measures of Location First Order Analysis. The term central tendency was coined because observations (numerical values) in most data sets show a distinct tendency to group or cluster around a value of an observation located somewhere in the observations. It is necessary to identify or calculate these typical central values to describe or project the characteristic of the entire data set in one figure. This descriptive value is known as measure of central tendency. It is very important in social science applications such as planning. For instance, it becomes easy to plan for the annual total need of potable water supply for Adama town residents firstly by studying the average quantity of water needed per household per head in the town.
Mesay Mulugeta, 2009 52

Quantitative Methods in Social Sciences

The most widely used measures of central tendency are the mean, the median and the mode. We will be calculating these values for populations (i.e. the collection of all elements we are describing) and for samples drawn from populations, as well as for grouped and ungrouped data sets.

4.2. Mean
Mean is a central value which is computed by taking into consideration all the observations or all recorded values. It has four sub types known as Arithmetic, Geometric, Weighted and Harmonic means. But unless and until specified, the term mean invariably refers to the Arithmetic Mean or Average. It is this measure which is most frequently used because it is easier to compute as well as it is used in further rigorous statistical analysis where Geometric and Harmonic means are not useful. Moreover, Geometric and Harmonic Means have very limited applications as well. Mathematically, mean of a list of numerical observations is the sum of the entire observation divided by the number of items in the list. The Greek letter is used to denote the mean of an entire population while sample mean is typically denoted by x (enunciated x bar). Some related literatures also use y and d to denote sample mean. a) Calculating Arithmetic Mean for Raw Data It is the most widely used and widely reported measure of central tendency. There are at least two methods to calculate arithmetic mean for ungrouped (raw) data.

Direct Method
In this method it is calculated by adding the values of all observations and dividing the total by the number (N) of observations.
Population Mean x1 X1 X2 X3 N . . . .x n . . . .X n or xi n Xi N

Sample Mean x

x2

x3 n

or x

Example: Calculate the arithmetic mean of the data given below.


25kg, 30kg, 18kg, 22kg, 28kg, 24kg, 33kg, 27kg, 25kg, 21kg

Arithmetic Mean = 25kg+30kg+18kg+22kg+28kg+24kg+33kg+27kg+25kg+21kg = 253kg =25.3kg 10 10


Mesay Mulugeta, 2009 53

Quantitative Methods in Social Sciences

Then, the arithmetic mean of the above observations (data) is calculated to be 25.3kg.

Short-Cut Method of Calculating Arithmetic Mean


In this method an arbitrary assumed mean (am) is used as a base for calculating the mean of deviations from individual values in the data set. The correct mean is calculated by adding the assumed mean to the mean of differences or deviations from the assumed mean. Look at the example below. Let the following table represents the daily income of 23 hypothetical employees in a firm. Let 24 birr be taken as the arbitrary assumed mean (am) of the employees daily earnings. Table 4.1: Calculation of Mean by Short-cut Method for Ungrouped Data
Daily Earnings in Birr (x) Number of Employees (fi ) Di = xi am = xi 24br fi di

34 23 18 22 36 38 12 Total
d fi d i N 29 23 1.26

2 3 2 4 3 4 5 23

10 -1 -6 -2 12 14 -12 -------

20 -3 -12 -8 36 56 -60 29

Then, the real arithmetic mean

am

fi d i N

24

29 23

24

1.26

25.26

Calculating Arithmetic Mean for Grouped (Classified) Data


It can be applied in grouped data in two ways. (Case 1) By changing the origin to the mid-value or class-mark assumed as mean as mean and calculating d as usual and get the correct mean by adding it to the assumed mean, and (Case 2) by changing the origin as well as the scale by dividing the deviation from assumed mean by class interval (C) and then calculating the assumed mean of the new variate and compute the corrected mean by adding the multiplication of mean of new variate

Mesay Mulugeta, 2009

54

Quantitative Methods in Social Sciences

with C and then adding to the assumed mean. For this it is also advisable to change the class interval into continuous form of grouping.

Example
Let the following table represents a daily earnings of 23 hypothetical employees in a firm in a classified form. This method requires grouping the raw data into class intervals, calculation of class midpoint and identification of the number of observations (data) in each class. Look at the table below.

Case 1
Table 4.2: Calculation of Arithmetic Mean for Grouped Data
Daily Earnings in Birr (Discrete Grouping) 10 14 15 19 20 24 25 29 30 34 35 39 Total Class mid-value mi 12 17 22 27 32 37 ------Number of Employees fi 2 3 2 4 3 4 18 fi mi 24 51 44 108 96 148 471

N Where, N = population size

Arithmetic Mean

fm

or

fm n

n = sample size Then, the arithmetic Mean for the above example

471 18

26.167

Case 2
Table 4.3: Calculation of Arithmetic Mean for Grouped Data
Daily Earning in Birr Class Mark (Mid-Point) mi No. of Employees

(Frequency, fi) 2 3 2 4 3 4 18

di (mi am)
-15 -10 -5 0 5 10 ---

fi d i
-30 -30 -10 0 15 40 -15

ui = mi am
C
-3 -2 -1 0 1 2 ---

fi ui
-6 -6 -2 0 3 8 -3

9.5 14.5 19.5 24.5 29.5 34.5

14.5 19.5 24.5 29.5 34.5 39.5

12 17 22 27 32 37 Total

Mesay Mulugeta, 2009

55

Quantitative Methods in Social Sciences

fi d i fi

15 18

0.833

Correct mean
OR fi ui

am

27

( 0.833)

26.167

fi

3 18

0.167

Correct mean

am

C (u )

27

5 ( 0.167)

27

0.833

26.167

Merits and Demerits of Arithmetic Mean Sharma, J.K (2004) writes the merits and demerits of arithmetic mean as follows: Merits 1. The calculation of arithmetic mean is simple and it is unique, 2. It is clear and unambiguous since every data set has one and only one mean value 3. The calculation of arithmetic mean is based on all values given in the data set 4. The arithmetic mean is reliable single value that reflects all values in the data set. 5. The arithmetic mean is least affected by fluctuations in sample size. Its value determined from various samples drawn from a population vary by the least possible amount 6. It used for more rigorous further statistical analysis 7. Arithmetic mean is a stable average Demerits 1. The value of arithmetic mean cannot be calculated accurately for open-ended class intervals 2. It is affected by the extreme values which are not the exact representative of the data set. 3. For large data sets the calculations of arithmetic mean may sometimes be difficult and tedious as every element is used in the calculation 4. It cannot be calculated for qualitative characteristics such as intelligence, beauty and loyalty 5. Arithmetic mean cannot be determined by inspection

Exercise
The following data is a monthly minimum temperature data taken from Guder Weather Station. Calculate the monthly minimum arithmetic mean and write your answers on the space provided at the bottom of the table.
Mesay Mulugeta, 2009 56

Quantitative Methods in Social Sciences

Table 4.4: Monthly Minimum Temperature in 0C (1998


Year of Record Months

2006): Guder Station


Jul 13.5 13.3 10.6 12.5 12.7 7.8 8.2 9.4 7.4 Aug 12.3 11.8 10.7 12.4 12 6.5 8.3 7.8 7.3 Sep 11.5 9.8 7.8 8.8 12.1 6 8.4 7 6.9 Oct 10.9 9.9 7.4 9.2 8.2 2.4 6 1.7 5.4 Nov 5.5 3.4 4.3 6.6 7 1.6 2.3 0.6 4.9 Dec 3.8 6.2 5.2 7.2 9.6 1.8 2.9 -0.4 5

Jan 10.2 6.8 5.7 7.8 8.9 10.2 5.2 3.5 2

Feb 11 7 6.1 10 8.8 8.5 3.8 4.1 5.1

Mar 11.1 10.5 10.6 11.2 13 10.7 6.9 9.1 9.4

Apr 11.6 11.7 11.2 9.4 12.9 10.2 8.1 12 10.8

May 13.7 12 11.6 12.5 12.6 9.9 8.6 11.1 9.2

Jun 12.5 12.6 10.2 11.4 11.5 8.2 7.4 9.9 8.1

1998 1999 2000 2001 2002 2003 2004 2005 2006

Mean Min T0 ? ? ? ? ? ? ? ? ? ? ? ?

Exercise
Form a frequency table for the raw data given below and calculate an arithmetic mean.
22 38 12 17 17 18 22 23 22 27 18 28 12 12 24 18 18 22 33 33 38 24 24 17 22 22 28 12 28 24 18 33 27 22 24

Weighted Arithmetic Mean


The arithmetic mean, as discussed earlier, gives equal importance (weight or value) to each observation in the data set. However, there are situations in which values of observations in the data are not of equal importance. In such cases computing simple arithmetic mean may not be truly representative and even misleading. Under these circumstances, we may attach to each observation a value or weight as an indicator of their importance and compute a weighted mean.

Example
Let us assume that East Shewa Zone administration wants to award only one top crop producing farmer in the zone based on his/her annual crop production. In the process of selecting the nominee, let three top farmers (Farmers X, Y and Z) be found and each of them harvested 80quintals of crops during the crop year. If each farmer produced four types of crops namely wheat, teff, barley and sorghum as depicted in the Table 4.4, who do think should be awarded? It is some what difficult to
Mesay Mulugeta, 2009 57

Quantitative Methods in Social Sciences

decide as three of the farmers produced the same quantity and the average production per crop is 20 quintals for each of them. Do not forget that only one farmer is required to be awarded. In order to decide whom to award, we may attach to each crop type value (weight) as w1, and

w2 , w3

w4 as an indicator of their importance and compute a weighted mean.

Therefore, based on

their current market price, for instance, we can attach 1200 to teff, 800 to barley, 700 to wheat and 400 to sorghum. Observe Table 4.5. below. Table 4.5: Calculation of Weighted Arithmetic Mean
Current

Crops type Teff

Market Price per Qntl

X
Prodn in qntls

Farmers Y
Prodn in qntls

Z
Prodn in qntls

(Weight) 1200

(x) 18

wx
21600 17600 21000 4000 64200

(x) 12

wx
14400 14400 31500 2000 62300

(x) 25

wx
30000 20000 10500 6000 66500

Barley Wheat Sorghum Total

800 700 400 3,100

22 30 10 80

18 45 5 80

25 15 15 80

In order to decide who should be awarded, we should now calculate the weighted mean for each farmer from the table above:
WM for farmer X 64200 3100 20.7096

WM for farmer Y

62300 3100 66500 3100

WX
20.0968

wx1 wx2 wx3 ... nxn w 1 2 3 w1 w w3.... n w 2

WM for farmer Z

21.4516

As per the calculation above, it can be noted that Farmer Z should be awarded. Remark: As noted by Sharma, J. K (2004) the weighted arithmetic mean should be used, among others, where the importance of all the numerical values in the given data set is not equal and when the frequencies of various classes are widely varying. The term weighted mean usually refers to a weighted arithmetic mean, but weighted versions of other means can also be calculated, such as weighted geometric mean and weighted harmonic mean.
Mesay Mulugeta, 2009 58

Quantitative Methods in Social Sciences

Exercise
Let an examination was held to decide the award of a scholarship from three selected students namely Gebre, Mandefro and Waktole. The weights of various subjects were different (4 for mathematics, 3 for GeES, 2 for civics and 1 for language). The marks obtained by three candidates (out of 100 in each subject) are given below. Table 4.6 Subjects Mathematics GeES Civics Language Gebre 60 62 55 67 Mandefro 57 61 53 77 Waktole 62 67 60 49

Then, decide who should be awarded the scholarship by calculating the weighted arithmetic mean.

Geometric Mean
It is the nth root of the product of n numbers (observations) or the average of the logarithmic values of a data set, converted back to a base 10 number. It is given by the formula GM

n (x1) (x2) (x3) (x4) ....(xn ) .

The logarithmic transformation will be

LogGM

1 (Logx1 Logx2 n

... Logxn ) .

Hence GM ,

Anti Log of Average of Log Values

Then, how do you calculate a geometric mean? The easiest way to think of the geometric mean is that it is the average of the logarithmic values, converted back to a base 10 number. However, the actual formula and definition of the geometric mean is that it is the

nth

root of the product of n numbers or Geometric Mean =

nth root of

(X1)(X2)...(Xn). Where X1, X2 points used in the calculation.

Xn represent the individual data and n is the total number of data

Mesay Mulugeta, 2009

59

Quantitative Methods in Social Sciences

GM

n (x1) (x2) (x3) (x4) ....(xn )

Consider this example. Suppose you want to calculate the geometric mean of the numbers 2 and 32. This simple example can be done as follows: First, take the product that means 2 and 32), the

x 32

64

. Then, since there are only two numbers (2

nth root is the square root, and the square root of 64 is 8. Therefore the geometric mean

of 2 and 32 is 8. Now, let's solve the problems using logs. In this case, we will convert to base-2 logs so that we can solve the problem as follows. Converting our numbers, we have:

21

,
6 3

32

25 ,

21 x 2 5

26

64

The square root of 2 is 2 which is equal to 8. Of course, the short cut to solve the problem is to take the average of the two exponents (1 and 5) which is 3, and 23 is 8. Merits and Demerits of Geometric Mean Sharma, J.K (2004) writes the merits and demerits of Geometric mean as follows: Merits 1. The values of GM is not much affected by extreme observations 2. GM is calculated by taking all the observations into account 3. It useful in determining rate of increase or decrease Demerits 1. The calculation of GM as compared to AM is more difficult and intricate 2. GM cannot be calculated when any of the observation in the data set is either negative or zero

Exercise
Calculate the geometric mean of 8, 24, 25, 26, 28 and 32.

Mesay Mulugeta, 2009

60

Quantitative Methods in Social Sciences

Harmonic Mean
In mathematics, the harmonic mean is one of several kinds of average. Typically, it is appropriate for situations when the average of rates is desired. The harmonic mean H of the positive real numbers x1,

x2, ..., xn is defined to be


H n 1 X1 1 X 2 1 X 3 1 Xn

Note: Equivalently, the harmonic mean is the reciprocal of the arithmetic mean of the reciprocals.

Merits and Demerits of Harmonic Mean


Merits 1. The HM of a data set is calculated based on its every elements 2. More weightage is given to smaller values in calculating Harmonic Mean 3. HM mean can be extended to further statistical analysis

Demerits
1. It is not easy to calculate and understand compared to AM 2. It is impossible to determine harmonic mean if any of the values is zero and/or negative 3. It is not representative value of the distribution or the data set unless the analysis requires greater weight to be given to smaller items.

Exercise
Calculate the harmonic mean of 10, 14, 20, 22, 26 and 30. Note that for a given set of observations the following inequality holds: Arithmetic Mean Geometric Mean Harmonic Mean

4.3. Median Median is defined as the middle value in the data set when its elements are arranged in a sequential
order, that is, in either ascending or descending order of magnitude. It is called a middle value in an ordered sequence of data in the sense that half of the observations are smaller and half are larger than this value.

Mesay Mulugeta, 2009

61

Quantitative Methods in Social Sciences

Median can be calculated for both ungrouped and classified data. In order to calculate it for ungrouped data, first the data should be arranged in an ascending or descending order. If the number of observations (n) is odd number, then the median (Med) is represented by the numerical value corresponding to the positioning point of the (n 1)th order observation . Note that median is only 2

moderately useful for farther statistical analysis unlike arithmetic mean which is the most important measure of central tendency (measure of location) for farther statistical analysis of a given dataset.

Example
Find the median value of the following data. 50 55 60 63 66 (n 1)th 2 69 72 75 78

9 , then we have to find the


5th value 66

value in the data

(9 1)th 2

Then, the median value of the data above is 66 If the number of observations (n) is an even number, then the median is defined as the arithmetic mean of the numerical values of the (n)th n and ( 2 2 1)th observations in the data array

Example
Find the median value of the following data 50 55 60 63 66 69 72 75 78 84

n = 10 which means even (n)th value 2 (n)th 2 value (10)th value 2 (10)th 2 5th value 66 69

1 value

6th value

Then, the arithmetic mean of 66 and 69 =

67.5 which is the median of the data above. 2 Median for Grouped Data: To find the median for grouped data, first identify the class interval n 1 th which contains the median value or ( ) observation of the data set. To find such class interval, 2
Mesay Mulugeta, 2009 62

66

69

Quantitative Methods in Social Sciences

find the cumulative frequency of each class for which the cumulative frequency is equal to or greater than the value of (n)th observation.
Med L (n 1 / 2) f cf x h

Where,

=lower class limit of the median class interval

cf = cumulative frequency of the class prior to the median class interval, that is, the sum of all the class frequencies unto, but not including, the median class interval f h n = frequency of the median class = width of the median class interval = total number of observations in the distribution

Example
Table 4.7 represents the dietary energy intake per person per day (kcal/day/person) of 100 rural households in one of the woredas in Ethiopia. Calculate the median value of the dietary energy intake in kilocalorie based on the discussion above.

Steps Total number of observation (n) = 100 Median is in the size of (


n 1 th 100 1 th ) =( ) = (50.5)th observation in the data set. 2 2

This observation lies in (500 1000) class interval

Applying the formula above we have: Med = 500 +


(500.5 20 x 500 = 500 + 401.32) = 901.32 38

Mesay Mulugeta, 2009

63

Quantitative Methods in Social Sciences

Table 4.7: Calculation for Median Value


Dietary Energy (kcal/day/person) No of Households (f) Cumulative Frequency (<UL)

500

20 38 28 4 3 3 4 100

20 58 86 90 93 96 100
---------

500-1000 1000-1500 1500-2000 2000-2500 2500-3000 3000 - 3500 Total

Merits and Demerits of Median Merits


1. Easy to calculate and understand even for professionals with low level of mathematics and statistics 2. It is not affected by the extreme values in a data set 3. Median can be computed while dealing with a distribution with open ended classes 4. Median can sometimes be located by simple inspection

Demerits
1. In case of even number of observations for ungrouped data, median cannot be determined exactly 2. Median, being a positioning value, is not based on each item in a data set 3. Median is not suitable for further mathematical treatment 4. Median is more affected by fluctuations of sampling as compared to arithmetic mean.

Exercise
Let us assume that you have collected a data from a factory employing 80 workers during your project work for this course. Let your data indicate that the daily wage of 20 workers is less than 10 Eth. Birr, of 30 workers is 10 to 20 Eth. Birr, of 14 workers is from 20 to 30 Eth. Birr, of 7 workers is 30 to 40 Eth. Birr and of the remaining 9 workers ranges from 40 to 50 Eth. Birr. Then, calculate the median wage of the workers.

Mesay Mulugeta, 2009

64

Quantitative Methods in Social Sciences

4.4. Mode
In statistics, the mode is the value that occurs most frequently in a data set or a probability distribution. Like the statistical mean and the median, the mode is a way of capturing important information about a random variable or a population in a single quantity. For instance, the mode of the sample [1, 3, 6, 6, 6, 6, 7, 7, 12, 12, 17] is 6 Calculation of Mode in Grouped Data: In case of the grouped data the following formula is used to calculate the mode.

Mode.

f 2f f1

f1 f2

x h

Where, L = lower limit of the modal class interval f = frequency of modal class f1 = frequency of the class preceding the modal class interval f2 = frequency of the class following the modal class interval h = width of the modal class interval

Example
Calculate the mode of the dietary energy intake in kilocalorie indicated in Table 4.7.

Steps: The largest frequency (38) corresponds to the class (500-1000) Then, we have L = 500 f = 38

f1 f2
500

= 20 = 28

= 500

Mode

500

38 20 x 500 2(38) 20 28

321.4

821.4

Though a poor measure of central tendency, mode has some advantages. It can be calculated only by inspection from a simple frequency distribution. It is also unaffected by the presence of extreme values of a data set and can be calculated from frequency distribution with open ended classes. However, mode has no significance unless large number of observations is available. Mode has little or no use for farther statistical analysis. You should also note that a data set may have one mode value (unimodal distribution), two mode values (bimodal distribution), three mode values (trimodal distribution) or many mode values (multimodal distribution).
Mesay Mulugeta, 2009 65

Quantitative Methods in Social Sciences

Merits and Demerits of Mode


Merits 1. Mode value is easy to understand and to calculate 2. Mode class can be inspected by inspection 4. Mode is not affected by extreme values in the distribution 5. Mode value can be calculated for open-end frequency distributions

Demerits
1. A data set may have more than one mode value which makes the comparison and interpretation more difficult 2. It is difficult to locate modal class in the case of multi-modal frequency distributions 3. Mode is not used for further rigorous statistical analysis

4.5. Partition Values: Quartiles, Deciles and Percentiles


The term partition value is here used to comprise all such several measures as the median, quartile, pentile, hexile, heptile, octile, decile and percentile referring to partitioning a set of data into two, four, five, six, seven, eight, ten and hundred equal parts, respectively. However, the commonest partition values are quartile, deciles and percentiles. The basic purpose of all these data partitioning activities is to know more and more about the characteristic of a data set. In order to split a set of data into certain number of partitions first each value should be arranged in either ascending or descending order of their magnitude and then dividing this ordered series into the required number of equal parts. The measures of central tendency which are used for dividing the data in to several equal parts are called partition values. In this section, we shall discus data analysis by dividing it into four, ten, and hundred parts of equal size. Corresponding partition values are called quartiles, deciles and percentiles. All these values can be determined in the same way as median. The only difference is in their location or position.

Quartiles: The value of observation in a data set, when arranged in an ordered sequence, can be
divided into four equal parts or quarters namely the first quartile (Q1), the second quartile (Q2) and the third quartile (Q3). The first quartile (Q1) divides a distribution in such a way that 25% (=1/4) of
Mesay Mulugeta, 2009 66

Quantitative Methods in Social Sciences

observations have a value less than or equal to Q1 and 75% (=3/4) have a value more than or equal to Q1. Similarly Q2 has 50% items with values less than or equal to Q2 For discrete data it is simple to locate the partition values. Arrange the data in order, if they are not and work out the cumulative frequencies. To find out Q1 calculate (N

1) / 4 where N is the total 1) / 4 is

number of observations. Search for the minimum cumulative frequency in which (N

contained. The variate value against this cumulative frequency is the value of Q1. For Q2 find

(N

1) / 2 and search for the minimum cumulative frequency in which (N

1) / 2 is contained. The 1) / 4 and locate Q3

variate value corresponding to this cumulative frequency is Q2. Calculate 3( N in the same manner as Q1 and Q2. If the population or sample size is relatively small, (N

1) is used instead of N as indicated in the

formula above. This is because in very large sample or population size the difference between the ratio of (N

1) and N to the same denominator is negligible as compared to the case of small size 1) in geographic researches since we usually make use of

sample or population. Hence, we use (N

smaller sample sizes as compared to other fields of study such as psychology. Then, find where the value of 1( N

1) / 4 , which means 1(29

1) / 4 value

(7.5)th is contained or

where it lies. This value lies at 11. Then, the Q1 of the data is 9. If you continue calculating, the Q2 and Q3 will be 11 and 17, respectively.

Example
Calculate Q1, Q2 and Q3 for hypothetical data given below Table 4.8 Daily income 4 5 7 9 11 14 17 24 28
Mesay Mulugeta, 2009

Number of workers 2 3 2 4 5 3 4 2 4 Total 29

Cumulative frequency 2 5 7 11 16 19 23 25 29 ---67

Quantitative Methods in Social Sciences

To find the deciles Di ( i = 1, 2, 3

9) we calculate the value i (N

1) / 10 and search for the

minimum cumulative frequency which contains the value i (N

1) / 10 . The variate value 1) / 100 (i = 1,2,3,


.99)

corresponding to this cumulative is the ith deciles. Similarly, calculate i (N

for percentiles and proceeding on as for quartiles and deciles, the percentiles are located. For a grouped set of data, to locate the ith quartile value, first calculate i ( N 1 / 4) th value and proceed as we do for ungrouped data above. This means search that minimum cumulative frequency in which i ( N 1 / 4) th is contained. The class corresponding to this cumulative frequency is called ith quartile class (1st Quartile, 2nd Quartile, or 3rd Quartile) The general formula for calculating quartiles in case of grouped data is:
i (n 1 / 4) f cf

x h; i

1, 2, 3

Where,

cf = cumulative frequency prior to the ith quartile class L = lower limit of the ith quartile class f = frequency of the ith quartile class interval h = width of the class interval Qi
=

ith quartile value which is to be worked out

Deciles: In descriptive statistics, a decile is any of the 9 values that divide the sorted data into 10
equal parts using nine deciles, Di, (i = 1, 2, 3, 9), so that each part represents 1/10th of the sample or population set of data. The procedure for calculating the ith class is to calculate iN/10 and search that minimum cumulative frequency in which this value is contained. The class corresponding to this cumulative frequency is the ith decile class. The general formula for calculating deciles in case of grouped data is:
i (n 1 / 10) f cf

Di

x h; i 1, 2, 3, ....9

Mesay Mulugeta, 2009

68

Quantitative Methods in Social Sciences

Percentile: Represents values of observations in a data when arranged in an ordered sequence into
hundred equal parts ninety nine percentiles, Pi (i = 1, 2, 3, 4 .99). So the 20th percentile is the value (or score) below which 20 percent of the observations may be found. The term percentile and the related term percentile rank are often used in descriptive statistics as well as in the reporting of scores from norm-referenced tests. The 25th percentile is also known as the first quartile (Q1); the 50th percentile as the median or second quartile (Q2); the 75th percentile as the third quartile (Q3). The general formula for calculating percentiles in case of grouped data is:
i(n 1 / 100) f cf

Pi

x h; i 1, 2, 3,.....99

Example Let us assume that the following distribution (Table 4.9) gives the pattern of livestock population per household for 100 rural households in Woreda W in Ethiopia. Calculate median, first quartile, 8th decile and 70th percentile of the grouped data. Since the number of observations in the data set are 100, the median value is (n/2)th = (100/2)th = 50th observation. This observation lies in the class interval 25 30. Applying the earlier formula, the median number of livestock can be calculated as:

Med .

25

(100 / 2) 48 x5 4

25 2.5

27.50

Table 4.9: Calculation for Partition Values


Livestock population per household No of Households (f) Cumulative Frequency

10 15 15-20 20-25 25-30 30-35 35-40 40 45 Total

15 8 25 4 33 8 7 100

15 23 48 52 85 93 100 ---------

Mesay Mulugeta, 2009

69

Quantitative Methods in Social Sciences

To calculate Q1 first find where i(N)/4 observation is contained where N=100. Then, i(N+1)/4 = (101)/4 = (25.25)th value is contained in 20-25 class interval. Then, 20-25 is

Q1or Quartile 1 class.


You can now apply the formula above. i (n 1 / 4) cf Q1 L x h, f

20

25 25

23

x5

20.45

Similarly, to calculate D8 first find where i(N+1)/10 observation is contained where N=100. Then, (iN)/10 = (8 x 101 )/10 = (80.8)th value is contained in 30-35 class interval. Then, 30-35 is D8or decile 8 class. Apply the formula above.
D8 L i (n 1 / 10) cf x h; f 30 8(101 / 10) 33 53 x5 34.21

You can also calculate P70 by using the same methods for Q1and D8

above.

First, find where

(iN)/100 observation is contained where N=100. Then, (iN)/100 = (70 x 100 )/100 = (70)th value is contained in 30-35 class interval. Then, 30-35 is
i (n 1 / 100) f cf

P8

class. You can now apply the formula above.


52

P8

xh

30

70 (101 / 100) 33

x 5

32.83

Exercise
Calculate Q3, D7, and P72 for the grouped data below Table 4.10: Class Interval
0 -4 5 9 10 14 15 19 20 24 25 29 30 -34 35 39 40- 44

Frequency
12 34 67 73 82 74 66 54 48

Mesay Mulugeta, 2009

70

Quantitative Methods in Social Sciences

Remark
Partitioning values are used in the classification or grouping of certain data sets where the intervals, invariably, differ but frequencies of the classes will remain constant. Partitioning (locational) values are not only confined to Quartiles, Deciles and Percentiles. One can also go for any other numbers of grouping say 5, 15 30, 60 and 90 according to which the values in the formula used.
i (n f 1)

can be

Relationships b/n Mean, Median and Mode: In a unimodal and symmetrical distribution, the
values of mean, median and mode are equal. In other words, when these three values are all not equal to each other (as in Fig. 4.1 below) the distribution is not symmetrical Figure 4.1: Comparison of Mean, Median and Mode

f r e q u e n c y

R/ships b/n mode. median and mode Mode = Median =Mean

Conditions Unimodal and symmetrical distribution Most of the values of observation in the distribution

Mean > Median > Mode Mean Mode = 3(Mean Median)

fall to the right, skewed to the right or positively skewed (See Figure 1 above) Most of the values of observation in the distribution

Mean < Median < Mode Mode = 3Median 2Mean

fall to the left or skewed to the left or negatively skewed. (See Figure 2 above)

Mesay Mulugeta, 2009

71

Quantitative Methods in Social Sciences

Exercise
Find the mean, median, mode, Q3, D4 and P65 of the grouped data set below and comment on the results. Give your judgment whether it is an asymmetrical or symmetrical distribution based on Karl Pearson s principle. Remark also whether it is negatively or positively skewed set of data. Table 4.11: Rural households by value of tradable material ownership in Kuyu Woreda (January 2001) Value of materials No. of (Eth. Birr) Households 0 50 122 50 100 95 100 150 50 150 200 68 200 250 18 250 300 11 300 350 7 350 400 14 400 450 15 Total 400
Source: Mesay M. (2002)

SPSS Practice
By using SPSS software create a frequency table, pie-chart and bar graph for the data given below. Edit the chart/graph (i.e. insert title, insert the variables and percentages into the chart/graph, and select the suitable color by using SPSS Data Editor and Chart Editor Windows) Table 4.12: Farmland covered by major crops during 1999 crop year in Kuyu Woreda Crops Area in hectare 254.0 Teff Sorghum 96.3 Wheat 64.0 Barley 14.6 Maize 6.8 Pulses 29.7 Oilseeds 54.1 Oats 1.0 Total 520.5
Source: Mesay M. 2002

Mesay Mulugeta, 2009

72

Quantitative Methods in Social Sciences

Exercise
Let us assume that the mean and median of a certain skewed (asymmetrical) set of data are 117 and 86 units, respectively. Find the mode value of this data set and comment whether it is positively or negatively skewed.

Mesay Mulugeta, 2009

73

Quantitative Methods in Social Sciences

Unit Five Absolute and Relative Measures of Dispersion


Unit Objective
Having studied this chapter, you should be able to: Explain the terms and mathematical formulas used to measure level of data dispersion Understand the techniques and methods of measures of dispersion Determine the nature of variability or dispersion in any set of quantitative data Know the purposes of measures of dispersion Differentiate absolute and relative measures of dispersion Identify the merits and demerits of each measures of dispersion

5.1. Introduction
The term dispersion in statistical methods refers to the variability or spread in a data set. Measures of dispersion, also known as second order analysis, or spread or variability provide information about the spread of the scores in quantitative techniques. Measures of dispersion help us to know whether the scores clustered around certain measures of central tendencies or spread out over a large segment of the scale. Measures of dispersion, invariably, are better employed when measures of central tendency failed to explicitly distinguish the two or more sets of distributions. For example, two sets of observations may provide the same Arithmetic Means. Here, it becomes difficult to identify or distinguish their specific differences. See the cases of two sets of values below.
Sum = 500 unit Mean = 100 unit SD = 7.07unit Range = 20 unit Sum = 500 unit Mean = 100 unit SD = 14.14unit Range = 40 unit

Data Set A

90 95

100

105

110

Data Set B

80 90

100

110

120

More dispersed than data set A (SDB > SDA)

Look at the characteristics of the two data sets (Data Set A and Data Set B) above that in both cases the Arithmetic Mean is 100 unit, which may lead to wrong interpretations indicating similar

Mesay Mulugeta, 2009

74

Quantitative Methods in Social Sciences

distributions. The actual inference can be drawn only by using certain measure of dispersion such as range, standard deviation and coefficient of variation.

5.2. Purposes of Measures of Dispersion


Various measures of dispersion are calculated with the following purposes as pointed out by Agarwal, B.L. (2006): 1. To have an idea about the reliability of central values: If the scatter or variability is large, an average is less representative and less reliable or if the value of dispersion is small, it indicates that a central value is a good representative of all the values in the set.

For instance, compare the means and measure of variability (i. e. standard deviation) of the two data sets below. Data set I is characterized by extremely less value of measure of variability (SD) and better representation of the mean value than the case of data set II.

Data set I 7 6 9 8 10 Standard deviation Mean 1.6 8

Data set II 2 120 50 300 750 30.4 244

2. To compare two or more sets of values with regard to their variability: Two or more sets of values can be compared by calculating the same or similar measures of dispersion. A set with smaller value possesses lesser variability. Look at the example in serial number 1 above. 3. To provide information about the structure of a series: A value of measure of dispersion gives an idea about the spread of the observation.

Mesay Mulugeta, 2009

75

Quantitative Methods in Social Sciences

4. To pave way to the use of other statistical measures: Measure of dispersion, especially Standard Deviation and Variance lead to many other statistical techniques like correlation, coefficient of variation, regression, analysis of variance (ANOVA), e.t.c. A measure of dispersion, therefore, is defined as a numerical value explaining the extent to which individual observations vary among themselves. They can be broadly categorized into two: Absolute and Relative Measures of Dispersion. The main difference between the two is that absolute measures of dispersion measure numerically heterogeneity of data and are not free from unit of measurement. Absolute measures of dispersion, unlike the relative ones, cannot be used to measure the degree of heterogeneity between two data sets, which does not have similar unit of measurement. The following diagram gives an overview of measures of dispersion.

5.1. Classification of Measures of Dispersion 5.1.1. Absolute Measures of Dispersion Range


The range (the difference between the maximum and minimum values) is the simplest measure of data spread or variability. But if there is an outlier in the data, it will be the minimum or maximum value. Thus, the range is not robust to outliers. Range gives us information on only the extreme values and highly unstable as a result. In other words, range as a measure of dispersion does not consider other data in the set except the two extreme values, the maximum and the minimum. Example The following data is the mean minimum temperature data in oC for the month of January for 10 years (1994 2004) taken from Tullubolo weather station. Find the range of the temperature records. 12.5 11.0 11.4 10.0 8.8 7.8 8.8 9.8 8.8 10.5

Solution: Max value = 12.50C Range = Max value Min value = 7.8 C
0

Min value

Range = 12.5oC - 7.80 C = 4.70 C

Mesay Mulugeta, 2009

76

Quantitative Methods in Social Sciences

Figure 5.1: Classification of Measures of Dispersion Measures of Dispersion/Deviation

Absolute Dispersion

Relative Dispersion

Range

Quartile Deviation

Mean Deviation

Standard Deviation

Variance

Coefficient of Variation

Coefficient of Quartile Deviation

Coefficient of Mean Deviation

Interquartile Range (IR)


It is the difference b/n the third and first quartiles.

IR
QD = Q

Q1
Q 2

Quartile Deviation (QD)


3 1

Exercise The following data is the mean maximum temperature data in oC for the month of January for 10 years (1998 2008) taken from Entoto weather station. Find the Interquartile Range (IR) and

Quartile Deviation (QD) for the temperature records and statistically explain the result you have calculated.

19.1

18.9

19.1

19.5

19.2

19.9

20.3

20.2

21.1

21.2

Mesay Mulugeta, 2009

77

Quantitative Methods in Social Sciences

Mean Deviation (MD)


This is also an absolute measure of dispersion. It is the average of the absolute deviations taken from a central value, usually mean and median ignoring the signs. Although standard deviation is a more accurate method of finding the error margin we use the average deviation method because it is relatively easy to calculate. The average deviation is an estimate of how far off the actual values are from the average value, assuming that our measuring device is accurate. We can use this as the estimated error. Sometimes it is given as a number (numerical form) or as a percentage. This measure is not used in further advanced statistical techniques. Follow the steps below to find mean deviation 1. Find the average value of your measurements 2. Find the difference between your first value and the average value. This is called the deviation. 3. Take the absolute value of this deviation. 4. Repeat steps 2 and 3 for your other values. 5. Find the average of the deviations. This is the average deviation. The formula for MD is

MD

xi N

for raw data

MD

f xi N

for grouped data

Where, the letters have their usual meanings and | | means ignoring the negative signs. The two vertical bars in the formula indicate the absolute values or values omitting the signs with the other symbols having the same meaning discussed earlier. In case of grouped data, the mid-point of each class interval is treated as xi.

Exercise
Calculate mean deviation for the data given below.
Mesay Mulugeta, 2009 78

Quantitative Methods in Social Sciences

Monthly Minimum To of Bushoftu town (Feb. 1994 -2001) in oC

27.9

28.1

28.7

27.2

27.8

28.1

27.5
Source: NMSA

27.7

Variance
It is simply the mean of the squares deviations from a central value, generally the Arithmetic Mean (i.e., the average of the squared deviations). The symbol for variance is s or
2
2

accompanied by a

subscript for the corresponding variable. The main demerit of variance is that its unit is the square of the unit of measurement of the values and this value is large making it very difficult to decide about variation of magnitude. Here is the formula for variance of variable x for ungrouped or raw data:
2

(Xi N
Where,

and

(xi x) n 1

s2 = Sample variance,
2

= Sample mean, N = Number of population, n = Number of samples

= (Sigma square) population variance

xi = Population or sample values,

= Population mean
Variance is an important measure in statistics particularly in assessing variation b/n two or more samples of a population. One very powerful statistical technique known as analysis of variance (ANOVA) uses variance to help decide whether a two or more sets of samples differ significantly from each other. ANOVA will be discussed later in this course In case of grouped data, mid-values of the classes are considered as xi and consequently we can make use of the formula below:

Mesay Mulugeta, 2009

79

Quantitative Methods in Social Sciences

f ( Xi N

and

S2
f ( xi ) 2 or

f (xi x) n 1
f ( xi

In the formula above, the numerator,

x) 2 , is called total sum of

squares (TSS). TSS measures the total variation among values in a data set while variance measures the average variation among data set. The larger the values of TSS or variance, the greater the variation among the values of data set.

Properties of Variance
Values are squared only to get rid of negative signs in variance. If they are not squared, a central value cannot be obtained because
(X
i

X )

is always zero.

The main drawback of variance is that its unit is the square of the unit of measurement of the observations. This results in larger value of the variance making it more difficult to interpret The variance gives more weightage to the extreme values as compared to those which are near to the mean value. This is because the difference is squared in variance.

Standard Deviation
Standard deviation, also known as root mean squared deviation, explains the average amount of variation on either side of the mean. It is considered to be the best measure of dispersion and is used most widely even in advanced techniques. The population ( ) and the sample (s) standard deviations are the positive square roots of their respective variances and have the desirable property of being in the same units as the data. That is, if the data is in hectares, the standard deviation is also in hectares as well. The Standard deviation is another way to calculate dispersion. The standard deviation formula for ungrouped (raw) sample data is
S (x n
i

x) 1

Notice the difference between the sample and population standard deviations. The sample standard deviation uses n-1 in the denominator, hence is slightly larger than the population standard deviation which uses N. The SD formula for ungrouped population data is

Mesay Mulugeta, 2009

80

Quantitative Methods in Social Sciences

Where,

= Population mean, = (Sigma) Population SD N = Number of population, n = Number of samples

( xi N

We have already discussed the use of Greek letters for sample statistics vs population parameters. This is why s is used for the sample standard deviation and (sigma) is used for the population

standard deviation. However, another sigma, the capital one ( ), appears inside the formula. It serves to indicate that we are adding things up. What is added up are the deviations from the mean:
( xi x) . But the average deviation from the mean is actually zero. Occasionally the mean

deviation, using average distance or using the symbols for absolute value, x x , is used. However, a better measure of variation comes from squaring each deviation, summing those squares, and then taking the square root after dividing by the number of data elements or one less than that. If you compare this with the formula for quadratic mean you will realize we are doing the same thing, except for what we are dividing by. The SD formulae for grouped population and sample data are:
f ( xi N )2

and

f ( xi x) 2 n 1

Properties of Standard Deviation


The drawbacks arising from the squared units in the analysis of variance are overcome in this measure of dispersion. One simple advantage of SD over variance is that the value of SD of a set of values is almost always considerably less than the variance. The variance of a set of data is larger number and even larger than any individual value in a set of data. The demerit of standard deviation, however, is the fact that the variability of two or more sets of data cannot be compared if the unit of measurement of the variables is not the same.

Mesay Mulugeta, 2009

81

Quantitative Methods in Social Sciences

5.1.2. Relative Measures of Dispersion


Relative measures of dispersion, also known as Coefficients, are 3rd order analysis and are used when the absolute dispersion measures are the same or when they are not able to distinguish the characteristics of the distribution. Also, the characteristics of the distribution having different units say rainfall and temperature distribution, temporally or spatially, cannot be expressed by measures of absolute dispersion or location. Here, it is the coefficients which can speak whether rainfall or temperature is more dispersed.

Coefficient of Quartile Deviation (CQD)


This is an absolute quantity (unitless) and is useful to compare the variability among the middle 50% observations. It excludes the lowest and highest values as a result of which it is not affected by extreme values. It is, therefore, not considered as a good measure of dispersion as it does not show the scattering of the central values. Hence, it is not commonly used.
Q Q
3 3

CQD

Q Q

1 1

Coefficient of Range or Range Coefficient (CR)


RC Maximum Minimum Maximum Minimum

a) Coefficient of variation (CV)


Also known as relative variability, it is a normalized measure of dispersion of a distribution. It is defined as the ratio of the standard deviation to the mean. This is only defined for non-zero mean, and is most useful for variables that are always positive. The coefficient of variation describes the magnitude of values of data and the variation within them.

It is often reported as a percentage (%) by multiplying the above formula by 100. The coefficient of variation is useful because the standard deviation of a data must always be understood in the context of the mean of the data. It is a dimensionless number. So when comparing
Mesay Mulugeta, 2009 82

Quantitative Methods in Social Sciences

between data sets with different units or wildly different means, one should use the coefficient of variation for comparison instead of the standard deviation. When the mean value is near zero, the coefficient of variation is sensitive to small changes in the mean, limiting its usefulness.

Example
Let the following table indicates 65 sample farmers by their annual crop yield output in Adama district. Then, find the average deviation, variance, standard deviation and coefficient of variation (CV) of the data. Table 5.1: Number of Sample Farmers by their Annual Crop Output: Hypothetical Data Crop Output in Quintals 10 19 20 29 30 39 40 49 50 59 60 69 70 - 79 80 - 89 90 - 99 Number of Farmers 3 5 8 14 12 10 7 4 2

In order to calculate the above requested values, we must first find class midpoint (x), f, fx, mean ( ), x, (x- )2, (x- )2. Look at the table below.

Table 5.2:
Output 10 20 30 40 50 60 70 80 90 19 29 39 49 59 69 79 89 99 Class mid-point (x) 14.5 24.5 34.5 44.5 54.5 64.5 74.5 84.5 94.5 f 3 5 8 14 12 10 7 4 2 fx 43.5 122.5 276.0 623.0 654.0 645.0 521.5 338.0 189.0

x38 28 18 8 2 12 22 32 42

f x114 140 144 112 24 120 154 128 84

x-38 -28 -18 -8 2 12 22 32 42

(x 1444 784 324 64 4 144 484 1024 1764

f(x 4332 3920 2592 896 48 1440 3388 4096 3528

Total

---

65

3412.5

202

1020

---

---

24240

Mesay Mulugeta, 2009

83

Quantitative Methods in Social Sciences

Solution:

Mean

3412.5 65
f x x

52.5
f ( x x) 2 n 1

MD

1020 65

15.69

S2

24240 372.92 64

f ( x x) 2 n 1

372.92 = 19.31

CV

S x

19.31 52.50

0.37

37%

Though not the scope of this resource material to deal with, there are also statistical techniques to measure inequalities, concentrations and diversifications of a variable. These, for instance, include Lorenz Curve, Gini Coefficient, Theil Index of Inequality, Herfindahl-Hirschman Index, and Tideman-Hall Index Exercises 1. The following table is a hypothetical data containing the number of urban households by their monthly income in Kebele K. a) Then, find the mean deviation, variance, standard deviation and coefficient of variation (cv) of the data based on the example given above. Table 5.3: Yield in Quintals 501 750 751 1000 1001 1250 1251 - 1500 1501 1750 1751 - 2000 2001 - 2250 2251 2500 2501 2750 2751 - 3000 Total Number of Farmers 16 24 19 28 10 12 8 6 6 2 131

b) Try to answer the question above by using SPSS software and compare your answers with the one you have done in question (a).
Mesay Mulugeta, 2009 84

Quantitative Methods in Social Sciences

2. Calculate the variance, standard deviation and coefficient of variation for the two sets of data below and comment on the results you have calculated. Which set of data is more dispersed? Table 5.4: Data Set x: 450 465 Data Set y: 45 54 87 56 112 76 232 45 132 34 233 54 546 345 435 67 342 10 23 27 460 765 500 496 440 392 389 871 986 567 457 987 876 234 345 567 987 432

3. Calculate the mean maximum values, standard deviation and coefficient of variation for the temperature data given below. Comment on the results you have found. Table 5.5: Monthly maximum temperature of Adama in oC (2000
Year Recorded 2000 2001 2002 2003 2004 2005 Mean Max SD CV (%) Jan 27.3 24.7 26.2 27 27.8 26.6 Feb 28.3 27.8 29.2 29.5 28.2 29.6 Mar 30.2 28.5 29.9 29.8 29.3 29.9 Apr 30.3 29.8 31 29.1 28.7 30.2 May 30.8 29.6 32.8 32.4 * 29.1 Jun 29.8 28.2 30.9 30.1 29.6 29.6 Jul 25.9 25.8 29.7 25.5 26.3 25.6 Aug 25.6 25.3 26.6 25.9 26 26.7

2005)
Sep 26.7 27.4 28.2 27 27.3 27.6 Oct 25.8 28.7 29.8 28.5 26.7 28.9 Nov 25.8 27.3 28.6 27.5 26.8 27.5 Dec 25.3 26.8 26.5 25.4 26.2 26.3

Source: NMSA

Mesay Mulugeta, 2009

85

Quantitative Methods in Social Sciences

Unit Six Skewness, Moments and Kurtosis


Unit objectives
Having studied this chapter, you should be able to: Explain the terms and mathematical formulas used to determine Skewness, moments and kurtosis Understand the techniques and methods of measures of Skewness, moments and kurtosis Determine the nature of Skewness and kurtosis in any set of quantitative data Know the purposes of measures of Skewness, moments and kurtosis Appreciate the use of Skewness, moments and kurtosis in quantitative data analysis

6.1. Introduction
In the previous chapters we have discussed measures of location and variation of a data set to describe the nature of individual values in the data set. However, the analysis of a data set still remains incomplete until we measure the degree to which these individual values in the data set deviate from symmetry on both sides of the central value and the direction in which these are distributed.

6.2. Measure of Skewness


A frequency distribution of the set of values that is not symmetrical is called asymmetrical or skewed. In a skewed distribution, extreme values in a data set incline towards one side or tail of a distribution, thereby longer tail in one direction. If extreme values incline towards the upper or right tail, the distribution is known as positively skewed. Where such values incline towards the lower or left tail the distribution is said to be negatively skewed. The relationship between the three measures central tendency (Mean, Median and Mode) tells us the nature of the Skewness of the data set. For a positively skewed distribution Maen Median Mode ; while Maen Median Mode for a negatively skewed distribution. In case of symmetrical distribution Maen Median Mode .

Mesay Mulugeta, 2009

86

Quantitative Methods in Social Sciences

Graphically in case of symmetrical distribution the lengths of the two segments of the curve on either sides of the peak point are equal while in asymmetrical curve one of the segments or tails of the curve is longer than the other. Look at Figure 6.1 1. Negative skew: The left tail is longer; the mass of the distribution is concentrated on the right of the figure. It has a few relatively low values. The distribution is said to be left-skewed. In such a distribution, the mean is lower than median which in turn is lower than the mode

(i.e mean median mod e) ; in which case the Skewness coefficient is lower than zero.
Figure 6.1: Symmetrical and Skewed Distribution

2. Positive skew: The right tail is longer; the mass of the distribution is concentrated on the left of the figure. It has a few relatively high values. The distribution is said to be right-skewed. In such a distribution, the mean is greater than median which in turn is greater than the mode

(i.e mean median mod e) ; in which case the Skewness coefficient is greater than zero.
3. Symmetric Distribution: If there is no Skewness or the distribution is symmetrical (and unimodal) like the bell-shaped normal curve then the mean median mod e) . The degree of Skewness can be measured both Absolute Skewness and Relative or Coefficient of Skewness. For an asymmetrical distribution, the distance between mean and mode may be used to measure the degree of Skewness because the mean is equal to mode in a symmetrical distribution. Absolute Skewness (Sk) = Mean Mode

Mesay Mulugeta, 2009

87

Quantitative Methods in Social Sciences

For a positively skewed distribution, Mean > Mode and therefore Sk is positive otherwise it is negative. Other than the above stated absolute method of measuring Skewness, there are at least three important relative measures of Skewness. These are Karl Pearson s Coefficient of Skewness, Bowley s Coefficient of Skewness and Kelly s Coefficient of Skewness. For the purpose of this course, you will observe the procedures used by Karl Pearson to measure the Skewness of a data set herein. The measure suggested by Karl Pearson for measuring coefficient of Skewness is given by:

SK

Mean
p

Mode SD

Where, SkP = Karl Pearson s Coefficient of Skewness Since a mode does not always exist uniquely in a distribution, it is convenient to define it using median. From our previous discussion in this course, you know the relationship between mode, median and mean.

Mean Mode Mode

3( Mean Median)

3Median 2Mean

By substituting this value of mode in the formula above, we can get:

SK

3 ( Mean
p

Median ) SD

Note: The values of SkP varies between +3 and -3 theoretically.

Example
Calculate the mode, mean, median and Pearson s Coefficient of Skewness for a set of data indicated below. Comment on the results!

Mesay Mulugeta, 2009

88

Quantitative Methods in Social Sciences

Table 6.1: Class Interval (x) Frequency (f)

21 25 26 30 31 35 36 40 41 - 45 46 50 51 55 median class modal class

5 15 41 42 2 12 3
N = 120

f1 f0 f2

Solution
Look at the table above that the mode lies the class 36 40 by inspection. Then, by applying the formula we have discussed earlier mode can be calculated as:

Mode L

f 2f f1

f1 f2

xh

= 36

42 41 x4 2 x 42 41 2

36.096

To calculate the mean, we can make use of any method (formula) we have seen in our earlier discussions. Table 6.2: Class Interval (x)

21 25 26 30 31 35 36 40 41 - 45 46 50 51 - 55 ------

Midpoints (x) 23 28 33 38 43 48 53 -------

Frequency (f)

5 15 41 42 2 12 3 --------

fx 115 420 1353 1596 86 576 159

fx = 4305

Mesay Mulugeta, 2009

89

Quantitative Methods in Social Sciences

Mean

4305 120

35.875

Now we can calculate the remaining measure of central tendency. What is it? Median

Mode 3 Median 3 Median Mode Mode Median Median 36 . 096

2 Mean 2 Mean 2 Mean 3 2 x 35 . 875 3 35 . 95

Now we are left to know standard deviation so that we can easily calculate Pearson s Coefficient of Skewness. Standard deviation can, therefore, be calculated based on the formula below:
(xi N )
2

Table 6.3:
Class Interval (x) Midpoints (x)
Mean ( )

x-

(x -

f(x -

21 25 26 30 31 35 36 40 41 - 45 46 50 51 - 55

23 28 33 38 43 48 53

36 36 36 36 36 36 36

- 13 -8 -3 2 7 12 17

169 64 9 4 49 144 289


f(x -

5 15 41 42 2 12 3
)2

845 960 369 168 98 1728 867


5035

f (X N

)2

5035 120

6 .5

Then, Pearson s Coefficient of Skewness (SkP )

3( Mean Median) SD

Mesay Mulugeta, 2009

90

Quantitative Methods in Social Sciences

3(35.875 35.95) 6.5

0.035

Since the coefficient of Skewness is - 0.035 (SkP = -0.035), the distribution is slightly skewed to the left (slightly negatively skewed). Thus, the concentration of the values of the distribution is slightly to the lower values to the extent of 3.5%. Remark: Skewness is an important concept in geographical statistics because very many of the variables measured in geographical studies show highly skewed distributions. This fact has two important consequences. First, it casts doubt upon the diversity of applying parametric statistical tests to this data. A high degree of Skewness is one sign that sample data are not normally distributed. They are therefore unlikely to come from a population which is normally distributed. At this juncture you have to remember that a normal distribution is symmetrical and has a Skewness of zero. Secondly, other descriptive measures particularly the mean, may be misleading if used in isolation. In a highly skewed data observation the mean on its own is not a very informative measure. Rather than the above discussed ones there are other various statistical measures of Skewness. The most common one, properly known as Momental Skewness, is calculated using the following equation:
Skewness (x n
3

x )3

Where,

(x

x ) 3 denotes the cube of the deviations of the values from their mean,

is the

standard deviation and n is the number of values. The value of Skewness for a symmetrical distribution is zero. Logically, positive values of the index indicate positive Skewness and negative values indicate negative Skewness. 6.3. Moments Measure of moments includes the measure of mean, average deviation and standard deviation. The value of these measures is obtained by taking the deviation of individual observations from a given origin. As in physics the term moments is affected by (i) size of class interval representing the force and (ii) deviation of mid-values of each class from an observation representing the distance. Moments can be calculated about mean, arbitrary mean, zero or origin, and about any arbitrary
Mesay Mulugeta, 2009 91

Quantitative Methods in Social Sciences

points in a set of data. For example, let x1, x2 , given by:


For ungrouped data mr (x n x)r

.. xn be the n observations in a data set with the

Then the rth moment about the actual mean of a variable both for ungrouped and grouped data is

, r 1, 2 , 3, 4

For grouped

data mr

f (x n

x) r

r 1, 2, 3, 4

Exercise
Calculate the first four moments about the mean for the following grouped data and explain the results you have found. Explain whether the distribution is symmetric or asymmetric, and whether it is positively skewed or negatively skewed. Comment also whether the distribution is Leptokurtic, Platykurtic or Mesokurtic.

Table 4.4
Class Interval 22- 26 27 31 Frequency 2 5 4 8 10 9 3

32- 36 37 -41 42 47 46 51

52 - 56

6.4. Kurtosis
Kurtosis (from the Greek word kyrtos or kurtos, meaning bulging) describes the degree of concentration of frequencies in a given distribution. That is, whether the observed values are concentrated more around the mode (a peaked curve) or away from the mode towards both tails of the frequency curve. In other words, the term kurtosis in statistics refers to the degree of flatness or peakedness in the region about the mode of a frequency curve.

Mesay Mulugeta, 2009

92

Quantitative Methods in Social Sciences

Note that two or more distributions may have identical average, variation, and Skewness but they may show different degrees of concentration of values of observation around the mode and hence may show different degrees of peakedness. The usual measure of Kurtosis is calculated with the following equation:
(x Kurtosis (x n
4

x) n x)
2

x)

(x n

Where,

(x

x) 4 denotes the fourth power of the deviations of the values from the mean,
(x n x)4

denotes the fourth moment around the mean,

is the standard deviation and n is the number of values. A normal distribution has a kurtosis of 3.0 while a very peaked (leptokurtic) distribution and a very flat (Platykurtic) distribution has a kurtosis greater than 3.0 and less than 3.0, respectively. Kurtosis is based on the size of a distribution's tails. Distributions with relatively large tails are called leptokurtic; those with small tails are called Platykurtic. A distribution with the same kurtosis as the normal distribution is called mesokurtic. This definition is used so that the standard normal distribution has a kurtosis of 0. In addition, with the second definition positive kurtosis indicates a peaked distribution and negative kurtosis indicates a flat distribution. Like Skewness. Kurtosis gives valuable information about the distribution of a set of data values, in addition to that provided by the mean and standard deviation.

Exercises
A) Find the Variance, Skewness and Kurtosis of the following frequency distribution by the method of moments. Explain the results you have calculated.

Mesay Mulugeta, 2009

93

Quantitative Methods in Social Sciences

Height in Inches: No of Soldiers (f)

58 11

61

62

65 14

66

69 11

70

73 6

74 8

77

B) Compute the first four moments about the mean from the following data. Comment on the result you have found.

Mid-value of a grouped data Frequency

17 8

22 12

27 13

32 16

37 5

42 7

Mesay Mulugeta, 2009

94

Quantitative Methods in Social Sciences

Unit Seven Elementary Spatial Analysis


Unit objectives
At the end of this chapter, you should be able to: Explain the basic concept of spatial statistics Differentiate several techniques of spatial statistics Apply the techniques of spatial statistics in relevant aspects of geographic studies. Appreciate the use of spatial statistics in quantitative analysis of geographic data

7.1. Introduction
Although there may be disagreement in detail about the nature and purpose of geography, it can reasonably be argued that one of the central themes of geographical enquiry is the appreciation of distribution of phenomena on the earth s surface. In view of the enthusiasm with which

geographers embraced the, quantitative revolution , it is, perhaps, surprising that remarkably little progress has been made in the development and application of spatial statistics. Although, a wide variety of statistical techniques have been applied by geographers to data which have been collected on an areal basis, only a handful of techniques are in common use for analyzing the spatial distribution of these data.

There are several possible explanations for this state of affairs. First, there is the lack of pre-existing theory and methods of statistical analysis applicable to spatial distributions. Both geographers and statisticians themselves have not shown a great deal of interest in spatial statistics in the past, and geographers are in a minority in demanding them now. Secondly, many of the techniques, which do exist are, at least at first sight, difficult to understand and tedious to use. It is certainly true that

some of the most advanced spatial techniques may not be applied to solve the real problems and analyze the data without the aid of a computer. Currently, very essential computer softwares, such as ArcGIS, are introduced into this specific area of geography. ArcGIS and other related such softwares have remarkably eased spatial geographic data analysis at present.

Mesay Mulugeta, 2009

95

Quantitative Methods in Social Sciences

Despite these reservations, there are several simple spatial techniques which can be applied using manual methods of calculation, and which are intuitively easy to understand. It may well be that the present underuse of spatial techniques is merely due to lack of publicity. Various types of

phenomena can be studied using these techniques: points, lines, and areas, and a number of different characteristics can be measured. These include central tendency, dispersion, shape, pattern and spatial relationships.

7.2. Central Tendency in Point Patterns


In our previous discussions various measures of central tendency were applied to sets of non spatial tabular data. Each of these measures was described as giving some indication of the average value in a set of data, or the centre of a frequency distribution. When dealing with spatial distributions the concept of a centre is intuitively reasonable. But there are several ways by which the spatial positions of such centre can be calculated, each of which will give a different result. It is important to realize that there is no one correct answer to the problems of finding the centers of a spatial distributions. Each measure has a different interpretation and the choice should be determined by the nature of the problem.

7.2.1. Mean Centre of Point Distribution


The mean centre is the simplest measure of the centre of a spatial distribution. It is analogous to the mean of a set of data, and is calculated in very similar way. Figure 7.1 shows a hypothetical spatial distribution of points. It could be the distribution of towns or of any other geographical

phenomenon. As a first step in calculating the mean centre it is necessary to devise some way of quantifying the locations of the points by coordinate systems. This can be done by calculating the co-ordinates of each point by using set of rectangular axial systems that can be laid down of the map showing the locations. Then, with reference to these X and Y axes, the coordinates can be measured either in centimeters/inches or ground distances by using the map scale. For the calculation of most spatial statistics the position of points needs to be measured in relation to some such co-ordinate system. The orientation of the co-ordinate grid is quite arbitrary, however. Geographers are used to measuring location in terms of eastings and nothings, but there is no reason why they should not use say south-easting and north-easting. Similarly the origin of the grid, the point from which the co-ordinates are measured, is arbitrary. For example, the national grid origin
Mesay Mulugeta, 2009 96

Quantitative Methods in Social Sciences

of Ethiopia is found at 00N latitude and 34030 E longitude. The only prerequisites of a co-ordinate system which is to be used in the calculation of spatial statistics are: 1. The co-ordinate axes must be at right angles to each other, in other words they must be orthogonal axes, 2. Measurements along the two axes must be made in the same units. In Figure 7.1 below, an arbitrary co-ordinate system has been superimposed, with its origin at the bottom left hand corner. For simplicity the horizontal axis, measuring easting, has been labeled x, and the vertical axis, measuring northings, has been labeled y. The axes have been marked off in arbitrary distance units. The co-ordinate for all points is given in Table 7.1. Figure 7.1: Identification of Spatial Mean Center

T1

T2

T3

T4 T6

T5

T7
Map scale 1 : 50,000

T8

The mean centre can now be found simply by calculating the mean of the x co-ordinates (easting) and the mean of the y co-ordinates (northings). These two mean co-ordinates mark the location of the mean centre. The equation for the mean centre is thus:

x n

y n

Mesay Mulugeta, 2009

97

Quantitative Methods in Social Sciences

Where x and y are the co-ordinates of the points,

and y are the means of the x and y co-ordinates

respectively, and n is the number of points. The calculation of the mean centre for Figure 7.1 is given in Table 7.1 and its position. Table 7.1: Finding the spatial mean centre Co- ordinate Values in cm Points or Towns X Y 1.0 6.7 T1 T2 T3 T4 T5 T6 T7 T8 Total 6.5 11.6 1.9 10.9 7.0 4.1 12.0 55.0 6.6 6.7 4.4 4.4 3.1 0.9 1.0 33.8

n x

8 55 8

x 6.875

55 y 34.8 8

33.8

4.35

The co-ordinates of the spatial mean center, therefore, is 6.875, 4.35. Then, we can easily find and locate the spatial mean center of the towns in Fig 7.1 above. The spatial mean centers of population can also be calculated provided that the population size of each town (P1, P2 .) is given.

The simple formula to calculate spatial mean of population is: X co-ordinate = P1X1 + P2X2 + P3X3 + . PnXn P1 + P2 + P3 + . Pn Y co-ordinate = P1Y1 + P2Y2 + P3Y3 + . PnYn P1 + P2 + P3 + . Pn

Mesay Mulugeta, 2009

98

Quantitative Methods in Social Sciences

7.2.2. Spatial Median Centre


The definition of spatial median center in this course is that it is the intersection of two orthogonal axes, each of which has an equal number of points on either side. The spatial median centre, following this definition, is analogous to the median of a set of data. The median centre is located in such a way that it has as many points to the south as to the north of it, and as many to the west as to the east provided that all the points have equal weights and no matter whether or not the lines are orthogonal. This is illustrated in Figure 7.2. Figure 7.2: Spatial median center of towns/points

T1

T2 SMC

T3

T4 T6 T7 T8

T5

The advantage of the median centre is that its location can be found very quickly, without resorting to any mathematics other than counting points. The disadvantage is that its location depends on the orientation of the two lines used to divide up the point distribution. This results in the fact that the location of the median centre cannot be uniquely found and its use should be restricted to preliminary geographical investigations, where speed may be more important than accuracy.

Mesay Mulugeta, 2009

99

Quantitative Methods in Social Sciences

Example
Locate the spatial median center of population for towns (T1, T2, data/size (P1, P2, ... P10,) given below. . T10,) by using the population

T1

T2

T3

T4 T6 T7 T8 T9

T5

T10
P1= 5000 P6=1900 P2= 5200 P7=4300 P3= 2800 P8=6000 P4= 7800 P9=3450 P5= 3200 P10= 4280

Solution
First you have to calculate the spatial median center of population by using the formula

Median center

p1

p2

... p10

2 500 5200 2800 7800 3200 1900 4300 6000 3450 4280 21, 965 2

Exercise
Let us assume that the following figure indicates the distribution of hypothetical towns T1, T2 .) in hypothetical region R. Then answer the following questions based on data given about the assumed map.
Mesay Mulugeta, 2009 100

Quantitative Methods in Social Sciences

Figure 7.3: Identification of Population Spatial Mean Center

T1

T2

T3

T4

T7 T9

T5

T6

T8

T10

T11
Map scale 1 : 100,000

A. Find the spatial mean center and median center of the towns. Table 7.2: The towns population data: Hypothetical
Total population of each town in 1995 T1= 200 T2= 250 T3= 400 T4= 120 T5= 80 T6= 456 T7= 250 T8= 550 T9= 350 T10= 680 T11= 450 Total population of each town in 2005 T1= 700 T2= 650 T3= 600 T4= 220 T5= 180 T6= 756 T7= 550 T8= 850 T9= 850 T10= 780 T11= 550

B. Calculate the spatial mean and median centers of population for the given two periods if the population of each town is as given in the Table 7.2. Try to comment on the direction of shift of spatial population mean during the given periods.
Mesay Mulugeta, 2009 101

Quantitative Methods in Social Sciences

7.2.3. Centre of Minimum Travel


The centre of minimum travel, referred to in many texts as the median centre, is the location from which the sum of the distances to all the points in a distribution is a minimum. It is based on the principle that sum of the deviations around the median (ignoring the signs) is minimum. The position of this centre could clearly be found manually by a process of trial and error or by choosing a number of alternative trial locations and calculating the sum of the distances from each trial centre to all the points. The true centre of minimum travel could eventually be found. In most cases the mean centre, median and center of minimum travel are likely to be fairly close to each other. Either of the first two could therefore be used as a starting point in the search for the centre of minimum travel. Although this interactive procedure (one which involves a long series of repeated steps) would be extremely time consuming to do manually, it can easily be manipulated by computer softwares such as ArcGIS.

Exercise
Let us assume that the following figure indicates the distribution of towns (T1, T2 towns indicated in the map? .) in certain region R. Then which one of the towns can serve as the center of minimum travel for the whole eight

T1

T2

T3

T4 T6

T5

T7

T8
Map scale 1 : 50,000

Mesay Mulugeta, 2009

102

Quantitative Methods in Social Sciences

7.3. Spatial Dispersion


Just as the measures of dispersion discussed in the earlier sections of this course, spatial dispersion describes the spread of values around some form of average. Measures of spatial dispersion give information about the areal spread of points/or places around a centre. In this section two techniques for measuring the spread of points around the mean centre will be discussed.

7.3.1. Standard Distance


Standard distance, otherwise known as standard distance deviation or root mean square distance deviation, is the spatial equivalent of standard deviation of the spread of points around the mean centre. The simplest equation for standard distance is: S tan dard dis tan ce Where,
d is the distance of each point from the mean centre
n is the number of points.

d2 n

Exercise
Calculate the standard distance for the point pattern shown below and comment on the result you have found if the points represent the distribution of towns.

T1

T2

T3 T4

T5

T6

T7
Map scale 1 : 100,000

Having located the mean centre, it is possible to measure all distances directly from the map, square them, add up all the squares, divide by the number of points and then take the square root. Though
Mesay Mulugeta, 2009 103

Quantitative Methods in Social Sciences

there are various methods, this will be the simplest and the quickest way of calculating the standard distance for many map distributions.

7.4. Analysis of Shape


The measure of shape is an obvious area in which a statistician can help a geographer. However, shape is extremely difficult to quantify or only certain specific characteristics of shape can be quantified. It is possible to say how much circular the shape of an areal unit is. Circle is taken as a standard because the area of a circle with the same perimeter as other areal units is the maximum. The most commonly measured characteristic shape is compactness. This is effectively a measure of how far a shape deviates from the most compact possible shape, a circle. A circle is the most compact shape in the sense that it has the smallest possible perimeter relative to the area contained within it. The measure of compactness of shapes of geographical areas such as countries, drainage basins, regions, administrative units and urban areas is known as Index of shape or Index of Compactness. The commonest measure of compactness will be discussed as follows. The symbols below are used throughout the formulas. These are A = Actual area A1 = Area of the smallest circle circumscribing the actual area L = Longest axis or major axis L1 = Minor axis P = perimeter 1. Elongation ratio

Er =

L1 , where the value ranges between 0 and 1. L

As Er approaching to 1, the shape is more and more compact or circular. For most elongated shapes such as straight line Er = 0. Er = 1, for a compact shape Er = 0, for an elongated shape such as straight line.
Mesay Mulugeta, 2009 104

Quantitative Methods in Social Sciences

2. Farm ratio

Fr =
Example

A , where the value ranges between 0 and . L2 4

Find the Farm ratio for a square which has area as one kilometer square. Area = 1km2

L
1km

1km

L=

L2 = 2km2 Fr = 3. Circularity ratio

1 = 0.5 2

Cr =

4A , the value ranges between 0 and 0.3184 P2

Example
Calculate the Cr value for a square given in the example above.

Cr =

4 x1square kilometer 4 = = 0.25 2 P 16

4. Compactness ratio a. Richardson compactness ratio =

2 P

, the value ranges b/n o and 1

b. Gibbs compactness ratio =

1.273 A , the value ranges b/n 0 and 1 L2


A A1

c. Cole compactness ratio =

Mesay Mulugeta, 2009

105

Quantitative Methods in Social Sciences

Example
Calculate Cole s compactness ratio for a square shaped area with each side measuring 1kilometer.

Area of a square = 1km x 1km = 1km2 d A1 = r2 Cole compactness ratio = 0.6369

1km 5. Circularity Index (CI)

CI =

AA ACSP

Where, AA = actual area ACSP = area of a circle with the same perimeter

The value of CI ranges b/n 0 and 1, indicating the most elongated and perfect compact shapes, respectively.

Exercise
Calculate the circularity index (CI) and compactness ratios of each of the country indicated below. Explain the results you have found.

Table 7.3:
Countries Ethiopia Djibouti Eritrea Kenya Somalia Sudan Area (km2) 1,106,000 22,000 117,400 582,644 637,657 2,502,813 Boundary (km)) 5,290 820 2,420 3,600 5,100 7,192

Mesay Mulugeta, 2009

106

Quantitative Methods in Social Sciences

Project Work
You know that Ethiopia is currently divided into 9 regional states and 2 city administrations. Try to visit any nearby governmental or non-governmental offices, libraries or websites and find both the area and boundary length of all the regions/city administrations and analyze their shape based on what you have learnt in this unit. You are expected to show all the necessary steps and formulas you have used for the analysis.

Mesay Mulugeta, 2009

107

Quantitative Methods in Social Sciences

Unit Eight Correlation and Regression


Unit objectives
Having studied this chapter, you should be able to: Explain the basic concept of correlation and regression analyses Apply the techniques of regression analysis in related geographic data analysis Appreciate the use of regression analysis in quantitative geographic data analysis Understand the basic theoretical concept of regression analysis in SPSS Apply SPSS software in the analysis of correlation and regression

8.1. Introduction
In this chapter we are going to analyze the degree and actual of relationships between two variables known as Correlation and Regression Analysis. For both regression and correlation studies, the number of variables may be two (Bivariate analysis) or more (Multivariate analysis). In this course we shall discuss in some detail both the bivariate and multivariate regression relationships. We shall see correlation and then regression in this chapter.

8.2. Correlation Analysis


In statistics, correlation indicates the strength and direction of the mutual interdependence of two or more variables. The relationship can be either linear or non-linear. The relationship between two variables is termed as bivariate correlation while that between more than two variables is known as multivariate correlation. The relative strengths of relationships are identified by a measure referred to as Correlation Coefficient or Coefficient of Correlation. Thus, one can have a) Bivariate linear coefficients of correlation b) Bivariate nonlinear coefficients of correlation c) Multivariate linear coefficients of correlation d) Multivariate nonlinear coefficients of correlation

Mesay Mulugeta, 2009

108

Quantitative Methods in Social Sciences

The relationship between variables (coefficients of correlation) can be direct or inverse. Its value ranges between +1 and -1 indicating the positive and negative correlation between the variables, respectively. The analysis of the coefficient of correlation has been attempted by different scholars. The most widely known is the Karl Pearson s product-moment correlation coefficient or simply Pearson s Coefficient of Correlation, which is obtained by dividing the covariance of the two variables by the product of their standard deviations. The Pearson s Coefficient of Correlation runs as follows for bivariate linear correlation:
( X
XY i

X )( Y
X Y

Y )

Where

, r X

Coefficien & Y Interdepen

t of

correlatio dent var

n iables

It can

be written in other forms as:


( X iY i ) r XY N
X Y

XY
X N
2 i

( X iY i ) N ( X
2

X Y Y N
2 i

) (

( X iY i ) N ( X
2 i

N X Y Yi
2

( X iYi ) N
N Y N
2

N XY
2 2

N X N

)(

1 N

2 i

NY )(

Yi

NY

( X iY i ) N ( X
2 i

N X Y Yi
2

( X iYi ) N
N Y
2

Xi N Yi
2

Yi N N( Yi N )2

N X

) (

Xi

N(

Xi N

)2 ) (

N N Xi
2

X iYi ( N X i )2 N

X iYi Yi N ( Yi ) 2
(N Xi
2

N (

( X iYi ) X i )2 ) ( N

X iYi Yi
2

)(

Yi )

Mesay Mulugeta, 2009

109

Quantitative Methods in Social Sciences

Where,

N = Number of pairs of scores XY = Sum of the products of paired scores X = Sum of X scores Y = Sum of Y scores X 2 = Sum of squared X scores Y 2= Sum of squared Y scores

Assumption of Correlation Coefficient


There are three assumptions made in giving the correlation coefficient by using the above formula. They are: 1. The random variables X and Y are distributed normally 2. The variables X and Y are related or interdependent 3. There is a cause and effect relationship between X and Y variables

Example
Let's assume that we want to look at the relationship between two variables, food grain available per head (in quintals) and family size. Perhaps we can have a hypothesis that family size affects the daily calorie supply per head in a family. Let's say we collect some information on 15 households and recorded as indicated in Table 8.1 below. Table 8.1: Food Grain Available (Quintal per Head) 12 8 8 9 3 5 21 2 2 7 21 25 18 2 3
Mesay Mulugeta, 2009

Family Size 3 5 4 3 6 5 2 9 10 4 3 2 2 9 7
110

Quantitative Methods in Social Sciences

You should immediately see in the Bivariate plot that the relationship between the variables is a negative or inverse one because if you were to fit a single straight line through the dots it would have a negative slope or move down from left to right. Since the correlation is nothing more than a quantitative estimate of the relationship, we would expect a negative correlation. What does a negative relationship mean in this context? It means that, if one increases, the other decreases in value. You should confirm visually that this is generally true in the plot below (Figure 8.1). Figure 8.1: Scatter plot of the data in Table 8.1

25.00

20.00

X=-0.2841Y+7.6986

Food Grain per Head

15.00

Intersection Point

10.00

Y=-2.3917X+21.5325

5.00

0.00

2.00

4.00

6.00

8.00

10.00

Faily Size

From Figure 8.1 above we can see that there are two lines indicating the two variables are mutually regressed against each other. The angle between the two lines will be zero when there is perfect relationship between the variables i.e. the coefficient of correlation is +1 or -1. Greater is the angle, lesser will be the value of correlation coefficient. Another check is that if the calculations and the
Mesay Mulugeta, 2009 111

Quantitative Methods in Social Sciences

drawings have gone smoothly, the coordinates of the intersection of the two lines will be the averages of the two variables. The two regression lines give two regression coefficients: for the regression of y on x, regression coefficient bYX coefficient bXY
r
X Y

Y X

and for the regression of x on y, the regression


bYX r
Y X

The multiplication of the two regression coefficients [(

)(

bXY

X Y

)]

gives the coefficient of determination, r2. Regression equation by using means, standard deviations and correlation coefficients Case1: Y regressed on X

(Y Y )

Y X

(X

X );Y 9.733 0.8243(

7.602 ) ( X 4.933) 2.62

Y 9.733 Y Y

2.3917 ( X 4.933 )

2.3917X 9.733 11.7992) 2.3917X 21.5325


Equation No. 1

Case 2: X regressed on Y

(X
X X

X)

X Y

(Y Y ); X 4.933 0.8243(

2.62 ) (Y 9.733) 7.602

0.2841Y 2.7652 4.933) 0.2841Y 7.6986 )


Equation No. 2

By using the previously explained formula, it is now easy to compute the coefficient of correlation for the variables in Table 8.1.
N (N Xi
2

rXY

( X iYi ) ( Xi ) ) (N
2

X iYi Yi
2

Yi )

The symbol

r stands for the coefficient of correlation. It is always between -1.0 and +1.0. If the
112

coefficient of correlation is negative, we have an inverse relationship; if it's direct, the relationship is
Mesay Mulugeta, 2009

Quantitative Methods in Social Sciences

positive. You don't need to know how we came up with this formula unless you want to be a statistician. But you probably will need to know how the formula relates to real data and how you can use the formula to compute the correlation coefficient. Table 8.2: Grain available (Y)
12 8 8 9 3 5 21 2 2 7 21 25 18 2 3

Family Size (X)


3 5 4 3 6 5 2 9 10 4 3 2 2 9 7

XY
36 40 32 27 18 25 42 18 20 28 63 50 36 18 21

X2
9 25 16 9 36 25 4 81 100 16 9 4 4 81 49

Y2
144 64 64 81 9 25 441 4 4 49 441 625 324 4 9

Total

146

74

474

468

2288

Let's look at the data we need for the formula. Here's the original data with the other necessary columns. (See table 8.2). The first two columns are the same as in the Table 8.1 above. The next three columns are simple computations based on the height and self esteem data. The bottom row consists of the sum of each column. This is all the information we need to compute the coefficient of correlation. Here are the values from the bottom row of the table (where N is 15) as they are related to the symbols in the formula: N= 15 X= 74
2 X = 468

2 X = 474 Y Y= 146 Y = 2288 Now, when we plug these values into the given formula given, we get the following:

r
15 * 468

15 * 474 5476

74 * 146 15 * 2288 21316

r
Mesay Mulugeta, 2009

0 . 8249
113

Quantitative Methods in Social Sciences

Here we can determine the Probable Error of correlation coefficient and confidence interval as:

Pe

0 . 6745
Where, Pe r n

r2 n

Pr obable error correlation coefficient number of pairs of observations

This probable error sets a range for the coefficients of correlation of other sets of samples selected randomly from the same population. The range is put as r associated with Pe is that if correlation coefficient is if r

Pe to r

Pe . Other properties

Pe , it is not significant at all and

6 Pe , it is definitely significant.

Example Calculate the coefficients of correlation and determination for the following data and comment on the results you have found. Table 8.3:
Crop yield/ha (in quintals)

18

28

20

14

22

24

16

12

Fertilizer/ha (in Kg)

50

35

15

45

100

38

27

43

55

Mesay Mulugeta, 2009

114

Quantitative Methods in Social Sciences

Then, you have to find the summations as follows: Table 8.4:


S/N X Y X2 Y2 XY

1 2 3 4 5 6 7 8 9 10 Total

50 35 15 45 100 0 38 27 43 55 408

18 8 28 20 14 22 24 16 6 12 168

2500 1225 225 2025 10000 0 1444 729 1849 3025 23022

324 64 784 400 196 484 576 256 36 144 3264

900 280 420 900 1400 0 912 432 258 660 6162

You can now substitute the value above in the formula below: N [N X
2

XY ( (
2

X )( Y

Y)
2

X ) ][ N

Y )2 ]

10 * 6162

408 *168

[10 * 23022 (408) 2 ][10 * 3264 (168) 2 ]

- 0.412654

r 2 = 0.17028 or 17.02% - 0.434, which is a moderately negative relationship

So, the correlation for our ten cases is r

between crop yield/hectare and use of fertilizer per hectare. The coefficient of determination ( r 2 = 0.1885) indicates that about 18.85% of the dependent variable(y) is explained by the investigated independent variable (X).

Testing the Significance of a Correlation


Once you have computed a correlation coefficient, you can determine the probability that the observed correlation occurred by chance. That is, you can conduct a significance test. Most often you are interested in determining the probability that the correlation is a real one and not a chance occurrence. In this case, you are testing the mutually exclusive hypotheses:

Mesay Mulugeta, 2009

115

Quantitative Methods in Social Sciences

Null Hypothesis Alternative Hypothesis

r=0 r 0

The easiest way to test this hypothesis is to find a statistics book that has a table of critical values of r. Most introductory statistics texts would have a table like this. As in all hypotheses testing, you need to first determine the significance level. Here, we can use the common significance level of =

.05. This means that we are conducting a test where the odd that the correlation is a chance occurrence is no more than 5 out of 100. Before we look up the critical value in a table we should also have to compute the degree of freedom (df). The df is simply equal to N-2 or, in this example of Table 8.2, is 15-2 = 13. Finally, we have to decide whether we are doing a one-tailed or two-tailed test. In this example, since we have no strong prior theory to suggest whether the relationship between food grain available and family size would be positive or negative, we will opt for the twotailed test. With these three pieces of information i.e. the significance level ( = .05)), degrees of freedom (df = 13), and type of test (two-tailed) we can now test the significance of the coefficient of correlation we have found. When we lookup this value in a table at the back of any statistics book we find that the critical value is 1.771. This means that if the calculated value ( r

- 0.434) is greater than 1.771 or -0.434


is

less than -1.771 (remember, this is a two-tailed test) we should conclude that the odds are less than 5 out of 100 that this is a chance occurrence. Since our calculated correlation of r

actually quite a bit higher than the negative tabulated value; we can conclude that it is not a chance finding and that the correlation is "statistically significant". We can reject the null hypothesis and accept the alternative. We can also compute the correlation coefficient and statistically confirm (test) the relationship between the variables by using SPSS software as indicated in the table below. Bivariate Correlation SPSS Output Family Size Pearson Correlation Sig. (2-tailed) Food Grain per Head -0.824(**) .000
** Correlation is significant at the 0.05 level (2-tailed). Mesay Mulugeta, 2009 116

Quantitative Methods in Social Sciences

The Correlation Matrix


All we have discussed so far is how to compute a correlation between two variables. In most studies we have considerably more than two variables. Let's say we have a study with 9 interval-level variables and we want to estimate the relationships among all of them (i.e., between all possible pairs of variables). In this instance, we have 36 unique correlations to estimate. How do we know that there are 36 unique correlations when we have 9 variables? There's a simple formula that tells how many pairs. That is

N * ( N 1) 2

We could do the above computations 36 times to obtain the correlations. Or we could use just SPSS or any statistics program to automatically compute all 36 with a simple click of the mouse. Here I used SPSS to calculate correlation among the variables and create the correlation matrix for the 8 variables. I told the program to compute the correlations among these variables. Here's the result. Table 8.5: Correlation Matrix
Variables Grain per Head Family Size Farm Land Size . Number of Oxen No. of Livestock Sex of H/Head Off-farm Income Fertilizer per Ha

Grain per Head Family Size Farm Land Size Number of Oxen No. of Livestock Sex of H/Head Off-farm Income Fertilizer per ha Dung per ha -0.824 0.912 0.911 0.917 0.626 0.916 0.915 0.867 -0.650 -0.705 -0.589 -0.666 -0.755 -0.743 -0.834 0.819 0.855 0.562 0.771 0.742 0.638 0.882 0.564 0.900 0.870 0.818 0.512 0.849 0.901 0.790 0.716 0.669 0.678 0.936 0.860 0.930 Pearson Correlation

This type of table is called a correlation matrix. It lists the variable names down the first column and across the first row. It shows only the lower triangle of the correlation matrix. In every correlation matrix there are two triangles that are the values below and to the left of the diagonal (lower triangle) and above and to the right of the diagonal (upper triangle). There is no reason to print both triangles because the two triangles of a correlation matrix are always mirror images of each other (the
Mesay Mulugeta, 2009 117

Quantitative Methods in Social Sciences

correlation of variable x with variable y is always equal to the correlation of variable y with variable x). When a matrix has this mirror-image quality above and below the diagonal we refer to it as a symmetric matrix. A correlation matrix is always a symmetric matrix. To locate the correlation for any pair of variables, find the value in the table for the row and column intersection for those two variables. For instance, to find the correlation between variables Family Size and Grain per Head, we should look for where row and column intersects. Then, we find that the correlation is -0.824.

Other Correlations
The specific type of correlation we have seen above is known as the Pearson s Product Moment Correlation coefficient. It is appropriate when both variables are measured at an interval level. However there are a wide variety of other types of correlations for other circumstances. For instance, if you have two ordinal variables, you could use the Spearman s rank Order Correlation or the Kendall Rank Order Correlation. When one measure is a continuous interval level one and the other is dichotomous (i.e., two-category) you can use the Point-Biserial Correlation also.

Bivariate Coefficient of Determination


The coefficient of determination, which is given as r2, explains to what extent the variation of dependent variable Y is being explained (expressed) by the independent variable X.

r2

Explained var iation Total var iation


2

Explained Variation = A Y + B (XY) - N Y , Unexplained variation = =

Total variation =

(Y - Y )2

(Y - YC)2

Total variation = variation unexplained + variation explained

r = r =
2

( XY ) N Y 2

(Y Y ) 2 A Y B ( X 1Y ) C ( X 2Y ) D (Y 2 NY )
2

. Bivariate
( X 3Y ).... N Y2

Multivariate

Mesay Mulugeta, 2009

118

Quantitative Methods in Social Sciences

Exercise
Calculate the coefficients of correlation and determination for the following data and comment on the results you have found. Use SPSS software to do so. Crop yield/ha (in quintals) Farm oxen/ha

18 2

8 1

28 4

20 0

14 3

22 1

24 2

16 1

6 4

12 2

8.3. Regression Analysis


In statistics, regression analysis is a collective name for techniques for the modeling and analysis of numerical data consisting of values of a dependent variable (Y) and of one or more independent variables, X, also known as explanatory variables or predictors. The dependent variable in the regression equation is modeled as a function of the independent variables, corresponding parameters (constants), and an error term. The error term is treated as a random variable. It represents unexplained variation in the dependent variable. The parameters are estimated so as to give a best fit of the data. Most commonly the best fit is evaluated by using the least squares method, but other criteria have also been used. Regression can be used for prediction (including forecasting of time-series data), inference, hypothesis testing, and modeling of causal relationships. These uses of regression rely heavily on the underlying assumptions being satisfied. Regression analysis has been criticized as being misused for these purposes in many cases where the appropriate assumptions cannot be verified to hold. One factor contributing to the misuse of regression is that it can take considerably more skill to critique a model than to fit a model Linear relation between two variables is represented by straight line which is known as regression line. In the study of linear relationship between two variables Y and X, suppose the variable Y is such that depends on X, then we call it the regression line of Y on X. To find out the regression line, the observations (Xi, Yi) on the variables X and Y are necessarily taken in pairs; on the units may be people, holding size, crop yield, animals, plots or any other variable. Generally, the studies are based on samples of size n, and hence n pairs of sample observations can be written as (X1, Y1), (X2, Y2), (X3,
Y3), .(Xn, Yn).
119

Mesay Mulugeta, 2009

Quantitative Methods in Social Sciences

Regression models are the mathematical/algebraic expressions while regression lines are the graphical representations based on the models in a two dimensional space in case of a bivariate distribution, and in multidimensional spaces in case of multivariate analysis. The numbers of curves/lines depends on the numbers of variables. For bivariate distributions, for instance, there can be two regression lines.

Least Square Method of Fitting a Regression Line


The general assumption of least square method of fitting a regression model to a data set is

(Y Yc)2 = should be minimum Where,


Yc = estimated value Y = actual or observed value (Y Yc) = deviation of estimated value from actual value/residuals Regression analysis chooses among all possible models by selecting the one for which the sum of the squares of the residuals is at a minimum. The intercept of the line provides the estimate of , and its slope provides the estimate of which is the length of the slope angle.

The equation for a regression line of Y on X for the population is:


Regression Equation Remark First degree or straight line equation/linear model Nonlinear second degree or curvilinear regression equation or Parabolic model Nonlinear third degree regression model a) Growth model (If b 1) b) Decay model (If b 1)

Y
Y Y Y

bX + e

a bX cX 2 a bX ab X cX 2 dX 3

It is hardly obvious why we should choose our line using the minimum sum of squared errors criterion. One virtue of the sum of squared errors criterion is that it is very easy to employ computationally. When one expresses the sum of squared errors/deviations mathematically and applies calculus techniques to ascertain the values of expressions for and that are easy to evaluate. and that minimize it, one obtains

Mesay Mulugeta, 2009

120

Quantitative Methods in Social Sciences

Note in the equation Y =

+ X (first degree or straight line equation)

and

are constants where

is the intercept where the line cuts on the axis of Y (or when x=0), and

is the tangent of angle

subtended with x-axis. It ( ) is also called the regression coefficient and defined as the measure of change in the dependent variable (Y) corresponding to a unit change in the independent variable (X). The symbol e is the noise which is usually omitted in calculations. The noise component e is comprised of factors that are unobservable or at least unobserved while and are the constants.

Graphical straight line fitting


The first degree(straight line) estimating equation is Y = + X

Figure 8.2:: Y Y= + X

= y intercept = is the slope of the line


or tangent value of angle . It is known as regression coefficient

X Dear student! Do you know the importance of regression analysis? Where do you use it? Applications of regression analysis exist in almost every field such as geography, economics, political science, sociology, psychology and education. The common aspect of the applications of regression analysis in these fields is that the dependent variable is a quantitative measure of some conditions or behaviors. When the dependent variable is qualitative or categorical, then other methods might be more appropriate to study. You will know more about the application of regression analysis after you observe the examples below.
Mesay Mulugeta, 2009 121

Quantitative Methods in Social Sciences

Calculation of 1st Degree Curve


Calculate the estimated value (Yc) of the production data given below by using first degree (straight line) prediction equation.

Table 8.6: A hypothetical farmer s crop output over years


Crop Year
1985 1990 1995 2000 2005

Production in Quintals
10 12 15 20 28

Note: In this example production is being regressing on time.


In order to make the calculation more convenient (suitable) you can rename production years by 1,2 ,3 4, and 5 as follows. Table 8.7:
Crop Year (X) 1 2 3 4 5 Production in Quintals (Y) 10 12 15 20 28

The prediction equation nicknamed as normal or standard equations are given as Y Y=n XY = + X + X X2

bX

1 2 as follows:

The raw data provides the values of

Y, X, XY and X2

Mesay Mulugeta, 2009

122

Quantitative Methods in Social Sciences

Table 8.8:
Crop Year (X) 1 2 3 4 5 Total 15 Production (Y) 10 12 15 20 28 85 XY 10 24 45 80 140 299 X2 1 4 9 16 25 55

Those numerical values replace the symbols in the above two equations to provide two simultaneous equations in terms of

and

as 85 = 5 + 15 1 2

299 =15 + 55

Multiply equation No. 1 by 3 so that eliminate 255 = 15 To eliminate 44 = 10 Hence,

and find 4

+ 45

subtract equation No. 2 from No.4

= 4.4 in equation No. 1 to find

Substitute 4.4 instead of

85 = 5 + 15 85 = 5 + 15 (4.4) 19 = 5 = 3.8
The estimating (prediction) equation or model is, therefore, Y Once the constants,

3.8

4.4 X

and , are calculated, there remains two unknown variables in the regression

equation, Y and X. We know Y depends on X in the case of regression equation of Y on X. Under the presumption that the trend of change in Y corresponding to X remains the same, the value of Y
Mesay Mulugeta, 2009 123

Quantitative Methods in Social Sciences

can be estimated for any value of X. It is simply substituting the values of X (1, 2, 3, 4 and 5) into the estimating (prediction) equation and calculating for Yc turn by turn. For instance, the value of Yc (the estimated value) will be calculated as follows when the value of X is 3.

Yc = 3.8 + 4.4 (3) = 17Quintals


Other calculated estimated values are indicated in the below: Table 8.9: Crop Year (X) 1 2 3 4 5 Total 15

Production (Y) 10 12 15 20 28 85

Yc 8.2 12.6 17.0 21.4 25.8 85.0 Y YC 1.8 -0.6 -2.0 -1.4 2.2 0.0

(Y YC )2 3.24 0.36 4.00 1.96 4.84 14.40

Note that Y is always equal to

Yc except in case of exponential relations

Curvilinear (Parabolic) Regression Equation


The relationship between the dependent variable Y and independent variable X can be curvilinear in many cases. The shape of the curve depends on the rate of change in Y corresponding to the change in the value of X. Some of the most commonly used curves are given here along with their mathematical equations. These curves may be fitted to the data and used. Y

Y=

+ X +cX2: 2nd degree

Y=
When

> 1: Exponential Growth Curve

When

< 1: Exponential Decay Curve

0 X

Mesay Mulugeta, 2009

124

Quantitative Methods in Social Sciences

Calculation of 2nd Degree Curve


Calculate the estimated value (Yc) of the production data in immediate example above by using the second degree curvilinear (parabolic) prediction equation. Table 8.10: Crop Year (X) 1 2 3 4 5 15

Production (Y) 10 12 15 20 28 85

XY 10 24 45 80 140 299

X2 1 4 9 16 25 55

X3 1 8 27 64 125 225

X2Y 10 48 135 320 700 1213 X

X4 1 16 81 256 625 979 cX 2 , (2nd Degree

The normal equations to calculate , equation) Y=n XY = (X2Y) =

and c (the constants) for Y

+ X + X2 +

X + c X2 X2 + c X3 X3 + c X4

1 2 3

Then, firstly you have to find Y, X, X2, XY, X3, (X2Y) and X4 Now you are expected to substitute in the three equations above as follows:

85 = 5 299 = 15

+ 15 + 55c + 55 + 225c

1 2 3

1213 = 55 + 225 + 979c

To eliminate 255 = 15

multiply equation No. 1 by 3


4

+ 45 + 165c

Mesay Mulugeta, 2009

125

Quantitative Methods in Social Sciences

Multiply again equation No. 1 by 11 935 = 55 + 165 + 605c


5

Equation No. 2 minus Equation No. 4 gives you 44 = 10 + 60c


6

Equation No. 3 minus Equation No. 5 gives you 278 = 60 + 374c


7

Multiplying equation No. 6 by 6 to eliminate 264 = 60 + 360c


8

Equation No. 7 minus Equation No. 8 gives you 14 = 14c Then, c = 1 Substituting c into equation No. 6 you can get: 44 = 10 + 60c 44 = 10 + 60 (1) = -1.6 Substituting c and 85 = 5 85 = 5 85 = 5 85 = 5 85 into equation No. 1 to get

+ 15 + 55c + 15 (-1.6) + 55 (1) + -24 + 55 + 31

31 = 5

= 10.8 Then, = 10.8 = -1.6 c=1


Mesay Mulugeta, 2009 126

Quantitative Methods in Social Sciences

You can confirm the results you have found by substituting into one of the equations above. For instance, if you substitute in equation No. 2 you can find 299 = 299, which means confirmed! The estimating or prediction equation is, therefore, Yc 10 . 8 1 .6 X 1X
2

Now, for every values of X you can estimate Yc. Look at the estimated or predicated value below. Table 8.11:
Independent variable

X 1 2 3 4 5 Total

Dependent Variable Estimated value Actual Value Yc Y 10 10.2 12 11.6 15 15.0 20 20.4 28 27.8 85 85.0

Y YC -0.2 0.4 0.0 -0.4 0.2 0.0

(Y YC )2 0.04 0.16 0.0 0.16 0.04 0.04

Note again that Y is always equal to Yc except minor differences because of rounding of fractions

Calculation of Exponential Curve


Calculate the estimated value (Yc) of the production data in Table 8.7 by using the exponential prediction equation.

The exponential prediction equation will be the following based on the linear equation Y=
X

or

LogY = Log + xLog

Log XY = m is the same as Y = Xm

Mesay Mulugeta, 2009

127

Quantitative Methods in Social Sciences

The normal equation will be

LogY = NLog + Log (XLogY) = Log

X X2

1 2

X + Log

Now, you have to find LogY, X, XLogY and X2as follows: Table 8.12: No. 1 2 3 4 5 Total X 1 2 3 4 5 15 Y 10 12 15 20 28 85 LogY 1 1.08 1.18 1.30 1.45 6.0035 XLogY 1.00 2.16 3.54 5.20 7.25 19.1265 X2 1 4 9 16 25 55

Now you are expected to substitute the values above in the exponential equations as follows:

6.01 = 5Log + 15Log 19.15 = 15Log + 55Log Multiply equation No. 1 by 3 to eliminate 18.03 = 15Log + 45Log Subtract equation No. 3 from No. 2 so that you can get 1.12 = 10Log Log = 0.112

1 2

Anti Log or
Now substitute

= 1.294
= 1.294 or Log = 0.112 in equation No. 1 to find .

6.01 = 5Log + 15 (0.112) 6.01 1.68 = 5Log Log = 0.866

Anti- Log
Mesay Mulugeta, 2009

or

= 7.345
128

Quantitative Methods in Social Sciences

Then the estimating or prediction equation will be: Y=


X

or

LogY = Log + XLog or LogYc = 0.866 + 0.112X

Yc = 7.34(1.29)X Remark LogY= LogYC Y

YC unless arbitrarily or accidentally.

Now for every values of X you can calculate Yc. Look at the estimated or predicated value below. Table 8.13
Independent variable X 1 2 3 4 5 Total Dependent Variable Actual Value Estimated value Y Yc 10 9.506 12 12.303 15 15.922 20 20.606 28 26.668 85 85.005 LogYC 0.978 1.090 1.202 1.314 1.426 6.01 YC 9.506 12.303 15.922 20.606 26.669 0.005

Note again that LogY is always equal to LogYC except minor differences because of rounding off fractions. An exponential curve will never be zero whatsoever the value of X is. It never crosses the Y-axis but approaches to it. This is why an exponential prediction equation is used in distance decay analysis.

Calculation of 3rd Degree Curve


Calculate the estimated value (Yc) of the production data in Table 8.7 by using the 3 prediction equation. The normal equations in this case, based on Y =
Y=N + (XY) = (X2Y) = (X3Y) = X + c X2 +d X3 X+ X2 + X3 + X2 + c X3 +d X4 X3 + c X4 +d X5 X4 + c X5 +d X6
rd

Degree

+ X + cX2 + dX3will be:


1 2 3 4

Additional items needed here are: X5= 4425


Mesay Mulugeta, 2009

X6= 20515

(X3Y) = 5291

129

Quantitative Methods in Social Sciences

Then, we can now apply quadratic equation as usual as follows: Equation No. 85=5 +15+55c+225d 229=15 +55+225c+979d 1213=55 +225+979c+4425d 5291=225 +979+4425c+20515d
Multiply equation No.1 by 3 Multiply equation No.1 by 11 Multiply equation No.1 by 45 Equation No.2 Equation No.5 Equation No.3 Equation No.6 Equation No.4 Equation No.7 Multiply equation No.8 by 6 Multiply equation No.8 by 30.4 Equation No.9 Equation No.11 Equation No.10 Equation No.12 Multiply equation No. 13 by 9 Equation No.14 Equation No.15 Substituting the value of d into equation No.13 we can get Substituting the value of c and d into equation No.8, we can get Substituting the value of , c and d into equation No.1, we can get

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

225=15 +45+165c+675d 935=55 +165+605c+2475d 3825=225 +675+2475c+10125d 44=10+60c+304d 278=60+374c+1950d 1466=304+1950c+10390d 264=60+360c+1824d 1337.6=304+1824c+9241.6d 14=14c+126d 128.4=126c+1148.4d 126=126c+1134d 2.4=14.4d;

d 0.166
0.50 2.33

-7 = 14c;

23.33 = 10 ;

8.00 c 0.50

Then,

8.00 2.33

d 0.166
0.5X2 + 0.166X3

The 3rd degree estimating equation is: YC = 8.00 + 2.33X

Mesay Mulugeta, 2009

130

Quantitative Methods in Social Sciences

Now for every value of X you can calculate Yc. Look at the estimated or predicated value below. Table 8.14 Independent variable X 1 2 3 4 5 Total Remark Dependent Variable Actual Value Estimated value Y Yc 10 9.99 12 11.99 15 14.99 20 19.99 28 27.99 85 84.95 (Y-YC) 0.01 0.01 0.01 0.01 0.01 0.05 (Y-YC)2 0.001 0.001 0.001 0.001 0.001 0.0005

Finally, by comparing the values of (Y YC)2 for the linear, 2nd degree, 3rd degree and exponential curves we can determine which fits best to the given data based on the minimum value of the sum. The smallest sum value indicates the best fit.
Exercise
Let us assume that a department has trained salesmen and has given them test. The department measured the performance of the salesmen after certain months and found the following data. Then identify (a) the best fitting line, (b) prediction equation and (c) the predicted sales due to a salesman having the test score of 9.5.

Test scores (10%) Sales in 000 Birr.

2 12

3 23

5 14

5.5 30

6 28

8 34

Multiple Regression Model


The general purpose of multiple regression (the term was first used by Pearson, 1908) is to learn more about the relationship between several independent or predictor variables and a dependent or criterion variable. The general computational problem that needs to be solved in multiple regression analysis is to fit a straight line to a number of points. A line in a two dimensional or two variable space is defined by
Mesay Mulugeta, 2009 131

Quantitative Methods in Social Sciences

the equation Y=A + BX; i.e. the Y variable can be expressed in terms of a constant ( ) and a tangent slope () times the X variable. The constant is also referred to as the intercept on Y-axis when X is 0, and the slope as the regression coefficient or . In the multivariate case, when there are more than one independent variables, the regression line cannot be visualized in the two dimensional space, but can be computed just as easily. We can, however, construct a linear equation containing all those variables. In general, multiple regression procedures will estimate a linear equation of the form:

Y=

+ 1X1 + 2X2 + 3X3

+ n Xn

Y = dependent variable X1, X2, X3 1, 2, 3 Estimating equation is YC = + 1X1 + 1X2 + nX3 +nXn Xn = independent variables + n = constants

In this equation, the values of each regression coefficient (or coefficients) indicate the contributions of each independent variable to the observed change within the dependent variable. The smaller the variability of the residual values around the regression line relative to the overall variability, the better is our prediction. For example, if there is no relationship between the X and Y variables, then the ratio of the residual variability of the Y variable to the original variance is equal to 1.0. If X and Y are perfectly related then there is no residual variance and the ratio of variance would be 0.0. In most cases, the ratio would fall somewhere between these extremes, that is, between 0.0 and 1.0. 1.0 minus this ratio is referred to as R square or the coefficient of determination. This value is immediately interpretable in the following manner. If we have an R square of 0.4 then we know that the variability of the Y values around the regression line is 0.4 times the original variance; in other words we have explained 40% of the original variability, and are left with 60% residual variability.

Mesay Mulugeta, 2009

132

Quantitative Methods in Social Sciences

Ideally, we would like to explain most if not all of the original variability. The R square value is an indicator of how well the model fits the data. For instance, an R square close to 1.0 indicates that we have accounted for almost all of the variability with the variables specified in the model. The degree to which two or more predictors (independent or x variables) are related to the dependent (Y) variable is expressed in the correlation coefficient R, which is the square root of R square. In multiple regressions, R can assume values between 0 and 1. To interpret the direction of the relationship between variables, one looks at the signs (plus or minus) of the regression or coefficients. If a coefficient is positive, then the relationship of this variable with the dependent variable is positive. Of course, if the B coefficient is equal to 0 then there is no relationship between the variables.

Example
Analyze the relationship between the dependent variable (y) and independent variables (xi) of the hypothetical data given below. In order to find the solution for the question, we have to know the values of the following summations as in the table below: Y, X1, X2, X3, X4 -------------------------------Calculated in the table

(YX1),

(YX2), (YX3), (X12), (X22), (X32), (X1X2), To be calculated

(X1X3), (X2 X3), (X1X4), (X2X4), (X3X4) + (X42)

Mesay Mulugeta, 2009

133

Quantitative Methods in Social Sciences

Table 8.15:
Crop yield/ha (Y) 18 8 28 20 14 22 24 16 6 12 Total 168 Farm Oxen/ha (X1) 2 1 4 0 3 1 2 1 4 2 20 Fertilizer (Kg/ha) (X2) 50 35 15 45 100 0 38 27 43 55 408 Per capita farmland (ha) (X3) 2 1 1.5 0.5 3 2 4 2 1 3 20 Per capita irrigated land (X4) 1 0.5 1 0 2 1 2 1 0.5 2 11

In order to find the solution for the question, we have to know the values of the following summations as in the table below: Y, X1, X2, X3, X4 -------------------------------Calculated above

(YX1),

(YX2), (YX3), (X12), (X22), (X32), (X1X2), To be calculated

(X1X3), (X2 X3), (X1X4), (X2X4), (X3X4) + (X42) The normal equation is: Y = NA + B X1 + C X2 + D X3 + E X4 ... ....1 . .2 .3 .4

(YX1) = A X1 + B (X12) + C (X1X2)+ D (X1X3) + E (X1X4) (YX2) = A X2 + B (X1 X2) + C (X22) + D (X2X3) + E (X2X4)

(YX3) = A X3 + B (X1 X3) + C (X2 X3)+ D (X32) + E (X3X4) ...

(YX4) = A X4 + B (X1 X4) + C (X2 X4)+ D (X3X4) + E (X42) .........5

Mesay Mulugeta, 2009

134

Quantitative Methods in Social Sciences

Table 8.16: YX1 36 8 112 0 42 22 48 16 24 24 332 YX2 900 280 420 900 1400 0 912 432 258 660 6162 YX3 36 8 42 10 42 44 96 32 6 36 352 YX4 18 4 28 0 28 22 48 16 3 24 191 X12 4 1 16 0 9 1 4 1 16 4 56 X22 2500 1225 225 2025 10000 0 1444 729 1849 3025 23022 X32 4 1 2.25 0.25 9 4 16 4 1 9 50.50

X42
1 0.25 1 0 4 1 4 1 0.25 4 16.50

X1X2 100 35 60 0 300 0 76 27 172 110 880

X1X3 4 1 6 0 9 2 8 2 4 6 42

Continued . X1X4 2 0.5 4 0 6 1 4 1 2 4 24.5

X2 X3 100 35 22.5 22.5 300 0 152 54 43 165 894

X2X4 50 17.5 15 0 200 0 76 27 21.5 110 517

X3X4 2 0.5 1.5 0 6 2 8 2 0.5 6 28.5

168 = 10A + 20B + 408C + 20D + 11E 332 = 20A + 56B + 880C+ 42D + 24.5E .

..

...

...1 .2

6162 = 408A + 880B + 23022C + 894D + 517E

..

3 4

352 = 20A + 42B + 894C + 50.5D + 28.5E ..............................................

191 = 11A + 24.5B + 517C + 28.5D + 16.5E ...............................................5

You can now find all the constants to arrive at an estimating equation. The values of the constants are calculated as follows:

Mesay Mulugeta, 2009

135

Quantitative Methods in Social Sciences

Variables X1 X2 X3 X4 Constant

Direct or Coefficients 0.273 -0.129 4.873 -3.950 16.104

Beta Coefficients (Standardized coefficients) 0.052 -0.489 0.751 -0.394

Then, the estimating equation is Yc = 16.104 + 0.273X1

0.129X2 + 4.873X3

3.950X4

Note: The independent variables with negative Beta coefficients (X2 and X4 in the above example) affect the dependent variable (Y) negatively with the greater the absolute value the greater the effect is. Likewise, the independent variables with positive Beta coefficients (X1 and X3) in the above example affect the dependent variable positively. The Beta coefficient of X3, for instance, is +0.751 which means an increase in one unit of an independent variable changes positively the dependent variable by 0.751 units. Let us calculate the multiple coefficients of correlation and determination for the example above. The formula for multiple coefficient of determination is:

r2 =

Y B

( X 1Y ) C

( X 2Y ) D Y 2 NY 2

( X 3Y ).... N

Y2

By using this formula, the coefficients of correlation and determination are calculated to be: Multiple coefficient of correlation (R) = 0.566 Multiple coefficient of determination (R ) = 0.320 or 32.0 % What do the values above indicate? The value of multiple coefficient of determination (R2) = 0.320 or 32.0 % indicates that all the independent variables together explain some 32% of the changes or variance in the dependent variable. This can easily be done by using SPSS software, the output table of which looks like the one indicated below.
2

Mesay Mulugeta, 2009

136

Quantitative Methods in Social Sciences

Table 8.17: Linear Regression SPSS Output


Unstandardized Coefficients Dependent (Constant) Variable: Crop yield per Farm oxen per hectare hectare Fertilizer per hectare Farmland Irrigated Land 16.104 0.273 -0.129 4.873 -3.950 Std. Error 8.468 2.303 0.113 9.787 16.183 Standardized Coefficients Beta 0.052 -0.489 0.751 -0.394 t 1.902 0.119 -1.141 0.498 -.244 Sig. 0.116 0.910 0.306 0.640 0.817

Exercise: Answer the following questions based on the following hypothetical data
Table 8.18: Hypothetical Data
Dependent Variable Food Grain Available per Head (Qntls/Head) Independent Variables X4 X5 X6 Number of Sex of Off-farm other Household Income per Livestock Head Head Per Hectare (Birr) 3 2.5 2.5 2 1 0.5 7 3 1 1.3 7 8 4 1.8 1.5 M M M M F F M F F M M M M M F 322 323 123 212 76 45 380 23 76 150 456 444 267 80 35

X2 X1 Family Size
Farmland Size (Hectare per Head)

X3 Number of Oxen per Hectare 4 3 1 4 0.5 0 6 1.5 1 0.5 6 6 5 0.5 2

X7 Fertilizer Input per Hectare (Kg/ha) 50 40 40 33 0 4 76 0 0 4 76 89 27 5 0

X8 Dung Input per Farmland (Kg/ha) 33 28 30 26 10 14 54 6 5 14 33 40 25 7 10

12 8 8 9 3 5 21 2 2 7 21 25 18 2 3

3 5 4 3 6 5 2 9 10 4 3 2 2 9 7

1.5 1.2 1.4 1.5 0.4 0.4 4 0.3 0.6 1.6 5 7 6 1.8 1.4

a) Calculate the coefficients of each independent variables b) What percent of the dependent variable is explained by the identified independent variables? c) Find the multivariate prediction equation d) Calculate the predicted values by using the prediction equation
Mesay Mulugeta, 2009 137

Quantitative Methods in Social Sciences

e) Screen out the most significant independent variables by using Stepwise regression analysis model f) Construct correlation matrix g) Conform the correlation between the dependent variable and each independent variable by using statistical testing technique you have learnt in this unit

Mesay Mulugeta, 2009

138

Quantitative Methods in Social Sciences

Unit Nine Tests of Significance


Unit objectives
Having studied this unit, you should be able to: Understand the basic concept of hypothesis testing Explain the need of tests of significance in quantitative methods Distinguish different tools of testing Select appropriate testing technique for a specific analytical results Appreciate the use of hypothesis testing in quantitative research methods

9.1. Introduction
For any statistical/quantitative analysis, it is always required to establish the validity or acceptance and rejection level of the results, generally, in relation with the already established values. For this purpose, a set of techniques are established by statisticians pertaining to various statistical parameters or measures. Thus, the whole operation is based on (1) establishing a hypothesis or assumption concerning the computed results in relation with the values which stand for being compared (2) a level of significance telling the fractional or percentage level of the comparisons and (3) on the degree of freedom, which will be varying with techniques and the numbers of observations. There is a fact to be considered in hypothesizing or assuming our notion about the results computed. These are whether (1) the result to be tested differs significantly or (2) not. Thus, the hypothesis can be considered. Thus, while hypothesizing we should consider whether or not the value/s does/do not differ significantly from the already established norm or the value with which the comparison is to be made. In this case it is referred to as Null Hypothesis generally denoted by Ho. It is also possible to presume that the result differs significantly from the value to be compared with. It is referred to as Alternative Hypothesis denoted by H1. The outcome of the test is to be compared with some already established values appealing in relevant tables and the interpretations concerning acceptance or rejection is based on instructions accompanying the tables. It is also true that if Ho is rejected, automatically the alternative hypothesis is accepted i.e. the computed value differs significantly.
Mesay Mulugeta, 2009 139

Quantitative Methods in Social Sciences

If Ho =0 1 = 2
2 2

H1 will be 1
2

0, < 0, = 0, > 0 2 , 1 < 2 , 1 > 2


0 2 2 2 2 2

<

>

Two broad classifications can be made among parametric and non-parametric tests. Parametric tests use the statistical parameters like mean, standard deviation variance and correlation coefficients. Non-parametric methods, on the other hand, are widely used for studying populations that take on a ranked order ignoring the actual values. The use of non-parametric methods is, therefore, necessary when data have no actual numerical interpretations/values rather take into consideration information like frequencies and ranks. In other words, parametric test is a statistical test that depends on an assumption about the distribution unlike non-parametric test. For example, in Analysis of Variance (ANOVA), a typical parametric test, there are three assumptions: Observations are independent The sample data have a normal distribution Scores in different groups have homogeneous variances Included in the first broad category (parametric test), among others, are z-test and ANOVA while among non-parametric tests are Chi-square (X2) test, Mann-Whitney U-Test or Wilcoxon-MannWhitney Rank-Sum Test and Kruskal Wallis Test (H-test). It is very difficult to have a comparative assessment of the two groups, parametric and nonparametric tests. This is because both suffer from some limitations. Invariably the assumptions with the parametric tests are that the distribution is normal while nonparametric tests can be used with all the types of distributions. Thus, nonparametric test is more flexible in conditions than the counterpart. But it is also remarked that nonparametric tests are less reliable than the parametric tests.

Mesay Mulugeta, 2009

140

Quantitative Methods in Social Sciences

9.2. Level of Significance


It is the quantity of risk which we are ready to tolerate in making decision about acceptance or rejection. In other words, it is the probability which is tolerable or the probability level at which the decision-maker concludes that observed difference between the value of the test statistic and hypothesized parameter value cannot be due to chance. The level of significance is denoted by (alpha) and is conventionally chosen as 0.05 or 0.01 with percentage equivalence of 95% and 99%, respectively. Level = 0.01 is used for high precision and = 0.05 for moderate precision. Often

other levels of significance like 0.1, 0.2 and 0.3 may also appear in tables.

9.3. Degree of Freedom


Degree of freedom is the number of independent observations in a set. The table values for distribution of test statistics are provided in separate pages of statistics books. The tabulated values make us decide about the rejection or acceptance of hypothesis. In a test of hypothesis, a sample is drawn from the population of which the parameter is under test. The size of the sample varies since it depends either on the experimenter or on the resources available or also the nature of the phenomena being investigated. Moreover, the test statistic involves the estimated value of the parameter which depends on the number of observations. Hence, the sample size plays an important role in testing of hypothesis and is taken care of by degrees of freedom. The level of significance is specified before the samples are drawn so that the results obtained should not influence the choice of the decision-maker. On the basis of observational data, a test is performed to decide whether a postulated hypothesis is accepted or not. This involves certain amount of risk. This amount of risk is called level of significance. When hypothesis is accepted, we consider it a non-significant result and when hypothesis is rejected (fail to accept or retained), it is called a significant result. Statistical test of hypothesis play an important role in geographical studies. For example, a researcher in economic geography may be interested to know whether there is a significant difference in landholdings among rural households in four woredas. Then, the researcher can collect landholding size of certain sample households from each woreda and perform a statistical test based
Mesay Mulugeta, 2009 141

Quantitative Methods in Social Sciences

on the observations. The statistical test tells the researcher whether the landholding sizes differ significantly among the woredas or not. It is also possible to find the degree of relationship (correlation) between two geographical data, say crop yield and fertilizer input per unit area, and perform a statistical test which enables us to decide whether the correlation is statistically significant or not. Note that the correlation between two variables X and Y is statistically not significant means; the relationship between the variables is only because of chance not because one variable affects the other.

9.4. Procedure for Hypothesis Testing


This refers to the steps required to test the validity of the claim or assumption about the sample statistic. The results of the analysis are used to decide whether the claim is valid or not. Hence, the general procedure for any hypothesis testing is summarized below. Step 1: State the null hypothesis (Ho) and alternative hypothesis (H1) Theoretically, hypothesis testing requires that the null hypothesis to be considered true or no difference until it is proved false on the basis of results observed from the sample data. The null hypothesis is always expressed in the form of an equation making claim regarding the specific value of the population parameter. That is: The general way of expressing the Null Hypothesis is that the sample parameters do not differ significantly Where, = population mean from the population parameters that they represent o = Hypothesized parameter value or the parameter obtained for some similar studies Ho: = o An alternative hypothesis is the logical opposite of the null hypothesis, that is, an alternative hypothesis must be true when the null hypothesis found to be false. It is stated as: H1: o

H1: < o or > o

Step 2: State the Level of Significance or

(alpha) for the Test

The level of significance is specified before the samples are drawn, so that the results obtained should not influence the choice of the decision-maker. It is specified in terms of the level of probability of the null hypothesis being wrong or rejected.
Mesay Mulugeta, 2009 142

Quantitative Methods in Social Sciences

Step 3: Establish Critical or Rejection Region As can be seen from the figure below, the sample space of the experiment is divided into two mutually exclusive regions. These are called the acceptance region and the rejection or critical regions in a normally distributed data.

Figure 9.1: Two-tailed Test Region


Rejection region, /2 (Ho is rejected) Acceptance region (Ho is accepted)
Rejection region, /2 (Ho is rejected)

Critical Values

If the value of the test statistic fall into the acceptance region, the null hypothesis is accepted, otherwise it is rejected. At this stage we should bear in mind that research hypotheses can be of two types, one-tailed and two tailed. A one tailed hypothesis makes predictions regarding both the presence of a significant effect and also of the direction of this difference or association. For instance, (1) we can test whether there will be a difference in performance of young and old participants on a memory test; and (2) whether young participants perform better on a memory test than elderly. In contrast, a twotailed hypothesis predicts only the presence of a statistically significant effect, not its direction. Step 4: Calculate the Suitable Test Statistic The value of the test statistic is calculated from the distribution of sample statistic by using the following formula:

Tests statistic

Value of sample statistic value of hypothesized population parameter S tan dared error of the sample statistic

Step 5: Reach a Conclusion Compare the calculated value of the test statistic with the critical value (also called standard table value or tabulated value). The decision rules for null hypothesis are as follows: |Value|Cal |Value|Table; Reject the Ho
Mesay Mulugeta, 2009 143

Quantitative Methods in Social Sciences

|Value|Cal < |Value|Table; Accept the Ho 9.4.1. One-tailed and Two-tailed Tests There are two types of tests referred to as the one-tailed and two-tailed tests. The type of tests depends on the way the hypotheses are formulated. a. Two-tailed test is when null and alternative hypotheses are stated as: Ho: = o and H1: o This implies that any deviation (either on the lower or higher side) of the calculated value of test statistic from the hypothesized value leads to rejection of the null hypothesis, Ho. The rejection region is kept in both tails as indicated in Fig.9.1. Then, if the significance level for the test is percent, the rejection region equal to distribution. b. One-tailed test is when null and alternative hypotheses are stated as: Ho: Ho: o and H1: >o (Right-tailed test), o and H1: < o (Left-tailed test), or /2 percent which is kept in each tail of the sampling

This implies that the value of sample statistic is either higher or lower than the hypothesized parameter value. This leads to the rejection of null hypothesis for significant deviation from the specified value in one direction or tail of the curve of sampling distribution. Look at Fig. 9.2 below. Figure 9.2: One-tailed test (Right-tailed)
Acceptance region (Ho is accepted)

Rejection region, (Ho is rejected)

Critical values

9.5. Errors in Hypothesis Testing


Ideally the hypothesis testing procedure should lead to the acceptance of Ho when it is true and the rejection of Ho when it is not. However, the correct decision is not always possible. Since the
Mesay Mulugeta, 2009 144

Quantitative Methods in Social Sciences

decision to accept or reject a hypothesis is based on sample data, there is a possibility of an incorrect decision or error. A decision-maker may commit two types of errors while testing a null hypothesis. These are known as Type I Error ( ) and Type II Error (). A type I error is made when Ho is rejected and conclude that the H1 is true when it is wrong. On the other hand, a type II error is made when a false Ho is accepted and concludes that the H1 is wrong when it is true.

9.6. Common Types of Hypothesis Testing 1. Student s t-test


It is the deviation of estimated mean from its population mean expressed in terms of standard deviation. Student t-test was named after Sir William Gosset of Ireland who under his pen name student developed a method for hypothesis testing popularly known as the t-test . It is said that Gosset was employed by Guinness Brewery in Dublin which did not permit him to publish his research findings under his own name, so he published his research findings in 1905 under pen name Student . When testing a hypothesis with small samples (<30), we must assume that the samples come from a normally or nearly normally distributed population. A t-test is used, among others, for: Hypothesis testing for the difference b/n two populations with independent samples Hypothesis testing for the difference b/n two populations with dependent samples Hypothesis testing for observed coefficient of correlation including partial and rank correlations Hypothesis testing for an observed regression coefficient.

I. Hypothesis testing for Single Population Mean: The test statistic for determining the difference b/n the sample mean and population mean is given by: x s*

1 n Where, x = sample mean = population mean

s = sample standard deviation n = sample size

Mesay Mulugeta, 2009

145

Quantitative Methods in Social Sciences

Note: This test statistic has a t distribution with n-1 degree of freedom. The tabulated t-value gives the critical value of t. More clearly, if tcal population) are desirable to apply t-test. Example The average rural households cereal production per year is specified to be 18.5 quintals. From that a sample of 14 households was selected. The mean and standard deviation of the samples were calculated as 17.85quintals and 1.955quintals, respectively. Test the significance of the deviation. Solution: Let us take the null hypothesis that there is no significant deviation in amount of production among households. Ho: = 18.50 and H1: 18.50 = 0.05. Critical value of t at df =13 and t or t /2, reject the Ho, otherwise accept it. Note also that

the sample size should be small (<30) and at least five observations (taken from normally distributed

Given, n = 14, x = 17.85, s = 1.955, df = 13, /2 = 0.025 is 2.16

17.85 18.75 = - 1.24 1.955 14

Since tcal (-1.24) value is less than its critical value (ttab = 2.160), the null hypothesis Ho is accepted. Hence, we conclude that there is no significant deviation of sample mean from the population mean. Exercise Let a herbicide spray machine is set to give 20 kilograms of herbicide per hectare of land. Seven plots of land (each one hectare in area) are examined and the amounts of herbicides in the plots are found to be 19, 22, 20, 18, 21, 17and 19 kilograms. Is there reason to accept that the machine is defective? Hint: Table 9.1: Variables (x) 19 22 20 18 21 17 19 x=?
Mesay Mulugeta, 2009 146

Deviation from Mean (x - x )

Quantitative Methods in Social Sciences

II. Hypothesis testing for Difference of Two Means: For comparing two mean values of two normally distributed populations, we can draw independent random samples of sizes n1 and n2 from the two populations. If 1 and 2 are the mean values of the two populations, then our aim is to estimate the value of the difference 1 - 2 b/n mean values of the two populations. Let the sample values be denoted by (x1a, x1b, x1b, x1c, x1c) and (x2a, x2b, x2b, x2c, x2c). Then, the expression for t is:

t
Sp

x1 x 2 1 n1 1 n2

or

x1 Sp

x2

n1 n2 n1 n2

Where, x1 and x 2 are means of the samples I and II respectively, Sp is the pooled standard deviation which is equal to S p . S p can be calculated by using the formula below:
(n1 n2 2) Note: In hypothesis testing for difference of two means, statistic t has (n1 + n2 - 2) degree of
2
2

Sp =

( x1i x1 ) 2

( x2i x2 ) 2

freedom. The calculated value of the t-test statistic here represents the number of standard deviations the difference x1 - x 2 is from 1 - 2 specified in Ho. Thus the rule to either accept or reject a null hypothesis is as follows: Ho: 1 - 2 = 0 and H1: 1 - 2 and degrees of freedom (n1 + n2 0

Accept Ho if calculated value of t is less than its critical value at a specified level of significance 2). Otherwise reject Ho.

Exercise Let us assume that the following table shows a test score (out of 20) of two groups of students in a class Group I Group II 11 20 18 8 12 13 12 18 17 16 14 19 14 15 16 17 11 17 7

10 13 15 16 13 11 14 19 10

Then, examine the significance of the difference between the mean of the marks secured by the students of the two groups.
Mesay Mulugeta, 2009 147

Quantitative Methods in Social Sciences

2. Z-test
It is one of the commonest types of hypothesis testing. Z-test (not the same as Z-score though closely related) compares sample and population means to determine if there is statistically a significant difference. Z-test, also known as normal test, is used in cases when the population variances (s) is/are known and sample size is large (>30). Theoretically, when the sample size is large, sample variance approaches to population variance and is deemed to be almost equal to population variance. In this way, the population variance is known even if we have sample data and hence the normal test is applicable. The distribution of Z is always normal with a mean zero (0) and a variance one (1). For testing Ho: = o against H1: x 0 Z / n Whereas, x is the sample mean and formula above, o, the test statistics is

is the standard deviation based on large sample size n. In the x represents standard error, SE. Then, z-test can also be stated as SE n

Exercise
Let the table below gives the daily income (in Eth. Birr) of randomly selected 40 laborers in a manufacturing plant.

Table 9.2:
12 7 8 9 12 10 16 23 21 17 6 8 9 12 32 16 4 21 20 22 8 9 4 23 21 20 12 16 18 19 5 18 9 12 12 6 14 21 22 18

Analyze whether it can be concluded or not that the average (mean) income of a person in this manufacturing plant is 15 Eth. Birr.

Mesay Mulugeta, 2009

148

Quantitative Methods in Social Sciences

Hint: 1. State the hypothesis in such a way that Ho: = 15 against H1:

15 against

2. Since the sample size is 40 (i.e. large), you should use a normal test (Z-test). Then, first you should calculate sample mean, x 3. = 15 4. Calculate also Then, by using the formula for Z-test, compare it against the tabulated value and decide whether to accept or reject the Ho.

3. Hypothesis Testing Based on F-distribution (F-test)


F-test is one of the parametric tests first coined by Sir Ronald Fisher who initially developed the statistic as the variance ratio in 1920s. F-distribution, also called variance ratio distribution or F-test is used either for testing the hypothesis about the equality of two population variances or the equality of two or more population means. The equality of the two population means has been dealt with t-test. Then, here you have to give more emphasis how to test hypotheses by comparing the standard deviations or variances of population. The assumptions for F-distribution (F-test) are: a. The F-values are non-negative b. The distribution is non-symmetric or samples should be drawn from normal population c. There are two independent degrees of freedom, one for the numerator and the other for the denominator d. The larger variance should always be placed in the numerator

Let there be two normal populations N (

) and N (

).

The hypothesis indicated below can be tested by F-test Ho:


2 1

2 2

against

2 1 2

Whenever independent random samples of size n1 and n2 are drawn from two normal populations, the F-ratio will be calculated by the formula below:
F S1 S2
2 2

Mesay Mulugeta, 2009

149

Quantitative Methods in Social Sciences

As a norm, larger variance is taken in the numerator of the formula S1 > S 2 in the formula above with the n1 1 degree of freedom for the numerator and n 2 1 degree of freedom for the denominator. Note that we keep the larger variance in the denominator so that the ratio is always equal to or greater than one. For H1: For H1:
2 1 2 1

> <

2 2 2 2

, reject Ho if Fcal > F

, reject Ho if Fcal < F1- . In the reverse situation Ho is not rejected

Exercise Let us assume that the following table represents the life expectancy of 7 and 9 regional states of Ethiopia in 1991 and 2007, respectively. Then, confirm whether the variation in life expectancy in various regions in 1991 and in 2007 is the same or not. Table 9.3:
Life expectancy in years Regions 1 2 3 4 5 6 7 1991 43.2 41.5 47.2 50.5 41.2 38.0 39.1 2007 54.5 47.0 56.9 60.3 58.2 49.6 54.9 48.8 58.5

Hint: a. State the hypothesis as Ho: b. First calculate S1 and S 2


2 2

2 1

2 2

vs

2 1 2

c. Degrees of freedom are 6 and 8 for the data set of 1991 and 2005, respectively.

4. Mann-Whitney U-Test (Wilcoxon-Mann-Whitney Rank-Sum Test)


Initially proposed by Wilcoxon (1945), for equal sample sizes, and later by Mann and Whitney (1947), for arbitrary sample sizes, it is a non-parametric test for assessing whether two samples of
Mesay Mulugeta, 2009 150

Quantitative Methods in Social Sciences

observations come from the same distribution or not. It is one of the best known non-parametric significance tests.

5. Kruskal Wallis Rank-Sum Test (H-test)


In statistics, Kruskal-Wallis one-way of analysis of variance by rank (named after William Kruskal and W. Allen Wallis) is a non-parametric method of testing equality of population medians among groups. It is an extension of the Mann-Whitney U-test to 3 or more groups. The formula for H-test is
H Ri 12 ( ) 3 N 1 N ( N 1) ni
2

Where, N = total number of observations R = sum of the ranks of ith set, ni = frequency of ith set

Exercise
Find H-value for the four sets of data given below and compare against tabulated value.

Table 9.4:
A B C D 4 6 12 10 6 8 16 10 2 6 14 10 6 10 8 6 2 0 10 4

Solution Rank the data as follows. Start from the lowest data. Table 9.5: Data Ranking A B C D 4(4.5) 6(8) 12(17) 6(8) 8(11.5) 16(19) 2(2.5) 6(8) 14(18) 10(14.5) 6(8) 10(14.5 8(11.5) 6(8) 2(2.5) 0(10 10(14.5) 4(4.5) R 25.5 43.0 85.5 56.0 210.0 7310.25 5 3136 ) 3 x 21 5

10(14.5) 10(14.5)

Grand Total of Rank 12 650.25 1849 ( 20(21) 5 5

H-value =

10.789

Mesay Mulugeta, 2009

151

Quantitative Methods in Social Sciences

By using the chi-square compare H-calculated with H-tabulated and decide whether or not the Ho is to be rejected or retained. Note that it is the Chi-square ( 2) table that must be used for H-test also. H-calculated = 10.789 H-tabulated = 7.820 Hcal > Htab, then reject Ho

6. Chi-square ( 2) Test
This is also one of the non-parametric or distribution free statistical tests which go back to 1900, when Karl Pearson used it for frequency data classified into k-mutually exclusive categories. It is usually represented by , a Greek letter Chi . The sampling distribution of
2 2 2

is called

distribution. Like other hypothesis testing procedures, the calculated

-test statistic is compared

with its critical (or table) value to know whether the Ho hypothesis is true or not. The decision of accepting the Ho is based on how close the sample results are to the expected results. The data should be expressed in original units, rather than in percentage or ratio form. Let us assume that we have a perceived value of variance (
2 o

) of a normal population on the basis

of previous knowledge. Draw a random sample of size n (<30) from this population. On the basis of n sample observations, the postulated value
2 o

of population variance (

) is to be either

substantiated or rejected with the help of statistical test. The hypothesis here is: 2 2 2 Ho: o = vs o This can be tested by:
2

( xi
2 o

x) 2
2

or

(n 1) S 2
2 0

Where, S2 = sample variance and To accept or reject Ho:

statistic has (n-1) degrees of freedom.

1. Reject Ho at pre-decided level of significance freedom or if Valuecal Value (1o

if Valuecal

Value

/2

at n-1 degree of

/2)
2

at n-1 degree of freedom


2

2. In case of one-tailed test (Ho:

vs

2 o

>

), reject Ho if Valuecal

Value at n-1

degree of freedom or reject Ho if Valuecal

Value(1- ) at n-1 degree of freedom

Mesay Mulugeta, 2009

152

Quantitative Methods in Social Sciences

Example Let us assume that a factory owner wants to purchase a commodity if it does not have variance of more than 0.4 kilograms in weight. To make sure of the specifications, the buyer selects 8 sample items of the commodity. The weight of each sample item was measured to be as follows. 5 9 Table 9.6: 5 7 10 4 9 4 8 6 (xi- x ) - 6.625 - 6.625 - 6.625 - 6.625 - 6.625 - 6.625 - 6.625 - 6.625 (xi- x )2 2.64 0.14 11.39 6.89 5.64 6.89 1.89 0.39 (xi- x )2 = 35.87 Weight in kilogram 7 10 4 8 4 6

x =6.625

( xi
2 o

x) 2 1, 8

For

= 0.05 and 7 degree of freedom (n

35.87 6.9922 5.13 1 = 7) the tabulated value from Chi-square table

( 2-distribution) is 14.0671. Then, since the calculated (6.9922) value is less than tabulated value (14.0671), we should accept the null hypothesis. It means that the factory owner should purchase the commodity.

7. Analysis of Variance (ANOVA)


Analysis of variance (ANOVA), developed by Sir Ronald Fisher, helps to test the differences between three or more sample means drawn from two or more sets of populations. Here we test null hypothesis (Ho) that three or more sets of population or populations from which samples are drawn have equal (homogeneous) means against the alternative hypothesis (H1) that population means are not equal at all. H0: = 2 = k H1: Not all j are equal (j = 1, 2, 3, k)

Mesay Mulugeta, 2009

153

Quantitative Methods in Social Sciences

Hence, the null and alternative hypothesis of population means imply that the null hypothesis should be rejected if any of the r sample means is different from others. The assumptions for the analysis of variance are: a. Each populations are a normal distributions b. The sets of populations from which the samples are drawn have equal variances c. Each sample is drawn randomly and is independent of other samples The first step in the analysis of variance is to partition the total variation in the sample data into the following two component variations. These are: a. The amount of variation among (variation between) the sample means or the variations attributable to the difference among sample means. b. The amount of variation within the sample observations. This difference is considered due to chance causes or random errors. In analysis of variance, a table known as ANOVA table is required and established as follows: Table 9.7:
Source of variation Degrees of freedom Sum of squares Variance F-value

Total

Example Let us assume that 4 enumerators are sent to a market to collect a data related to a price of a commodity. Then, we can apply ANOVA to check (test) the differences of prices collected by the enumerators.

H0: 1 = 2 = 3 = 3 H1: 1 2 3 3

Mesay Mulugeta, 2009

154

Quantitative Methods in Social Sciences

Enumerators

Prices in Eth. Birr Total


x

A 4 6 2 6 2 20 4

B 6 8 6 10 6 30 6

C 12 16 14 8 20 70 14

D 10 10 10 6 4 40 8

20+30+70+40=160 Grand mean = 8 4 6 14 8 8 4

Then, find variation within as follows: Variation within a data set collected by A Variation within a data set collected by B Variation within a data set collected by C Variation within a data set collected by D Total variation within ( xi ( xi ( xi ( xi x) 2 = 16 x) 2 = 56 x) 2 = 80 x) 2 = 32

16 + 56 + 80 + 32 = 184

We have to find also variation between as follows. Here the mean of each data set represents the whole data: 165 = 80

Variation b/n a data set collected by A

( x grand mean) 2

Variation b/n a data set collected by B

( x grand mean) 2

45 = 16

Variation b/n a data set collected by C

( x grand mean) 2

365 = 180

Variation b/n a data set collected by D Total variation b/n


Mesay Mulugeta, 2009

( x grand mean) 2 80 + 20 + 180 + 0 = 280

05 = 0 280
155

Quantitative Methods in Social Sciences

Grand variation = total variation within (184) + total variation b/n (280) = 464 Finally, create an ANOVA table as follows: Source of variation Variation b/n Degrees of freedom 3 Sum of squares 280 Variance F-value

280 93.3 3 184 11.5 16

Variation within

16

184

93.3 8.113 11.5

Grand variation

19

464

The conclusion of the ANOVA above is that since the Fcal (8.113) > Ftab (3.24), the difference is significant (1 2 3 3) and reject the null hypothesis.

9.7. Test of Significance of Correlation Coefficients


Whatever conclusions are drawn from the sample under consideration is meant to draw inferences about the parent population. The estimates are not unique and hence a sort of confirmation is sought by way of test of significance for validity of inferences drawn from the sample about population. The test of significance for the existence of a linear relationship between two variables x and y involves the determination of sample correlation coefficient r. This test of linear relationship between x and y is the same as determining whether there is any significant correlation between them. For determining the correlation, we start by hypothesizing the population correlation coefficient p equal to zero. The test of significance of correlation coefficient means to test the hypothesis, whether or not the correlation coefficient is zero in population. This means we test: Ho: p = 0 (There is no correlation) vs H1: p Thus, the t-test statistic for testing the null hypothesis is: r n r2 1 r2 0 (There exist a correlation)

tn

Mesay Mulugeta, 2009

156

Quantitative Methods in Social Sciences

Where,

r = sample coefficient of correlation of n pairs of observations (n-2) p = degree of freedom = population coefficient of correlation = t-test with n-2 degree of freedom = sample size value of significance and (n-

tn

If the calculated value of t is greater than the tabulated value of t for

2) degrees of freedom, reject Ho. Rejection of Ho leads to the conclusion that the two variables are not independent. This means that the correlation between them is worth considering. On the other hand, if Ho is accepted, it means that the value of r is due to sampling error whereas in reality two variables are uncorrelated in the population too. Example Calculate the correlation coefficient of the following paired hypothetical data and confirm or test its validity. Table 9.8: Farm Households 1 2 3 4 5 6 Crop Yield per Unit Area (in Quintals) 25 14 18 22 15 18 Fertilizer Application Per Unit Area (in Kg) 70 30 30 60 30 35

Solution 1. Firstly, we should calculate the coefficient of correlation(r) for the paired data. Then by the methods (formula) you have learnt earlier in this course, the Pearson s correlation coefficient for the data is calculated to be +0.940. 2. State the hypothesis i.e. Ho: p = 0 (There is no correlation) correlation) 3. Apply the method above to confirm or test the correlation r n r2 1 r
2

vs H1: p

0 (There exist a

tn

0.940,

r2

0.8836, n

Mesay Mulugeta, 2009

157

Quantitative Methods in Social Sciences

tn
4.

0.940 6 0.8836 1 0.8836

0.940 5.1164 0.1164

= 6.233

Then, the tabulated value of t at

= 0.01 and (n-2, 6 -2 = 4) degrees of freedom is 3.747

which is less than the calculated value of t (6.233). Hence, reject Ho. Rejection of Ho leads to the confirmation of the conclusion that the two variables are really highly correlated. This means that the correlation between them is worth considering.

Mesay Mulugeta, 2009

158

Quantitative Methods in Social Sciences

References
1. Agrawal, B.L (2006). Basic Statistics. New Age International Publishers: New Delhi 2. Bartlett, James E., Joe W. Kotrlik and Chadwick C. Higgins (2001). Organizational Research: Determining Appropriate Sample Size in Survey Research. Information Technology Learning and Performance Journal, Vol. 19, No. 1. Ball State University: Muncie 3. Best, J.W. and J.V. Kahn (2005). Research in Education. Prentice-Hall Pvt. New Delhi. 4. Bryman, A. (2000). Quantity and Quality Research in Social Sciences. Unwin University: London 5. Cresswell, J. W. (2003). Research Design: Qualitative, Quantitative and Mixed Methods Approaches. 2nd Edition. SAGE Publications: London. 6. Frechtling, J.et.el.Eds (1997). User-Friendly Handbooks for Mixed Methods Evaluation. DIANE Publisher. 7. Gurumani, N. (2007). Research Methodology for Biological Sciences. MJP Publications: Chennai 8. Henn, M et.al. (2006). A Short Introduction to Social Research. SAGE Publications Ltd. London. 9. Nachmias, C. F. and D. Nachmias (1996). Research Methods in Social Sciences. St Martin s Press Inc.: London. 10. Salvatore, D. (1982). Theory and Problems of Statistics and Econometrics. Schaum s Outline Series. New York 11. Sayer, A. (1999). Methods in Social Sciences. Routledge Inc. London. 12. Sharma, J. K. (2004). Business Statistics. Pearson Education Ltd. New Delhi 13. Shipman, M. (1998). The Limitations of Social Research. Longman Group: London 14. Spratt, C., Waker, R and Robinson, B. (2004). Mixed Research Methods. Commonwealth of Learning

Mesay Mulugeta, 2009

159

This document was created with Win2PDF available at http://www.daneprairie.com. The unregistered version of Win2PDF is for evaluation or non-commercial use only.

You might also like