You are on page 1of 47

IIMM/DH/02/2006/8154, Business Statistics

Q 1. (a) What do you understand by word Statistics give out its definitions (minimum by 4 Authors) as explained by various distinguished authors? Answer:-

Statistics is the science of making effective use of numerical data relating to groups of individuals or experiments. It deals with all aspects of this, including not only the collection, analysis and interpretation of such data, but also the planning of the collection of data, in terms of the design of surveys and experiments. A statistician is someone who is particularly versed in the ways of thinking necessary for the successful application of statistical analysis. Often such people have gained this experience after starting work in any of a number of fields. There is also a discipline called mathematical statistics, which is concerned with the theoretical basis of the subject. The word statistics can either be singular or plural. When it refers to the discipline, "statistics" is singular, as in "Statistics is an art." When it refers to quantities (such as mean and median) calculated from a set of data, statistics is plural, as in "These statistics are misleading." Statistics is considered by some to be a mathematical science pertaining to the collection, analysis, interpretation or explanation, and presentation of data, while others consider it to be a branch of mathematics concerned with collecting and interpreting data. Because of its empirical roots and its focus on applications, statistics is usually considered to be a distinct mathematical science rather than a branch of mathematics. Statisticians improve the quality of data with the design of experiments and survey sampling. Statistics also provides tools for prediction and forecasting using data and statistical models. Statistics is applicable to a wide variety of academic disciplines, including natural and social sciences, government, and business. Statistical methods can be used to summarize or describe a collection of data; this is called descriptive statistics. This is useful in research, when communicating the results of experiments. In addition, patterns in the data may be modeled in a way that accounts for randomness and uncertainty in the observations, and are then used to draw inferences about the process or population being studied; this is called inferential statistics. Inference is a vital element of scientific advance, since it provides a prediction (based in data) for where a theory logically leads. To further prove the guiding theory, these predictions are tested as well, as part of the scientific method. If the inference holds true, then the descriptive statistics of the new data increase the soundness of that hypothesis. Descriptive statistics and inferential statistics (a.k.a., predictive statistics) together comprise applied statistics. The word statistics has its origin from German statistik (Political State) or Italian statistca (statesman). It is a branch Matehmatics applied to real situations and used for resolving problems. Essentially it means the use of data to help the decision maker reach better decisions in forecasting, controlling and exploring. Various Definitions1. According to Prof. Horace Secrist Statistics an aggregate of facts affected to a marked extent by multiplicity of causes, numerically expressed, enumerated or estimated according to reasonable standard of accuracy, collected in a systematic manner for a predetermined purpose and placed in relation to each other. 2. According to Croxton and Cowden Statistics is the science of collection, organization, presentation, analysis and interpretation of numerical.

IIMM/DH/02/2006/8154, Business Statistics

3. According to Prof. Ya Lun Chou Statistics is a method of decision making in the face of uncertainty on the basis of numerical data and calculated risks. 4. According to Wallis and Roberts Statistics is not a body of substantive knowledge but a body of methods for obtaining knowledge.

Experimental and observational studies


A common goal for a statistical research project is to investigate causality, and in particular to draw a conclusion on the effect of changes in the values of predictors or independent variables on dependent variables or response. There are two major types of causal statistical studies: experimental studies and observational studies. In both types of studies, the effect of differences of an independent variable (or variables) on the behavior of the dependent variable are observed. The difference between the two types lies in how the study is actually conducted. Each can be very effective. An experimental study involves taking measurements of the system under study, manipulating the system, and then taking additional measurements using the same procedure to determine if the manipulation has modified the values of the measurements. In contrast, an observational study does not involve experimental manipulation. Instead, data are gathered and correlations between predictors and response are investigated. An example of an experimental study is the famous Hawthorne study, which attempted to test changes to the working environment at the Hawthorne plant of the Western Electric Company. The researchers were interested in determining whether increased illumination would increase the productivity of the assembly line workers. The researchers first measured the productivity in the plant, then modified the illumination in an area of the plant and checked if the changes in illumination affected productivity. It turned out that productivity indeed improved (under the experimental conditions). However, the study is heavily criticized today for errors in experimental procedures, specifically for the lack of a control group and blindness. The Hawthorne effect refers to finding that an outcome (in this case, worker productivity) changed due to observation itself. Those in the Hawthorne study became more productive not because the lighting was changed but because they were being observed. An example of an observational study is one that explores the correlation between smoking and lung cancer. This type of study typically uses a survey to collect observations about the area of interest and then performs statistical analysis. In this case, the researchers would collect observations of both smokers and nonsmokers, perhaps through a case-control study, and then look for the number of cases of lung cancer in each group. The basic steps of an experiment are: Planning the research, including determining information sources, research subject selection, and ethical considerations for the proposed research and method. 2. Design of experiments, concentrating on the system model and the interaction of independent and dependent variables. 3. Summarizing a collection of observations to feature their commonality by suppressing details. (Descriptive statistics) 4. Reaching consensus about what the observations tell about the world being observed. (Statistical inference) 5. Documenting / presenting the results of the study.
1.

Levels of measurement
2

IIMM/DH/02/2006/8154, Business Statistics

There are four types of measurements or levels of measurement or measurement scales used in statistics:

nominal, ordinal, interval, and ratio.

They have different degrees of usefulness in statistical research. Ratio measurements have both a zero value defined and the distances between different measurements defined; they provide the greatest flexibility in statistical methods that can be used for analyzing the data. Interval measurements have meaningful distances between measurements defined, but have no meaningful zero value defined (as in the case with IQ measurements or with temperature measurements in Fahrenheit). Ordinal measurements have imprecise differences between consecutive values, but have a meaningful order to those values. Nominal measurements have no meaningful rank order among values. Since variables conforming only to nominal or ordinal measurements cannot be reasonably measured numerically, sometimes they are called together as categorical variables, whereas ratio and interval measurements are grouped together as quantitative or continuous variables due to their numerical nature.

Key terms used in statistics

Null hypothesis
Interpretation of statistical information can often involve the development of a null hypothesis in that the assumption is that whatever is proposed as a cause has no effect on the variable being measured. The best illustration for a novice is the predicament encountered by a jury trial. The null hypothesis, H0, asserts that the defendant is innocent, whereas the alternative hypothesis, H1, asserts that the defendant is guilty. The indictment comes because of suspicion of the guilt. The H0 (status quo) stands in opposition to H1 and is maintained unless H1 is supported by evidence "beyond a reasonable doubt". However, "failure to reject H0" in this case does not imply innocence, but merely that the evidence was insufficient to convict. So the jury does not necessarily accept H0 but fails to reject H0.

Error
Working from a null hypothesis two basic forms of error are recognised:

Type I errors where the null hypothesis is falsely rejected giving a "false positive". Type II errors where the null hypothesis fails to be rejected and an actual difference between populations is

missed.

Confidence intervals
Most studies will only sample part of a population and then the result is used to interpret the null hypothesis in the context of the whole population. Any estimates obtained from the sample only approximate the population value. Confidence intervals allow statisticians to express how closely the sample estimate matches the true value in the whole population. Often they are expressed as 95% confidence intervals. Formally, a 95% confidence interval of a procedure is any range such that the interval covers the true population value 95% of the time given repeated sampling under the same conditions. If these intervals span
3

IIMM/DH/02/2006/8154, Business Statistics

a value (such as zero) where the null hypothesis would be confirmed then this can indicate that any observed value has been seen by chance. For example a drug that gives a mean increase in heart rate of 2 beats per minute but has 95% confidence intervals of -5 to 9 for its increase may well have no effect whatsoever. The 95% confidence interval is often misinterpreted as the probability that the true value lies between the upper and lower limits given the observed sample. However this quantity is more a credible interval available only from Bayesian statistics.

Significance
Statistics rarely give a simple Yes/No type answer to the question asked of them. Interpretation often comes down to the level of statistical significance applied to the numbers and often refer to the probability of a value accurately rejecting the null hypothesis (sometimes referred to as the p-value). When interpreting an academic paper reference to the significance of a result when referring to the statistical significance does not necessarily mean that the overall result means anything in real world terms. (For example in a large study of a drug it may be shown that the drug has a statistically significant but very small beneficial effect such that the drug will be unlikely to help anyone given it in a noticeable way.)

Examples
Some well-known statistical tests and procedures are:

Analysis of variance (ANOVA) Chi-square test Correlation Factor analysis MannWhitney U Mean square weighted deviation (MSWD) Pearson product-moment correlation coefficient Regression analysis Spearman's rank correlation coefficient Student's t-test Time series analysis

Q 1. (b) Enumerate some important development of statistical theory; also explain merits and limitations of statistics. Answer:

IIMM/DH/02/2006/8154, Business Statistics

The theory of statistics includes a number of topics: Statistical models of the sources of data and typical problem formulation: A statistical model is a set of mathematical equations which describe the behavior of an object of study in terms of random variables and their associated probability distributions. If the model has only one equation it is called a single-equation model, whereas if it has more than one equation, it is known as a multipleequation model. In mathematical terms, a statistical model is frequently thought of as a pair (Y,P) where Y is the set of possible observations and P the set of possible probability distributions on Y. It is assumed that there is a distinct element of P which generates the observed data. Statistical inference enables us to make statements about which element(s) of this set are likely to be the true one. Three notions are sufficient to describe all statistical models. We choose a statistical unit, such as a person, to observe directly. Multiple observations of the same unit over time is called longitudinal research. Observations of multiple statistical attributes is a common way of studying relationships among the attributes of a single unit. 2. Our interest may be in a statistical population (or set) of similar units rather than in any individual unit. Survey sampling offers an example of this type of modeling. 3. Our interest may focus on a statistical assembly where we examine functional subunits of the statistical unit. For example, Physiology modeling probes the organs which compose the unit. A common model for this type of research is the stimulus-response model.
1.

One of the most basic models is the simple linear regression model which assumes a relationship between two random variables Y and X. For instance, one may want to linearly explain child mortality in a given country by its GDP. This is a statistical model because the relationship need not to be perfect and the model includes a disturbance term which accounts for other effects on child mortality other than GDP. As a second example, Bayes theorem in its raw form may be intractable, but assuming a general model H allows it to become

IIMM/DH/02/2006/8154, Business Statistics

which may be easier. Models can also be compared using measures such as Bayes factors or mean square error.
1. 2. 3.

Sampling from a finite population Measuring observational error and refining procedures Studying statistical relations

Generating informative data using optimization and randomization while measuring and controlling observational error.[1] Optimization reduces the cost of data while meeting statistical needs, while randomization warrants reliable inferences:
1. 2.

Design of experiments to determine treatment effects. Survey sampling to describe natural populations

Summarizing statistical data in conventional forms (also known as descriptive statistics)


1. 2.

Choosing summary statistics to describe a sample Fitting probability distributions to sample data

Interpreting statistical data is the final objective of all research:


1. 2. 3. 4.

Common assumptions that we make Likelihood principle Estimating parameters Testing statistical hypotheses

Estimation Theory

In the statistical theory of estimation, estimating the maximum of a uniform distribution is a common illustration of differences between estimation methods. The specific case of sampling without replacement from a discrete uniform distribution is known, in the English-speaking world, as the German tank problem, due to its application in World War II to the estimation of the number of German tanks. Estimating the population maximum based on a single sample raises philosophical points about evaluation of estimators and likelihood (particularly bias in maximum likelihood estimators) and yields divergent results, depending on approach, while the estimation based on multiple samples is used in elementary statistics education as an instructive practical estimation question whose answer is simple but not obvious. The problem is usually framed for a discrete distribution, but virtually identical analysis holds for a continuous distribution. Pareto interpolation is a method of estimating the median and other properties of a population that follows a Pareto distribution. It is used in economics when analysing the distribution of incomes in a population, when one must base estimates on a relatively small random sample taken from the population. The family of Pareto distributions is parameterized by
6

IIMM/DH/02/2006/8154, Business Statistics

a positive number that is the smallest value that a random variable with a Pareto distribution can take. As applied to distribution of incomes, is the lowest income of any person in the population; and a positive number the "Pareto index"; as this increases, the tail of the distribution gets thinner. As applied to distribution of incomes, this means that the larger the value of the Pareto index the smaller the proportion of incomes many times as big as the smallest incomes.

Pareto interpolation can be used when the available information includes the proportion of the sample that falls below each of two specified numbers a < b. For example, it may be observed that 45% of individuals in the sample have incomes below a = $35,000 per year, and 55% have incomes below b = $40,000 per year. Let
Pa = proportion of the sample that lies below a; Pb = proportion of the sample that lies below b.

Then the estimates of and are

and

The estimate of the median would then be

since the actual population median is

THE POTENTIALS AND LIMITATIONS OF STATISTICS AS A SCIENTIFIC METHOD OF INFERENCE inference is conducted are: defining the problem, which incorporates defining the population and one or more variables that should be observed on that population including the goal of the research; defining statistical methods that should lead to achieving the goal; sampling (defining the sample size and method of selection, as well as the selection itself); arranging data gathered from the sample according to the selected method of statistical inference; conducting the statistical inference; and interpretation of obtained results. Fundamental statistical procedures can be classified according to the way of inference into two groups: estimation of parameters and testing statistical hypothesis. Apart from this it is possible to make other classifications according to other criteria such as the kind of variable observed on a population; complexity of the method of inference and so on. The classification emphasized in, that should be presented here, is the one based on the complexity of inference. According to this classification, there are four types of statistical procedures. 7

IIMM/DH/02/2006/8154, Business Statistics

Statistical procedures of the first type comprise simplest inferences performed at the level of arranging and presenting samples, in the first place by graphical means. The second type relates to applying the statistics (functions of a sample with certain characteristics), of so called "small" samples, but also those that lead to the same type of inferences based on "large" samples. From the point of view of the potentials and limitations of statistics as a science, a special attention should be paid to the sense of "small" and "large" samples, since the sample size impacts the ability to infer on the selected statistics in the most direct way. It should be stressed here that there is no "universal" sample size, which would be large enough for all statistics. The sample size determines the speed of convergence of the selected statistics towards the real value of the parameter as well as the accuracy of the approximate estimation which is only possible upon a part of the total population a sample. Statistics of the third type follow dynamical stochastic systems (as opposed to those of the second type, which mainly concerned with static systems). Dynamical stochastic systems are by definition very complex systems comprising a large number of variables characteristics that are often not well differentiated between. Dynamical systems can range from natural to production, to social. An example of a social dynamical system is, for example, the quality of life of a social group. The complexity of such a problem from the point of view of the statistical inference is obvious. The number of factors variables that constitute this system; types of variables in the system (qualitative and quantitative variables); possibility of their measurement direct or indirect; criteria, i.e. the range of their values that could be used to describe the quality of life, are only some of the questions that face the statistician on his way to understanding of this system. However, it should be kept in mind that the statistician is he who provides a good service to other scientists or simply to orderers of the service and that without those orders and a well defined project goal his services are not usable. This third type of statistical inference includes the so-called statistical modeling. The fourth type of statistical inference comprises statistical decision the decision theory. That is making decisions on all levels of management of social, production, even natural processes by means of the statistical methods. The Potentials and Limitations When potentials and limitations of a scientific method are discussed it is assumed that the method has been applied correctly, or properly in the first place, and then its limitations should be considered as well as the usability of obtained inferences. All these phases will be analyzed in the context of a statistical method. A special emphasis will be given to psychological research related to errors in applying the statistical inference. First, let us consider conditions under which a statistical method can be applied. It has been stated already that the statistical method can be applied on all events that have a property of the so-called statistical homogeneity. That means that they must be equivalent to the statistical experiment. The statistical experiment meets the following criteria: 1) it can be repeated any number of times under the same conditions; 2) it is defined in advance what to observe in the experiment and all possible outcomes of the observation are known; 3) the outcome of any individual experiment is not known in advance. In this context we will, from now on, only talk about the statistical experiment, or experiment, regardless of whether it is a natural or perhaps a social event or even an artificially triggered experiment. It follows directly from such a definition that a large number of events those psychology and social sciences are interested in can be categorized as statistical experiments. Hence a large interest of social scientists and psychologists for this area of science. However, it should be kept in mind that if even a single element of the above definition is missing while observing an event, the statistical method is rendered useless. This is even more so since conditions of the definition are not always obvious, i.e. it is not always possible at first glance to determine if the observed event is a statistical experiment or not. Therefore, the applicability of the statistical method is questionable, i.e. conclusions that would eventually be drawn from its application. By definition, a statistical experiment is conducted on a sample, rather than the total population. In order to be able to make conclusions about the total population based on a sample, it is necessary that this sample be a representative one. The importance of the selection of a representative sample is often illustrated by Fisz's example: The population comprised all employed health insurance policyholders in Poland, them N = 2,757,131 - the population size; and the observed variable was the type of work they did. All types of employment were grouped in four categories (a qualitative variable with four possible values) and the following table shows the absolute frequency of these categories (Ni,i = 1,2,3,4) in the total population of employees, as well as the absolute frequency of employment types in the sample (ni,i = 1,2,3,4) of size n = 230,433; which has been selected from the total population based on the criterion that an employee's surname starts with the letter "P": Ni ni 8

IIMM/DH/02/2006/8154, Business Statistics

1 laborers, except miners and steel mill workers 1,778,446 152,812 2 miners and steel mill workers 250,397 22,493 3 social workers and public servants 564,147 44,040 4 services 164,141 11,088 By comparing the distribution of the variable across the total population and the distribution of the same characteristic across the selected sample testing the hypothesis about the equality of those distributions, it was shown that the distribution of the variable on the sample was significantly different from the distribution across the total population. So, the sample was not representative, i.e. it was shown that the choice of profession in Poland was not independent of the staring letter of the surname. A lot of psychological research has been dedicated to understanding the representative sample [4]. Let us list some of the most common misconceptions. Often, the validity and likelihood of a sample is estimated according to its similarity with the actual population. For example, in samples of six successive births in a hospital realized in the following way: MFFMFF and MMMMMF (M-male sex, F-female sex), the first is layman accepted as more representative than the second one. Applying this heuristics also leads to the following error. That is, those small samples represent the total population equally well as large samples. For example, it is erroneously concluded that 70% of obtained tails is an equally representative outcome for both the batch of 100 and the batch of 10 coin tosses. (However, it is easily calculated that the probability of obtaining 70 tails in a batch of 100 trials is approximately 0.000023, while the probability of obtaining 7 tails in a batch of 10 tosses is approximately 0.117187.) Therefore, the choice of a sample is one of the phases of the statistical method of inference, which in a fundamental way determines the worthiness of inferences obtained using this method. The logical principle that "from a wrong premise an arbitrary conclusion (correct as well as incorrect) can be drawn" gets a practical confirmation in this case in the sense that from an erroneously selected sample it is possible to draw any, i.e. totally useless inferences. The importance of this phase of statistical inference is best illustrated by the fact that the sampling theory represents a special field of mathematical statistics. Hereafter it will be assumed that the sample, when one is discussed, has been selected in an appropriate way. A sample size is the particular question. Actually, the sample size is associated with the limit distribution of the appropriate statistics used according to the method of inference. So, only those statistics corresponding the available sample size should be taken into account, or we should choose the sample size satisfactory enough having decided in advance about the method of inference, i.e. the appliance of the certain statistics. The third aspect concerns the question of chance. Not only that chance is often erroneously equated to probability, but also it is also unjustifiably applied to Statistics. Conclusion In the appliance of the statistical method of inference, there are objective and subjective limitations. In this paper, the analyzed aspects are certainly not the only one sand it should be taken into account when the appliances of statistical method are concerned. It is important, however, to conclude that at the today's level of scientific thought and knowledge about objective and subjective reality the statistical method is the irreplaceable one and the only one being concerned with data and their processing and the only one accepting the randomness as the objectively existing. Concerning the last it should be pointed out that in scientific world today it is getting more talked about determined or deterministic chaos. In the case it is proved that the whole nature and society are arranged this way, mathematical statistics will definitely lose on its importance. For the deterministic chaos it is, however, important to discover the first element of a sequence. In that case, it should be thought of mathematical statistics as the available scientific procedure in the process of recognizing that first one in the sequence. Q 3. (a) Describe Arithmetic, Geometric and Harmonic means with suitable examples. Explain merits and limitations of Geometric mean. Answer:

Arithmetic Mean:In mathematics and statistics, the arithmetic mean (or simply the mean) of a list of numbers is the sum of all of the list divided by the number of items in the list. If the list is a statistical population, then the mean of
9

IIMM/DH/02/2006/8154, Business Statistics

that population is called a population mean. If the list is a statistical sample, we call the resulting statistic a sample mean. The mean is the most commonly-used type of average and is often referred to simply as the average. The term "mean" or "arithmetic mean" is preferred in mathematics and statistics to distinguish it from other averages such as the median and the mode. While the mean is often used to report central tendency, it is not a robust statistic, meaning that it is greatly influenced by outliers. Notably, for skewed distributions, the arithmetic mean may not accord with one's notion of "middle", and robust statistics such as the median may be a better description of central tendency. A classic example is average income. The arithmetic mean may be misinterpreted as the median to imply that most people's incomes are higher than is in fact the case. When presented with an arithmetic mean one may be led to believe that most people's incomes are near this number. This mean income is higher than most people's incomes because high income outliers skew the result higher (in contrast, the median income resists such skew). However, the mean says nothing about the number of people near the median income (nor does it say anything about the modal income that most people are near). Nevertheless, because one might carelessly relate "average" and "most people" one might incorrectly assume that most people's incomes would be higher (nearer this inflated "average") than they are. For instance, reporting the "average" net worth in Medina, Washington as the arithmetic mean of all annual net worths would yield a surprisingly high number because of Bill Gates. Consider the scores (1, 2, 2, 2, 3, 9). The arithmetic mean is 3.17, but five out of six scores are below this

Directions
Particular care must be taken when using cyclic data such as phases or angles. Navely taking the arithmetic mean of 1 and 359 yields a result of 180. This is incorrect for two reasons: Firstly, angle measurements are only defined up to a factor of 360 (or 2, if measuring in radians). Thus one could as easily call these 1 and 1, or 1 and 719 each of which gives a different average. Secondly, in this situation, 0 (equivalently, 360) is geometrically a better average value: there is lower dispersion about it (the points are both 1 from it, and 179 from 180, the putative average).

In general application such an oversight will lead to the average value artificially moving towards the middle of the numerical range. A solution to this problem is to use the optimization formulation (viz, define the mean as the central point: the point about which one has the lowest dispersion), and redefine the difference as a modular distance (i.e., the distance on the circle: so the modular distance between 1 and 359 is 2, not 358). The arithmetic mean is the "standard" average, often simply called the "mean".

The mean may often be confused with the median, mode or range. The mean is the arithmetic average of a set of values, or distribution; however, for skewed distributions, the mean is not necessarily the same as the middle value (median), or the most likely (mode). For example, mean income is skewed upwards by a small number of people with very large incomes, so that the majorities have an income lower than the mean. By contrast, the median income is the level at which half the population is below and half is above. The mode income is the most likely income, and favors the larger number of people with lower incomes. The median or modes are often more intuitive measures of such data.
10

IIMM/DH/02/2006/8154, Business Statistics

Nevertheless, many skewed distributions are best described by their mean such as the exponential and Poisson distributions. For example, the arithmetic mean of six values: 34, 27, 45, 55, 22, 34 is

Geometric mean

The geometric mean, in mathematics, is a type of mean or average, which indicates the central tendency or typical value of a set of numbers. It is similar to the arithmetic mean, which is what most people think of with the word "average", except that the numbers are multiplied and then the nth root (where n is the count of numbers in the set) of the resulting product is taken. For instance, the geometric mean of two numbers, say 2 and 8, is just the square root of their product which equals 4; that is 22 8 = 4. As another example, the geometric mean of three numbers 1, , is the cube root of their product (1/8), which is 1/2; that is 31 = . The geometric mean can also be understood in terms of geometry. The geometric mean of two numbers, a and b, is the length of one side of a square whose area is equal to the area of a rectangle with sides of lengths a and b. Similarly, the geometric mean of three numbers, a, b, and c, is the length of one side of a cube whose volume is the same as that of a cuboid with sides whose lengths are equal to the three given numbers. The geometric mean only applies to positive numbers. It is also often used for a set of numbers whose values are meant to be multiplied together or are exponential in nature, such as data on the growth of the human population or interest rates of a financial investment. The geometric mean is also one of the three classic Pythagorean means, together with the aforementioned arithmetic mean and the harmonic mean. The geometric mean of a data set [a1, a2, ..., an] is given by

The geometric mean of a data set is less than or equal to the data set's arithmetic mean (the two means are equal if and only if all members of the data set are equal). This allows the definition of the arithmeticgeometric mean, a mixture of the two which always lies in between. The geometric mean is also the arithmetic-harmonic mean in the sense that if two sequences (an) and (hn) are defined:

and

11

IIMM/DH/02/2006/8154, Business Statistics

then an and hn will converge to the geometric mean of x and y. This can be seen easily from the fact that the sequences do converge to a common limit (which can be shown by Bolzano-Weierstrass theorem) and the fact that geometric mean is preserved:

Replacing arithmetic and harmonic mean by a pair of generalized means of opposite, finite exponents yields the same result.

Relationship with arithmetic mean of logarithms


By using logarithmic identities to transform the formula, we can express the multiplications as a sum and the power as a multiplication.

This is sometimes called the log-average. It is simply computing the arithmetic mean of the logarithm transformed values of ai (i.e., the arithmetic mean on the log scale) and then using the exponentiation to return the computation to the original scale, i.e., it is the generalised f-mean with f(x) = log x. For normal frequency:

For frequency distribution:

The geometric mean is an average that is useful for sets of positive numbers that are interpreted according to their product and not their sum (as is the case with the arithmetic mean) e.g. rates of growth.

For example, the geometric mean of six values: 34, 27, 45, 55, 22, 34 is:

Harmonic Mean In mathematics, the harmonic mean (formerly sometimes called the subcontrary mean) is one of several kinds of average. Typically, it is appropriate for situations when the average of rates is desired.
12

IIMM/DH/02/2006/8154, Business Statistics

The harmonic mean H of the positive real numbers x1, x2, ..., xn is defined to be

Equivalently, the harmonic mean is the reciprocal of the arithmetic mean of the reciprocals. The harmonic mean is an average which is useful for sets of numbers which are defined in relation to some unit, for example speed (distance per unit of time).

For example, the harmonic mean of the six values: 34, 27, 45, 55, 22, and 34 is

Relationship between AM, GM, and HM


Main article: Inequality of arithmetic and geometric means AM, GM, and HM satisfy these inequalities:

Equality holds only when all the elements of the given sample are equal Merits and Limitation of Geometric mean:Merits of GM It is based on all the items of the data. It is rigidly defined. It means different investigators will find the same result from the given set of data.

It is a relative measure and given less importance to large items and more to small ones unlike the arithmetic mean. Geometric mean is useful in ratios and percentages and in determining rates of increase or decrease. It is capable of algebraic treatment. It mean we can find out the combined geometric mean of two or more series. Demerits or Limitations of GM It is not easily understood and therefore is not widely used. It is difficult to compute as it involves the knowledge of ratios, roots, logs and antilog. It becomes indeterminate in case any value in the given series happen to be zero or negative. With open-end class intervals of the data, geometric mean cannot be calculated. Geometric mean may not correspond to any value of the given data. 13

IIMM/DH/02/2006/8154, Business Statistics

Q 3. (b) What do you understand by Concept of Probability. Explain various theories of probabilities. Answer:Probability is a way of expressing knowledge or belief that an event will occur or has occurred. In mathematics the concept has been given an exact meaning in probability theory, that is used extensively in such areas of study as mathematics, statistics, finance, gambling, science, and philosophy to draw conclusions about the likelihood of potential events and the underlying mechanics of complex systems.

The word probability does not have a consistent direct definition. In fact, there are sixteen broad categories of probability interpretations, whose adherents possess different (and sometimes conflicting) views about the fundamental nature of probability: Frequentists talk about probabilities only when dealing with experiments that are random and welldefined. The probability of a random event denotes the relative frequency of occurrence of an experiment's outcome, when repeating the experiment. Frequentists consider probability to be the relative frequency "in the long run" of outcomes. 2. Bayesians, however, assign probabilities to any statement whatsoever, even when no random process is involved. Probability, for a Bayesian, is a way to represent an individual's degree of belief in a statement, or an objective degree of rational belief, given the evidence.
1. The word Probability derives from probity, a measure of the authority of a witness in a legal case in Europe, and often correlated with the witness's nobility. In a sense, this differs much from the modern meaning of probability, which, in contrast, is used as a measure of the weight of empirical evidence, and is arrived at from inductive reasoning and statistical inference.

The scientific study of probability is a modern development. Gambling shows that there has been an interest in quantifying the ideas of probability for millennia, but exact mathematical descriptions of use in those problems only arose much later. According to Richard Jeffrey, "Before the middle of the seventeenth century, the term 'probable' (Latin probabilis) meant approvable, and was applied in that sense, univocally, to opinion and to action. A probable action or opinion was one such as sensible people would undertake or hold, in the circumstances." However, in legal contexts especially, 'probable' could also apply to propositions for which there was good evidence. Aside from some elementary considerations made by Girolamo Cardano in the 16th century, the doctrine of probabilities dates to the correspondence of Pierre de Fermat and Blaise Pascal (1654). Christiaan Huygens (1657) gave the earliest known scientific treatment of the subject. Jakob Bernoulli's Ars Conjectandi (posthumous, 1713) and Abraham de Moivre's Doctrine of Chances (1718) treated the subject as a branch of mathematics. See Ian Hacking's The Emergence of Probability and James Franklin's The Science of Conjecture for histories of the early development of the very concept of mathematical probability.
14

IIMM/DH/02/2006/8154, Business Statistics

The theory of errors may be traced back to Roger Cotes's Opera Miscellanea (posthumous, 1722), but a memoir prepared by Thomas Simpson in 1755 (printed 1756) first applied the theory to the discussion of errors of observation. The reprint (1757) of this memoir lays down the axioms that positive and negative errors are equally probable, and that there are certain assignable limits within which all errors may be supposed to fall; continuous errors are discussed and a probability curve is given. Pierre-Simon Laplace (1774) made the first attempt to deduce a rule for the combination of observations from the principles of the theory of probabilities. He represented the law of probability of errors by a curve y = (x), x being any error and y its probability, and laid down three properties of this curve:
1. 2.

3.

it is symmetric as to the y-axis; the x-axis is an asymptote, the probability of the error being 0; the area enclosed is 1, it being certain that an error exists.

He also gave (1781) a formula for the law of facility of error (a term due to Lagrange, 1774), but one which led to unmanageable equations. Daniel Bernoulli (1778) introduced the principle of the maximum product of the probabilities of a system of concurrent errors. The method of least squares is due to Adrien-Marie Legendre (1805), who introduced it in his Nouvelles mthodes pour la dtermination des orbites des comtes (New Methods for Determining the Orbits of Comets). In ignorance of Legendre's contribution, an Irish-American writer, Robert Adrain, editor of "The Analyst" (1808), first deduced the law of facility of error,

h being a constant depending on precision of observation, and c a scale factor ensuring that the area under
the curve equals 1. He gave two proofs, the second being essentially the same as John Herschel's (1850). Gauss gave the first proof which seems to have been known in Europe (the third after Adrain's) in 1809. Further proofs were given by Laplace (1810, 1812), Gauss (1823), James Ivory (1825, 1826), Hagen (1837), Friedrich Bessel (1838), W. F. Donkin (1844, 1856), and Morgan Crofton (1870). Other contributors were Ellis (1844), De Morgan (1864), Glaisher (1872), and Giovanni Schiaparelli (1875). Peters's (1856) formula for r, the probable error of a single observation, is well known. In the nineteenth century authors on the general theory included Laplace, Sylvestre Lacroix (1816), Littrow (1833), Adolphe Quetelet (1853), Richard Dedekind (1860), Helmert (1872), Hermann Laurent (1873), Liagre, Didion, and Karl Pearson. Augustus De Morgan and George Boole improved the exposition of the theory. Andrey Markov introduced the notion of Markov chains (1906) playing an important role in theory of stochastic processes and its applications. The modern theory of probability based on the meausure theory was developed by Andrey Kolmogorov (1931).

Mathematical treatment
In mathematics, a probability of an event A is represented by a real number in the range from 0 to 1 and written as P(A), p(A) or Pr(A). An impossible event has a probability of 0, and a certain event has a probability of 1. However, the converses are not always true: probability 0 events are not always impossible,
15

IIMM/DH/02/2006/8154, Business Statistics

nor probability 1 events certain. The rather subtle distinction between "certain" and "probability 1" is treated at greater length in the article on "almost surely". The opposite or complement of an event A is the event [not A] (that is, the event of A not occurring); its probability is given by P(not A) = 1 - P(A). As an example, the chance of not rolling a six on a six-sided die is 1 (chance of rolling a six) . See Complementary event for a more complete treatment.

If both the events A and B occur on a single performance of an experiment this is called the intersection or joint probability of A and B, denoted as probability is . If two events, A and B are independent then the joint

for example, if two coins are flipped the chance of both being heads is If either event A or event B or both events occur on a single performance of an experiment this is called the union of the events A and B denoted as probability of either occurring is . If two events are mutually exclusive then the

For

example,

the

chance

of

rolling

or

on

six-sided

die

is

If the events are not mutually exclusive then

For example, when drawing a single card at random from a regular deck of cards, the chance of getting a heart or a face card (J,Q,K) (or one that is both) is , because of the 52 cards of a deck 13 are hearts, 12 are face cards, and 3 are both: here the possibilities included in the "3 that are both" are included in each of the "13 hearts" and the "12 face cards" but should only be counted once. Conditional probability is the probability of some event A, given the occurrence of some other event B. Conditional probability is written P(A|B), and is read "the probability of A, given B". It is defined by

If P(B)

= 0 then

is undefined.

Theories of Probabilities:Like other theories, the theory of probability is a representation of probabilistic concepts in formal terms that is, in terms that can be considered separately from their meaning. These formal terms are manipulated by the rules of mathematics and logic, and any results are then interpreted or translated back into the problem domain.
16

IIMM/DH/02/2006/8154, Business Statistics

Probability theory is the branch of mathematics concerned with analysis of random phenomena. The central objects of probability theory are random variables, stochastic processes, and events: mathematical abstractions of non-deterministic events or measured quantities that may either be single occurrences or evolve over time in an apparently random fashion. Although an individual coin toss or the roll of a die is a random event, if repeated many times the sequence of random events will exhibit certain statistical patterns, which can be studied and predicted. Two representative mathematical results describing such patterns are the law of large numbers and the central limit theorem. As a mathematical foundation for statistics, probability theory is essential to many human activities that involve quantitative analysis of large sets of data. Methods of probability theory also apply to descriptions of complex systems given only partial knowledge of their state, as in statistical mechanics. A great discovery of twentieth century physics was the probabilistic nature of physical phenomena at atomic scales, described in quantum mechanics.

There have been at least two successful attempts to formalize probability, namely the Kolmogorov formulation and the Cox formulation. In Kolmogorov's formulation (see probability space), sets are interpreted as events and probability itself as a measure on a class of sets. In Cox's theorem, probability is taken as a primitive (that is, not further analyzed) and the emphasis is on constructing a consistent assignment of probability values to propositions. In both cases, the laws of probability are the same, except for technical details. There are other methods for quantifying uncertainty, such as the Dempster-Shafer theory or possibility theory, but those are essentially different and not compatible with the laws of probability as they are usually understood.

Discrete probability distributions


Discrete probability theory deals with events that occur in countable sample spaces. Examples: Throwing dice, experiments with decks of cards, and random walk. Classical definition: Initially the probability of an event to occur was defined as number of cases favorable for the event, over the number of total outcomes possible in an equiprobable sample space. For example, if the event is "occurrence of an even number when a die is rolled", the probability is given by , since 3 faces out of the 6 have even numbers and each face has the same probability of appearing. The modern definition starts with a set called the sample space, which relates to the set of all possible outcomes in classical sense, denoted by . It is then assumed that for each element , an intrinsic "probability" value 1. 2. That is, the probability function f(x) lies between zero and one for every value of x in the sample space , and the sum of f(x) over all values x in the sample space is equal to 1. An event is defined as any subset of the sample space . The probability of the event is defined as
17

is attached, which satisfies the following properties:

IIMM/DH/02/2006/8154, Business Statistics

So, the probability of the entire sample space is 1, and the probability of the null event is 0. The function mapping a point in the sample space to the "probability" value is called a probability mass function abbreviated as pmf. The modern definition does not try to answer how probability mass functions are obtained; instead it builds a theory that assumes their existence.

Continuous probability distributions


Main article: Continuous probability distribution Continuous probability theory deals with events that occur in a continuous sample space. Classical definition: The classical definition breaks down when confronted with the continuous case. See Bertrand's paradox. Modern definition: If the outcome space of a random variable X is the set of real numbers ( ) or a subset thereof, then a function called the cumulative distribution function (or cdf) exists, defined by . That is, F(x) returns the probability that X will be less than or equal to x. The cdf necessarily satisfies the following properties.
1.

is a monotonically non-decreasing, right-continuous function;

2. 3. If is absolutely continuous, i.e., its derivative exists and integrating the derivative gives us the cdf back again, then the random variable X is said to have a probability density function or pdf or simply density

For a set

, the probability of the random variable X being in

is

In case the probability density function exists, this can be written as

Whereas the pdf exists only for continuous random variables, the cdf exists for all random variables (including discrete random variables) that take values in These concepts can be generalized for multidimensional cases on and other continuous sample spaces.

Measure-theoretic probability theory


18

IIMM/DH/02/2006/8154, Business Statistics

The raison d'tre of the measure-theoretic treatment of probability is that it unifies the discrete and the continuous cases, and makes the difference a question of which measure is used. Furthermore, it covers distributions that are neither discrete nor continuous nor mixtures of the two. An example of such distributions could be a mix of discrete and continuous distributions, for example, a random variable which is 0 with probability 1/2, and takes a random value from a normal distribution with probability 1/2. It can still be studied to some extent by considering it to have a pdf of where [x] is the Dirac delta function. ,

Other distributions may not even be a mix, for example, the Cantor distribution has no positive probability for any single point, neither does it have a density. The modern approach to probability theory solves these problems using measure theory to define the probability space: Given any set , (also called sample space) and a -algebra on it, a measure defined on is called a

probability measure if If is the Borel -algebra on the set of real numbers, then there is a unique probability measure on for any cdf, and vice versa. The measure corresponding to a cdf is said to be induced by the cdf. This measure coincides with the pmf for discrete variables, and pdf for continuous variables, making the measure-theoretic approach free of fallacies. The probability of a set in the -algebra is defined as

where the integration is with respect to the measure

induced by

Along with providing better understanding and unification of discrete and continuous probabilities, measuretheoretic treatment also allows us to work on probabilities outside , as in the theory of stochastic processes. For example to study Brownian motion, probability is defined on a space of functions.

Q 4. (a) In any Business all strategic and corporate policies/ decisions are based on sampling. Define sampling techniques and merits of smaplings, support your answer with relevant examples. Answer:Sampling is that part of statistical practice concerned with the selection of an unbiased or random subset of individual observations within a population of individuals intended to yield some knowledge about the population of concern, especially for the purposes of making predictions based on statistical inference. Sampling is an important aspect of data collection. Researchers rarely survey the entire population for two reasons (Adr, Mellenbergh, & Hand, 2008): the cost is too high, and the population is dynamic in that the individuals making up the population may change over time. The three main advantages of sampling are that the cost is lower, data collection is faster, and since the data set is smaller it is possible to ensure homogeneity and to improve the accuracy and quality of the data.

19

IIMM/DH/02/2006/8154, Business Statistics

Each observation measures one or more properties (such as weight, location, color) of observable bodies distinguished as independent objects or individuals. In survey sampling, survey weights can be applied to the data to adjust for the sample design. Results from probability theory and statistical theory are employed to guide practice. In business and medical research, sampling is widely used for gathering information about a population.

Process , Techniques and Merits of SamplingThe sampling process & Techniques comprises several stages:

Defining the population of concern Specifying a sampling frame, a set of items or events possible to measure Specifying a sampling method for selecting items or events from the frame Determining the sample size Implementing the sampling plan Sampling and data collecting Reviewing the sampling process

Population definition
Successful statistical practice is based on focused problem definition. In sampling, this includes defining the population from which our sample is drawn. A population can be defined as including all people or items with the characteristic one wishes to understand. Because there is very rarely enough time or money to gather information from everyone or everything in a population, the goal becomes finding a representative sample (or subset) of that population. Sometimes that which defines a population is obvious. For example, a manufacturer needs to decide whether a batch of material from production is of high enough quality to be released to the customer, or should be sentenced for scrap or rework due to poor quality. In this case, the batch is the population. Although the population of interest often consists of physical objects, sometimes we need to sample over time, space, or some combination of these dimensions. For instance, an investigation of supermarket staffing could examine checkout line length at various times, or a study on endangered penguins might aim to understand their usage of various hunting grounds over time. For the time dimension, the focus may be on periods or discrete occasions. In other cases, our 'population' may be even less tangible. For example, Joseph Jagger studied the behaviour of roulette wheels at a casino in Monte Carlo, and used this to identify a biased wheel. In this case, the 'population' Jagger wanted to investigate was the overall behaviour of the wheel (i.e. the probability distribution of its results over infinitely many trials), while his 'sample' was formed from observed results from that wheel. Similar considerations arise when taking repeated measurements of some physical characteristic such as the electrical conductivity of copper. This situation often arises when we seek knowledge about the cause system of which the observed population is an outcome. In such cases, sampling theory may treat the observed population as a sample from a larger 'superpopulation'. For example, a researcher might study the success rate of a new 'quit smoking' program on a test group of 100 patients, in order to predict the effects of the program if it were made available nationwide. Here the superpopulation is "everybody in the country, given access to this treatment" - a group which does not yet exist, since the program isn't yet available to all.

20

IIMM/DH/02/2006/8154, Business Statistics

Note also that the population from which the sample is drawn may not be the same as the population about which we actually want information. Often there is large but not complete overlap between these two groups due to frame issues etc (see below). Sometimes they may be entirely separate - for instance, we might study rats in order to get a better understanding of human health, or we might study records from people born in 2008 in order to make predictions about people born in 2009. Time spent in making the sampled population and population of concern precise is often well spent, because it raises many issues, ambiguities and questions that would otherwise have been overlooked at this stage.

Sampling frame
In the most straightforward case, such as the sentencing of a batch of material from production (acceptance sampling by lots), it is possible to identify and measure every single item in the population and to include any one of them in our sample. However, in the more general case this is not possible. There is no way to identify all rats in the set of all rats. Where voting is not compulsory, there is no way to identify which people will actually vote at a forthcoming election (in advance of the election). These imprecise populations are not amenable to sampling in any of the ways below and to which we could apply statistical theory. As a remedy, we seek a sampling frame which has the property that we can identify every single element and include any in our sample. The most straightforward type of frame is a list of elements of the population (preferably the entire population) with appropriate contact information. For example, in an opinion poll, possible sampling frames include:

Electoral register Telephone directory

Not all frames explicitly list population elements. For example, a street map can be used as a frame for a door-to-door survey; although it doesn't show individual houses, we can select streets from the map and then visit all houses on those streets. (One advantage of such a frame is that it would include people who have recently moved and are not yet on the list frames discussed above.) The sampling frame must be representative of the population and this is a question outside the scope of statistical theory demanding the judgment of experts in the particular subject matter being studied. All the above frames omit some people who will vote at the next election and contain some people who will not; some frames will contain multiple records for the same person. People not in the frame have no prospect of being sampled. Statistical theory tells us about the uncertainties in extrapolating from a sample to the frame. In extrapolating from frame to population, its role is motivational and suggestive.
To the scientist, however, representative sampling is the only justified procedure for choosing individual objects for use as the basis of generalization, and is therefore usually the only acceptable basis for ascertaining truth. Andrew A. Marino

It is important to understand this difference to steer clear of confusing prescriptions found in many web pages. In defining the frame, practical, economic, ethical, and technical issues need to be addressed. The need to obtain timely results may prevent extending the frame far into the future. The difficulties can be extreme when the population and frame are disjoint. This is a particular problem in forecasting where inferences about the future are made from historical data. In fact, in 1703, when Jacob
21

IIMM/DH/02/2006/8154, Business Statistics

Bernoulli proposed to Gottfried Leibniz the possibility of using historical mortality data to predict the probability of early death of a living man, Gottfried Leibniz recognized the problem in replying:
Nature has established patterns originating in the return of events but only for the most part. New illnesses flood the human race, so that no matter how many experiments you have done on corpses, you have not thereby imposed a limit on the nature of events so that in the future they could not vary. Gottfried Leibniz

Kish posited four basic problems of sampling frames:


1. 2. 3. 4. Missing elements: Some members of the population are not included in the frame. Foreign elements: The non-members of the population are included in the frame. Duplicate entries: A member of the population is surveyed more than once. Groups or clusters: The frame lists clusters instead of individuals.

A frame may also provide additional 'auxiliary information' about its elements; when this information is related to variables or groups of interest, it may be used to improve survey design. For instance, an electoral register might include name and sex; this information can be used to ensure that a sample taken from that frame covers all demographic categories of interest. (Sometimes the auxiliary information is less explicit; for instance, a telephone number may provide some information about location.) Having established the frame, there are a number of ways for organizing it to improve efficiency and effectiveness. It's at this stage that the researcher should decide whether the sample is in fact to be the whole population and would therefore be a census.

Probability and non-probability sampling


A probability sampling scheme is one in which every unit in the population has a chance (greater than zero) of being selected in the sample, and this probability can be accurately determined. The combination of these traits makes it possible to produce unbiased estimates of population totals, by weighting sampled units according to their probability of selection. Example: We want to estimate the total income of adults living in a given street. We visit each household in that street, identify all adults living there, and randomly select one adult from each household. (For example, we can allocate each person a random number, generated from a uniform distribution between 0 and 1, and select the person with the highest number in each household). We then interview the selected person and find their income. People living on their own are certain to be selected, so we simply add their income to our estimate of the total. But a person living in a household of two adults has only a one-in-two chance of selection. To reflect this, when we come to such a household, we would count the selected person's income twice towards the total. (In effect, the person who is selected from that household is taken as representing the person who isn't selected.) In the above example, not everybody has the same probability of selection; what makes it a probability sample is the fact that each person's probability is known. When every element in the population does have the same probability of selection, this is known as an 'equal probability of selection' (EPS) design. Such designs are also referred to as 'self-weighting' because all sampled units are given the same weight.

22

IIMM/DH/02/2006/8154, Business Statistics

Probability sampling includes: Simple Random Sampling, Systematic Sampling, Stratified Sampling, Probability Proportional to Size Sampling, and Cluster or Multistage Sampling. These various ways of probability sampling have two things in common:
1. 2. Every element has a known nonzero probability of being sampled and involves random selection at some point.

Nonprobability sampling is any sampling method where some elements of the population have no chance of selection (these are sometimes referred to as 'out of coverage'/'undercovered'), or where the probability of selection can't be accurately determined. It involves the selection of elements based on assumptions regarding the population of interest, which forms the criteria for selection. Hence, because the selection of elements is nonrandom, nonprobability sampling does not allow the estimation of sampling errors. These conditions give rise to exclusion bias, placing limits on how much information a sample can provide about the population. Information about the relationship between sample and population is limited, making it difficult to extrapolate from the sample to the population. Example: We visit every household in a given street, and interview the first person to answer the door. In any household with more than one occupant, this is a nonprobability sample, because some people are more likely to answer the door (e.g. an unemployed person who spends most of their time at home is more likely to answer than an employed housemate who might be at work when the interviewer calls) and it's not practical to calculate these probabilities. Nonprobability Sampling includes: Accidental Sampling, Quota Sampling and Purposive Sampling. In addition, nonresponse effects may turn any probability design into a nonprobability design if the characteristics of nonresponse are not well understood, since nonresponse effectively modifies each element's probability of being sampled.

Sampling methods
Within any of the types of frame identified above, a variety of sampling methods can be employed, individually or in combination. Factors commonly influencing the choice between these designs include:

Nature and quality of the frame Availability of auxiliary information about units on the frame Accuracy requirements, and the need to measure accuracy Whether detailed analysis of the sample is expected Cost/operational concerns

Simple random sampling


In a simple random sample ('SRS') of a given size, all such subsets of the frame are given an equal probability. Each element of the frame thus has an equal probability of selection: the frame is not subdivided or partitioned. Furthermore, any given pair of elements has the same chance of selection as any other such pair (and similarly for triples, and so on). This minimises bias and simplifies analysis of results. In particular, the variance between individual results within the sample is a good indicator of variance in the overall population, which makes it relatively easy to estimate the accuracy of results. However, SRS can be vulnerable to sampling error because the randomness of the selection may result in a sample that doesn't reflect the makeup of the population. For instance, a simple random sample of ten people from a given country will on average produce five men and five women, but any given trial is likely to overrepresent one sex and underrepresent the other. Systematic and stratified techniques, discussed below,
23

IIMM/DH/02/2006/8154, Business Statistics

attempt to overcome this problem by using information about the population to choose a more representative sample. SRS may also be cumbersome and tedious when sampling from an unusually large target population. In some cases, investigators are interested in research questions specific to subgroups of the population. For example, researchers might be interested in examining whether cognitive ability as a predictor of job performance is equally applicable across racial groups. SRS cannot accommodate the needs of researchers in this situation because it does not provide subsamples of the population. Stratified sampling, which is discussed below, addresses this weakness of SRS. Simple random sampling is always an EPS design, but not all EPS designs are simple random sampling.

Systematic sampling
Systematic sampling relies on arranging the target population according to some ordering scheme and then selecting elements at regular intervals through that ordered list. Systematic sampling involves a random start and then proceeds with the selection of every kth element from then onwards. In this case, k=(population size/sample size). It is important that the starting point is not automatically the first in the list, but is instead randomly chosen from within the first to the kth element in the list. A simple example would be to select every 10th name from the telephone directory (an 'every 10th' sample, also referred to as 'sampling with a skip of 10'). As long as the starting point is randomized, systematic sampling is a type of probability sampling. It is easy to implement and the stratification induced can make it efficient, if the variable by which the list is ordered is correlated with the variable of interest. 'Every 10th' sampling is especially useful for efficient sampling from databases. Example: Suppose we wish to sample people from a long street that starts in a poor district (house #1) and ends in an expensive district (house #1000). A simple random selection of addresses from this street could easily end up with too many from the high end and too few from the low end (or vice versa), leading to an unrepresentative sample. Selecting (e.g.) every 10th street number along the street ensures that the sample is spread evenly along the length of the street, representing all of these districts. (Note that if we always start at house #1 and end at #991, the sample is slightly biased towards the low end; by randomly selecting the start between #1 and #10, this bias is eliminated.) However, systematic sampling is especially vulnerable to periodicities in the list. If periodicity is present and the period is a multiple or factor of the interval used, the sample is especially likely to be unrepresentative of the overall population, making the scheme less accurate than simple random sampling. Example: Consider a street where the odd-numbered houses are all on the north (expensive) side of the road, and the even-numbered houses are all on the south (cheap) side. Under the sampling scheme given above, it is impossible' to get a representative sample; either the houses sampled will all be from the oddnumbered, expensive side, or they will all be from the even-numbered, cheap side. Another drawback of systematic sampling is that even in scenarios where it is more accurate than SRS, its theoretical properties make it difficult to quantify that accuracy. (In the two examples of systematic sampling that are given above, much of the potential sampling error is due to variation between neighbouring houses - but because this method never selects two neighboring houses, the sample will not give us any information on that variation.)

24

IIMM/DH/02/2006/8154, Business Statistics

As described above, systematic sampling is an EPS method, because all elements have the same probability of selection (in the example given, one in ten). It is not 'simple random sampling' because different subsets of the same size have different selection probabilities - e.g. the set {4,14,24,...,994} has a one-in-ten probability of selection, but the set {4,13,24,34,...} has zero probability of selection. Systematic sampling can also be adapted to a non-EPS approach; for an example, see discussion of PPS samples below.

Stratified sampling
Where the population embraces a number of distinct categories, the frame can be organized by these categories into separate "strata." Each stratum is then sampled as an independent sub-population, out of which individual elements can be randomly selected. There are several potential benefits to stratified sampling. First, dividing the population into distinct, independent strata can enable researchers to draw inferences about specific subgroups that may be lost in a more generalized random sample. Second, utilizing a stratified sampling method can lead to more efficient statistical estimates (provided that strata are selected based upon relevance to the criterion in question, instead of availability of the samples). Even if a stratified sampling approach does not lead to increased statistical efficiency, such a tactic will not result in less efficiency than would simple random sampling, provided that each stratum is proportional to the groups size in the population. Third, it is sometimes the case that data are more readily available for individual, pre-existing strata within a population than for the overall population; in such cases, using a stratified sampling approach may be more convenient than aggregating data across groups (though this may potentially be at odds with the previously noted importance of utilizing criterion-relevant strata). Finally, since each stratum is treated as an independent population, different sampling approaches can be applied to different strata, potentially enabling researchers to use the approach best suited (or most costeffective) for each identified subgroup within the population. There are, however, some potential drawbacks to using stratified sampling. First, identifying strata and implementing such an approach can increase the cost and complexity of sample selection, as well as leading to increased complexity of population estimates. Second, when examining multiple criteria, stratifying variables may be related to some, but not to others, further complicating the design, and potentially reducing the utility of the strata. Finally, in some cases (such as designs with a large number of strata, or those with a specified minimum sample size per group), stratified sampling can potentially require a larger sample than would other methods (although in most cases, the required sample size would be no larger than would be required for simple random sampling.
A stratified sampling approach is most effective when three conditions are met 1. Variability within strata are minimized 2. Variability between strata are maximized 3. The variables upon which the population is stratified are strongly correlated with the desired dependent variable. Advantages over other sampling methods 1. Focuses on important subpopulations and ignores irrelevant ones. 25

IIMM/DH/02/2006/8154, Business Statistics

2. Allows use of different sampling techniques for different subpopulations. 3. Improves the accuracy/efficiency of estimation. 4. Permits greater balancing of statistical power of tests of differences between strata by sampling equal numbers from strata varying widely in size. Disadvantages 1. 2. 3. Requires selection of relevant stratification variables which can be difficult. Is not useful when there are no homogeneous subgroups. Can be expensive to implement.

Q 4. (b) Define Hypothesis and enumerate procedure for hypothesis testing. What are the common errors you are likely to encounter in the testing Hypothesis. Answer:-

A hypothesis (from Greek ; plural hypotheses) is a proposed explanation for an observable phenomenon. The term derives from the Greek, hypotithenai meaning "to put under" or "to suppose." For a hypothesis to be put forward as a scientific hypothesis, the scientific method requires that one can test it. Scientists generally base scientific hypotheses on previous observations that cannot satisfactorily be explained with the available scientific theories. Even though the words "hypothesis" and "theory" are often used synonymously in common and informal usage, a scientific hypothesis is not the same as a scientific theory. A working hypothesis is a provisionally accepted hypothesis. In a related but distinguishable usage, the term hypothesis is used for the antecedent of a proposition; thus in proposition "If P, then Q", P denotes the hypothesis (or antecedent); Q can be called a consequent. P is the assumption in a (possibly counterfactual) What If question. The adjective hypothetical, meaning "having the nature of a hypothesis," or "being assumed to exist as an immediate consequence of a hypothesis," can refer to any of these meanings of the term "hypothesis." In its ancient usage, hypothesis also refers to a summary of the plot of a classical drama.

Scientific hypothesis
People refer to a trial solution to a problem as a hypothesis often called an "educated guess" because it provides a suggested solution based on the evidence. Experimenters may test and reject several hypotheses before solving the problem. According to Schick and Vaughn, researchers weighing up alternative hypotheses may take into consideration:

Testability (compare falsifiability as discussed above) Simplicity (as in the application of "Occam's razor", discouraging the postulation of excessive numbers of Scope the apparent application of the hypothesis to multiple cases of phenomena Fruitfulness the prospect that a hypothesis may explain further phenomena in the future Conservatism the degree of "fit" with existing recognized knowledge-systems

entities)

26

IIMM/DH/02/2006/8154, Business Statistics

Evaluating hypotheses
Karl Popper's formulation of hypothetico-deductive method, which he called the method of "conjectures and refutations", demands falsifiable hypotheses, framed in such a manner that the scientific community can prove them false (usually by observation). According to this view, a hypothesis cannot be "confirmed", because there is always the possibility that a future experiment will show that it is false. Hence, failing to falsify a hypothesis does not prove that hypothesis: it remains provisional. However, a hypothesis that has been rigorously tested and not falsified can form a reasonable basis for action, i.e., we can act as if it were true, until such time as it is falsified. Just because we've never observed rain falling upward, doesn't mean that we never willhowever improbable, our theory of gravity may be falsified some day. Popper's view is not the only view on evaluating hypotheses. For example, some forms of empiricism hold that under a well-crafted, well-controlled experiment, a lack of falsification does count as verification, since such an experiment ranges over the full scope of possibilities in the problem domain. Should we ever discover some place where gravity did not function, and rain fell upward, this would not falsify our current theory of gravity (which, on this view, has been verified by innumerable well-formed experiments in the past) it would rather suggest an expansion of our theory to encompass some new force or previously undiscovered interaction of forces. In other words, our initial theory as it stands is verified but incomplete. This situation illustrates the importance of having well-crafted, well-controlled experiments that range over the full scope of possibilities for applying the theory. In recent years philosophers of science have tried to integrate the various approaches to evaluating hypothesis, and the scientific method in general, to form a more complete system that integrates the individual concerns of each approach. Notably, Imre Lakatos and Paul Feyerabend, both former students of Popper, have produced novel attempts at such a synthesis.

Hypotheses, Concepts and Measurement


Concepts, as abstract units of meaning, play a key role in the development and testing of hypotheses. Concepts are the basic components of hypotheses. Most formal hypotheses connect concepts by specifying the expected relationships between concepts. For example, a simple relational hypothesis such as education increases income specifies a positive relationship between the concepts education and income. This abstract or conceptual hypothesis cannot be tested. First, it must be operationalized or situated in the real world by rules of interpretation. Consider again the simple hypothesis Education increases Income. To test the hypothesis the abstract meaning of education and income must be derived or operationalized. The concepts should be measured. Education could be measured by years of school completed or highest degree completed etc. Income could be measured by hourly rate of pay or yearly salary etc. When a set of hypotheses are grouped together they become a type of conceptual framework. When a conceptual framework is complex and incorporates causality or explanation it is generally referred to as a theory. According to noted philosopher of science Carl Gustav Hempel An adequate empirical interpretation turns a theoretical system into a testable theory: The hypothesis whose constituent terms have been interpreted become capable of test by reference to observable phenomena. Frequently the interpreted hypothesis will be derivative hypotheses of the theory; but their confirmation or disconfirmation by empirical data will then immediately strengthen or weaken also the primitive hypotheses from which they were derived. Hempel provides a useful metaphor that describes the relationship between a conceptual framework and the framework as it is observed and perhaps tested (interpreted framework). The whole system floats, as it
27

IIMM/DH/02/2006/8154, Business Statistics

were, above the plane of observation and is anchored to it by rules of interpretation. These might be viewed as strings which are not part of the network but link certain points of the latter with specific places in the plane of observation. By virtue of those interpretative connections, the network can function as a scientific theory Hypotheses with concepts anchored in the plane of observation are ready to be tested. In actual scientific practice the process of framing a theoretical structure and of interpreting it are not always sharply separated, since the intended interpretation usually guides the construction of the theoretician. It is, however, possible and indeed desirable, for the purposes of logical clarification, to separate the two steps conceptually.

Statistical hypothesis testing


When a possible correlation or similar relation between phenomena is investigated, such as, for example, whether a proposed remedy is effective in treating a disease, that is, at least to some extent and for some patients, the hypothesis that a relation exists cannot be examined the same way one might examine a proposed new law of nature: in such an investigation a few cases in which the tested remedy shows no effect do not falsify the hypothesis. Instead, statistical tests are used to determine how likely it is that the overall effect would be observed if no real relation as hypothesized exists. If that likelihood is sufficiently small (e.g., less than 1%), the existence of a relation may be assumed. Otherwise, any observed effect may as well be due to pure chance. In statistical hypothesis testing two hypotheses are compared, which are called the null hypothesis and the alternative hypothesis. The null hypothesis is the hypothesis that states that there is no relation between the phenomena whose relation is under investigation, or at least not of the form given by the alternative hypothesis. The alternative hypothesis, as the name suggests, is the alternative to the null hypothesis: it states that there is some kind of relation. The alternative hypothesis may take several forms, depending on the nature of the hypothesized relation; in particular, it can be two-sided (for example: there is some effect, in a yet unknown direction) or one-sided (the direction of the hypothesized relation, positive or negative, is fixed in advance). Proper use of statistical testing requires that these hypotheses, and the threshold (such as 1%) at which the null hypothesis is rejected and the alternative hypothesis is accepted, all be determined in advance, before the observations are collected or inspected. If these criteria are determined later, when the data to be tested is already known, the test is invalid. A statistical hypothesis test is a method of making statistical decisions using experimental data. In statistics, a result is called statistically significant if it is unlikely to have occurred by chance. The phrase "test of significance" was coined by Ronald Fisher: "Critical tests of this kind may be called tests of significance, and when such tests are available we may discover whether a second sample is or is not significantly different from the first." Hypothesis testing is sometimes called confirmatory data analysis, in contrast to exploratory data analysis. In frequency probability, these decisions are almost always made using null-hypothesis tests (i.e., tests that answer the question Assuming that the null hypothesis is true, what is the probability of observing a value for the test statistic that is at least as extreme as the value that was actually observed?) One use of hypothesis testing is deciding whether experimental results contain enough information to cast doubt on conventional wisdom. Statistical hypothesis testing is a key technique of frequentist statistical inference, and is widely used, but also much criticized. While controversial, the Bayesian approach to hypothesis testing is to base rejection of the hypothesis on the posterior probability. Other approaches to reaching a decision based on data are available via decision theory and optimal decisions.
28

IIMM/DH/02/2006/8154, Business Statistics

The critical region of a hypothesis test is the set of all outcomes which, if they occur, will lead us to decide that there is a difference. That is, cause the null hypothesis to be rejected in favor of the alternative hypothesis. The critical region is usually denoted by C.

Errors in Hypothesis:Meta-criticism
Criticism is of the application, or of the interpretation, rather than of the method. Criticism of null-hypothesis significance testing is available in other articles (for example "Statistical significance") and their references. Attacks and defenses of the null-hypothesis significance test are collected in Harlow et al. The original purposes of Fisher's formulation, as a tool for the experimenter, was to plan the experiment and to easily assess the information content of the small sample. There is little criticism, Bayesian in nature, of the formulation in its original context. In other contexts, complaints focus on flawed interpretations of the results and over-dependence/emphasis on one test. Numerous attacks on the formulation have failed to supplant it as a criterion for publication in scholarly journals. The most persistent attacks originated from the field of Psychology. After review, the American Psychological Association did not explicitly deprecate the use of null-hypothesis significance testing, but adopted enhanced publication guidelines which implicitly reduced the relative importance of such testing.

Philosophical criticism
Philosophical criticism to hypothesis testing includes consideration of borderline cases. Any process that produces a crisp decision from uncertainty is subject to claims of unfairness near the decision threshold. (Consider close election results.) The premature death of a laboratory rat during testing can impact doctoral theses and academic tenure decisions. "... surely, God loves the .06 nearly as much as the .05" The statistical significance required for publication has no mathematical basis, but is based on long tradition. "It is usual and convenient for experimenters to take 5% as a standard level of significance, in the sense that they are prepared to ignore all results which fail to reach this standard, and, by this means, to eliminate from further discussion the greater part of the fluctuations which chance causes have introduced into their experimental results."

Pedagogic criticism
Pedagogic criticism of the null-hypothesis testing includes the counter-intuitive formulation, the terminology and confusion about the interpretation of results. "Despite the stranglehold that hypothesis testing has on experimental psychology, I find it difficult to imagine a less insightful means of transiting from data to conclusions."

29

IIMM/DH/02/2006/8154, Business Statistics

Students find it difficult to understand the formulation of statistical null-hypothesis testing. In rhetoric, examples often support an argument, but a mathematical proof "is a logical argument, not an empirical one". A single counterexample results in the rejection of a conjecture. Karl Popper defined science by its vulnerability to disproof by data. Null-hypothesis testing shares the mathematical and scientific perspective rather than the more familiar rhetorical one. Students expect hypothesis testing to be a statistical tool for illumination of the research hypothesis by the sample; it is not. The test asks indirectly whether the sample can illuminate the research hypothesis. There is widespread and fundamental disagreement on the interpretation of test results. "A little thought reveals a fact widely understood among statisticians: The null hypothesis, taken literally (and that's the only way you can take it in formal hypothesis testing), is almost always false in the real world.... If it is false, even to a tiny degree, it must be the case that a large enough sample will produce a significant result and lead to its rejection. So if the null hypothesis is always false, what's the big deal about rejecting it?" (The above criticism only applies to point hypothesis tests. If one were testing, for example, whether a parameter is greater than zero, it would not apply.) "How has the virtually barren technique of hypothesis testing come to assume such importance in the process by which we arrive at our conclusions from our data?" Null-hypothesis testing just answers the question of "how well the findings fit the possibility that chance factors alone might be responsible." "Statistical significance does not necessarily imply practical significance!"

Practical criticism
Practical criticism of hypothesis testing includes the sobering observation that published test results are often contradicted. Mathematical models support the conjecture that most published medical research test results are flawed. Null-hypothesis testing has not achieved the goal of a low error probability in medical journals. Many authors have expressed a strong skepticism, sometimes labeled as postmodernism, about the general unreliability of statistical hypothesis testing to explain many social and medical phenomena. For example, modern statistics do not reliably link exposures of carcinogens to spatial-temporal patterns of cancer incidence. There is not a strong convention in statistical hypothesis testing to consider alternate units of scale. With temporal data the units chosen for temporal aggregation (hour, day, week, year, decade) can completely alter the trends and cycles. With spatial data, the units chosen for analysis (the modifiable areal unit problem) can alter or reverse relationships between variables. If the issue of analysis scale is ignored in a hypothesis test then skepticism about the results is justified.

Straw man
Hypothesis testing is controversial when the alternative hypothesis is suspected to be true at the outset of the experiment, making the null hypothesis the reverse of what the experimenter actually believes; it is put forward as a straw man only to allow the data to contradict it. Many statisticians ]have pointed out that rejecting the null hypothesis says nothing or very little about the likelihood that the null is true. Under traditional null hypothesis testing, the null is rejected when the conditional probability P(Data as or more extreme than observed | Null) is very small, say 0.05. However, some say researchers are really interested in the probability P(Null | Data as actually observed) which cannot be inferred from a p-value: some like to present these as inverses of each other but the events "Data as or more extreme than observed" and "Data as actually observed" are very different. In some cases , P(Null | Data) approaches 1 while P(Data as or more
30

IIMM/DH/02/2006/8154, Business Statistics

extreme than observed | Null) approaches 0, in other words, we can reject the null when it's virtually certain to be true. For this and other reasons, Gerd Gigerenzer has called null hypothesis testing "mindless statistics" while Jacob Cohen described it as a ritual conducted to convince ourselves that we have the evidence needed to confirm our theories.

Bayesian criticism
Bayesian statisticians reject classical null hypothesis testing, since it violates the Likelihood principle and is thus incoherent and leads to sub-optimal decision-making. The JeffreysLindley paradox illustrates this. Along with many frequentist statisticians, Bayesians prefer to provide an estimate, along with a confidence interval, (although Bayesian confidence intervals are different from classical ones). Some Bayesians (James Berger in particular) have developed Bayesian hypothesis testing methods, though these are not accepted by all Bayesians (notably, Andrew Gelman). Given a prior probability distribution for one or more parameters, sample evidence can be used to generate an updated posterior distribution. In this framework, but not in the null hypothesis testing framework, it is meaningful to make statements of the general form "the probability that the true value of the parameter is greater than 0 is p". According to Bayes' theorem, we have:

thus P(Null | Data) may approach 1 while P(Data | Null) approaches 0 only when P(Null)/P(Data) approaches infinity, i.e. (for instance) when the a priori probability of the null hypothesis, P(Null), is also approaching 1, while P(Data) approaches 0: then P(Data | Null) is low because we have extremely unlikely data, but the Null hypothesis is extremely likely to be true. Q 5. (a) What is Chi Square (x2) test, narrate the steps for determining value of x2 with suitable examples. Explain the conditions for applying x2 and uses of Chi Square test. Answer:A chi-square test (also chi-squared or 2 test) is any statistical hypothesis test in which the sampling distribution of the test statistic is a chi-square distribution when the null hypothesis is true, or any in which this is asymptotically true, meaning that the sampling distribution (if the null hypothesis is true) can be made to approximate a chi-square distribution as closely as desired by making the sample size large enough. Some examples of chi-squared tests where the chi-square distribution is only approximately valid: Pearson's chi-square test, also known as the chi-square goodness-of-fit test or chisquare test for independence. When mentioned without any modifiers or without other precluding context, this test is usually understood (for an exact test used in place of 2, see Fisher's exact test). Yates' chi-square test, also known as Yates' correction for continuity. Mantel-Haenszel chi-square test. Linear-by-linear association chi-square test. The portmanteau test in time-series analysis, testing for the presence of autocorrelation Likelihood-ratio tests in general statistical modelling, for testing whether there is evidence of the need to move from a simple model to a more complicated one (where the simple model is nested within the complicated one).

31

IIMM/DH/02/2006/8154, Business Statistics

One case where the distribution of the test statistic is an exact chi-square distribution is the test that the variance of a normally-distributed population has a given value based on a sample variance. Such a test is uncommon in practice because values of variances to test against are seldom known exactly.

Chi-square test for variance in a normal population


If a sample of size n is taken from a population having a normal distribution, then there is a well-known result (see distribution of the sample variance) which allows a test to be made of whether the variance of the population has a pre-determined value. For example, a manufacturing process might have been in stable condition for a long period, allowing a value for the variance to be determined essentially without error. Suppose that a variant of the process is being tested, giving rise to a small sample of product items whose variation is to be tested. The test statistic T in this instance could be set to be the sum of squares about the sample mean, divided by the nominal value for the variance (ie. the value to be tested as holding). Then T has a chi-square distribution with n1 degrees of freedom. For example if the sample size is 21, the acceptance region for T for a significance level of 5% is the interval 9.59 to 34.17.

Test for fit of a distribution

Discrete uniform distribution


In this case N observations are divided among n cells. A simple application is to test the hypothesis that, in the general population, values would occur in each cell with equal frequency. The "theoretical frequency" for any cell (under the null hypothesis of a discrete uniform distribution) is thus calculated as

and the reduction in the degrees of freedom is constrained to sum to N.

p = 1, notionally because the observed frequencies Oi are

Other distributions
When testing whether observations are random variables whose distribution belongs to a given family of distributions, the "theoretical frequencies" are calculated using a distribution from that family fitted in some standard way. The reduction in the degrees of freedom is calculated as p = s + 1, where s is the number of parameters used in fitting the distribution. For instance, when checking a 3-parameter Weibull distribution, p = 4, and when checking a normal distribution (where the parameters are mean and standard deviation), p = 3. In other words, there will be n p degrees of freedom, where n is the number of categories.
32

IIMM/DH/02/2006/8154, Business Statistics

It should be noted that the degrees of freedom are not based on the number of observations as with a Student's t or F-distribution. For example, if testing for a fair, six-sided die, there would be five degrees of freedom because there are six categories/parameters (each number). The number of times the die is rolled will have absolutely no effect on the number of degrees of freedom.

Calculating the test-statistic


The value of the test-statistic is

where

2 = the test statistic that asymptotically approaches a 2 distribution. Oi = an observed frequency; Ei = an expected (theoretical) frequency, asserted by the null hypothesis; n = the number of possible outcomes of each event.
The chi-square statistic can then be used to calculate a p-value by comparing the value of the statistic to a chi-square distribution. The number of degrees of freedom is equal to the number of cells n, minus the reduction in degrees of freedom, p. The result about the number of degrees of freedom is valid when the original data was multinomial and hence the estimated parameters are efficient for minimizing the chi-square statistic. More generally however, when maximum likelihood estimation does not coincide with minimum chi-square estimation, the distribution will lie somewhere between a chi-square distribution with n 1 p and n 1 degrees of freedom (See for instance Chernoff and Lehmann 1954).

Bayesian method
For more details on this topic, see Categorical distribution#Bayesian statistics.

In Bayesian statistics, one would instead use a Dirichlet distribution as conjugate prior. If one took a uniform prior, then the maximum likelihood estimate for the population probability is the observed probability, and one may compute a credible region around this or another estimate.

Test of independence
In this case, an "observation" consists of the values of two outcomes and the null hypothesis is that the occurrence of these outcomes is statistically independent. Each observation is allocated to one cell of a twodimensional array of cells (called a table) according to the values of the two outcomes. If there are r rows and c columns in the table, the "theoretical frequency" for a cell, given the hypothesis of independence, is

33

IIMM/DH/02/2006/8154, Business Statistics

and fitting the model of "independence" reduces the number of degrees of freedom by p = r + c 1. The value of the test-statistic is

The number of degrees of freedom is equal to the number of cells rc, minus the reduction in degrees of freedom, p, which reduces to (r 1)(c 1). For the test of independence, a chi-square probability of less than or equal to 0.05 (or the chi-square statistic being at or larger than the 0.05 critical point) is commonly interpreted by applied workers as justification for rejecting the null hypothesis that the row variable is unrelated (that is, only randomly related) to the column variable.[1] The alternative hypothesis corresponds to the variables having an association or relationship where the structure of this relationship is not specified.

Q 5. (b) How do you define Index Numbers? Narrate the nature and types of Index numbers with adequate examples. Answer:-

An index number is an economic data figure reflecting price or quantity compared with a standard or base value. The base usually equals 100 and the index number is usually expressed as 100 times the ratio to the base value. For example, if a commodity costs twice as much in 1970 as it did in 1960, its index number would be 200 relative to 1960. Index numbers are used especially to compare business activity, the cost of living, and employment. They enable economists to reduce unwieldy business data into easily understood terms. In economics, index numbers generally are time series summarizing movements in a group of related variables. In some cases, however, index numbers may compare geographic areas at a point in time. An example is a country's purchasing power parity. The best-known index number is the consumer price index, which measures changes in retail prices paid by consumers. In addition, a cost-of-living index (COLI) is a price index number that measures relative cost of living over time. In contrast to a COLI based on the true but unknown utility function, a superlative index number is an index number that can be calculated. Thus, superlative index numbers are used to provide a fairly close approximation to the underlying cost-of-living index number in a wide range of circumstances. There is a substantial body of economic analysis concerning the construction of index numbers, desirable properties of index numbers and the relationship between index numbers and economic theory. Index numbers are meant to study the change in the effects of such factors which cannot be measured directly. According to Bowley, Index numbers are used to measure the changes in some quantity which we cannot observe directly. For example, changes in business activity in a country are not capable of direct
34

IIMM/DH/02/2006/8154, Business Statistics

measurement but it is possible to study relative changes in business activity by studying the variations in the values of some such factors which affect business activity, and which are capable of direct measurement. Index numbers are commonly used statistical device for measuring the combined fluctuations in a group related variables. If we wish to compare the price level of consumer items today with that prevalent ten years ago, we are not interested in comparing the prices of only one item, but in comparing some sort of average price levels. We may wish to compare the present agricultural production or industrial production with that at the time of independence. Here again, we have to consider all items of production and each item may have undergone a different fractional increase (or even a decrease). How do we obtain a composite measure? This composite measure is provided by index numbers which may be defined as a device for combining the variations that have come in group of related variables over a period of time, with a view to obtain a figure that represents the net result of the change in the constitute variables.

Index Numbers are devices for measuring differences in the magnitude of a group of related variables Said by Croxton and Cowden. In its simplest form an index number is nothing more than a relative which expresses the relationship between two figures, where one figure is used as a base. - Morris Hamburg. Generally speaking, index numbers measure the size of magnitude of some object at a particular point in time as a percentage of some base or reference object in the past.. M.L. Berenson and D.M. Levine An index number is a measure how much a variable changes over time. We calculate an index number by finding the ratio of the current value to a base value. Then we multiply the resulting number by 100 to express the index as a percentage. This final value is a percentage relative. Needless to say that the index number for the base point in time is always 100. Richard I. Levin & David S. Rubin. Index Numbers: Index numbers are statistical measures designed to show changes in a variable or group of related variables with respect to time, geographic location or other characteristics such as income, profession, etc. A collection of index numbers for different years, locations, etc., is sometimes called an index series. Simple Index Number: A simple index number is a number that measures a relative change in a single variable with respect to a base. Composite Index Number: A composite index number is a number that measures an average relative changes in a group of relative variables with respect to a base. Types of Index Numbers: Following types of index numbers are usually used: Price index Numbers: Price index numbers measure the relative changes in prices of a commodities between two periods. Prices can be either retail or wholesale. 35

IIMM/DH/02/2006/8154, Business Statistics

Quantity Index Numbers: These index numbers are considered to measure changes in the physical quantity of goods produced, consumed or sold of an item or a group of items.

Nature of Index Numbers


(1) Index numbers are specialized averages used for comparison in situations where two or more series are expressed in different units or represent different items. E.g. Consumer Price Index representing prices of various items or the Index of Industrial Production representing various commodities produced.. (2) Index numbers measure the net change in a group of related variables over a period of time. (3) Index number measure the effect of change over a period of time, across a range of industries, geographical regions or countries. (4) The computation of the Index Numbers is carefully planned according to the purpose of their computation. Collection of data and application of appropriate method, assignment of correct weight ages and formulae. Types of Index Numbers:(i) Price Index Numbers:- A price index number compares the level of prices from one period to the other and is most frequently used. Prices are generally represented by P in formulae. These are also expressed as price relatives, defined as follows:Price Relative = (Current years Price/Base years Price) x 100 = (P1/ P0) x 100 Any increase in price index amounts to corresponding decrease in the purchasing power of the Rupee or other affected currency. (ii) Quantity Index Numbers:- A value Index number measures how much the number or quantity of a variable changes over time. Quantities are generally represented as q in formulae. (iii) Value Index Numbers:- A value Index number measures changes in total monetary worth, that is, it measures changes in the Rupee value of a variable. It combines price and quantity changes to present a more informative index. (iv) Composite Index Numbers:- A single index may reflect a composite, or group, of changing variable. For instance. The Consumer Price Index measures the general price level for specific goods and services in the economy. These are also known simply as Index Numbers. In such cases the price relatives with respect to a selected base are determined separately for each and their statistical average (arithmetic, geometric or harmonic mean, mode or median) is computed.

Examples of Index Number


Prices of Different Commodities in Years 2002 and 2003 Commodity A B C D E Price in 2002 (Rs.) 50 40 80 110 20 Price in 2003 (Rs.) 70 60 90 120 20

Computation of Price Indices and Price Index Commodity A B Price in 2002 (Rs.) (P0) 50 40 Price in2003 (Rs.) (P1) 70 60 Price-Relative {(P1)/(P0)} x 100 140% 150% 36

IIMM/DH/02/2006/8154, Business Statistics

C 80 90 D 110 120 E 20 20 Price Index P01 = (P1 / P0 ) x 100 = 100x360/300 =120%

112.5% 109.1% 100%

Computation of Price Indices of a Commodity for a Number of Years taking a selected Year as The Base Year
Year

1997 1998 1999 2000 2001 2002 2003

Price (Rs.) (Obtained from Computed Index Number with Year survey) 2000 as Base (%) 1997 as Base (%) 40 100x40/50=80 Assumed as 100 (Base) 36 100x36/50=72 100x36/40=90 48 100x48/50=96 100x48/40=120 50 Assumed as 100 (Base) 100x50/40=125 44 100x44/50=88 100x44/40=110 52 100x52/50=104 100x52/40=130 46 100x46/50=92 100x46/40=115

Computation of Price Indices of a Commodity for a Number of Years By Chain Base Method Year 1997 1998 1999 2000 2001 2002 2003 Price (Rs.) (Obtained from Index Number Computed by Chain Base Method survey) Calculations Value (%) 40 100x40/40= 100 36 100x36/40= 90 48 100x48/36= 133.3 50 100x50/48= 104.2 44 100x44/50= 88 52 100x52/44= 118.2 46 100x46/52= 92.3

Q 6. (a) What are the important Index Numbers used in Indian Economy. Explain index number of Industrial Production. Answer:-

A price index (plural: price indices or price indexes) is a normalized average (typically a weighted average) of prices for a given class of goods or services in a given region, during a given interval of time. It is a statistic designed to help to compare how these prices, taken as a whole, differ between time periods or geographical locations. Price indices have several potential uses. For particularly broad indices, the index can be said to measure the economy's price level or a cost of living. More narrow price indices can help producers with business plans and pricing. Sometimes, they can be useful in helping to guide investment. Some notable price indices include:
37

IIMM/DH/02/2006/8154, Business Statistics

Consumer price index

A consumer price index (CPI) is a measure estimating the average price of consumer goods and services purchased by households. A consumer price index measures a price change for a constant market basket of goods and services from one period to the next within the same area (city, region, or nation). It is a price index determined by measuring the price of a standard group of goods meant to represent the typical market basket of a typical urban consumer. Related, but different, terms are the United Kingdom's CPI, RPI, and RPIX. It is one of several price indices calculated by most national statistical agencies. The percent change in the CPI is a measure estimating inflation. The CPI can be used to index (i.e., adjust for the effect of inflation on the real value of money: the medium of exchange) wages, salaries, pensions, and regulated or contracted prices. The CPI is, along with the population census and the National Income and Product Accounts, one of the most closely watched national economic statistics.

Two basic types of data are needed to construct the CPI: price data and weighting data. The price data are collected for a sample of goods and services from a sample of sales outlets in a sample of locations for a sample of times. The weighting data are estimates of the shares of the different types of expenditure as fractions of the total expenditure covered by the index. These weights are usually based upon expenditure data obtained for sampled decades from a sample of households. Although some of the sampling is done using a sampling frame and probabilistic sampling methods, much is done in a commonsense way (purposive sampling) that does not permit estimation of confidence intervals. Therefore, the sampling variance is normally ignored, since a single estimate is required in most of the purposes for which the index is used. Stocks greatly affect this cause. The index is usually computed yearly, or quarterly in some countries, as a weighted average of sub-indices for different components of consumer expenditure, such as food, housing, clothing, each of which is in turn a weighted average of sub-sub-indices. At the most detailed level, the elementary aggregate level, (for example, men's shirts sold in department stores in San Francisco), detailed weighting information is unavailable, so indices are computed using an unweighted arithmetic or geometric mean of the prices of the sampled product offers. (However, the growing use of scanner data is gradually making weighting information available even at the most detailed level.) These indices compare prices each month with prices in the price-reference month. The weights used to combine them into the higher-level aggregates, and then into the overall index, relate to the estimated expenditures during a preceding whole year of the consumers covered by the index on the products within its scope in the area covered. Thus the index is a fixed-weight index, but rarely a true Laspeyres index, since the weight-reference period of a year and the price-reference period, usually a more recent single month, do not coincide. It takes time to assemble and process the information used for weighting which, in addition to household expenditure surveys, may include trade and tax data. Ideally, the weights would relate to the composition of expenditure during the time between the pricereference month and the current month. There is a large technical economics literature on index formulae which would approximate this and which can be shown to approximate what economic theorists call a true cost of living index. Such an index would show how consumer expenditure would have to move to compensate for price changes so as to allow consumers to maintain a constant standard of living. Approximations can only be computed retrospectively, whereas the index has to appear monthly and, preferably, quite soon. Nevertheless, in some countries, notably in the United States and Sweden, the philosophy of the index is that it is inspired by and approximates the notion of a true cost of living (constant utility) index, whereas in most of Europe it is regarded more pragmatically. The coverage of the index may be limited. Consumers' expenditure abroad is usually excluded; visitors' expenditure within the country may be excluded in principle if not in practice; the rural population may or may not be included; certain groups such as the very rich or the very poor may be excluded. Saving and

38

IIMM/DH/02/2006/8154, Business Statistics

investment are always excluded, though the prices paid for financial services provided by financial intermediaries may be included along with insurance. The index reference period, usually called the base year, often differs both from the weight-reference period and the price reference period. This is just a matter of rescaling the whole time-series to make the value for the index reference-period equal to 100. Annually revised weights are a desirable but expensive feature of an index, for the older the weights the greater is the divergence between the current expenditure pattern and that of the weight reference-period. Example: The prices of 95,000 items from 22,000 stores, and 35,000 rental units are added together and averaged. They are weighted this way: Housing: 41.4%, Food and Beverage: 17.4%, Transport: 17.0%, Medical Care: 6.9%, Other: 6.9%, Apparel: 6.0%, Entertainment: 4.4%. Taxes (43%) are not included in CPI computation. CPI= (Productrep X Pricecurrent)/(Productrep X Price11987*)

Weighting
Weights and sub-indices
Weights can be expressed as fractions or ratios summing to one, as percentages summing to 100 or as per mille numbers summing to 1000. In the European Union's Harmonised Index of Consumer Prices, for example, each country computes some 80 prescribed sub-indices, their weighted average constituting the national Harmonised Index. The weights for these sub-indices will consist of the sum of the weights of a number of component lower level indexes. The classification is according to use, developed in a national accounting context. This is not necessarily the kind of classification that is most appropriate for a Consumer Price Index. Grouping together of substitutes or of products whose prices tend to move in parallel might be more suitable. For some of these lower level indexes detailed reweighing to make them be available, allowing computations where the individual price observations can all be weighted. This may be the case, for example, where all selling is in the hands of a single national organisation which makes its data available to the index compilers. For most lower level indexes, however, the weight will consist of the sum of the weights of a number of elementary aggregate indexes, each weight corresponding to its fraction of the total annual expenditure covered by the index. An 'elementary aggregate' is a lowest-level component of expenditure, one which has a weight but within which, weights of its sub-components are usually lacking. Thus, for example: Weighted averages of elementary aggregate indexes (e.g. for mens shirts, raincoats, womens dresses etc.) make up low level indexes (e.g. Outer garments), Weighted averages of these in turn provide sub-indices at a higher, more aggregated level,(e.g. clothing) and weighted averages of the latter provide yet more aggregated sub-indices (e.g. Clothing and Footwear). Some of the elementary aggregate indexes, and some of the sub-indexes can be defined simply in terms of the types of goods and/or services they cover, as in the case of such products as newspapers in some countries and postal services, which have nationally uniform prices. But where price movements do differ or might differ between regions or between outlet types, separate regional and/or outlet-type elementary aggregates are ideally required for each detailed category of goods and services, each with its own weight. An example might be an elementary aggregate for sliced bread sold in supermarkets in the Northern region.

39

IIMM/DH/02/2006/8154, Business Statistics

Most elementary aggregate indexes are necessarily 'unweighted' averages for the sample of products within the sampled outlets. However in cases where it is possible to select the sample of outlets from which prices are collected so as to reflect the shares of sales to consumers of the different outlet types covered, selfweighted elementary aggregate indexes may be computed. Similarly, if the market shares of the different types of product represented by product types are known, even only approximately, the number of observed products to be priced for each of them can be made proportional to those shares.

Estimating weights
The outlet and regional dimensions noted above mean that the estimation of weights involves a lot more than just the breakdown of expenditure by types of goods and services, and the number of separately weighted indexes composing the overall index depends upon two factors:
1. The degree of detail to which available data permit breakdown of total consumption expenditure in the weight reference-period by type of expenditure, region and outlet type. 2. Whether there is reason to believe that price movements vary between these most detailed categories.

How the weights are calculated, and in how much detail, depends upon the availability of information and upon the scope of the index. In the UK the RPI does not relate to the whole of consumption, for the reference population is all private households with the exception of a) pensioner households that derive at least three-quarters of their total income from state pensions and benefits and b) high income households whose total household income lies within the top four per cent of all households. The result is that it is difficult to use data sources relating to total consumption by all population groups. For products whose price movements can differ between regions and between different types of outlet:

The ideal, rarely realizable in practice, would consist of estimates of expenditure for each detailed consumption category, for each type of outlet, for each region. At the opposite extreme, with no regional data on expenditure totals but only on population (e.g. 24% in the Northern region) and only national estimates for the shares of different outlet types for broad categories of consumption (e.g. 70% of food sold in supermarkets) the weight for sliced bread sold in supermarkets in the Northern region has to be estimated as the share of sliced bread in total consumption 0.24 0.7.

The situation in most countries comes somewhere between these two extremes. The point is to make the best use of whatever data are available.

The nature of the data used for weighting


No firm rules can be suggested on this issue for the simple reason that the available statistical sources differ between countries. However, all countries conduct periodical Household Expenditure surveys and all produce breakdowns of Consumption Expenditure in their National Accounts. The expenditure classifications used there may however be different. In particular:

Household Expenditure surveys do not cover the expenditures of foreign visitors, though these may be within the scope of a Consumer Price Index. National Accounts include imputed rents for owner-occupied dwellings which may not be within the scope of a Consumer Price Index.

Even with the necessary adjustments, the National Account estimates and Household Expenditure Surveys usually diverge.

40

IIMM/DH/02/2006/8154, Business Statistics

The statistical sources required for regional and outlet-type breakdowns are usually weaker. Only a largesample Household Expenditure survey can provide a regional breakdown. Regional population data are sometimes used for this purpose, but need adjustment to allow for regional differences in living standards and consumption patterns. Statistics of retail sales and market research reports can provide information for estimating outlet-type breakdowns, but the classifications they use rarely correspond to COICOP categories. The increasingly widespread use of bar codes, scanners in shops has meant that detailed cash register printed receipts are provided by shops for an increasing share of retail purchases. This development makes possible improved Household Expenditure surveys, as Statistics Iceland has demonstrated. Survey respondents keeping a diary of their purchases need to record only the total of purchases when itemised receipts were given to them and keep these receipts in a special pocket in the diary. These receipts provide not only a detailed breakdown of purchases but also the name of the outlet. Thus response burden is markedly reduced, accuracy is increased, product description is more specific and point of purchase data are obtained, facilitating the estimation of outlet-type weights. There are only two general principles for the estimation of weights: use all the available information and accept that rough estimates are better than no estimates.

Reweighing
Ideally, in computing an index, the weights would represent current annual expenditure patterns. In practice they necessarily reflect past expenditure patterns, using the most recent data available or, if they are not of high quality, some average of the data for more than one previous year. Some countries have used a threeyear average in recognition of the fact that household survey estimates are of poor quality. In some cases some of the data sources used may not be available annually, in which case some of the weights for lower level aggregates within higher level aggregates are based on older data than the higher level weights. Infrequent reweighing saves costs for the national statistical office but delays the introduction into the index of new types of expenditure. For example, subscriptions for Internet Service entered index compilation with a considerable time lag in some countries, and account could be taken of digital camera prices between reweightings only by including some digital cameras in the same elementary aggregate as film cameras.

Index Number of Industrial Production:


A Index Number of Industrial Production measures average changes in prices received by domestic producers for their output. It is one of several price indices. Its importance is being undermined by the steady decline in manufactured goods as a share of spending.

United States
In the US, the PPI was known as the Wholesale Price Index, or WPI, up to 1978. The PPI is one of the oldest continuous systems of statistical data published by the Bureau of Labor Statistics, as well as one of the oldest economic time series compiled by the Federal Government. The origins of the index can be found in an 1891 U.S. Senate resolution authorizing the Senate Committee on Finance to investigate the effects of the tariff laws upon the imports and exports, the growth, development, production, and prices of agricultural and manufactured articles at home and abroad.

India

41

IIMM/DH/02/2006/8154, Business Statistics

The Indian Wholesale Price Index (WPI) was first published in 1902, and was used by policy makers until it was replaced by the Producer Price Index (PPI) in 1978 The Index Number of Industrial Production is designed to measure increase or decrease in the level of industrial production in a given period of time compared to some base period. Such an Index measures changes in the quantities of production and not their values. Data about the level of industrial output in the Base Period and in the given period is to be collected first under the following heads:(a) (b) (c)
(d)

(e) (f)

Textile Industries to include cotton, wollen, silk etc. Mining Industries like iron ore, coal, copper, petroleum etc. Metallurgical Industries like iron, steel, aluminum etc. Mechanical Industries like automobiles, locomotives, aeroplanes etc. Industries subject to Excise Duties like sugar, tobacco, match etc. Miscellaneous like glass, detergents, chemical, cement etc.

The figures of output for the various industries classified above are obtained on a monthly, quarterly or yearly basis. Weights are assigned to various industries on the basis of some criteria such as capital invested turnover, net output, production etc. Usually the weights in the index are based on the values of net output of different industries. The index of industrial production is obtained by taking the simple mean or geometric mean of the relative. When simple arithmetic mean is used the formula for constructing the index is as follows:Index of Industrial Production = (100/w) x { (q1/q2) w} = (100/w) x I.w Where q1 = quantity produced in a given period. q2 = quantity produced in the Base period. w = Relative importance of different outputs. I = (q1/q0) = Index for respective commodity. For determining the relative share of an individual output the concept of value addition during a production process is most commonly used. Examples Construct an index number for the business activity from the following data: S.No. 1 2 3 4 5 6 Item Industrial production Mineral Production Internal Trade Financial Activity International Trade Shipping Weight (w) 36 7 24 20 7 6 Index 250 135 200 135 325 300

Solution- : Construction of Index Number of Business Activity Ser 1 Item Industrial Production Weight (w) 36 Index (I) 250 Ixw 9000
42

IIMM/DH/02/2006/8154, Business Statistics

2 3 4 5 6

Mineral Production Internal Trade Financial Activity International Trade Shipping Total

7 24 20 7 6 w =100

135 200 135 325 300

945 4800 2700 2275 1800 I.w = 21520

Hence Index Number of Business Activity = I.w / w = 21520/100 =215.2. Answer.

Q 6. (b) Enumerate Probability Distributions, explain the Histogram and Probability Distribution curve. Answer:-

Probability distribution
In probability theory and statistics, a probability distribution identifies either the probability of each value of a random variable (when the variable is discrete), or the probability of the value falling within a particular interval (when the variable is continuous). The probability distribution describes the range of possible values that a random variable can attain and the probability that the value of the random variable is within any (measurable) subset of that range.

When the random variable takes values in the set of real numbers, the probability distribution is completely described by the cumulative distribution function, whose value at each real x is the probability that the random variable is smaller than or equal to x. The concept of the probability distribution and the random variables which they describe underlies the mathematical discipline of probability theory, and the science of statistics. There is spread or variability in almost any value that can be measured in a population (e.g. height of people, durability of a metal, sales growth, traffic flow, etc.); almost all measurements are made with some intrinsic error; in physics many processes are described probabilistically, from the kinetic properties of gases to the quantum mechanical description of fundamental particles. For these and many other reasons, simple numbers are often inadequate for describing a quantity, while probability distributions are often more appropriate. There are various probability distributions that show up in various different applications. One of the more important ones is the normal distribution, which is also known as the Gaussian distribution or the bell curve and approximates many different naturally occurring distributions. The toss of a fair coin yields another familiar distribution, where the possible values are heads or tails, each with probability 1/2. In the measure-theoretic formalization of probability theory, a random variable is defined as a measurable function X from a probability space to measurable space . A probability distribution is the pushforward measure X*P = PX 1 on .

Probability distributions of real-valued random variables


Because a probability distribution Pr on the real line is determined by the probability of a real-valued random variable X being in a half-open interval (-, x], the probability distribution is completely characterized by its cumulative distribution function:

43

IIMM/DH/02/2006/8154, Business Statistics

Discrete probability distribution


A probability distribution is called discrete if its cumulative distribution function only increases in jumps. More precisely, a probability distribution is discrete if there is a finite or countable set whose probability is 1. For many familiar discrete distributions, the set of possible values is topologically discrete in the sense that all its points are isolated points. But, there are discrete distributions for which this countable set is dense on the real line. Discrete distributions are characterized by a probability mass function, p such that

Continuous probability distribution


By one convention, a probability distribution all . Another convention reserves the term continuous probability distribution for absolutely continuous distributions. These distributions can be characterized by a probability density function: a non-negative Lebesgue integrable function defined on the real numbers such that is called continuous if its cumulative distribution function for

is continuous and, therefore, the probability measure of singletons

Discrete distributions and some continuous distributions (like the Cantor distribution) do not admit such a density.

Terminology
The support of a distribution is the smallest closed interval/set whose complement has probability zero. It may be understood as the points or elements that are actual members of the distribution. A discrete random variable is a random variable whose probability distribution is discrete. Similarly, a continuous random variable is a random variable whose probability distribution is continuous.

Simulated sampling
If one is programming and one wishes to sample from a probability distribution (either discrete or continuous), the following algorithm lets one do so. This algorithm assumes that one has access to the inverse of the cumulative distribution (easy to calculate with a discrete distribution, can be approximated for continuous distributions) and a computational primitive called "random()" which returns an arbitraryprecision floating-point-value in the range of [0,1).

44

IIMM/DH/02/2006/8154, Business Statistics

For discrete distributions, the function cdfInverse (inverse of cumulative distribution function) can be calculated from samples as follows: for each element in the sample range (discrete values along the x-axis), calculating the total samples before it. Normalize this new discrete distribution. This new discrete distribution is the CDF, and can be turned into an object which acts like a function: calling cdfInverse(query) returns the smallest x-value such that the CDF is greater than or equal to the query. Note that often, mathematics environments and computer algebra systems will have some way to represent probability distributions and sample from them. This functionality might even have been developed in third-party libraries. Such packages greatly facilitate such sampling, most likely have optimizations for common distributions, and are likely to be more elegant than the above bare-bones solution

Histogram
In statistics, a histogram is a graphical display of tabular frequencies, shown as adjacent rectangles. Each rectangle is erected over an interval, with an area equal to the frequency of the interval. The height of a rectangle is also equal to the frequency density of the interval, i.e. the frequency divided by the width of the interval. The total area of the histogram is equal to the number of data. A histogram may also be based on the relative frequencies instead. It then shows what proportion of cases fall into each of several categories (a form of data binning), and the total area then equals 1. The categories are usually specified as consecutive, non-overlapping intervals of some variable. The categories (intervals) must be adjacent, and often are chosen to be of the same size, but not necessarily so. Histograms are used to plot density of data, and often for density estimation: estimating the probability density function of the underlying variable. The total area of a histogram used for probability density is always normalized to 1. If the length of the intervals on the x-axis are all 1, then a histogram is identical to a relative frequency plot. An alternative to the histogram is kernel density estimation, which uses a kernel to smooth samples. This will construct a smooth probability density function, which will in general more accurately reflect the underlying variable. The histogram is one of the seven basic tools of quality control.

45

IIMM/DH/02/2006/8154, Business Statistics

The etymology of the word histogram is uncertain. Sometimes it is said to be derived from the Greek histos 'anything set upright' (as the masts of a ship, the bar of a loom, or the vertical bars of a histogram); and gramma 'drawing, record, writing'. It is also said that Karl Pearson, who introduced the term in 1895, derived the name from "historical diagram". As an example we consider data collected by the U.S. Census Bureau on time to travel to work (2000 census,). The census found that there were 124 million people who work outside of their homes. An interesting feature of this graph is that the number recorded for "at least 30 but less than 35 minutes" is higher than for the bands on either side. This is likely to have arisen from people rounding their reported journey time. This rounding is a common phenomenon when collecting data from people.

This histogram differs from the first only in the vertical scale. The height of each bar is the decimal percentage of the total that each category represents, and the total area of all the bars is equal to 1, the decimal equivalent of 100%. The curve displayed is a simple density estimate. This version shows proportions, and is also known as a unit area histogram. In other words a histogram represents a frequency distribution by means of rectangles whose widths represent class intervals and whose areas are proportional to the corresponding frequencies. The intervals are placed together in order to show that the data represented by the histogram while being exclusive is also continuous. (E.g., in a histogram it is possible to have two connnecting intervals of 10.5-20.5 and 20.5-33.5, but not two connecting intervals of 10.5-20.5 and 22.5-32.5. Empty intervals are represented as empty and not skipped.) Mathematical definition In a more general mathematical sense, a histogram is a mapping mi that counts the number of observations that fall into various disjoint categories (known as bins), whereas the graph of a histogram is merely one way to represent a histogram. Thus, if we let n be the total number of observations and k be the total number of bins, the histogram mi meets the following conditions:

Cumulative histogram
A cumulative histogram is a mapping that counts the cumulative number of observations in all of the bins up to the specified bin. That is, the cumulative histogram Mi of a histogram mj is defined as:

46

IIMM/DH/02/2006/8154, Business Statistics

Number of bins and width


There is no "best" number of bins, and different bin sizes can reveal different features of the data. Some theoreticians have attempted to determine an optimal number of bins, but these methods generally make strong assumptions about the shape of the distribution. You should always experiment with bin widths before choosing one (or more) that illustrate the salient features in your data. The number of bins k can be assigned directly or can be calculated from a suggested bin width h as:

The braces indicate the ceiling function. Sturges' formula which implicitly bases the bin sizes on the range of the data, and can perform poorly if n < 30. Scott's choice

where is the sample standard deviation. Square-Root Choice

which takes the square root of the number of data points in the sample (used by Excel histograms and many others) FreedmanDiaconis' choice

which is based on the interquartile range. A good discussion of this and other rules for choice of bin widths is in Modern Applied Statistics with S, 5.6: Density Estimation.

47

You might also like