You are on page 1of 23

Personal Learning Paper

On
Quantitative Techniques

Prepared By:
Submitted To: Prof. Priyabrata Nayak
Contents

 Introduction
 History
 Data Collecting
 Arranging Data
Using Data Array
Using Frequency Distribution
 Central Tendency
 Dispersion
 Skewness
 Measures of Central Tendency
Mean
Median
Mode
 Probability
 Probability Distribution
 Discrete Probability Distribution
Contents

 Sampling
 Central Limit Theorem
 Standard Error
 Sample Size
 Estimation
 Point Estimation
 Interval Estimation
 Correlation
 Coefficient of Determination
Introduction to
Statistics

 The word statistics means different things to


different people.
 To a football fan, statistics are rushing, passing,
and first down numbers; to the chargers’ coach in
the second example, statistics is the chance that
the giants will through the pass over center.
 To the manager of a power station, statistics are
the amounts pollution being released into the
atmosphere.
 To the Food and Drug administrator in our third
example, statistics is the likely percentage of
undesirable effects in the population using the new
prostate drug.
 To the Community Bank in the fourth example,
statistics is the chance that Sarah will repay her
loan on time.
 Each of these people is using the word correctly,
yet each person uses it in a different way. All of
them are using statistics to help them make
decisions
History

 The word statistik comes from the Italian


word statista (meaning “statesman”). It was
first used by Gottfried Achenwall(1719-1772),
a professor at Marlborough and Gottingen.
Dr. E. A. W. Zimmerman introduced the word
statistics into England. Its use was
popularized by Sir John Sinclair in his work
Statistical Account of Scotland 1791-1799.
Long before the eighteenth century, however,
people had been recording and using data.
Data Collecting

 Statisticians select their observations so that all


relevant groups are represented in the data. To
determine the potential market for a new
product, for example, analysts may study 100
consumers in a certain geographical area.
Analysts must be certain that this group
contains people representing variables such as
income level, race, education, and
neighborhood.
 Past data is used to make decisions about the
future.
Arranging Data using
the Data Array

 The data array is one of the simplest ways to present


data. It arranges the data in ascending or descending
order.

Table 1-1 16.2 15.8 15.8 15.8 16.3 15.6


Sample of Daily 15.7 16.0 16.2 16.1 16.8 16.0
Production in 16.4 15.2 15.9 15.9 15.9 16.8
Yards of 30 15.4 15.7 15.9 16.0 16.3 16.0
Carpet Looms 16.4 16.6 15.6 15.6 16.9 16.3

Table 1-2 15.2 15.7 15.9 16.0 16.2 16.4


Data Array of 15.4 15.7 15.9 16.0 16.3 16.6
daily production in 15.6 15.8 15.9 16.0 16.3 16.8
yards of 30 Carpet 15.6 15.8 15.9 16.1 16.3 16.8
Looms 15.6 15.8 16.0 16.2 16.4 16.9

The table 1-1 contains the Raw Data and the


table 1-2 rearranges the data in a data array in
ascending order. Advantages of Data Array
 We can quickly locate the lowest and highest values in
the data.
 We can easily divide the data into sections.
 We can see whether any values appear more than once
in the array.
 We can observe the distance succeeding values in the
data.
Arranging Data using the
Frequency Distribution

In statistics, a graph or data set organized to show the


frequency of occurrence of each possible outcome of a
repeatable event observed many times.

Simple examples are election returns and test scores listed


by percentile. A frequency distribution can be graphed as a
histogram or pie chart. For large data sets, the stepped
graph of a histogram is often approximated by the smooth
curve of a distribution function (called a density function
when normalized so that the area under the curve is 1).

The famed bell curve or normal distribution is the graph of


one such function. Frequency distributions are particularly
useful in summarizing large data sets and assigning
probabilities.
Central Tendency

Measure that indicates the typical Median value of a


distribution. The mean and the median are examples of
measures of central tendency.

Dispersion

A term used in statistics that refers to the location of a


set of values relative to a mean or average level.
Investopedia Says: In finance, dispersion is used to
measure the volatility of different types of investment
strategies. Returns that have wide dispersions are
generally seen as more risky because they have a
higher probability of closing dramatically lower than the
mean. In practice, standard deviation is the tool that is
generally used to measure the dispersion of returns.
Skewness

The degree to which a distribution departs from symmetry


about its mean value.

In probability theory and statistics, Skewness is a measure


of the asymmetry of the probability distribution of a real-
valued random variable. Roughly speaking, a distribution
has positive skew (right-skewed) if the right (higher value)
tail is longer or fatter and negative skew (left-skewed) if the
left (lower value) tail is longer or fatter. The two are often
confused, since most of the mass of a right (or left) skewed
distribution is to the left (or right) of its respective tail.

Measures of Central
Tendency
The three most common measures of central tendency are
the mean, the median, and the mode.
Measures of Central
Tendency
Arithmetic Mean
The arithmetic mean is the most common measure of central tendency. It simply
the sum of the numbers divided by the number of numbers. The symbol m is
used for the mean of a population. The symbol M is used for the mean of a
sample. The formula for m is shown below: m= SX
N
where SX is the sum of all the numbers in the
numbers in the sample and N is the number of
numbers in the sample. As an example, the
mean of the numbers 1+2+3+6+8=
20
5
=4 regardless of whether the numbers constitute the entire population or just a sample
from the population.
The table, Number of touchdown passes, shows the number of touchdown (TD) passes
thrown by each of the 31 teams in the National Football League in the 2000 season.
The mean number of touchdown passes thrown is 20.4516 as shown below. m=

SX 634
= = 20.4516
N 31

Number of touchdown passes


37 33 33 32 29 28 28 23
22 22 22 21 21 21 20 20
19 19 18 18 18 18 16 15
14 14 14 12 12 9 6
Although the arithmetic mean is not the only "mean" (there is also a geometric mean),
it is by far the most commonly used. Therefore, if the term "mean" is used without
specifying whether it is the arithmetic mean, the geometric mean, or some other
mean, it is assumed to refer to the arithmetic mean.
Measures of Central
Tendency
Median
The median is also a frequently used measure of central tendency. The median
is the midpoint of a distribution: the same number of scores are above the
median as below it. For the data in the table, Number of touchdown passes,
there are 31 scores. The 16th highest score (which equals 20) is the median
because there are 15 scores below the 16th score and 15 scores above the
16th score. The median can also be thought of as the 50th percentile.
Let's return to the made up example of the quiz on which you made a three
discussed previously in the module Introduction to Central Tendency and
shown in table 2.

Three possible datasets for the 5-point make-up quiz

Student Dataset 1 Dataset 2 Dataset 3

You 3 3 3

John's 3 4 2

Maria's 3 4 2

Shareecia's 3 4 2

Luther's 3 5 1

For Dataset 1, the median is three, the same as your score. For Dataset 2, the
median is 4. Therefore, your score is below the median. This means you are in the
lower half of the class. Finally for Dataset 3, the median is 2. For this dataset, your
score is above the median and therefore in the upper half of the distribution.
Computation of the Median: When there is an odd number of numbers, the median is
simply the middle number. For example, the median of 2, 4, and 7 is 4. When there
is an even number of numbers, the median is the mean of the two middle numbers.
Thus, the median of the numbers 2, 4, 7, 12 is

4+7
=5.5.
2
Measures of Central
Tendency
Mode

The mode is the most frequently occuring value. For the data in the table,
Number of touchdown passes, the mode is 18 since more teams (4) had
18 touchdown passes than any other number of touchdown passes. With
continuous data such as response time measured to many decimals, the
frequency of each value is one since no two scores will be exactly the
same (see discussion of continuous variables). Therefore the mode of
continuous data is normally computed from a grouped frequency
distribution. The Grouped frequency distribution table shows a grouped
frequency distribution for the target response time data. Since the interval
with the highest frequency is 600-700, the mode is the middle of that
interval (650).
Grouped frequency distribution

Range Frequency

500-600 3

600-700 6

700-800 5

800-900 5

900-1000 0

1000-1100 1
Probability
Probability theory is the mathematical study of phenomena
characterized by randomness or uncertainty.
More precisely, probability is used for modelling situations when the
result of an experiment, realized under the same circumstances,
produces different results (typically throwing a dice or a coin).
Mathematicians and actuaries think of probabilities as numbers in the
closed interval from 0 to 1 assigned to "events" whose occurrence or
failure to occur is random. Probabilities P(A) are assigned to events A
according to the probability axioms.
The probability that an event A occurs given the known occurrence of an
event B is the conditional probability of A given B; its numerical value is
(as long as P(B) is nonzero). If the conditional
probability of A given B is the same as the ("unconditional") probability of
A, then A and B are said to be independent events. That this relation
between A and B is symmetric may be seen more readily by realizing
that it is the same as saying when A and B are
independent events.
Probability Distribution
Outcomes of an experiment and their probabilities of occurrence. If the
experiment were to be repeated any number of times, the same
probabilities should also repeat. For example, the probability distribution
for the possible number of heads from two tosses of a fair coin having
both a head and a tail would be as follows:
Number of Heads Tosses Probability of Event
0 (tail, tail) . 25
1 (head, tail) + (tail, head) . 50
2 (head, head) . 25

In mathematics and statistics, a probability distribution, more


properly called a probability distribution function, assigns to every
interval of the real numbers a probability, so that the probability
axioms are satisfied. In technical terms, a probability distribution is a
probability measure whose domain is the Borel algebra on the reals.
A probability distribution is a special case of the more general notion
of a probability measure, which is a function that assigns
probabilities satisfying the Kolmogorov axioms to the measurable
sets of a measurable space. Additionally, some authors define a
distribution generally as the probability measure induced by a
random variable X on its range - the probability of a set B is P(X -
1(B)). However, this article discusses only probability measures over
the real numbers.
Discrete Probability
Distribution
The binomial distribution is the discrete probability
distribution of the number of successes in a sequence of n
independent yes/no experiments, each of which yields
success with probability p. Such a success/failure
experiment is also called a Bernoulli experiment or
Bernoulli trial. In fact, when n = 1, then the binomial
distribution is the Bernoulli distribution. The binomial
distribution is the basis for the popular binomial test of
statistical significance.

Example
A typical example is the following: assume 5% of the
population is green-eyed. You pick 500 people randomly.
The number of green-eyed people you pick is a random
variable X which follows a binomial distribution with n = 500
and p = 0.05 (when picking the people with replacement).
Sampling
In many disciplines, there is often a need to describe the
characteristics of some large entity, such as the air quality in a region,
the prevalence of smoking in the general population, or the output
from a production line of a pharmaceutical company. Due to practical
considerations, it is impossible to assay the entire atmosphere,
interview every person in the nation, or test every pill. Sampling is the
process whereby information is obtained from selected parts of an
entity, with the aim of making general statements that apply to the
entity as a whole, or an identifiable part of it. Opinion pollsters use
sampling to gauge political allegiances or preferences for brands of
commercial products, whereas water quality engineers employed by
public health departments will take samples of water to make sure it
is fit to drink. The process of drawing conclusions about the larger
entity based on the information contained in a sample is known as
statistical inference.
There are several advantages to using sampling rather than
conducting measurements on an entire population. An important
advantage is the considerable savings in time and money that can
result from collecting information from a much smaller population.
When sampling individuals, the reduced number of subjects that need
to be contacted may allow more resources to be devoted to finding
and persuading nonresponders to participate. The information
collected using sampling is often more accurate, as greater effort can
be expended on the training of interviewers, more sophisticated and
expensive measurement devices can be used, repeated
measurements can be taken, and more detailed questions can be
posed.
Sampling
Definitions
The term "target population" is commonly used to refer to the group of
people or entities (the "universe") to which the findings of the sample
are to be generalized. The "sampling unit" is the basic unit (e.g.,
person, household, pill) around which a sampling procedure is
planned. For instance if one wanted to apply sampling methods to
estimate the prevalence of diabetes in a population, the sampling unit
would be persons, whereas households would be the sampling unit
for a study to determine the number of households where one or
more persons were smokers. The "sampling frame" is any list of all
the sampling units in the target population. Although a complete list of
all individuals in a population is rarely available, an alphabetic listing
of residents in a community or of registered voters are examples of
sampling frames.
Central Limit Theorem
A central limit theorem is any of a set of weak-
convergence results in probability theory. They all express
the fact that any sum of many independent identically
distributed random variables will tend to be distributed
according to a particular "attractor distribution". The most
important and famous result is called The Central Limit
Theorem which states that if the sum of the variables has
a finite variance, then it will be approximately normally
distributed.
Since many real processes yield distributions with finite
variance, this explains the ubiquity of the normal
distribution.

Standard Error
In statistics, the standard error of a measurement, value or quantity
is the estimated standard deviation of the process by which it was
generated, including adjusting for sample size. In other words the
standard error is the standard deviation of the sampling distribution
of the sample statistic (such as sample mean, sample proportion or
sample correlation).
Sample Size
Sample size, usually designated N, is the number of repeated
measurements in a statistical sample. They are used to estimate
a parameter, a descriptive quantity of some population. N
determines the precision of that estimate. Larger N gives smaller
error bounds of estimation. A typical statement is to say that one
can be 95% sure the true parameter is within +or- B of the
estimate, where B is an error bound that decreases with
increasing N. Such a bounded estimate is referred to as the
confidence interval for that parameter.

Estimation
Estimation is the calculated approximation of a result which is
usable even if input data may be incomplete, uncertain, or noisy.
In statistics, see estimation theory, estimator.
In mathematics, approximation or estimation typically means
finding upper or lower bounds of a quantity that cannot readily be
computed precisely. While initial results may be unusable
uncertain, recursive input from output, can purify results to be
approximately accurate, certain, complete and noise-free.
Point Estimation
In statistics, point estimation involves the use of sample data to
calculate a single value (known as a statistic) which is to serve as a
"best guess" for an unknown (fixed or random) population parameter.
More formally, it is the application of a point estimator to the data.
Point estimation should be contrasted with Bayesian methods of
estimation, where the goal is usually to compute (perhaps to an
approximation) the posterior distributions of parameters and other
quantities of interest. The contrast here is between estimating a single
point (point estimation), versus estimating a weighted set of points (a
probability density function).

Interval Estimation
In statistics, interval estimation is the use of sample data to calculate
an interval of possible (or probable) values of an unknown population
parameter. The most prevalent forms of interval estimation are
confidence intervals (a frequentist method) and credible intervals (a
Bayesian method).
Point Estimation

In statistics, regression analysis is used to model relationships between


variables and determine the magnitude of those relationships. The
models can be used to make predictions.

Introduction

Regression analysis models the relationship between one or more


response variables (also called dependent variables, explained variables,
predicted variables, or regressands) (usually named Y), and the
predictors (also called independent variables, explanatory variables,
control variables, or regressors,) usually named X1,...,Xp). Multivariate
regression describes models that have more than one response variable.

Types of regression

Simple and multiple linear regression


Simple linear regression and multiple linear regression are related
statistical methods for modeling the relationship between two or more
random variables using a linear equation. Simple linear regression refers
to a regression on two variables while multiple regression refers to a
regression on more than two variables. Linear regression assumes the
best estimate of the response is a linear function of some parameters
(though not necessarily linear on the predictors).

Nonlinear regression models


If the relationship between the variables being analyzed is not linear in
parameters, a number of nonlinear regression techniques may be used to
obtain a more accurate regression.
Correlation
Degree of relationship between business and economic variables such
as cost and volume. Correlation analysis evaluates cause/effect
relationships. It looks consistently at how the value of one variable
changes when the value of the other is changed. A prediction can be
made based on the relationship uncovered. An example is the effect of
advertising on sales. A degree of correlation is measured statistically by
the Coefficient of Determination (r-squared).

Coefficient of Determination
Statistical measure of Goodness-Of-Fit. It measures how good
the estimated regression equation is, designated as r2 (read as r-
squared). The higher the r-squared, the more confidence one can
have in the equation. Statistically, the coefficient of determination
represents the proportion of the total variation in the y variable
that is explained by the regression equation. It has the range of
values between 0 and 1. It is computed as

You might also like