You are on page 1of 12

Business Statistics

1. The sample data from a research survey conducted in various cities on the amount of time 13-
15 year-old children spent with mobiles are as follows:

Time with mobiles


City (hours per week)
Hyderabad 46
Mumbai 50
Pune 46
Bangalore 54
Bhubneshwar 42
Indore 30
Bhopal 42
New Delhi 50
Chandigarh 46

For the above sample, determine the following measures:


i. The mean
ii. The standard deviation
iii. The mode
iv. The 75th percentile
Based on your calculations comment on the time spent on mobile.
Ans:

(i)
For a set of data, we determine a quantity used to summarise the whole set of data. This
quantity is termed a measure of central tendency. The most commonly used measures are mean,
medium and mode.

Mean:

For ungrouped data, the formula for calculation of mean is given below:
x1  x2  x3  ...  xn
Mean( x)  x  , where x is a study variable. (1.1)
n

Using formula (1.1), the average time with mobiles (hours per week) is given as:
x1  x2  x3  ...  xn 46  50  46  54  42  30  42  50  46
Mean( x)  x    45.11 (hours)
n 9

(ii)
In probability and statistics, the standard deviation of a probability distribution, random variable,
or population or multiset of values is a measure of the spread of its values. It is usually denoted
with the letter σ (lower case sigma). It is defined as the square root of the variance. To
understand standard deviation, keep in mind that variance is the average of the squared
differences between data points and the mean. Variance is tabulated in units squared. Standard
deviation, being the square root of that quantity, therefore measures the spread of data about the
mean, measured in the same units as the data. Said more formally, the standard deviation is the
root mean square (RMS) deviation of values from their arithmetic mean. For example, in the
population {4, 8}, the mean is 6 and the deviations from mean are {-2, 2}. Those deviations
squared are {4, 4} the average of which (the variance) is 4. Therefore, the standard deviation is
2. In this case 100% of the values in the population are at one standard deviation of the mean.
The standard deviation is the most common measure of statistical dispersion, measuring how
widely spread the values in a data set are. If the data points are close to the mean, then the
standard deviation is small. As well, if many data points are far from the mean, then the standard
deviation is large. If all the data values are equal, then the standard deviation is zero.

n
 ( xi  x ) 2
i 1
Formula for Standard deviation for ungrouped data= (1.2)
n

Time with mobiles (hours per


City xi  x ( xi  x ) 2
week)
Hyderabad 46 0.889 0.790123457

Mumbai 50 4.889 23.90123457

Pune 46 0.889 0.790123457

Bangalore 54 8.889 79.01234568

Bhubneshwar 42 -3.111 9.679012346

Indore 30 -15.111 228.345679


Bhopal 42 -3.111 9.679012346

New Delhi 50 4.889 23.90123457

Chandigarh 46 0.889 0.790123457


n
 ( xi  x ) 2 =376.888
i 1

n
 ( xi  x ) 2 376.888
Standard deviation for data given in Q.1 is calculated as i 1

n 9 = 6.47

(iii)
Mode: The mode is the most common (frequent) value. A list can have more than one mode.
In case of data set given in Q.1, most frequent value is 46 (hours).
Therefore mode=46 hours

(iv)
The meaning of percentile can be captured by stating that the pth percentile of a
distribution is a number such that approximately p percent (p%) of the values in the distribution
are equal to or less than that number. So, if ‘28’ is the 80th percentile of a larger batch of
numbers, 80% of those numbers are less than or equal to 28. A percentile can be calculated
directly for values that actually exist in the distribution. To calculate percentiles, sort the data so
that x1 is the smallest value, and xn is the largest, with n = total number of observations. There
are various formulas suggested by statisticians for calculations of percentiles, below formula for
raw data is the simplest one.

First Sort the data in descending order.

Pth percentile= n(p/100) th item,


n is the sample size,
Sorted data is given by
Time with mobiles (hours per
week) 30 42 42 46 46 46 50 50 54

75th percentile is (9x75/100)th item=(6.75)=7 th item =50


2. ‘Mumbai Ice Cream an ice cream store gives relationship between ice cream sold and
temperature. The store has taken a sample of a week’s data. Below you are given the results of
the sample

Day Cones Sold Temperature


1 350 110
2 200 100
3 210 90
4 100 80
5 80 70
6 70 60
7 50 50

i. Which variable is the dependent variable?


ii. Compute the least squares estimated line.
iii. Is there a significant relationship between the sales of cones and temperature?
iv. Predict sales of a 95 degree day.
Ans:
(i)

Let X be the Cones Sold and Y be the Temperature, Cones sold (X) is the dependent variable,
because sale of Ice cream cones depends on temperature.
(ii)

If X is the dependent variable and Y be the independent variable, then the least squared
estimated line is given by
Y=a+bX (2.1)
The normal equations to get the estimate of coefficients a and b is given by

 y  na  b x (2.2)

 xy  a x  b x 2 (2.3)

For the calculation of normal equation, we have below calculations as below:


Day Cones Sold (Y) Temperature (X) XY X2
1 350 110 38500 12100
2 200 100 20000 10000
3 210 90 18900 8100
4 100 80 8000 6400
5 80 70 5600 4900
6 70 60 4200 3600
7 50 50 2500 2500
Total  y  1060  x  560  xy  97700  x 2 =47600

Therefore the normal equation are given below, from above table calculations
1060=7a+560b (2.4)
97700=560a+47600b (2.5)
Multiplying equation (2.5) by 80, we have
84800=560a+44800b (2.6)
Subtract equation (2.6) from (2.5), we have
12900=2800b
Therefore, b=12900/2800=4.607
Substitute, b=4.07 in equation (2.2), we have
7a=1060-560b=106-560(4.46)
Therefore, a= -217.131
Hence, final least square estimate line is
Y=-217.131+4.07X (2.7)
(iii)

Using excel analysis tool for analyzing the data for significance purpose, we have regression
output below:

SUMMARY
OUTPUT

Regression
Statistics
Multiple R 0.92218168
R Square 0.850419052
Adjusted R
Square 0.820502862
Standard Error 45.72432925
Observations 7

ANOVA
Significance
df SS MS F F
Regression 1 59432.14286 59432.14286 28.42672 0.003110347
Residual 5 10453.57143 2090.714286
Total 6 69885.71429

Standard
Coefficients Error t Stat P-value Lower 95% Upper 95%
- - -
Intercept -217.1428571 71.25622064 3.047352992 0.02851 400.3128035 33.97291076
Temperature (X) 4.607142857 0.864108601 5.331671105 0.00311 2.385880985 6.828404729

Since p-value is less than 0.05, therefore relation between X and Y is statistically significant.

(iv)
In order to predict sales of a 95 degree day. Put X=95 in equation Y=-217.131+4.07X.
Y=-217.131+4.07(95)
Y= 220.536
3. According to one of the recent study conducted by an academic researcher on international
placement of students from leading institutes in India there is a high variation in the salary
offered by institutes. The following details have been gathered from the placement institute of
the colleges. The researcher wants to understand the trends with regard to international
placement based on the data he has gathered.

Amount (in
USD in lakhs Marital
per annum ) Age Status Type of institute Gender
2 35 Single University Male
5 24 Married PGDM Male
3.5 29 Married University Female
5 26 Single University Male
4 26 Married PGDM Female
8 25 Single PGDM Female
15 34 Married PGDM Male
3 26 Single PGDM Male
7 23 Single PGDM Male

a. Using descriptive statistics explore salary, and identify factors that appear to influence the
amount of the salary received.
Ans:

We have frequency distribution of the Marital Status, Type of Institute and gender as given
below:

Marital Status
Frequency Percent Valid Cumulative
Percent Percent
Single 5 55.6 55.6 55.6
Marrie
Valid 4 44.4 44.4 100.0
d
Total 9 100.0 100.0

Type of institute
Frequency Percent Valid Cumulative
Percent Percent
PGDM 6 66.7 66.7 66.7
Universit
Valid 3 33.3 33.3 100.0
y
Total 9 100.0 100.0

Gender
Frequency Percent Valid Cumulative
Percent Percent
Female 3 33.3 33.3 33.3
Valid Male 6 66.7 66.7 100.0
Total 9 100.0 100.0

In order to explore, explore salary, and identify factors that appear to influence the amount of
the salary received, we use multiple regression technique, using SPSS for multiple regression
technique, we have

ANOVAa
Model Sum of df Mean F Sig.
Squares Square
Regression 50.148 4 12.537 .688 .637b
1 Residual 72.852 4 18.213
Total 123.000 8
a. Dependent Variable: Amount (in USD in lakhs per annum )
b. Predictors: (Constant), Gender, Type of institute, Marital Status, Age

Gender, Type of institute, Marital Status, Age are not statistically significant. Statistically, there
is no influence of these factors on the amount of the salary received because P-values are less
than 0.05.

Coefficientsa
Model Unstandardized Standardized t P-value
Coefficients Coefficients
B Std. Error Beta
(Constant) -4.530 10.499 -.431 .688
Age .402 .420 .438 .959 .392
Marital Status .875 3.237 .118 .270 .800
1
Type of
-4.829 3.508 -.616 -1.377 .241
institute
Gender .755 3.313 .096 .228 .831
a. Dependent Variable: Amount (in USD in lakhs per annum )

b. Do a correlation analysis between ‘Amount’ and ‘Age’ and interpreted the coefficient of
correlation.
Ans:
Correlation
In statistics, the word "correlation" has a very specific meaning. Statistical correlation
means that, given two variables X and Y measured for each case in a sample, variation in X
corresponds (or does not correspond) to variation in Y, and vice versa. That is, extreme values of
X are associated with extreme values of Y, and less extreme X values with less extreme Y
values. The correlation coefficient (Pearson r) measures the degree of this correspondence.

Correlation and causation


If one variable causally influences a second variable, then we would expect a strong
correlation between them. However, a strong correlation could also mean, for example, that they
are both causally influenced by a third variable. Therefore a strong observed correlation can
suggest a causal connection, but it doesn't per se indicate the direction or nature of that causation.

X→Y X←Y X ↔Y X←A→Y

X and Y influence
X influences Y Y influences X A influences X and Y
each other

Alternative Explanations for Strong Observed Correlation

Important: Correlation between two variables does not prove X causes Y or Y causes X.
Example: There is a statistical correlation between the temperature of sidewalks in New York
City and the number of infants born there on any given day.

Pearson r
There is a simple and straightforward way to measure correlation between two variables. It is
called the Pearson correlation coefficient (r) – named after Karl Pearson who invented it. It's
longer name, the Pearson product-moment correlation, is sometimes used.

The formula for computing the Pearson r is as follows:


r
 ( xi  X )( yi  Y )
 ( xi  X ) 2  ( y i  Y ) 2

The value of r ranges between +1 and -1:

 r > 0 indicates a positive relationship of X and Y: as one gets larger, the other gets larger.
 r < 0 indicates a negative relationship: as one gets larger, the other gets smaller.
 r = 0 indicates no relationship

Let's intuitively consider how this formula works. It starts by subtracting the means from X and
Y, and then multiplying the results. When we subtract the mean from a variable, some of the
resulting values will be positive and some negative. When we subtract the means from both X
and Y, that will happen with both variables.

If there is no association between X and Y, there will be no systematic relationship between


( xi  X ) and ( yi  Y ) . Therefore the positive values of one will match up with positive and
negative values of the other randomly, and the same with negative values of the first variable.
Therefore when we take the sum of ( xi  X )( yi  Y ) , all these positive and negative results will
tend to cancel each other out, making r close to 0. However if two variables are positively
associated, then positive values of ( xi  X ) will match up with positive values ( yi  Y ) , and
negative values with negative values. The sum of ( xi  X )( yi  Y ) will produce a positive r.
In a negative relationship, positive values of ( xi  X ) will match up with negative values of
( yi  Y ) , and vice versa. Then the sum of ( xi  X )( yi  Y ) , and r, will be negative.

Note also that if we calculate the Pearson correlation of X with itself, the result will be 1:

Correlation between Amount (in USD in lakhs per annum ) (X) and Age (Y) using the formula

r
 ( xi  X )( yi  Y ) , we have calculations in below table as
 ( xi  X ) 2  ( y i  Y ) 2
Amount (in USD
in lakhs per Age
annum ) (X) (Y) ( xi  X ) ( yi  Y ) ( xi  X ) ( y i  Y ) ( xi  X ) 2 ( yi  Y ) 2
2 35 -3.833 7.444 -28.537 14.694 55.420
5 24 -0.833 -3.556 2.963 0.694 12.642
4 29 -2.333 1.444 -3.370 5.444 2.086
5 26 -0.833 -1.556 1.296 0.694 2.420
4 26 -1.833 -1.556 2.852 3.361 2.420
8 25 2.167 -2.556 -5.537 4.694 6.531
15 34 9.167 6.444 59.074 84.028 41.531
3 26 -2.833 -1.556 4.407 8.028 2.420
7 23 1.167 -4.556 -5.315 1.361 20.753
Total 27.833= 123.000= 146.222=
 ( xi  X )( yi  Y )  ( xi  X ) 2  ( yi  Y ) 2

r
 ( xi  X )( yi  Y ) 
27.83
 0.2075
 ( xi  X ) 2  ( y i  Y ) 2 123.0x146.22
We have positive correlation between Amount (in USD in lakhs per annum) and Age. Since 0.20
is near to zero as compared to 1, the positive correlation is not much strong between the two
variables.

You might also like