You are on page 1of 25

ETC1000/ETW1000/ETX9000 Business and Economic Statistics

LECTURE NOTES

Topic 1: Knowing What is Happening



1. Introduction: What is Econometrics and Business Statistics (and Why Should I Learn It)?

Econometrics and business statistics is about getting the best information from data. When
you have good information, you can use it to make good decisions. The econo and
business parts mean that the data is economic or business data, and/or the decision is an
economic or business decision.

How can data be used in business and economic decision making? Consider some
decisions that business and government may need to make:

Business decisions
e.g. A telecommunications company is trying to decide whether to branch out into
a new demographic of mobile phone users targeting teenagers. Will it be
economical? Will teenagers take them up?

Economic policy decisions
e.g. The government is trying to decide whether to continue to support
children/young adults who have grown up in state care after they turn 18. How
do these young adults survive after they leave care? Do they end up without
jobs and in poor health once the support is removed, and thus incur higher cost
to the government than if they were supported until they were 25?

How do we answer these questions in order to make such decisions?

Business managers and policymakers need data: they need evidence which can help them
understand the ways things are, weigh up options, identify problems, look at consequences,
evaluate performance, etc. It is not enough to go on intuition, or on what is commonly
accepted wisdom. Often this can be wrong or misleading.

What can you expect to get out of this topic?
- Learn how to present and summarise data in a meaningful way.
- Learn how to use data to make informed decisions.
- Develop critical and analytical thinking about data and decisions made from it.


2. Collecting Relevant Data

2.1 Recording Business Activities

With so much business activity taking place via computers, businesses already have data
being collected on many aspects of their activities sales, costs, etc. This historical data
can be invaluable in picking trends and planning for the future.

e.g. A telecommunications company may be interested in how their mobile phone
sales are going how many phones are sold per month, and has this begun to
2
slow over the last few years as the market has become saturated? Is there a
clear pattern of busy months and quiet months?

Here is the data they were able to obtain from their business records.

Mobile Phone Sales (Thousands)
0
100
200
300
400
500
600
700
800
Q
1
-
9
6
Q
3
-
9
6
Q
1
-
9
7
Q
3
-
9
7
Q
1
-
9
8
Q
3
-
9
8
Q
1
-
9
9
Q
3
-
9
9
Q
1
-
0
0
Q
3
-
0
0
Q
1
-
0
1
Q
3
-
0
1
Q
1
-
0
2
Q
3
-
0
2
Q
1
-
0
3
Q
3
-
0
3
Q
1
-
0
4
Q
3
-
0
4
Q
1
-
0
5
Q
3
-
0
5
Q
1
-
0
6
Q
3
-
0
6


From this plot of sales over time, we can see that mobile phone sales have changed a lot
over the last 10 years. In the December quarter of 2005 sales were more than twice that in
the December quarter of 1996. This would suggest there is a general increase in sales over
the 10 years. Looking a little more closely at the plot, we can see that there are some
regular peaks in sales in the December quarter of each year, followed by a couple of low-
sales quarters a phenomenon known as seasonality. The peak in the December quarter
of each year would suggest that mobile phones tend to be a common Christmas present in
Australia!


2.2 Surveys and Sampling

A sample survey is a very common source of data. Surveys can be a useful way of gaining
specific information.

e.g. Tourism: You want to assess customer satisfaction with the service they
receive at your tourist resort.

Retail sales: You want to understand spending patterns of people in a
particular product area so that you can target advertising better.

Local Council: You want to assess community needs so that you can ensure
that facilities being provided are appropriate.

There are two big issues in collecting survey data.

(1) Survey Design

The results of a survey are only as good as the design of the questions / form. For
example, the wording of questions is vital.

A bad question:
In light of recent claims that the Premier has engaged in corrupt practices,
who do you think is the more honest leader: the Premier or the Leader of the
Opposition?
3

Wording designed to influence towards a particular outcome.

This could be worded better:
Who do you think is the more honest leader: the Premier or the Leader of the
Opposition?

(2) Sample Design

Obviously you cannot survey everyone in the population, so you need to take a sample
it would take too long to do the surveys, not everyone would be willing to
participate, time and cost in processing would be too great.

e.g. You might ask a small group of visitors to your tourist resort to complete a
questionnaire at the end of their visit. The hope is that this small sample of
respondents will be representative of the wider population of visitors to the
resort.

We need to ensure that our sample is representative of the population. When a
SAMPLE produces results which are not representative of the POPULATION, we say
that the sample is BIASED.

In particular, there are 2 sources of bias in sample design.

Selection Bias

e.g. If I choose my sample of tourists according to my impressions of people, I
might choose those who look friendly and willing to participate they would
be more likely to say yes to completing the questionnaire. BUT they could
give a distorted picture of customer satisfaction. I may end up choosing all
the people who had a good time, who had no complaints, and not select any
grumpy, unhappy people. The results of my survey will then show things to
be better than they actually are.

This bias is known as SELECTION BIAS.

How do we ensure that our sample is representative?

The key is RANDOMNESS: selecting the sample in some kind of random way. This
reduces the possibility that the sample may not represent the population well.

e.g. Each hour, randomly choose a number between 1 and 60. If in this hour I
draw No. 11, say, then at 11 minutes past the hour, I ask the next customer
who comes through the door to complete the questionnaire.

Non-Response Bias

Even if I take a random sample to avoid selection bias, I can still encounter a second
source of bias: non-response bias.

Whenever you do a survey, not all those who you ask to complete the survey will
respond. Non-response can be a problem in two ways:

One, it makes the sample smaller than we might like (e.g. you may post out a survey
to 200 customers and only get 30 responses a sample of size 30 is pretty small).

4
Secondly, there may be a bias inherent in who responds and who doesnt. Suppose,
for example, I ask for feedback on this subject via a voluntary survey students can
complete the survey if they wish. What might happen is that only very unhappy
students bother completing the survey they have something to complain about.
Those who are generally happy dont bother filling out the survey. So my survey
results will be biased they will suggest things are much worse that they actually are.

This bias is known as NON-RESPONSE BIAS.


3. Summarising Data in Meaningful Ways

We need to summarise data in ways that are appropriate for the type of data you have:

- Numerical or Quantitative Data Data that takes a numerical value.

Numerical or quantitative data can be of two types:

(1) Discrete

e.g. Number of people in a household, number of times youve been to
the doctor in the last year, number of children in school, etc.

(2) Continuous

e.g. Your height, value of the Consumer Price Index (CPI), the
unemployment rate, etc.

- Categorical or Qualitative Data Data that do not take on numerical values, but
can be classified into distinct categories (e.g. country of birth, gender, day of the
week, etc.)


3.1 Tables and Charts for Numerical Data

Suppose you have data on income levels of every household in a particular suburb: there
are 20,000 data points. You want some easy way of capturing the characteristics of
household incomes in that suburb. A snapshot of the data is given below.

5

The first thing to do with any sort of data is work out the question that you want to answer.
For example, using this data you may be interested in what the income distribution is for
this suburb, and you could ask a question like are households in this suburb mostly
affluent, or is there an uneven distribution of income? One of the best ways of answering
this question using a table is by creating a FREQUENCY DISTRIBUTION: a table which
shows the number of households earning income within particular ranges.

e.g.



That is, we look at the incomes of these 20,000 households and count the number of
households with incomes in each class range. So in this case, there are 14 households who
earn between $10,001 and $20,000 per annum. In tutorials next week you will learn how to
automatically create frequency distributions like this in Excel using Data Analysis,
Histogram from the Data Tab.

We can also take this information and present it in a more visually appealing manner as a
HISTOGRAM. e.g.

6


A histogram will be produced by Excel if you tick Chart Output in Data Analysis, Histogram.
7
Make sure your histogram is well presented:

- Headings and labels for axes, including units of data.

- No gaps between the bars this is not the default Excel output you will see in
tutorials how to fix this.

BUT: reading from charts tends to be more approximate than reading from tables.

The problem with the table and chart above is that in its current form, the numbers are not
particularly meaningful. Is 14 households a large or a small number? A more meaningful
table or chart would be one that presents the frequency column as a percentage of the
20,000 households.

e.g.



This allows us to say things like 0.07% of households earn between $10,001 and $20,000 p.a.

Some points about creating frequency distributions:

- Dont have too few classes (ranges are too broad, and useful information is lost),
but dont have too many (too much detail, hard to form overall impressions).
Somewhere between 5 and 15 is normal, depending how much original data you
have, and how much it varies.

- Select sensible class boundaries nice round numbers (you will find that Excels
automatic class (bin) ranges dont do this, and so its better to enter your own).

- Giving cumulative frequencies and cumulative percentages is also a useful
addition. It allows us to say things like 20.15% of households earn $40,000 or
less p.a..
8
e.g.




3.2 Tables and Charts for Categorical Data

Suppose we had some qualitative information on a bunch of people in our suburb,
including information on whether the individual has been diagnosed with particular
medical conditions. Specifically, individuals were asked to indicate their primary medical
condition out of Asthma, Cancer, Depression, Diabetes, Heart Disease or None of Above.
A snapshot of the data is shown below.



Since individuals indicate only one medical condition, they can be categorised into one of 6
categories related to medical condition: Asthma, Cancer, Depression, Diabetes, Heart
Disease or None of Above.

9
How can we organise and summarise this categorical data? We can also create a frequency
distribution to see how many people are in each category. In Excel, you would use a
different function: From the Insert tab, select Pivot Table. You will look at pivot tables in
tutorials.

The frequency distribution for this example is as follows:

Sum of Victorians
Medical Condition Grand Total
Asthma 628
Cancer 575
Depression 720
Diabetes 322
Heart Disease 348
None of Above 4871
Grand Total 7464

From this table we can see that, for example, 322 individuals out of the 7,464 respondents
in the sample have diabetes as their primary medical condition.

A BAR CHART is the most common way of presenting this kind of categorical data. It is
easy to read and interpret.

Medical Conditions in Victoria
0
1000
2000
3000
4000
5000
6000
Asthma Cancer Depression Diabetes Heart
Disease
None of
Above
Diagnosis
N
u
m
b
e
r

o
f

I
n
d
i
v
i
d
u
a
l
s


Note here that our categories are in alphabetical order. Since the categories have no natural
ordering, it really wouldnt matter if we, say, put diabetes first. You may even like to order
the categories from most frequent to least frequent this special presentation, called a
PARETO CHART, helps to distinguish the vital few from the trivial many. Be sure to
use bars and not a line graph which joins the categories, as with no natural ordering it
doesnt make sense to join the dots!

As with the numerical data case, this information is often more useful in percentages. This
is quick to do in Excel, once the pivot table has been constructed: right-click on the table,
select Field settings and choose Options. You can then Show data as % of column,
and it will automatically change the frequencies to percentages.
10

Sum of Victorians
Medical Condition Grand Total
Asthma 8.41%
Cancer 7.70%
Depression 9.65%
Diabetes 4.31%
Heart Disease 4.66%
None of Above 65.26%
Grand Total 100.00%

So, 4.31% of the individuals reported diabetes as their primary medical condition.

Since the categories are mutually exclusive (individuals can only be in one category) and
exhaustive (there are no other possible categories the total is 100%), a PIE CHART can
be used to present the same information:

Medical Conditions in Victoria
Asthma
Cancer
Depression
Diabetes
Heart Disease
None of Above


The pie chart is often popular it is visually more appealing, and also shows nicely how
the overall population is divided up into its various categories. However, the bar chart is
easier to read accurately it is easier to judge length of lines in a bar chart than angles /
areas in a pie chart.


3.3 Tables for Bivariate Data

So far we have been looking at just one characteristic of interest for the data medical
condition. i.e. We have been analysing univariate data. Univariate data often provides a
description that is too simplistic and can even be misleading if there are other factors at
work behind the univariate categories. It may be more useful to look at two characteristics,
and look at frequencies in each pair-wise category. i.e. We want to look at bivariate data.

Lets consider an example. Suppose the government may want to find out more about the
people with each medical condition, so that they can come up with policies aimed at
reducing the prevalence of the condition.

11
e.g. If people who exercise more have lower incidence of illness, the government
may push for people to become more active. (After all, expenditure on health
care is an important component of the government budget, and so it is in the
governments best interests, both financially and socially, to monitor and
reduce the prevalence of medical conditions.)

We have information on how much exercise each individual does. So, we have data for
each individual on medical condition AND on exercise Moderate to Frequent Exercise or
Minimal Exercise. For each individual we have a pair of data points: we have bivariate
data.



How do we present and summarise bivariate data? We use a CONTINGENCY TABLE. A
contingency table is a two-way frequency distribution. It can be produced as a pivot table
in Excel. Youll create some pivot tables in tutorials next week.

e.g.

Sum of Victorians Exercise
Medical Conditions
Moderate to
Frequent Exercise Minimal Exercise Grand Total
Asthma 254 374 628
Cancer 172 403 575
Depression 249 471 720
Diabetes 80 242 322
Heart Disease 74 274 348
None of Above 1861 3010 4871
Grand Total 2690 4774 7464

You will notice that we have split the medical condition frequency distribution into
categories based on exercise. For example, we can see the same total values that we saw
in the univariate table: out of the sample of 7,464 people, there were 322 people with
diabetes as their primary medical condition. Further, 242 people both suffer from Diabetes
AND do Minimal Exercise.

12
Why would this bivariate presentation be more useful? Because it allows us to see whether
there are any differences in medical conditions across the different categories of exercise.
In particular, the bivariate table suggests that a large number of those with Diabetes do
Minimal Exercise. This information would be much more interesting to the government it
indicates a possible relationship between diabetes and exercise, and suggests a policy angle
that the government might take to reduce the prevalence of diabetes. By allowing us to see
a second dimension to the data, the bivariate pivot table allows us to get a much richer
picture of the data than the univariate pivot table.

You will notice that in each category of medical condition, there are many more people
who do Minimal Exercise. But, we need to be careful here: because there are more people
in the Minimal Exercise group in total, naturally there will be more people in this category
with medical conditions. Again, we need to use percentages.

We could use the following % of row pivot table:

Sum of Victorians Exercise
Medical Conditions
Frequent to
Moderate Exercise Minimal Exercise
Grand
Total
Asthma 40.45% 59.55% 100.00%
Cancer 29.91% 70.09% 100.00%
Depression 34.58% 65.42% 100.00%
Diabetes 24.84% 75.16% 100.00%
Heart Disease 21.26% 78.74% 100.00%
None of Above 38.21% 61.79% 100.00%
Grand Total 36.04% 63.96% 100.00%

From this table, we can see that 75.16% of people with Diabetes as their primary medical
condition do Minimal Exercise.

BUT: this doesnt really tell us anything more meaningful than the previous contingency
table, given that we are interested in whether more exercise improves health. Again, there
are more people in the minimal exercise group, thus the percentage will always be the
largest in the minimal exercise column.

To answer our question, we should use the % of column contingency table:

Sum of Victorians Exercise
Medical Conditions
Frequent to
Moderate Exercise Minimal Exercise Grand Total
Asthma 9.44% 7.83% 8.41%
Cancer 6.39% 8.44% 7.70%
Depression 9.26% 9.87% 9.65%
Diabetes 2.97% 5.07% 4.31%
Heart Disease 2.75% 5.74% 4.66%
None of Above 69.18% 63.05% 65.26%
Grand Total 100.00% 100.00% 100.00%

This pivot table effectively takes into account (controls for/ standardises according to)
differences in population size of each exercise category.

From this table, we can say things like 5.07% of the people who do Minimal Exercise have
Diabetes as their primary medical condition. Looking at the rate of Diabetes in the overall
sample (4.31%, as shown in the Grand Total column and as we saw in the univariate table)
this would suggest a connection between Diabetes and the level of exercise a person
undertakes.
13
If we believe that lack of exercise contributes to development of diabetes, a strategy for the
government might be to promote a more active lifestyle amongst people in the suburb.


3.4 Descriptive (Summary) Statistics

When we have a set of numerical or quantitative data, it is common to try and summarise
characteristics of the data with what we call summary measures.

Getting some idea of the general characteristics of the data is a good starting point. We can
do this quite quickly and easily using Excel using the Data tab, Data Analysis, Descriptive
Statistics, and selecting the Summary Statistics box.

Using the data on 20,000 households in our suburb, we get the following output:



It is worth knowing how these statistics are calculated, what they mean and their
limitations. But before we start, lets introduce some notation:

Consider the following representation:

1
n
i
i
X
=



This means sum the Xs from X
1
to X
n
. The capital Greek letter E , pronounced sigma,
is used widely in mathematics and statistics as shorthand for sum a set of values as
described by the notation following it. The i indicates the element/observation number,
while the value specified below E is the term to begin with and the value above E is the
term to finish with.

That is:

5
1 2 3 4 5
1
i
i
X X X X X X
=
= + + + +



and

14
( )
4
3 4
3
2 2
i
i
X X X
=
= +

.

You will have some practice with summation operators in tutorials next week.

Now, back to the summary statistics:

(1) Mean

The MEAN is the arithmetic average of the n data points:

X
X
n
i
i
n
=
=

1
or
X X X
n
n 1 2
+ + +


So, for our 20 000 data points the formula becomes:

20000
20000
1

=
=
i
i
X
X or
20000
20000 2 1
X X X + + +


The mean is the most common measure of central tendency. Generally it is a good
measure. BUT it can be affected by a few extreme values.

e.g. If you are looking at average household income in a particular suburb, there
may be one household that earns an extremely high income which pushes up
the mean, so that it gives a misleading picture of where incomes of most
households in the suburb sit.

(2) Standard Error ignore

(3) Median

The MEDIAN is the middle number when the data is ordered from smallest to
biggest. 50% of values are below it, and 50% above it.

The median has an advantage over the mean in that it is not affected by a few
extreme values. It does a good job of telling us what the typical income of a
household is in that particular suburb. BUT the median can also be misleading it
takes no account of how the data is distributed around the middle.

e.g. Consider 5 housing properties that are up for sale in Clayton with prices at
$270,000; $320,000; $460,000; $470,000 and $480,000. In Wantirna South,
5 similar housing properties up for sale are priced at $450,000; $450,000;
$460,000; $850,000 and $1,000,000.

Both suburbs have a median price of $460 000, so we would say a typical
housing property has the same price in each suburb. But clearly, most
housing properties in Wantirna South are more costly than that in Clayton.
The median doesnt give the full picture. Information about the range of
property prices within each suburb does not enter into the computation of the
median.
15
(4) Mode

The MODE is the most frequently occurring value. It is sometimes a useful
summary measure, for either numerical or categorical data.

e.g. You may have data on the number of people in each household in a suburb.
The mode is 4, telling us the most common household size is 4 people.

Sometimes the mode is of little interest if there are few repeated values. It needs
data which, by nature, has frequent repeats.

(5) Standard Deviation

and

(6) Sample Variance

The VARIANCE and STANDARD DEVIATION are the most commonly used
measures of variation or spread in numerical data. Often it is interesting to know
how spread out the data is.

e.g. We usually have an average result in this subject of around 65%, which is
very good. But we also need to know how spread out the results are: does
virtually everyone get between 55 and 75 (i.e. good pass rate, but hard to
score well), or are results more spread (indicating that a significant number
fail, and some score very high marks).

The variance measures the average of the squared variation about the mean:

( )
s
X X
n
i
i
n
2
2
1
1
=

.

i.e. We are looking for some indication of how much the data varies around the
mean. If the data is all close to the mean, then ( ) X X
i
will be small for
all the values (that is, all i), and hence the average of their squares will be
small. Conversely, data which varies greatly above & below the mean will
have big ( ) X X
i
values, hence big variance.

N.B. Why divide by n - 1 and not n? This is a sample variance, and is being used
to estimate a population variance. It turns out that dividing by n - 1 gives a
better estimate of the population variance than dividing by n well think
more about this in the next topic.

Standard deviation is just the square root of the variance. It is much easier to
interpret than the variance:

Strictly speaking, s is the square root of the average of the squared deviations from
the mean.

But a more understandable interpretation is: Some X values are above the mean;
others are below the mean. s is an estimate of the average amount that the Xs vary
from the mean, either above or below. This isnt exactly correct, but its at least
understandable!
(7) Kurtosis - ignore
16

(8) Skewness

SKEWNESS tells us about shape of the distribution. In particular, it tells us about
how the data is distributed around the mean. Consider three typical histograms,
each with the same mean and standard deviation:













Symmetric zero skewness












Negative skewness Positive skewness


Skewness describes the degree and direction of asymmetry in the data. The first
(upper-most) histogram represents a symmetric distribution. The skewness measure
would be zero in this case. The histogram is identical either side of the mean.

The skewed portion is the long, thin part of the curve. The lower left-hand-side
histogram is what we call negatively skewed giving a negative number for this
skewness measure. There is a long left hand tail formed by a scattering of some
very small values, but the bulk of the data sits further to the right. This data might
represent exam marks in this subject most people score around 55%-80%, with
very few above this. But there is a long tail of people to the left who get below
55%, marks as low as 10% or 20% do happen.

The lower right-hand-side histogram is positively skewed a long tail to the right.
This data could be, perhaps, household income levels. Most earn $30,000 to
$50,000, with very few earning much less than this. But there is a long tail of
households to the right who earn much more $60,000, $80,000, $100,000 or
more.

0
10
20
30
40
50
60
70
0-10 11-20 21-30 31-40 41-50 51-60 61-70 71-80 81-90 91-100 More
F
r
e
q
u
e
n
c
y
0
10
20
30
40
50
60
70
0-10 11-20 21-30 31-40 41-50 51-60 61-70 71-80 81-90 91-100 More
F
r
e
q
u
e
n
c
y
0
10
20
30
40
50
60
70
0-10 11-20 21-30 31-40 41-50 51-60 61-70 71-80 81-90 91-100 More
F
r
e
q
u
e
n
c
y
17
(9) Range

The RANGE is another measure of spread or variation. It is simply the difference
between the maximum and minimum values in our data set.

(10) Minimum

The smallest value in our data set.

(11) Maximum

The largest value in our data set.

(12) Sum

The sum of all values in the data set not usually very interesting.

(13) Count

The number of values in the data set.

Note that the bigger the sample is, the closer we are to having the population. So, if
we only have a few values to summarise, we may have a sample that is not
representative of the true population, and consequently our descriptive statistics
may not be meaningful.


3.5 Analysis of Variance (ANOVA)

We have just seen how summary statistics can be calculated from numerical data: e.g. we
saw that mean income in our suburb was $44395. But what if we had bivariate data in this
case e.g. for each household, we have data on income (a numerical variable) and also
data on household type (a categorical variable). That is, a snapshot of the data would look
like this:



With this sort of data, we could calculate summary statistics for income for each different
household type. To do this using Excels Descriptive Statistics tool, we would need to first
sort the data by household type and rearrange it as follows (just the first 5 rows are shown
to save space):
18


We can then obtain summary statistics for each of these household types as presented in the
columns. If there are a different number of observations in each category, then youll need
to run the descriptive statistics for each category separately.



Here we see that mean household income is highest among couples without children, and
lowest amongst single adults without children. Generally, however, there is quite a large
degree of variation in incomes for households comprising of couples, compared to singles
(this would, of course, be due to the fact that couples could have 1 or 2 working adults).
We can get a similar sort of output in one go using ANOVA: Single Factor in Excels Data
Analysis tool. ANOVA stands for ANALYSIS OF VARIANCE, and it is primarily used to
compare the mean of a numerical variable across groups defined by a categorical variable.
In our case, wed be comparing mean income across each household type group.

Heres the output we get when we do this:


19
Notice that the first block of summary output gives some (but not all) of the statistics that
we obtained from the Descriptive Statistics. We could interpret these as we did previously.

The second block of output performs what is known as Analysis of Variance or ANOVA.
When we perform ANOVA, we are essentially trying to figure out what the different
sources of variation are in the data. In other words, why are all the incomes not exactly the
same? They clearly are not all the same there is variation in household incomes.

Heres how it works. Income varies. It varies by an amount given by its variance, the sum
of squared deviations from its overall mean:

Total Variation in Income (X) = Var(X) =
( )
2
1
n
i
i
X X
=



In Excels ANOVA output, the total variation in income is given in the SS Total column
(7.488 x 10
12
).

This total variation in income is then decomposed into two parts (sources of variation):

(1) Between Groups Variation

This is the variation that is explained by membership to the groups in this case, it
is the variation in income that can be attributed to household structure.

A measure of the amount of Between Groups Variation is the Between Groups Sum of
Squares (under the SS column in the Excel output). It is a measure of how much the
group means vary from the overall mean:

( )
2
1
k
j j
j
n X X
=



where: k is the number of groups
n
j
is the number of observations in group j
j
X is the mean of group j
X is the overall mean

(2) Within Groups Variation

This is the variation that is not explained by membership to the groups. That is, it is
the variation in income due to other factors (e.g. education of household members),
including chance or randomness.

( )
2
1 1
j
n
k
ij j
j i
X X
= =



where:
ij
X is the ith observation on X in group j

We can compare the amount of explained variation with the amount of unexplained
variation. But to do this we need to convert the above into average measures. We do this by
dividing by the df column. The result is the MS column.

The MS Between Groups in our example is 6.704 x 10
11
while the MS Within Groups is
2.739 x 10
8
the variation in income attributable to household structure is more than 2400
20
times the amount due to other factors (the F column gives us this ratio). This certainly
seems like the explained part is much larger that the unexplained part, so we would
consider household structure to be an important factor in determining household income. In
particular, it suggests that mean household income is not the same across groups defined by
household type.

So, how much larger does the MS Between Groups need to be than the MS Within Groups
for us to conclude that membership to the groups is an important source of variation?
The amount differs depending on a couple of factors how many observations we have
and how many groups there are. Well tell you how this works in Topic 3, but for now, the
ratio is given in the column F crit. That is, if F is bigger than F crit in the Excel output,
then wed conclude that the MS Between Groups is indeed bigger than the MS Within
Groups, and thus membership to the groups is an important source of variation.

In our example, the Excel output gives us:

Between Groups / Between Groups Between Groups
2447.751
Within Groups / Within Groups Within Groups
SS df MS
F
SS df MS
= = =

and:

2.605 F crit =

So since 2447.751 2.605 F = > we say that the explained variation is much bigger than
the unexplained variation, and therefore household structure is indeed an important factor
in determining household income.

We will build on this idea further in Topic 3.


3.6 Statistics for More than One Variable: Making Appropriate Comparisons

To make meaningful comparisons between different variables, e.g. incomes in different
suburbs, malnutrition rates among children in different developing countries, prices of
shares in different industries, etc., it is important to have standardised data sets/units.

Specifically, this might involve dividing the variable of interest by:

- The total, to convert magnitudes into percentages (to allow for different total
magnitudes).

e.g. Shares in Macquarie Bank are much more profitable than shares in Westpac
Bank: Macquarie shares grew by $3.30 this year, while Westpac shares only
grew by $1.26.

But: at the beginning of the year, a share in Macquarie cost $69.00, while a
share in Westpac cost $26.52. We should convert the price growth into a
percentage return in order to make an appropriate comparison of the more
profitable investment:

Return on Macquarie share = $3.30 / 69.00 = 0.048, i.e. 4.8%.
Return on Westpac share = $1.26 / 26.52 = 0.048, i.e. 4.8%.
So in fact, even though the price increases are different in dollar terms, both
shares earned the same return as a percentage on original investment.

21
- The consumer price index (CPI) (to remove the effects of inflation)

e.g. Working families have never been better off. Average weekly earnings
have gone up 400% in the last 25 years.

Average Weekly Earnings
Total Earnings, All Employees
$/week
0
100
200
300
400
500
600
700
800
900
81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00 01 02 03 04 05


But: as well as wages, prices have gone up over the 25-year period, so a fair
comparison of how better off families are should adjust for the effects of
inflation. i.e. can they actually buy any more with these extra dollars?

The CPI is an index that measures the cost of living, and it is often used to
remove the effects of inflation from monetary data spanning across time.
When we divide nominal wages (the dollar amount you get on your pay sheet)
by the CPI, we get what is known as the REAL WAGE. It is obtained by the
following:

Real Wage =
Nominal Wage
CPI
100

The real wage tells what our wage would be if prices had remained fixed at
what they were in the base period for the CPI (in Australia this is currently
1989/90). Changes reflect a capacity to buy more or less items, hence the
term real wage.

22
Average Weekly Earnings
Total Earnings, All Employees
Nominal and Real, $/week
0
100
200
300
400
500
600
700
800
900
81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00 01 02 03 04 05
Nominal Real


- Population (to adjust for different sized populations).

e.g. Economic growth, as measured by the growth rate in real Gross Domestic
Product (GDP) has been 3.2% this past year.

But: population has grown by around 2%, so actual per capita growth has been
much smaller.

e.g. Chinas energy use is growing 10 times faster than Australias.

But: population is 50 times bigger, so per capita, Australias energy use is
growing faster.

We capture this by calculating per capita energy consumption:

per capita energy consumption = consumption / population.

This tells us consumption per person. Growth in this variable over time
indicates actual growth in energy use per person, rather than growth in the
combination of energy use and population.


3.7 Summary

Looking at these descriptive statistics and graphs is often a useful first step in getting a
feel for the data you are looking at. Lets go through an example to show how the tools
you have learnt in this topic can be used to report and draw conclusions about a set of data.

Suppose we are interested in evaluating a job search program undertaken by Centrelink.
We have data on 100 individuals, some of which took part in the job search program, and
some who didnt. We also have data on whether the individual found work within a 6-
month period, and if so, their annual wage.

Heres a snapshot of the data:
23

Individual
Participate in Job
Search Training
Become
Employed Income
1 no no -
2 no no -
3 yes yes 43230
4 no yes 27635
5 no yes 62046
6 no yes 45157
7 yes yes 55970
8 no no -
9 no no -
10 no no -
11 no no -
12 no no -
13 no no -
14 no no -
15 no yes 66181
16 no no -
17 no no -
18 no yes 49030
19 no yes 50117
20 yes yes 47100

First, we could use a pivot table to see whether those who participated in the program were
more likely to get a job than those who didnt.

Employed?
Participate in Job Search Training? No Yes Grand Total
No 50 26 76
Yes 1 23 24
Grand Total 51 49 100

From this table, we can see that 24% of the individuals participated in the job search
program, and virtually all (23 out of 24 96%) of them found work. Amongst those who
did not participate in the training, only 26/76 or 34% found work.

These proportions would be easier to see as a %Row, as by dividing by the row totals we
effectively account for the different number of people in each group (participants in the
training and non-participants).

That is, wed construct:

Employed?
Participate in Job Search Training? No Yes Grand Total
No 66% 34% 100%
Yes 4% 96% 100%
Grand Total 51% 49% 100%

24
But just because there was a high success rate in employment doesnt mean those people
who got jobs have high-paying jobs. We can look at some descriptive statistics of income
for the two groups: job search training participants and non-participants. The output is
given below:


Participants in Job
Search Training Non Participants
Mean 39314.87 41034.78
Standard Error 2861.283 3643.973
Median 42519.76 40000
Mode #N/A 40000
Standard Deviation 13722.23 18580.69
Sample Variance 1.88E+08 3.45E+08
Kurtosis -0.36927 3.286039
Skewness -0.67963 1.57031
Range 45866.31 81156.34
Minimum 10819.11 18843.66
Maximum 56685.42 100000
Sum 904242.1 1066904
Count 23 26

Note that these descriptive statistics were calculated for the employed only the Counts are
the same as the total number of employed in each participation group.

The alternative output to the above is the ANOVA output below:

Anova: Single Factor

SUMMARY
Groups Count Sum Average Variance
Participate 23 904242.12 39314.87478 188299665.2
Do Not Participate 26 1066904.166 41034.7756 345241991


ANOVA
Source of Variation SS df MS F P-value F crit
Between Groups 36100391.21 1 36100391.21 0.132829645 0.717150714 4.047099759
Within Groups 12773642409 47 271779625.7

Total 12809742801 48

The statistics highlight that mean income is similar across the two groups participants
earn on average $39,315 per annum, while the non-participants earn slightly more on
average at $41,035 per annum. In particular, the ratio of explained variation to unexplained
variation is too small for participation in the program to be deemed an important source of
variation income (the ratio of 0.133 is much smaller than the critical value of 4.047).

The medians are also somewhat similar, with 50% of participants earning above and below
$42,520 and non-participants, 50% above and below $40,000. No mode is available for
income of participants - there are no 2 incomes exactly the same in this group, whereas an
income of $40,000 comes up most often amongst non-participants a value quite close to
the mean and median.

25
There is some difference in variation between the 2 groups, with incomes of participants
varying around the mean by $13,722 on average, while income of non-participants tend to
vary more around their mean on average, incomes vary by $18,581 above and below the
mean. This characteristic can also be seen in the values for the range, minimum and
maximum incomes of participants range from $10,819 to $56,685 per annum, while those
for non-participants span a much larger range: $18,843 to $100,000. Incomes of
participants tend be symmetrically and more tightly varying around the mean than incomes
of non-participants most tend to sit around the $40,000 mark, but a few particularly large
values like $100,000 do happen. This few large values seem to drag out the mean and give
the distribution a moderately large skew to the right (hence the positive skewness
coefficient of 1.57).

Summing up the findings of the analysis, we could conclude:

- The job search program seems to be successful in getting people jobs.
- The jobs participants get do not seem to be more highly paying than what the non-
participants get.
- There is much more variation in income outcomes of non-participants compared to
participants, but most people found jobs with incomes around the $40,000 mark.

N.B. What determined who participated in the program and who didnt?

You might also like