You are on page 1of 15

There is no required textbook RCM1613 Applied Statistics 1 Week 1

But we recommend the following two books for additional reading: [1] Triola, M.F., (2005), Elementary Statistics (9th ed.). Pearson Education

[2] Statistics for Managers Using Microsoft Excel 5/E


David M. Levine, Baruch College, Zicklin School of Business, City University of New York David Stephan, Baruch College, City University of New York Timothy C. Krehbiel, Miami University Mark L. Berenson, Montclair State University ISBN: 0-13-157940-8 Publisher: Prentice Hall Copyright: 2008
3

Features of the textbooks:


Hundreds of real-life examples and exercises An Introduction to Microsoft Excel Tutorials and Flexible Excel Presentations More comprehensive coverage of topics Web Exercises and Cases

What is Statistics
Statistics deals with data, or numerical facts. In this subject you will first learn methods for organising and describing data. Following that, methods of producing data to answer specific questions will be covered. Later in the semester, statistical methods for industrial quality control, time series analysis, and experimental designs will be tackled.
5

Definition of the word statistics in Websters New World Dictionary:


1. facts or data of a numerical kind, assembled, classified, and tabulated so as to present significant information about a given subject. 2. the science of assembling, classifying, and tabulating such facts or data.

The on-line dictionary (http://foldoc.doc.ic.ac.uk) definition of Statistics is :


The practice, study or result of the application of mathematical functions to collections of data in order to summarise or extrapolate that data. The subject of Statistics can be divided into descriptive Statistics - describing data, and analytical Statistics - drawing conclusions from data.

The Microsoft Encarta definition (http://dictionary.msn.com):


1. a branch of mathematics that deals with the analysis and interpretation of numerical data in terms of samples and populations 2. a collection of numerical data, e.g., this months sales statistics

Also from Encarta, origin of the word statistics is [Late 18th century. Via German Statistik from, ultimately, Latin status (see state ). The underlying meaning is the study of data relating to the state.]

Wherever data exists


Nowadays data exists in every field of research and every aspect the society. Particularly due to the fast development of information technology, data are collected much more easily and vastly. Wherever data exists, methods of analysing it are needed and thus correct or good methods need to be chosen or designed. Prior to obtaining data, good experimental design is also needed to collect data effectively. Therefore Statistics should be learned and used in every field. 10

Why Statistics Web Sources of Statistics


Statistics allows us to describe a circumstance or situation numerically, answer questions, resolve uncertainty, and make useful predictions and decisions. Statistics is used in every field that deals with data: Politics/government, Marketing, Medicine Manufacturing, Transport, Farming, Sport, Business/Finance, IT, etc. http://lib.stat.cmu.edu/ http://www.stat.ufl.edu/vlib/statistics.html http://www.statsci.org/

11

12

An advertisement
Statistics Job sites:
www.statsci.org/jobs/index.html www.austms.org.au/Jobs/Job_listings.html www.stat.ufl.edu/vlib/statistics.html www.amstat.org/opportunities/index.html

13

14

Another ad:

A recent ad:

15

16

A Video Program
From VU Library you can borrow the following video tapes of a Statistical Teaching Programs: AGAINST ALL ODDS: Inside Statistics There are 26 tapes/programs in total.

What An Australian Statistician Says


In her homepage, http://wwwmaths.anu.edu/~sue/PersonalPage, Prof. Sue Wilson says The heart of statistics is data. My research always has been, and will continue to be, data-motivated. ....

17

18

Calculation and interpretation


We will learn how to calculate various numerical measures (such as sample mean, sample variance, etc.) of a data set. However, it is more important to be able to correctly interpret your results, to correctly use and understand statistical terms in a scientific report, and to draw correct conclusions in a scientific research.
19

So we start with
4 statistical terms:
Population Sample Descriptive statistics Inferential statistics

20

Population and sample


A population is the set of all measurements of interest to the sample collector. A sample is any subset of measurements selected from the population.

Relationship between population and sample

Example: A medical research group studies a new medicine of asthma and tests the new medicine on 88 asthma patients. In this study, The population is all asthma patients, and the 88 asthma patients is a sample.

21

22

Example 1:
All VU Students Student 1 Student 2 Student 3 Student 4 Student 5 Student 6 Student 10,000 Height (cm) 178 189 169 172 171 166 179 Heights of 10 selected All VU Students Student 1 Student 2 Student 3 Student 4 Student 5 Student 6 Student 10,000

population
Height (cm) 178 189 169 172 171 166 179

Sample
Heights of 10 selected

Example 1:

189

189

171

171

23

24

Sample size
Number of items in a sample is called the size of the sample, or sample size, and is usually denoted by n. Number of items in a population is called the size of the population, or population size, and is usually denoted by N.

Descriptive statistics & Inferential statistics


Descriptive statistics consists of methods for organizing and summarizing information. Inferential statistics consists of methods for drawing and measuring the reliability of conclusions about a population based on information obtained from a sample of the population.

25

26

Descriptive statistics

Descriptive statistics is about the sample, about information contained in the sample data, such as the sample mean, sample variance, graphical presentation of the data, etc.
27

Inferential statistics
Inferential statistics is about drawing conclusions on population, etc. For example, since the average height of 10 selected VU students is 172cm, it is reasonable to believe the average height of all VU students should be close to 172cm. But, how 28 accurate is this estimate?

Presidential Election
population Voter 1 Voter 2 Voter 3 Obama McCain Other Sample Voter s1 Voter s2 Voter s3 Vt Obama Voter 2 McCain Other Voter 3 Voter 1

Presidential Election
population Obama McCain Other No one knew the exact results before the election

29

30

Ob

Presidential Election
But some people did a poll and obtained a small sample before the election. They found 48.6% of voters in the sample would vote for McCain, and 49.3% for Obama. Then they predicted that Obama would win 49.3% of all US voters, And McCain 48.6%. Margin of error of this prediction, according to them, is 3%. Sample Voter s1 Voter s2 Voter s3 Vt Obama McCain Other

Presidential Election
But some people did a poll and obtained a small sample before the election. They found 48.6% of voters in the sample would vote for McCain, and 49.3% for Obama. Then they predicted that Obama would win 49.3% of all US voters, and McCain 48.6%. Margin of error of this prediction, according to them, is 3%.

Descriptive statistics
Sample Voter s1 Voter s2 Voter s3 Obama McCain Other

31

Inferential Voter sn Obama 32

statistics

Ob

Statistic and parameter


A summary measure that is computed to describe a numerical characteristic from only a sample of the population is called a statistic. A summary measure that is computed to describe a characteristic of an entire population is called a parameter. For example, we have a sample of 10 students from the population of all VU students. The average height of the 10 students is a statistic, and the average height of all VU students if a parameter. Usually we use sample statistics to scientifically guess (formal phrase is to make statistical inference on) parameters of a population.
33

Statistical decisions are probabilistically correct


There are always chances that we may make wrong decisions or predictions due to randomness of the sample used. Correctly using statistical methods, we hope that the probability of making wrong decisions or predictions will be minimized.

34

examplemilk powder

examplebiscuits

35

36

Questions on labels
What does the number exactly mean? How was the number obtained? How would you test it? Write exact meaning with correct unit.
37

Statistics helps us answer questions


How would you answer the following common problems?
The machine is supposed to deliver 250ml of soft drink. The last bottle contained 257ml. Should we adjust the machine? We get a lot of complaints about how difficult it is to open the plastic seals on our products. What should we change in our manufacturing process to make them easier to open?
38

More Questions
If 98% of all calls to our help desk are to be answered within 5 rings, how many operators do we need to meet this objective. How long should we allow for the downloading of files over the internet? We are to receive files (in Melbourne) from the head office in Chicago.
39

Another example
A new drug had been developed to increase sleeping time of patients. In an experiment, 20 randomly selected patients used the drug while another 20 randomly selected patients used a placebo. Increased sleeping times relative to each patients sleeping records were recorded. What kind of results can lead us conclude the drug is effective?
40

Statistical software packages


EXCEL S-PLUS R SPSS MINITAB SAS STATA and many others
41

Microsoft Excel
Widely available Easy to use Powerful enough to perform basic to medium level statistical analysis Yet it is being improved constantly by Microsoft and others.

42

The importance of collecting data


To provide scientific evidence To assess different marketing strategies To monitor a production process To make forecasts To enrich our knowledge

Data Sources
Primary
Data Collection

Secondary
Data Compilation Print or Electronic, Internet

Observation

Survey

Experimentation
43 44

Reasons for Drawing a Sample


A census is when you collect data from the entire population Taking a sample is less time consuming than a census Less costly to administer than a census Less cumbersome and more practical to administer than a census of the targeted population
45

Types of Samples Used


Non-probability Sample
Items included are chosen without regard to their probability of occurrence

Probability Sample
Items in the sample are chosen on the basis of known probabilities
46

Types of Samples Used


Samples

(continued)

Judgment sample
In the terminology of Deming (1947) a judgment sample is, in general, any sample which is not a probability sample. some element of human judgment enters directly into the selection of the sample. For example, a marketing manger judges its only important to know young female shoppers opinion on some product, so she selects 20 of them in a supermarket and conducted a questionnaire survey.
48

Non-Probability Samples

Probability Samples

Judgement Quota

Chunk Convenience

Simple Random

Stratified Cluster
47

Systematic

Convenience sampling
Items are selected into the sample based only on the fact that they are easy, inexpensive, or convenient to sample. For example, if I want to know the opinion of all VU students on some teaching policy, I get the sample of students in this class now. Thats convenient, but not good, as this is not a representative sample of all VU students.
49

Another example of nonprobability sample


Many companies conduct surveys by giving visitors to their website the opportunity to complete survey forms and submit them electronically. The response of these surveys can provide large amounts of data in a timely fashion, but the sample is composed of self-selected web users.

50

An example from CNN webpage:

Result:

51

52

QUOTA SAMPLE
A sample, usually of human beings, in which each investigator is instructed to collect information from an assigned number of individuals (the quota) but the individuals are left to his personal choice. In practice this choice is severely limited by controls, e.g. he is instructed to secure certain numbers in assigned age groups, equal numbers of the two sexes, certain numbers in particular social classes and so forth.
53

QUOTA SAMPLE (cont.)


For example, a supervisor asks an investigator to select 10 1st year male students, 10 1st year female student, 10 2nd year male student, 10 2nd year female students, 10 3rd year male students, 10 3rd year female students, to learn the students altitude towards some teaching policy.
54

A frame
You usually have or prepare a list of items comprising the population, and then select some items on the list to obtain a sample. The sampling process begins by defining the frame, i.e., a complete or partial listing of items comprising the population. The frame can be population lists, directories, or maps. Samples are drawn from these frames.
55

Chunk sample
A chunk sample is a group of items who happen to be available at the time of the study (e.g., people waiting in a waiting room at a hospital or people walking through a mall).

56

Probability samples:
Simple Random Samples
Every individual or item from the frame has an equal chance of being selected Selection may be with replacement or without replacement Samples obtained from table of random numbers or computer random number generators

Obtaining an SRS
List all possible samples and randomly select one Use a software package Use a table of random numbers

57

58

Random sampling by Excel

Steps to obtain a random sample by Excel:


Make a numerical ID column; Click Tools Data Analysis Then you will have Sampling.

59

60

10

Systematic Samples
Decide on sample size: n Divide frame of N individuals into groups of k individuals: k=N/n Randomly select one individual from the 1st group Select every kth individual thereafter
N = 64 n=8
61

First Group
62

k=8

Another example of systematic sampling:


Suppose we want to select 4 funds out of 20. Four groups of funds: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Randomly we pick up the 4th one in the 1st group, and then 4+5, 4+2*5,4+3*5 in other groups.

Stratified Samples
Divide population into two or more subgroups (called strata) according to some common characteristic A simple random sample is selected from each subgroup, with sample sizes proportional to strata sizes Samples from subgroups are combined into one
Population Divided into 4 strata
63

Sample64

Another example:
We randomly choose 10 male students and 10 female students among all VU students. Is that a stratified sample? Answer: No.

An example of stratified sample


Suppose 60% of VU students are male, 40% female. We randomly select 30 male students, and 20 female students. That is a stratified sample, as the sample sizes are proportional to strata sizes.

65

66

11

Cluster Samples
Population is divided into several clusters, each representative of the population A simple random sample of clusters is selected
All items in the selected clusters can be used, or items can be chosen from a cluster using another probability sampling technique Population divided into 16 clusters.

An example of cluster sampling


An investigator randomly selects 10 streets (including all street types such as street, road, court, etc.) in Melbourne, and conduct a questionnaire survey to each resident in the 10 streets. In this example, a cluster is a street.

Randomly selected clusters for sample

67

68

Advantages and Disadvantages


Simple random sample and systematic sample Simple to use May not be a good representation of the populations underlying characteristics Stratified sample Ensures representation of individuals across the entire population Cluster sample More cost effective Less efficient (need larger sample to acquire the same level of precision) 69

focus group
A focus group is a form of qualitative research in which a group of people are asked about their attitude towards a product, service, concept, advertisement, idea, or packaging. Questions are asked in an interactive group setting where participants are free to talk with other group members. In the world of marketing, focus groups are an important tool for acquiring feedback regarding new products, as well as various topics. In particular, focus groups allow companies wishing to develop, package, name, or test market a new product, to discuss, view, and/or test the new product before it is made available to the public. This can provide invaluable information about the potential market acceptance of the product.
70

Types of Variables
Variables

Categorical Variables
A set of data is said to be categorical if the values or observations belonging to it can be sorted according to category. Each value is chosen from a set of nonoverlapping categories. For example, shoes in a cupboard can be sorted according to color; the characteristic 'color' can have non-overlapping categories 'black', 'brown', 'red' and 'other'. People have the characteristic of 'gender' with categories 'male' and 'female'. Categories should be chosen carefully since a bad choice can prejudice the outcome of an investigation. Every value should belong to one and only one category, and there should be no doubt as to which one.
72

Categorical
Examples: Marital Status Political Party Eye Color Blood type Importance level State or territory Gender (Defined categories)

Numerical

Discrete
Examples: Number of Children Defects per hour (Counted items)

Continuous
Examples: Weight Voltage (Measured characteristics) 71

12

Ordinal Scale
A set of data is said to be ordinal if the values/observations belonging to it can be ranked (put in order) or have a rating scale attached. You can count and order, but not measure, ordinal data. Example: data set has a variable with values young, middle-aged, aged. Its of ordinal type natural ranking exists. Another example: a data set from a survey has a variable with values very important, important, not important. Its of ordinal type as they can be ranked.
73

Nominal Scale
A set of data is said to be nominal if the values/observations belonging to it can NOT be ranked (put in order) or have a rating scale attached. You can count but you can NOT order nominal data. Example: data set has a variable with values VIC, NSW, SA, QLD. Its nominal, as they can not be ranked.
74

Continuous Variable
A set of data is said to be continuous if the values/observations belonging to it may take on any value within a finite or infinite interval. You can count, order and measure continuous data. For example, height; weight; temperature; the amount of sugar in an orange; the time required to run a mile.
75

Discrete Variable
A set of data is said to be discrete if the values / observations belonging to it are distinct and separate, i.e. they can be counted (1,2,3,....). Examples might include the number of kittens in a litter; the number of patients in a doctors surgery; the number of flaws in one meter of cloth; gender (male, female); blood group (O, A, B, AB).
76

Interval Scale
An interval scale is a scale of measurement where the distance between any two adjacent units of measurement (or 'intervals') is the same but the zero point is arbitrary. For example, the time interval between the starts of years 1981 and 1982 is the same as that between 1983 and 1984, namely 365 days. The zero point, year 1 AD (or year 0 AD), is arbitrary; time did not begin then. Other examples of interval scales include the measurement of longitude, angle, and distances between stars in the universe.
77

Ratio scale
A ratio scale is an interval scale in which distances are stated with respect to a rational zero For example, height is measured as a ratio scale as zero height is a rational zero. So are salary, weight, etc.

78

13

Levels of Measurement and Measurement Scales


Differences between measurements, true zero exists Differences between measurements but no true zero

Mutual Funds Data

Ratio Data

Highest Level Strongest forms of measurement

Interval Data

Ordered Categories (rankings, order, or scaling)

Ordinal Data

Higher Level

Categories (no ordering or direction)

Nominal Data

Lowest Level Weakest form of 79 measurement


80

Mutual Funds Data


Variable Fund is of nominal, categorical type. Variable Type is of ordinal, categorical type. Assets and 5-yr return are of continuous, numerical type. Another variable Number of employees is of discrete, numerical type, as possible values are 1, 2, 3,

Example
Name Blood Height No. of Distance Email address group (m) units to uni

John Smith

1.76

far

js@vu.edu.au

Adam Lee

1.67

near

al@hotmail.com

81

Fuchun B Huang

1.89

very far

fh@optusnet.com. 82 au

(continue)
Blood group is nominal, categorical data Height is continuous, numerical data No. of units is discrete, numerical data Distance to uni is ordinal, categorical data Email address is nominal, categorical data

Evaluating Survey Worthiness


What is the purpose of the survey? Is the survey based on a probability sample? Coverage error appropriate frame? Nonresponse error follow up Measurement error good questions elicit good responses Sampling error always exists
84

83

14

Types of Survey Errors


Coverage error or selection bias
Exists if some groups are excluded from the frame and have no chance of being selected

Types of Survey Errors


Coverage error Non response error Sampling error
85

(continued)

Non response error or bias


People who do not respond may be different from those who do respond

Excluded from frame Follow up on nonresponses Random differences from sample to sample Bad or leading question 86

Sampling error
Variation from sample to sample will always exist

Measurement error
Due to weaknesses in question design, respondent error, and interviewers effects on the respondent

Measurement error

about response rates


The longer the questionnaire, the lower the rate. Mail surveys usually produce lower response rates than personal interviews or telephone surveys. Question wording can affect a response rate.

Leading question
In a leading question, the required answer is indicated by the question itself. An example of leading question would be: 'As you have young children, you will be looking for childcare, right?' Leading questions should be used with considerable care, otherwise there is a real danger of enforcing your own ideas and not learning enough about the individual's needs
88

87

Chapter Summary
Another example of leading question
Asking a woman: In this country wives play a leading role in most families, so do you in your family, right? The right question should be: "Who plays a leading role in your family?
Why we need to learn statistics Introduced key definitions:
o Population vs. Sample o Primary vs. Secondary data sources o Types of variables

Examined descriptive vs. inferential statistics Described different types of samples Reviewed data types Examined survey worthiness and types of survey errors

89

90

15

You might also like