Data Analysis

lOMoARcPSD|2665969
Data Analysis Study Notes
Data Analysis (Queensland University of Technology)
StuDocu is not sponsored or endorsed by any college or university

Downloaded by HY ER (erinhaneko@gmail.com)
lOMoARcPSD|2665969
Data Analysis Mega Study Document

 Hey guys
 This is prety much a compilaion of lectures 1-11
 I wrote it with the mindset that you should be able to skip to any secion and understand
what’s going on (so someimes I write the same thing 3 imes in 3 diferent ways)
 To use the doc easier open up the navigaion plane (view -> show-> navigaion plane)
 I also tried to ilter out things that I didn’t think were necessary and shortened secions
 I also expanded on some secions that were a bit harder to grasp.
 All the hypothesis tesing stuf is compiled at the end (instead of using a week structure)
 there’s a table with all the test-staisics compiled at the end
 If you’d like to work with me on future documents feel free to let me know.
jeyar.montemayor@gmail.com
lOMoARcPSD|2665969
Week 1 – Types of Data

Learning Objectives
 Disinguish diferent types of data

 Understand diference of populaion/sample, Greek/Roman notaions
Terminology
Populaion: Consists of all members of the group
Sample: Porion of the populaion selected for analysis
Parameter: Numerical measure that describes characterisic of populaion
Staisic: Numerical measure that describes a characterisic of a sample
Notations
Populaions/Greek = μ, σ, Ν
Sample/Roman= x, s, n
Sample Staisic Populaion Parameter Descripion

n N Number of members
x̅ - x-bar µ – Mu/Mew Mean/average
s σ - sigma Standard Deviaion
lOMoARcPSD|2665969
Content
Types of data
Categorical/Qualitaive
 can only be named or categorised

o E.g. Gender
 When ranked/ordered
o Called ‘Categorical Ordinal’
 E.g. Bad, good, excellent
 When unranked/ordered
o Called ‘Categorical Nominal’
 Brown hair, Blond hair, Black hair
Numerical/Quanitaive
 measured on numerical scale

o E.g. Age
 When contains decimals
o Called ‘Numerical Coninuous’
 E.g. 3.14
 When doesn’t contain Decimals
o Called
o ‘Numerical Discrete’
 E.g. 3
Time Series or Cross sectional

Time series
 Collected through ime

o E.g. Monthly sales
 Jan - $100, Feb $150, Mar $120…
Cross Secional
 Collected for a point in ime

o E.g. heights of class = 185, 172, 178…
lOMoARcPSD|2665969
Week 2 – Numerical Data Tables/Charts

Learning Objectives
 Tables and charts for numerical data
 Measures of Central Locaion
o Mean
o Median
o Mode
 Understand Quariles and Interquarile Range
 Box and Whisker plot (Not included in document)
Terminology
Ordered Array: Data in Order (e.g. 1,2,3,4,5)
Frequency Distribuions: Summary table of arranged data with frequency

E.g. 1,1,1,1,2,2,3,3,3
Number Frequency
1 4
2 2
3 3
Mean/X-bar : =Sum of all numbers/Number of Numbers
Median: Midpoint of ranked values
Mode: Most Frequently observed value
Range: Diference between largest number and smallest number
Frequency: Times a number/result is counted
Cumulaive Frequency: Times a number is counted + Sum of frequencies counted

E.g. 1,2,2,3,4,5,5
Number Frequency Cumulaive frequency

1 1 1
2 2 3
3 1 4
4 1 5
5 2 7
Total 7 7
lOMoARcPSD|2665969
Content
Graphs
Class Intervals
 Puts data into class intervals/bins
o E.g. Instead of 1,6,15,18,19,22,30
Class Interval Frequency
1-10 2
11-20 3
21-30 2
 Every class grouping has same width
o Determined using below formula
 Width of interval = Range/number of desired class groupings
Histograms
 Uses midpoint of class intervals
 Verical axis = frequency or relaive frequency or percentage
Coningency Tables
 Can be in frequency or percentage
 Used for bar charts
o Frequency
Cola Preference
Total
Regular diet
Asian 12 3 15
Ethnicity Caucasian 12 13 25
Other 6 4 10
Total 30 20 50
o Percentage
Cola Preference
Total
Regular diet
Asian 12/15=80% 20% 100%
Ethnicity Caucasian 48% 52% 100%
Other 60% 40% 100%
Total 60% 40% 100%
Scater Diagrams
 Examine relaionships between two numerical variables (X, Y)
o X = Independent variable [Horizontal/X axis]
lOMoARcPSD|2665969
o Y= Dependent variable [Verical/Y axis]

 Important in future lectures when covariance is involved
o Measures relaionship and strength between two variables
Time-Series Plot
 Studies paterns in values over ime
 Examines one variable
o Y = the variable [verical axis]
o X= Time [horizontal axis]
Points of central location
Mean/X-bar : Sum of all numbers/Total Number of Numbers
=
∑x
n
n
∑ Xi X 1 + X 2 +⋯+ X n
i =1
X= =
= n n
x̅: X-bar/Mean/Averagew
X: A symbol used instead of an actual number/result
∑: Sum of/add up
N: total number of results
Median: Midpoint of ranked values
n+1
Posiion of Median =
2
 Finding the median

o Obtain posiion of median
o Count from let to right using the number
 Rules
o Number of values = Odd
 Median is the middle ranked value
 E.g. 1,5,7,8,10
 (5+1)/2= 3
 Median = 7
o Number of values= Even
lOMoARcPSD|2665969
 Median is the average of two middle ranked values

 E.g. 1,5,7,8,10,12
 (6+1)/2= 3.5
 Find average of 3rd and 4th numbers
 7+8/2 = 7.5
Mode: Most Frequently observed value
 E.g. 1,2,3,3,3,3,4,5,5
 3 has the largest frequency, occurring 4 imes
o Therefore, mode = 3
Range: Diference between largest number and smallest number
 E.g. 3,5,7,8,9,11,15
 Range = Largest Number – Smallest number
 Range = 3 – 15
 Range = 12
Quariles: 25% segments or 4 divisions or quarters
 Named Q1 , Q2 , Q3 , Q4
 Q1=( N + 1)/4
 Q2=( N + 1)/2
 Q3=3∗( N +1)/4
Interquarile Range: Range between Q1∧Q3
 IQR/Interquarile Range = Q3−Q1
lOMoARcPSD|2665969
Week 3 – Numerical Descriptive Measures

Learning Objectives
 Understand measures of dispersion
o Variance
o Standard Deviaion
o Coeicient of variaion
 Distribuion Shape
 Bivariate measures
Terminology
Variance: Litearlly, just the Standard deviaion2..
Note: Remember to convert to St Dev by using √Square root.
Standard Deviaion/Sigma/ σ/s: The averaged deviaion of each number from the average.
aka MAD
Aka Mean absolute distribuion
aka Mean Squared Deviaion
Coeicient of variaion: Used to compare data sets and determine which is

More or less variable.
Z-Scores: Shows how many standard deviaions (Units)

a result is from the mean.
Content
Describing data
lOMoARcPSD|2665969
Measures of Dispersion
Standard deviaion
 Unit used to count how far away a number is from the mean
 Formula
o Sum of all numbers – the average squared / number of total numbers – 1
x́
X i−¿
¿
¿2
o ¿
∑¿
¿
S= √ ¿
 S = Standard Deviaion
 ∑ = Sum of all
 X i = any of the numbers/results
 x́ = The average
 n = total number of numbers/results
 For example
o Standard deviaion of : 2,5,10,15,18,21,44
o Average/ x́ = 16.42
lOMoARcPSD|2665969
x́
X i−¿
o ¿ = 1165.71
¿
∑¿
 Explanaion: (2 - 16.42)2 +(5-1.42)2 +(10-1.42)2 …
o N-1 = 7-1 = 6
 Therefore
√ 1165.71
6
 Standard deviaion = 13.93
= standard deviaion
 Uses in language (ALSO THE Z-SCORE SEE BELOW)

o Take the number 44 and look at its diference from 16.42 (average)
 44-16.42 = 27.58
o If we see how many imes the standard deviaion its into 27.58(the diference)
we ind out how many standard deviaions(units) away 44 is from the average
 27.58/13.93 = 1.97
o Therefore, we can say the number 44 is 1.97 standard deviaions away from the
average.
Variance
 Deiniion = The mean squared deviaion
 Diference to standard deviaion
o It is in a squared2, format as it doesn’t have the square root.
o It is expressed using speciic greek/roman , populaion/sample characters
o Populaion does not use N-1, instead it just uses N
o Sample uses N-1 to adjust for biasness of sample staisics
o Sample variance = s 2=
∑ (x i− x́ )
n−1
o Populaion variance = σ 2=
∑ (x i−µ)
N
 EXAM NOTE: Be aware of whether the quesion gives you
o Standard deviaion or variance
o If they only give variance
 Remember to square root it
Coeicient of Variaion (CV)

 Measures relaive variaion
o Shows variaion relaive to the mean
 Can be used to compare two or more sets for data measured in diferent units
 ALWAYS EXPRESED AS A PERCENTAGE (%)
 Formula
lOMoARcPSD|2665969
o CV = ( Sx́ )∗100 %
 S= Standard Deviaion
 x́ = Average (X-bar)
Z-Scores
 Diference between given observaion(X) + mean( x́ ¿ , divided by the standard deviaion
 Formula
x−x́
o Z=
S
 X = given observaion/number/result/value
 x́ = Mean/average
 S = Standard deviaion
 Z = the score that you will be using to look on the z table

Table value = a probability to the let of the graph.
 Also used to describe the number of standard deviaions from the mean a value is
o E.g. z score of 2.0 means that a value is 2.0 standard deviaions away from the mean
lOMoARcPSD|2665969
Distribution Shape
 Describes how data are distributed
o Let Skewed
 Mean<Median
o Symmetric
 Mean= Median
o Right Skewed
 Median< Mean
Bivariate Measures – Direction and Strength

 We use these to disinguish a relaionship between 2 numerical values
o Determine direcion of relaionship: posiive or negaive
o Determine strength of relaionships
Covariance [Cov(X,Y)] – THE DIRECTION

 Determines DIRECTION of relaionship
 Formula
o ∑ ( x i−x́ )( y i− ý )
cov ( X ,Y )=
n−1
o Note: you have to do each X,Y pairing separately then do the total summaion then
the division.
 For example ∑ ( x i− x́ ) ( y i− ý )
X y (X-Average) (Y-Average) (X-average)*(Y-Average)
14 16 -12 * -6 = 72
22 18 -4 * -4 = 16
28 25 2 * 3 = 6
30 21 4 * -1 = -4
36 30 10 * 8 = 80
lOMoARcPSD|2665969
Average Total Sum 170

26 22
o Note: n = the number of lines
 In the above example there are 10 results
 However
 We only count the lines
 Therefore n=5
o In Xi, the I just an abbreviaion that is variable for any given line
 For example line 1 of x and y pair up and muliply each other.
 XlFirst line * YFirst Line
 XSecond Line * YSecond Line
 Etc
Coeicient of Correlaion r – The Strength

 Measures relaive strength of the linear relaionship between two variables
o Ranges between -1 and 1
 Closer to -1, stronger the negaive linear relaionship
 Closer to 1, stronger the posiive linear relaionship
 Closer to 0, the weaker the linear relaionship
 0 - 0.3 = weak
 0.3 – 0.7 = Medium
 0.7 - 1 = strong
o Excel formula: = correl(aray1, array2)
 Array = info that you highlight to represent x and y.
 Formula
cov ( X ,Y )
o r=
Sx Sy
 R = Correlaion r (Abbreviaion)
 cov ( X ,Y )=
∑ ( x i−x́ )( y i− ý ) (see above)
n−1
 Sx = Standard deviaion of x values
 Sy = Standard deviaion of y values
x́
X i−¿
¿
 ¿2 = Standard deviaion formula
¿
∑¿
¿
S= √ ¿
 Example
lOMoARcPSD|2665969
Week 4 – Regression and Probability

Learning Objectives
 Simple linear Regression
o Esimaion
o Interpretaion
o Predicion
 Basic probability Concepts
o Experiments
o Sample space
o Events
 Interpreing and Assessing Probabiliies
o Interpreing probabiliies
o Assessing Probabiliies
 Calculaing and combining probabiliies
o Marginal probabiliies
o Complements
o Joint Probabiliies (Intersecion)
Terminology
Regression analysis: Explains and measures relaionship between two variables.
Simple Linear Regression: Straight line relaionship between Y and X.
Dependent variable (Y): Variable we wish to predict or explain (Verical Axis)
Independent variable (X) :Variable used to explain the dependent variable
E.g. every ime X increases by 1, Y increases/decreases by ‘_____’
Experiment: Process which an observaion or measurement is obtained
Sample space: List of all possible outcomes that can be generated

e.g. Sample space of Dice = 1,2,3,4,5,6
Event: Collecion of simple events (Notaion of event = A)

E.g. A = {1,2,3}
Content
Simple Line Regression
 Helps with predicion
 Relaionship between Y and X are a Causal relaionship
lOMoARcPSD|2665969
o Y is dependent on X, for every 1 units change in X, Y Increases/decreases by ‘_____’

 Formula
o Y = mx + C
 C = the Y intercept = What the value of Y is when X is 0
 M = The slope of the line
 Measure rate of change in Y for a 1 unit change in X
o E.g.
o when x = 1, y = 2
o when x = 2, y = 4

Rate of change = Yx2 – Yx1

Rate of change = 4 – 2 = 2
 Therefore M/rate of change = 2
 We use this to potenially predict what Y will be at any given value of X
o X = whatever number we choose
 Any given point that we want to discover what y is
Possible exam QUESTION!!!

 They’ll probably ask us to interpret an excel output and predict what Y will be at a random X
point
 Example – They will give us something that looks like this

lOMoARcPSD|2665969
 You only need to be concerned with this
o Muliple R =The coeicient of correlaion r = STRENGTH + DIRECTION(Refer week 3)

 If there are any quesions on this:
 They want us to tell them the direcion and strength
o E.g. Direcion = is it a posiive value or a negaive value?
 0.90 = Posiive value
o E.g. Strength = is it Strong, Medium or weak
 0.9 = Strong
o Therefore, 0.9 = a strong posiive correlaion
o Intercept and No of children
 These two rows refer us back to our equaion of Y=mx + c
 They’ll probably ask us:
 How X afects Y, by wriing a statement e.g. For every 1 unit change
in X, y increases/decreases by ‘Insert amount’
o The amount = m = rate of change
o E.g. For every 1 unit change in X, y increases by 1.4.
o Give it context!
E.g. for every 1 pool at the aquaic centre, there are
1.4 children. If I have 10 pools I have 14 children.
 Or to actually use the formula, what does Y= when X=a certain value
 Intercept = C
 What we use for C in the y=mx + C formula
 Number of children = m = rate of change
 What we use for m in the y=mx + c formula
lOMoARcPSD|2665969
 ‘No of children’ is just a column itle they used in excel

o During the exam it’ll probably be something else retarded
 Just remember whatever is below the intercept (in the table) is the
M, in y=mx + c
 E.g. if X = 5, what does Y equal?
o Y = mx + c
 M= rate of change = no of children = 1.4
 X = whatever number they give us = 5
 C = intercept = 4.8
o Y = 1.4 * 5 + 4.8
o Y = 7 + 4.8
o Y = 11.8
Probability
Interpreting probability
 Probability
o Chance that an outcome is achieved in a paricular experiment
o Notaion for the probability of an event, A, occurring is P(A)
 Notaion for event = A
 Notaion for Probability = P
 Notaion for probability of an event = P(A)
o Probability of an event ranges between 0 and 1
 0% chance to 100% chance
o The probability of all simple events must = 1
 E.g. probability of rolling a dice for 1,2,3,4,5 or 6 = 100% = 1
Assessing Probabilities
 A priori classical probability
o Based on prior knowledge
o E.g. based on the symmetrical nature of an experiment
 Flipping a coin (1/2)
 Rolling a die (1/6)
 Empirical classical probability
o Based on observed data or repeated experimentaion to assign probabiliies
NA
o P ( A )=
N
 N = number of trials
 Subjecive probability
o Based on individual judgement or opinion about the probability of occurrence
Calculating and combining probabilities
lOMoARcPSD|2665969
 Summary of types of probabiliies (exam quesions)

o Marginal
o Complement
o Joint
o Either/or
o Condiional
 You need to become familiar with what words determine
the kind of probability you need to ind to answer the quesion.
 Marginal = P(A)
o Probability of a single outcome
o Example
 What is the probability someone likes coke?
120
 P (Coke )= =0.6=60 %
200
 Complement
o Probability something won’t happen
 Notaion event won’t happen = Á
o Formula
 P ( Á )=1−P ( A )
o Example
 ´ )=1−P ( Coke )
P ( Coke
 = 1 – 120/200
 = 1- 0.6
 = 0.4 = 40%
 Joint Probabiliies
o Intersecion of two events, A and B.
 Event that both A and B occur
 Key word is AND
o Notaion = A ∩ B
o Example. Probability someone is female and likes coke
 P ( A ∩ B ) =P ( Female∩ coke)
A ∩ B 75
 ¿N = = 0.375
N 200
 N = total
 Either/Or (The Union)
o Union of events
o Either A or B can occur
o Or Both
o P(A U B)
o Formula (The addiion rule)
lOMoARcPSD|2665969

A U B) = P(A)+P(B) –P(A ∩ B)
P¿
o Example – Probability that the individual likes coke
or is female
 P(Coke U Female) = P (Coke) + P(Female) – P(Coke ∩ Female)
120 110 75
 ¿ + −
200 200 200
 = 0.6 + 0.55 – 0.375
 = 0.775
o BIGGEST NOTE: the key here is the words EITHER/OR. The biggest mistake is when
people confuse or/either with AND.
 E.g. Individual likes coke OR is female (U)
vs
Individual likes coke AND is female (∩)
 And has an ‘n’ in it… ∩
 Condiional Probability
o Probability of one event condiional upon the state of another event
o KEY WORD: given
 The condiional probability of A given we know about B
o P(A | B)
 Note: B is the condiion/given
o Example
 Given a person is male, what is the probability that they prefer Pepsi
 P (Pepsi | Male)
 What is the probability that an individual is female given they prefer coke
 P(Female | Coke)
o EXAM NOTE
 They might not give us a coningency table (as above) and do a full text
quesion.
 Pay atenion to the words used and use the following formula
lOMoARcPSD|2665969
P (A ∩ B)
 P ( A|B )= Remember B is the GIVEN
P(B)
 EXAMPLE
 Situaion
The probability that Mark has lunch in the tearoom is 0.6 and the
probability that Gary has lunch in the tea room is 0.5. However,
Mark and Gary don’t like each other and the probability that they
have lunch together in the tea room is 0.1
Mark Tearoom = 0.6
Gary Tearoom = 0.5
Both Tearoom = 0.1
 Quesion
what is the probability that Mark will have lunch in the tea room
given Gary is having lunch in the tea room?
Or
Given Gary is having lunch in the tea room, what is the probability
that mark will have lunch in the tea room?
P (A ∩ B)
 P ( A|B )=
P(B)
o B= Given = Gary = 0.5
o A = Probability event = Mark = 0.6 (NOT NEEDED IN THIS Q)
o P(A ∩ B) = Gary AND Mark = 0.1
P ( A ∩ B) 0.1
 P ( A|B )= = =0.2
P(B) 0.5
 REMEMBER
 B = GIVEN/the condiion
 A = The EVENT we are trying to ind the probability for
 Mutually exclusive Events
o If events are mutually exclusive they cannot occur together.
 E.g. male and female are mutually exclusive, you cant be both… or can you?
 E.g. in rolling a die, events of 1 and 2 are mutually exclusive
 Collecively exhausive events
o One set of the events must occur
o The set of events covers the enire sample space
 E.g. male and female are collecively exhausive
 E.g. in rolling a die, 1,2,3,4,5 and 6 are collecively exhausive
lOMoARcPSD|2665969
Week 5 – Discrete Probability and

Distributions
Note:
I put Either/or, condiional and muliplicaion rules in week 4 so that the informaion is amalgamated
in one place.
Learning objectives
 Random variable
 Probability distribuions
o Discrete
o Coninuous
 Binomial distribuion
Terminology
Random Variable:
variable that assumes numerical values according to the random outcome of an experiment.
notaion = X
Content
The Addition Rule
 The formula can be rearranged
o This enables us to us an EITHER/OR probability to ind an AND probability and vice
versa
 The below will ind Either/OR
P ( A U B )=P ( A ) +P ( B )−P( P ∩B)
 Rearrange (Mid step)

P ( A U B ) + P ( A ∩ B )=P ( A )+ P (B)
o This is your bread and buter rule

 Rearrange to ind what you need
 The below will ind Event A occurring WITH/AND B also occurring (remember female AND
likes coke)
P ( A ∩ B ) =P ( A )+ P ( B )−P ( A U B )
 Example
o P(A U B) = 0.8 = Probability that A OR B occurs
lOMoARcPSD|2665969
o P(A) = 0.6 = Probability A occurs

o P(B) = 0.5 = Probability B occurs
o P(A ∩ B) = ??? = Probability A AND B occurs
 P ( A ∩ B ) =P ( A )+ P ( B )−P ( A U B )
 P ( A ∩ B ) =0.6+0.5−0.8
 P ( A ∩ B ) =0.3
o Remember you can always check your answer
 P ( A U B ) +P ( A ∩ B )=P ( A )+ P (B)
 0.8+0.3=0.6+0.5
 1.1 = 1.1
The Condition Rule

P ( A ∩ B) N AB
P ( A|B )= = = Probability A occurs GIVEN B is known
P(B) NB
P= Probability
A= Event (that you’re trying to ind probability of)
B= Given/Condiion (of A happening)
P(A ∩ B) = Probability A and B occur = NAB
Note: For some reason the lecturers used N and P interchangeably, I dunno why 
Example: Refer to week 4 -> Condiional probability -> exam note
The Multiplication Rule

P ( A ∩ B ) =P ( A|B ) × P( B)
P (A ∩ B) = Probability Event A AND Event B occur
P (A|B) = Probability Event A occurs GIVEN Event B is known
P(B) = Given event = Probability of B
 Example
o Situaion
In a recent study it was found that probability that a speeding driver was male given
they were driving a ‘big car’ was 0.6. The probability that the car speeding was a ‘big
car’ was 0.8
o Quesion
What is the probability that the car speeding was a ‘Big car’ AND driven by a male
lOMoARcPSD|2665969
 P(B) = Given = Big Car = 0.8

 P(A|B) = Male | Big Car = Male given big car = 0.6
o P ( A ∩ B ) =P ( A|B ) × P( B)
o P ( Male ∩ Big Car )=0.6 × 0.8=0.48
Dependent and Independent events

P (A) = P(A|B) Independent
P(A) ≠ P(A|B) Dependent
Random Variable
 Random variable
o Notaion = X
o Is a variable that assumes numerical values according to the random outcome of an
experiment
o Can be discrete or coninuous
 Remember:
Discrete = no decimals (E.g 3)
Coninuous= decimals (E.g 3.14159)
 Probability distribuion of a random variable, X, describes the probability that X will take on
for each of its possible values.
 Probability distribuions can be discrete OR coninuous
o Discrete distribuion = Binomial Distribuion
o Coninuous Distribuions = Normal distribuion
Discrete Distributions:
Expected Value:
 Formula
N
o µ=E ( X )=∑ xi p(x i)
i=1
 Muliply each possible RV (Xi) by its corresponding probability [P(x i)]

 Then sum them all together
 The weighted average of the diferent possible outcomes where the weights are provided by
diferent probabiliies
 EXAMPLE
o Situaion
AXA ofers a “Death and disability” policy, which pays $100k when you die, or $50k
when you are permanently disable.
Charges annual premium of $350 for these beneits
Outcome Payout (X) P(X) = Probability X

Death $100,000 0.1%
lOMoARcPSD|2665969
Disability $50,000 0.2%

Neither $0 99.7%
o Quesion
Is AXA likely to make proit selling this policy
o Answer
N
 µ=E ( X )=∑ xi p(x i)
i=1
 µ=¿ populaion mean
 E(X) = Expected Value
N
 ∑¿ Sum of all (just ignore all that other shhh..stuf)
i=1
 Xi = random variable (death, disability, neither)
 P(X)= probability that a random variable will occur
 Plain English
 Expected value = Sum of all Random variables X Probability of it
happening
o 1 Policy

E ( X )=( Death payout × Death Probability ) + ( Disability payout × Disability Probability )+
 E ( X )=( $ 100,000∗0.001 ) + ( $ 50,000∗0.002 ) + ( 0∗0.997 )
 E ( X )=$ 100+100+ 0=$ 200
o EXAM NOTE: if the quesion asks for muliple policies just muliply the P(x) by that
number. E.g. 1000 policies * 0.1% = 1
 Example 2
o Choose between two stocks
 Which has greater probability for greater returns
 Stock A = Less range and greater certainty
 Stock B = More range less certainty
 Calculate using Expected Value Formula to ind your answer
Variance:
lOMoARcPSD|2665969
 Variance
o Weighted average of the squared diference between each observaion from the
expected value of the distribuion
o Measure of dispersion
o An indicator of RISK in investment
o Formula
Xi
¿
E(X )
 −¿
¿
N
σ =V ( X )=∑ ¿
2
x
i =1
 EXAMPLE
o Calculate the expected variance for each stock as a measure of the stocks Risk
o Stock A
Xi
¿
E(X )
o −¿
¿
N
σ 2x =V ( X )=∑ ¿
i =1
 V(X) = Variance of stock

N
 ∑¿ Sum of all
i=1
 Xi= random variable (Above, Normal, Below)
 E(X) = Expected value = ∑ x i p( xi )
lOMoARcPSD|2665969
 P(xi) = Probability of random variable [P(Above),P(Normal), P(Below)]

 σ = standard deviaion(populaion) = S(sample)
o Plain English
 Sum of all (random variables – the expected value) 2 X Probability of Random
variable
o Step 1
 Calculate Expected value (Taken from aforemenioned)
 E(X) = $17
o Step 2
 Subsitute E(X) into the equaion with random variables
Xi
¿
E( X )
 −¿
¿
N
V ( X ) =∑ ¿
i=1
 V ( X )=¿
Probability of Above
( Above Returns−Expected value ¿2 (¿)] +¿
¿
Probability of Normal
( Normal Returns−Expected value ¿2 (¿)] +¿
¿
Probability of Below
( Below Returns−Expected value ¿ 2( ¿) ]
¿
 V(X) = [($30 – 17)2 X 0.1] + [($20-17)X0.8] + [(-$20 – 17) X 0.1]
 = ($132 X 0.1) + ($32 X 0.8) + ($-372 X 0.1)
 = ($169 X 0.1) + ($9 x 0.8) + ($1369* x 0.1)
 = 16.9 + 17.2 + 136.9
 =161
o Step 3
 Input V(X) into the original formula and convert the variance into standard
deviaion
 σ x =V ( X )=161
2
 σ ( x )∨S( x )=√ σ 2= √ V ( X )=√ 161

 S(x) = 12.689
o Summary
 Stock a
 E(X) = 17%
 σ =12.689
 Stock B
 E(X) = 18%
 σ =44.788
lOMoARcPSD|2665969
 Which would you prefer

 Higher standard dev means there’s greater variaion even though
the expected value is greater.
 Stock B has greater return but less certainty
 Given that the gain is only 1% the risk wouldn’t be worth it.
Binomial Distribution
 Discrete distribuion where the underlying experiment has only two outcomes
o Success or failure
 Binomial experiment possess the following properies
o Fixed number of trials, n
o the probability π of success is constant for each trial
o each trial is independent of the other trials
 The binomial random variable (X)
o Number of successes is n trials of the binomial experiment
 Example: if we toss a coin 3 imes what is the probability…
 Formulas
o Mean
 µ=E ( X )=np
o Variance and standard deviaion
 σ 2=NP ( 1−P )=variance
 σ =√ np ( 1− p ) = Standard deviaion
 N= Sample size
 P = probability of success
 (1-p) = probability of failure
 Shape of the binomial distribuion depends of the values of p and n
lOMoARcPSD|2665969
 Example
o Situaion
 If sausage machine is working properly, no more than 10% of the output
produced will be defecive
 No. sausages sampled, n= 10
 Inspect and record the no. of defecives, x
o N= no. sampled = 10
o X = no. defecives
o π = prob. Of defecive = 10%
 Quesions
 P(X=1) = 0.387
 P(X ≥2)
o =P(x=2) + p(X=3)… + P(X=10)
o =0.194+ 0.057+ 0.011 + 0.001…
o 0.263
 P(X≥4)
o P(x=4) + p(X=5)… + P(X=10)
o 0.011 + 0.001
o 0.012
 How to do the above quesions
 Use your binomial tables
o Find the table where N = what you have
 In this example n = 10
o Go down the table with the ‘x’ you need e.g. 2
o Go across the table with the π/probability you have e.g. 0.1
 Interpreing your answers
 P(X ≥2) = 0.263
o We expect to ind 2 or more defecive sausages in 26.3% of
samples of 10 sausages, if the machine was producing 10%
defecive.
lOMoARcPSD|2665969
Week 6 – Continuous Probability

Distributions Random
Variables
Ch. 5 Ch. 6
Learning objectives Discrete Continuous
 Normal distribuion Random Variable Random Variable
 Standardisaion
 Z-tables
 Inverse Z-Tables
 Rearranging Z Formula
Terminology
Coninuous random variable: variable that can assume any value on a coninuum
Normal Distribuions: Where results are evenly spread from mean (50% above 50% below)
Content
Summary
 Normal Distribuion
o Symmetric around the mean
o Area under curve = 1
o Area under curve represents probability
o Determined by Mean and standard deviaion
 Standard normal (Z)
o Mean = 0
o Standard deviaion = 1
o Transform:
X −μ
o Z=
σ
 Using the tables:
o P(Z<cutof) = Probability
 Types of Q’s:
o Less than
o Greater than
o In between
o Inverse problems
o Solve the 4th variable in z = x – mean/standard deviaion
Continuous Probability Distributions

 Variable that can assume any value on a coninuum
lOMoARcPSD|2665969
 Can take on any value, depending on the ability to measure accurately

o Thickness of an item
o Time required to complete a task
o Weight, in grams
o Height, in cenimetres
 Remember : Discrete vs Coninuous, No decimals vs Decimals
Normal distributions
 Characterisics
o Coninuous random variable
o – ininite < X < + ininite
o Area under curve = 1
o Mean and standard deviaion uniquely determine a normal distribuion
Standardised Normal Distribution (Z-Distribution)

 Formula (Remember week 3)
X −μ
o Z=
σ
= Finds the number of standard deviaions X is from the Mean
 µ = Mean
 σ =¿ Standard deviaion
 X = random variable
 The above will allow us to plot where on our normal distribuion X
lies.
o Example
 Mean = 100
 Standard deviaion = 50
 X = 200
 Z= 200-100/50 = 2.0
The Standardised Normal Table

 Gives probability less than a desired value for Z
o Area to the let of the Z value
 Using the Normal Table
o Obtain z-score using above formula
 Refer to table
o Column = Value to second decimal e.g. 0.1, 1.2 ,2.3 etc
o Row= value of irst decimal point
 E.g. Z-score =0.12 = 0.5478 probability
 =P(z<0.12)=0.5478
 Probability that Z is less than 0.12 is 54.78%
lOMoARcPSD|2665969
 Remember: we only use this when X is NORMALY DISTRIBUTED.
Bible for solving P(X) in Normal Distributions

 LOWER TAIL PROBABILTIES
 Step 1 draw the normal curve
 Step 2 Translate x-values to Z-values
 Step 3 Use the standardised normal table
 Example
o µ/mean= 8
o σ /St dev= 5
o Find p(X<8.6)
X −μ
 Z=
σ
8.6−8
 Z=
5
 Z =0.12
o Use normal table
 0.12 = 0.5478
 UPPER TAIL PROBABILITIES
o Remember all area under the curve = 1
o Therefore if the Normal table gives us the area to the let (lower tail)
 Lower tail + upper tail = 1
 Area to let + Area to right = 1
o We can rearrange the formula to ind the upper tail area
 Upper tail = 1 – Lower tail
o Example
 Using the same values as above
 Upper tail = 1 – 0.5478 = 0.4522 = P(Z>0.12)
 Between two value Probabiliies
o When the quesion asks to ind the probability that X is larger than Value 1, but smaller
than Value2
 Find P(Value1 < X < Value2)
o Example
 Step 1 Draw the normal curve
Idenify what probability area you need
Idenify what two Z values you need
 Step 2 Translate X values to Z values
Use standard normal tables
 Step 3 Calculate area you need
Area between = Larger value – Smaller value
 Besides from the between value
o Two ways to do it See picture
lOMoARcPSD|2665969
Finding X for a known probability

 Inverse problems
o Draw a normal curve placing all known values on it
Mean of X, Z
o Shade in area of interest and ind cumulaive probability
o Find the Z value for the known probability
o Convert to X units using formula
 X=μ+ Zσ
 X = Mean + (Z*standard deviaion)
 Inverse Problem Example
o X = 20%<X (20% of values are smaller than X)
o Mean = 8.0
o ST dev= 5.0
o Z = -0.84 (Refer to table ind closest number to 0.20)
o Convert to X units using the formula
o
 Inverse normal table
o Use it when you need to ind a Z-score or X value (plug Z-score into inverse formula)
 You are provided with a percentage
 That is to the right hand tail (upper tail)
o For example
Find Min weight for heaviest 1% of cans by weight
 Mean = 400g
 St dev = 10g
 Area given 1% (to the right)(as it says heaviest)
 Formula
 X=μ+ Zσ
 X=400 g+Z 10 g
o Two variables sill exist (X and Z)
o Use inverse table to ind Z
 X=400 g+2.3263∗10 g
 X=423.26
 Golden rule
 You can use the formula to ind X or Mean or ST dev
o They will give you all known variables + (use table to ind z)
o Then just rearrange formula as needed
lOMoARcPSD|2665969
lOMoARcPSD|2665969
Week 7 – Sampling Distributions

Learning objectives
 Sampling distribuion of the mean
 Sampling distribuion of the proporion
Terminology
Staisical inference: when a sample is selected to draw conclusions regarding a populaion
Sampling error: An error we expect to occur when we make a staisical inference
Probability distribuion: all the possible sample means that can occur in a graph
Sampling distribuion: just a subset/speciic name for probability distribuions regarding samples
X-bar = Sample mean = x́
Uniform distribuion: all outcomes have equal probability (like a coin has ½ for T or H)
Summary
 Fundamentally the exact same as week 6 (Finding probabiliies)
X −μ
 Content is based around the Z formula Z =
σ
 Adds an extra step
o Find probability a sample will have a certain mean instead of a single observaion will
have a certain value
o Observaion -> Sample & value -> mean
( x́−μx́ ) ( x́−μ)
Z= =
 Z Test is changed to σ x́ σ
√n
 We use above formula for:
o Normal distribuions of populaions
o Non-normal distribuions of populaions when N ≥ 30
 (Central limit theorem)
Content
Sampling Distributions
 Distribuion of possible values an sample staisic may take or spread around the populaion
parameter of interest inaccuracy
o Every sample staisic calculated is a random variable
o Every random variable will have a distribuion
lOMoARcPSD|2665969
o If we deine the distribuion we can use it to

answer quesions
 Takes account of the distribuion of possible sampling errors.
Sampling Distribution of the Mean

 In week 6 we looked at the probability that ONE item would be greater/lesser than a certain
value (n=1)
o P(x<value) = __________
 The aim is to learn how to ind the probability that an enire SAMPLE (using the mean) would
be greater or lesser than a value. (n=2 or n=3 … n =25 etc)
o P( x́<value ) =___________
 Why we can igure out X but not X-bar(sample mean/ x́ )
o In X, we knew it was normally distributed
 And we knew the mean and standard deviaion
o Therefore, to answer X-Bar (Sample mean)
 Need to know probability distribuion/sampling distribuion of X-bar
 Sampling distribuion of the mean
o In every sample we get a mean
o The sampling distribuion is the consideraion of
EVERY possible mean we can get from
 Any combinaion of samples
 EXAMPLE OF SAMPLING DISTRIBUTION
o Situaion
 Populaion size = N = 4
 Random variable, X, is AGE of individuals
 Values of X: 18, 20, 22, 24 (Years)
 Populaion Mean = 21
 Populaion standard deviaion = 2.236
 Distribuion = Uniform = Uniform distribuion
o Developing a sampling distribuion
 Pick an ‘n’ size (number of observaions)
 E.g. n = 2
 Find all possible sample results possible
 E.g. Picking random people from a crowd (and puing them back in)
1st Person Second person

18 20 22 24
18 18-18 18-20 18-22 18-24
20 20-18 20-20 20-22 20-24
22 22-18 22-20 22-22 22-24
24 24-18 24-20 24-22 24-24
 Calculate the Means
lOMoARcPSD|2665969
1st Person Second person

18 20 22 24
18 18 19 20 21
20 19 20 21 22
22 20 21 22 23
24 21 22 23 24
o Noice
 Probability that Mean is 21
 P(X-bar=21) = 4/16
 P(X-bar=20) = 3/16 = P(X-bar=22)
 P(X-bar=19) = 2/16 = P(X-bar=23)
 P(X-bar =18) = 1/6 = P(Xbar=24)
 Calculate the populaion mean(µ) by using all of the sample means
x́ )
¿
o Notaion = μx́
o Formula = μx́ =
∑ x́ i
N
 x́ i = Any of means (18, 19, 20…, 24)
 N=¿ Populaion total
 ∑ ¿ Sum of all
 Plain English = Populaion Mean = sum of all sample
means divided by populaion total.
o
18+19+19+ 20+ 20+20+21+21+21+21+22+22+22+23+23+24
μx́ =
16
o μx́ =2 1
 We can also now calculate the Populaion standard deviaion (
σ ¿ of the sample means (x-bar)
o Notaion = σ x́
 Formula
o Plain English
 Populaion standard dev of sample means Equals
The square root Of
Sum of all
Sample means Minus populaion mean Squared
divided by
total populaion
lOMoARcPSD|2665969
x́i −μx́
¿
¿2
o ¿
∑¿
¿
σ x́ =√ ¿
2
24−21¿
¿
¿ 16
2
o 19−21 ¿ +… ¿ 1.58
18−21 ¿2 +¿
¿
¿
σ x́ =√ ¿
 Noice
 With the above we have….
o Something that looks like/is a
normal distribuion
o A mean
o A standard deviaion
o X values… (which are the X-bars)…
 Even though it is prety much a normal
distribuion, we call it a sampling
distribuion
 We can now answer probability quesions
X −μ
using Z=
σ
o A step back to what we have found (Populaion vs Sampling distribuion)
The relationship between the population St dev and Sampling

distribution St Dev
lOMoARcPSD|2665969
x́i −μx́
¿
¿2
 Before we said ¿
∑¿
¿
σ x́ =√ ¿
o This gets us our absolute correct standard deviaion
 But we can only do it when we have the populaion total and when we know
all the possible sampling averages
If population is normal
 If OUR POPULATION IS NORMAL we can say
σ
o σ x́ = and μx́ =μ
√n
 σ = Standard deviaion of populaion
 n = total sample number
 the greater our n (the more samples we get)the smaller our
standard deviaion is
o which is what we want because it shows greater accuracy as
there is less deviaion from the mean
o Our Z formula!!!
 Original (week 6)
X −μ
 Z=
σ
 Z formula for Sampling distribuion of mean
( x́−μx́ ) ( x́−μ)
Z= =
 σ x́ σ
√n
 Comparison
 they share the exact same format
 μ→ μx́ =μ = mean becomes mean of the samples’
averages
σ
 σ → σ x́ = = st dev becomes st dev of the samples’ st devs
√n
 EXAMPLE
o Quesion
How likely is it we would get a mean ill from a sample of 25 botles which has a result
of 598mls or less?
 N = 25
 X-bar = 598
 µ = 600
 σ = 10
 P(x-bar<598)
lOMoARcPSD|2665969
o Answer
( x́−μx́ ) ( x́−μ)
Z= =
 σ x́ σ
√n
(598−600)
¿
 10
√ 25
−2 −2
¿ = =−1=0.1587 probability
 10 2
5
 Note: -1 .0 on the normal table is 0.1587
 If populaion mean = 600, there is a 15/87% chance a sample of 25 botles
would produce a sample mean of less than 598 mls.
If Population is Not Normal

 Smaller samples resemble the populaion more
 Bigger samples become more normal
o Central limit Theorem
 N ≥ 30 (sample must be greater than 30)
 If it is fairly symmetric n ≥ 5 is suicient
 If our sample size falls into above categories we can treat it as if it is
normal
σ
o Where σ x́ = and μx́ =μ
√n
lOMoARcPSD|2665969
Sampling Distribution of the PROPORTION

 Distribuion is binomial (we won’t get quesions on this)
o Instead we will approximate it as being normal when:
 Nπ ≥ 5 and n(1 – π) ≥ 5
 Π (pie) = proporion of populaion
 P = the sample proporion
 P = x/n = number items in the sample having characterisic of interest/ sample size
 Steps to solving
o State formula
p−π p−π
Z= =
√
σp π (1−π )
 n
o State rules

σ p=
μ =π
√ π (1−π )
n
  p
o Idenify known values
o Find µp and σp
o Complete original formula
o Use table to convert z-scores to probabiliies
 EXAMPLE
o Quesion
Voters who support proposiion A is 0.4, what is the probability that a sample size of
200 yields a sample porion between 0.4 and 0.45?
o Formula
p−π p−π
Z= =
o
σp
Known informaion
√
π (1−π )
n where
μ p=π and
σ p=
√ π (1−π )
n
 Π = µp = 0.4
 N= 200
 P(0.40 ≤ Z ≤ 0.45)
 σ p=
√
π (1−π )
n
=
√
0.4 (1−0.4 )
200
=
0.24
200
= √ 0.0012=0.03464
√

0.40 ≤ Z ≤ 0.45=p
0.4−0.4
0.03464 (
≤Z
0.45−0.4
0.03464
=P (0 ≤ Z ≤ 1.44) )
p¿
 Z- score 0 = 0.5000
 Z- score 1.33 = 0.9251
 p ( 0≤ Z ≤ 1.44 )=P ( z <1.44 )−P ( z <0 )=0.9251−0.5=0.4251
lOMoARcPSD|2665969
 Answer
 42.51% chance that a sample size of 200 yields a sample porion
between 0.4 and 0.45.
Week 8 – Estimation
Learning objectives
 Conidence intervals for the populaion mean (µ)
o When populaion standard deviaion σ is known
σ
 x́ ± Z Use inverse Z table with α/2 in each tail
√n
o When populaion standard deviaion σ is unknown (when s is given)
S
 x́ ± T n−1 use inverse T table with α/2
√n
o Answer format __________≤ µ ≤ _________ with ___% conidence
o Formula rearrangement to ind sample size:

z2 σ 2 Always round answer up to next whole number
n=
e2
 Conidence intervals for populaion porion (π)
o
 p± z
√ p(1−P) Use inverse Z table α/2 in each tail
n
Answer format __________≤ π ≤ _________ with ___% conidence
o Formula rearrangement to ind sample size:
2
z π (1−π )
 n= 2
always round up answer to next whole
e
Terminology
Point esimate: value of a single staisic (best guess of mean)
Conidence interval: range of values around the point esimate. Limited by upper and lower
boundaries. Range is created by having paricular level of conidence. e.g. 95% conidence interval
means we are 95% conident mean is within the interval.
Conidence interval and Alpha (α): You can rearrange the formula (just remember area under graph
is always = to 1. Therefore:
1 = Conidence interval + α
1- Conidence interval = α
1- α = Conidence interval
Content
lOMoARcPSD|2665969
Point and Interval estimates

 A point esimate is the value of a single simple staisic
 A conidence interval provides a range of values constructed around the point esimate
 Point esimate
We can esimate a populaion parameter… With a sample staisic
(point esimate)
Mean µ X-bar
Proporion π p
Confidence Interval
 An interval gives a range of values
o Takes into consideraion variaion in sample staisics from sample to sample
o Based on observaions from 1 sample
o Gives informaion about closeness to unknown populaion parameters
o Stated in terms of level of conidence
 The general formula for all conidence interval is:
o Point esimate +/- (criical value) * (standard error)
 (1 – α)
o Common conidence levels = 90% 95% or 99%
 Also writen (1 – α) = 0.9, 0.95 or 0.99
o A relaive frequency interpretaion
 In the long run, 90%, 95% or 99% of all the conidence intervals that can be
constructed (in repeated samples) will contain the unknown true parameter.
o For example, if we were to randomly select 100 samples and use the results of each
sample to construct 95% conidence intervals, approximately 95 out of 100 would
contain the populaion mean.
Confidence interval for µ (sigma known)(inverse normal

table)
 Assumpions
o Populaion standard deviaion (sigma) is known
o Populaion is normally distributed
o If populaion is not normal, use central limit theorem ( n = over 30)
lOMoARcPSD|2665969
 Conidence interval esimate

o Point esimate +/- (criical value) * (standard error)
σ
 x́ ± Z
√n
 x́ is the point esimate
 Z is the normal distribuion criical value for a probability of α/2 in
each tail
σ
 is the standard error
√n
o Finding the Criical Value (Z)
 Determine the conidence level
 Usually 90% , 95% or 99%
 Since there are two tails divide 1-conidence level by 2
 E.g. (1- 0.95)/2 = 0.025
 Refer to 0.025 on the t Table (criical values of t)
 Note: ∞ degrees of freedom
o Or you can use Inverse normal table
 0.025 = 1.96
lOMoARcPSD|2665969
Example of finding confidence interval when Sigma is known

o Quesion
 Determine a 95% conidence interval for the true mean resistance of the
populaion
o Known info
 Sample n = 11 circuits
 X-bar = 2.20 ohms
 Sigma = 0.35 ohms
 Z = 95% conidence
o Formula
σ
 x́ ± Z
√n
 x́ point esimate = 2.2
 Z criical value = (1-0.95)/2 =0.025 = 1.96
σ 0.35 0.35
 standard error == =0.1055
√n √ 11 3.3166
 95% conidence interval ¿ 2.2± ( 1.96 ×0.1055 )=2.2 ±0.20
 Upper tail = 2.2 + 0.2068 = 2.4068
 Lower tail = 2.2- 0.2068 = 1.9932
o Answer
 We are 95% conident the mean is between 1.9932 and 2.4068
Confidence interval when sigma is UNKNOWN  (use t-table)

 What do we do >_<;;
o Subsitute the populaion st dev(sigma)
 For sample st dev (S)
 Introduces extra uncertainty, since S is variable from sample to sample
o We use the T distribuion instead of the normal distribuion
 Formula
S
o x́ ± T n−1
√n
 T = the criical value of the T distribuion
 N -1 = degrees of freedom
 Area of a/2 in each tail
 The T Table
o The body of the table contains T values, not probabiliies
 Unlike the normal table 
o We use it to ind the T value
lOMoARcPSD|2665969
 That we sub into our formula above 

 Find Degrees of freedom = N – 1
 Find area (α), which will always be divided by 2 (as we have two
tails)
o Is the let over from the conidence interval
o E.g. If you want to have a 90% conidence interval
 Since we know that the area under the curve is
always 1
 1 – 0.9 =0.1 = Area
 0.1/2= area/2 (which is what we use)
 Using the table example
 Find value of T when n=28 with 99% conidence
o Find degrees of freedom
 D.f = N-1
 D.f = 28-1 = 27
o Find area/2
 1-0.99 = 0.01 = area
 0.01/2 = area/2
 =0.005
o Use table
 T = 2.7707

 Example
o Quesion
Form 95% conidence interval for µ
o Given informaion
 n = 25
 x- bar = 50
 s=8
 Tn-1 = Use table and locate: = 2.0639
 Degrees of freedom = N-1 = 25-1 = 24 = Degrees of freedom (for T
table)
 Area (α)
o Upper tail = (1 – 95% Conidence)/2 = 0.025 (for T table)
o Lower tail = (1 – 95% conidence)/2 = 0.025 (For T table)
o Formula
lOMoARcPSD|2665969
S
 x́ ± T n−1
√n
o Working
S
 95% conidence = x́ ± T n−1
√n
 = 50 ± (2.0639 ×
8
√25 (8
¿=50 ± 2.0639 × =50 ± ( 2.0639 ×1.6 )=50 ±3.30224
5 )
 Upper Tail = 50 + 3.30224
o 53.30224
 Lower tail = 50 – 3.30224
o 46.69776
o Answer
 95% conident the populaion mean is within 46.698 and 53.302
o Remember
 Draw your graphs and use visualisaions
 Start by inding what T is
o Find Degrees of freedom
o Find the area of EACH tail
Confidence Intervals for Portion (Use inverse normal table)

 Recall that the distribuion sample of proporion is approximately normal if the sample size is
large, with standard error:
 We will esimate this sample data

o i.e p (sample proporion)
 since we don’t know π (populaion proporion)
 Upper and lower conidence limits for the populaion proporion are therefore calculated
with the formula:
o Where:
 Z = Standard normal value for the level
of conidence desired
 P = the sample proporion
 N = the sample size
 Example
o Quesion
form a 95% conidence interval for the true proporion of let handers
o Given informaion
 N = 100 people = total number
lOMoARcPSD|2665969
 P= 25 /100 = 0.25 = sample porion

 Z = standard normal value (use inverse normal table)
 Area = 95% conidence interval
 Area of tails = 1- conidence interval = 1-0.95 = 0.05
o Area of upper tail= area of tails/2 = 0.05/2 = 0.025
o Area of lower tail= area of tails/2 = 0.05/2 = 0.025
o Value of 0.025 in Inverse normal table = 1.96
 1.96 = Z
o Formula

p± z
√
p(1−P)
n
=0.25 ±1.96
0.25(1−0.25)
100 √
=25 ± 1.96
¿ 0.25 ±1.96 √ 0.001875=0.25± ( 1.96 ×0.0433 )

0.1875
100 √
 0.25 ± 0.0848
 Upper tail = 0.25 + 0.0848 = 0.3348 = 33.48
 Lower tail = 0.25 – 0.0848 = 0.1652 = 16.52
o Answer
 We are 95% conident that the true porion of let handers in the populaion
is between 33.48% and 16.52%.
o REMEMBER
 This is a PORTION
 P = sample PORTION = x/n or ?/total
 Therefore your answer will always be below 1
o In other words, your answer should be a
decimal/percentage
 it is easy to get this mixed up with inding conidence interval when sigma is
knowing
 because they are very similar
 Read the quesion carefully
 Is it asking for a PORTION or MEAN
o If porion -> answer is in decimals
o If Mean -> with sigma or without sigma
Determining Sample Size, n

 Why this is important
o You need ,n, sample size, to ind the conidence interval for
 The mean
 For the proporion
 Without it you cant use your formulas
o Cos you don’t have ‘n’ derp…
 Returning to the formula for
o inding conidence interval in populaion mean (with sigma)
lOMoARcPSD|2665969
σ
 x́ ± Z
√n
σ
 Remember that Z was the sampling error
√n
 we denote sampling error as ‘e’
 Therefore,
σ
 Sampling error = Z
√n
σ
 E= Z
√n
 If you don’t have ‘n’ we have to rearrange the formula
o Swap E with N (so we isolate N)
σ z2 σ 2
 E= Z →n= 2
√n e
 FORMULA TO FOCUS ON

z2 σ 2
n= 2
e
 For example ‘Finding Sample size, n, when the quesion doesn’t give it to us’
o Quesion
Find sample size ,n,
o Given informaion
 Standard deviaion = 45
 ± 5 with 90% conidence interval
 Z = Area = 1- conidence interval= 1- 0.9 = 0.1
 Upper tail = area/2 = 0.05 (Use inverse normal table)
 Lower tail= area/2= 0.05 (Use inverse normal table)
o Z = 1.6449
 E=5
σ
 Remember that the inal step to inding true mean was ± Z
√n
σ
o Therefore ± 5 = ± Z =e
√n
o Formula
2 2 2 2

z σ 1.6449 45 2.7057 ×2025 5479.0344
n= 2
= 2
= = =219.16
e 5 25 25
 N = 220
 Remember to round up (you can’t have a part sample e.g. having
219.16 students in a class)
 If σ (sigma aka st dev of populaion) is unknown

o to determine the required sample size for the mean, you must know:
 the desired level of conidence (1 – α) which determines the criical Z value
 the acceptable sampling error, e
lOMoARcPSD|2665969
 the standard deviaion, σ

o if unknown, sigma can be esimated:
 from past data using that data’s standard deviaion
 if populaion is normal, range is approximately (6 standard deviaions) so we
can esimate standard deviaion by dividing the range by 6
 6 standard deviaions / 6 = standard deviaion
 Conduct a pilot study and esimate standard deviaion with sample standard
deviaion, s.
Determining Sample size for π

 Very similar concept to the above
o Instead of standard deviaion we use π(1-π)
z2 σ2
 For Means n=
2
e
2
z π (1−π )
 For porion n= 2
e
o Remember
 The key is just understanding what sampling error is
 Sampling error = e =
 Example for inding sample size for porion (p,π)

z
√ π (1−π)
n
o Quesion
how large a sample would be necessary to esimate the true porion of a defecive in
a large populaion?
o Given informaion
 ±3% with 95% conidence
 P = 0.12
o Appropriate formula
2
z π (1−π )
 n=
e2
o Thought process
 we have ,p(sample), aka, π (populaion)
 = 0.12
 We have e
 E= sampling error = 3% = 0.03
 We need to ind Z before we can use the formula
z 2 0.12(1−0.12)
 n=
0.032
 Find Z
 Z = value that plots upper and lower tails = 95% conidence = 0.95
o Using inverse normal table (which gives us area to the right)
 1-0.95 = 0.05
lOMoARcPSD|2665969
 Inverse normal table shows z-value of normal

distribuions, therefore, values to the right are
symmetrical to the let
 Therefore since we have 2 tails
 tail = area to the right/2 = 0.05/2 = 0.025
 Tail = 0.025
 Z value = 1.96
 We now have all values needed to inish formula
1.962 0.12(1−0.12) 3.8416 × 0.1056 0.40567
 n= = = =450.74
0.032 0.0009 0.0009
o Answer
 The sample size needed to have a porion of 0.12 that is +/- 0.03 with 95%
conidence is 451.
 To ind the true porion of defecive in a large populaion, you need a sample
size of 451 to give at least 95% conidence that it is +/- 0.03 within 0.12.
lOMoARcPSD|2665969
Hypothesis Testing
All we are doing is tesing a ‘claim’, about a populaion, to see if it is true or not. We do this by taking
a SAMPLE and make a conclusion about the populaion.
 E.g. Adverisers claim: 4 out of 5 denists recommend Colgate toothpaste.

o This is a claim about the populaion of denists.
o It is a claim about the PROPORTION (π = 0.8 = 80%)
 Can I prove this claim? -> you can never prove this 100% (remember this)
 BUT we can use a sample to provide evidence to support or reject
the claim.
 The Null Hypothesis and the alternaive hypothesis
o always ind the alternaive before the null
 e.g. I want to say that the populaion porion of denists that recommend
Colgate has decreased
 Ha : π < 0.8
 Therefore, H0: π ≥ 0.8 (That is that the porion is sill equal to or
greater than 0.8)
 Using graphs
o When using your graphs the informaion the quesion will give you will be one of
three categories:
 Z/T/F criical values:
 Percentage:
 Mean/Proporion value (actual value):
o You cannot draw a graph accurately if the informaion isn’t aligned into one of the
above categories
 therefore you have to convert all the informaion so that it all its into one
category (all the informaion speaks in the same language)
o so what we do is either
 convert the z values (which will be the alpha/signiicance) into an x/p value
o Do this by creaing a conidence interval and using the
formulas from week 8 (use this for your test staisic)
o this requires a bit more work (as you have to do 2 equaions)
 or
 convert the mean/porion value into a Z/T value
o WE ALWAYS CONVERT TO Z/T VALUES and plot on the same graph. (its what we use
the test staisic for)
 Using P values
o this is a quick way to determine if you should reject H 0
 its used a lot in the REAL WORLD
o if p-value < alpha, reject H0
o if P-value = or >, do not reject H0
lOMoARcPSD|2665969
o to ind p-value just ask the quesion, what is the probability that you get your sample
staisic (mean or proprion).
 remember signiicance level (alpha) is in a percentage/probability
 therefore your p-value is also a percentage/probability
Notations guide:
H0: Null Hypothesis (what the current hypothesis is)
H1/Ha: Alternaive hypothesis (what you are proving is correct/incorrect)
α: alpha, area of rejecion (level of signiicance=total size of reject region)
D.R: Decision Rule – when do you reject H0 (Null Hypothesis)
T.S: Test Staisic – The formula you use
T.S.V: Point esimate from the test staisic
Con: Are you rejecing or keeping H0 (Null hypothesis)? State the signiicance level and contextualise
or
is there enough evidence to support Ha? Yes/no. Do you reject H0? Yes/no.
n: Sample Size
The Different hypothesis Tests

 1. Sigma is known – Use Z test (Populaion average)
 2. Sigma is unknown – Use T test (Populaion average)
 3. P-value approach to tesing (Probability)
 4. One tailed tests
o Upper = Test staisic is Greater than
o Lower = Test staisic is lower than
 5.Proporion tesing – use Z-test (proporion)
Hypothesis test Null Hypothesis (H0) Alternaive Test Staisic (T.S)
Hypothesis (H1,
Ha)
Populaion average µ=? µ≠? Z-Test (Mean)
(Sigma Known) ( x́−μ)
(Two tailed) Z=
σ
(α/2)
√n
Populaion average µ≤? µ>? Z-Test (Mean)
(Sigma Known)
(Upper Tail) ( x́−μ)
“AN INCREASE” Z=
σ
√n
Populaion average µ≥? µ<? Z-Test (Mean)
(Sigma Known)
(Lower tail) ( x́−μ)
“A DECREASE” Z=
σ
√n
Populaion Average µ=? µ≠? T-Test (Mean)
(Sigma Unknown)
lOMoARcPSD|2665969
x́−μ
t n−1 =
S
√n
Proporion Test π=? π≠? Z-Test (Proporion)
p−π
Z=
√ π (1−π)
Degrees of freedom = n-1

n
Two Sample Test (Means) Lower: µ1 ≥ µ2 i.e., µ1 - µ2 ≥ 0 Lower: µ1 < µ2 Pooled variance Test
Sigma unknown i.e., µ1 - µ2 < 0 ( x́ 1−x́ 2 )−( μ 1−μ 2 )
variances assumed equal T STAT =
Assumpions
N=minimum 30
or
Upper: µ1 ≤ µ2 i.e., µ1 - µ2 ≤ 0 Upper: µ1 µ2
i.e., µ1 - µ2 > 0
√( s2p
Where
1 1
+
n1 n 2 )
normally distributed
2 ( n1−1 ) S21 + ( n2−1 ) S 22
s p=
( n1−1 ) + ( n2−1 )
Two: µ1 = µ2 i.e., µ1 - µ2 = 0 Two: µ1 ≠ µ2 Conidence interval (extra)
√
i.e., µ1 - µ2 ≠ 0
(x́ 1−x́ 2) ±t n +n −2 s p
1 2
2
( n1 + n1 )
1 2
Degrees of freedom = n1 + n2 - 2
Two Sample Test (means) Lower: µ1 ≥ µ2 i.e., µ1 - µ2 ≥ 0 Lower: µ1 < µ2 Does this even have a name?
sigma unknown i.e., µ1 - µ2 < 0 ( x́ 1−x́ 2 )−( μ 1−μ 2 )
NOT assumed equal T STAT =
Assumpions:
-populaions are
normally distributed or
Upper: µ1 ≤ µ2 i.e., µ1 - µ2 ≤ 0 Upper: µ1 µ2
i.e., µ1 - µ2 > 0
√( S 21 S 22
+
n1 n2
Tstat has no D.f. v =
)
both sample sizes are
s21 s22 2
atleast 30 + ¿
-populaion variances are n1 n2
unknown and cannot be Two: µ1 = µ2 i.e., µ1 - µ2 = 0 Two: µ1 ≠ µ2 ¿
2
assumed to be equal i.e., µ1 - µ2 ≠ 0 s1 2
¿
n1
¿
2
s2 2
¿
n2
¿
¿
¿
¿
¿
v =¿
F- Test for diference Lower: σ12 ≥ σ22 Lower: σ12 < σ22 F=S 21 /S 22
between two variances Where:
USE THE F TABLE S12 = variance of sample 1 (larger
sample variance)
N1 = sample size of sample taken from
populaion 1
2
S2 = variance of sample 2
Upper: σ12 ≤ σ22 Upper:σ12 ≤ σ22 N2 = sample size of sample taken from
lOMoARcPSD|2665969
populaion 2
N1-1 = numerator degrees of freedom
(from sample 1)
N2- 1= denominator degrees of
freedom (from sample 2)
Two: σ12 = σ22 Two: σ12 ≠ σ22
Two proporions Lower: π1  π2 L: π1 < π2 1 1

+
n1 n2
√ ṕ ( 1− ṕ ) (¿)
Upper: π1 ≤ π2 U: π1 > π2 ( p1− p2 ) −( π 1−π 2 )
Z stat= ¿
Where p-bar
x 1+ x 2
ṕ=
Two: π1 = π2 Two: π1 ≠ π2 n 1 + n2
Where x = number of successes in
sample
Chi square test for 2 ( n−1 ) s 2
populaion variance or x=
σ2
standard deviaion
Where
N = sample size
S2 = sample variance
2
O = hypothesised populaion variance
Degrees of freedom = n -1
X2 test of independence f 0−f e ¿ 2
¿
¿
¿
x 2= ∑ ¿
allcells
Where
F0 = observed frequency in a paricular
cell of the r x c table
Fe= expected frequency in a paricular
cell if H0 is true
Fe= row total x column total/n
Degrees of freedom = (r-1)(c-1)
2
Chi-Square Goodness of F 0−F e ¿
it test ¿
¿
¿
x 2k− p−1=∑ ¿
k
Where
K = Number of categories or classes
remaining ater combining classes
F0= observed frequency
Fe= Expected frequency
P = number of parameters esimated
from the data
lOMoARcPSD|2665969
Steps in Hypothesis testing (IMPORTANT)

 1. H0 : State the Null hypothesis
 2. H1: State the alternaive hypothesis (Do this irst then determine step 1)
 3. α: Choose Level of Signiicance , n : choose Sample size
 4.Determine appropriate test staisic and sampling distribuion
o is the Sigma Known?
 Use Z Test
o is sigma unknown?
 Use T Test
 5. Determine the criical values that divide the rejecion and non-rejecion regions
o Rejecion regions
 Is the test:
 Single tailed
 Two tailed
 6. Collect data and compute the value of the test staisic
 7. Make the staisical decision and state the managerial conclusion
o If the test staisic falls into
 non-rejecion region, do not reject the null hypothesis H 0.
 Rejecion region, reject the null hypothesis
o Express the managerial conclusion in the context of the real-world problem
Hypothesis Testing template

 Quesion:
 H0 :
 H1 :
α :
 D.R :
 T.S :
 T.S.V :
 Con :
Chi Square Distribution

 test for a variance or standard deviaion
 to perform a test of independence
 Evaluate goodness of it of a set of data to a speciic probability distribuion
lOMoARcPSD|2665969
Z- Test Example: True Mean (With Sigma, 2-Tailed)

 Quesion:
o Test the assumpion or status quo that the true mean of TV sets in U.S. Homes is
Equal to 3. Assume Sigma = 0.8, Sample Mean = 2.84, Sample total = 100
 H0 : µ = 3 , Populaion mean equals three

 H1 : µ ≠ 3 , Populaion mean doesn’t equal 3
α : Signiicance level = 0.05
 α/2 : Area of 1 tail = a/2 = 0.05/2 = 0.025 (which is a %) = 1.96 (which is a value)
o Note: use inverse normal table to convert 0.025 into a value
 D.R : Reject H0 if z < -1.96 or z > 1.96; otherwise do not reject H 0
 T.S : Sigma is known, therefore this is a Z Test
( x́−μx́ ) ( x́−μ)
Z= =
o σ x́ σ
√n
 Where
 x́ = 2.84
 µ=3
 σ = 0.8
 n = 100
(2.84−3) (2.84−3) −0.16
Z= = = =−2
 T.S.V : 0.8 0.8 0.08
√100 10
 Con :
o Z = -2, which is below 1.96. Therefore, we reject the null hypothesis. The true mean of
TV sets in U.S. homes is not equal to 3.
 Probability value (P-value) approach (Lower tail example)
o As found above x-bar has a Z-score of -2.0
 Using the Normal table
 P(Z < -2.0) = 0.0228 or 2.28%
 P(Z > 2.0) = 0.0228 or 2.28%
o Saying reject H0 if Z< -1.96
 is the same as saying
 Reject H0 if P Value < signiicance (Could use this for DR)
 Reject if H0 if P value < 0.025
 P value 0.0228 < 0.025
o Reject the hypothesis
lOMoARcPSD|2665969
 Note I didn’t include the upper tail
lOMoARcPSD|2665969
Z Test Example: True Mean (Sigma Known, Upper

Tail)
 Quesion:
o A phone industry manager claims that customer monthly cell phone bills have
increased (in other words the status quo has changed), and now the average over $52
per month.
o the company wishes to test this claim with α = 0.1
o Assume sigma = 10.
 H0 : µ ≤ $52 –average is not over $52 per month
 H1 : µ > $52 – average is over $52 per month
α : 0.1 = Z-score of 1.2816
 D.R : Reject if Z > 1.28
 T.S :
( x́−μx́ ) ( x́−μ)
Z= =
o σ x́ σ
√n
 where
 x́ = 53.1
 µ = $52
 σ = 10
 n = 64
 Z = 1.2816
 T.S.V :
(53.1−52) 1.1 1.1
Z= = = =0.88
o 10 10 1.25
√ 64 8
 Con :
o Do not reject Null hypothesis because 0.88 is < 1.28. There is not enough evidence
the mean bill is over $52.
 Probability value approach
o As found Z-score is 0.88 = 0.8106 (To the let)
 1-0.8106 = P(Z>0.88) =0.1894
o if Pvalue < than signiicance reject the null hypothesis
 p(z>0.88) = 0.1894
 Signiicance = 0.1
o Do not reject the null hypothesis because 0.1894 is not < 0.1.
lOMoARcPSD|2665969
T- Test Example: True Mean (Sigma Unknown, Two

tail)
 Quesion:
o Average cost of a hotel room in New York is said to be $168 per night. A random
sample resulted in X-bar = $172.5 and s = $15.4
o Test at α = 0.05 (Assume distribuion is normal)
 H0 : µ = $168
 H1 : µ ≠ $168
α : 0.05 (Two tailed test), Tail(α/2) = 0.025 at N-1 = 25-1 = 24
o Tails of rejecion = Refer to T table 0.025, 24 degrees of freedom = 2.0639
 D.R : If Tn-1<-2.0639 or Tn-1>2.0639
 T.S : The test staisic is the T-test
x́−μ
t n−1 =
o S
√n
 Where
 x́ = sample average = $172.5 p/night
 µ = populaion average = $168 p/night
 S = sample st dev = $15.4
 n = sample total = 25
 T.S.V :
172.5−168 4.5 4.5
t n−1 = = = =1.4610
o 15.4 15.4 3.08
√ 25 5
 Con : Do not reject H0 : not suicient evidence that true mean cost is diferent than $168
lOMoARcPSD|2665969
Z-Test Example: Proportion
 Quesion:
o A Markeing company claims that it receives 8% responses from its mailing. To test
this claim, a random sample of 500 were surveyed with 25 responses.
o Test at α = 0.05 signiicance level
 H0 : π = 0.08
 H1 : π ≠ 0.08
α : 0.05 = signiicance level, area of two tails = a/2 = 0.025 = Z- score1.96
 D.R : Z is < -1.96 or >1.96
 T.S :
p−π
Z=
o

√ π (1−π)
n
Where
 p = Sample proporion (Responses/total)= 25/500 = 0.05
 π = µp = mean of samples proporion = 0.08
 n = sample total = 500
0.05−0.08 −0.03 −0.03 −0.03
Z= = = = =−0.2472
 T.S.V
 Con
:
√ 0.08(1−0.08)
500 500√
0.0736 √ 0.0001472 0.12132
: there is suicient evidence to reject the company’s claim of 8% response rate.
lOMoARcPSD|2665969
Pooled-Variance t Test Example (mean)

 Quesion:
You are a inancial analyst for a brokerage irm. Is there a diference in mean dividend yield
between stocks listed on the NYSE and NASDAQ?
Assume both populaions are approximately normal with equal variances, is there a
diference in mean yield (alpha = 0.05)
You Collect the following data
NYSE NASDAQ
Number 21 25
Sample Mean 3.27 2.53
Sample Std Dev 1.30 1.16
2
note: variance = St dev
 H0 : µNYSE = µNASDAQ (µ1 = µ2 i.e., µ1 - µ2 = 0)
 H1 : µNYSE ≠ µNASDAQ (µ1 ≠ µ2 i.e., µ1 - µ2 ≠ 0)
 α : 0.05
 a/2 :0.025
o T value = Degrees of freedom + 0.025
Degrees of freedom= NYSE + NASDAQ – 2 = 21+25 – 2 = 44
T values = 2.0154,-2.0154
 D.R : If µ1 - µ2 < -2.0154, or µ1 - µ2 > 2.0154
 T.S :
( x́ 1−x́ 2 )−( μ 1−μ 2 )
T STAT =
√ ( n1 + n1 )
o
s2p
1 2
 Where
 X-bar 1 = NYSE Mean = 3.27
 X-bar 2 = NASDAQ Mean = 2.53
 µ1 = unknown =0
 n1 = NYSE Number = 21
 n2 = NASDAQ Number = 25
 s1 = NYSE st dev = 1.30
 S2 = NASDAQ st dev = 1.16
2 ( n1−1 ) S21 + ( n2−1 ) S 22
o s =
p
( n1−1 ) + ( n2−1 )
 T.S.V :
lOMoARcPSD|2665969
o
( 3.27−2.53 )−( 0−0 ) 0.74 0.74 0.74
T STAT = = = = =2.040

√ 1.5021
WHERE
( 211 + 251 )
√ 1.5021× 0.08761 √ 0.1316 0.3627
o
2 ( 21−1 ) 1.32 + ( 25−1 ) 1.162 (20∗1.69)+(24∗1.3456) 33.8+32.2944 66.0944
s =
p = = = =1.5
( 21−1 ) + ( 25−1 ) 20+24 20+24 44
021
 Con : There is suicient evidence to support alternaive hypothesis. Reject the null
Pooled-Variance Confidence interval Example

(mean)
 Quesion:
You are a inancial analyst for a brokerage irm. Is there a diference in mean dividend yield
between stocks listed on the NYSE and NASDAQ?
Assume both populaions are approximately normal with equal variances, is there a
diference in mean yield (alpha = 0.05)
You Collect the following data
NYSE NASDAQ
Number 21 25
note: variance = St dev2
 H0 : µNYSE < µNASDAQ (µ1 = µ2 i.e., µ1 - µ2 = 0)
 H1 : µNYSE > µNASDAQ (µ1 ≠ µ2 i.e., µ1 - µ2 ≠ 0)
 α : 0.05
 a/2 :0.025
o T value = Degrees of freedom + 0.025
Degrees of freedom= NYSE + NASDAQ – 2 = 21+25 – 2 = 44
T values = 2.0154,-2.0154
 D.R : If µ1 - µ2 (0) <0.009 , or µ1 - µ2 > 1.471, reject the null hypothesis
 T.S :
 Where
√( 2 1
1
1
( x́ 1−x́ 2 ) ± t α/ 2 s p n + n
2
)
 X-bar 1 = NYSE Mean = 3.27
 X-bar 2 = NASDAQ Mean = 2.53
lOMoARcPSD|2665969
 n1 = NYSE Number = 21
 n2 = NASDAQ Number = 25
 s1 = NYSE st dev = 1.30
 S2 = NASDAQ st dev = 1.16
2 ( n1−1 ) S21 + ( n2−1 ) S 22
o s p=
( n1−1 ) + ( n2−1 )
 T.S.V :
o
( 3.27−2.53 ) ± 2.0154 √ 1.5021 ( 0.0476+0.04 ) =0.74 ± ( 2.0154 ×1.1471 )=0.74 ± 0.7310=0.009 ,1.4
 WHERE
o
( 21−1 ) 1.32 + ( 25−1 ) 1.162 (20∗1.69)+(24∗1.3456) 33.8+32.2944 66.0944
s 2p= = = = =1.5
( 21−1 ) + ( 25−1 ) 20+24 20+24 44
021
 Con : There is suicient evidence to support alternaive hypothesis. Reject the null
F Test Example : Difference between two Variances

 Quesion:
You are a inancial analysis for a brokerage irm. is there a diference in dividend yield
between stocks listed on the NYSE and NASDAQ? you collect the following data
NYSE NASDAQ
Number 21 25
Is there a diference in the variances between NYSE and NASDAQ at the 0.05% level
 H0 : σ12 = σ22 Note: Standard deviaion2 = Variance
 H1 : σ12 ≠ σ22
 α : 0.05
 a/2 : 0.025
o Degrees of freedom1 = 20 = n-1 (when using the F table[columns])
o Degrees of freedom2 = 24 = n -1(when using the f table [rows])
F-value upper = 2.33
F-value lower = D.f FLIPPED. [look at column 24 and row 20) = 2.41
 You must invert 2.41 (because it’s a lower limit and 2.41>2.33)
 1/2.41 = 0.41 = F-value lower
 D.R : If F is < 0.41 or F is > 2.33 reject the null hypothesis
2
S1
 T.S : F= 2
S2
o Where:
 S12 = variance of sample 1 (larger sample variance)
lOMoARcPSD|2665969
 N1 = sample size of sample taken from populaion 1

 S22 = variance of sample 2
 N2 = sample size of sample taken from populaion 2
 N1-1 = numerator degrees of freedom (from sample 1)
 N2- 1= denominator degrees of freedom (from sample 2)
 T.S.V :
1.3 2 1.69
o F= 2
= =1.255
1.16 1.3456
 Con :
o Not enough evidence supports the alternaive hypothesis. Do not reject the null
hypothesis.
lOMoARcPSD|2665969
Z-Test Example: Proportion 2 Samples

 Quesion:
o Is there a signiicant diference between the proporion of men and the proporion of
women who will vote yes on Proposiion A.
 In a random sample, 36 of 72 men and 31 of 50 women indicated they would
vote yes
 Test at 0.5 level of signiicance
 H0 : π1 ≠ π 2
 H1 : π1 = π 2
α :0.05
 a/2 :0.25 = Z criical value: 1.96
o Upper tail= 1.96
o Lower tail= -1.96
 D.R : If Z < -1.96 or if Z >1.96 Reject the null hypothesis
 T.S :
1 1
+
n1 n2
o √ ṕ ( 1− ṕ ) (¿)
( p1− p2 ) −( π 1−π 2 )
Z stat= ¿
 Where p-bar
x 1+ x 2
 ṕ=
n 1 + n2
o Where x = number of successes in sample
o P1 =36/72 = 0.5 =porion of men =x/n
o P2=31/50 = 0.62 = Porion of women = x/n
o n1=72
o n2=50
o π1=0
o π2= 0
o x1=36
o x2=31
 T.S.V :
1 1
+
72 50
−1.12 −.12
o √ 0.549 ( 1−0.549 ) (¿)= = =¿ 1.31
√ 0.549× 0.4509 ×0.0338 0.0914
( 0.5−0.62 )−0
Z stat = ¿
 Where p-bar
lOMoARcPSD|2665969
36+31 67
 ṕ= = =0.5491
72+50 122
 Con : there is not signiicant evidence of a diference in proporions who will vote yes
between men and women. Do not reject the null hypothesis
 Conidence interval
( p1− p2) ± Z α /2
√ p 1 (1− p1 ) p2 (1−p 2 )
n1
+
n2
lOMoARcPSD|2665969
Chi Square test for ST dev or variance

 Quesion:
o a manufacturer of door knobs has a producion process designed to provide a door
knob with a target diameter of 2.5cm. in the past the standard deviaion of the
diameter has been 0.035cm. In an efort to reduce variaion in the process the
process has been redesigned.
o a sample of 25 door knobs produced under the new process indicates a sample
standard deviaion of 0.025cm.
o at 0.05 level of signiicance, is there evidence that the populaion standard deviaion
is less than 0.035cm in the new process
 H0 : Sigma = 0.035
 H1 : Sigma < 0.035
α : 0.05
o Degrees of freedom= n-1 = 25-1 = 24
o 0.05 Criical value of X2 = 13.848 (lower tail)
 D.R : If X2 is < 13.848 reject H0 (Lower tail test)
 T.S :
2 ( n−1 ) s 2
o x=
σ2
 Where
 N = sample size = 25
 2
S = sample variance = 0.0252
 O2 = hypothesised populaion variance = 0.0352
 Degrees of freedom = n -1
 T.S.V :
2 ( 25−1 ) 0.0252 24∗0.000625 0.015
o x= 2
= = =12.244
0.035 0.001225 0.001225
 Con : There is suicient evidence to support the alternaive hypothesis. Reject the null
hypothesis
lOMoARcPSD|2665969
X2 test for independence

 Quesion:
o A department store wants to know whether there is a relaionship between the
method of payment and then gender of the customer
o the results of a survey of 400 shoppers are shown in the table below.
o is there any evidence of a relaionship between gender and method of payment?
o use alpha = 0.05
Gender Method of payment

Cash Cheque Credit Card Total
Male 20 70 120 210
Female 24 76 90 190
Total 44 146 210 400
 H0 : Method of payment and gender are independent
 H1 : Method of payment and gender are dependent
α : 0.05
o Degrees of freedom = (row-1)(column-1) = 1*2 = 2 degrees of freedom
 D.R : if x 2 > x2U, reject H0, otherwise do not reject H0
o x2u = 5.991 (upper tail)
o If x2 is larger than the upper tail then reject the null.
 T.S :
2
f 0−f e ¿
¿
o ¿
¿
x=∑ ¿
2
allcells
 Where
 F0 = observed frequency in a paricular cell of the r x c table
Total 44 = 44/400 146= 146/400 210= 210/400 = 400

=11% =36.5% 52.5%
o
 Fe= expected frequency in a paricular cell if H 0 is true
Male 210 X 11%= 210 x36.5%= 210x52.5%= 210

23.1 76.65 110.25
Female 190 x 11%= 190 x36.5%= 190x 52.5% = 190
20.9 69.35 99.75
 Fe= row total x column total/n

 Degrees of freedom = (r-1)(c-1)
 T.S.V :
lOMoARcPSD|2665969

 Con :
3.905 is < 5.991
Therefore, we don’t reject the null.
lOMoARcPSD|2665969
Chi Square Goodness of fit test

 Quesion:
o Does the sample data conform to a hypothesised distribuion
 do measurements from a producion process follow a normal distribuion
 Mean = 50
 Sigma = 15
 H0 : the distribuion of producion process is normally distributed
 H1 : the distribuion of producion process is not normally distributed
α :0.05
o Degrees of freedom = k – p – 1
 In normal distribuion THERE ARE ALWAYS 2 PARAMETERS p = 2
 =7–2–1=4
 Criical x2u= 9.488
 D.R : x 2 > x2U
 T.S :
F 0−F e ¿2
¿
o ¿
¿
p−1= ∑ ¿
2
x k−
k
 Where
 K = Number of categories or classes remaining ater combining classes
 F0= observed frequency
 Fe= Expected frequency
 P = number of parameters esimated from the data
 T.S.V :
 Con :

Data Analysis

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Analysis

Uploaded by

Copyright:

Available Formats

lOMoARcPSD|2665969

Data Analysis Study Notes

Data Analysis (Queensland University of Technology)

StuDocu is not sponsored or endorsed by any college or university

Data Analysis Mega Study Document

Week 1 – Types of Data

 Disinguish diferent types of data

Populaion: Consists of all members of the group

Sample: Porion of the populaion selected for analysis

Parameter: Numerical measure that describes characterisic of populaion

Staisic: Numerical measure that describes a characterisic of a sample

Sample Staisic Populaion Parameter Descripion

 can only be named or categorised

 measured on numerical scale

Time Series or Cross sectional

 Collected through ime

 Collected for a point in ime

Week 2 – Numerical Data Tables/Charts

Frequency Distribuions: Summary table of arranged data with frequency

Mean/X-bar : =Sum of all numbers/Number of Numbers

Median: Midpoint of ranked values

Mode: Most Frequently observed value

Range: Diference between largest number and smallest number

Frequency: Times a number/result is counted

Cumulaive Frequency: Times a number is counted + Sum of frequencies counted

Number Frequency Cumulaive frequency

o Y= Dependent variable [Verical/Y axis]

Points of central location

Mean/X-bar : Sum of all numbers/Total Number of Numbers

X: A symbol used instead of an actual number/result

N: total number of results

Median: Midpoint of ranked values

 Finding the median

 Median is the average of two middle ranked values

Mode: Most Frequently observed value

Range: Diference between largest number and smallest number

Quariles: 25% segments or 4 divisions or quarters

Interquarile Range: Range between Q1∧Q3

 IQR/Interquarile Range = Q3−Q1

Week 3 – Numerical Descriptive Measures

Coeicient of variaion: Used to compare data sets and determine which is

Z-Scores: Shows how many standard deviaions (Units)

 Uses in language (ALSO THE Z-SCORE SEE BELOW)

Coeicient of Variaion (CV)

Bivariate Measures – Direction and Strength

Covariance [Cov(X,Y)] – THE DIRECTION

Average Total Sum 170

Coeicient of Correlaion r – The Strength

Week 4 – Regression and Probability

Simple Linear Regression: Straight line relaionship between Y and X.

Dependent variable (Y): Variable we wish to predict or explain (Verical Axis)

Independent variable (X) :Variable used to explain the dependent variable

E.g. every ime X increases by 1, Y increases/decreases by ‘_____’

Experiment: Process which an observaion or measurement is obtained

Sample space: List of all possible outcomes that can be generated

Event: Collecion of simple events (Notaion of event = A)

o Y is dependent on X, for every 1 units change in X, Y Increases/decreases by ‘_____’

Possible exam QUESTION!!!

 You only need to be concerned with this

o Muliple R =The coeicient of correlaion r = STRENGTH + DIRECTION(Refer week 3)

 ‘No of children’ is just a column itle they used in excel

Calculating and combining probabilities

 Summary of types of probabiliies (exam quesions)

Week 5 – Discrete Probability and