Professional Documents
Culture Documents
Mean (Average):
Mean vs. Medium Mean : Skewed to the Right Mean : Skewed to the Left Right Long Tail: Skewed to the Right Left Long Tail: Skewed to the Left Boxplot: 1. Q1,Q2,Q3 2. Fences Outliers 3. Max, Min Q1 First Quartile: 25th Percentile Q2 Median: 50th Percentile Q3 Third Quartile: 7th Percentile Upper Fence: Q3 + 1.5IQR Lower Fence: Q1 1.5IQR Range = Interquartile Range (IQR) = Q3 Q1 Spread of middle 50% of distribution Sensitivity to Outliers Not Skewed (Roughly Symmetric) Sensitive (No) to Outliers: Mean and Range, Variance, SD Skewed (Has Outliers) Not Sensitive to Outliers: Median and IQR Observation number of outliers = n Margin of outliers = n of outliers Population Standard Deviation:
or
or
Linear Transformation Adding a constant (a): center (mean, median) will shift by constant, does not change spread Multiplying a constant (b): center and spread (SD, IQR) change, variance is b2 Linear Model: New Linear Model: o Center (c): : plug original c into new formula o Spread (d): : plug original d into new formula o Variance:
Normal Distribution
Empirical Rule
Standard Normal Distribution: N(0,1) or N(, ) For Data: For Model: For %:
Sampling Distributions for Proportions: Categorical Data p: true proportion, center of histogram : sample proportion, varies from one sample to the next For CI:
~ N(p, )
Sampling Distributions for Means: Quantitative Data : population mean : sample mean
For CI:
Statistical Inference
Properties of Sampling Distribution of Sample Means 1. = : sample mean = population mean 2.
x=
Central Limit Theorem The relationship between the sampling distribution of sample means and the population that the samples are taken from If samples of n25 drawn from any population with mean and SD , then sampling distribution of sample means approximates normal distribution. The greater the sample size the better the approximation If population normally distributed, sampling distribution of sample means is normally distributed for any sample size n If asks about one (individual), then use:
One-Sample t-Test
H0: = 0 Ha: <,,> 0 Unknown : data/sample ~tn-1
P-Value P(tn-1<
p p q
)=% ) )
s s
) = #<p-value<#
)=%
p0
p q0 0
) = 1 P(z>
P(tn-1>
) = #<p-value<#
s
p0
p q0 0
) = 1 P(z>
) )
P(z>()
) = 2 P(z>
P(tn-1>()
)=2#<p-value<2#
P(z>()
p0
p q0 0
)= 2 P(z>
p0
p q0 0
Conclusion: based on / CI
P-value < : reject H0, accept Ha Sufficient evidence Significant Unlikely P-value > : fail to reject H0 Not sufficient evidence Not significant Likely
Conclusion: based on / CI
P-value < : reject H0, accept Ha Sufficient evidence Significant Unlikely P-value > : fail to reject H0 Not sufficient evidence Not significant Likely CI:
Conclusion: based on / CI
P-value < : reject H0, accept Ha Sufficient evidence Significant Unlikely P-value > : fail to reject H0 Not sufficient evidence Not significant Likely CI:
CI:
Z*:
1.645 1.960 2.576
P-Value: value on which we base our decision Determines how likely data hypothesized are, if H0 true Ultimate goal of calculation Conclusion: reject or fail to reject H0
Confidence Interval
CI: Estimated range of values, calculated from the sample data that is likely to include unknown population parameter level + Confidence have to capture true value more often so make interval wider Smaller interval (less variability) choose a large sample Estimate ME ME = SE: extent of interval on either side of middle value o ME < 5% is acceptable Level of Confidence Probability that the interval estimate contains the population parameter 95% confidence level o One can be 95% confident that the population parameter is contained in the interval Critical Value # of SEs must stretch put on either side of middle value
Type I Error (false positive): H0 is rejected, when it is true (drew unusual sample) Healthy person diagnosed with disease Jury convicts innocent person Money will be invested in project that turns out not to be profitable Type II Error (false negative): H0 is not rejected (fail to reject), when false Infected person diagnosed healthy Jury fails to convict guilty person Money wont be invested in project that would have been profitable Detecting false hypothesis: Power ()
Descriptive Statistics
Categorical Variable: descriptive responses Quantitative Variable: measure of quantity (units) Time-series: variable measures at regular intervals over time Consistent space-time interval (months, weeks) Cross Sectional Data: several variables measured at same point in time. Exact time (every Feb at Starbucks) Stem-plot: distribution of data, which also includes the specific data points Can calculate mean, quartiles, median, shape Scatterplot: plots 2 quantitative variables Histogram: distribution of data by breaking the range of values of variable into intervals and displaying count or proportion of observations that fall into each interval Shapes of Distributions Symmetric, Unimodal, Bimoda, Uniform, Skewed Data Collection: From direct observation or produced through experiments Survey (Response Rate): personal interview, telephone interview, self-administered questionnaire Sampling Plans SRS: sample selected so that every possible sample with the same number of observations is equally likely to be selected Stratified RS: separating the population into mutually exclusive sets or strata and drawing a SRS from each o Homogeneous and different to one another (Black vs. White) Cluster S: SRS of groups or clusters of elements o Heterogeneous and similar to one another (Vancouver vs. Toronto) Systematic S: sample every kth unit in a population Multi-stage S: randomly choose clusters and randomly sample individuals within each cluster Errors Sampling Error: difference between sample and population that exists only because of the observations that happened to be selected for the sample Non-sampling Error: due to mistakes made in acquiring data or sample observations being selected improperly (more serious b/c cannot be corrected by + sample size) o Errors in acquiring data: recording incorrect responses o Nonresponse error: error (or bias) introduced when responses are not obtained be some members of a sample o Response Bias: anything that influences responses o Voluntary Response Bias: large group invited to respond and all counted o Selection bias: occurs when sampling plan is such that some members of the target population cannot be selected for inclusion in sample o Convenience Sampling: include individuals that are convenient o Under-coverage: some portion of population not sampled at all or has smaller representation Answer Phrasing o Measurement Errors: inaccurate responses o Pilot Test: small trial run of study to check if method of study okay
Random Variables: variable that assigns a numerical result to an outcome of an event that is associated with chance Discrete: if can take only finite or countably infinite number of values Continuous: not discrete Discrete Conditions 0 P(x) 1 Expected Value: mean of probability distribution: = E(x) = Variance: or If Linear H(x) = ax + b E(h(x)) = E(h(x)) = E(ax + b) = aE(x) + b Var(h(x)) = Var(ax + b) = a2Var(x) If Random Variables E(x + y) = E(x) + E(y) Var(x y) = Var(x) + Var(y)
1 2
1
D0
1
1 2
2
0
2
2 + 1 2 2 + 1 2 1 +2 2 2
1+ 2
1 2
P-Value: P(tdf t0) = % from top P(tdf t0) = % from top P(tdf>() t0) = 2 P(% from top) Conclusion: : P-value < : reject H0, accept Ha
Sufficient evidence, Significant, Unlikely
P-Value: P(tdf t0) = % from top P(tdf t0) = % from top P(tdf>() t0) = 2 P(% from top) Conclusion: : P-value < : reject H0, accept Ha
Sufficient evidence, Significant, Unlikely Not sufficient evidence, Not significant, Likely
P-Value: P(tdf t0) = % from top P(tdf t0) = % from top P(tdf>() t0) = 2 P(% from top) Conclusion: : P-value < : reject H0, accept Ha
Sufficient evidence, Significant, Unlikely Not sufficient evidence, Not significant, Likely
P-value > : fail to reject H0 CI: (#,#): has value: Fail to reject H0 (#,#): doesnt have value: Reject H0 (-#,#): CI has 0 cannot conclude CI:
2 1 1 2 2
P-value > : fail to reject H0 CI: (#,#): has value: Fail to reject H0 (#,#): doesnt have value: Reject H0 (-#,#): CI has 0 cannot conclude
CI: (#,#): has value: Fail to reject H0 (#,#): doesnt have value: Reject H0 (-#,#): CI has 0 cannot conclude
CI:
2 2 1 + 2 1 2 2 2 2 1 2 1 1 2 + 1 1 1 2 1 2
min
Confidence Interval
CI: + Confidence have to capture true value more often so make interval wider Smaller interval (less variability) choose a large sample
Level of Confidence 90% CI: (-#,#): Since the 90% CI contains 0, we cannot conclude at the 10% level of significance that the perception of the study examined of and 90% CI: We are 90% confident that the interval (#, #) contains the true mean/proportion of the study experiment
P-Value: P(tdf t0) = % from top P(tdf t0) = % from top P(tdf>() t0) = 2 P(% from top)
Conclusion: : P-value < : reject H0, accept Ha
Sufficient evidence, Significant, Unlikely Not sufficient evidence, Not significant, Likely
P-value > : fail to reject H0 CI: (#,#): has value: Fail to reject H0 (#,#): doesnt have value: Reject H0 (-#,#): CI has 0 cannot conclude
1 1 1 2 2 2
Test of Independence
Used to determine whether the row and column variables in a two-way contingency table are independent or related H0: no association between row and
column variables
Homogeneity Test
Comparing observed counts from 2 or more populations Examine sample to see if they have the same proportions of some characteristic H0: the populations have the same
proportion of some characteristic : 2 categorical variables are independent Ha: at least one of the populations has a different proportion : they arent independent
r: number of rows c: number of columns Conclusion: P-value: P-value < : reject H0, accept Ha
Sufficient evidence, Significant, Unlikely Not sufficient evidence, Not significant, Likely
Conclusion: P-value:
, Reject H0
, Reject H0
, Reject H0
Reject H0: We conclude that the variables are not independent. They are associated. Fail to Reject H0: There is no evidence of an association
OR
2
2
Properties of Correlation Coefficient 1. -1 1 2. rxy = ryx 3. Positive values = Positive correlation; Negative Values = Negative correlation 4. Strong Correlation/linear relationship closer to -1 and 1 5. Weak Correlation/linear relationship closer to 0 6. Only used on two quantitative variables 7. Calculated using means, SDs, z-scores 8. Not resistant to outliers Describing Association 1. Form Is it linear, Bell-shaped, Curved, Cloud 2. Direction Positive or Negative? 3. Strength Spread Linear Model:
Slope: +/- by slope (y units) per x (x units) Gets sign from correlation Gets units from ratio of two SDs, so units of the slope are a ratio of units of the variables OR Intercept: is y-intercept when x=0, = Starting value for predictions
2
2
To find slope and intercept need: Correlation (r): tells us strength of linear association Means: tells us where to locate the line SD: tells us the units Predicting SD +/- the mean when have another SD +/- the mean: SDy = r (SDx)
Correlation to the Line: plot of standardized variables =r =0 : For every SD +/- the mean we are in x, well predict that y is r(SD) +/- the mean y Residual: observed (point) predicted (line) Does the model make sense? How well does the line fit the data? o How much variation in y does our model explain? coefficient of determination R2 e: is big (overestimate) + e: is small (underestimate) Residuals vs. predicted values: shows no patterns, no direction, no shape, mean = 0 Point Prediction: the value of obtained by specifying the value x and solving regression equation for X* we have for regression equation o we can only make predictions within range of our data not away because of EXRAPOLATION BAD Coefficient of Determination Measures the proportion of variation in y that is explained be the variation in x R2: fraction of datas variation accounted for by model About R2 % of the variation in y that is explained by the variation in x, 1 R2: fraction of variation left in residuals
Lurking Variable: Variable that is not among the explanatory or response variables. It can influence the interpretation of relationships among these variables Ex. Lung cancer stain fingernails: smoking What is the most realistic value of SD?
= predicted y-value
Coefficient of Determination/Variation (R2): the amount of the variation in y that is explained by the regression line. It is the ratio of the explained variation to the total variation
Standard Error of Estimate (Se): a measure of the differences between the observed sample yvalues and the predicted that are obtained using the regression equation Sum of Squares for X (SSxx): the sum of the squares of the deviations of x ASSESING THE MODEL Standard Deviation of the Error Variable ( ) If is large some of the errors will be large, which implies that the models fit is poor If is small, the errors tend to be close to the mean (which is 0) and so the model fits well Sum of Squares for Regression (SSR): measures the amount of variation in y that is explained by the variation in the independent variable x Variation in So SSE is the amount of variation in y that remains unexplained
2 2 2 2
The greater the explained variation (greater the SSR or R2), the better the model
Regression Equation
Prediction Interval Determines how closely matches
the true value of y Single observation predicting individual vale wider interval
Coefficient of Correlation
Data observational Two variables are bivariate normally distributed Can test for linear association b/w 2 variables using test : estimate of population coefficient of correlation is the sample coefficient of correlation r
1 1 1
than the prediction interval b/c there is less error in estimating a mean value opposed to predicating an individual value
df: v = n 2
Conclusion: P-value: P-value < : reject H0, accept Ha P-value > : fail to reject H0
Not sufficient evidence, Not significant, Likely
CI:
Conclusion: P-value:
P-value < : reject H0, accept Ha
Sufficient evidence, Significant, Unlikely Not sufficient evidence, Not significant, Likely
: estimated error
1 : n2 distribution df: v = n 2
: outside
: outside
Multiple Regressions
where: k = independent variables are potentially related to the dependent variable y = dependent variable , , , = independent variables , , , = coefficients = error variable Independent variables: may be function of other variables , , Meaning of Regression Coefficient : with all other variables held constant, if the average your expected Y increases/decreases by coefficient function (what it means) increases by 1,
Adjusted R2: takes into account the sample size and the number of independent variables k > n: unadjusted R2 may be unrealistically high 2
( ) 1
CI: Test for Each Variable If P-value < : We conclude that i is greater/smaller/not than 0
Significance of Each Variable Testing the coefficients: t- tests T-tests of individual coefficients allow us to determine whether 0 (for I = 1, 2, , k) if linear using lots of t-tests instead of F-tests to test the
relationship exists between xi and y. validity of the model increases the probability of Type 1 Error
H0: at least one = 0 : if true says no linear relationship Ha: 0 for any i
1 df: v = n k - 1
Conclusion: P-value: P-value < : reject H0, accept Ha P-value > : fail to reject H0
Sufficient evidence, Significant, Unlikely Not sufficient evidence, Not significant, Likely
prediction interval b/c there is less error in estimating a mean value opposed to predicating an individual value
df (numerator): v = k df (denominator): v = n k - 1 ,,
, Reject H0
: estimated error
H0: 1 = 2 = = k
Ha: at least one j is different : not all the same
: test statistic involves variation within groups and the variation among groups.
If differences among sample means very large, the variation witihin groups to be large, so the numerator of the test statistic becomes larger that the denominator. So large values of the test statistic suggest unequal means
ANOVA Regression Residual Total ANOVA Regression Residual Total SST
SS
, Reject H0
MS F
df
1 2 SST
Significance F
where Fstat=
SS
df
g-1
MS
Significance F
where Fstat=
P-Value = P(Fg-1,n-g) > F0: P-value < : Reject H0 g = how many groups means variables n = samples observations
SST:SSR: sum of squares between groups (treatments) Represents the variation between the means of the groups SSE: sum of squares within groups (error) Represents the variation within a group due to random error SSTotal: total sums of squares Represents the total variation among all data points and is equal to the sum of squares between groups and within a group
Regression Statistics
Multiple R R Square Adjusted R Square Standard Error Observations ANOVA Regression Residual Total [ ]
df
1 2 SST
SS
MS
Significance F
where Fstat=
Coefficients
Intercept X Variable
Standard Error
tStat
P-value
2xP(tn-2)>t-stat
Lower 95%
Upper 95%
2xP(tn-2)>t-stat
Conditions + Assumptions
Regression Lines (Correlation) Quantitative Variables Condition Linearity Condition Outlier Condition Equal Spread Condition: checking spread is about the same throughout Model for Sampling Distribution of Proportions- 68-95-99.7 Independence Assumptions Sample Size Assumption: sample size, n, must be large enough o Randomization Conditions o 10% Condition: n < 10% of population o Success/Failure Condition: np > 10 nq > 10 Model for Sampling Distribution of Means: z-score Independence Assumption o Randomization Condition Sample Size Assumption o 10% Condition o Large Enough Sample Condition: depends on shape of original data distribution Confidence Intervals for one proportion z-interval, one proportion z-test Independence Assumption o Randomization Condition o 10% Condition Sample Size Assumption o Inference CLT o Need large enough sampling model o Success/Failure Condition: n 10 n 10 Sampling Distribution for Mean: t-score Independence Assumption o Randomization Condition o 10% Condition Normal Population Assumption: students t-model wont work for that are badly skewed o Nearly Normal Condition: n < 5: data should follow normal model 15 < n < 40: t-method works well as long as data unimodal and symmetric n > 40: t-method safe to use unless data very skewed, also can use of very skewed if n is large enough b/c SD close enough to Normal
ANOVA 1. All populations are normally distributed 2. The populations variances are equal 3. The observations are independent of one another Multiple Regression: required conditions for the error variable 1. The probability distribution of is normal 2. The mean of is 0 3. The standard deviation of is , which is constant of each value x 4. The errors are independent Simple Linear Regression (inference): required conditions for the error variable 1. The probability distribution of is normal 2. The mean of the distribution is 0; that is E( ) = 0 3. The standard deviation of is , which is constant regardless of the value of x 4. The value of associated with any particular value of y is independent of associated with any value of y Chi-Square test 1. Expected cell frequency condition: all expected cell counts are at least 5, so X2 is reliable Comparing two Means 1. Independence Assumption: Randomization, 10% 2. Normal Population Assumption a. N < 15: do not use student t if skewed b. N 40: okay if mildly skewed c. N > 40: CLT works unless data very skewed 3. Independence group assumption a. Two independent samples Pooled T-test 1. Paired data assumption 2. Independence assumption