You are on page 1of 17

Statistics Glossary

A
Acceptance Error, Beta Error, Type II Error - An error made by wrongly accepting the null
hypothesis when the null is really false.
Acceptance Region - Opposite of the Rejection Region. It is better to call this the "Fail to Reject
Region." In the case of a two-tailed hypothesis t-test, it is shaded in light blue on the picture
below. If the test statistic falls between -tcritical and tcritical then we fail to reject the null
hypothesis.

Adjusted R-Squared, R-Squared Adjusted - A version of R-Squared that has been adjusted for
the number of predictors in the model. R-Squared tends to over estimate the strength of the
association especially if the model has more than one independent variable.
Alpha [A, a ], Chosen Significance Level - The maximum amount of chance a researcher is
willing to take that they will reject a null hypothesis that is true (Type I Error).
Alpha Error, Type I Error - An error made by wrongly rejecting the null hypothesis when the
null is really true.
Alternative Hypothesis, Research Hypothesis - An hypothesis that does not conform to the one
being tested, usually the opposite of the null hypothesis. Symbolized or .
Analysis of Variance (ANOVA) - A test of differences between mean scores of two or more
groups with one or more variables.
Approximation Curve, Curve Fitting - the general method for using a line or curve to estimate
the relationship between two associated numerical variables.
Autocorrelation - This occurs when later variables in a time series are correlated with earlier
variables.
B
Backward Elimination - A method of determining regression equation that starts with a
regression equation that includes all independent variables and then remover variables that are not
useful one at a time

Best Subsets Regression - A method of determining the regression equation used with statistical
computer applications that allows the user to run multiple regression models using a specified
number of independent variables. The computer will sort through all of the models and display
the "best" subsets of all the models that were run. "Best" is typically identified by the highest
value of R-squared. Other diagnostic statistics such as R-square adjusted and Cp are also
displayed to help the user determine their best choice of a model.
Bell-Shaped Curve - A symmetrical curve. Looks like the cross-section of a bell.
Best Fit, Goodness of Fit - A model that is the best model for the given data.
Beta Error, Acceptance Error, Type II Error - An error made by wrongly accepting the null
hypothesis when the null is really false.
Bivariate Association/ Relationship - The relationship between two variables only.
C
Cp Statistic - Cp measures the differences of a fitted regression model from a true model, along
with the random error. When a regression model with p independent variables contains only
random differences from a true model, the average value of Cp is (p+1), the number of
parameters. Thus, in evaluating many alternative regression models, our goal is to find models
whose Cp is close to or below (p+1).
Centering - takes the difference between each observation and the mean for the variable.
Cooks Distance: Cooks distance combines leverages and studentized residuals into one overall
measure of how unusual the predictor values and response are for each observation. Large values
signify unusual observations. Geometrically, Cooks distance is a measure of the distance
between coefficients calculated with and without the ith observation. Cook and Weisberg suggest
checking observations with Cooks distance > F (.50, p, n-p), where F is a value from an Fdistribution.
Coefficient of Determination In general the coefficient of determination measures the amount
of variation of the response variable that is explained by the predictor variable(s). The coefficient
of simple determination is denoted by r-squared and the coefficient of multiple determination is
denoted by R-squared.
Coefficient of Variation The coefficient of variation, in regression, is the standard deviation of
the predictor variable divided by the mean of the predictor variable. If this value is small, your
variation in the y-values (predictor values) is nearly constant. This implies that the data are illconditioned.
Confidence Bands (Upper & Lower) - This is the range of the responses that can be expected
for all of the appropriate inputs of X's. The upper confidence band is the highest value that the h
value is predicted to be. The lower confidence band is the lowest value predicted that h could be.
Confidence Level - This is the amount of error allowed for the model (given as a percent or a).
Confidence Intervals - A range of values to estimate a value of a population parameter.
Associated with the range of values is also the amount of confidence the researcher has in the
estimate. For example, we might estimate the cost of a new space vehicle to be 35 million dollars.

Assume that the confidence level is 95% and the margin of error is 5 million dollars. We say that
we are 95% confident that the cost is between 30 and 40 million dollars.
Confidence Interval Bounds, Upper and Lower - The lower endpoint on a confidence interval
is called the lower bound or lower limit. The lower bound is the point estimate minus the margin
of error. The upper bound is the point estimate plus the margin of error.

Correlation - The amount of association between two or more items. In these tutorials,
correlation will refer to the amount of association between two or more numerical variables.
Correlation Coefficients, Pearsons Sample Correlation Coefficient, r - Measures the strength
of linear association between two numerical variables.

Correlation Matrix - A table that shows all pairs of correlations coefficients for a set of
variables.
Correlation Ratio- A kind of correlation used when the relation between two variables is
assumed to be curvilinear (i.e. not linear).
Criterion Variable - Another term for the dependent variable.
Curve Fitting, Approximation Curve - the general method for using a line or curve to estimate
the relationship between two associated numerical variables.
D
Degrees of Freedom, df, - The number of values that can vary independently of one another. For
example, if you have a sample of size n that is used to evaluate one parameter, then there are n-1
degrees of freedom.
Dependent Variable, Response Variable, Output Variable - The variable in correlation or
regression that cannot be controlled or manipulated. The variable that "depends" on the values of
one or more variables. In math, y frequently represents the dependent variable.
DFITS, DFFITS: Combines leverage and studentized residual (deleted t residuals) into one
overall measure of how unusual an observation is. DFITS is the difference between the fitted
values calculated with and without the ith observation, and scaled by stdev (i). Belseley, Kuh,
and Welsch suggest that observations with DFITS >2(p/n) should be considered as unusual
Direct Correlation, Positive Correlation, Direct Relationship, Positive Relationship - A
relationship between two variables (x, y) such that as x increases, y increases or if x decreases, y
decreases. As one variable increase, so does the other. See the graph below.

Dummy Variable, Indicator Variable - A variable used to code the categories of a measurement.
Usually, 1 indicates the presence of an attribute and 0 indicates the absence of an attribute.
Example: If the measurement variable is cost of space flight vehicle then the vehicle might be
manned or unmanned. Let the dummy variable be 1 if the vehicle is manned and 2 if it is
unmanned. Note: Dummy variable coding can be used for more than 2 categories.
E
Efficiency, Efficient Estimator - It is a measure of the variance of an estimate's sampling
distribution; the smaller the variance, the better the estimator.
Error - In general, the error difference in the observed and estimated value of a parameter.
Error, Measurement (Measurement Error) - inaccurate results due to flaw(s) in the measuring
instrument.
Errors, Residuals - In regression analysis, the error is the difference in the observed Y values
and the predicted Y values that occur from using the regression model. See the graph below.

Error, Specification (Specification error) - A mistake made when specifying which model to
use in the regression analysis. A common specification error involves including a irrelevant
variable and leaving out an important variable.
F
F (F test statistic) - This is the test statistic for whenever conducting an analysis of variance.
Fits, Fitted Values, Predicted Values - The Fits are the predicted values found by substituting
the original values for the independent variable(s) into the regression equation. The name "fit"
refers to how well the observed data matches the relationship specified in the model.
Forward Selection - A frequently available option of statistical software applications. A method
of determining the regression equation by adding variables to the regression equation until the
addition of new variables does not appear to be worthwhile.
F-test: An F-test is usually a ratio of two numbers, where each number estimates a variance. An
F-test is used in the test of equality of two populations. An F-test is also used in analysis of
variance, where it tests the hypothesis of equality of means for two or more groups. For instance,
in an ANOVA test, the F statistic is usually a ratio of the Mean Square for the effect of interest
and Mean Square Error. The F-statistic is very large when MS for the factor is much larger than
the MS for error. In such cases, reject the null hypothesis that group means are equal. The p-value
helps to determine statistical significance of the F-statistic.

G
General Linear Model (GLM) - A full range of methods used to study linear relations between
one continuous dependent variable and one or more independent variables, whether continuous or
categorical. General means the kind of variable is not specified. Examples include Regression
and ANOVA.
H
Heteroscedasticity - Non constant error variance. Hetero = different; scedasticity = tendency to
scatter.
Hierarchical Regression Analysis - A multiple regression analysis method in which the
researcher, not a computer program, determines the order that the variables are entered into and

removed from the regression equation. Perhaps the researcher has experience that leads him/her
to believe certain variables should be included in the model and in what order.
Homoscedasticity - Constant error variance. Homo = same; scedasticity = tendency to scatter.
Hypothesis Testing - This is the common approach to determining the statistical significance of
findings.
I
Independent Variable, Explanatory Variable, Predictor Variable, Input Variable - The
variable in correlation or regression that can be controlled or manipulated. In math, x frequently
represents the independent variable.
Influential Observation - An observation that has a large effect on the regression equation. Note:
Outliers and leverage points may be influential observations, but influential observations are
usually outliers and leverage points.
Inverse Relationship, Inverse Correlation, Negative Correlation, Negative Relation Relation between two variables (x, y) such that as x increase, y decreases (or visa versa).

Intercorrelation - Correlation between variables that are all independent (no dependent variables
involved).
L
Least Squares Regression - Regression analysis method which minimizes the sum of the square
of the error as the criterion to fit the data. This can refer to linear or curvilinear regression.
Leverages, Leverage Points - An extreme value in the independent (explanatory) variable(s).
Compared with an outlier, which is an extreme value in the dependent (response) variable.
Line of Best Fit - See Regression Line.

Linear Correlation- A relationship between the independent and dependent data, that whenever
plotted forms a straight line.
Linear Regression - Typically when regression is used without qualification, the type of
regression is assumed to be linear regression. This is the method of finding a linear model for the
dependent variable based on the independent variable(s).
Linear Trend - The appearance that the data has a linear relationship whenever plotted.
M
Mean Square Residual, Mean Square Error (MSE) - A measure of variability of the data
around the regression line or surface.
Measurement Error (Error, Measurement) - inaccurate results due to flaw(s) in the measuring
instrument.
Multicollinearity, Collinearity - The case when two or more independent variables are highly
correlated. The occurrence of multicollinearity can cause difficulties in multiple regression. If the
independent variables are interrelated, then it may be difficult or impossible to find the specific
effect of only one independent variable.
Multiple Correlation Coefficient, R - A measure of the amount of correlation between more
than two variables. As in multiple regression, one variable is the dependent variable and the
others are independent variables. The positive square root of R-squared.

Multiple Correlation - Correlation with one dependent variable and two or more independent
variables. Measures the combined influences of the independent variables on the dependent.R^2
gives the proportion of the variance in the dependent variable that can be explained by the action
of all the independent variables taken together.
Multiple Correlation Matrices - A table of correlation coefficients that shows all pairs of
correlations of all the parameters with in the sample.
Multiple Correlation Plots - A collection of scatterplots showing the relationship between the
variables of interest.

Multiple R - That is the name MS Excel uses for the Multiple Correlation Coefficient, R.

Multiple Regression, Multiple Linear Regression - A method of regression analysis that uses
more than one independent (explanatory) variable(s) to predict a single dependent (response)
variable. Note: The coefficients for any particular explanatory variable is an estimate of the effect
that variable has on the response variable while holding constant the effects of the other predictor
variables. Multiple means two or more independent variables. Unless specified otherwise,
Multiple Regression generally refers to Linear Multiple Regression.
Multiple Regression Analysis (MRA) - Statistical methods for evaluation the effects of more
than one independent variable on one dependent variable.
N
Negative Correlation- This occurs whenever the independent variable increases and the
dependent variable decreases. This is also called a negative relationship.

Nonadditivity - A statement used to describe a relation when the addition of the separate effects
do not add up to the total effect.
Nonlinearity - The events are not the same as their causes.
Nonlinear Relationship - A relationship between two variables for which the points in the
corresponding scatterplot do not fall in approximately a straight line. Nonlinearity may occur
because there is not a defined relationship between the variables as in the first figure below, or
because there is a specific curvilinear relationship. See the parabolic relationship shown in the
second graph below.

Normality Plot, Normal Probability Plot - A graphical representation of a data set used to
determine if the sample represents an approximately normal population. A graph from Minitab is
shown below. The sample data is on the x-axis and the probability of the occurrence of that value
assuming a normal distribution is on the y-axis. If the resulting graph is approximately a straight
line, then the distribution is approximately normal. There are statistical hypothesis tests for
normality as well.

Null Hypothesis,Ho - This is the hypothesis that two or more variables are not related and the
researcher wants to reject.
O
Outlier - An extreme value in the dependent (response) variable. Compared with a leverage
point, which is an extreme value in the independent (explanatory) variables.
P
Parameter - Generally, it is either a boundary or limit ,or an element or a characteristic.
Partial Correlation - Correlation between two variables given that the linear effect of one or
more other variables has been controlled. Example. r12.3 is the correlation of variables one and
two given that variable three has been controlled.
Partial Correlation Coefficients - This is the square root of a coefficient of partial
determination. It is given the same sign as that of the corresponding regression coefficient in the
fitted regression function.
Partial Determination Coefficients- This measures the marginal contribution of one X variable
when all others are already included in the model. In contrast, the coefficient of multiple
determination, R^2, measures the proportion reduction in the variation of Y achieved by the
introduction of the entire set of X variables considered in the model.
Partial Regression Coefficient, Partials - In a multiple regression equation, the coefficients of
the independent variables are called partial regression coefficients because each coefficient tells
only how the dependent variable varies with the selected independent variable.
Partial Slope Coefficient - See Partial Regression Coefficient.
Pearsons Sample Correlation Coefficient, r - Measures the strength of linear association
between two numerical variables.
Population - A group of people that one whishes to describe or generalize about.
Predictor Variable, Independent Variable, Explanatory Variable, Input Variable - The
variable in correlation or regression that can be controlled or manipulated. In math, x frequently
represents the independent variable.

Prediction Equation - An equation that predicts the value of one variable on the basis of
knowing the value of one or more variables. Note: Formally prediction equation is a regression
equation that does not include an error term.
Prediction Interval - In regression analysis, a range of values that estimate the value of the
dependent variable for given values of one or more independent variables. Comparing prediction
intervals with confidence intervals: prediction intervals estimate a random value, while
confidence intervals estimate population parameters.

Population Parameter, Parameter - A measurement used to quantify a characteristic of the


population. Even when the word population is not used with parameter, the term refers to the
population. Example: The population mean is a measure of central tendency of the population.
The population parametric is usually unknown. (See Sample Statistic.)
Proportional Reduction of Error (PRE) - A measure of association that calculates how much
more you can reduce your error in the predication of y if you know x, then when you do not know
x. Pearsons r is not a PRE, but r-squared is a PRE.
Positive Correlation- This relationship occurs whenever the dependent variable increases as the
independent variable increases

P-values, Observed Significance Level - The probability of making a Type I error. (i.e. given
that the null is true, the probability of getting a data set like the one we have or one more extreme
in the direction of the alternative.)

r, Correlation Coefficients, Pearsons r - Measures the strength of linear association between


two numerical variables.
R, Coefficient of Multiple Correlation - A measure of the amount of correlation between more
than two variables. As in multiple regression, one variable is the dependent variable and the
others are independent variables. The positive square root of R-squared.
r2 , r-squared (r-sq.), Coefficient of Simple Determination - The percent of the variance in the
dependent variable that can be explained by of the independent variable.
R-squared, Coefficient of Multiple Determination - The percent of the variance in the
dependent variable that can be explained by all of the independent variables taken together.
R-Squared Adjusted (R-sq. adj.), Adjusted R-Squared - A version of R-Squared that has been
adjusted for the number of predictors in the model. R-Squared tends to over estimate the strength
of the association especially if the model has more than one independent variable.
Range of Predictability, Region of Predictability - The range of independent variable(s) for
which the regression model is considered to be a good predictor of the dependent variable. For
example, if you want to predict the cost of a new space vehicle subsystem based on the weight,
and all of the input data subsystem weights all range from 100 to 200 pounds. You could not
expect the resulting model to provide good predictions for a subsystem that weighs 3000 pounds.
Regression Analysis, Statistical Regression, Regression - Methods of establishing an equation
to explain or predict the variability of a dependent variable using information about one or more
independent variables. The equation is often represented by a regression line, which is the straight
line that comes closest to approximating a distribution of points in a scatter plot. When
"regression" is used without any qualification it refers to linear regression.
Regression Artifact, Regression Effect - An artificial result due to statistical regression or
regression toward the mean.
Regression Curve - The curve that represents the regression model.
Regression Coefficient, Regression Weight - In a regression equation the number in front of an
independent variable. For example, if the regression equation is Y = mx + b then m is the
regression coefficient of the x-variable. The regression coefficient estimates the effect of the
independent variable(s) on the dependent variable. (Compare with Partial Regression
Coefficients)
Regression Constant - Unless specified otherwise, the regression constant is the intercept in the
regression equation.
Regression Equation - An algebraic equation that models the relationship between two (or more)
variables. If the equation is Y = a + bX + e, then Y is the dependent variable, X is the independent
variable, b is the coefficient of X, and a is the intercept, and e is the error term (See Prediction
Equation).
Regression Line, Trend Line - When the best fitting regression model is a straight line, that line
is called a regression line. Ordinary Least Squares method is usually used for computing the
regression line.

Regression Model - An equation used to describe the relationship between a continuous


dependent variable, an independent variable or variables, and an error term.
Regression Plane - When the regression model has two independent variables, then a plane
represents the relationship between the variables two-dimensional. Example: z = a + bx + cy

Regression Plot- A scatterplot with the regression curve on the same graph.
Regression SS (also SSR or SSregression) - The sum of squares that is explained by the
regression equation. Analogous to between-groups sum of squares in analysis of variance.
Regression Toward the Mean - The type of bias described by Francis Galton, a 19th century
researcher. A tendency for those who score high on any measure to get somewhat lower scores on
a subsequent measure of the same thing- or, conversely, for someone who has scored very low on
some measure to get a somewhat higher score the next time the same thing is measured. Knowing
how much regression toward the mean there is for a particular pair of variables gives you a
prediction. If there is very little regression, you can predict quite well. If there is a great deal of
regression, you can predict poorly if at all. (See Vogt, page 240)
Regression Weight, Regression Coefficient - In a regression equation the number in front of an
independent variable. For example, if the regression equation is Y = mx + b then m is the
regression coefficient of the x-variable. The regression coefficient estimates the effect of the
independent variable(s) on the dependent variable. (Compare with Partial Regression
Coefficients)

Regress On - The dependent variable is regressed on the independent variable(s). We will


regress the cost of the space vehicle (based) on the weight of the vehicle. If x predicts y, then y is
regressed on x. (i.e. Regress the dependent variable on the independent. Response variable is
regressed on the explanatory variable.)
Rejection Region - The area in the tail(s) of the sampling distribution for a test statistic. The
figure below shows the Rejection Region in red.

Residuals, Errors - The amount of variation on the dependent variable not explained by the
independent variable.
Response Variable- Same as the independent variable.
Robust - Said of a statistic that remains useful even when one or more of the assumptions is
violated.
S
Sample - A group of subjects selected from a larger group, the population.
Sample Statistic, Statistic - A measurement used to quantify a characteristic of the Sample. Even
when the word sample is not used, the term statistic refers to the sample. Example: The sample
mean is a measure of central tendency of the sample (see Population Parametric).
Sampling Error, Sampling Variability, Random Error - The estimation of the expected
differences between the sample statistic and the population parameter.
Sampling Distribution - It is all possible values of a statistic and their probabilities of occurring
for a sample of a particular size.
Scaling - expresses the centered observation in the units of the standard deviation of the
observations.
Scatter Diagram, Scattergram, Scatter Plot - The pattern of points due to plotting two
variables on a graph.

Significance - The degree to which a researchers finding is meaningful or important. (See


statistical significance and practical significance.)
Significance Level - there are two types of significance levels, the observed significance level
(alpha) and the chosen significance level (p-value). The lower the probability the greater the
statistical significance, called alpha level.
Simple Correlation - Correlation between only two variables.
Simple Linear Correlation - Correlation that describes a linear relationship.
Simple Linear Regression - A form of regression analysis, which has only one independent
variable.
Slope - The rate at which the line or curve rises or falls when covering a given horizontal
distance.
Spearman Correlation Coefficient (rho), Rank-Difference Correlation, rs. - A statistical
measure of the amount of monotonic relationship between two variables that are arranged in rank
order.
Specification error (Error, Specification) - A mistake made when specifying which model to
use in the regression analysis. A common specification error involves including a irrelevant
variable and leaving out an important variable.
Standard Deviation - A statistic that shows the square root of the squared distance that the data
points are from the mean.
for a sample

for a population

Standardized Measure of Scale - Any statistic that allows comparisons between things
measured on different scales. Example: percent, standard deviations and z-scores
Standardized Regression Coefficient - Regression Coefficients which have been standardized in
order to better make comparisons between the regression coefficients. This is particularly helpful
when different independent variables have different units.
Standardized Regression Model - This is the regression model used after centering and scaling
of the dependent variable and independent variables.
Standardized residuals - Standardized residuals are of the form (residual) / (square root of the
Mean Square Error). Standardized residuals have variance 1. If the standardized residual is larger
than 2, then it is usually considered large.

where

Standard Error, Standard Error of the Regression, Standard Error of the Mean, Standard
Error of the Estimate - In regression the standard error of the estimate is the standard deviation
of the observed y-values about the predicted y-values. In general, the standard error is a measure
of sampling error. Standard error refers to error in estimates resulting from random fluctuations in
samples. The standard error is the standard deviation of the sampling distribution of a statistic.
Typically the smaller the standard error, the better the sample statistic estimates of the population
parameter. As N goes up, so does standard error.
Statistical Significance - Statistical significance does not necessarily mean that the result is
clinically or practically important. For example, a clinical trial might result is a statistically
significant finding (at the 5% level) that shows the difference in the average cholesterol rating for
people taking drug A is lower than that of those taking drug B. However, drug A may only lower
the cholesterol by 2 units more than drug B which is probably not a difference that is clinically
important to the people taking the drug. Note: Large sample sizes can lead to results that are
statistically significant that would otherwise be considered inconsequential.

Stepwise Regression - A method of regression analysis where independent variables are added
and removed in order to find the best model. Stepwise regression combines the methods of
backward elimination and forward selection. (See also Hierarchical Regression Analysis.)
Strength of Association, Strength of Effect Index - The degree of relationship between two (or
more) variables. One example is R-squared, which measures the proportion of variability in a
dependent variable explained by the independent variable(s).
Studentized Residuals: The studentized residual has the form of error/standard deviation of the
error. Studentized residuals have constant variance when the model is appropraite.
T
Transformations - This is a method of changing all the values of a variable by using some
mathematical operation.
Trend Line- This is a line representing a movement in one direction of the values of a variable
over a period of time.
U
Unbiased Estimator - A sample statistic that is free from systemic bias.
V
Variance Inflation Factor (VIF) - A statistics used to measuring the possible collinearity of the
explanatory variables.
W
Weighted Least Squares - A method of regression used to take into account the non constant
variance. The variables are multiples by a particular number (weights). It is typical to choose
weights that are the inverse of the pure error variance in the response. (Minitab, page 2-7.) This
choice gives large variances relatively small weights and visa versa.
Y
Y-intercept - The point where a regressin line intersects the y axis.
http://statisticalconcepts.blogspot.com/p/statistics-glossary.html

You might also like