Professional Documents
Culture Documents
Data: facts and figures collected, analyzed and summarized for presentation and interpretation
Element: entity on which data are collected (usually first column)
Variable: characteristic of interest
Observation: set of measurements obtained for a particular element (# of obs = # of elements)
Total number of data items = # of observations x # of variables
Scales of measurement: determine amount of info in data and most appropriate summarization
and statistical analyses
Existing sources: company databases (employees and customers); with Internet, this is
even more powerful and required
Statistical studies:
o Experimental: variable of interest is first identified, one or more other are
identified and controlled to know how they influence
o Nonexperimental/observational: no attempt to control variables. Ex survey
Existing sources may be preferred because of the cost and time required for studies. Cost of data
acquisition and analysis should not exceed savings by using the info.
Descriptive statistics: summaries of data (tabular, graphical, numerical)
Population: set of all elements of interest
Sample: subset of population
Census: process of conducting a survey to entire population
Survey: for sample
Statistical inference: estimates and test hypotheses about characteristics of population
Data warehousing: capture, store and maintain data
Data mining: methods for developing useful decision-making info from large databases;
technology that relies heavily on statistical methodology; stress on automated and predictive;
limited ability to uncover and identify causal relationships
CHAPTER 2: TABULAR AND GRAPHICAL PRESENTATIONS
Frequency distribution: tabular summary of data shows number (frequency) in each
nonoverlapping class
Relative frequency distribution: shows proportion of items = Frequency of the class / n
Percent frequency distribution: same as relative x 100%
Bar chart: graphical presentation for data in frequency, relative freq or percent freq distribution
Pie chart: graphical presentation for data in frequencies distribution
Dot plot: simplest graphical summary of data
Histogram: graphical presentation of quantitative data
Often number of classes = number of categories found in data; sum of frequencies = number of
observations; Sum of relative frequencies = 1; sum of percent frequencies =100
Determine number of nonoverlapping classes: general 5 to 20 classes.
Width of classes: same for each class; Approx class width= (largest-smallest)/number of classes; in
practice determined by trial and error
Class limits: so that each data item belongs to one and only one class
Cumulative frequency distribution: tabular summary of quantitative data; shows number of data
less than or equal to the upper class limit of each class; last entry equals total number of
observations
Exploratory data analysis: simple arithmetic graphs (stem and leaf)
Stem and leaf: easier to construct; more info because it shows actual data
Cross tabulation: tabular summary for two variables; provides insight about relationship
Simpsons paradox: conclusion based upon aggregate data can be reversed if we look at the
unaggregated data
Scatter diagram: graphical presentation of relationship between two quantitative variables
Trendline: approximation of relationship between two quantitative variables
Measure of variability:
Range: largest value smallest value
Interquartile range: difference between third quartile and first quartile; range for the middle 50%
of the data
Variance: utilizes all data; difference between the value of each observation and the mean
For any data set, sum of deviations about the mean always equal zero.
Standard deviation: positive square root of variance; better measurement, measured in same units
as original data.
Coefficient of variation: expressed as percentage; how large standard deviation is relative to mean
Coefficient of variation= (standard deviation/mean) x 100%
Measures of Shape:
To the left, skewness is negative; when symmetric, mean and median are equal
Smallest value
First quartile (Q1)
Median (Q2)
Third quartile (Q3)
Largest value
Positive value for sxy show positive linear association between x and y; as x increases y increases
If close to zero, no linear association between x and y
Problem with covariance: value depends on units of measurement for x and y solution:
correlation coefficient
Permutations: when n objects are to be selected from a set of N objects where the order of
selection is important; in a different order is a different experimental outcome
Experiments results in more permutations than combinations for the same number of objects.
Assigning probabilities: between 0 and 1 and the sum adds up to 1
Classical method: when all outcomes are equally likely; if n outcomes possible probability of 1/n
Relative frequency method: data are available to estimate the proportion of the time the
experimental outcome will occur if repeated a large number of times number of times outcome
is seen/total number of repetitions
Subjective method: one cannot realistically assume that are equally likely and when little relevant
data are available; may use experience, intuition specify a degree of belief
Start analysis with initial or prior probability calculate revised probabilities or posterior
probabilities
Prior New info application of Bayes Theorem Posterior
E(x) = = np
Var(x) = 2 = np (1 p)
Hypergeometric: closely related to binomial but trials are not independent and probability of
success changes from trial to trial