You are on page 1of 19

Chapter1

Descriptivestatisticsdealswithmethodsoforganizing,summarizing,andpresentingdata
inaconvenientandinformativeway.Oneformofdescriptivestatisticsusesgraphical
techniquesthatallowstatisticspractitionerstopresentdatainwaysthatmakeiteasyfor
thereadertoextractusefulinformation.
Anotherformofdescriptivestatisticsusesnumericaltechniquestosummarizedata.One
suchmethodthatyouhavealreadyusedfrequentlycalculatestheaverageormean.
Measureofcentrallocation,example:theaverage,themean
Measureofvariability,example:range
Inferentialstatisticsisabodyofmethodsusedtodrawconclusionsorinferencesabout
characteristicsofpopulationsbasedonsampledata.
Exitpolls:arandomsampleofvoterswhoexitthepollingboothareaskedforwhomthey
voted.
Statisticalinferenceproblemsinvolvethreekeyconcepts:thepopulation,thesample,and
thestatisticalinference:
o A population isthegroupofallitemsofinteresttoastatisticspractitioner.A
descriptivemeasureofapopulationiscalledaparameter.
o A sample is a set of data drawn from the studied population. A descriptive
measureofasampleiscalledastatistic.
o Statisticalinferenceistheprocessofmakinganestimate,prediction,ordecision
aboutapopulationbasedonsampledata.
Itisfareasierandcheapertotakeasamplefromthepopulationofinterestanddraw
conclusionsormakeestimatesaboutthepopulationonthebasisofinformationprovided
bythesample.However,suchconclusionsandestimates arenotalwaysgoingtobe
correct.Forthisreason,webuildintothestatisticalinferenceameasureofreliability.
Therearetwosuchmeasures:theconfidencelevelandthesignificancelevel.
o Theconfidencelevelistheproportionoftimesthatanestimatingprocedurewill
becorrect.
o Whenthepurposeofthestatisticalinferenceistodraw aconclusionabouta
population,thesignificancelevelmeasureshowfrequentlytheconclusionwillbe
wrong.
Tohelpstudentsunderstandthebasicfoundation,weoffertwoapproaches.First,wewill
teach readers how to create Excel spreadsheets that allow for whatif analyses. By
changingsomeoftheinputvalue,studentscanseeforthemselveshowstatisticsworks.
(ThetermisderivedfromwhathappenstothestatisticsifIchangethisvalue?).
Second, weoffer applets, which arecomputerprograms thatperform similarwhatif
analysesorsimulations.

Chapter2

Avariableissomecharacteristicofapopulationorsample.
Thevaluesofthevariablearethepossibleobservationsofthevariable.
Dataaretheobservedvaluesofavariable.Dataispluralfordatum.
Intervaldataarerealnumbers,suchasheights,weights,incomes,anddistances.Wealso
refertothistypeofdataasquantitativeornumerical.
Thevaluesof nominal dataarecategories.Forexample,responsestoquestionsabout
marital status produce nominal data. The values of this variable are single, married,
divorced,andwidowed.Noticethatthevaluesarenotnumbersbutinsteadarewordsthat
describethecategories.Weoftenrecordnominaldatabyarbitrarilyassigninganumber
toeachcategory.Nominaldataarealsocalledqualitativeorcategorical.
Thethirdtypeofdataisordinal.Ordinaldataappeartobenominal,butthedifferenceis
thattheorderoftheirvalueshasmeaning.
Thedifferencebetweennominalandordinaltypesofdataisthattheorderofthevalues
ofthelatterindicateahigherrating.Consequently,whenassigningcodestothevalues,
weshouldmaintaintheorderofthevalues.Itsnotthemagnitudeofthevaluesthatis
important,itstheirorder.
Students often have difficulty distinguishing between ordinal and interval data. The
criticaldifferencebetweenthemisthattheintervalsordifferencesbetweenvaluesof
intervaldataareconsistentandmeaningful(whichiswhythistypeofdatais called
interval).Forexample,thedifferencebetweenmarksof85and80isthesamefivemark
differencethatexistsbetween75and70thatis,wecancalculatethedifferenceand
interprettheresults.
Becausethecodesrepresentingordinaldataarearbitrarilyassignedexceptfortheorder,
wecannotcalculateandinterpretdifferences.Forexample,usinga12345coding
systemtorepresentpoor,fair,good,verygood,andexcellent,wenotethatthedifference
betweenexcellentandverygoodisidenticaltothedifferencebetweengoodandfair.
Witha618234588coding,thedifferencebetweenexcellentandverygoodis43,and
thedifferencebetweengoodandfairis5.
Allcalculationsarepermittedonintervaldata.Weoftendescribeasetofintervaldataby
calculatingtheaverage.Forexample,theaverageofthe10markslistedonpage13is
70.3.
Because the codes of nominal data are completely arbitrary, we cannot perform any
calculationsonthesecodes.Calculationsbasedonthecodesusedtostorethistypeof
dataaremeaningless.Allthatwearepermittedtodowithnominaldatais countor
computethepercentagesoftheoccurrencesofeachcategory.
Themostimportantaspectofordinaldataistheorderofthevalues.Asaresult,theonly
permissiblecalculationsarethoseinvolvingarankingprocess.Forexample,wecanplace
all the data in order and select the code that lies in the middle. This descriptive
measurementiscalledthemedian.
HierarchyofData:Thedatatypescanbeplacedinorderofthepermissiblecalculations.

Atthetopofthelist,weplacetheintervaldatatypebecausevirtuallyallcomputations
areallowed.Thenominaldatatypeisatthebottombecausenocalculationsotherthan
determiningfrequenciesarepermitted.(Wearepermittedtoperformcalculationsusing
the frequencies of codes, but this differs from performing calculations on the codes
themselves.)Inbetweenintervalandnominaldataliestheordinaldatatype.Permissible
calculationsareonesthatrankthedata.
Asdiscussed,theonlyallowablecalculationonnominaldataistocountthefrequencyor
computethepercentagethateachvalueofthevariablerepresents.Wecansummarizethe
data in a table, which presents the categories and their counts, called a frequency
distribution. A relativefrequencydistribution liststhecategoriesandtheproportion
withwhicheachoccurs.Wecanusegraphicaltechniquestopresentapictureofthedata.
Therearetwographicalmethodswecanuse:thebarchartandthepiechart.
ToconstructafrequencyandrelativefrequencydistributionfornominaldatainExcel:
1) CopythedataintoExcel
2) Activateanyemptycellandtype=COUNTIF([Inputrange],[Criteria])
Inputrangearethecellscontainingthedata.
Thecriteriaarethecodesyouwanttocount.
Example: To count the number of 1s (Working fulltime), type
=COUNTIF(P1:P2024,1).
The information contained in the data is summarized well in the table. However,
graphicaltechniquesgenerallycatchareaderseyemorequicklythandoesatableof
numbers.Twographicaltechniquescanbeusedtodisplaytheresultsshowninthetable.
Abarchartisoftenusedtodisplayfrequencies;apiechartgraphicallyshowsrelative
frequencies.
Thebarchartiscreatedbydrawingarectanglerepresentingeachcategory.Theheightof
therectanglerepresentsthefrequency.Thebaseisarbitrary.
Ifwewishtoemphasizetherelativefrequenciesinsteadofdrawingthebarchart,we
drawapiechart.Apiechartissimplyacirclesubdividedintoslicesthatrepresentthe
categories.Itisdrawnsothatthesizeofeachsliceisproportionaltothepercentage
correspondingtothatcategory.
Toconstructabarandpiechartforfrequencyandrelativefrequencydistributionfor
nominaldatainExcel:
1) Aftercreatingthefrequencydistribution,highlightthecolumnoffrequencies.
2) Forabarchart,clickInsert,Column,andthefirst2DColumn.
3) ClickChartTools(ifitdoesnotappear,clickinsidetheboxcontainingthebar
chart)andLayout.Thiswillallowyoutomakechangestothechart.Weremoved
theGridlines,theLegend,andclickedtheDataLabelstocreatethetitles.
4) Forapiechart,clickPieandChartToolstoeditthegraph.
Interpretation:Thebarchartfocusesonthefrequenciesandthepiechartfocusesonthe
proportions.
Pie and bar charts are used widely in newspapers, magazines, and business and
governmentreports.Onereasonforthisappealisthattheyareeyecatchingandcan

attractthereadersinterestwhereasatableofnumbersmightnot.Pieandbarchartsare
frequentlyusedtosimplypresentnumbersassociatedwithcategories.Theonlyreasonto
useabarorpiechartinsuchasituationwouldbetoenhancethereadersabilitytograsp
thesubstanceofthedata.
Therearenospecificgraphicaltechniquesforordinaldata.Consequently,whenwewish
todescribeasetofordinaldata,wewilltreatthedataasiftheywerenominalandusethe
techniques described in this section. The only criterion is that the bars in bar charts
shouldbearrangedinascending(ordescending)ordinalvalues;inpiecharts,thewedges
aretypicallyarrangedclockwiseinascendingordescendingorder.
Techniquesappliedtosinglesetsofdataarecalledunivariate.Therearemanysituations
wherewewishtodepicttherelationshipbetweenvariables;insuchcases, bivariate
methodsarerequired.Acrossclassificationtable(alsocalledacrosstabulationtable)is
usedtodescribetherelationshipbetweentwonominalvariables.
Todescribetherelationshipbetweentwonominalvariables,wemustrememberthatwe
arepermittedonlytodeterminethefrequencyofthevalues.Asafirststep,weneedto
produceacrossclassificationtablethatliststhefrequencyofeachcombinationofthe
valuesofthetwovariables.
Excelcanproducethecrossclassificationtableusingseveralmethods.Wewilluseand
describethePivotTableintwoways:(1)tocreatethecrossclassificationtablefeaturing
thecountsand(2)toproduceatableshowingtherowrelativefrequencies.
1) ClickInsertandPivotTable.
2) MakesurethattheTable/Rangeiscorrect.
3) DragtheOccupationbuttontotheROWsectionofthebox.DragtheNewspaper
button to the COLUMN section. Drag the Reader button to the DATA field.
Rightclickanynumberinthetable,clickSummarizeDataBy,andcheckCount.
Toconverttorowpercentages,rightclickanynumber,clickSummarizeDataBy,
MoreoptionsandShowvaluesas.Scrolldownandclick%ofrows.(Wethen
formatted the data into decimals. To improve both tables, we substituted the
namesoftheoccupationsandnewspapers.
Thereareseveralwaystostorethedatatobeusedinthissectiontoproduceatableora
barorpiechart.
1) Thedataareintwocolumns.Thefirstcolumnrepresentsthecategoriesofthe
firstnominalvariable,andthesecondcolumnstoresthecategoriesforthe
secondvariable.Eachrowrepresentsoneobservationofthetwovariables.
Thenumberofobservationsineachcolumnmustbethesame.Exceland
Minitab can produce a crossclassification table from these data. (To use
Excels PivotTable, there also must be a third variable representing the
observationnumber.)
2) Thedataarestoredintwoormorecolumns,witheachcolumnrepresenting
the same variable in a different sample or population. For example, the
variablemaybethetypeofundergraduatedegreeofapplicantstoanMBA
program,andtheremaybefiveuniversitieswewishtocompare.Toproducea

crossclassificationtable,wewouldhavetocountthenumberofobservations
ofeachcategory(undergraduatedegree)ineachcolumn.
3) Thetablerepresentingcountsinacrossclassificationtablemayhavealready
beencreated.
Chapter3

Thehistogramnotonlyisapowerfulgraphicaltechniqueusedtosummarizeintervaldata
butalsoisusedtohelpexplainanimportantaspectofprobability.
We create a frequency distribution for interval data by counting the number of
observationsthatfallintoeachofaseriesofintervals,calledclasses,thatcoverthe
completerangeofobservations.
Ahistogramiscreatedbydrawingrectangleswhosebasesaretheintervalsandwhose
heightsarethefrequencies.
Howtocreateahistogram:
1) Typeorimportthedataintoonecolumn.Inanothercolumn,typetheupperlimits
oftheclassintervals.Excelcallsthembins.
2) ClickData,DataAnalysis,andHistogram.
3) Specify the Input Range (A1:A201) and the Bin Range (B1:B9). Click Chart
Output.ClickLabelsifthefirstrowcontainsnames.
4) Toremovethegaps,placethecursoroveroneoftherectanglesandclicktheright
buttonofthemouse.Click(withtheleftbutton)FormatDataSeriesmovethe
pointertoGapWidthandusetheslidertochangethenumberfrom150to0.
Thenumberofclassintervalsweselectdependsentirelyonthenumberofobservations
inthedataset.Themoreobservationswehave,thelargerthenumberofclassintervals
weneedtousetodrawausefulhistogram.
ApproximateNumberofClassesinHistograms

AnalternativetotheguidelineslistedinthetableaboveistouseSturgessformula,
whichrecommendsthatthenumberofclassintervalsbedeterminedbythefollowing:
1) Numberofclassintervals=1+3.3log(n)
Forexample,ifn=50Sturgessformulabecomes
2) Numberofclassintervals=1+3.3log(50)=1+3.3(1.7)=6.6whichweround

to7.
ClassIntervalWidths
1) Wedeterminetheapproximatewidthoftheclassesbysubtractingthesmallest
observationfromthelargestanddividingthedifferencebythenumberofclasses.
Thus, class width = (largest observation smallest observation)/number of
classes.
Wethendefineourclasslimitsbyselectingalowerlimitforthefirstclassfromwhichall
otherlimitsaredetermined.Theonlyconditionweapplyisthatthefirstclassinterval
mustcontainthesmallestobservation.
The purpose ofdrawing histograms, like that ofall other statistical techniques, is to
acquireinformation.Oncewehavetheinformation,wefrequentlyneedtodescribewhat
weve learned to others. We describe the shape of histograms on the basis of the
followingcharacteristics.
1) Ahistogramissaidtobesymmetricif,whenwedrawaverticallinedownthe
centerofthehistogram,thetwosidesareidenticalinshapeandsize.
2) Askewedhistogramisonewithalongtailextendingtoeithertherightortheleft.
The former is called positively skewed, and the latter is called negatively
skewed.
Amodeistheobservationthatoccurswiththegreatestfrequency.Amodalclassisthe
class withthelargestnumberofobservations.A unimodalhistogram isonewitha
singlepeak.
A bimodalhistogram isonewithtwopeaks,notnecessarilyequalinheight.Bimodal
histogramsoftenindicatethattwodifferentdistributionsarepresent.
Aspecialtypeofsymmetricunimodalhistogramisonethatisbellshaped.
Oneofthedrawbacksofthehistogramisthatwelosepotentiallyusefulinformationby
classifyingtheobservations.
yclassifyingtheobservationswedidacquireusefulinformation.However,thehistogram
focusesourattentiononthefrequencyofeachclassandbydoingsosacrificeswhatever
informationwascontainedintheactualobservations.AstatisticiannamedJohnTukey
introducedthestemandleafdisplay,whichisamethodthattosomeextentovercomes
thisloss.
Thefirststepindevelopingastemandleafdisplayistospliteachobservationintotwo
parts,astemandaleaf.
Thereareseveraldifferentwaysofdoingthis.Forexample,thenumber12.3canbesplit
sothatthestemis12andtheleafis3.Inthisdefinitionthestemconsistsofthedigitsto
theleftofthedecimalandtheleafisthedigittotherightofthedecimal.Anothermethod
candefinethestemas1andtheleafas2(ignoringthe3).Inthisdefinitionthestemis
thenumberoftensandtheleafisthenumberofones.Aftereachstem,welistthatstems
leaves,usuallyinascendingorder.
Thestemandleafdisplayissimilartoahistogramturnedonitsside.Thelengthofeach
linerepresentsthefrequencyintheclassintervaldefinedbythestems.Theadvantageof

thestemandleafdisplayoverthehistogramisthatwecanseetheactualobservations.
The frequency distribution lists the number of observations that fall into each class
interval. We can also create a relative frequency distribution by dividing the
frequenciesbythenumberofobservations.
As you can see, the relative frequency distribution highlights the proportion of the
observationsthatfallintoeachclass.Insomesituations,wemaywishtohighlightthe
proportionofobservationsthatliebeloweachoftheclasslimits.Insuchcases,wecreate
thecumulativerelativefrequencydistribution.
Another way of presenting this information is the ogive, which is a graphical
representationofthecumulativerelativefrequencies.
Besidesclassifyingdatabytype,wecanalsoclassifythemaccordingtowhetherthe
observationsaremeasuredatthesametimeorwhethertheyrepresentmeasurementsat
successivepointsintime.Theformerarecalledcrosssectionaldata,andthelattertime
seriesdata.
Timeseriesdataareoftengraphicallydepictedona linechart,whichisaplotofthe
variableovertime.Itiscreatedbyplottingthevalueofthevariableontheverticalaxis
andthetimeperiodsonthehorizontalaxis.
Graphicalexcellence:atermweapplytotechniquesthatareinformativeandconcise
andthatimpartinformationclearlytotheirviewers.
Graphicalexcellenceisachievedwhenthefollowingcharacteristicsapply.
1) The graph presents large data sets concisely and coherently. Graphical
techniqueswerecreatedtosummarizeanddescribelargedatasets.Smalldatasets
areeasilysummarizedwithatable.Oneortwonumberscanbestbepresentedin
asentence.
2) Theideasandconceptsthestatisticspractitionerwantstodeliverareclearly
understood by the viewer. The chart is designed to describe what would
otherwisebedescribedinwords.Anexcellentchartisonethatcanreplacea
thousandwordsandstillbeclearlycomprehendedbyitsreaders.
3) Thegraphencouragestheviewertocomparetwoormorevariables.Graphs
displayingonlyonevariableprovideverylittleinformation.Graphsareoftenbest
usedtodepictrelationshipsbetweentwoormorevariablesortoexplainhowand
whytheobservedresultsoccurred.
4) Thedisplayinducestheviewertoaddressthesubstanceofthedataandnot
theformofthegraph. Theformofthegraphissupposedtohelppresentthe
substance. If the form replaces the substance, the chart is not performing its
function.
5) Thereisnodistortion ofwhatthedatareveal. Youcannotmakestatistical
techniquessaywhateveryoulike.Aknowledgeablereaderwilleasilyseethrough
distortionsanddeception.
GraphicalDeception
o Thefirstthingtowatchforisagraphwithoutascaleononeaxis.
o Asecondtraptoavoidisbeinginfluencedbyagraphscaption.

o Perspective is often distorted if only absolute changes in value, rather than


percentagechanges,arereported.
o Agraphcanbemadetoappearmoredramaticbystretchingtheaxis

Chapter4

Therearethreedifferentmeasuresthatweusetodescribethecenterofasetofdata.The
firstisthebestknown,thearithmeticmean,whichwellrefertosimplyasthemeanor
the average. Themeaniscomputedbysummingtheobservationsanddividingbythe
numberofobservations.
Welabeltheobservationsinasamplex1,x2,,xn,wherex1isthefirstobservation,x2
isthesecond,andsoonuntilxn,wherenisthesamplesize.Asaresult,thesamplemean
isdenotedx.Inapopulation,thenumberofobservationsislabeledNandthepopulation
meanisdenotedbym(Greeklettermu).

The median is calculated by placing all the observations in order (ascending or


descending).Theobservationthatfallsinthemiddleis themedian.Thesampleand
populationmediansarecomputedinthesameway.
Whenthereisanevennumberofobservations,themedianisdeterminedbyaveraging
thetwoobservationsinthemiddle.
The mode isdefinedastheobservation(orobservations)thatoccurswiththegreatest
frequency.Boththestatisticandparameterarecomputedinthesameway.
Forpopulationsandlargesamples,itispreferabletoreportthemodalclass
Withthreemeasuresfromwhichtochoose,whichoneshouldweuse?Thereareseveral
factorstoconsiderwhenmakingourchoiceofmeasureofcentrallocation.Themeanis
generallyourfirstselection.However,thereareseveralcircumstanceswhenthemedian
isbetter.Themodeisseldomthebestmeasureofcentrallocation.Oneadvantagethe
medianholdsisthatitisnotassensitivetoextremevaluesasisthemean.
Whenthedataareinterval,wecanuseanyofthethreemeasuresofcentrallocation.

However,forordinalandnominaldata,thecalculationofthemeanisnotvalid.Because
the calculation of the median begins by placing the data in order, this statistic is
appropriateforordinaldata.Themode,whichisdeterminedbycountingthefrequencyof
eachobservation,isappropriatefornominaldata.However,nominaldatadonothavea
center,sowecannotinterpretthemodeofnominaldatainthatway.Itisgenerally
pointlesstocomputethemodeofnominaldata.
Measuresofvariability:
o Range=LargestobservationSmallestobservation
Theadvantageoftherangeisitssimplicity.Thedisadvantageisalsoits
simplicity.Becausetherangeiscalculatedfromonlytwoobservations,it
tellsusnothingabouttheotherobservations.
o The variance anditsrelatedmeasure,the standarddeviation,arearguablythe
mostimportantstatistics.

Examinetheformulaforthesamplevariance s2.Itmayappeartobeillogicalthatin
calculatings2wedividebyn1ratherthanbyn.*However,wedosoforthefollowing
reason.Populationparametersinpracticalsettingsareseldomknown.Oneobjectiveof
statistical inference is to estimate the parameter from the statistic. For example, we
estimatethepopulationmeanmfromthesamplemean x.Althoughitisnotobviously
logical,thestatisticcreatedbydividinga(xix)2byn1isabetterestimatorthantheone
createdbydividingbyn.
Tocomputethesamplevariances2,webeginbycalculatingthesamplemeanx.Nextwe
computethedifference(alsocallthedeviation)betweeneachobservationandthemean.
Wesquarethedeviationsandsum.Finally,wedividethesumofsquareddeviationsbyn
1.
Thevarianceprovidesuswithonlyaroughideaabouttheamountofvariationinthe
data.However,thisstatisticisusefulwhencomparingtwoormoresetsofdataofthe
sametypeofvariable.Ifthevarianceofonedatasetislargerthanthatofaseconddata
set,weinterpretthattomeanthattheobservationsinthefirstsetdisplaymorevariation
thantheobservationsinthesecondset.
TheStandardDeviationissimplythepositivesquarerootofthevariance.

Knowingthe meanand standarddeviation allows thestatistics practitioner toextract


usefulbitsofinformation.Theinformationdependsontheshapeofthehistogram.Ifthe
histogramisbellshaped,wecanusetheEmpiricalRule.
o Approximately68%ofallobservationsfallwithinonestandarddeviationofthe
mean.
o Approximately95%ofallobservationsfallwithintwostandarddeviationsofthe
mean.
o Approximately99.7%ofallobservationsfallwithinthreestandarddeviationsof
themean.

When k 2, Chebysheffs Theorem states that at least threequarters (75%) of all


observations lie withintwostandard deviations ofthe mean.With k 3,Chebysheffs
Theorem states that at least eightninths (88.9%) of all observations lie within three
standarddeviationsofthemean.

Measuresofrelativestandingaredesignedtoprovideinformationaboutthepositionof
particularvaluesrelativetotheentiredataset.
ThePthpercentileisthevalueforwhichPpercentarelessthanthatvalueand(100P)
%aregreaterthanthatvalue.
o Example:ourSATscoreisreportedtobeatthe60thpercentile.Thismeansthat
60%ofalltheothermarksarebelowyoursand40%areaboveit.Younowknow
exactlywhereyoustandrelativetothepopulationofSATscores.
Wehavespecialnamesforthe25th,50th,and75thpercentiles.Becausethesethree
statisticsdividethesetofdataintoquarters,thesemeasuresofrelativestandingarealso

called quartiles. The first or lower quartile is labeled Q1. It is equal to the 25th
percentile.The secondquartile, Q2,isequaltothe50thpercentile,whichisalsothe

median.Thethirdorupperquartile,Q3,isequaltothe75thpercentile.
Thefollowingformulaallowsustoapproximatethelocationofanypercentile:
P
o L =(n+1) 100
P
whereLPisthelocationofthePthpercentile.
Example:
o Placingthe10observationsinascendingorderweget00578912142233
25
o Thelocationofthe25thpercentileisL25=(10+1) =(11)(.25)=2.75
o The25thpercentileisthreequartersofthedistancebetweenthesecond(whichis
0)andthethird(whichis5)observations.Threequartersofthedistanceis(.75)
(50)=3.75.
o Becausethesecondobservationis0,the25thpercentileis0+3.75=3.75.
Wecanoftengetanideaoftheshapeofthehistogramfromthequartiles.Forexample,if
the first and second quartiles are closer to each other than are the second and third
quartiles,thenthehistogramispositivelyskewed.Ifthefirstandsecondquartilesare
fartherapartthanthesecondandthirdquartiles,thenthehistogramisnegativelyskewed.
Ifthedifferencebetweenthefirstandsecondquartilesisapproximatelyequaltothe
differencebetweenthesecondandthirdquartiles,thenthehistogramisapproximately
symmetric.
The quartiles can be used to create another measureof variability, the interquartile
range:InterquartilerangeQ Q
3 1
Theinterquartilerangemeasuresthespreadofthemiddle50%oftheobservations.Large
valuesofthisstatisticmeanthatthefirstandthirdquartilesarefarapart,indicatingahigh
levelofvariability.
Boxplotsgraphfivestatistics:theminimumandmaximumobservations,andthefirst,
second,andthirdquartiles.Italsodepictsotherfeaturesofasetofdata.

Thethreeverticallinesoftheboxarethefirst,second,andthirdquartiles.Thelines
extending to the left and right are called whiskers. Any points that lie outside the
whiskersarecalledoutliers.Thewhiskersextendoutwardtothesmallerof1.5timesthe
interquartilerangeortothemostextremepointthatisnotanoutlier.
Outliers areunusuallylargeorsmallobservations.Becauseanoutlierisconsiderably
removedfromthemainbodyofthedataset,itsvalidityissuspect.Consequently,outliers
shouldbecheckedtodeterminethattheyarenottheresultofanerrorinrecordingtheir

values.Outlierscanalsorepresentunusualobservationsthatshouldbeinvestigated.
INSTRUCTIONS
o Typeorimportthedataintoonecolumnortwoormoreadjacentcolumns.(Open
Xm0301.)
o ClickAddIns,DataAnalysisPlus,andBoxPlot.
o SpecifytheInputRange(A1:A201).
o Aboxplotwillbecreatedforeachcolumnofdatathatyouhavespecifiedor
highlighted
FactorsThatIdentifyWhentoComputePercentilesandQuartiles
1.Objective:Describeasinglesetofdata
2.Typeofdata:Intervalorordinal
3.Descriptivemeasurement:Relativestanding
FactorsThatIdentifyWhentoComputetheInterquartileRange
1.Objective:Describeasinglesetofdata
2.Typeofdata:Intervalorordinal
3.Descriptivemeasurement:Variability

Chapter3(cont)

Statisticspractitionersfrequentlyneedtoknowhowtwointervalvariablesarerelated.
Thetechniqueiscalledascatterdiagram.
Todrawascatterdiagram,weneeddatafortwovariables.Inapplicationswhereone
variabledependstosomedegreeontheothervariable,welabelthedependentvariableY
andtheother,calledtheindependentvariable,X.Inothercaseswherenodependencyis
evident,welabelthevariablesarbitrarily.
Todeterminethestrengthofthelinearrelationship,wedrawastraightlinethroughthe
pointsinsuchawaythatthelinerepresentstherelationship.Ifmostofthepointsfall
closetotheline,wesaythatthereisalinearrelationship.Ifmostofthepointsappearto
bescatteredrandomlywithonlyasemblanceofastraightline,thereisno,oratbest,a
weaklinearrelationship.
Ingeneral,ifonevariableincreaseswhentheotherdoes,wesaythatthereisapositive
linearrelationship.Whenthetwovariablestendtomoveinoppositedirections,we
describethenatureoftheirassociationasanegativelinearrelationship.
Ininterpretingtheresultsofascatterdiagramitisimportanttounderstandthatiftwo
variablesarelinearlyrelateditdoesnotmeanthatoneiscausingtheother.Infact,we
canneverconcludethatonevariablecausesanothervariable.Wecanexpressthismore
eloquentlyasCorrelationisnotcausation.

Chapter4(cont)

InChapter3,weintroducedthescatterdiagram,agraphicaltechniquethatdescribesthe
relationshipbetweentwointervalvariables.Atthattime,wepointedoutthatwewere

particularlyinterestedinthedirectionandstrengthofthelinearrelationship.Wenow
present three numerical measures oflinear relationship that provide this information:
covariance,coefficientofcorrelation,andcoefficientofdetermination.

Chapter6

A random experiment is an action or process that leads to one of several possible


outcomes.
Thefirststepinassigningprobabilitiesistoproducealistoftheoutcomes.Thelisted
outcomesmustbeexhaustive,whichmeansthatallpossibleoutcomesmustbeincluded.
In addition, the outcomes must be mutually exclusive, which means that no two
outcomescanoccuratthesametime.
A sample space of a random experiment is a list of all possible outcomes of the
experiment.Theoutcomesmustbeexhaustiveandmutuallyexclusive.

Therearethreeapproachestoassigningprobabilities:
o Theclassicalapproachisusedbymathematicianstohelpdetermineprobability

associatedwithgamesofchance.Forexample,theclassicalapproachspecifies
thattheprobabilitiesofheadsandtailsintheflipofabalancedcoinareequalto
eachother.Becausethesumoftheprobabilitiesmustbe1,theprobabilityof
heads and the probability of tails are both 50%. Similarly, the six possible
outcomesofthetossofabalanceddiehavethesameprobability;eachisassigned
aprobabilityof1/6.
o The relative frequency approach defines probability as the longrun relative
frequencywithwhichanoutcomeoccurs.Forexample,supposethatweknow
thatofthelast1,000studentswhotookthestatisticscourseyourenowtaking,
200receivedagradeofA.TherelativefrequencyofAsisthen200/1000or20%.
ThisfigurerepresentsanestimateoftheprobabilityofobtainingagradeofAin
thecourse.Itisonlyanestimatebecausetherelativefrequencyapproachdefines
probabilityasthelongrunrelativefrequency.Onethousandstudentsdonot
constitutethelongrun.Thelargerthenumberofstudentswhosegradeswehave
observed,thebettertheestimatebecomes.Intheory,wewouldhavetoobservean
infinitenumberofgradestodeterminetheexactprobability.
o Whenitisnotreasonabletousetheclassicalapproachandthereisnohistoryof
theoutcomes,wehavenoalternativebuttoemploythesubjectiveapproach.In
thesubjectiveapproach,wedefineprobabilityasthedegreeofbeliefthatwehold
intheoccurrenceofanevent.
Aneventisacollectionorsetofoneormoresimpleeventsinasamplespace.
Theprobabilityofaneventis thesumoftheprobabilities ofthesimpleevents that
constitutetheevent.
Nomatterwhatmethodwasusedtoassignprobability,weinterpretitusingtherelative
frequencyapproachforaninfinitenumberofexperiments.
TheintersectionofeventsAandBistheeventthatoccurswhenbothAandBoccur.It
isdenotedasAandB.Theprobabilityoftheintersectioniscalledthejointprobability.
Forexample,onewaytotossa3withtwodiceistotossa1onthefirstdieanda2on
theseconddie,whichistheintersectionoftwosimpleevents.
Marginal probabilities, computed by adding across rows or down columns, are so
namedbecausetheyarecalculatedinthemarginsofthetable.

Wefrequentlyneedtoknowhowtwoeventsarerelated.Inparticular,wewould
liketoknowtheprobabilityofoneeventgiventheoccurrenceofanotherrelated
event.Thisprobabilityiscalledaconditionalprobability.Theconditionalprobability
thatweseekisrepresentedbyP(B1|A1)wherethe|representsthewordgiven.

Oneoftheobjectivesofcalculatingconditionalprobabilityistodeterminewhethertwo
eventsarerelated.Inparticular,wewouldliketoknowwhethertheyare independent
events.

Putanotherway,twoeventsareindependentiftheprobabilityofoneeventisnotaffected
bytheoccurrenceoftheotherevent.

Anothereventthatisthecombinationofothereventsistheunion.

ThecomplementofeventAistheeventthatoccurswheneventAdoesnotoccur.The
complementofevent A isdenotedby AC.The complementrule definedherederives
from the fact that the probability of an event and the probability of the events
complementmustsumto1.

Themultiplicationruleisusedtocalculatethejointprobabilityoftwoevents.Itisbased

ontheformulaforconditionalprobability.Wederivethemultiplicationrulesimplyby
multiplyingbothsidesbyP(B).

IfAandBareindependentevents,P(A|B)=P(A)andP(B|A)=P(B).Itfollowsthatthe
jointprobabilityoftwoindependenteventsissimplytheproductoftheprobabilitiesof
thetwoevents.Wecanexpressthisasaspecialformofthemultiplicationrule.

Theadditionruleenablesustocalculatetheprobabilityoftheunionoftwoevents.

Aswasthecasewiththemultiplicationrule,thereisaspecialformoftheadditionrule.
Whentwoeventsaremutuallyexclusive(whichmeansthatthetwoeventscannotoccur
together),theirjointprobabilityis0.

Aneffectiveandsimplermethodofapplyingtheprobabilityrulesistheprobabilitytree,
wherein the events in an experiment are represented by lines. The resulting figure
resemblesatree,hencethename.
Theadvantageofaprobabilitytreeisthatitrestrainsitsusersfrommakingthewrong
calculation.Oncethetreeisdrawnandtheprobabilitiesofthebranchesinserted,virtually
theonlyallowablecalculationisthemultiplicationoftheprobabilitiesoflinkedbranches.
Aneasycheckonthosecalculationsisavailable.Thejointprobabilitiesattheendsofthe
branchesmustsumto1becauseallpossibleeventsarelisted.

Conditionalprobabilityisoftenusedtogaugetherelationshipbetweentwoevents.There
aresituations,however,wherewewitnessaparticulareventandweneedtocomputethe
probabilityofoneofitspossiblecauses.BayessLawisthetechniqueweuse.
The probabilities P(A) and P(AC) are called prior probabilities because they are
determined prior tothedecisionabouttakingthepreparatorycourse.Theconditional
probabilities are called likelihood probabilities for reasons that are beyond the
mathematics in this book. Finally, the conditional probability P(A B) and similar
conditionalprobabilities P(AC | B), P(A | BC),and P(AC | BC)arecalled posterior
probabilitiesorrevisedprobabilitiesbecausethepriorprobabilitiesarerevisedafterthe
decisionabouttakingthepreparatorycourse.

Chapter7

A randomvariable isafunctionorrulethatassignsanumbertoeachoutcomeofan
experiment.
Therearetwotypesofrandomvariables,discreteandcontinuous.Adiscreterandom
variableisonethatcantakeonacountablenumberofvalues.
A continuous random variable is one whose values are uncountable. An excellent
exampleofacontinuousrandomvariableistheamountoftimetocompleteatask.
A probabilitydistribution isatable,formula,orgraphthatdescribesthevaluesofa
randomvariableandtheprobabilityassociatedwiththesevalues.
An uppercase letter will represent the name of the random variable, usually X. Its
lowercasecounterpartwillrepresentthevalueoftherandomvariable.Thus,werepresent
theprobabilitythattherandomvariableXwillequalxasP(X=x)ormoresimplyP(x).

Thepopulationmeanistheweightedaverageofallofitsvalues.Theweightsarethe
probabilities.ThisparameterisalsocalledtheexpectedvalueofXandisrepresentedby
E(X).

Thepopulationvarianceiscalculatedsimilarly.Itistheweightedaverageofthesquared
deviationsfromthemean.

Thereisashortcutcalculationthatsimplifiesthecalculationsforthepopulationvariance.
Thisformulaisnotanapproximation;itwillyieldthesamevalueastheformulaabove.

Asyouwilldiscover,weoftencreatenewvariablesthatarefunctionsofotherrandom
variables.Theformulasgiveninthenexttwoboxesallowustoquicklydeterminethe
expectedvalueandvarianceofthesenewvariables.Inthenotationusedhere,Xisthe
randomvariableandcisaconstant.

You might also like