Outlayer PDF

SESUG Proceedings (c) SESUG, Inc (http://www.sesug.
org) The papers contained in the SESUG proceedings are the property of their authors, unless otherwise stated. Do not reprint without permission. SEGUG papers are distributed freely as a courtesy of the Institute for Advanced Analytics (http://analytics.ncsu.edu). Paper SA11
Cleaning Data the Chauvenet Way

Lily Lin, MedFocus, San Mateo, CA Paul D Sherman, Independent Consultant, San Jose, CA
ABSTRACT
Throwin away data is a touchy su!"ect# $eep the ma%eric& and you contaminate a eneral trend# Toss away ood data points and you don't &now i( somethin important has happened )) until its too late# *ow do you &now what, i( any, data to e+clude, Chau%enet is the answer# *ere is a really entle test you can apply to any distri!ution o( num!ers# It wor&s e-ually well (or normal, s&ewed, and e%en multi)modal populations# This article i%es you a macro tool (or cleanin up your data and separatin the ood (rom the !ad# S&ill Le%el. /asic statistics, Data step, SAS0MAC12, and Proc S3L
INTRODUCTION
4ou o(ten want to compare data sets# 4ou can't really do this point)!y)point, so you summari5e each set indi%idually and compare their descripti%e statistics on the a re ate# /y loo&in only at the summary, you are ma&in an assumption that all o!ser%ations are related in some way# *ow do you %eri(y this assumption, /y testin (or outliers# 6alues that are spurious or unrelated to the others must !e e+cluded (rom summari5ation# In this paper we present a simple, e((icient, and entle macro (or (ilterin a data set# There are many techni-ues, and the su!"ect o( ro!ust statistice is modern and rich thou h not at all simple# Chau%enet's criteria is easy to understand, can !e -uic&ly computed (or a !illion rows o( data, and e%en entle enou h to !e used on a tiny set o( only ten points# 7e ha%e seen that Chau%enet's criterion is used in astronomy, nuclear technolo y, eolo y, epidemiolo y, molecular !iolo y, radiolo y and many (ields o( physical science# It is widely used !y o%ernment la!oratories, industries, and uni%ersities# Althou h Chau%enet8s criterion is not currently used in Clinical trials, we would li&e to e+plore it (or possi!ility o( !ein applied to the trials en%ironment as well#
THE FILTERING PROCESS
4ou want some way to identi(y what o!ser%ations in your data set need closer study# It's not appropriate to simply throw away or delete an o!ser%ation9 you must &eep it around to loo& at later# The picture is as (ollows#
>ood 6alues 2ri inal Data Set /ad 6alues
? continue with analysis ? why are these !ad,
2ur macro does this easily# To scrutini5e %aria!le x and split origdata into ood and !ad pieces, use this macro call. %chauv(origdata, x, good=theGood, bad=theBad); Then, (or e+ample, proc means data=theGood; run; * summarize the good ... *; proc print data=theBad; run; * ... and show us the bad *;
INTERQUARTILE RANGE IQR! TEST

The I31 test is commonly used in clinical studies to re"ect data points# The one)way analysis o( %ariance usin :!o+ plots: also uses the I31 test# 7his&ers o( the !o+ are called 2utlier Limits and set ;<= (urther away than the I31# The ran e (or a data point to !e considered : ood: is de(ined !elow#
L2L @ P C2L @ P where I31 @ P
A;
) 1#; B I31 E 1#; B I31
PA; L2L
PD; C2L
D;
D;
)P
A; A;
geometry of the box plot

) 1#; B P D; F + F A#; B P D; ) 1#; B P A;
An accepta!le data %alue must lie within these limits. A#; B P
I31 is !ased on percentile statistics# Just li&e the median, percentiles presume speci(ic orderin o( o!ser%ations in a dataset# The sort step re-uired can !e costly, especially when the dataset is hu e# For e+ample, a data set with a !illion o!ser%ations ta&es hal( an hour to calculate the percentiles, e%en with the piecewise)para!olic al orithm, while only si+ minutes to enerate distri!ution moments G and H# Nu"#e$ %& O#(e$vati%n( 1,<<<,<<<,<<< 1<<,<<<,<<< 1<,<<<,<<< 1,<<<,<<< 1<<,<<< 1<,<<< 1,<<< 1<< C%"'utati%n Ti"e )e*ian+ P + P )ean+ St* ,- .IA.I1#JI K.<D#J; I.1I#DL I<#1; 1M#KA 1#I; 1#ML <#1J <#A< <#<1 <#<I <#<< <#<1 <#<< <#<< <#<1
Sortin and o!ser%ation num!erin cannot !e done in S3L !ecause rows o( a ta!le are independent amon themsel%es# 4ou cannot compute the median in a data!ase -uery# Neither can you use the 21DO1 /4 clause in a su!)-uery# There(ore, we must see& an alternati%e outlier test which can !e per(ormed within a data!ase -uery en%ironment#
WHO IS CHAU/ENET
A character in a play, The auntie o( someone who's !est (riend is a poo&a# 2nly Mrs# Chau%enet &nows the truth a!out Olwood P# Dowd# /ut this o!ser%ation is itsel( an outlier# French mathematician 7illiam Chau%enet, 1MA<)1MD<, is !est &nown (or his clear and simple writin style and pioneerin contri!utions to the C#S# Na%al Academy# *e mathematically %eri(ied the (irst !rid e spannin the Mississippi 1i%er, and was the second chancellor o( 7ashin ton Cni%ersity in St# Louis# To his honor, each year since 1LA; a well)written mathematical article recei%es the Chau%enet award#
CHAU/ENET0S CRITERIA
I( the e+pected num!er o( measurements at least as !ad as the suspect measurement is less than 10A, then the suspect measurement should !e re"ected#
PROCEDURE
Let's assume you ha%e a data set with numeric %aria!le x# Suppose there are n o!ser%ations in your dataset# 4ou want to throw away all o!ser%ations which are :not ood enou h:# *ow do you do this, 1emem!er that in clinical practice, no point is not ood enou h, so the su!"ect o( outliers does not apply# 1# A# I# J# Calculate G and H I( n B er(cP Q + ) G Q 0 H R F S then 1e"ect + i i 1epeat steps 1 and A until step 2 passes or too many points removed 1eport (inal G, H, and n
7hen the dust settles, you ha%e two data sets. The set o( all ood data points, and the set o( :!ad: points# Althou h most o(ten we don't care a!out !ad data points, sometimes the !ad points tell us much more in(ormation than do the ood points# 7e must not !e too hasty to completely (or et a!out the !ad data points, !ut &eep them aside (or later and care(ul e%aluation#
Some -uestions to as& a!out the :!ad: data points, which we will tal& a!out later. 1# 7hy were these particular points e+cluded, A# 7hat do all the !ad points ha%e in common,
E1A)PLE
Suppose we ha%e 1J measurements o( some parameter, shown !elow in Ta!le 1# It ta&es two iterations o( the Chau%enet procedure to eliminate all the :!ad: data %alues# The (irst pass mar&s A %alues as !ad. AL#MD and A;#D1# Then, the second pass mar&s another %alue !ad. A<#JK# Further and su!se-uent applications o( Chau%enet don't mar& any more points# As :!ad: data points are e+cluded, notice how the standard de%iation si ni(icantly impro%es (rom M#DD to ;#1D to I#IM# Original Pass #1 Pass #2 M#<A # # M#1K # # I#LD # # M#KJ # # <#MJ # # J#JK # # <#M1 # # D#DJ # # M#DM # # L#AK # # Shielded outlier A<#JK # %utlie$ AL#MD %utlie$ # 1<#IM # # A;#D1 # %utlie$ avg: 1<#;1 K#JK D#KI stdev: M#DD I#IM ;#1D n: 1J 11 1A The third outlier %alue, cau ht and e+cluded in Pass TA, is called a shielded outlier# At (irst, its %alue is small enou h )) or close enou h to the mean )) to !e considered ood# 2nly when the most e+treme %alue is remo%ed does this ne+t most lar est %alue !ecome noticea!le# As we will see later, each remo%al o( a data point :li htens the mass: o( a distri!ution# Smaller sample si5es re-uire their %alues to !e closer to ether# The shieldin e((ect produced !y %ery lar e %alues is precisely why we must per(orm an outlier test iteratively#
HOW IT WOR2S
I( there are (ewer points than you e+pect, then throw away those (ew points#
O+pected num!er o( points with %alue +i
ne S B ne
3 1 1i These I points are outliers I
WHAT IS ERFC5
The complementary error (unction, erfc, is the residual area under the tails o( a distri!ution# Its %alue ets smaller (or %alues (urther away (rom the center o( the distri!ution# Thus, the error (unction %alue o( in(inity is always 5ero# That's li&e sayin there is nothin else to see when you loo& at the whole picture# There is nothin speci(ic to any particular distri!ution# erfc simply is the inte ral o( a pro!a!ility density (unction# In most data!ase systems and statistics pro rams, erfc assumes the normal or >aussian distri!ution# Csin the appropriate calculation or loo&)up ta!le ma&es Chau%enet's criterion a uni%ersal test#
WH6 THE 75
This is the ma ical Chau%enet num!er# 7e i%e each %alue a ;<= chance o( sur%i%al# Said another way, there must !e as many points closer to the mean as there are (urther away# A %alue is an outlier i( it is so (ar away that there's hardly any other %alues reater than it# Sample si5e is %ery important# A distri!ution with more data points is less li&ely to !e a((ected !y any sin le point, re ardless o( its %alue# Thin& a!out sample si5e as the :mass: o( a physical system# It is di((icult (or a cat to et a !owlin !all mo%in # /ut a cat can easily play with a pin pon !all all day# >reater mass means more inertia# This analo y is e+actly the same (or distri!utions# >reater num!ers o( data points means that there is little chance (or any sin le data point to a((ect the distri!ution shape# A %alue must !e very far away (rom the mean in order to :mo%e: the distri!ution o( other points and !e considered an outlier# 7ith A<< data points, an outlyin %alue is more than IH distant )) %ery (ar away (rom the meanU 2n the other hand, suppose we ha%e a nearly :mass)less:, li htwei ht distri!ution with only 1< %alues# A !ad %alue or outlier need only !e 1#LKBH away (rom the mean# There(ore, smaller sample si5es place more ri id re-uirements on the indi%idual %alues# The critical threshold which separates ood %alues (rom !ad is shown in the (i ure !elow# V is the usual 5) score, Q+)GQ0H, indicatin how (ar away a %alue is (rom the mean# Percenta es show how con(ident we are that a particular %alue !elon s to the distri!ution# This plot assumes a normal, aussian distri!ution, althou h the !asic concept here is uni%ersal and other distri!utions may !e similarly considered !y ta!ulatin the appropriate inte ral#
5
I#I I#A I#1 I#< A#L A#M A#D A#K A#; A#J A#I A#A A#1 A#< 1#L 1#M 1#D 1#K 1#; 1#J 1#I 1#A 1#1 < 1< A< I< J< ;< K< D< M< L< 1<< 11< 1A< 1I< 1J< 1;< 1K< 1D< 1M< 1L< A<<
2CTLIO1S
LL#D;= LL#;=
LM#I=
L;=
L<=
>22D
Sample Si5e P n R
This (i ure is simply the 5)score correspondin to the con(idence le%el o( P 1 ) 10PANR R# The :A: in this (ormula is the ma ical Chau%enet num!er# Two e+ample calculations ma&e this picture clear#
N@;
1 0 PANR @ 1 0 PAB;R @ 101< @ 1<01<< @ <#1<
1)p @ <#1< p @ <#L<
? ta!le loo&)up ? 5 @ 1#K; ? ta!le loo&)up ? 5 @ 1#LK
N@1< 1 0 PANR @ 1 0 PAB1<R @ 10A< @ ;01<< @ <#<;

WHAT HAPPENS IF WE USE 89: OR -9; OR <<<5
1)p @ <#<; p @ <#L;
This chan es the sensiti%ity o( the outlier test, and corresponds to the nature o( the distri!ution with which you are testin # Csin 10I means that there must !e twice as many smaller values as there are lar er %alues# Similarly, ;0K means there should !e only one smaller value (or e%ery five larger %alues# Shown in the a!le !elow are a (ew %alues (or the Chau%enet (actor and a -ualitati%e comparison# Chau%enet (actor. Distri!ution shape. 10J pea&y 10I s&inny loose +WA#1JBH ### 10A normal moderate +W1#LKBH e-ual -uantity o( points close to and (ar (rom mean A0I disperse ti ht +W1#MJBH ### ;0K lon )tailed %ery ri id +W1#DJBH P(or N@1<R
1e"ection sensiti%ity. %ery lenient outlier i( ### +WA#A;BH Acceptance criteria. allows more points closer to mean
re-uires more points (urther (rom mean
SUGGESTION FOR GOOD PROCESS CONTROL

Monitorin summary statistics (rom complete detail or raw data sets can lead to many anomalous out)o() control alerts# Too many warnin s dilutes the e((ecti%eness o( a -uality control team# Too (ew, o!%iously, is not ood either9 you don't want to !e so !lind as to pass all the "un&# Csin a entle outlier (ilterin method such as Chau%enet's criteria is a ood idea# 4ou split your raw data set into two pieces. ood and !ad# O+istin SPC PStatistical Process ControlR methods are then per(ormed on the ood data set# There is no distri!ution amon the !ad o!ser%ations9 they are all useless ar!a e# *owe%er, the quantity o( them is use(ul# Simply count the num!er o( o!ser%ations in the !ad data set and watch a trend chart o( that# A consistent, in)control process should ha%e a similar num!er o( outliers o%er time# I(, (or e+ample, one day you come in to wor& and see your process within SPC control yet with almost no outliers, then you &now somethin has chan ed and must !e loo&ed at#
>ood 6alues 2ri inal Data Set /ad 6alues
SPC t T outliers
GEIBH G G)IBH
nma+ Count nmin t somethin chan ed
WHAT )IGHT GO WRONG
It mi ht appear that we assume normality o( the data distri!ution# 2ur (irst step is to comp1ute mean and si ma# *owe%er, we only consider the 5)score, which is the ratio o( mean and si ma so their distri!ution in(erence is nulli(ied# 7hen the mean or si ma does not e+ist, such as (or Lorent5ian distri!utions (ound in Nuclear Ma netic 1esonance e+periments, other criteria must !e applied# Chau%enet's criteria has trou!le when the data distri!ution is stron ly !i)modal# 7hen there are widely separated, resol%a!le modes, all data points will !e re"ected# That's why we put a :stop limit: in step I o( the procedure so that the entire data set isn't whac&ed away#
In practice, a !i)modal or multi)modal distri!ution o( a parametric usually means that we ha%e mi+ed disparate data sources which should not !e mi+ed at all#
CONCLUSION
7e ha%e presented a simple, e((icient, and entle macro to ma&e a cleaner data set# It is easier to interpret a summary o( test results when the raw results are clean and all related Pi#e#, not spuriousR# The num!er o( points e+cluded (rom summari5ation is an important parameter# $eep a lo o( which points were e+cluded# I( many e%ents e+clude the same measurement, loo& (or a systematic trend such as wron test position, incorrect precondition, etc# 7ith Chau%enet's criteria you can !e sure your analyses are (ree o( in%isi!le ra!!its#
AUTHOR CONTACT
Lily Lin AD1; South Nor(ol& Street, Apt# 1<I San Mateo, CA LJJ<I PLDIR LDM)MALA lilyJlinXyahoo#com www.idiom.com !sherman pau" pubs chauv 2n)line Demonstration http# www.idiom.com !sherman pau" pubs demo chauvdemo.htm" Paul D She$"an II; Olan 6illa e Lane, Apt# JAJ San Jose, CA L;1IJ PJ<MR IMI)<JD1 shermanXidiom#com
REFERENCES
Chase, Mary Coyle# 1L;I# *ar%ey# C$. 2+(ord Cni%ersity Press# Di+on, 7# J# P1L;IR# :Processin data (or outliers#: Biometrics, %ol#L, pp# DJ)ML# Fer uson, T# S# P1LK1R :2n the re"ection o( outliers#: Proceedings of the 4th Berkeley Symp. !athematical Statistics and Probability, 1# pp# A;I)1MD# n
>ru!!s, F# P1LKLR# :Procedures (or Detectin 2utlyin 2!ser%ations in Samples#: "echnometrics, 11, pp#1) A1# *er5o , Ori& D# PA<<I, Jan# AJR# :Picturin 2ur Past#: In #ecord, %ol# AD, no#1D, St# Louis, M2. 7ashin ton Cni%ersity# 1etrie%ed July AD, A<<D, (rom http.00record#wustl#edu0A<<I01)AJ)<I0picturin YourYpast#html Mathematical Association o( America# :The Mathematical Association o( America's Chau%enet Pri5e,: 1etrie%ed July AD, A<<D, (rom http.00www#maa#or 0awards0chau%ent#html 1oss, Stephen M# PA<<IR# :Peirce's Criterion (or the Olimination o( Suspect O+perimental Data#: $. %ngr. Technolo y# Peirce, /# P1M;AR# :Criterion (or the re"ection o( dou!t(ul o!ser%ations#: &stronomical $ournal, 11 PA1R, pp# 1K1)1KI# Taylor, John 1# 1LLD# An Introduction to Orror Analysis . The Study o( Cncertainties in Physical Measurements, second ed# *erndon, 6A. Cni%ersity Science /oo&s# Tiet"en, >ary L# and 1o er *# Moore# P1LDA, Au ustR# :Some >ru!!s)Type Statistics (or the Detection o( Se%eral 2utliers#: "echnometrics, 1J PIR, pp#;MI);LD#
AC2NOWLEDG)ENTS
The authors would li&e to than& >rant Luo (or creatin and discussin the I31 comparison study# Annie Perlin deser%es a warm round o( applause (or her role as the other Chau%enet# 7e reatly appreciate the enerosity o( MedFocus (or allowin us to wor& on this article#
TRADE)AR2 INFOR)ATION
SAS, SAS Certi(ied Pro(essional, and all other SAS Institute Inc# product or ser%ice names are re istered trademar&s or trademar&s o( SAS Institute, Inc# in the CSA and other countries# Z indicates CSA re istration# 2ther !rand and product names are re istered trademar&s or trademar&s o( their respecti%e companies#
THE CHAU/ENET OUTLIER FILTERING )ACRO

options nosource nonotes; * ==================================================== * $%&'( ) $hauvenet*s criteria data c"eaner * * +,-&. ) input dataset * (&/ ) variab"e name to process * G00- ) output dataset 1or the G00- observations * B&- ) ditto, 1or the 0'.2+3/4. -ot (.) throws awa5 * $%&'6&$ ) sensitivit5 1actor. positive, "ess than 7. * * 7.8 988:)8;)9; pds "p" initia" cut * ==================================================== * per1orm a sing"e step o1 1i"tering *; %macro chauv8(iodat, var, chau1ac, macvar, "oopnum); * (re)compute summar5 on on"5 the good points *; proc means data=<iodat. noprint mean std; where isgood=7; output out=summ (drop==t5pe= =1re>=) mean=x std=s n=n; run; * app"5 the test *; proc s>"; create tab"e <iodat. as ( 4323$. raw.<var., summ.n*er1c((raw.<var.)summ.x) summ.s)?<chau1ac. &4 isgood 6/0@ summ inner Aoin <iodat. &4 raw 0, raw.isgood=7 ); >uit; %"et <macvar.=./'3; data "oopdat<"oopnum.; set <iodat.; i1 isgood e> 8 then do; ca"" s5mput(B<macvar.B, *6&243*); output "oopdat<"oopnum.; end; run; %mend chauv8; * * * * * * * * * * *
* the main macro *; %macro chauv(indat, var, good=outdatg, bad=outdatb, chau1ac=8.C); %"oca" is&""good "oopnum; * initia"ize a"" data points G00- *; * assumes there is not a"read5 a variab"e ca""ed +4G00- *; data chaudat; set <indat.; isgood=7; run; * "oop 1orever unti" a"" va"ues pass the test *; %"et is&""good=6&243; %"et "oopnum=7; %do %unti"(<is&""good. e> ./'3); %chauv8(chaudat, <var., <chau1ac., is&""good, <"oopnum.); %"et "oopnum=%eva"(<"oopnum.D7); %end; data <good. (drop=isgood); set chaudat; run; %i1 <bad. ne . %then %do; data <bad. (drop=isgood); set %do i=7 %to <"oopnum.)7; "oopdat<i. %end; ; run; %end; proc datasets "ib=worE nodetai"s no"ist nowarn; de"ete chaudat summ; de"ete %do i=7 %to <"oopnum.)7; "oopdat<i. %end; ; >uit; %mend chauv; *************************************** *** $23&,+,G -&.& .%3 $%&'(3,3. F&G *** *** a demonstration *** *** b5 2i"5 2in and Hau" - 4herman *** ***************************************; *** 1aEe some data ***; proc s>" noprint; create tab"e raw (va"ue integer); insert into raw (va"ue) va"ues (I.89) va"ues (I.7J) va"ues (K.L:) va"ues (8.I;) va"ues (;.;J) va"ues (8.I7) va"ues (I.:I) va"ues (L.9J) va"ues (98.;J) va"ues (78.KI) va"ues (9C.:7) ; >uit; %chauv(raw, va"ue, good=theGood, bad=theBad); proc means data=theGood; run; * summarize the good ... *; proc print data=theBad; run; * ... and show us the bad *; %chauv(raw, va"ue, good=raw, bad=.); * overwrite the origina" data *;
va"ues (I.J;) va"ues (:.:;) va"ues (9L.I:)
E1A)PLE = PERCENTILES AND QUARTILES TA2E A LONG TI)E TO CO)PUTE

Percentile)!ased calculations ta&e si ni(icantly lon er to per(orm than do distri!ution moments# This is due to the (ormer in%ol%in internal sortin steps# These results appear to !e -uite eneral, and in%ariant o( how many CPC's or threads are allocated to a system# options nosource notes; %macro means(n); data dum; do i=8 to <n.; x=ranexp(J:IL); output; end; run; proc means data=dum noprint nonobs pC8 p9C p:C >method=p9 >nt"de1=C; output out=dums (drop==t5pe= =1re>=) median=pC8 p9C=p9C p:C=p:C; run; proc means data=dum noprint nonobs mean std; var x; output out=dumss (drop==t5pe= =1re>=) mean=avg std=std; run; %mend; %means(7888888888); %means(788888888); %means(78888888); %means(7888888); %means(788888); %means(78888); %means(7888); %means(788); %put *** -0,3 ***;
E1A)PLE = CO)PARING IQR TEST AND CHAU/ENET0S CRITERIA

The I31 test is a sin le)pass al orithm# There(ore, its num!er o( outlier %alues is constant# For small sample si5es o( a normal distri!ution with a (ew arti(icially :!ad: points thrown in, Chau%enet's criteria somewhat o%erestimates the num!er o( !ad points# 2n the other hand, you mi ht thin& that I31 somewhat underestimates the num!er o( !ad points when the data set is small# 1emem!er that smaller sample si5es place more strict rules on what is an outlyin %alue# 7hen the data set is lar e, more than 1<,<<< o!ser%ations, the situation is re%ersed# I31 re"ects many more points than does Chau%enet# The cross)o%er point, where !oth tests report a!out the same -uantity o( outlyin %alues, seems to !e a!out J,<<< o!ser%ations# There(ore, with %ery lar e data sets Chau%enet's criteria is superior# It re"ects only those (ew really !ad %alues, and without percentile calculations is much more time and memory e((icient in its calculations# Loop Num!er 1 A I J ; N@A<< I31 Chau% 1< 1< 1< 1; 1< A< 1< AA 1< AI N@J,<<< I31 Chau% IL 1J IL IL IL JA IL JJ IL J; N@1<,<<< I31 Chau% MJ AL MJ JL MJ ;; MJ ;D # # N@1,<<<,<<< I31 Chau% D<IA J<M D<IA JI< D<IA JI1 # # # #
The code which enerates this data is shown !elow# %"et n=obs=78888; data a; group=7; do i= 7 to C; x=9CD8.:*rannor(J:IL); output; end; do i= J to 78; x=KD8.9*rannor(J:IL); output; end; do i =77 to <n=obs; x=78D7*rannor(J:IL); output; end; run; %macro m5compare; data 1ina"; merge a a=good(Eeep=i in=b) temp(Eeep=i in=c); b5 i; 1ormat 1"ag M;C.; i1 (not b) and (not c) then 1"ag=B/emoved b5 Both +N/ and $hauv i1 (not b) and c then 1"ag=B/emoved b5 +N/ 0,2GB; i1 b and (not c) then 1"ag=B/emoved b5 $hauv @ethod 0,2GB; run; proc 1re> data=1ina"; tab"e 1"ag; run; %mend m5compare; %macro i>r(inds=a); proc univariate data=<inds noprint; var x; output out=sum=<inds mean=mean median=pC8 >7=p9C >K=p:C std=std; run; data sum=<inds; set sum=<inds; range=p:C)p9C; '2=p:CD7.C*range; 22=p9C)7.C*range; ca"" s5mput (*'2*, '2); ca"" s5mput(*22*, 22); run; tit"e B+nter Nuarti"e /ange @ethodB; tit"e9 B&"" -ata HointB; 1ootnote BHrogram# 434'G 988: Haper 4&)77 data <inds.=good <inds.=bad; set <inds; i1 <22O=xO=<'2 then output <inds.=good; e"se output <inds.=bad; run;
@ethodB;
0utput# out"ierB;
1<
proc s>" ; tit"e B,umber o1 Bad /ecodsB; se"ect count(*) into# n=bad 1rom <inds.=bad; tit"e B+nter Nuarti"e /ange @ethodB; tit"e9 BGood -ata HointB; %mend i>r; %i>r(inds=a);
%macro chauv(inds=a, "oop=C); %do i=7 %to <"oop; proc univariate data=<inds noprint; var x; output out=sum=<inds n=n mean=mean median=pC8 >7=p9C >K=p:C std=std; run; proc s>"; create tab"e <inds. as se"ect r.i, r.x, s.n*er1c(abs(r.x)s.mean) s.std)?8.C as is=good 1rom sum=<inds as s inner Aoin <inds as r on 7=7 ; tit"e B,umber o1 Bad /ecods in 2oop <iB; se"ect count(*) into# n=bad 1rom <inds where is=good=8; >uit; data <inds; set <inds; group=7; where is=good=7; run; %i1 <n=bad=8 %then %do; %"et i=7888; tit"e B$hauvenet @ethodB; tit"e9 BGood -ata Hoint a1ter 2oop 3ndB; %end; %e"se %do; tit"e B$hauvenet @ethodB; tit"e9 BGood -ata Hoint a1ter 2oop <iB; %end; %m5compare; %end; %mend chauv; data temp; set a; run; %chauv(inds=temp, "oop=C8);
11

Outlayer PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Outlayer PDF

Uploaded by

Copyright:

Available Formats

SESUG Proceedings (c) SESUG, Inc (http://www.sesug.

Cleaning Data the Chauvenet Way

>ood 6alues 2ri inal Data Set /ad 6alues

? continue with analysis ? why are these !ad,

INTERQUARTILE RANGE IQR! TEST

L2L @ P C2L @ P where I31 @ P

) 1#; B I31 E 1#; B I31

geometry of the box plot

An accepta!le data %alue must lie within these limits. A#; B P

O+pected num!er o( points with %alue +i

1 0 PANR @ 1 0 PAB;R @ 101< @ 1<01<< @ <#1<

1)p @ <#1< p @ <#L<

? ta!le loo&)up ? 5 @ 1#K; ? ta!le loo&)up ? 5 @ 1#LK

N@1< 1 0 PANR @ 1 0 PAB1<R @ 10A< @ ;01<< @ <#<;

1)p @ <#<; p @ <#L;

re-uires more points (urther (rom mean

SUGGESTION FOR GOOD PROCESS CONTROL

>ood 6alues 2ri inal Data Set /ad 6alues

nma+ Count nmin t somethin chan ed

WHAT )IGHT GO WRONG

THE CHAU/ENET OUTLIER FILTERING )ACRO

va"ues (I.J;) va"ues (:.:;) va"ues (9L.I:)

E1A)PLE = PERCENTILES AND QUARTILES TA2E A LONG TI)E TO CO)PUTE

E1A)PLE = CO)PARING IQR TEST AND CHAU/ENET0S CRITERIA

You might also like