You are on page 1of 11

SESUG Proceedings (c) SESUG, Inc (http://www.sesug.

org) The papers contained in the SESUG proceedings are the property of their authors, unless otherwise stated. Do not reprint without permission. SEGUG papers are distributed freely as a courtesy of the Institute for Advanced Analytics (http://analytics.ncsu.edu). Paper SA11

Cleaning Data the Chauvenet Way


Lily Lin, MedFocus, San Mateo, CA Paul D Sherman, Independent Consultant, San Jose, CA
ABSTRACT
Throwin away data is a touchy su!"ect# $eep the ma%eric& and you contaminate a eneral trend# Toss away ood data points and you don't &now i( somethin important has happened )) until its too late# *ow do you &now what, i( any, data to e+clude, Chau%enet is the answer# *ere is a really entle test you can apply to any distri!ution o( num!ers# It wor&s e-ually well (or normal, s&ewed, and e%en multi)modal populations# This article i%es you a macro tool (or cleanin up your data and separatin the ood (rom the !ad# S&ill Le%el. /asic statistics, Data step, SAS0MAC12, and Proc S3L

INTRODUCTION
4ou o(ten want to compare data sets# 4ou can't really do this point)!y)point, so you summari5e each set indi%idually and compare their descripti%e statistics on the a re ate# /y loo&in only at the summary, you are ma&in an assumption that all o!ser%ations are related in some way# *ow do you %eri(y this assumption, /y testin (or outliers# 6alues that are spurious or unrelated to the others must !e e+cluded (rom summari5ation# In this paper we present a simple, e((icient, and entle macro (or (ilterin a data set# There are many techni-ues, and the su!"ect o( ro!ust statistice is modern and rich thou h not at all simple# Chau%enet's criteria is easy to understand, can !e -uic&ly computed (or a !illion rows o( data, and e%en entle enou h to !e used on a tiny set o( only ten points# 7e ha%e seen that Chau%enet's criterion is used in astronomy, nuclear technolo y, eolo y, epidemiolo y, molecular !iolo y, radiolo y and many (ields o( physical science# It is widely used !y o%ernment la!oratories, industries, and uni%ersities# Althou h Chau%enet8s criterion is not currently used in Clinical trials, we would li&e to e+plore it (or possi!ility o( !ein applied to the trials en%ironment as well#
THE FILTERING PROCESS

4ou want some way to identi(y what o!ser%ations in your data set need closer study# It's not appropriate to simply throw away or delete an o!ser%ation9 you must &eep it around to loo& at later# The picture is as (ollows#

>ood 6alues 2ri inal Data Set /ad 6alues

? continue with analysis ? why are these !ad,

2ur macro does this easily# To scrutini5e %aria!le x and split origdata into ood and !ad pieces, use this macro call. %chauv(origdata, x, good=theGood, bad=theBad); Then, (or e+ample, proc means data=theGood; run; * summarize the good ... *; proc print data=theBad; run; * ... and show us the bad *;

INTERQUARTILE RANGE IQR! TEST


The I31 test is commonly used in clinical studies to re"ect data points# The one)way analysis o( %ariance usin :!o+ plots: also uses the I31 test# 7his&ers o( the !o+ are called 2utlier Limits and set ;<= (urther away than the I31# The ran e (or a data point to !e considered : ood: is de(ined !elow#

L2L @ P C2L @ P where I31 @ P

A;

) 1#; B I31 E 1#; B I31

PA; L2L

PD; C2L

D;

D;

)P

A; A;

geometry of the box plot


) 1#; B P D; F + F A#; B P D; ) 1#; B P A;

An accepta!le data %alue must lie within these limits. A#; B P

I31 is !ased on percentile statistics# Just li&e the median, percentiles presume speci(ic orderin o( o!ser%ations in a dataset# The sort step re-uired can !e costly, especially when the dataset is hu e# For e+ample, a data set with a !illion o!ser%ations ta&es hal( an hour to calculate the percentiles, e%en with the piecewise)para!olic al orithm, while only si+ minutes to enerate distri!ution moments G and H# Nu"#e$ %& O#(e$vati%n( 1,<<<,<<<,<<< 1<<,<<<,<<< 1<,<<<,<<< 1,<<<,<<< 1<<,<<< 1<,<<< 1,<<< 1<< C%"'utati%n Ti"e )e*ian+ P + P )ean+ St* ,- .IA.I1#JI K.<D#J; I.1I#DL I<#1; 1M#KA 1#I; 1#ML <#1J <#A< <#<1 <#<I <#<< <#<1 <#<< <#<< <#<1

Sortin and o!ser%ation num!erin cannot !e done in S3L !ecause rows o( a ta!le are independent amon themsel%es# 4ou cannot compute the median in a data!ase -uery# Neither can you use the 21DO1 /4 clause in a su!)-uery# There(ore, we must see& an alternati%e outlier test which can !e per(ormed within a data!ase -uery en%ironment#

WHO IS CHAU/ENET
A character in a play, The auntie o( someone who's !est (riend is a poo&a# 2nly Mrs# Chau%enet &nows the truth a!out Olwood P# Dowd# /ut this o!ser%ation is itsel( an outlier# French mathematician 7illiam Chau%enet, 1MA<)1MD<, is !est &nown (or his clear and simple writin style and pioneerin contri!utions to the C#S# Na%al Academy# *e mathematically %eri(ied the (irst !rid e spannin the Mississippi 1i%er, and was the second chancellor o( 7ashin ton Cni%ersity in St# Louis# To his honor, each year since 1LA; a well)written mathematical article recei%es the Chau%enet award#

CHAU/ENET0S CRITERIA
I( the e+pected num!er o( measurements at least as !ad as the suspect measurement is less than 10A, then the suspect measurement should !e re"ected#
PROCEDURE

Let's assume you ha%e a data set with numeric %aria!le x# Suppose there are n o!ser%ations in your dataset# 4ou want to throw away all o!ser%ations which are :not ood enou h:# *ow do you do this, 1emem!er that in clinical practice, no point is not ood enou h, so the su!"ect o( outliers does not apply# 1# A# I# J# Calculate G and H I( n B er(cP Q + ) G Q 0 H R F S then 1e"ect + i i 1epeat steps 1 and A until step 2 passes or too many points removed 1eport (inal G, H, and n

7hen the dust settles, you ha%e two data sets. The set o( all ood data points, and the set o( :!ad: points# Althou h most o(ten we don't care a!out !ad data points, sometimes the !ad points tell us much more in(ormation than do the ood points# 7e must not !e too hasty to completely (or et a!out the !ad data points, !ut &eep them aside (or later and care(ul e%aluation#

Some -uestions to as& a!out the :!ad: data points, which we will tal& a!out later. 1# 7hy were these particular points e+cluded, A# 7hat do all the !ad points ha%e in common,

E1A)PLE
Suppose we ha%e 1J measurements o( some parameter, shown !elow in Ta!le 1# It ta&es two iterations o( the Chau%enet procedure to eliminate all the :!ad: data %alues# The (irst pass mar&s A %alues as !ad. AL#MD and A;#D1# Then, the second pass mar&s another %alue !ad. A<#JK# Further and su!se-uent applications o( Chau%enet don't mar& any more points# As :!ad: data points are e+cluded, notice how the standard de%iation si ni(icantly impro%es (rom M#DD to ;#1D to I#IM# Original Pass #1 Pass #2 M#<A # # M#1K # # I#LD # # M#KJ # # <#MJ # # J#JK # # <#M1 # # D#DJ # # M#DM # # L#AK # # Shielded outlier A<#JK # %utlie$ AL#MD %utlie$ # 1<#IM # # A;#D1 # %utlie$ avg: 1<#;1 K#JK D#KI stdev: M#DD I#IM ;#1D n: 1J 11 1A The third outlier %alue, cau ht and e+cluded in Pass TA, is called a shielded outlier# At (irst, its %alue is small enou h )) or close enou h to the mean )) to !e considered ood# 2nly when the most e+treme %alue is remo%ed does this ne+t most lar est %alue !ecome noticea!le# As we will see later, each remo%al o( a data point :li htens the mass: o( a distri!ution# Smaller sample si5es re-uire their %alues to !e closer to ether# The shieldin e((ect produced !y %ery lar e %alues is precisely why we must per(orm an outlier test iteratively#

HOW IT WOR2S
I( there are (ewer points than you e+pect, then throw away those (ew points#

O+pected num!er o( points with %alue +i

ne S B ne
3 1 1i These I points are outliers I

WHAT IS ERFC5

The complementary error (unction, erfc, is the residual area under the tails o( a distri!ution# Its %alue ets smaller (or %alues (urther away (rom the center o( the distri!ution# Thus, the error (unction %alue o( in(inity is always 5ero# That's li&e sayin there is nothin else to see when you loo& at the whole picture# There is nothin speci(ic to any particular distri!ution# erfc simply is the inte ral o( a pro!a!ility density (unction# In most data!ase systems and statistics pro rams, erfc assumes the normal or >aussian distri!ution# Csin the appropriate calculation or loo&)up ta!le ma&es Chau%enet's criterion a uni%ersal test#
WH6 THE 75

This is the ma ical Chau%enet num!er# 7e i%e each %alue a ;<= chance o( sur%i%al# Said another way, there must !e as many points closer to the mean as there are (urther away# A %alue is an outlier i( it is so (ar away that there's hardly any other %alues reater than it# Sample si5e is %ery important# A distri!ution with more data points is less li&ely to !e a((ected !y any sin le point, re ardless o( its %alue# Thin& a!out sample si5e as the :mass: o( a physical system# It is di((icult (or a cat to et a !owlin !all mo%in # /ut a cat can easily play with a pin pon !all all day# >reater mass means more inertia# This analo y is e+actly the same (or distri!utions# >reater num!ers o( data points means that there is little chance (or any sin le data point to a((ect the distri!ution shape# A %alue must !e very far away (rom the mean in order to :mo%e: the distri!ution o( other points and !e considered an outlier# 7ith A<< data points, an outlyin %alue is more than IH distant )) %ery (ar away (rom the meanU 2n the other hand, suppose we ha%e a nearly :mass)less:, li htwei ht distri!ution with only 1< %alues# A !ad %alue or outlier need only !e 1#LKBH away (rom the mean# There(ore, smaller sample si5es place more ri id re-uirements on the indi%idual %alues# The critical threshold which separates ood %alues (rom !ad is shown in the (i ure !elow# V is the usual 5) score, Q+)GQ0H, indicatin how (ar away a %alue is (rom the mean# Percenta es show how con(ident we are that a particular %alue !elon s to the distri!ution# This plot assumes a normal, aussian distri!ution, althou h the !asic concept here is uni%ersal and other distri!utions may !e similarly considered !y ta!ulatin the appropriate inte ral#
5
I#I I#A I#1 I#< A#L A#M A#D A#K A#; A#J A#I A#A A#1 A#< 1#L 1#M 1#D 1#K 1#; 1#J 1#I 1#A 1#1 < 1< A< I< J< ;< K< D< M< L< 1<< 11< 1A< 1I< 1J< 1;< 1K< 1D< 1M< 1L< A<<

2CTLIO1S
LL#D;= LL#;=

LM#I=

L;=

L<=

>22D

Sample Si5e P n R

This (i ure is simply the 5)score correspondin to the con(idence le%el o( P 1 ) 10PANR R# The :A: in this (ormula is the ma ical Chau%enet num!er# Two e+ample calculations ma&e this picture clear#

N@;

1 0 PANR @ 1 0 PAB;R @ 101< @ 1<01<< @ <#1<

1)p @ <#1< p @ <#L<

? ta!le loo&)up ? 5 @ 1#K; ? ta!le loo&)up ? 5 @ 1#LK

N@1< 1 0 PANR @ 1 0 PAB1<R @ 10A< @ ;01<< @ <#<;


WHAT HAPPENS IF WE USE 89: OR -9; OR <<<5

1)p @ <#<; p @ <#L;

This chan es the sensiti%ity o( the outlier test, and corresponds to the nature o( the distri!ution with which you are testin # Csin 10I means that there must !e twice as many smaller values as there are lar er %alues# Similarly, ;0K means there should !e only one smaller value (or e%ery five larger %alues# Shown in the a!le !elow are a (ew %alues (or the Chau%enet (actor and a -ualitati%e comparison# Chau%enet (actor. Distri!ution shape. 10J pea&y 10I s&inny loose +WA#1JBH ### 10A normal moderate +W1#LKBH e-ual -uantity o( points close to and (ar (rom mean A0I disperse ti ht +W1#MJBH ### ;0K lon )tailed %ery ri id +W1#DJBH P(or N@1<R

1e"ection sensiti%ity. %ery lenient outlier i( ### +WA#A;BH Acceptance criteria. allows more points closer to mean

re-uires more points (urther (rom mean

SUGGESTION FOR GOOD PROCESS CONTROL


Monitorin summary statistics (rom complete detail or raw data sets can lead to many anomalous out)o() control alerts# Too many warnin s dilutes the e((ecti%eness o( a -uality control team# Too (ew, o!%iously, is not ood either9 you don't want to !e so !lind as to pass all the "un&# Csin a entle outlier (ilterin method such as Chau%enet's criteria is a ood idea# 4ou split your raw data set into two pieces. ood and !ad# O+istin SPC PStatistical Process ControlR methods are then per(ormed on the ood data set# There is no distri!ution amon the !ad o!ser%ations9 they are all useless ar!a e# *owe%er, the quantity o( them is use(ul# Simply count the num!er o( o!ser%ations in the !ad data set and watch a trend chart o( that# A consistent, in)control process should ha%e a similar num!er o( outliers o%er time# I(, (or e+ample, one day you come in to wor& and see your process within SPC control yet with almost no outliers, then you &now somethin has chan ed and must !e loo&ed at#

>ood 6alues 2ri inal Data Set /ad 6alues

SPC t T outliers

GEIBH G G)IBH

nma+ Count nmin t somethin chan ed

WHAT )IGHT GO WRONG

It mi ht appear that we assume normality o( the data distri!ution# 2ur (irst step is to comp1ute mean and si ma# *owe%er, we only consider the 5)score, which is the ratio o( mean and si ma so their distri!ution in(erence is nulli(ied# 7hen the mean or si ma does not e+ist, such as (or Lorent5ian distri!utions (ound in Nuclear Ma netic 1esonance e+periments, other criteria must !e applied# Chau%enet's criteria has trou!le when the data distri!ution is stron ly !i)modal# 7hen there are widely separated, resol%a!le modes, all data points will !e re"ected# That's why we put a :stop limit: in step I o( the procedure so that the entire data set isn't whac&ed away#

In practice, a !i)modal or multi)modal distri!ution o( a parametric usually means that we ha%e mi+ed disparate data sources which should not !e mi+ed at all#

CONCLUSION
7e ha%e presented a simple, e((icient, and entle macro to ma&e a cleaner data set# It is easier to interpret a summary o( test results when the raw results are clean and all related Pi#e#, not spuriousR# The num!er o( points e+cluded (rom summari5ation is an important parameter# $eep a lo o( which points were e+cluded# I( many e%ents e+clude the same measurement, loo& (or a systematic trend such as wron test position, incorrect precondition, etc# 7ith Chau%enet's criteria you can !e sure your analyses are (ree o( in%isi!le ra!!its#

AUTHOR CONTACT
Lily Lin AD1; South Nor(ol& Street, Apt# 1<I San Mateo, CA LJJ<I PLDIR LDM)MALA lilyJlinXyahoo#com www.idiom.com !sherman pau" pubs chauv 2n)line Demonstration http# www.idiom.com !sherman pau" pubs demo chauvdemo.htm" Paul D She$"an II; Olan 6illa e Lane, Apt# JAJ San Jose, CA L;1IJ PJ<MR IMI)<JD1 shermanXidiom#com

REFERENCES
Chase, Mary Coyle# 1L;I# *ar%ey# C$. 2+(ord Cni%ersity Press# Di+on, 7# J# P1L;IR# :Processin data (or outliers#: Biometrics, %ol#L, pp# DJ)ML# Fer uson, T# S# P1LK1R :2n the re"ection o( outliers#: Proceedings of the 4th Berkeley Symp. !athematical Statistics and Probability, 1# pp# A;I)1MD# n

>ru!!s, F# P1LKLR# :Procedures (or Detectin 2utlyin 2!ser%ations in Samples#: "echnometrics, 11, pp#1) A1# *er5o , Ori& D# PA<<I, Jan# AJR# :Picturin 2ur Past#: In #ecord, %ol# AD, no#1D, St# Louis, M2. 7ashin ton Cni%ersity# 1etrie%ed July AD, A<<D, (rom http.00record#wustl#edu0A<<I01)AJ)<I0picturin YourYpast#html Mathematical Association o( America# :The Mathematical Association o( America's Chau%enet Pri5e,: 1etrie%ed July AD, A<<D, (rom http.00www#maa#or 0awards0chau%ent#html 1oss, Stephen M# PA<<IR# :Peirce's Criterion (or the Olimination o( Suspect O+perimental Data#: $. %ngr. Technolo y# Peirce, /# P1M;AR# :Criterion (or the re"ection o( dou!t(ul o!ser%ations#: &stronomical $ournal, 11 PA1R, pp# 1K1)1KI# Taylor, John 1# 1LLD# An Introduction to Orror Analysis . The Study o( Cncertainties in Physical Measurements, second ed# *erndon, 6A. Cni%ersity Science /oo&s# Tiet"en, >ary L# and 1o er *# Moore# P1LDA, Au ustR# :Some >ru!!s)Type Statistics (or the Detection o( Se%eral 2utliers#: "echnometrics, 1J PIR, pp#;MI);LD#

AC2NOWLEDG)ENTS
The authors would li&e to than& >rant Luo (or creatin and discussin the I31 comparison study# Annie Perlin deser%es a warm round o( applause (or her role as the other Chau%enet# 7e reatly appreciate the enerosity o( MedFocus (or allowin us to wor& on this article#

TRADE)AR2 INFOR)ATION
SAS, SAS Certi(ied Pro(essional, and all other SAS Institute Inc# product or ser%ice names are re istered trademar&s or trademar&s o( SAS Institute, Inc# in the CSA and other countries# Z indicates CSA re istration# 2ther !rand and product names are re istered trademar&s or trademar&s o( their respecti%e companies#

THE CHAU/ENET OUTLIER FILTERING )ACRO


options nosource nonotes; * ==================================================== * $%&'( ) $hauvenet*s criteria data c"eaner * * +,-&. ) input dataset * (&/ ) variab"e name to process * G00- ) output dataset 1or the G00- observations * B&- ) ditto, 1or the 0'.2+3/4. -ot (.) throws awa5 * $%&'6&$ ) sensitivit5 1actor. positive, "ess than 7. * * 7.8 988:)8;)9; pds "p" initia" cut * ==================================================== * per1orm a sing"e step o1 1i"tering *; %macro chauv8(iodat, var, chau1ac, macvar, "oopnum); * (re)compute summar5 on on"5 the good points *; proc means data=<iodat. noprint mean std; where isgood=7; output out=summ (drop==t5pe= =1re>=) mean=x std=s n=n; run; * app"5 the test *; proc s>"; create tab"e <iodat. as ( 4323$. raw.<var., summ.n*er1c((raw.<var.)summ.x) summ.s)?<chau1ac. &4 isgood 6/0@ summ inner Aoin <iodat. &4 raw 0, raw.isgood=7 ); >uit; %"et <macvar.=./'3; data "oopdat<"oopnum.; set <iodat.; i1 isgood e> 8 then do; ca"" s5mput(B<macvar.B, *6&243*); output "oopdat<"oopnum.; end; run; %mend chauv8; * * * * * * * * * * *

* the main macro *; %macro chauv(indat, var, good=outdatg, bad=outdatb, chau1ac=8.C); %"oca" is&""good "oopnum; * initia"ize a"" data points G00- *; * assumes there is not a"read5 a variab"e ca""ed +4G00- *; data chaudat; set <indat.; isgood=7; run; * "oop 1orever unti" a"" va"ues pass the test *; %"et is&""good=6&243; %"et "oopnum=7; %do %unti"(<is&""good. e> ./'3); %chauv8(chaudat, <var., <chau1ac., is&""good, <"oopnum.); %"et "oopnum=%eva"(<"oopnum.D7); %end; data <good. (drop=isgood); set chaudat; run; %i1 <bad. ne . %then %do; data <bad. (drop=isgood); set %do i=7 %to <"oopnum.)7; "oopdat<i. %end; ; run; %end; proc datasets "ib=worE nodetai"s no"ist nowarn; de"ete chaudat summ; de"ete %do i=7 %to <"oopnum.)7; "oopdat<i. %end; ; >uit; %mend chauv; *************************************** *** $23&,+,G -&.& .%3 $%&'(3,3. F&G *** *** a demonstration *** *** b5 2i"5 2in and Hau" - 4herman *** ***************************************; *** 1aEe some data ***; proc s>" noprint; create tab"e raw (va"ue integer); insert into raw (va"ue) va"ues (I.89) va"ues (I.7J) va"ues (K.L:) va"ues (8.I;) va"ues (;.;J) va"ues (8.I7) va"ues (I.:I) va"ues (L.9J) va"ues (98.;J) va"ues (78.KI) va"ues (9C.:7) ; >uit; %chauv(raw, va"ue, good=theGood, bad=theBad); proc means data=theGood; run; * summarize the good ... *; proc print data=theBad; run; * ... and show us the bad *; %chauv(raw, va"ue, good=raw, bad=.); * overwrite the origina" data *;

va"ues (I.J;) va"ues (:.:;) va"ues (9L.I:)

E1A)PLE = PERCENTILES AND QUARTILES TA2E A LONG TI)E TO CO)PUTE


Percentile)!ased calculations ta&e si ni(icantly lon er to per(orm than do distri!ution moments# This is due to the (ormer in%ol%in internal sortin steps# These results appear to !e -uite eneral, and in%ariant o( how many CPC's or threads are allocated to a system# options nosource notes; %macro means(n); data dum; do i=8 to <n.; x=ranexp(J:IL); output; end; run; proc means data=dum noprint nonobs pC8 p9C p:C >method=p9 >nt"de1=C; output out=dums (drop==t5pe= =1re>=) median=pC8 p9C=p9C p:C=p:C; run; proc means data=dum noprint nonobs mean std; var x; output out=dumss (drop==t5pe= =1re>=) mean=avg std=std; run; %mend; %means(7888888888); %means(788888888); %means(78888888); %means(7888888); %means(788888); %means(78888); %means(7888); %means(788); %put *** -0,3 ***;

E1A)PLE = CO)PARING IQR TEST AND CHAU/ENET0S CRITERIA


The I31 test is a sin le)pass al orithm# There(ore, its num!er o( outlier %alues is constant# For small sample si5es o( a normal distri!ution with a (ew arti(icially :!ad: points thrown in, Chau%enet's criteria somewhat o%erestimates the num!er o( !ad points# 2n the other hand, you mi ht thin& that I31 somewhat underestimates the num!er o( !ad points when the data set is small# 1emem!er that smaller sample si5es place more strict rules on what is an outlyin %alue# 7hen the data set is lar e, more than 1<,<<< o!ser%ations, the situation is re%ersed# I31 re"ects many more points than does Chau%enet# The cross)o%er point, where !oth tests report a!out the same -uantity o( outlyin %alues, seems to !e a!out J,<<< o!ser%ations# There(ore, with %ery lar e data sets Chau%enet's criteria is superior# It re"ects only those (ew really !ad %alues, and without percentile calculations is much more time and memory e((icient in its calculations# Loop Num!er 1 A I J ; N@A<< I31 Chau% 1< 1< 1< 1; 1< A< 1< AA 1< AI N@J,<<< I31 Chau% IL 1J IL IL IL JA IL JJ IL J; N@1<,<<< I31 Chau% MJ AL MJ JL MJ ;; MJ ;D # # N@1,<<<,<<< I31 Chau% D<IA J<M D<IA JI< D<IA JI1 # # # #

The code which enerates this data is shown !elow# %"et n=obs=78888; data a; group=7; do i= 7 to C; x=9CD8.:*rannor(J:IL); output; end; do i= J to 78; x=KD8.9*rannor(J:IL); output; end; do i =77 to <n=obs; x=78D7*rannor(J:IL); output; end; run; %macro m5compare; data 1ina"; merge a a=good(Eeep=i in=b) temp(Eeep=i in=c); b5 i; 1ormat 1"ag M;C.; i1 (not b) and (not c) then 1"ag=B/emoved b5 Both +N/ and $hauv i1 (not b) and c then 1"ag=B/emoved b5 +N/ 0,2GB; i1 b and (not c) then 1"ag=B/emoved b5 $hauv @ethod 0,2GB; run; proc 1re> data=1ina"; tab"e 1"ag; run; %mend m5compare; %macro i>r(inds=a); proc univariate data=<inds noprint; var x; output out=sum=<inds mean=mean median=pC8 >7=p9C >K=p:C std=std; run; data sum=<inds; set sum=<inds; range=p:C)p9C; '2=p:CD7.C*range; 22=p9C)7.C*range; ca"" s5mput (*'2*, '2); ca"" s5mput(*22*, 22); run; tit"e B+nter Nuarti"e /ange @ethodB; tit"e9 B&"" -ata HointB; 1ootnote BHrogram# 434'G 988: Haper 4&)77 data <inds.=good <inds.=bad; set <inds; i1 <22O=xO=<'2 then output <inds.=good; e"se output <inds.=bad; run;

@ethodB;

0utput# out"ierB;

1<

proc s>" ; tit"e B,umber o1 Bad /ecodsB; se"ect count(*) into# n=bad 1rom <inds.=bad; tit"e B+nter Nuarti"e /ange @ethodB; tit"e9 BGood -ata HointB; %mend i>r; %i>r(inds=a);

%macro chauv(inds=a, "oop=C); %do i=7 %to <"oop; proc univariate data=<inds noprint; var x; output out=sum=<inds n=n mean=mean median=pC8 >7=p9C >K=p:C std=std; run; proc s>"; create tab"e <inds. as se"ect r.i, r.x, s.n*er1c(abs(r.x)s.mean) s.std)?8.C as is=good 1rom sum=<inds as s inner Aoin <inds as r on 7=7 ; tit"e B,umber o1 Bad /ecods in 2oop <iB; se"ect count(*) into# n=bad 1rom <inds where is=good=8; >uit; data <inds; set <inds; group=7; where is=good=7; run; %i1 <n=bad=8 %then %do; %"et i=7888; tit"e B$hauvenet @ethodB; tit"e9 BGood -ata Hoint a1ter 2oop 3ndB; %end; %e"se %do; tit"e B$hauvenet @ethodB; tit"e9 BGood -ata Hoint a1ter 2oop <iB; %end; %m5compare; %end; %mend chauv; data temp; set a; run; %chauv(inds=temp, "oop=C8);

11

You might also like