You are on page 1of 6

Proceedings of the International Conference on Information and Automation, December 15-18, 2005, Colombo, Sri Lanka.

Domain Knowledge to Support Understanding and Treatment of Outliers


Robert Redpath, Judy Sheard
Caulfield School of Information Technology Monash University, Caulfield East, Victoria, Australia Email: robert.redpath@infotech.monash.edu.au Email: judy.sheard@infotech.monash.edu.au

AbstractThe understanding and treatment of outliers is complex and non trivial in many data analysis and data mining exercises. It is not always done well. One approach that can be used in combination with others is to understand the domain of interest and use this knowledge to guide the data preparation and subsequent steps in terms of the treatment and interpretation of outliers. To demonstrate the approach proposed a study on web usage in the tertiary education sector is used. A particular issue that occurs in the sector, where widespread use of the World Wide Web occurs, is to monitor students web use in the environment. This is important in evaluating and improving teaching outcomes. Data mining techniques play a key role in analyzing student interaction as it is captured in Web logs. This paper considers the non-trivial task of data preparation and analysis of web data and in particular the treatment of outliers in this domain. Some conclusions on how to define an outlier in terms of the strategic aims of the particular analysis are made. Some general conclusions are made about how to classify outliers as noise or indicative indicators in a web environment. It is argued that the approach demonstrated can be applied across a range of domains and is a guide as to how the knowledge discovery task may be partially automated.

I. INTRODUCTION Outlier treatment is an important aspect of the Knowledge Discovery in Databases (KDD) process where outliers may reveal important information about the data under analysis. Outliers can be determined by a number of techniques including visualization, and proximity based approaches. All approaches require some input from the expert in the domain of interest so domain knowledge plays an important role in their identification. Once outliers have been identified they may be rejected as due to measurement error or as being from another sample population or they may be accepted as phenomena of interest. Making the distinction between the outliers that should be rejected and those that should be accepted is not always easy. The purpose of this paper is to use a case study on student behavior, as demonstrated by their usage of a courseware web system to support their activities, to make some general conclusions about how outliers may be treated in the domain

of the case study and to extend those conclusions to other domains of interest. Having gained an understanding of how outliers may be treated in a general way it is suggested how this understanding could be incorporated into a partially automated model for KDD. Section 2 reviews the treatment of outliers, from the use of traditional statistical approaches through to the more recent approaches that have been suggested with the increased interest in data mining; where data mining is as a label for a new grouping of techniques for data analysis. In section 3 the data preparation for the case study on student behavior is detailed and an emphasis is placed on how outliers are treated in the sample data. The particular role of the goals of the analysis is demonstrated (1) in identifying outliers and (2) in deciding on whether they should be retained or rejected. Section 4 discusses how domain knowledge is important at every step of the KDD process and relates the points made to the case study. Section 5 describes how the capture of the appropriate domain knowledge will allow outlier treatment to be part of an automated KDD system. Use of domain knowledge is only one aspect of attempts to automate the KDD process and other issues such visualization of data and knowledge, integration of data mining tools and automated support via workbenches need to be considered. This paper increases understanding of the issues by a focus on outliers, one of the types of knowledge that has the potential to be revealed. II. TREATMENT OF OUTLIERS There is a large literature on outliers extending over many years. In consequence one would expect that a concise definition of an outlier could be provided but in fact this is a difficult task. It is a complex matter to precisely encapsulate what an outlier is. Many writers convey a notion of what an outlier is in a larger series of observations but providing an objective statement that can be used to identify an outlier is a great challenge. Some attempts are included here: It is well recognized by those who collect or analyse

Proceedings of the International Conference on Information and Automation, December 15-18, 2005, Colombo, Sri Lanka.

data that values occur in a sample which are so far removed from the remaining values that the analyst is not willing to believe that these values have come from the same population. Many times values occur which are dubious in the eyes of the analyst and he feels that he should make a decision as to whether to accept or reject these values as part of his sample. [1] The outliers are values which seem either too large or too small as compared to the rest of the observations. [2] The use of phrases such as in the eyes of the analyst and seem indicate the subjective approach used for identifying outliers. Collet and Lewis[3] provide a comprehensive discussion of the subjective nature of the exercise. They point out that most procedures for identifying outliers essentially have two stages. First a subjective decision is made that a value (or observation) is unusual in comparison to most other values. They use the term surprising for such a value. Second the identified value is tested for discordance with the rest of the sample based on satisfying some objective criterion. It thus becomes an hypothesis testing procedure using traditional statistical methods. By experiment they observe and support the conclusion that identification of outliers by such methods is subjective and is affected by the method of presentation and the scale and pattern of the data [3] While recognizing the subjective nature of outlier identification, once identified they can be treated in two broad ways as either values for rejection or values that point to phenomena of interest. They are rejected if they can be considered as values that are drawn from another population or values resulting from measurement error and suitable for exclusion from the sample as distorting the analysis. Alternately they can be considered phenomena of interest that should not be excluded. In the past an approach was taken to outlier detection that placed an emphasis on rejecting outliers as values that are not part of the population being analyzed and so more statistically based approaches were used. Recent activity in the data mining area is more particularly concerned with outliers as anomalies of interest. Data mining as a term embraces traditional statistical approaches and other less statistically rigorous techniques such as decision trees and neural networks. Data mining will often avoid a strictly statistical view of data, as the volume of observations and number of attributes does not easily permit statistically based approaches. But outliers are still of great interest so identification by density-based and proximity-based techniques, combined with visualizations are typically employed. The aim is to find outliers that represent valid data that is significant but that differs from most of the sample. Applications include identifying unusual weather events, fraud, intrusion, medical conditions and public health issues. A range of other approaches exist for

detection of outliers including model based techniques and also techniques that make use of class labels. If class labels are used and training set data is available a supervised learning approach could be employed. If no training data is available an unsupervised techniques may be employed. It may also be appropriate to remove an observation in the data preparation step of data mining if it is indicative of a measurement error or from another population not of interest in the analysis. Outliers of this type are often referred to as noise and would typically be rejected from the sample as they would distort the analysis being carried out. An excellent overview of all these approaches can be found in the book by Tan et al [4]. Other literature on outliers includes Knorr and Ng[5] who take an intuitive notion of outliers and provide a formalization as follows: An object O in a dataset T is UO(p,D) -outlier if at least a fraction p of the objects in T are>= distance D from O. It is intuitive in the sense that the analyst must nominate the fraction p and the distance D based on a notion that an outlier is a distance from the main fraction of the data. It is noted by the authors that this can be used as the basis for approaches that are effective with the large datasets such as are encountered in data mining. We note here that it also depends on the domain knowledge of the expert in the area to establish the fraction and distance parameters. Work by Liu[6] addresses the problem of distinguishing between outliers that should be rejected and those that should be retained in the sample as phenomena of interest. The criteria suggested to make this decision are the characteristics of the data and also relevant domain knowledge. A strategy is suggested that models noise and error processes and accepts outliers as phenomena of interest if the noise model cannot account for them. We again note here that the domain knowledge of the expert in the domain of interest is required to construct the noise model. Lui has also published a number of papers addressing this theme for treatment of outliers with G. Cheng and J. Wu[7, 8]. G. Cheng, in his PhD dissertation suggests a methodology that uses two strategies for outlier treatment. The methodology requires that outliers in multivariate data are first detected and visualized by the use of self organizing maps (SOMs). SOMs, a technique pioneered by T. Kohonen [9], are widely used for understanding and interpreting clusters in large datasets containing multivariate data. Two strategies are then proposed for handling the outliers. Domain knowledge can be used to identify outliers for retention and to model what is considered normal data and thus reason about outliers based on this model. Domain Knowledge can also be used to identify outliers for rejection and to build a model of the behavior of these outliers. Data would be accepted outside the norms if it is not accounted for by the model developed

Proceedings of the International Conference on Information and Automation, December 15-18, 2005, Colombo, Sri Lanka.

for the outliers suitable for rejection. One of the logical considerations for proceeding in this way is that large datasets would not permit all outliers to be dealt with manually. The key point to note in our view is that the domain knowledge of the expert in the domain of interest essentially determines the treatment of outliers after application of some widely used supervised and unsupervised SOM learning approaches that generate visualizations. The visualizations permit the domain expert to first identify the outliers, then decide whether to accept the outliers as phenomena of interest or reject the outliers as distorting the analysis. III. OUTLIER TREATMENT IN A WEBSITE USAGE ANALYSIS A case study was chosen to demonstrate the application of domain knowledge in the treatment of outliers in the context of the overall KDD process. The particular analysis used in this case study was concerned with the effectiveness of web-based learning environment and was part of a study executed in 2002. The web based learning environment was provided to students in the final year of their undergraduate degree where they undertake an industrial experience (IE) project in which they design, develop and deliver a small computer system for a client. A full description of the IE project can be found in Hagan[10]. The web site, known as the Web Industrial Experience Resources Website (WIER), was developed to provide students with an integrated learning environment to use during their IE project work. The site provides various resources including general project information, a facility to enable event scheduling, facilities for project management including a task timer tracker, time graph generator, file manager and risk list, and various forms of communication facilities via news group and discussion forums. The site also provides access to a repository of resources including standards documents, document templates, and samples of past projects. Details can be found in [11] The research studies made use of a web log data generated when students made use of the WIER system. The full details of the data preparation can be reviewed in the report by Sheard et al [12] and details of the findings based on web usage analysis can be found in[13]. To enable meaningful interpretation of the data the following abstractions were defined. A connection to the WIER system was termed a session. A student session at the WIER web site can logically be viewed as a number of episodes. An episode corresponds to a student making use of one of the functional areas of the system (e.g. accessing a past project, using the time tracker or engaging in a discussion). Each episode consists of a series of interactions. An interaction was defined as a page request. This is shown diagrammatically in Figure 1. Over the 27 week period of

the data collection there were 9442 sessions, consisting of 47725 episodes in total, in the analysis.
Episode 1 Interaction 1 .. Interaction n Session 1 . ..Episode n

Fig. 1 Class hierarchy of the abstractions used

The goals of the analysis embraced three major themes. These were (1) frequency of sessions for a student over the semester, (2) time of session analysis and finally (3) time of episode analysis. The level of abstraction (session or episode or interaction) suitable for one analysis goal and what constituted an outlier for one analysis goal was not necessarily the same for another analysis goal. Considering the analysis goal of frequency of sessions it was decided to include all sessions even though some lasted a number of days and were clearly inactive and some lasted less than a second and no work could have been done. This was valid in terms of understanding this simple aspect of student behaviour; their attempts to connect to the system. Considering the goals of analysis (2) and (3), the aspects of student behaviour of interest required that students be actively engaged using the system and more than just connecting with the system, so apparent failed logins were eliminated. Furthermore, a long session did not necessarily indicate inactivity; neither did a long episode. An active episode was defined as one in which the time between interactions within that episode was above some threshold. A careful consideration of the percentages of episodes and sessions excluded at different thresholds for the interaction times combined with knowledge of the system functions from an educators perspective allowed the analyst to arrive at a threshold of 600 seconds (10 minutes) between interactions. The threshold was used to attach a class label of inactive to an episode. Using the threshold, if an episode had any times between interactions of greater than or equal to 600 seconds the episode was considered inactive. Sessions (and their single episode) that lasted less than 1 second were also excluded as being inactive. Considering the analysis goal of time of session if a session contained an inactive episode this identified the session as being inactive and the entire session was excluded from the analysis as being misleading. Considering the analysis goal of time of episode any inactive episode was excluded. Table I, below, compares the method employed to exclude outliers to simply considering rejecting episodes with a time greater than 1 hour. The method chosen includes some outliers (as phenomena of interest) that would be excluded by simpler methods or visualization approaches while also excluding a greater number of undesirable outliers. Some episodes (18) were retained that had episode times greater than 1 hour including one episode that lasted

Proceedings of the International Conference on Information and Automation, December 15-18, 2005, Colombo, Sri Lanka.

more than 2 hours.


TABLE I PERCENTAGE OF EPISODES REMAINING AFTER EXCLUDING OUTLIERS Method Percentage of episodes remaining Number of episodes with times >=3600 seconds included 0 Number of episodes with times >=7200 seconds included 0

Extreme episode time (>3600 sec.) Interaction interval (the method used)

97.5

93.9

18

belonged to so that a students use of each function could be analysed. The categorization required the domain knowledge of the expert in the application. Another key decision informed by domain knowledge is the suitable level of abstraction to apply in terms of the analysis goals. In the case study the goals of analysis required the class hierarchy of session, episode and interaction be established. The interaction time was used to attach a class label (or abstraction label) to an episode of inactive. It is clear that domain knowledge assists at a number of steps in the analysis. Table II following summarises the contribution at each step.
TABLE II SUMMARY OF DECISIONS INFORMED BY DOMAIN KNOWLEDGE AT EACH STEP IN THE ANALYSIS Data Process Step Collection design Collection Abstraction Integration Cleaning Decision informed by Domain Knowledge Capture additional information for the analysis Monitor student logons and record identity Decide on semantically meaningful groupings Combine with additional data related to user Removal of irrelevant items Determine missing items Remove outliers that distort the analysis Transformation Mining Identify the user Identify the session Outliers as phenomena of interest are identified

IV. DOMAIN KNOWLEDGE TO SUPPORT THE KDD PROCESS It has been suggested that domain knowledge has a role in all steps of the KDD process[14]. It must be noted that the success of a data mining exercise depends on a number of factors including the degree of automation and the use of visualization tools. A comprehensive overview of the issues and literature concerning this can be found in the paper [15] that details the impact of domain knowledge on all aspects of the KDD process. Also suggested in [15] is an architecture for a partially automated KDD system that splits the analyst/users involvement into initial knowledge acquisition and subsequent user interventions. Additionally it is noted there that the strategic goals of the analyst would and should be included in a broad definition of domain knowledge. Web usage analysis and data collection presents some particular challenges and an understanding of the domain of interest must inform this analysis. As is commented by Sullivan, many page hits could represent a deeply satisfying experience or a hopelessly lost reader[16]. So in discussing the steps in the KDD process; from data collection, data preparation through to interpretation of the results; an emphasis will be placed on how domain knowledge informs the various decisions being made. (A detailed commentary on the steps can be found in[12]). In order to understand the users behavior better it was recognised that typical web file data, if used alone, did not provide enough information. A script was therefore included in each page on the site that recorded information in a database each time a page was loaded. The students entered the site via a login page and thus the start of each session was able to be determined. The domain expert understood that identity was vital for the analysis. Students identities could also be matched to interactions. Caching was disabled on most pages and thus page requests could be recorded as interactions. Pages were also categorized according to which functional area of the system they

In order to know if outliers should be rejected or included as phenomena of interest the goals of the analyst must be taken into account. Here the goals were to answer the questions: How long do students spend using the web site? What are the patterns and trends in web site usage over the year? Are there any differences in use based on student performance? Clearly the data on students who spend long amounts of times at initial setup need to be included to address these objectives. Observations that are related to inactive or disinterested users, as defined via the domain knowledge that defines the inactive observations, would be rejected from the sample. V. OUTLIER TREATMENT AS PART OF A PARTIALLY AUTOMATED MODEL FOR KNOWLEDGE DISCOVERY There are two ways in which domain knowledge can be captured in order to partially automate the knowledge discovery process and in particular aid in identifying outliers for categorization as suitable for rejection or as phenomena of interest. The domain knowledge can be captured prior and during the analysis via a knowledge acquisition module. It can also be embodied in case studies of previous analyses that have been formally captured in a

Proceedings of the International Conference on Information and Automation, December 15-18, 2005, Colombo, Sri Lanka.

data bank of representative case studies. The Web Usage case study detailed here will be used to demonstrate how the approaches suggested could be applied in a particular domain of interest. Domain knowledge acquisition and use of representative case studies can be applied separately but would preferably be applied in combination. If used in

combination they would have complimentary and consequential effects on each other. The user would also have the discretion to intervene at points of decision as the data analysis proceeded. Figure 2, adapted from[15] seeks to show the broad architecture for what is proposed.
Initial Knowledge Acquisition

Knowledge Acquisition Module

Convert to Formalism Module

User KDD Process Interface and Workbench DK Meta Definition Module

Process-Metadata

Data Selection Design and collection

Web Usage Log Data

Formal Domain Knowledge

Data Integration Transformation and Coding

Selected Data

Case Repository Data Mining Transformed Data

Case Based Reasoning Module Result Evaluation

Fig.2 Proposed Architecture for a Partially Automated KDD System

Case based reasoning would allow a previous web usage analysis such as described here to provide prompts for the major steps in the analysis if the user so desired that would over rule a generic set of steps that would be provided as a default. The steps would then guide the interaction with the knowledge acquisition module in prompting for the appropriate domain knowledge. Table III shows the knowledge acquisition in summary as it would apply to the case study. The goals of the analysis would be captured and formalized in terms of which level of abstraction they relate to and the attributes of the particular level of abstraction that are important in categorizing the instances at that level of abstraction. In the case study this was the time between interactions in an episode. With reference to outliers the knowledge acquisition module would capture the hierarchy of abstractions, in this case session, episode and interaction. The domain expert could then be provided with a frequency distribution for the level of abstraction showing the instances above and below certain threshold values. This would assist the domain expert in determining upper

and lower threshold values for the analysis in question that captured phenomena of interest but rejected outliers that would distort the analysis in their view.
TABLE III AN AUTOMATED MODEL APPLIED TO THE CASE STUDY Relevant domain knowledge Case study application captured by knowledge acquisition module at the beginning of the KDD process Attributes to enrich data requested Student identity on page requests Goals in terms of data abstraction Session frequency requested; supported by automatic Session length interrogation of meta data Abstraction hierarchy requested Session has many episodes; episode has many interactions Request class label Inactive episode Request attribute to determine class Time between interactions label Request threshold for attribute Threshold is 10 minutes identified

Proceedings of the International Conference on Information and Automation, December 15-18, 2005, Colombo, Sri Lanka.

VI. CONCLUSION It is not always clear how outliers can be identified and once identified how they should be treated. Domain knowledge can play a key part in outlier treatment. The definition of domain knowledge must be broad enough to include the goals of the analysis being carried out. It can be employed at a number of points in the KDD process to ensure correct handling of outliers. In particular it can inform the following steps: Establishment of the goals of the analysis Design of data collection to allow the correct attributes to be captured. Determination of the suitable level of abstraction to allow identification of outliers for both rejection and inclusion in the analysis Determination of the attributes to permit the correct class labels to be attached to the appropriate level of abstraction. In the particular example of the web usage case study it was found that the data collection had to be designed to allow the individual student sessions to be identified and also the use by students of the functional areas (episodes) within a session to be identified. Episodes then had to be labeled as active or inactive. Sessions that contained an inactive episode were considered to be inactive by the domain expert also. An understanding of the treatment of outliers will permit a model for a partially automated KDD workbench to be developed that handles outliers based on the principals established. This model will have two major parts that relate to domain knowledge acquisition. The first part will be an initial knowledge acquisition module that captures domain knowledge. The second part will be a case based reasoning module that will make use of the major steps to make them available for any future analysis exercise in the same domain of interest. REFERENCES
[1] [2] [3] W. J. Dixon, "Analysis of Extreme Values," Ann. Math. Statist., vol. 21, pp. 488-506, 1950. E. J. Gumbell, "Discussion on "Rejection of Outliers" by Anscombe, F.J.," Technometrics, vol. 2, pp. 165-166, 1960. D. Collett and T. Lewis, "The Subjective Nature of Outlier Rejection Procedures," Applied Statistics, vol. 25, pp. 228-237, 1976. P.-N. Tan, M. Steinbach, and V. Kumar, Introduction to Data Mining: Pearson Education, Inc., 2006. E. M. Knorr and R. T. Ng, "A Unified Notion of Outliers: Properties and Computation," presented at Proceedings of Knowledge Discovery in Databases (KDD-97), Newport Beach CA., 1997. X. Liu, "Strategies for Outlier Analysis," presented at Colloquium on Knowldge Discovery and Data Mining, London, UK, 1998.

[7]

[8]

[9] [10]

[11]

[12]

[13]

[14]

[15]

[16]

X. Liu, G. Cheng, and J. X. Wu, "Noise and Uncertainty Management in Intelligent Data Modelling," presented at Proceedings of the 12th national Confrence on Artificial Intelligence (AAAI-94), 1994. J. G. Cheng, "Outlier Management in Intelligent Data Analysis," in Department of Computer Science, Birkbeck College. London: University of London, 2000, pp. 163. T. Kohonen, Self Organizing Maps: Springer; New York, Berlin, 1995. D. Hagan, S. Tucker, and J. Ceddia, "Industrial Experience Products: A Balance of Product and Process," Computer Science Education, vol. 9, pp. 106-113, 1999. J. Ceddia, S. Tucker, C. Clemence, and A. Cambrell, "WIERImplementing Artifact Reuse in an Educational Environment with Real Products," presented at 31st Annual Frontiers in Education Conference, Reno, Nevada, 2001. J. Sheard, J. Ceddia, J. Hurst, and J. Tuovinen, "Determining Website Usage Time from Interactions: Data Preparation and Analysis," Educational Technology Systems, vol. 32, pp. 101121, 2003-2004. J. Sheard, J. Ceddia, J. Hurst, and J. Tuovinen, "Inferring Student Learning Behaviour from Website Interactions: A Usage Analysis," Education and Information Technologies, vol. 8, pp. 245-266, 2003. A. Knobbe, A. Schipper, and P. Brockhausen, "Domain Knowledge and Data Mining Process Decisions: Enabling EndUser Datawarehouse Mining; Contract No. IST-1999-11993; Deliverable No. D5: www-ai.cs.unidortmund.de/MMWEB/content/publications.html," vol. 2003, 2000. R. Redpath and B. Srinivasan, "A Model for Domain Centered Knowledge Discovery in Databases," presented at IEEE 4th International Conference on Intelligent Systems Design and Applications ISDA 2004, Budapest Hungary, 2004. T. Sullivan, "Reading Reader Reaction: A Proposal for Inferential Analysis of Web Server Log Files," presented at 3rd Conference on Human Factors and the Web, Denver Colorado, 1997.

[4] [5]

[6]

You might also like