You are on page 1of 21

The role of Domain Knowledge in a large scale Data Mining Project

Kopanas I., Avouris N., Daskalaki S. University of Patras

Outline of the talk


Knowledge in a DM process Case study in a large DM project: Prediction of customer insolvency in Telecommunications business The role of domain expertise (and domain experts ) in the process Summary and conclusions

University of Patras, HCI Group - SETN02

Data Mining
Evolution of knowledge-based systems Key partners in Data Mining
Data analyst / statistician

Knowledge Engineer
Domain Expert

Role of domain knowledge in Data Mining

University of Patras, HCI Group - SETN02

DM phases
(a) Problem definition (b) Creating target data set (c ) Data pre-processing and transformation (d ) Feature and algorithm selection (e) Data Mining (f) Evaluation of learned knowledge (g) Fielding the knowledge base
University of Patras, HCI Group - SETN02 4

Case study: Prediction of Customer Insolvency in Telecommunications business


Predict the insolvent customers to be, that is the customers that will refuse to pay their telephone bills in the next payment due date, while there is still time for preventive (and possibly avertive) measures Problem Objectives
Detect as many insolvent customers as possible
Minimize false alarms (solvent customers classified as insolvent)
University of Patras, HCI Group - SETN02 5

Case study: problem characteristics


Significant loss of revenue for the company
Human behavior is (generally) unpredictable Insolvency cases are rare compared to noninsolvencies Information can be retrieved only after processing huge amounts of data from several sources

University of Patras, HCI Group - SETN02

The billing process (domain knowledge)


Jun Jul Aug Sept Oct Nov Dec Jan Feb Mar Apr

Billing Period Issue of Bill Due Date Service Interruption Nullification

University of Patras, HCI Group - SETN02

Target data set definition (semantic value of data)


Data from 3 different cities (combination of rural, urban and touristic areas)
Types of data
Customer data (coded) Data from billing and payments Call detail records (from switching centers)

Time span of data studied


Cases of collected and uncollected bills (10/99-2/01) Calls records (8/99-12/00)
University of Patras, HCI Group - SETN02 8

Data pre-processing (knowledge-based reduction of search space)


Eliminated inexpensive calls (< 0.3 ) Synchronizing data

Removing noise
Missing values Data aggregation by period

DATA WAREHOUSE

University of Patras, HCI Group - SETN02

Dataset for model fitting


Stratified sample of solvent customers
Class distribution: 90% solvent customers and 10% insolvent customers

2066 total number of cases and 46 variables


2 variables describing the phone account 4 variables describing customer attitude towards previous phone bills

40 variables summarizing customer call habits over fifteen 2-week periods

University of Patras, HCI Group - SETN02

10

Data mining
Classification problem
2 classes: solvent and insolvent customers Distribution among classes in original dataset: 99% of solvent customers and 1% of insolvent customers Very small number of insolvencies

Very different costs of misclassification between the two classes of customers

University of Patras, HCI Group - SETN02

11

Criteria for evaluation of prediction


The precision of the classifier, defined as the percentage of the actually insolvent customers in those, predicted as insolvent by the classifier. The accuracy of the classifier, defined as the percentage of the correctly predicted insolvent out of the total cases of insolvent customers in the data set. Precision > 30% & Accuracy > 70%
University of Patras, HCI Group - SETN02 12

Features selected (most popular in 50 classifiers)


NewCust TrendUnitsMax

Latency
Count_X_charges

TrendDif5

TrendDifxx , StdDif TrendDif8


a
given timeMaxSec interval xx
TrendUnits5

dispersionAverage_Dif of called CountResiduals telephonenumbers in Type StdDif


TrendDif11 TrendDif10 TrendDif7 TrendDif6 TrendDif3

AverageUnits
TrendCount5 CountInstallments
University of Patras, HCI Group - SETN02 13

Deployment of the Knowledgebased system


The classifiers are combined (voting algorithms have been used)
Heuristics are used as applicability criteria

Visualization plays an important role in the design of the system


The roles of the user and the knowledge-based system have to be carefully defined

University of Patras, HCI Group - SETN02

14

Stepwise Discriminant Analysis


Classification Results E3 Predicted Category 0 Original Count 0 78 1 28 % 0 57.35 Cases 1 2.31 Selected CrossCount 0 77 validated 1 35 % 0 56.62 1 2.89 Cases not Original Count 0 36 Selected 1 22 % 0 56.25 1 3.36 93.6% of selected original grouped cases correctly classified 93.02% of selected cross-validated cases correctly classified 93.04% of unselected original grouped cases correctly classified 1 58 1184 42.65 97.69 59 1177 43.38 97.11 28 632 43.75 96.64 Total 136 1212 100 100 136 1212 100 100 64 654 100 100

University of Patras, HCI Group - SETN02

15

Decision Tree
Classification Results E21
Cases Selected Original Count % Cases not Selected Original Count % Category 0 1 0 1 0 1 0 1 Predicted Group Total 0 1 101 35 136 9 1203 1212 74.26 25.74 100 0.74 99.26 100 42 22 64 16 638 654 65.62 34.38 100 2.45 97.55 100

University of Patras, HCI Group - SETN02

16

Neural Network
Classification Results E30 Category 0 Count 1 0 % 1 Count 0 1 0 % 1 Predicted Group 0 1 Total 65 69 136 8 1203 1212 47.7 50.7 100 0.6 99.2 100 24 40 64 11 643 654 37.5 62.5 100 1.6 98.3 100

Cases Selected

Original

Cases not Selected

Original

University of Patras, HCI Group - SETN02

17

Evaluation of classifiers (example)


Predicted cases Insolvent (0)
Category

Solvent (1) 23 (16.9%) 25081 (90.2 %)

Actual cases

Insolvent (0) Solvent (1)

113 (83.1 %) 2731 (9.8 %)

Performance over 90% in the majority class and over 83% in the minority class.

precision = 113/2844= 3.9%


accuracy = 113/136= 83%,
University of Patras, HCI Group - SETN02 18

stage
(a) Problem definition

DK
HIGH
MEDIUM

Type of DK
Business and domain knowledge, requirements Implicit, tacit knowledge Attribute relations, semantics of corporate DB Tacit and implicit knowledge for inferences Interpretation of the selected features Inspection of discovered knowledge Definition of criteria related to business objectives

(b) Creating target data set (c ) Data preprocessing (d ) Feature and algorithm selection (e) Data Mining (f) Evaluation of learned knowledge

HIGH
MEDIUM
LOW

MEDIUM

(g) Fielding the knowledge base

HIGH

Supplementary domain knowledge necessary for implementing the system


19

University of Patras, HCI Group - SETN02

Selection of DM tool (Elder 98)

University of Patras, HCI Group - SETN02

20

Conclusion
Data mining is a knowledge-driven process All stages contribute to the success of the process Domain experts play significant role in most phases of the process Need for selection of algorithms and techniques that support interpretation of mined knowledge

Need for integrated tools and adequate techniques to support involvement of domain experts in the process
University of Patras, HCI Group - SETN02 21

You might also like