The Role of Domain Knowledge in A Large Scale Data Mining Project

The role of Domain Knowledge in a large scale Data Mining Project
Kopanas I., Avouris N., Daskalaki S. University of Patras
Outline of the talk

Knowledge in a DM process Case study in a large DM project: Prediction of customer insolvency in Telecommunications business The role of domain expertise (and domain experts ) in the process Summary and conclusions
University of Patras, HCI Group - SETN02
Data Mining
Evolution of knowledge-based systems Key partners in Data Mining
Data analyst / statistician
Knowledge Engineer
Domain Expert
Role of domain knowledge in Data Mining
DM phases
(a) Problem definition (b) Creating target data set (c ) Data pre-processing and transformation (d ) Feature and algorithm selection (e) Data Mining (f) Evaluation of learned knowledge (g) Fielding the knowledge base
University of Patras, HCI Group - SETN02 4
Case study: Prediction of Customer Insolvency in Telecommunications business

Predict the insolvent customers to be, that is the customers that will refuse to pay their telephone bills in the next payment due date, while there is still time for preventive (and possibly avertive) measures Problem Objectives
Detect as many insolvent customers as possible
Minimize false alarms (solvent customers classified as insolvent)
Case study: problem characteristics

Significant loss of revenue for the company
Human behavior is (generally) unpredictable Insolvency cases are rare compared to noninsolvencies Information can be retrieved only after processing huge amounts of data from several sources
The billing process (domain knowledge)

Jun Jul Aug Sept Oct Nov Dec Jan Feb Mar Apr
Billing Period Issue of Bill Due Date Service Interruption Nullification
Target data set definition (semantic value of data)

Data from 3 different cities (combination of rural, urban and touristic areas)
Types of data
Customer data (coded) Data from billing and payments Call detail records (from switching centers)
Time span of data studied

Cases of collected and uncollected bills (10/99-2/01) Calls records (8/99-12/00)
Data pre-processing (knowledge-based reduction of search space)

Eliminated inexpensive calls (< 0.3 ) Synchronizing data
Removing noise
Missing values Data aggregation by period
DATA WAREHOUSE
Dataset for model fitting

Stratified sample of solvent customers
Class distribution: 90% solvent customers and 10% insolvent customers
2066 total number of cases and 46 variables

2 variables describing the phone account 4 variables describing customer attitude towards previous phone bills
40 variables summarizing customer call habits over fifteen 2-week periods
10
Data mining
Classification problem
2 classes: solvent and insolvent customers Distribution among classes in original dataset: 99% of solvent customers and 1% of insolvent customers Very small number of insolvencies
Very different costs of misclassification between the two classes of customers
11
Criteria for evaluation of prediction

The precision of the classifier, defined as the percentage of the actually insolvent customers in those, predicted as insolvent by the classifier. The accuracy of the classifier, defined as the percentage of the correctly predicted insolvent out of the total cases of insolvent customers in the data set. Precision > 30% & Accuracy > 70%
Features selected (most popular in 50 classifiers)

NewCust TrendUnitsMax
Latency
Count_X_charges
TrendDif5
TrendDifxx , StdDif TrendDif8

a
given timeMaxSec interval xx
TrendUnits5
dispersionAverage_Dif of called CountResiduals telephonenumbers in Type StdDif

TrendDif11 TrendDif10 TrendDif7 TrendDif6 TrendDif3
AverageUnits
TrendCount5 CountInstallments
Deployment of the Knowledgebased system

The classifiers are combined (voting algorithms have been used)
Heuristics are used as applicability criteria
Visualization plays an important role in the design of the system

The roles of the user and the knowledge-based system have to be carefully defined
14
Stepwise Discriminant Analysis

Classification Results E3 Predicted Category 0 Original Count 0 78 1 28 % 0 57.35 Cases 1 2.31 Selected CrossCount 0 77 validated 1 35 % 0 56.62 1 2.89 Cases not Original Count 0 36 Selected 1 22 % 0 56.25 1 3.36 93.6% of selected original grouped cases correctly classified 93.02% of selected cross-validated cases correctly classified 93.04% of unselected original grouped cases correctly classified 1 58 1184 42.65 97.69 59 1177 43.38 97.11 28 632 43.75 96.64 Total 136 1212 100 100 136 1212 100 100 64 654 100 100
15
Decision Tree
Classification Results E21
Cases Selected Original Count % Cases not Selected Original Count % Category 0 1 0 1 0 1 0 1 Predicted Group Total 0 1 101 35 136 9 1203 1212 74.26 25.74 100 0.74 99.26 100 42 22 64 16 638 654 65.62 34.38 100 2.45 97.55 100
16
Neural Network
Classification Results E30 Category 0 Count 1 0 % 1 Count 0 1 0 % 1 Predicted Group 0 1 Total 65 69 136 8 1203 1212 47.7 50.7 100 0.6 99.2 100 24 40 64 11 643 654 37.5 62.5 100 1.6 98.3 100
Cases Selected
Original
Cases not Selected
Original
17
Evaluation of classifiers (example)

Predicted cases Insolvent (0)
Category
Solvent (1) 23 (16.9%) 25081 (90.2 %)
Actual cases
Insolvent (0) Solvent (1)
113 (83.1 %) 2731 (9.8 %)
Performance over 90% in the majority class and over 83% in the minority class.
precision = 113/2844= 3.9%

accuracy = 113/136= 83%,
stage
(a) Problem definition
DK
HIGH
MEDIUM
Type of DK
Business and domain knowledge, requirements Implicit, tacit knowledge Attribute relations, semantics of corporate DB Tacit and implicit knowledge for inferences Interpretation of the selected features Inspection of discovered knowledge Definition of criteria related to business objectives
(b) Creating target data set (c ) Data preprocessing (d ) Feature and algorithm selection (e) Data Mining (f) Evaluation of learned knowledge
HIGH
MEDIUM
LOW
MEDIUM
(g) Fielding the knowledge base
HIGH
Supplementary domain knowledge necessary for implementing the system

19
Selection of DM tool (Elder 98)
20
Conclusion
Data mining is a knowledge-driven process All stages contribute to the success of the process Domain experts play significant role in most phases of the process Need for selection of algorithms and techniques that support interpretation of mined knowledge
Need for integrated tools and adequate techniques to support involvement of domain experts in the process

The Role of Domain Knowledge in A Large Scale Data Mining Project

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

The Role of Domain Knowledge in A Large Scale Data Mining Project

Uploaded by

Copyright:

Available Formats

The role of Domain Knowledge in a large scale Data Mining Project

Kopanas I., Avouris N., Daskalaki S. University of Patras

Outline of the talk

University of Patras, HCI Group - SETN02

Role of domain knowledge in Data Mining

University of Patras, HCI Group - SETN02

Case study: Prediction of Customer Insolvency in Telecommunications business

Case study: problem characteristics

University of Patras, HCI Group - SETN02

The billing process (domain knowledge)

Billing Period Issue of Bill Due Date Service Interruption Nullification

University of Patras, HCI Group - SETN02

Target data set definition (semantic value of data)

Time span of data studied

Data pre-processing (knowledge-based reduction of search space)

University of Patras, HCI Group - SETN02

Dataset for model fitting

2066 total number of cases and 46 variables

40 variables summarizing customer call habits over fifteen 2-week periods

University of Patras, HCI Group - SETN02

Very different costs of misclassification between the two classes of customers

University of Patras, HCI Group - SETN02

Criteria for evaluation of prediction

Features selected (most popular in 50 classifiers)

TrendDifxx , StdDif TrendDif8

dispersionAverage_Dif of called CountResiduals telephonenumbers in Type StdDif

Deployment of the Knowledgebased system

Visualization plays an important role in the design of the system

University of Patras, HCI Group - SETN02

Stepwise Discriminant Analysis

University of Patras, HCI Group - SETN02

University of Patras, HCI Group - SETN02

Cases not Selected

University of Patras, HCI Group - SETN02

Evaluation of classifiers (example)

Solvent (1) 23 (16.9%) 25081 (90.2 %)

Insolvent (0) Solvent (1)

113 (83.1 %) 2731 (9.8 %)

precision = 113/2844= 3.9%

(g) Fielding the knowledge base

Supplementary domain knowledge necessary for implementing the system

University of Patras, HCI Group - SETN02

Selection of DM tool (Elder 98)

University of Patras, HCI Group - SETN02

You might also like