You are on page 1of 30

Knowledge Discovery (KDD) Process

Amanullah Yasin (PhD)

Center For Advanced Studies In Engineering


Islamabad, Pakistan

Data Mining
Data mining is the exploration and analysis of large
quantities of data in order to discover valid, novel,
potentially useful, and ultimately understandable
patterns in data.
Valid: The patterns hold in general.
Novel: We did not know the pattern beforehand.
Useful: We can devise actions from the patterns.
Understandable: We can interpret and comprehend the patterns.
7

Data Mining Tasks


Data mining tasks are generally divided into two major categories:

Predictive tasks [Use some attributes to predict unknown or future


values of other attributes.]
Classification
Regression
Deviation Detection

Descriptive tasks [Find human-interpretable patterns that describe the


data.]
Association Discovery
Clustering

Knowledge Discovery (KDD)


Process

Data Mining vs. KDD


Knowledge Discovery in Databases (KDD): process of finding useful
information and patterns in data.
Data Mining: Use of algorithms to extract the information and
patterns derived by the KDD process.

11

Knowledge Discovery (KDD) Process


Data miningcore of
knowledge discovery
process

Pattern Evaluation

Data Mining
Task-relevant Data
Data Warehouse

Selection

Data Cleaning
Data Integration
Databases

12

KDD Process
Selection: Obtain data from various sources.
Preprocessing: Cleanse data.
Transformation: Convert to common format. Transform to new format.

Data Mining: Obtain desired results.


Interpretation/Evaluation: Present results to user in meaningful manner.

13

KDD Process: Several Key Steps


Learning the application domain
relevant prior knowledge and goals of application

Creating a target data set: data selection


Data cleaning and preprocessing: (may take 60% of effort!)

Data reduction and transformation


Find useful features, dimensionality/variable reduction, invariant
representation

Choosing functions of data mining


summarization, classification, regression, association, clustering

Choosing the mining algorithm(s)


Data mining: search for patterns of interest
Pattern evaluation and knowledge presentation
visualization, transformation, removing redundant patterns, etc.

Use of discovered knowledge


14

KDD Process Ex: Web Log


Selection:

Select log data (dates and locations) to use

Preprocessing:

Remove identifying URLs


Remove error logs

Transformation:

Sessionize (sort and group)

Data Mining:

Identify and count patterns


Construct data structure

Interpretation/Evaluation:

Identify and display frequently accessed sequences.

Potential User Applications:


Cache prediction
Personalization

15

Knowledge Discovery Process


Example: the database of a magazine publisher which
sells five types of magazines on cars, houses, sports,
music and comics
Data mining:
Find interesting categorical properties

Questions:
What is the profile of a reader of a car magazine?
Is there any correlation between an interest in cars and an interest in
comics?

The knowledge discovery process consists of six stages

Data Selection
Select the information about people who have
subscribed to a magazine

Cleaning
Pollutions: Type errors, moving from one place to
another without notifying change of address,
people give incorrect information about
themselves
Pattern Recognition Algorithms

Cleaning
Lack of domain consistency

Enrichment
Need extra information about the clients consisting of
date of birth, income, amount of credit, and whether or
not an individual owns a car or a house

Enrichment
The new information need to be easily joined to the
existing client records
Extract more knowledge

Coding
We select only those records that have enough
information to be of value (row)
Project the fields in which we are interested
(column)

Coding
Code the information which is too detailed

Address to region
Birth date to age
Divide income by 1000
Divide credit by 1000
Convert cars yes-no to 1-0
Convert purchase date to month numbers starting from
1990
The way in which we code the information will determine the
type of patterns we find
Coding has to be performed repeatedly in order to get the best
results

Coding
The way in which we code the information will
determine the type of patterns we find

Coding
We are interested in the relationships between
readers of different magazines
Perform flattening operation

Data mining
We may find the following rules
A customer with credit > 13000 and aged between 22 and 31 who
has subscribed to a comics at time T will very likely subscribe to a
car magazine five years later
The number of house magazines sold to customers with credit
between 12000 and 31000 living in region 4 is increasing
A customer with credit between 5000 and 10000 who reads a
comics magazine will very likely become a customer with credit
between 12000 and 31000 who reads a sports and a house
magazine after 12 years

Knowledge Discovery Process

Business-Question-Driven Process

Data Mining and Business


Intelligence
Increasing potential
to support
business decisions

Making
Decisions

Data Presentation
Visualization Techniques

Data Mining
Information Discovery

End User

Business
Analyst
Data
Analyst

Data Exploration
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
OLAP, MDA
Data Sources
Paper, Files, Information Providers, Database Systems, OLTP

DBA

Architecture of a Typical Data Mining System

Data Mining: On What Kinds of Data?


Database-oriented data sets and applications
Relational database, data warehouse, transactional database
Advanced data sets and advanced applications

Data streams and sensor data


Time-series data, temporal data, sequence data (incl. bio-sequences)
Structure data, graphs, social networks and multi-linked data
Object-relational databases
Heterogeneous databases
Spatial data and spatiotemporal data
Multimedia database

Text databases
The World-Wide Web
BigData (NoSQL)

You might also like