You are on page 1of 10

Data Mining & Knowledge

Discovery
Data Mining is the task of discovering
interesting patterns from large amounts of
data where the data can be stored in
databases, data warehouses, or other
information repositories
A knowledge discovery process includes data
cleaning, data integration, data selection,
data transformation, data mining, pattern
evaluation and knowledge presentation
1
Client

Data source in Chicago

Clean
Data source in Query and
Transform Data
New York Analysis tools
Integrate warehouse
Load

Data source in Toronto

Client

Data source in Vancouver 2


Evaluation and
Presentation
Knowledge

Data Mining
Patterns
Selection and
Transformation

Data
Cleaning and Warehouse
Integration

Databases Flat Files

3
Data Warehouse & OLAP
A data warehouse is a repository for long-
term storage of data from multiple sources,
organized so as to facilitate management
decision making
Online Analytical Processing (OLAP) make
use of background knowledge regarding the
domain of the data being studied in order to
allow the presentation of data at different
levels of abstraction examples drill-down,
roll-up etc.
4
Data Cube in OLAP

Karachi 440 345


Location (cities)
Lahore

Q1 605 825
Time Quarters
Q2 400

Q3

Q4
Grocery
Furniture phone

computer (Item Types) 5


Types of Data Mining
Classification
Association
Characterization
Clustering

6
Classification
Classification allows you to have a predictive model
labeling different samples to different classes
Model may be represented as (if-then) rules, decision
trees, neural networks etc.
ID3 Algorithm, Bayesian classification
Example three classes in a sales campaign are
good response, mild response and no response
and different features of items are price, brand,
category the decision tree may identify price as
the single factor that best distinguishes the three
classes

7
Association
Association analysis is the discovery of association
rules showing attribute-value conditions that occur
frequently together in a given set of data widely
used for market basket analysis
Apriori algorithm
Support and Confidence are the two measures used
Confidence is a measure of how often the
relationship holds true e.g, what percentage of
the time did people who bought milk also bought
eggs
Support means what is the percentage of two
items occurring together overall
One can adjust these measures to discover items
having corresponding level of association and
8
accordingly set marketing strategy
Characterization
It is discovering interesting concepts in
concise and succinct terms at generalized
levels for examining the general behavior of
the data
Example in database of graduate students
departments of music, history and literature
can be generalized as art department and the
rest as science department
Version Space Search
Attribute-Oriented Induction

9
Clustering
A cluster is a group of data objects that are
similar to another within the same cluster and
are dissimilar to the objects in other clusters
Example distinct group of customers,
categories of emails in a mailing list database,
different categories of web usage from log
files
Serves as a preprocessing step for other
algorithms such as classification and
characterization
K-means algorithm
10

You might also like