Professional Documents
Culture Documents
Business intelligence
DW & OLAP
Data mining
Data Warehousing and Data Mining Motivation
Classification,
clustering,
association, etc.
1. The new technology for understanding the past and predicting the futture
2. A broad category of technologies that allows for
Gathering, storing, accessing and analyzing the data business users make better decisions
Analyzing business performance through data-driven insight
3. A broad category of applications, which includes the activities of
Decision support systems
Query and reporting
OLAP
Statistical, forecasting and data mining
A data warehouse is a simply a single, complete and consistent store of data obtained from a variety of
source and made available to end user in a way they can understand and use it in a business context
Many Definitions
o Search for valuable information (knowledge) from large volumes of data
o Exploration & analysis, by automatic or semi-automatic means, of large quantities of
data in order to discover meaningful patterns & rules
Alternative terms:
o Data analysis, pattern analysis, data dredging, data exploration, data understanding,
data summarization
o Data mining: a misnomer?
KDD process
Certain names are more prevalent in certain US locations (O’Brien, O’Rurke, O’Reilly… in Boston
area)
Group together similar documents returned by search engine according to their context (e.g.
Amazon rainforest, Amazon.com)
a. Draws ideas from machine learning/AI, pattern recognition, statistics, and database systems
b. Traditional Techniques may be unsuitable due to
Enormity of data
High dimensionality
of data
Heterogeneous,
distributed nature
of data
n Automated data collection tools and mature database technology lead to large amounts of data
stored in databases and data warehouses
and warehoused
b) purchases at department/
grocery stores
c) Bank/Credit Card
transactions
a) Provide better, customized services for an edge (e.g. in Customer Relationship Management)
expression data
4) scientific simulations
1) Query processing
2) Reporting tool
3) Spreadsheet
4) Statistics
What we need is
assist humans in
1) Prediction Methods
2) Description Methods
3) Classification [Predictive]
4) Clustering [Descriptive]
Classification: Definition
n Each record contains a set of attributes, one of the attributes is the class.
n Find a model for class attribute as a function of the values of other attributes.
1) Institution: a credit card company typically receives thousands of applications for new cards. The
application contains information: annual salary, any outstanding debts, age etc.
2) The problem: A decision has to be taken whether to accept or reject the applications.
3) Data mining task: To categorize applications into those who have good credit, bad credit, or fall
into a gray area (thus requiring further human analysis).
Clustering
a) Groups data into meaningful classes/clusters
b) Unsupervised learning
c) Motivation:
2) The first step in identifying useful patterns is to group data by their similarity
3) Once data are grouped (clustered), properties of each cluster can be analyzed
Given points in some spaces, group the points into a small number of clusters
Given a set of records each of which contain some number of items from a given collection;
1) Produce dependency rules which will predict occurrence of an item based on occurrences of other
items.
Sequential Pattern Discovery: Definition
Given is a set of objects, with each object associated with its own timeline of events, find rules that
predict strong sequential dependencies among different events.
Stock market
3) Computer Bookstore:
Medical field
1) If a patient underwent cardiac bypass surgery for blocked arteries (blood vessel) and later
developed high blood urea within a year of surgery, he or she is likely to suffer from kidney failure within
the next 18 months.
Deviation/Anomaly Detection
b) Applications:
2) Network Intrusion
Detection