Chap1 DM Intrn

DATA MINING
INTRODUCTION
DB Vs VLDB

The world contains an unimaginably vast amount of digital information which is getting ever vaster ever more rapidly Despite the abundance of tools to capture, process and share all this information sensors, computers, mobile phones, etc.- it already exceeds the available storage space
Data Growth

The amount of digital information increases tenfold every five years. Moores law, says that the processing power and storage capacity of computer chips double or their prices halve roughly every 18 months. Data are becoming the new raw material of business: an economic input almost on par with capital and labour.
What is the use of VLDB?

Farecast, a part of Microsofts search engine Bing, can advise customers whether to buy an airline ticket now or wait for the price to come down by examining 225 billion flight and price records.
Industry Need

In recent years Oracle, IBM, Microsoft and SAP spent more than $15 billion on buying software firms specialising in data management and analytics. This industry is estimated to be worth more than $100 billion and growing at almost 10% a year, roughly twice as fast as the software business as a whole. Googles search engine, is partly guided by the number of clicks on an item to help determine its relevance to a search query. If the eighth listing for a search term is the one most people go to, the algorithm puts it higher up.
Data Mining Professional

Chief information officers (CIOs) have become somewhat more prominent in the executive suite a new kind of professional has emerged, the data scientist, who combines the skills of software programmer, statistician and storyteller/artist
Evolution of Database Technology

1960s:

Data collection, database creation, IMS and network DBMS Relational data model, relational DBMS implementation RDBMS, advanced data models (extended-relational, OO, deductive, etc.) and application-oriented DBMS (spatial, scientific, engineering, etc.) Data mining and data warehousing, multimedia databases, and Web databases
1970s:

1980s:

1990s2000s:

Motivation: Necessity is the Mother of Invention

Data explosion problem

Automated data collection tools and mature database technology lead to tremendous amounts of data stored in databases, data warehouses and other information repositories

We are drowning in data, but starving for knowledge! Solution: Data warehousing and data mining

Data warehousing and on-line analytical processing Extraction of interesting knowledge (rules, regularities, patterns, constraints) from data in large databases
A real life scenerio

A credit card company must determine whether to authorize credit card purchase by a customer Purchase can be placed under any one of the following classes : 1) Authorize 2) Ask for further id. 3) Do not Authorize 4) Do not authorize, contact police.
Why Mine Data? Commercial Viewpoint

Lots of data is being collected and warehoused

Web data, e-commerce purchases at department/ grocery stores Bank/Credit Card transactions

Computers have become cheaper and more powerful Competitive Pressure is Strong

Provide better, customized services for an edge (e.g. in Customer Relationship Management)
Why Mine Data? Scientific Viewpoint

Data collected and stored at enormous speeds (GB/hour)

remote sensors on a satellite telescopes scanning the skies microarrays generating gene expression data scientific simulations generating terabytes of data

Traditional techniques infeasible for raw data Data mining may help scientists

in classifying and segmenting data in Hypothesis Formation
What Is Data Mining?

Data mining (knowledge discovery in databases): Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) information or patterns from data in large databases Alternative names and their inside stories: Data mining: a misnomer?

Knowledge discovery(mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.
DATA MINING - Definition

Process of discovering interesting knowledge from large amounts of data stored either in databases, data warehouses or other information repositories.
DM Definition contd.
Data Mining is the process of identifying valid, novel, Potentially useful, and ultimately comprehensible Knowledge from database that is used to make crucial Business decisions. - Gregory Shapiro, Editor, Kdnuggets.com
What is (not) Data Mining?

What is not Data Mining?

What is Data Mining?

Look up phone number in phone directory Query a Web search engine for information about Amazon
Certain names are more prevalent in certain US locations (OBrien, ORurke, OReilly in Boston area) Group together similar documents returned by search engine according to their context (e.g. Amazon rainforest, Amazon.com,)
Why Data Mining

Credit ratings/targeted marketing:

Given a database of 100,000 names, which persons are the least likely to default on their credit cards? Identify likely responders to sales promotions Which types of transactions are likely to be fraudulent, given the demographics and transactional history of a particular customer? Which of my customers are likely to be the most loyal, and which are most likely to leave for a competitor? :
Fraud detection

Customer relationship management:

Data Mining helps extract such information
Applications

Medicine: disease outcome, effectiveness of treatments

analyze patient disease history: find relationship between diseases

Molecular/Pharmaceutical: identify new drugs Scientific data analysis:

identify new galaxies by searching for sub clusters find affinity of visitor to pages and modify layout
Web site/store design and promotion:

Data Mining: A KDD Process

Data mining: the core of knowledge discovery process.
Pattern Evaluation
Data Mining Task-relevant Data Data Warehouse Data Cleaning Data Integration Databases Selection
Steps of a KDD Process

Learning the application domain:

relevant prior knowledge and goals of application

Creating a target data set: data selection Data cleaning and preprocessing: (may take 60% of effort!) Data reduction and transformation:

Find useful features, dimensionality/variable reduction, invariant representation. summarization, classification, regression, association, clustering.
Choosing functions of data mining

Choosing the mining algorithm(s) Data mining: search for patterns of interest Pattern evaluation and knowledge presentation

visualization, transformation, removing redundant patterns, etc.
Use of discovered knowledge
Data Mining and Business Intelligence
Increasing potential to support business decisions
Making Decisions Data Presentation Visualization Techniques Data Mining Information Discovery Data Exploration Statistical Analysis, Querying and Reporting
End User
Business Analyst Data Analyst
Data Warehouses / Data Marts OLAP, MDA Data Sources Paper, Files, Information Providers, Database Systems, OLTP
DBA
Architecture of a Typical Data Mining System
Graphical user interface
Pattern evaluation Data mining engine

Database or data warehouse server
Data cleaning & data integration Filtering
Knowledge-base
Databases
Data Warehouse
Data Mining: Confluence of Multiple Disciplines

Database Technology Statistics
Machine Learning
Data Mining
Visualization
Information Science
Other Disciplines
Data Mining: On What Kind of Data?

Relational databases Data warehouses Transactional databases Advanced DB and information repositories Object-oriented and object-relational databases Spatial databases Time-series data and temporal data Text databases and multimedia databases Heterogeneous and legacy databases WWW
Data Mining Functionalities

Concept /class description:

Associating data with class (class of items : Computers and printers) and concepts (Concept on customers : big spenders and budgetspenders) Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet regions Data characterization : summarizing the data of the class under study (Target class) Data Discrimination : comparison of target class with one or set of comparative classes
Association

Mining Frequent Patterns Frequent Itemset set of items frequently appear together in a transactional data set. Mining frequent patterns leads to discovery of interesting association and correlations within data Threshold measures : Support and Confidence Single-dimensional vs. Multi-dimensional association contains(T, computer) contains(x, software) [1%, 75%] buys(X, PC) age(X, 20..29) ^ income(X, 20..29K) [support = 2%, confidence = 60%]

Classification and Prediction

Finding models (or functions) that describe and distinguish classes or concepts, and use the model for future prediction Derived model is based on training data E.g., classify countries based on climate, or classify cars based on gas mileage Presentation of model : decision-tree, classification rule, neural network Prediction: Predict some unknown or missing numerical values , like regression analysis Both should precede by relevance analysis : identifying attributes contributing to classification or prediction process

Decision trees
Tree where internal nodes are simple decision rules on one or more attributes and leaf nodes are predicted class labels.
Salary < 1 M Prof = teacher Good Bad Age < 30 Bad
Good
Neural network

Set of nodes connected by directed weighted edges A more typical NN

x1 x2 w2 x3 w3 w1
n
Basic NN unit x1 x2 x3 Hidden nodes Output nodes
o ! W ( wi xi )
i !1
1 W ( y) ! 1 e y
Cluster analysis

Class label is unknown: Group data to form new classes, e.g., cluster houses to find distribution patterns Clustering based on the principle: maximizing the intraclass similarity and minimizing the interclass similarity Facilitate taxonomy formation, i.e., organization of observations into a hierarchy of classes that group similar events together.
Outlier analysis

Outlier: a data object that does not comply with the general behavior of the data It can be considered as noise or exception but is quite useful in fraud detection, rare events analysis Statistical methods : distribution model or distance measures Deviation based methods : Examines the differences in the main characteristics of objects in a group
Trend and evolution analysis

Trend and deviation: regression analysis Sequential pattern mining, periodicity analysis Time series data analysis Similarity-based data analysis
Discovered Patterns Interestingness

A data mining system/query may generate thousands of patterns, not all of them are interesting.

Suggested approach: Human-centered, query-based, focused mining
Interestingness measures: A pattern is interesting if it is easily understood by humans, valid on new or test data with some degree of certainty, potentially useful, novel, or validates some hypothesis that a user seeks to confirm Objective vs. subjective interestingness measures:

Objective: based on statistics and structures of patterns, e.g., support, confidence, etc. Subjective: based on users belief in the data, e.g., unexpectedness, novelty, actionability, etc.
Can We Find All and Only Interesting Patterns?

Find all the interesting patterns: Completeness

Can a data mining system find all the interesting patterns? User-provided constraints and interestingness measures used to focus the search Ex: Association Rule Mining Can a data mining system find only the interesting patterns? Approaches

Search for only interesting patterns: Optimization

First generate all the patterns and then filter out the uninteresting ones. Generate only the interesting patternsmining query optimization
CLASSIFICATION OF DM SYSTEMS

Classification according to

Kinds of Databases mined (data models, types of data or applications) Kinds of knowledge mined (data mining functionalities) Kinds of techniques utilized (degree of user interaction involved or methods of data analysis employed) Applications adapted (like finance, Stock Markets, Telecommunications)
DM Task Primitives

Each user will have a DM task in mind Can be specified to DM System in the form of DM query DM query is defined in the form of DM Task primitives Allows interactive communication with DM system to direct Mining process
DM Primitives

Task-relevant data to be mined Relevant db attribute or DWH dimensions of interest Kinds of knowledge to be mined-Functionalities Background Knowledge-Concept Hierarchy Interestingness measures-Support & Confidence Knowledge Presentation & Visualization -Form of display

DM Query Language

To incorporate DM Task primitives Foundation on which User-friendly graphical interface can be built Example for DMQL :

Use database <dbname> Use hierarchy <type of hierarchy> for <attrib> Mine <functionality> as <name_of_pattern> In relevance to <relevant attributes> From <table names> Where <condition> Group by <attribute> Having <min threshold> Display as <visualization of result>
DM System Architecture
-

Coupling or integrating a DM system and a DB/DWH system No coupling (DM system will not utilise any function of DB or DWH system) Loose coupling (some facilities used) Semi tight coupling (few DM primitives provided as part of DB/DWH system Tight coupling (DM system integrated into DB/DW system)
MAJOR ISSUES

Mining methodology and user interaction issues Performance issues Diversity of database types issues
Mining methodology & user interaction issues

Mining different kinds of knowledge in db Interactive mining of knowledge at multiple levels of abstraction Incorporation of background knowledge DM query languages and ad hoc data mining Presentation & visualization of results Handling noisy or incomplete data Pattern evaluation the interestingness problem
Performance Issues

Efficiency and scalability of Data Mining algorithms Parallel, distributed and incremental mining algorithms
Diversity of DB types issues

Handling of relational and complex types of data Mining information from heterogeneous databases & global information systems
To conclude

Data mining: discovering interesting patterns from large amounts of data A natural evolution of database technology, in great demand, with wide applications A KDD process includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentation Mining can be performed in a variety of information repositories Data mining functionalities: characterization, discrimination, association, classification, clustering, outlier and trend analysis, etc.

Chap1 DM Intrn

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chap1 DM Intrn

Uploaded by

Copyright:

Available Formats

DATA MINING

What is the use of VLDB?

Data Mining Professional

Evolution of Database Technology

Motivation: Necessity is the Mother of Invention

Data explosion problem

A real life scenerio

Why Mine Data? Commercial Viewpoint

Lots of data is being collected and warehoused

Why Mine Data? Scientific Viewpoint

Data collected and stored at enormous speeds (GB/hour)

in classifying and segmenting data in Hypothesis Formation

What Is Data Mining?

DATA MINING - Definition

What is (not) Data Mining?

What is not Data Mining?

What is Data Mining?

Why Data Mining

Credit ratings/targeted marketing:

Customer relationship management:

Data Mining helps extract such information

Medicine: disease outcome, effectiveness of treatments

analyze patient disease history: find relationship between diseases

Molecular/Pharmaceutical: identify new drugs Scientific data analysis:

Web site/store design and promotion:

Data Mining: A KDD Process

Data mining: the core of knowledge discovery process.

Steps of a KDD Process

Learning the application domain:

relevant prior knowledge and goals of application

Choosing functions of data mining

visualization, transformation, removing redundant patterns, etc.

Use of discovered knowledge

Data Mining and Business Intelligence

Increasing potential to support business decisions

Business Analyst Data Analyst

Architecture of a Typical Data Mining System

Graphical user interface

Pattern evaluation Data mining engine

Data Mining: Confluence of Multiple Disciplines

Data Mining: On What Kind of Data?

Data Mining Functionalities

Concept /class description:

Classification and Prediction

Set of nodes connected by directed weighted edges A more typical NN

Basic NN unit x1 x2 x3 Hidden nodes Output nodes

Trend and evolution analysis

Discovered Patterns Interestingness

Suggested approach: Human-centered, query-based, focused mining

Can We Find All and Only Interesting Patterns?

Find all the interesting patterns: Completeness

Search for only interesting patterns: Optimization

Mining methodology & user interaction issues

Diversity of DB types issues

You might also like