Professional Documents
Culture Documents
INTRODUCTION
DB Vs VLDB
The world contains an unimaginably vast amount of digital information which is getting ever vaster ever more rapidly Despite the abundance of tools to capture, process and share all this information sensors, computers, mobile phones, etc.- it already exceeds the available storage space
Data Growth
The amount of digital information increases tenfold every five years. Moores law, says that the processing power and storage capacity of computer chips double or their prices halve roughly every 18 months. Data are becoming the new raw material of business: an economic input almost on par with capital and labour.
Farecast, a part of Microsofts search engine Bing, can advise customers whether to buy an airline ticket now or wait for the price to come down by examining 225 billion flight and price records.
Industry Need
In recent years Oracle, IBM, Microsoft and SAP spent more than $15 billion on buying software firms specialising in data management and analytics. This industry is estimated to be worth more than $100 billion and growing at almost 10% a year, roughly twice as fast as the software business as a whole. Googles search engine, is partly guided by the number of clicks on an item to help determine its relevance to a search query. If the eighth listing for a search term is the one most people go to, the algorithm puts it higher up.
Chief information officers (CIOs) have become somewhat more prominent in the executive suite a new kind of professional has emerged, the data scientist, who combines the skills of software programmer, statistician and storyteller/artist
1960s:
Data collection, database creation, IMS and network DBMS Relational data model, relational DBMS implementation RDBMS, advanced data models (extended-relational, OO, deductive, etc.) and application-oriented DBMS (spatial, scientific, engineering, etc.) Data mining and data warehousing, multimedia databases, and Web databases
1970s:
1980s:
1990s2000s:
Automated data collection tools and mature database technology lead to tremendous amounts of data stored in databases, data warehouses and other information repositories
We are drowning in data, but starving for knowledge! Solution: Data warehousing and data mining
Data warehousing and on-line analytical processing Extraction of interesting knowledge (rules, regularities, patterns, constraints) from data in large databases
A credit card company must determine whether to authorize credit card purchase by a customer Purchase can be placed under any one of the following classes : 1) Authorize 2) Ask for further id. 3) Do not Authorize 4) Do not authorize, contact police.
Web data, e-commerce purchases at department/ grocery stores Bank/Credit Card transactions
Computers have become cheaper and more powerful Competitive Pressure is Strong
Provide better, customized services for an edge (e.g. in Customer Relationship Management)
remote sensors on a satellite telescopes scanning the skies microarrays generating gene expression data scientific simulations generating terabytes of data
Traditional techniques infeasible for raw data Data mining may help scientists
Data mining (knowledge discovery in databases): Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) information or patterns from data in large databases Alternative names and their inside stories: Data mining: a misnomer?
Knowledge discovery(mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.
Process of discovering interesting knowledge from large amounts of data stored either in databases, data warehouses or other information repositories.
DM Definition contd.
Data Mining is the process of identifying valid, novel, Potentially useful, and ultimately comprehensible Knowledge from database that is used to make crucial Business decisions. - Gregory Shapiro, Editor, Kdnuggets.com
Look up phone number in phone directory Query a Web search engine for information about Amazon
Certain names are more prevalent in certain US locations (OBrien, ORurke, OReilly in Boston area) Group together similar documents returned by search engine according to their context (e.g. Amazon rainforest, Amazon.com,)
Given a database of 100,000 names, which persons are the least likely to default on their credit cards? Identify likely responders to sales promotions Which types of transactions are likely to be fraudulent, given the demographics and transactional history of a particular customer? Which of my customers are likely to be the most loyal, and which are most likely to leave for a competitor? :
Fraud detection
Applications
identify new galaxies by searching for sub clusters find affinity of visitor to pages and modify layout
Pattern Evaluation
Data Mining Task-relevant Data Data Warehouse Data Cleaning Data Integration Databases Selection
Creating a target data set: data selection Data cleaning and preprocessing: (may take 60% of effort!) Data reduction and transformation:
Find useful features, dimensionality/variable reduction, invariant representation. summarization, classification, regression, association, clustering.
Choosing the mining algorithm(s) Data mining: search for patterns of interest Pattern evaluation and knowledge presentation
Making Decisions Data Presentation Visualization Techniques Data Mining Information Discovery Data Exploration Statistical Analysis, Querying and Reporting
End User
Data Warehouses / Data Marts OLAP, MDA Data Sources Paper, Files, Information Providers, Database Systems, OLTP
DBA
Knowledge-base
Databases
Data Warehouse
Machine Learning
Data Mining
Visualization
Information Science
Other Disciplines
Relational databases Data warehouses Transactional databases Advanced DB and information repositories Object-oriented and object-relational databases Spatial databases Time-series data and temporal data Text databases and multimedia databases Heterogeneous and legacy databases WWW
Associating data with class (class of items : Computers and printers) and concepts (Concept on customers : big spenders and budgetspenders) Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet regions Data characterization : summarizing the data of the class under study (Target class) Data Discrimination : comparison of target class with one or set of comparative classes
Association
Mining Frequent Patterns Frequent Itemset set of items frequently appear together in a transactional data set. Mining frequent patterns leads to discovery of interesting association and correlations within data Threshold measures : Support and Confidence Single-dimensional vs. Multi-dimensional association contains(T, computer) contains(x, software) [1%, 75%] buys(X, PC) age(X, 20..29) ^ income(X, 20..29K) [support = 2%, confidence = 60%]
Finding models (or functions) that describe and distinguish classes or concepts, and use the model for future prediction Derived model is based on training data E.g., classify countries based on climate, or classify cars based on gas mileage Presentation of model : decision-tree, classification rule, neural network Prediction: Predict some unknown or missing numerical values , like regression analysis Both should precede by relevance analysis : identifying attributes contributing to classification or prediction process
Decision trees
Tree where internal nodes are simple decision rules on one or more attributes and leaf nodes are predicted class labels.
Salary < 1 M Prof = teacher Good Bad Age < 30 Bad
Good
Neural network
o ! W ( wi xi )
i !1
1 W ( y) ! 1 e y
Cluster analysis
Class label is unknown: Group data to form new classes, e.g., cluster houses to find distribution patterns Clustering based on the principle: maximizing the intraclass similarity and minimizing the interclass similarity Facilitate taxonomy formation, i.e., organization of observations into a hierarchy of classes that group similar events together.
Outlier analysis
Outlier: a data object that does not comply with the general behavior of the data It can be considered as noise or exception but is quite useful in fraud detection, rare events analysis Statistical methods : distribution model or distance measures Deviation based methods : Examines the differences in the main characteristics of objects in a group
Trend and deviation: regression analysis Sequential pattern mining, periodicity analysis Time series data analysis Similarity-based data analysis
A data mining system/query may generate thousands of patterns, not all of them are interesting.
Interestingness measures: A pattern is interesting if it is easily understood by humans, valid on new or test data with some degree of certainty, potentially useful, novel, or validates some hypothesis that a user seeks to confirm Objective vs. subjective interestingness measures:
Objective: based on statistics and structures of patterns, e.g., support, confidence, etc. Subjective: based on users belief in the data, e.g., unexpectedness, novelty, actionability, etc.
Can a data mining system find all the interesting patterns? User-provided constraints and interestingness measures used to focus the search Ex: Association Rule Mining Can a data mining system find only the interesting patterns? Approaches
First generate all the patterns and then filter out the uninteresting ones. Generate only the interesting patternsmining query optimization
CLASSIFICATION OF DM SYSTEMS
Classification according to
Kinds of Databases mined (data models, types of data or applications) Kinds of knowledge mined (data mining functionalities) Kinds of techniques utilized (degree of user interaction involved or methods of data analysis employed) Applications adapted (like finance, Stock Markets, Telecommunications)
DM Task Primitives
Each user will have a DM task in mind Can be specified to DM System in the form of DM query DM query is defined in the form of DM Task primitives Allows interactive communication with DM system to direct Mining process
DM Primitives
Task-relevant data to be mined Relevant db attribute or DWH dimensions of interest Kinds of knowledge to be mined-Functionalities Background Knowledge-Concept Hierarchy Interestingness measures-Support & Confidence Knowledge Presentation & Visualization -Form of display
DM Query Language
To incorporate DM Task primitives Foundation on which User-friendly graphical interface can be built Example for DMQL :
Use database <dbname> Use hierarchy <type of hierarchy> for <attrib> Mine <functionality> as <name_of_pattern> In relevance to <relevant attributes> From <table names> Where <condition> Group by <attribute> Having <min threshold> Display as <visualization of result>
DM System Architecture
-
Coupling or integrating a DM system and a DB/DWH system No coupling (DM system will not utilise any function of DB or DWH system) Loose coupling (some facilities used) Semi tight coupling (few DM primitives provided as part of DB/DWH system Tight coupling (DM system integrated into DB/DW system)
MAJOR ISSUES
Mining methodology and user interaction issues Performance issues Diversity of database types issues
Mining different kinds of knowledge in db Interactive mining of knowledge at multiple levels of abstraction Incorporation of background knowledge DM query languages and ad hoc data mining Presentation & visualization of results Handling noisy or incomplete data Pattern evaluation the interestingness problem
Performance Issues
Efficiency and scalability of Data Mining algorithms Parallel, distributed and incremental mining algorithms
Handling of relational and complex types of data Mining information from heterogeneous databases & global information systems
To conclude
Data mining: discovering interesting patterns from large amounts of data A natural evolution of database technology, in great demand, with wide applications A KDD process includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentation Mining can be performed in a variety of information repositories Data mining functionalities: characterization, discrimination, association, classification, clustering, outlier and trend analysis, etc.