You are on page 1of 24

Data Mining

Knowledge discovery in databases

By
M. Pranay Teja
Id no. 180030368.
Sec 25.

Data 3 1
What is Data Mining?
• Data mining is a capability to support the
recognition of previously unknown but
potentially useful relationships within large
databases/ data warehouses.

• Aim: find useful patterns in the data.


• Uses statistical, mathematical, artificial
intelligence, and machine-learning techniques

Data 3 2
Data Mining Tools
• Data mining tools use statistical or rules-based
methods to identify patterns and create predictive
models.
• Tools look for patterns using a variety of models
– Statistical methods e.g. correlation
– Decision trees
– Case based reasoning
– Neural computing
– Intelligent agents
– Genetic algorithms

Data 3 3
Text Mining
• Text Mining – Analyze text documents.

– Find Hidden content


– Group by themes
– Determine relationships between documents

Data 3 4
Process of Data Mining/ Knowledge Discovery

Pattern Evaluation

Data Mining

Task-relevant Data

Data Warehouse Selection

Data Cleaning

Data Integration

Databases Data 3 5
What does it let you to do?
• Data mining automates the process of
sifting through historical data in order to
discover new information.
• Data Mining techniques enable users to
identify patterns and correlations within a
set of data
• These can then be used as predictive
models that anticipate behaviour or events
based on trends in the data.

Data 3 6
Correlation versus Causation
• Correlation
– A statistical relation between two or more
variables such that changes in the value of one
variable are accompanied by changes in the value
of the other
• Causation
– Changes in one variable cause changes in another.

Data 3 7
What do you need for Data Mining?

• Massive data collection


• Powerful computers
• Data mining algorithms

Data 3 8
Five Basic Operations
• Clustering
– Identifies groups of items that share a particular characteristic
• Classification
– infers the defining characteristics of a certain group
• Association
– identifies relationships between events that occur at the one
time
• Sequencing:
– relationships over time
• Forecasting
– estimates future values based on patterns within large sets of
data

Data 3 9
Clustering

• The process of identifying relationships between


similar records without any preconceived notion
of what that that similarity might involve.
• Examples:
– Disease clusters,
– Similarities in customers telephone usage
• Often used as an exploratory exercise before
further data mining using a classification
technique.

Data 3 10
Classification

• DM system learns from examples of the


data how to partition or classify the data
i.e. it formulates classification rules which
can be used for prediction.
– Example : Bank classifies customers and may
offer them differing levels of service, different
offers, different charges. Can build loan
approval models.

Data 3 11
Association

• Looks for links between records in a data set


– e.g. items purchased at the one time.
• Patterns can be identified to indicate probabilities
e.g.
• 500,000 transactions
• 20,000 nappies
• 30,000 beer
• 10,000 nappies + beer
– Beer and nappies occur together in 2% of transactions.
– “when people buy beer they buy nappies 1/3 of the
time”
– “when people buy nappies they buy beer 50% of the
time”

Data 3 12
Sequential Analysis
• A form of association used to track
relationships over time.
– E.g. health insurance claims.
– E.g. 10% of customers who bought a tent bought a
backpack within one month.
– Weather patterns e.g. tidal wave in Hawaii follows
hurricane in N. Atlantic x% of the time.

Data 3 13
Forecasting
• Concerns the prediction of continuous variables
e.g. sales, share values, stock market levels, oil
prices etc.
• Often done with regression functions statistical
methods for examining the relationship between
variables in order to predict a future value.
• 2 types
– Forecasting single continuous value based on
unordered examples. e.g. predict income based on
personal details.
– Predict one or more values based on a sequential
pattern – time series forecasting.

Data 3 14
Data Mining Tools in more detail
• Case-based Reasoning
– Use historical cases to identify patterns.
• Neural Computing :
– Examine historical data for pattern recognition e.g.
identify potential customers for a new product.
• Intelligent agents
– Retrieve information from large databases.

• Other tools e.g. decision trees, rule induction,


data visualisation.

Data 3 15
Some Key Application Areas
• Data mining is used in many different areas
• Two big areas are:
– Market analysis and management
• Initial Data Gathered From
Credit card transactions, loyalty cards, discount coupons,
customer complaint calls, lifestyle studies, focus groups

– Fraud detection and management

Data 3 16
Examples
• Target marketing
– Find clusters of “model” customers who share the same characteristics: e.g.
interests, income
• Determine customer purchasing patterns over time
• Cross-market analysis uses associations/co-relations between product
sales and predicts based on the association information
• Customer profiling:
– What types of customers buy what products
• Identifying customer requirements-
– Identifying the best products for different customers, use prediction to find
what factors will attract new customers

Data 3 17
Fraud detection and management

• Used in health care, retail, credit card services,


telecommunications (phone card fraud), etc.
• Use historical data to build models of fraudulent
behavior and use data mining to help identify
similar instances
• Examples
– auto insurance: detect a group of people who stage
accidents to collect on insurance
– money laundering: detect suspicious money
transactions
– medical insurance: detect professional patients and
ring of doctors and ring of references

Data 3 18
Text Mining

- Application of data mining to unstructured or less


structured files.
- Text mining operates with less structured
information and helps organisations to:-
– Find hidden content of documents including useful
relationships.
– Relate documents across unnoticed divisions e.g.
customers in 2 product division have the same
characteristics.
– Group documents by themes e.g. all customers who
have similar complaints.

Data 3 19
Some more example applications by area

• Marketing:- Predicting customers to respond to internet


banners or buy a product. Segmenting customer
demographics.

• Banking : forecasting bad loans and fraudulent credit card


usage, credit card spending by new customers and which
customers will respond bet to new loan offers.

• Retailing and Sales: Predicting sales, correct stock levels,


distribution schedules

• Manufacturing and Production: predicting when to expect


machinery failures , finding key factors that control the
optimisation of manufacturing capacity.
Data 3 20
• Brokerage and Securities Trading:- Predicting when bond
prices will change, forecasting range of stock fluctuation
for particular issues, determining when to trade stock.
• Insurance: forecasting claim amounts, medical coverage
costs, classifying the most important elements that affect
medical coverage, predicting which customers will buy
new policies.
• Computer Hardware and Software: Predicting drive
failure, forecasting creation time for new chips, predicting
potential security violations.
• Government and Defence: Forecasting cost of moving
military equipment, testing strategies for potential
military engagements, predicting resource consumption.

Data 3 21
• Airlines: Capturing data on what customers are
flying and destination of those who change
carriers midflight.

• Healthcare : correlating demographics of patients


with critical illnesses.

• Broadcasting – programs best shown in prime


time and how to maximize returns by inserting
advertisements.

• Police: tracking crime patterns, locations, criminal


behaviour and attributes to help crack criminal
cases.
Data 3 22
Problems with data mining
• Need clear business objectives and access to the
appropriate data.
• Need the right data.
– Bad data quality can lead to spurious results
• Models are not fail-safe.
• Privacy, property and other legal and ethical
issues.
• Companies must change mode of operation and
maintain the effort (e.g. loyalty programs such as
air miles).

Data 3 23
Conclusion
• Data Mining is an attractive sounding
technology which is still evolving.
• The key is that the algorithms discover useful
relationships.
– Unlike standard research where researchers
hypothesise correlations and then search for
them.
• There are ethical issues:
– E.g. Criminal profiling.

Data 3 24

You might also like