You are on page 1of 18

What is Data Mining ?

By
Saurabh Jain
General Concept of Data Mining
• Most organization have accumulated a great
deal of data, but, what they really want is
information
• Data mining is the process of finding
correlations or patterns among dozens of
fields in large relational databases.
General Concept of Data Mining
(cont’d)

• Data mining uses sophisticated statistical


analysis and modeling techniques to
uncover patterns and relationship hidden in
organization database.
• Data mining is one of several terms,
including knowledge discovery, knowledge
extraction, data archaeology, information
harvesting and even data dredging
How Does Data Mining Work?

Algorithms Technologies
• Associations • Neural networks
• Classifications • Decision trees
• Sequential discovery • Rule induction
• Clustering • Data visualization
Algorithms
1. Associations
- This is used to identify items that occur
together in a given event or record.
- This technique is often used for market
analysis, rules hidden between the attributes
Ex. “When people buy a hammer they also buy
nails 50% of the time.”
Algorithms (cont’d)

2. Classifications
This is used to classify database records
into a number of predefined classes based
on certain criteria.
Ex. “Customers with excellent credit history
have a debt/equity ratio of less than 10%”
Algorithms (cont’d)

3. Sequential Discovery
This helps identify patterns in time series.

Ex. “60% of customers buy TVs followed by


8mm camcorders.”
Algorithms (cont’d)

4. Clustering
This is used to segment the database into
different clusters, based on a set of
attributes.

Ex. “Understand the target market used to


classify new data”
Technologies
1. Neural networks
This trains the net on a
training dataset and
then use it to make
predictions.
Technologies (cont’d)
2. Decision trees
• A way of representing
a series of rule that
lead to value or class.
• This segregates the
data based on the
value of the variables.
Technologies (cont’d)

2. Decision Trees(cont’d)
- Cannot use continuous data
- Used for model understanding rather than
prediction
Technologies (cont’d)
3. Rule induction
All possible patterns in the database are
systematically pulled out and then the
accuracy and the coverage are calculated.
Ex.
IF breakfast cereals, then milk : accuracy 90%, coverage 15%
IF Friday and male and diapers, then beer:

accuracy 60%, coverage 0.1%


Technologies (cont’d)
4. Data visualization
• Graphics tools are used to illustrate data
relationships.
• Gain a deeper, intuitive understanding of
the data by presenting a picture for users.
Current Limitations

• Cost, Time and Effort


– Data Mining setup can be expensive
– Many man-hour of development are needed.
– Some of their functions involve steep learning
curves for the end-users.
– Extensive training and practices are still needed
for most users
Current Limitations (cont’d)
• Low-end software
– These have limited query capabilities and its
inability to perform multidimensional analyses.
– Many of the current methods are not truly
interactive and cannot incorporate prior
knowledge
• Large databases
– The large size presents problems in terms of
finding efficient algorithms for association rules.
Conclusion
• Data mining assists user finding patterns and
relationships in the data.
• Data mining is a powerful tool, not magic.
• Organize the large volumes of data into some form of
categories. => To avoid the GIGO, data should have
minimal missing values.
• Current Limitations : Cost, Time, effort, Low-end
software, and Large databases
Any Questions ????????
THANKING YOU !!!

You might also like