You are on page 1of 7

Data mining questions

Lecture- 1
What is classification? How classification is done?
Classification is a data mining function that assigns items in a collection to target
categories or classes. The goal of classification is to accurately predict the target class
for each case in the data. For example, a classification model could be used to identify
loan applicants as low, medium, or high credit risks.

What is test and training data? What are training sets?


Given a collection of records (training set )
A test set is used to determine the accuracy of the model. Usually, the given data
set is divided into training and test sets, with training set used to build the model
and test set used to validate it.

Applications of classification
Direct Marketing- Reduce cost of mailing by targeting a set of customers likely to
buy a new cell-phone product
Fraud Detection - Predict fraudulent cases in credit card transaction
Customer Attrition- To predict whether a customer likely to be lost to a competitor
Sky Survey Cataloging To predict class (star or galaxy) of sky objects,
especially visually faint ones, based on telegraphy survey images.

5. What is clustering?
Given a set of data points, each having a set of attributes, and a similarity
measure among them, find clusters such that
Data points in one cluster are more similar to one another.
Data points in separate clusters are less similar to one another.

Application of clustering.
Market Segmentation:
subdivide a market into distinct subsets of customers
where any subset may conceivably be selected as a
market target to be reached with a distinct marketing
mix.
Find clusters of similar customers.

Measure the clustering quality by observing buying patterns of customers


in same cluster vs. those from different clusters.
Document Clustering:
Goal: To find groups of documents that are similar to each other based on the
important terms appearing in them.
Approach: To identify frequently occurring terms in each document. Form a
similarity measure based on the frequencies of different terms. Use it to cluster.
Gain: Information Retrieval can utilize the clusters to relate a new document or
search term to clustered documents.

1. What is association rule discovery?

9.Regression
10. Deviation/ analog detection
11. challenges of data mining.

Scalability: Because of advances in data generation and collection, data sets


with sizes of gigabytes, terabytes, or even petabytes are becoming common.
lf data mining algorithms are to handle these massive data sets, then they
must be scalable. Many data mining algorithms employ special search strategies
to handle exponential search problems. Scalability may also require the
implementation of novel data structures to access individual records in an efficient
manner.

High Dimensionality: It is now common to encounter data sets with


hundreds
or thousands of attributes instead of the handful common a few decades
ago. In bioinformatics, progress in microarray technology has produced
gene
expression data involving thousands of features. Data sets with temporal
or spatial components also tend to have high dimensionality.
Heterogeneous and Complex Data

Traditional data analysis methods often deal with data sets containing attributes of
the same type, either continuousorcategorical. As the role of data mining in
business, science, medicine,and other fields has grown, so has the need for
techniques that can handleheterogeneous attributes.

Data Ownership and Distribution Sometimes, the data needed for an


analysis is not stored in one location or owned by one organization. Instead,
the data is geographically distributed among resources belonging to multiple
entities. This requires the development of distributed data mining techniques.
Among the key challenges faced by distributed data mining algorithms include
(1) how to reduce the amount of communication needed to perform the
distributed computation, (2) how to effectively consolidate the data minillg
results obtained from multiple sources, and (3) how to address data security
issues.
Non-trad itional Analysis The traditional statistical approach is based on
a hypothesize-and-test paradigm. ln other words, a hypothesis is proposed ,
an experiment is designed to gather the data, and then the data is analyzed
with respect to the hypothesis. Unfortunately, this process is extremely
laborintensive.
Current data analysis tasks often require the generation and evaluation
of thousands of hypotheses, and consequently, the development of some
data mining techniques has been motivated by the desire to automate the
process of hypothesis generation and evaluation. Furthermore, the data sets
analyzed in data mining are typically not the result of a carefully designed

12. What is test set/ test data?

Lecture -2
1. How does data quality fall?
Examples of data quality problems:

Noise and outliers

missing values

duplicate data

2. What is data pre-processing?


Data preprocessing is a data mining technique that involves transforming raw data into an
understandable format. Real-world data is often incomplete, inconsistent, and/or lacking in certain
behaviors or trends, and is likely to contain many errors. Data preprocessing is a proven method of
resolving such issues. Data preprocessing prepares raw data for further processing.
Data preprocessing is used database-driven applications such as customer relationship
management and rule-based applications (like neural networks

2. Feature/ attribute of subset selection


Another way to reduce dimensionality of data
Redundant features
duplicate much or all of the information contained in one or more other
attributes
Example: purchase price of a product and the amount of sales tax paid
Irrelevant features
contain no information that is useful for the data mining task at hand
Example: students' ID is often irrelevant to the task of predicting students' GPA
3. Discretization and Binarizaation

Some data mining algorithms, especially certain classification algorithms,


require
that the data be in the form of categorical attributes. Algorithms that
find association patterns require that the data be in the form of binary
attributes.
Thus, it is often necessary to transform a continuous attribute into
a categorical attribute (discretization), and both continuous and discrete

attributes may need to be transformed into one or more binary attributes


(binarization). Additionally, if a categorical attribute has a large number of
values (categories), or some values occur infrequently, then it may be
beneficial
for certain data mining tasks to reduce the number of categories by
combining
some of the values.
As with feature selection, the best discretization and binarization approach
is the one that "produces the best result for the data mining algorithm that
will be used to analyze the data'' It is typically not practical to apply such a
criterion directly. Consequent ly, discretizat ion or binarization is performed
in
a way that satisfies a criterion that is thought to have a relationship to good
Performance for the data mining task being considered .

5. Feature creation
Create new attributes that can capture the important information in a data
set much more efficiently than the original attributes
Three general methodologies:

Feature Extraction
domain-specific

Mapping Data to New Space

Feature Construction
combining features

6. What is stratified sampling?


Split the data into several partitions; then draw random samples from each partition
7. Aggregation and sampling?
Combining two or more attributes (or objects) into a single attribute (or object)

You might also like