You are on page 1of 3

Department of Computer Science

582634 Data Mining


Exercices 1
18 March 2009

1. Present an example where data mining is crucial to the success of a business. What data mining
functions does this business need? Can they be performed alternatively by data query processing or
simple statistical analysis?

A suitable example could be found from practically any business that sells items or services. Such
business would require both cross-market analysis (finding associations between product sales) and
customer profiling (what types of customers buy what products). Based on the acquired profiles
predictions can be made on what kind of marketing strategies would be most effective.

In theory this knowledge can be acquired with data query processing or simple statistical analysis,
but it would require a considerable amount of manual work by expert market analysts, both in order
to decide which queries to use or how to interpret the statistics and due to the huge amount of data.

2. What is the difference between discrimination and classification? Between characterization and
clustering? Between classification and prediction? For each of these pairs of tasks, how are they
similar?

Discrimination vs. classification


Data discrimination is a comparison of the general features of a target class data objects with the
general features of objects from one or a set of contrasting classes. Classification is the process of
finding a set of models (or functions) that describe and distinguish data classes or concepts, for the
purpose of being able to use the model to predict the class of objects whose class label is unknown.
The model is based on the analysis of a set of training data (data objects whose class label is
known). (Han & Kamber)

The difference between discrimination and classification is that discrimination compares the general
features of the target class data to that of contrasting classes, whereas in classification the goal is to
build models that describe and distinguish data classes from each other. As for similarity, both
methods are interested in things that are different between some classes of objects.

Characterization vs. clustering


Data characterization is a summarization of the general characteristics or features of a target class of
data. In clustering the objects are grouped together based on the principle of maximizing the
intraclass similarity and minimizing the interclass similarity, for e.g. the purpose of generating
training data for classification. (Han & Kamber)

So the difference between characterization and clustering is that in characterization the general
features of target class are deduced, whereas in clustering similar objects are simply grouped
together without any interest in their features at this point (rules can be later derived from the
formed cluster). You could also say that the output of the process is different: in characterization it
is a set of general features, whereas in clustering it a set of object classes. As for similarity, both
methods are interested in things that are common for some class of objects.
Classification vs. prediction
Classification is the process of finding a set of models (or functions) that describe and distinguish
data classes or concepts, for the purpose of being able to use the model to predict the class of
objects whose class label is unknown. In prediction, rather than predicting class labels, the main
interest (usually) is missing or unavailable data values. (Han & Kamber)

So, although classification is actually the step of finding the models, the goal of both methods is to
predict something about unknown data objects. The difference is that in classification that
“something” is the class of objects, whereas in prediction it is the missing data values.

3. Describe two challenges to data mining regarding performance issues.

The challenges to data mining regarding performance issues are


- efficiency and scalability, and
- parallelization.

To effectively extract information from a huge amount of data in databases, data mining algorithms
must be efficient and scalable. In other words, the running time and required runtime storage space
of a data mining algorithm must be predictable (as some, preferably linear, function of the amount
of data mined) and acceptable in large databases.

Parallel, distributed, and incremental algorithms are needed due to the huge size of many databases,
the wide distribution of data, and the computational complexity of some data mining methods. Such
algorithms divide the data into partitions that are processed in parallel. Then the results are merged.
The high cost of some data mining processes also promotes the need for incremental data mining
algorithms that incorporate database updates without having to mine the entire data again.

(Source: Han & Kamber)

4. What is KDD? What about KDD Cup?

KDD stands for Knowledge Discovery in Databases, and means the extraction of interesting (non-
trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge
amount of data.

It consists of an iterative sequence of the following steps (see slide 17 of lecture 1):
1. Data cleaning (to remove noise and inconsistent data)
2. Data integration (where multiple data sources may be combined)
3. Data selection (where data relevant to the analysis task are retrieved from the database)
4. Data transformation (where data are transformed or consolidated into forms appropriate for
mining by performing summary or aggregation operations, for instance)
5. Data mining (an essential process where intelligent methods are applied in order to extract
data patterns)
6. Pattern evaluation (to identify the truly interesting patterns representing knowledge based on
some interestingness measures)
7. Knowledge presentation (where visualization and knowledge representation techniques are
used to present the mined knowledge to the user)
(Source: Han & Kamber)
Data mining is only one step in the process, but many people still associate the term with the whole
process.

KDD Cup is the leading Data Mining and Knowledge Discovery competition in the world,
organized by ACM SIGKDD - Special Interest Group on Knowledge Discovery and Data Mining.

5. Present examples of data mining tasks where cosine similarity is useful.

Cosine similarity is a measure of similarity between two vectors of n dimensions by finding the
cosine of the angle between them. It is often used to compare documents (keywords) in text mining.
Another example is the biologic taxonomy (comparing DNA sequences).

6. What other similarity measures are suitable for data mining tasks?

For example
- Euclidean distance
- Minkowski distance
- Jaccard coefficient (for asymmetric binary variables)

You might also like