Professional Documents
Culture Documents
1. Present an example where data mining is crucial to the success of a business. What data mining
functions does this business need? Can they be performed alternatively by data query processing or
simple statistical analysis?
A suitable example could be found from practically any business that sells items or services. Such
business would require both cross-market analysis (finding associations between product sales) and
customer profiling (what types of customers buy what products). Based on the acquired profiles
predictions can be made on what kind of marketing strategies would be most effective.
In theory this knowledge can be acquired with data query processing or simple statistical analysis,
but it would require a considerable amount of manual work by expert market analysts, both in order
to decide which queries to use or how to interpret the statistics and due to the huge amount of data.
2. What is the difference between discrimination and classification? Between characterization and
clustering? Between classification and prediction? For each of these pairs of tasks, how are they
similar?
The difference between discrimination and classification is that discrimination compares the general
features of the target class data to that of contrasting classes, whereas in classification the goal is to
build models that describe and distinguish data classes from each other. As for similarity, both
methods are interested in things that are different between some classes of objects.
So the difference between characterization and clustering is that in characterization the general
features of target class are deduced, whereas in clustering similar objects are simply grouped
together without any interest in their features at this point (rules can be later derived from the
formed cluster). You could also say that the output of the process is different: in characterization it
is a set of general features, whereas in clustering it a set of object classes. As for similarity, both
methods are interested in things that are common for some class of objects.
Classification vs. prediction
Classification is the process of finding a set of models (or functions) that describe and distinguish
data classes or concepts, for the purpose of being able to use the model to predict the class of
objects whose class label is unknown. In prediction, rather than predicting class labels, the main
interest (usually) is missing or unavailable data values. (Han & Kamber)
So, although classification is actually the step of finding the models, the goal of both methods is to
predict something about unknown data objects. The difference is that in classification that
“something” is the class of objects, whereas in prediction it is the missing data values.
To effectively extract information from a huge amount of data in databases, data mining algorithms
must be efficient and scalable. In other words, the running time and required runtime storage space
of a data mining algorithm must be predictable (as some, preferably linear, function of the amount
of data mined) and acceptable in large databases.
Parallel, distributed, and incremental algorithms are needed due to the huge size of many databases,
the wide distribution of data, and the computational complexity of some data mining methods. Such
algorithms divide the data into partitions that are processed in parallel. Then the results are merged.
The high cost of some data mining processes also promotes the need for incremental data mining
algorithms that incorporate database updates without having to mine the entire data again.
KDD stands for Knowledge Discovery in Databases, and means the extraction of interesting (non-
trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge
amount of data.
It consists of an iterative sequence of the following steps (see slide 17 of lecture 1):
1. Data cleaning (to remove noise and inconsistent data)
2. Data integration (where multiple data sources may be combined)
3. Data selection (where data relevant to the analysis task are retrieved from the database)
4. Data transformation (where data are transformed or consolidated into forms appropriate for
mining by performing summary or aggregation operations, for instance)
5. Data mining (an essential process where intelligent methods are applied in order to extract
data patterns)
6. Pattern evaluation (to identify the truly interesting patterns representing knowledge based on
some interestingness measures)
7. Knowledge presentation (where visualization and knowledge representation techniques are
used to present the mined knowledge to the user)
(Source: Han & Kamber)
Data mining is only one step in the process, but many people still associate the term with the whole
process.
KDD Cup is the leading Data Mining and Knowledge Discovery competition in the world,
organized by ACM SIGKDD - Special Interest Group on Knowledge Discovery and Data Mining.
Cosine similarity is a measure of similarity between two vectors of n dimensions by finding the
cosine of the angle between them. It is often used to compare documents (keywords) in text mining.
Another example is the biologic taxonomy (comparing DNA sequences).
6. What other similarity measures are suitable for data mining tasks?
For example
- Euclidean distance
- Minkowski distance
- Jaccard coefficient (for asymmetric binary variables)