You are on page 1of 4

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/234800045

Data mining: concepts and techniques by Jiawei Han and Micheline Kamber

Article  in  ACM SIGMOD Record · June 2002


DOI: 10.1145/565117.565130 · Source: DBLP

CITATIONS READS
5 19,921

2 authors, including:

Fernando Berzal
University of Granada
99 PUBLICATIONS   792 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

NOESIS View project

ModelCC View project

All content following this page was uploaded by Fernando Berzal on 02 June 2014.

The user has requested enhancement of the downloaded file.


Data Mining: Concepts and Techniques
By Jiawei Han and Micheline Kamber

Academic Press, Morgan Kaufmann Publishers, 2001


500 pages, list price $54.95
ISBN 1-55860-489-8

Review by:
Fernando Berzal and Nicolás Marín, University of Granada
Department of Computer Science and AI
fberzal@decsai.ugr.es

Mining information from data: A present- with huge databases which have to be
day gold rush. Data Mining is a automatically analyzed.
multidisciplinary field which supports Data mining is a pivotal step in the
knowledge workers who try to extract KDD process: the extraction of interesting
information in our “data rich, information patterns from a set of data sources (relational,
poor” environment. Its name stems from the transactional, object-oriented, spatial,
idea of mining knowledge from large temporal, text, and legacy databases, as well
amounts of data. The tools it provides assist as data warehouses and the World Wide
us in the discovery of relevant information Web). The patterns obtained are used to
through a wide range of data analysis describe concepts, to analyze associations, to
techniques. Any method used to extract build classification and regression models, to
patterns from a given data source is cluster data, to model trends in time-series,
considered to be a data mining technique. and to detect outliers (“data objects that do
Han and Kamber’s book provides not comply with the general behavior or
more than a good starting point for those model of the data”). Since the patterns which
interested in this eclectic research field. The are present in data are not all equally useful,
book surveys techniques for the main tasks interestingness measures are needed to
data miners have to perform. Most existing estimate the relevance of the discovered
data mining texts emphasize the managerial patterns to guide the mining process.
and marketing aspects involved in the Although the book stresses the importance of
adoption of this technology by modern interestingness measures and it presents the
enterprises. In contrast, Han and Kamber’s standard simplicity, certainty, utility, and
textbook focuses on issues such as novelty measures, a more in-depth treatment
algorithmic efficiency and scalability from a of alternative interestingness measures would
database perspective. be of interest for data miners but is not given.
From the authors’ point of view, data
Basic Concepts for Beginners. The warehousing and multidimensional databases
evolution of database technology is an are introduced as desirable intermediate
essential prerequisite for understanding the layers between the original data sources and
need of knowledge discovery in databases the On-Line Analytical Mining system the
(KDD). This evolution is described in the user interacts with. OLAM (also known as
book to present data mining as a natural stage OLAP mining) integrates on-line analytical
in the data processing history: we have processing with data mining.
collected data in the early days of computing, In the initial chapters of the book, the
we created database management systems in reader will find an excellent overview of data
the seventies, we developed advanced data warehousing concepts and the proposal of an
models in the eighties, and, now, we are left integrated OLAM architecture, as well as an
introduction to DMQL (Data Mining Query presented as the landmark in association rule
Language). Microsoft OLE DB for Data mining. Several improvements over the
Mining is an alternative to this language and original Apriori algorithm are also described.
it is briefly described in a separate appendix. Han et al.’s FP-Growth (SIGMOD 2000) is
Irrespective of whether data thoroughly discussed in the book as an
warehouses are used or not, input data must alternative to mine association rules without
be preprocessed in order to reduce the effect candidate generation, the common-step in all
of noise, missing values, and inconsistencies Apriori-like algorithms. Additional
before applying data mining algorithms. Data extensions to the basic association rule
cleaning, data integration, data framework are explored, e.g., iceberg queries
transformation, data reduction, discretization and multilevel, multidimensional, constraint-
and concept hierarchies are enabling based, and quantitative association rules
techniques which help to prepare the data for (which, from our point of view, are
the mining process. All these techniques are artificially categorized into quantitative and
explained in the book without focusing too distance-based association rules when both of
much on implementation details so that the them work with quantitative attributes).
reader can easily understand these Closer to the predictive arena the
preprocessing methods. book deals with the classical machine
learning topics of supervised and
Data Mining in Action. According to their unsupervised learning. Several classification
final goal, data mining techniques can be and regression techniques are introduced
considered to be descriptive or predictive: taking into account accuracy, speed,
Descriptive data mining intends to summarize robustness, scalability, and interpretability
data and to highlight their interesting issues.
properties, while predictive data mining aims Decision trees, Bayesian classifiers,
to build models to forecast future behaviors. and backpropagation neural networks are
Generalization is the basis of presented as outstanding classification
descriptive techniques and can be used to techniques. The authors also discuss some
summarize data by applying attribute- classification methods based on concepts
oriented induction using characteristic rules from association rule mining. Furthermore,
and generalized relations. Analytical the chapter on classification mentions
characterization is used to perform attribute alternative models based on instance-based
relevance measurements to identify irrelevant learning (e.g., k-NN, CBR…), genetic
and weakly relevant attributes (the lower the algorithms, rough and fuzzy sets. We believe
number of attributes, the more efficient the that this book section would deserve a more
mining process). Generalization techniques detailed treatment (even a whole volume on
can also be extended to discriminate among its own), which should obviously include an
different classes. The authors refer to these extended version of the study of classifier
techniques as “class comparison mining”. accuracy found at the end of the chapter.
The discussion of descriptive techniques is Regression (called prediction by the
completed with a brief study of statistical authors) appears as an extension of the
measures (i.e. central tendency and data classical classification models. The former
dispersion measures) and their insightful deals with continuous values while the latter
graphical display. is intended to work with discrete categories.
Association rules are midway Linear regression is clearly explained;
between descriptive and predictive data multiple, nonlinear, generalized linear, and
mining (maybe closer to descriptive log-linear regression models are only
techniques). They find interesting referenced in the text.
relationships among large sets of data items With respect to unsupervised learning
and are typically used in market basket (i.e., “learning by observation” rather than
analysis. The Apriori family of algorithms is learning by examples), cluster analysis is
precisely treated in Han and Kamber’s book. dimensional thinking (as Tom Gilb did in
A general framework for the clustering “Principles of Software Engineering
process is presented pointing out how to Management” some time ago). Moreover, the
compute the dissimilarity between objects features of some commercial data mining
taking into account the various types of systems are outlined, such as the authors’
attributes which can characterize them DBMiner, whose architecture and capabilities
(binary, nominal, ordinal, interval-based, and are introduced in a separate appendix. Some
ratio-scaled). A taxonomy of clustering buzzwordism about the role of data mining
methods is proposed including examples for and its social impact can be found in this
each category: partitioning methods (e.g. k- chapter and forecast of future trends is
Means and CLARANS), agglomerative and included at its end, although we feel that the
divisive hierarchical methods (such as authors’ forecast ignores the importance of
BIRCH), density-based methods (like reusable data mining toolkits and
DBSCAN), grid-based methods (such as frameworks.
CLIQUE), and model-based methods (like
COBWEB). This categorization of clustering Why to Read This Book. Maybe the
algorithms provides an excellent overview of authors’ goal of covering the whole field of
current clustering techniques, although it can data mining hinders a detailed treatment of
be slightly too dense for people who are new some of the topics discussed in the book.
to the field. Data mining has become an important
The book’s survey of data mining research area in just a few years and its
tasks and techniques is concluded by the current breadth makes it impossible to fit into
discussion of other relevant problems which a single volume book. The youth of this field
are as appealing as the previous ones. For might justify the authors’ bias we have found
example, outlier analysis has important in some specific sections (e.g. they strongly
applications in fraud detection, exception advocate for tightly coupled data mining
handling, and data preprocessing (i.e., to systems discouraging alternative solutions).
detect measurement errors); while time-series Anyway, this book is an indispensable road
and sequence mining can be useful to detect map for those interested in data mining, both
trends in market indicators and match similar researchers and practitioners.
patterns in genome databases. Unfortunately, This book constitutes a superb
these interesting techniques are only briefly example of how to write a technical textbook
described in this book. with didactic content and academic rigor. It is
Space constraints also limit the written in a direct style with questions and
discussion of data mining in complex types of answers scattered throughout the text that
data, such as object-oriented databases, keep the reader involved and explain the
spatial, multimedia, and text databases. Web reasons behind every decision. The presence
mining, for instance, is only overviewed in its of examples make concepts easy to
three flavors: web content mining (search understand and the summary and exercises at
engines and information retrieval), web the end of each chapter support the reader in
structure mining (linkage analysis), and web checking his/her comprehension of the book’s
usage mining (web log mining). contents. The chapters are mostly self-
contained, so they can be separately used to
Practical Issues. The book’s final chapter teach particular data mining areas. In fact,
describes some interesting examples of the you may even use the book artwork which is
use of data mining in the real world (i.e., freely available from the Web. Moreover, the
biomedical research, financial data analysis, bibliographical discussions presented at the
retail industry, and telecommunication end of every chapter describe related work
utilities). This chapter also offers some and may prove invaluable for those interested
practical tips on how to choose a particular in further reading. A must-have for data
data mining system, advocating for multi- miners!

View publication stats

You might also like