You are on page 1of 20

Lecture review of DISCOVERING KNOWLEDGE IN DATA An Introduction to Data Mining

DANIEL T. LAROSE Director of Data Mining Central Connecticut State University (2005) Mohammad Forghani (2011)

WHAT IS DATA MINING?


Data mining is the process of discovering

meaningful new correlations, patterns and trends by sifting through large amounts of data stored in repositories, using pattern recognition technologies as well as statistical and mathematical techniques.

CROSS-INDUSTRY STANDARD PROCESS: CRISPDM


According to CRISPDM, a given data mining

project has a life cycle consisting of six phases:


1. Business understanding phase. 2. Data understanding phase 3. Data preparation phase

4. Modeling phase
5. Evaluation phase 6. Deployment phase

WHAT TASKS CAN DATA MINING ACCOMPLISH?


Description
Find ways to describe patterns and trends lying within data High-quality description can often be accomplished by exploratory data

analysis ,a graphical method of exploring data in search of patterns and trends

Estimation
Estimation is similar to classification except that the target variable is

numerical rather than categorical.

Prediction
Prediction is similar to classification and estimation, except that for prediction,

the results lie in the future.

Classification
Common data mining methods used for classification are k-nearest neighbor

, decision tree ,and neural network .

Clustering
cluster is a collection of records that are similar to one another, and dissimilar

to records in other clusters. Clustering differs from classification in that there is no target variable for clustering.

Association
finding which attributes go together.

DATA PREPROCESSING
To be useful for data mining purposes, the databases

need to undergo preprocessing, in the form of data cleaning and data transformation.
HANDLING MISSING DATA: 1. Replace the missing value with some constant 2. Replace the missing value with the field mean (for

numerical variables) or the mode (for categorical

variables).
3. Replace the missing values with a value generated at

random from the variable distribution observed. replace missing values more precisely and accurately:
Bayesian estimation

DATA TRANSFORMATION MinMax Normalization Z-Score Standardization

STATISTICAL APPROACHES TO ESTIMATION AND PREDICTION


UNIVARIATE METHODS:
MEASURES OF CENTER AND SPREAD Mean Standard Deviation Standard Error of Mean Confidence Interval Estimation:

BIVARIATE METHODS:
LINEAR REGRESSION MULTIPLE REGRESSION

SUPERVISED VERSUS UNSUPERVISED METHODS


unsupervised methods: no target variable is identified the data mining algorithm searches for patterns and structure among all the variables. clustering! supervised methods: there is a particular pre specified target variable, the algorithm is given many examples where the value of the target variable is provided, so that the algorithm may learn association between variables. the algorithm is provided with a training set of data, which includes the pre classified values of the target variable in addition to the predictor variables. classification methods including decision trees, neural networks, and k-nearest neighbors.

Methodology for supervised modeling

k-NEAREST NEIGHBOR ALGORITHM


Instance-based learning.

the k-nearest neighbor algorithm assigns the

classification of the most similar record or records.


Data analysts define distance metrics to measure

similarity.
Numerical: Euclidean distance Categorical: different form function Unweighted voting and weighted voting

Cross-validation approach
Choosing K?

DECISION TREES
A collection of decision nodes, connected by

branches, extending downward from the root node until terminating in leaf nodes. Decision tree algorithms represent supervised learning, and as such require pre classified target variables.
This training data set should be rich and varied

The target attribute classes must be discrete:


cannot apply decision tree analysis to a continuous

target variable.
Classification and regression trees (CART)

algorithm C4.5 algorithm

CLASSIFICATION AND REGRESSION TREES (CART)


The decision trees produced by CART are strictly binary,

containing exactly two branches for each decision node.

C4.5 algorithm
the C4.5 algorithm recursively visits each decision

node, selecting the optimal split, until no further splits are possible. Unlike CART, the C4.5 algorithm is not restricted to binary splits. uses the concept of information gain or entropy reduction to select the optimal split. Entropy of X:

Neural NETWORKS

The inputs (xi ) are collected from upstream neurons (or the data

set) and combined through a combination function such as summation (), which is then input into a (usually nonlinear) activation function to produce an output response (y), which is then channeled downstream to other neurons. robust with respect to noisy data! One possible drawback of neural networks is that all attribute values must be encoded in a standardized manner, taking values between zero and 1, even for categorical variables.

CLUSTERING
A cluster is a collection of records that are similar

to one another and dissimilar to records in other clusters. Clustering differs from classification in that there is no target variable for clustering. clustering algorithms seek to segment the entire data set into relatively homogeneous subgroups or clusters, where the similarity of the records within the cluster is maximized, and the similarity to records outside this cluster is minimized.

HIERARCHICAL CLUSTERING METHODS


Agglomerative clustering methods: initialize each observation to be a tiny cluster of its own. Then, in

succeeding steps, the two closest clusters are aggregated into a new combined cluster.
Divisive clustering methods: begin with all the records in one big cluster, with the most

dissimilar records being split off recursively, into a separate cluster, until each record represents its own cluster.
criteria for determining distance between clusters:
Single linkage: the minimum distance between any record in cluster A and any record in cluster B long, slender clusters. Complete linkage: the maximum distance between any record in cluster A and any record in cluster B compact, spherelike clusters. Average linkage: the average distance of all the records in cluster A from all the records in cluster B approximately equal within-cluster variability.

k-MEANS CLUSTERING
Step 1: Ask the user how many clusters k the data set should be

partitioned into. Step 2: Randomly assign k records to be the initial cluster center locations. Step 3: For each record, find the nearest cluster center. Thus, in a sense, each cluster center owns a subset of the records, thereby representing a partition of the data set.
We therefore have k clusters, C1,C2, . . . ,Ck .

Step 4: For each of the k clusters, find the cluster centroid, and update the location of each cluster center to the new value of the centroid. Step 5: Repeat steps 3 to 5 until convergence or termination The algorithm terminates when the centroids no longer change.
the k-means algorithm cannot guarantee finding the global

minimum SSE, instead often settling at a local minimum. To improve the probability of achieving a global minimum, the analyst should rerun the algorithm using a variety of initial cluster centers

KOHONEN NETWORKS
Kohonen networks represent a type of self-organizing

map (SOM), which itself represents a special class of neural networks.


The goal of self-organizing maps is to convert a complex

high-dimensional input signal into a simpler lowdimensional discrete map. SOMs are nicely appropriate for cluster analysis, where underlying hidden patterns among records and fields are sought.

ASSOCIATION RULES
Association rules take the form If antecedent,

then consequent, along with a measure of the support and confidence associated with the rule. The number of possible association rules grows exponentially in the number of attributes. support = P(A B) = number of transactions
containing both A and B/ total number of transactions
. confidence = P(B|A) = P(A B) /P(A)= number

of transactions containing both A and B/ number of transactions containing A

ASSOCIATION RULES (cont)

Reduce the search space:

GENERALIZED RULE INDUCTION METHOD:


The GRI methodology can handle either categorical

or numerical variables as inputs, but still requires categorical variables as outputs. Specifically, GRI applies the J-measure:

The End!

You might also like