Professional Documents
Culture Documents
DANIEL T. LAROSE Director of Data Mining Central Connecticut State University (2005) Mohammad Forghani (2011)
meaningful new correlations, patterns and trends by sifting through large amounts of data stored in repositories, using pattern recognition technologies as well as statistical and mathematical techniques.
4. Modeling phase
5. Evaluation phase 6. Deployment phase
Estimation
Estimation is similar to classification except that the target variable is
Prediction
Prediction is similar to classification and estimation, except that for prediction,
Classification
Common data mining methods used for classification are k-nearest neighbor
Clustering
cluster is a collection of records that are similar to one another, and dissimilar
to records in other clusters. Clustering differs from classification in that there is no target variable for clustering.
Association
finding which attributes go together.
DATA PREPROCESSING
To be useful for data mining purposes, the databases
need to undergo preprocessing, in the form of data cleaning and data transformation.
HANDLING MISSING DATA: 1. Replace the missing value with some constant 2. Replace the missing value with the field mean (for
variables).
3. Replace the missing values with a value generated at
random from the variable distribution observed. replace missing values more precisely and accurately:
Bayesian estimation
BIVARIATE METHODS:
LINEAR REGRESSION MULTIPLE REGRESSION
similarity.
Numerical: Euclidean distance Categorical: different form function Unweighted voting and weighted voting
Cross-validation approach
Choosing K?
DECISION TREES
A collection of decision nodes, connected by
branches, extending downward from the root node until terminating in leaf nodes. Decision tree algorithms represent supervised learning, and as such require pre classified target variables.
This training data set should be rich and varied
target variable.
Classification and regression trees (CART)
C4.5 algorithm
the C4.5 algorithm recursively visits each decision
node, selecting the optimal split, until no further splits are possible. Unlike CART, the C4.5 algorithm is not restricted to binary splits. uses the concept of information gain or entropy reduction to select the optimal split. Entropy of X:
Neural NETWORKS
The inputs (xi ) are collected from upstream neurons (or the data
set) and combined through a combination function such as summation (), which is then input into a (usually nonlinear) activation function to produce an output response (y), which is then channeled downstream to other neurons. robust with respect to noisy data! One possible drawback of neural networks is that all attribute values must be encoded in a standardized manner, taking values between zero and 1, even for categorical variables.
CLUSTERING
A cluster is a collection of records that are similar
to one another and dissimilar to records in other clusters. Clustering differs from classification in that there is no target variable for clustering. clustering algorithms seek to segment the entire data set into relatively homogeneous subgroups or clusters, where the similarity of the records within the cluster is maximized, and the similarity to records outside this cluster is minimized.
succeeding steps, the two closest clusters are aggregated into a new combined cluster.
Divisive clustering methods: begin with all the records in one big cluster, with the most
dissimilar records being split off recursively, into a separate cluster, until each record represents its own cluster.
criteria for determining distance between clusters:
Single linkage: the minimum distance between any record in cluster A and any record in cluster B long, slender clusters. Complete linkage: the maximum distance between any record in cluster A and any record in cluster B compact, spherelike clusters. Average linkage: the average distance of all the records in cluster A from all the records in cluster B approximately equal within-cluster variability.
k-MEANS CLUSTERING
Step 1: Ask the user how many clusters k the data set should be
partitioned into. Step 2: Randomly assign k records to be the initial cluster center locations. Step 3: For each record, find the nearest cluster center. Thus, in a sense, each cluster center owns a subset of the records, thereby representing a partition of the data set.
We therefore have k clusters, C1,C2, . . . ,Ck .
Step 4: For each of the k clusters, find the cluster centroid, and update the location of each cluster center to the new value of the centroid. Step 5: Repeat steps 3 to 5 until convergence or termination The algorithm terminates when the centroids no longer change.
the k-means algorithm cannot guarantee finding the global
minimum SSE, instead often settling at a local minimum. To improve the probability of achieving a global minimum, the analyst should rerun the algorithm using a variety of initial cluster centers
KOHONEN NETWORKS
Kohonen networks represent a type of self-organizing
high-dimensional input signal into a simpler lowdimensional discrete map. SOMs are nicely appropriate for cluster analysis, where underlying hidden patterns among records and fields are sought.
ASSOCIATION RULES
Association rules take the form If antecedent,
then consequent, along with a measure of the support and confidence associated with the rule. The number of possible association rules grows exponentially in the number of attributes. support = P(A B) = number of transactions
containing both A and B/ total number of transactions
. confidence = P(B|A) = P(A B) /P(A)= number
or numerical variables as inputs, but still requires categorical variables as outputs. Specifically, GRI applies the J-measure:
The End!