Professional Documents
Culture Documents
Clustering
Goal for Today
Age
Clustering
One Cluster
Revenue
Centroid
== Center of
the cluster
Age
clustering applications
• Fraud: Detect Outliers
C2
C3
C1
C2
C3
C1
C1
C2
C3
C3
C2
C1
C3
C2
C1
C3
C2
C1
C1
C3
C3
C2
C2
C1
C3
C2
C1
C3
C2
C1
C1
C3
C3
C2
C2
clustering challenges
• Curse of Dimensionality
• Performance
• Choice # of clusters
Mahout Clustering
Challenges
Image Vectorized
Data Processing
Data
Voice
Log / DB
Mahout K-Means on Text
Workflow
Text Files
mahout
seqdirectory
Mahout Sequence Files
mahout
seq2parse
Tfidf Vectors
mahout
kmeans
Clusters
Mahout K-Means on
Database Extract Worflow
Database Dump (CSV)
org.apache.mahout.clustering.conv
ersion.InputDriver
Mahout Vectors
mahout
kmeans
Clusters
Convert a CSV File to
Mahout Vector
• Converting Categorical
variables to dimensions
• Variable Rescaling
K (number of clusters)
K-Means Circles Point -> ClusterId
Convergence
K (number of clusters) Point -> ClusterId * ,
Fuzzy K-Means Circles
Convergence Probability
Expectation K (Number of clusterS)
Gaussian distribution Point -> ClusterId*, Probability
Maximization Convergence
Mean-Shift Distance boundaries,
Gradient like distribution Point -> Cluster ID
Clustering Convergence
Top Down Point -> Large ClusterId, Small
Two Clustering Algorithns Hierarchy
Clustering ClusterId
Dirichlet Points are a mixture of
Model Distribution Point -> ClusterId, Probability
Process distribution
Spectral
- - Point -> ClusterId
Clustering
MinHash Number of hash / keys
High Dimension Point -> Hash*
Clustering Hash Type
Comparing Clustering
KMeans Dirichlet Fuzzy KMeans
MeanShift
Canopy Optimization
T2
T2 T1
Surely in Cluster
Pick a random point Surely not in cluster