Machine Learning Slides

Machine Learning and
Data Mining
Introduction
Albert Bifet(@abifet)
Data Science
Data Science is an interdisciplinary eld focused on

extracting knowledge or insights from large volumes
of data.
Data Scientist
Figure: http://www.marketingdistillery.com/2014/11/29/
is-data-science-a-buzzword-modern-data-scientist-defined/
Data Science
Figure: Drew Convays Venn diagram

Classication
Denition
Given nC different classes, a classier algorithm builds a model
that predicts for every unlabelled instance I the class C to which
it belongs with accuracy.
Example
A spam lter
Example
Twitter Sentiment analysis: analyze tweets with positive or
negative feelings
Classication
Example
Contains Domain Has Time
Data set that Money type attach. received spam
describes e-mail yes com yes night yes
features for yes edu no night yes
deciding if it is no com yes night yes
spam. no edu no day no
no com no day no
yes cat no day yes
Assume we have to classify the following new instance:

Money type attach. received spam
yes edu yes day ?
k-Nearest Neighbours
k-NN Classier
Training: store all instances in memory
Prediction:
Find the k nearest instances
Output majority class of these k instances
Bayes Classiers
Nave Bayes

Based on Bayes Theorem:
P(c)P(d|c)
P(c|d) =
P(d)
prior likelikood
posterior =
evidence
Estimates the probability of observing attribute a and the
prior probability P(c)
Probability of class c given an instance d:
P(c) a2d P(a|c)

P(c|d) =
P(d)
Bayes Classiers
Multinomial Nave Bayes

Considers a document as a bag-of-words.
Estimates the probability of observing word w and the prior
probability P(c)
Probability of class c given a test document d:
P(c) w2d P(w|c)nwd

P(c|d) =
P(d)
Perceptron
Attribute w
Attribute w
Attribute w Output h~w (~xi )
Attribute w
Attribute w
Data stream: h~xi , yi i

Classical perceptron: h~w (~xi ) = sgn(~wT~xi ),
Minimize Mean-square error: J(~w) = (yi h~w (~xi ))
Perceptron
Attribute w
Attribute w
Attribute w
Attribute w
We use sigmoid function h~w = s (~wT~x) where
s (x) = /( + e x )
s 0 (x) = s (x)( s (x))

Perceptron

Stochastic Gradient Descent: ~w = ~w hJ~xi
Gradient of the error function:
J = (yi h~w (~xi ))h~w (~xi )

i
h~w (~xi ) = h~w (~xi )( h~w (~xi ))

Weight update rule
~w = ~w + h (yi h~w (~xi ))h~w (~xi )( h~w (~xi ))~xi

i
Restricted Boltzmann Machines (RBMs)
z1 z2 z3 z4
x1 x2 x3 x4 x5
Energy-based models, where
E(~x,~z)
P(~x,~z) e .
Manipulate a weight matrix W to nd low-energy states

and thus generate high probability P(~x,~z), where
E(~x,~z) = W.
RBMs can be stacked on top of each other to form

so-called Deep Belief Networks (DBNs)
Classication
Example
no com no day no
yes cat no day yes

yes edu yes day ?
Classication

yes edu yes day ?
Time
Day Night
Contains Money YES
Yes No
YES NO
Decision Trees
Basic induction strategy:

A the best decision attribute for next node
Assign A as decision attribute for node
For each value of A, create new descendant of node
Sort training examples to leaf nodes
If training examples perfectly classied, Then STOP, Else
iterate over new leaf nodes
Bagging
Example
Dataset of Instances : A, B, C, D
Classier : B, A, C, B
Classier : D, B, A, D
Classier : B, C, B, B
Classier : D, C, A, C
Bagging builds a set of M base models, with a bootstrap

sample created by drawing random samples with
replacement.
Random Forests
Bagging
Random Trees: trees that in each node only uses a random
subset of the attributes
Random Forests is one of the most popular methods in
machine learning.
Boosting
The strength of Weak Learnability, Schapire
A boosting algorithm transforms a weak learner

into a strong one
Boosting
A formal description of Boosting (Schapire)

given a training set (x , y ), . . . , (xm , ym )
yi 2 { , +} correct label of instance xi 2 X
for t = , . . . , T
construct distribution Dt
nd weak classier
ht : X ! { , +}
with small error et = PrDt [ht (xi ) 6= yi ] on Dt

output nal classier
Boosting
AdaBoost
: Initialize D (i) = /m for all i 2 {, , ..., m}
: for t = ,,...T do
: Call WeakLearn, providing it with distribution Dt
: Get back hypothesis ht : X ! Y
: Calculate error of ht : et = i:ht (xi )6=yi Dt (i)
6: Update distribution
et /( et ) if ht (xi ) = yi
Dt : Dt+ (i) = DZt (i)
t otherwise
where Zt is a normalization constant (chosen so Dt+ is a
probability distribution)
: return hn (x) = arg maxy2Y t:ht (x)=y log et /( et )
Boosting
AdaBoost
: for t = ,,...T do
et if ht (xi ) = yi
t et otherwise
Stacking
Use a classier to combine predictions of base classiers

Example
Use a perceptron to do stacking
Use decision trees as base classiers
Clustering
Denition
Clustering is the distribution of a set of instances of examples
into non-known groups according to some common relations or
afnities.
Example
Market segmentation of customers
Example
Social network communities
Clustering
Denition
Given
a set of instances I
a number of clusters K
an objective function cost(I)
a clustering algorithm computes an assignment of a cluster for
each instance
f : I ! {, . . . , K}
that minimizes the objective function cost(I)
Clustering
Denition
Given
an objective function cost(C, I)
a clustering algorithm computes a set C of instances with
|C| = K that minimizes the objective function
cost(C, I) = d (x, C)
x2I
where
d(x, c): distance function between x and c
d (x, C) = minc2C d (x, c): distance from x to the nearest
point in C
k-means
. Choose k initial centers C = {c , . . . , ck }

. while stopping criterion has not been met
For i = , . . . , N
nd closest center ck 2 C to each instance pi
assign instance pi to cluster Ck
For k = , . . . , K
set ck to be the center of mass of all points in Ci
k-means++
. Choose a initial center c

For k = , . . . , K
select ck = p 2 I with probability d (p, C)/cost(C, I)
For i = , . . . , N
For k = , . . . , K
Performance Measures
Internal Measures
Sum square distance
Dunn index D = ddmin
max
C-Index C = S S Smin
max S min
External Measures
Rand Measure
F Measure
Jaccard
Purity
Density based methods
DBSCAN
e-neighborhood(p): set of points that are at a distance of p
less or equal to e
Core object: object whose e-neighborhood has an overall
weight at least
A point p is directly density-reachable from q if
p is in e-neighborhood(q)
q is a core object
A point p is density-reachable from q if
there is a chain of points p , . . . , pn such that pi+ is directly
density-reachable from pi
A point p is density-connected from q if
there is point o such that p and q are density-reachable
from o
DBSCAN
A cluster C of points satises
if p 2 C and q is density-reachable from p, then q 2 C
all points p, q 2 C are density-connected
A cluster is uniquely determined by any of its core points
A cluster can be obtained
choosing an arbitrary core point as a seed
retrieve all points that are density-reachable from the seed
DBSCAN
Figure: DBSCAN Point Example with =

DBSCAN
select an arbitrary point p
retrieve all points density-reachable from p
if p is a core point, a cluster is formed
If p is a border point
no points are density-reachable from p
DBSCAN visits the next point of the database
Continue the process until all of the points have been
processed
Frequent Patterns
Suppose D is a dataset of patterns, t 2 D, and min sup is a

constant.
Denition Denition
Support (t): number of Pattern t is frequent if
patterns in D that are Support (t) min sup.
superpatterns of t.
Frequent Subpattern Problem

Given D and min sup, nd all frequent subpatterns of patterns
in D.
Frequent Patterns

constant.
Denition Denition
superpatterns of t.

in D.
Frequent Patterns

constant.
Denition Denition
superpatterns of t.

in D.
Frequent Patterns

constant.
Denition Denition
superpatterns of t.

in D.
Pattern Mining
Dataset Example
Document Patterns
d abce
d cde
d abce
d acde
d abcde
d6 bcd
Itemset Mining
Support Frequent
d abce d,d,d,d,d,d6 c
d cde
d,d,d,d,d e,ce
d abce
d,d,d,d a,ac,ae,ace
d acde
d,d,d,d6 b,bc
d abcde
d,d,d,d6 d,cd
d6 bcd
d,d,d ab,abc,abe
be,bce,abce
d,d,d de,cde
minimal support =
Itemset Mining
Support Frequent
d abce 6 c
d cde
e,ce
d abce
a,ac,ae,ace
d acde
b,bc
d abcde
d,cd
d6 bcd
ab,abc,abe
be,bce,abce
de,cde
Itemset Mining
Support Frequent Gen Closed

d abce 6 c c c
d cde
e,ce e ce
d abce
a,ac,ae,ace a ace
d acde
b,bc b bc
d abcde
d,cd d cd
d6 bcd
ab,abc,abe ab
be,bce,abce be abce
de,cde de cde
Itemset Mining
Support Frequent Gen Closed Max

d abce 6 c c c
d cde
e,ce e ce
d abce
a,ac,ae,ace a ace
d acde
b,bc b bc
d abcde
d,cd d cd
d6 bcd
ab,abc,abe ab
be,bce,abce be abce abce
de,cde de cde cde
Itemset Mining

d abce 6 c c c
d cde
e,ce e ce
d abce
a,ac,ae,ace a ace
d acde
b,bc b bc
d abcde
d,cd d cd
d6 bcd
ab,abc,abe ab
de,cde de cde cde
Itemset Mining

d abce 6 c c c
d cde
e,ce e ce
d abce
a,ac,ae,ace a ace
d acde
b,bc b bc
d abcde
d,cd d cd
d6 bcd
ab,abc,abe ab
e ! ce be,bce,abce be abce abce
de,cde de cde cde
Itemset Mining

d abce 6 c c c
d cde
e,ce e ce
d abce
a,ac,ae,ace a ace
d acde
b,bc b bc
d abcde
d,cd d cd
d6 bcd
ab,abc,abe ab
de,cde de cde cde
Itemset Mining

d abce 6 c c c
d cde
e,ce e ce
d abce
a,ac,ae,ace a ace
d acde
b,bc b bc
d abcde
d,cd d cd
d6 bcd
ab,abc,abe ab
de,cde de cde cde
Itemset Mining

d abce 6 c c c
d cde
e,ce e ce
d abce
a,ac,ae,ace a ace
d acde
b,bc b bc
d abcde
d,cd d cd
d6 bcd
ab,abc,abe ab
a ! ace be,bce,abce be abce abce
de,cde de cde cde
Itemset Mining

d abce 6 c c c
d cde
e,ce e ce
d abce
a,ac,ae,ace a ace
d acde
b,bc b bc
d abcde
d,cd d cd
d6 bcd
ab,abc,abe ab
de,cde de cde cde
Closed Patterns
Usually, there are too many frequent patterns. We can compute

a smaller set, while keeping the same information.
Example
A set of items, has subsets, that is more
than the number of atoms in the universe
Closed Patterns
A priori property
If t0 is a subpattern of t, then Support (t0 ) Support (t).
Denition
A frequent pattern t is closed if none of its proper superpatterns
has the same support as it has.
Frequent subpatterns and their supports can be generated from

closed patterns.
Maximal Patterns
Denition
A frequent pattern t is maximal if none of its proper
superpatterns is frequent.
Frequent subpatterns can be generated from maximal patterns,

but not with their support.
All maximal patterns are closed, but not all closed patterns are
maximal.
Non streaming frequent itemset miners
Representation:
Horizontal layout
T: a, b, c
T: b, c, e
T: b, d, e
Vertical layout
a:
b:
c:
Search:
Breadth-rst (levelwise): Apriori
Depth-rst: Eclat, FP-Growth
The Apriori Algorithm
A A
Initialize the item set size k =

Start with single element sets
Prune the non-frequent ones
while there are frequent item sets
do create candidates with one item more
6 Prune the non-frequent ones
Increment the item set size k = k +
8 Output: the frequent item sets

The Eclat Algorithm
Depth-First Search
divide-and-conquer scheme : the problem is processed by
splitting it into smaller subproblems, which are then
processed recursively
conditional database for the prex a
transactions that contain a
conditional database for item sets without a
transactions that not contain a
Vertical representation
Support counting is done by intersecting lists of
transaction identiers
The FP-Growth Algorithm
Depth-First Search
Vertical and Horizontal representation : FP-Tree
prex tree with links between nodes that correspond to the
same item
Support counting is done using FP-Tree
Mining Graph Data
Problem
Given a data set D of graphs, nd frequent graphs.
Transaction Id Graph
O
C C S N
O
O
C C S N
C
N
C C S N
The gSpan Algorithm
S(g, D, min sup, S)
Input: A graph g, a graph dataset D, min sup.

Output: The frequent graph set S.
if g 6= min(g)
then return S
insert g into S
update support counter structure
C 0/
6 for each g0 that can be right-most
extended from g in one step
do if support(g) min sup
8 then insert g0 into C
for each g0 in C
do S S(g0 , D, min sup, S)
return S
Data Mining
Data Preprocessing
Data Basics
Machine Learning/Data Mining Applications
Business Analytics
Is this costumer credit-worthy?
Is a costumer willing to respond to an email?
Do costumers divide in similar groups?
How much a costumer is going to spend next semester?
World Wide Web
Financial Analytics
Internet of Things
Image Recognition, Speech
..
The Data Mining Process
Data collection
Data Preprocesing
Feature extraction
Data cleaning
Feature selection and transformation
Analytical processing and algorithms
Data Postprocesing
Multidimensional Data
Example:
Competitor Name Swim Cycle Run Total
John T : : 8: :
Norman P 8: : : :
Alex K : 8: n/a n/a
Sarah H : : : :
Table: Triathlon results
Example or Instance
data point, transaction, entity, tuple, object, or feature-vector
Attribute or Feature
eld, dimension
Instance Types
Dense
red, white, Barcelona, , up
red, red, Barcelona, , down
black, white, Paris, , up
red, green, Paris, , down
Sparse
, , , , , , , , , , , , , , , , , , , , , ,
, , , , , , , , , , , , , , , , , , , , , ,
, , , , , , , , , , , , , , , , , , , , , ,
, , , , , , , , , , , , , , , , , , , , , ,
, , , , , , , , , , , , , , , , , , , , , ,
, , , , , , , , , , , , , , , , , , , , , ,
Attribute Type
Numerical
, , ., ., .
Categorical or Discrete
+, -
red, green, black
yes, no
up, down
Barcelona, Paris, London, New York
Text Data: vector-space representation
The cat is black
Binary: Categorical or Numerical
Attribute/Column Relationships
Classication : predict value of a discrete attribute
Regression: predict value of a numeric attribute
Instance/Row Relationships
Clustering: determine subsets of rows, in which the values
in the corresponding columns are similar
Outlier Detection: determine the rows that are very different
from the other rows
Big Data Scalability
Distributed Systems:
Hardware: Hadoop cluster
Software: MapReduce, Spark, Flink, Storm
Streaming Algorithms
Single pass over the data
Concept Drift
Data Preparation
Data collection
Data Preprocesing
Feature extraction
Data cleaning
Data Postprocesing
Feature Extraction
Sensor data: wavelets or Fourier Transforms

Image Data: histograms or visual words
Web logs: multidimensional data
Network trafc: specic features as network protocol,
bytes transferred
Text Data: remove stop words, stem data,
multidimensional data
Feature Conversion
Numeric to Discrete
Equi-width ranges
Equi-log ranges
Equi-depth ranges
Discrete to Numeric
Binarization: one numeric attribute for each value
Text to Numeric
remove stop words, stem data, tf-idf, multidimensional data
Time Series to Discrete Sequence Data
SAX: equi-depth discretization after window-based
averaging
Time Series to Numeric Data
Discrete Wavelet Transform
Discrete Fourier Transform
Term Frequency-Inverse Document Frequency
Term frequency
Boolean frequencies
tf(t, d) = if t occurs in d and otherwise;
Logarithmically scaled frequency
tf(t, d) = + logft,d , or zero if ft,d is zero;
Augmented frequency,
ft,d
tf(t, d) = . + .
max{ft 0 ,d : t 0 2 d}
Inverse document frequency
N
idf(t, D) = log
|{d 2 D : t 2 d}|
Term frequency-inverse document frequency
tfidf(t, d, D) = tf(t, d) idf(t, D)

Data Cleaning
Handling missing entries
Eliminate entries with a missing value
Estimate missing values
Algorithms can handle missing values
Handling incorrect entries
Duplicate detection and inconsistency detection
Domain knowledge
Data-centric methods
Scaling and normalization
Standardization: for instance i, attribute j:
xij - j
zji =
j
Normalization:
xij - minj
yij =
maxj - min j
Sampling for Static Data

Sampling with Replacement
Sampling without Replacement: no duplicates
Biased Sampling
Stratied Sampling
Reservoir Sampling for Data Streams
Given a data stream, choose k items with the same
probability, storing only k elements in memory.
R S
R S
for every item i in the rst k items of the stream

do store item i in the reservoir
n=k
for every item i in the stream after the rst k items of the stream
do select a random number r between and n
6 if r < k
then replace item r in the reservoir with item i
8 n=n+
Figure: Algorithm R S
Feature Subset Selection

Supervised feature selection
Unsupervised feature selection
Biased Sampling
Stratied Sampling
Dimensionality reduction with axis rotation
Principal Component Analysis
Singular Value Decomposition
Latent Semantic Analysis
Goal: Principal component analysis computes the most

meaningful basis to re-express a noisy, garbled data set.
The hope is that this new basis will lter out the noise and
reveal hidden dynamics
Normalize Input Data

Compute k orthonormal vectors to have a basis for the
normalized data
Sort these principal components
Eliminate components with low variance
Organize the data set X as an m n matrix, where m is the

number of features and n is the number of instances.
Normalize Input Data: subtract off the mean for each
instance xi
Calculate the SVD or the eigenvectors of the covariance
Find some orthonormal matrix P where Y = PX such that

SY = YY T
n-
is diagonalized.
The rows of P are the principal components of X.
Clustering, classification and
evaluation
Mostafa H. Chehreghani
Mostafa.chehreghani@gmail.com
Clustering
Albert Bifet (@abifet)
Paris, 18 October 2015

albert.bifet@telecom-paristech.fr
Clustering
Definition
affinities.
Example
Example
Clustering
Definition
Given
I a set of instances I
I a number of clusters K
I an objective function cost(I)
each instance
f : I {1, . . . , K }
Clustering
Definition
Given
I an objective function cost(C, I)
X
cost(C, I) = d 2 (x, C)
xI
where
I d(x, c): distance function between x and c
I d 2 (x, C) = mincC d 2 (x, c): distance from x to the nearest
point in C
k-means
I 1. Choose k initial centers C = {c1 , . . . , ck }

I 2. while stopping criterion has not been met
I For i = 1, . . . , N
I find closest center ck C to each instance pi
I assign instance pi to cluster Ck
I For k = 1, . . . , K
I set ck to be the center of mass of all points in Ci
k-means++
I 1. Choose a initial center c1

I For k = 2, . . . , K
I select ck = p I with probability d 2 (p, C)/cost(C, I)
I For i = 1, . . . , N
I For k = 1, . . . , K
Internal Measures
I Cluster Cohesion: Measures how closely related are
objects in a cluster
I Cluster Separation: Measure how distinct or well
separated a cluster is from other clusters
I Silhouette Coefficient: 1 a/b if a < b
I a = average distance of i to the points in its cluster
I b = min (average distance of i to points in another cluster)
External Measures
I Rand Measure
I F Measure
I Jaccard
I Purity
Distances
Numeric features
I Euclidean:
X
d(x, y ) = ||x y ||2 = (xi yi )2
I Manhattan distance:
X
d(x, y) = ||x y||1 = |xi yi |
DBSCAN
I -neighborhood(p): set of points that are at a distance of p
less or equal to
I Core object: object whose -neighborhood has an overall
weight at least
I A point p is directly density-reachable from q if
I p is in -neighborhood(q)
I q is a core object
I A point p is density-reachable from q if
I there is a chain of points p1 , . . . , pn such that pi+1 is directly
I A point p is density-connected from q if
I there is point o such that p and q are density-reachable
from o
DBSCAN
I A cluster C of points satisfies
I if p C and q is density-reachable from p, then q C
I all points p, q C are density-connected
I A cluster is uniquely determined by any of its core points
I A cluster can be obtained
I choosing an arbitrary core point as a seed
I retrieve all points that are density-reachable from the seed
DBSCAN
I select an arbitrary point p
I retrieve all points density-reachable from p
I if p is a core point, a cluster is formed
I If p is a border point
I no points are density-reachable from p
I DBSCAN visits the next point of the database
I Continue the process until all of the points have been
processed
DBSCAN
Figure: DBSCAN Point Example with =3

BIRCH
B ALANCED I TERATIVE R EDUCING AND C LUSTERING

USING H IERARCHIES
I Clustering Features CF = (N, LS, SS)
I N: number of data points
I LS: linear sum of the N data points
I SS: square sum of the N data points
I Properties:
I Additivity: CF1 + CF2 = (N1 + N2 , LS1 + LS2 , SS1 + SS2 )
I Easy to compute: average inter-cluster distance
and average intra-cluster distance
I Uses CF tree
I Height-balanced tree with two parameters
I B: branching factor
I T: radius leaf threshold
BIRCH

USING H IERARCHIES
Phase 1: Scan all data and build an initial in-memory CF

tree
Phase 2: Condense into desirable range by building a
smaller CF tree (optional)
Phase 3: Global clustering
Phase 4: Cluster refining (optional and off line, as requires
more passes)
BIRCH:
Balanced Iterative Reducing and Clustering using
Hierarchies
Tian Zhang, Raghu Ramakrishnan, Miron Livny
Presented by Zhao Li
2009, Spring
Introduction to BIRCH

Designed for very large data sets
Time and memory are limited
Incremental and dynamic clustering of incoming objects
Only one scan of data is necessary
Does not need the whole data set in advance

Two key phases:
Scans the database to build an in-memory tree
Applies clustering algorithm to cluster the leaf nodes
September 1, 2017 2
Similarity Metric(1)
Given a cluster of instances , we define:

Centroid:
Radius: average distance from member points to centroid
Diameter: average pair-wise distance within a cluster
September 1, 2017 3
centroid Euclidean distance:

centroid Manhattan distance:
average inter-cluster:
average intra-cluster:
variance increase:
September 1, 2017 4
Clustering Feature

The Birch algorithm builds a dendrogram called clustering
feature tree (CF tree) while scanning the data set.

Each entry in the CF tree represents a cluster of objects
and is characterized by a 3-tuple: (N, LS, SS), where N is
the number of objects in the cluster and LS, SS are defined
in the following.
September 1, 2017 5
Properties of Clustering Feature

CF entry is more compact
Stores significantly less than all of the data points in
the sub-cluster

A CF entry has sufficient information to calculate
D0-D4

Additivity theorem allows us to merge sub-clusters
incrementally & consistently
September 1, 2017 6
CF-Tree
Each non-leaf node has at

most B entries
Each leaf node has at
most L CF entries,
each of which satisfies
threshold T
September 1, 2017 7
CF-Tree Insertion

Recurse down from root
Find the appropriate leaf
Follow the "closest"-CF path, w.r.t. D0 / / D4

Modify the leaf
If the closest-CF leaf cannot absorb, make a new CF
entry. If there is no room for new leaf, split the parent
node

Traverse back
Update CFs on the path or splitting nodes
September 1, 2017 8
CF-Tree Rebuilding

If we run out of space, increase threshold T
By increasing the threshold, CFs absorb more data

Rebuilding "pushes" CFs over
The larger T allows different CFs to group together

Reducibility theorem
Increasing T will result in a CF-tree smaller than the
original
September 1, 2017 9
Example of BIRCH
New subcluster
sc8 sc3
sc1 sc4 sc5 sc6 sc7
sc2 LN2 LN3

Root
LN1 LN1 LN2 LN3
sc8 sc1 sc5

sc2 sc4 sc6 sc7
sc3
September 1, 2017 10
Insertion Operation in BIRCH
If the branching factor of a leaf node can not exceed 3, then LN1 is split.
sc8 sc3
sc1 sc4 sc5 sc6 sc7
sc2
LN1 LN2 LN3
LN1 Root
LN1 LN2 LN3
LN1
sc8 sc1 sc3sc4sc5 sc6 sc7

sc2
If the branching factor of a non-leaf node can not
exceed 3, then the root is split and the height of
the CF Tree increases by one.
sc8 sc3
sc1 sc4 sc5 sc6 sc7
sc2
LN1 LN2 LN3
LN1 Root
NLN1 NLN2
LN1
LN1 LN2 LN3
sc8 sc1 sc2 sc3sc4sc5 sc6 sc7

BIRCH Overview
Experimental Results
Input parameters:
Memory (M): 5% of data set
Disk space (R): 20% of M
Distance equation: D2
Quality equation: weighted average diameter (D)
Initial threshold (T): 0.0
Page size (P): 1024 bytes
KMEANS clustering
DS Time D # Scan DS Time D # Scan
1 43.9 2.09 289 1o 33.8 1.97 197
BIRCH clustering
2 13.2 4.43 51 2o 12.7 4.20 29
DS
3 Time
32.9 D
3.66 # 187
Scan DS
3o Time
36.0 D
4.35 # 241
Scan
1 11.5 1.87 2 1o 13.6 1.87 2
2 10.7 1.99 2 2o 12.1 1.99 2
Exam Questions

What is the main limitation of BIRCH?
Since each node in a CF tree can hold only a limited
number of entries due to the size, a CF tree node doesnt
always correspond to what a user may consider a nature
cluster. Moreover, if the clusters are not spherical in
shape, it doesnt perform well because it uses the notion
of radius or diameter to control the boundary of a
cluster.
Classification Evaluation
Paris, 27 September 2016

Evaluation
1. Error estimation: Hold-out or Cross-Validation

2. Evaluation performance measures: Accuracy or -statistic
3. Statistical significance validation: MacNemar or Nemenyi test
Evaluation Framework
Error Estimation
Data available for testing

I Holdout an independent test set
I Apply the current decision model to the test set
I The loss estimated in the holdout is an unbiased estimator
Holdout Evaluation
1. Error Estimation
Not enough data available for testing

I Divide dataset in 10 folds
I Repeat 10 times: use one fold for testing and the rest for
training
k-fold Cross-validation
2. Evaluation performance measures
Predicted Predicted
Class+ Class- Total
Correct Class+ 75 8 83
Correct Class- 7 10 17
Total 82 18 100
Table: Simple confusion matrix example
Predicted Predicted
Class+ Class- Total
Correct Class+ tp fn tp+fn
Correct Class- fp tn fp+tn
Total tp+fp fn+tn N
tp
I Precision = tp+fp
tp
I Recall = tp+fn
precisionrecall
I F1 = 2 precision+recall
Predicted Predicted
Class+ Class- Total
Total 82 18 100
75 10 75 83 10 17
I Accuracy = 100 +100 = 83 100 + 17 100 = 85%
I Arithmetic mean = ( 75 10
83 + 17 )/2 = 74.59%
q
75 10
I Geometric mean = 83 17 = 72.90%
2. Performance Measures with Unbalanced Classes
Predicted Predicted
Class+ Class- Total
Total 82 18 100
Predicted Predicted
Class+ Class- Total
Correct Class+ 68.06 14.94 83
Correct Class- 13.94 3.06 17
Total 82 18 100
Table: Confusion matrix for chance predictor
Kappa Statistic
I p0 : classifiers prequential accuracy
I pc : probability that a chance classifier makes a correct
prediction.
I statistic
p0 pc
=
1 pc
I = 1 if the classifier is always correct
I = 0 if the predictions coincide with the correct ones as
often as those of the chance classifier
Matthews correlation coefficient (MCC)
tp tn fp fn
p
(tp + fp)(tp + fn)(tn + fp)(tn + fn)
Predicted Predicted
Class+ Class- Total
Total tp+fp fn+tn N
AUC Area under the curve

A ROC space is defined by FPR and TPR (recall)
fp
I FPR = fp+tp
tp
I TPR = tp+fn
3. Statistical significance validation (2 Classifiers)
Classifier A Classifier A
Class+ Class- Total
Classifier B Class+ c a c+a
Classifier B Class- b d b+d
Total c+b a+d a+b+c+d
M = |a b 1|2 /(a + b)
The test follows the 2 distribution. At 0.99 confidence it rejects
the null hypothesis (the performances are equal) if M > 6.635.
McNemar test
3. Statistical significance validation (> 2 Classifiers)
Two classifiers are performing differently if the corresponding
average ranks differ by at least the critical difference
r
k(k + 1)
CD = q
6N
I k is the number of learners, N is the number of datasets,

I critical values q are
based on the Studentized range
statistic divided by 2.
Nemenyi test
r
k(k + 1)
CD = q
6N

# classifiers 2 3 4 5 6 7
q0.05 1.960 2.343 2.569 2.728 2.850 2.949
q0.10 1.645 2.052 2.291 2.459 2.589 2.693
Table: Critical values for the Nemenyi test
CS224W: Social and Information Network Analysis
Jure Leskovec, Stanford University
http://cs224w.stanford.edu
How to organize the Web?
First try: Human curated
Web directories
Yahoo, DMOZ, LookSmart
Second try: Web Search
Information Retrieval attempts to
find relevant docs in a small
and trusted set
Newspaper articles, Patents, etc.
But: Web is huge, full of untrusted documents,
random things, web spam, etc.
So we need a good way to rank webpages!
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 2
2 challenges of web search:
(1) Web contains many sources of information
Who to trust?
Insight: Trustworthy pages may point to each other!
(2) What is the best answer to query
newspaper?
No single right answer
Insight: Pages that actually know about newspapers
might all be pointing to many newspapers
All web pages are not equally important
www.joe-schmoe.com vs. www.stanford.edu
We already know:
There is large diversity
in the web-graph vs.
node connectivity.
So, lets rank the pages
using the web graph
link structure!
We will cover the following Link Analysis
approaches to computing importance of
nodes in a graph:
Hubs and Authorities (HITS)
Page Rank
Random Walk with Restarts
Sidenote: Various notions of node centrality: Node

Degree centrality = degree of
Betweenness centrality = #shortest paths passing through
Closeness centrality = avg. length of shortest paths from to
all other nodes of the network
Eigenvector centrality = like PageRank
Goal (back to the newspaper example):
Dont just find newspapers. Find experts pages that
link in a coordinated way to good newspapers
Idea: Links as votes
Page is more important if it has more links
In-coming links? Out-going links?
Hubs and Authorities NYT: 10
Each page has 2 scores: Ebay: 3

Quality as an expert (hub):
Yahoo: 3
Total sum of votes of pages pointed to
Quality as a content provider (authority): CNN: 8
Total sum of votes of experts WSJ: 9

Principle of repeated improvement
Interesting pages fall into two classes:
1. Authorities are pages containing
useful information
Newspaper home pages
Course home pages
Home pages of auto manufacturers
2. Hubs are pages that link to authorities

List of newspapers NYT: 10
Ebay: 3
Course bulletin Yahoo: 3
List of U.S. auto manufacturers CNN: 8
WSJ: 9
Each page starts with hub score 1
Authorities collect their votes
(Note this is an idealized example. In reality graph is not bipartite and

each page has both the hub and the authority score)
Hubs collect authority scores

each page has both the hub and authority score)
Authorities collect hub scores

A good hub links to many good authorities
A good authority is linked from many good
hubs
Note a self-reinforcing recursive definition
Model using two scores for each node:

Hub score and Authority score
Represented as vectors and , where the i-th
element is the hub/authority score of the i-th node
[Kleinberg 98]
Convergence criteria:
Each page has 2 scores: = 5 5 6 678
<
<
Authority score: 5
6 678
<
= 5 5 <
Hub score: 5
HITS algorithm:
()) ())
Initialize: ' = 1/ n, h2 = 1/ n
Then keep iterating until convergence:
(678) (6)
: Authority: 5 = '5 '
(678) (6)
: Hub: 5 = 5' '
: Normalize:
< <
678 678
5 5 = 1, ' ' =1
[Kleinberg 98]
Details!
Hits in the vector notation:

Vector = ( , ), = ( , )
Adjacency matrix (n x n): = if
Can rewrite = as =
So: = And similarly: =
Repeat until convergence:
(678) = (6)
(678) = I (6)
Normalize (678) and (678)
Details!
What is = ?
Then: = ( )
new
new
is updated (in 2 steps):
= I ( ) = (I )
h is updated (in 2 steps):

= (I ) = ( I )
Thus, in steps:
= I L
= I L
Repeated matrix powering
Definition: Eigenvectors & Eigenvalues
Let =
for some scalar , vector , matrix
Then is an eigenvector, and is its eigenvalue
The steady state (HITS has converged) is:
= Note constants c,c
dont matter as we
normalize them out
= RR every step of HITS
So, authority is eigenvector of

(associated with the largest eigenvalue)
Similarly: hub is eigenvector of
Still the same idea: Links as votes
Think of in-links as votes:
www.stanford.edu has 23,400 in-links
www.joe-schmoe.com has 1 in-link
Are all in-links equal?

Links from important pages count more
Recursive question!
A vote from an important
page is worth more: i k
ri/3 r /4
Each links vote is proportional k
to the importance of its source j rj/3
page rj/3 rj/3
If page i with importance ri has

di out-links, each link gets ri / di
rj = ri/3 + rk/4
votes
Page js own importance rj is
the sum of the votes on its in-
links
A page is important if it is The web in 1839
pointed to by other important ry/2

pages y
Define a rank rj for node j
ra/2
ri ry/2
rj = a
rm
m
i j di ra/2
out-degree of node Flow equations:
ry = ry /2 + ra /2
ra = ry /2 + rm
You might wonder: Lets just use Gaussian elimination rm = ra /2
to solve this system of linear equations. Bad idea!
j
Stochastic adjacency matrix
Let page have out-links i

If , then =

is a column stochastic matrix
Columns sum to 1 1/3
Rank vector : An entry per page M

is the importance score of page
=
The flow equations can be written
ri
= rj =
i j di
i1 i2 i3
Imagine a random web surfer:
At any time , surfer is on some page
At time + , the surfer follows an j
out-link from uniformly at random ri
rj =
Ends up on some page linked from i j d out (i)
Process repeats indefinitely
Let:
() vector whose th coordinate is the
prob. that the surfer is at page at time
So, () is a probability distribution over pages
i1 i2 i3
Where is the surfer at time t+1?
Follows a link uniformly at random
j
+ = () p(t + 1) = M p(t )
Suppose the random walk reaches a state
+ = () = ()
then () is stationary distribution of a random walk
Our original rank vector satisfies =
So, is a stationary distribution for
the random walk
Given a web graph with n nodes, where the
nodes are pages and edges are hyperlinks
Assign each node an initial page rank
Repeat until convergence (i |ri(t+1) ri(t)| < )
Calculate the page rank of each node
(t )
( t +1) ri
rj =
i j di
. out-degree of node
y a m
Power Iteration: y
y 0
Set ' 1/N a 0 1
]^
a m m 0 0
1: ' 5' _
^
ry = ry /2 + ra /2
2: ra = ry /2 + rm
If | | > : goto 1 rm = ra /2
Example:
ry 1/3 1/3 5/12 9/24 6/15
ra = 1/3 3/6 1/3 11/24 6/15
rm 1/3 1/6 3/12 1/6 3/15
Iteration 0, 1, 2,
y a m
Power Iteration: y
y 0
Set ' 1/N a 0 1
]^
a m m 0 0
1: ' 5' _
^
ry = ry /2 + ra /2
2: ra = ry /2 + rm
If | | > : goto 1 rm = ra /2
Example:
ry 1/3 1/3 5/12 9/24 6/15
ra = 1/3 3/6 1/3 11/24 6/15
rm 1/3 1/6 3/12 1/6 3/15
Iteration 0, 1, 2,
(t )
( t +1) ri
rj = or
r = Mr
i j di
equivalently
Does this converge?
Does it converge to what we want?
Are results reasonable?
The Spider trap problem:
(t )
( t +1) ri
a b rj =
i j di
Example:
Iteration: 0, 1, 2, 3
ra 1 0 0 0
=
rb 0 1 1 1
The Dead end problem:
(t )
( t +1) ri
a b rj =
i j di
Example:
ra 1 0 0 0
rb = 0 1 0 0
2 problems:
(1) Some pages are
dead ends (have no out-links)
Such pages cause
importance to leak out
(2) Spider traps

(all out-links are within the group)
Eventually spider traps absorb all importance
y a m
Power Iteration: y
y 0
8
Set ' = a 0 0
c a m m 0 1
]^
' = 5' _
^ ry = ry /2 + ra /2
And iterate ra = ry /2
rm = ra /2 + rm
Example:
ry 1/3 2/6 3/12 5/24 0
ra = 1/3 1/6 2/12 3/24 0
rm 1/3 3/6 7/12 16/24 1
Iteration 0, 1, 2,
The Google solution for spider traps: At each
time step, the random surfer has two options
With prob. , follow a link at random
With prob. 1-, jump to a random page
Common values for are in the range 0.8 to 0.9
Surfer will teleport out of spider trap within a
few time steps
y y
a m a m
y a m
Power Iteration: y
y 0
8
Set ' = a 0 0
c a m m 0 0
]^
' = 5' _
^ ry = ry /2 + ra /2
rm = ra /2
Example:
ry 1/3 2/6 3/12 5/24 0
ra = 1/3 1/6 2/12 3/24 0
rm 1/3 1/6 1/12 2/24 0
Iteration 0, 1, 2,
Teleports: Follow random teleport links with
probability 1.0 from dead-ends
Adjust matrix accordingly
y y
a m a m
y a m y a m
y 0 y
a 0 0 a 0
m 0 0 m 0
Googles solution: At each step, random
surfer has two options:
With probability , follow a link at random
With probability 1-, jump to some random page
PageRank equation [Brin-Page, 98]
5 1
' = = + (1 )
5 di out-degree
5' of node i
The above formulation assumes that has no dead ends. We can

either preprocess matrix (bad!) or explicitly follow random teleport
links with probability 1.0 from dead-ends. See P. Berkhin, A Survey
on PageRank Computing, Internet Mathematics, 2005.
Details!
PageRank as a principal eigenvector

= or equivalently =

But we really want (**): d out-degree i
of node i
= +

Lets define: Note: is a sparse

= + ( ) matrix but is dense
(all entries 0). In
Now we get what we want: practice we never
materialize but
= rather we use the
sum formulation (**)
What is ?
In practice 0.15 (Jump approx. every 5-6 links)
Input: Graph and parameter
Directed graph with spider traps and dead ends
Parameter
Output: PageRank vector
) 8
Set: ' = , = 1
c
do:
(m)
()
: =
()
= if in-deg. of is 0

Now re-insert the leaked PageRank:
o
: = R + (6)
where: = ' '

=+
(6) (6o8)
while ' ' ' >
PageRank and HITS are two solutions to the
same problem:
What is the value of an in-link from u to v?
In the PageRank model, the value of the link
depends on the links into u
In the HITS model, it depends on the value of the
other links out of u
The destinies of PageRank and HITS

post-1998 were very different
[Tong-Faloutsos, 06]
I 1 J
1 1
A 1 H 1 B
1 1
D
1 1 1
E G
F
a.k.a.: Relevance, Closeness, Similarity

Given:

Conferences-to-authors IJCAI
graph Philip S. Yu
Goal:
KDD
Ning Zhong
Proximity on graphs ICDM
Q: What is most related SDM R. Ramakrishnan
conference to ICDM? AAAI M. Jordan

NIPS

Conference Author
Shortest path is not good:
No influence for degree-1 nodes (E, F, G)!

Multi-faceted relationships
Network Flow is not good:
Does not punish long paths
I 1 J
1 1
A 1 H 1 B
Multiple Connections
1 1
D Quality of connection

1 1 1 Direct & In-direct
E G
connections
F
Length, Degree,
Weight
10
9
12
2
8
1
11
3
6
5
Goal: Evaluate pages not just by popularity
but by how close they are to the topic
Teleporting can go to:
Any page with equal probability
PageRank (we used this so far)
A topic-specific set of relevant pages
Topic-specific (personalized) PageRank (S ...teleport set)
= + ( )/|| if
= otherwise
Random Walk with Restart: S is a single element
Graphs and web search:

Ranks nodes by importance IJCAI
Personalized PageRank: Philip S. Yu

KDD
Ranks proximity of nodes Ning Zhong
to the teleport nodes ICDM
Proximity on graphs: SDM R. Ramakrishnan
Q: What is most related AAAI M. Jordan
conference to ICDM?

NIPS
Random Walks with Restarts

Teleport back to the starting node: Conference Author
S = { single node }
Node 4
0.04 0.03 Node 1 0.13
10 Node 2 0.10
9
0.10 Node 3 0.13
12
0.13 2 0.08 0.02 Node 4 /
1 8 Node 5 0.13
3 0.13 11
0.04 Node 6 0.05
4 Node 7 0.05
Node 8 0.08
6 0.05 Node 9 0.04
5
0.13 Node 10 0.03
7 Node 11 0.04
0.05 Node 12 0.02
Nearby nodes, higher scores Ranking vector

More red, more relevant
PKDD
SDM PAKDD
0.008
0.007
0.009
KDD 0.005 ICML
0.011
ICDM
0.005
0.004
CIKM ICDE
0.005
0.004
0.004
ECML SIGMOD
DMKD
Q: Which conferences
are closest to KDD &
K
ICDM?
I A: Personalized
PageRank with
Graph of CS conferences teleport set S={KDD,
ICDM}
Pins belong to Boards
Input:
Input: Output
Input:
Input: Output
Proximity to query node(s) Q:
Bipartite Pin and Board graph
Pixie Random Walk
5 5 5 5 5 5 14 9 Q 16 7 8 8 8 8 1 1 1
Yummm Strawberries Smoothies Smoothie Madness!!
Observations Models Algorithms
Small diameter, Erds-Renyi model,

Decentralized search
Edge clustering Small-world model
Patterns of signed Structural balance, Models for predicting

edge creation Theory of status edge signs
Viral Marketing, Blogosphere, Independent cascade model, Influence maximization,

Memetracking Game theoretic model Outbreak detection, LIM
Preferential attachment, PageRank, Hubs and

Scale-Free
Copying model authorities
Densification power law, Microscopic model of Link prediction,

Shrinking diameters evolving networks Supervised random walks
Strength of weak ties, Community detection:

Core-periphery Kronecker Graphs
Girvan-Newman, Modularity
We often think of networks looking
like this:
What lead to such a conceptual picture?

How information flows through the network?
What structurally distinct roles do nodes play?
What roles do different links (short vs. long) play?
How people find out about new jobs?
Mark Granovetter, part of his PhD in 1960s
People find the information through personal contacts
But: Contacts were often acquaintances
rather than close friends
This is surprising: One would expect your friends to
help you out more than casual acquaintances
Why is it that acquaintances are most helpful?
[Granovetter 73]
Two perspectives on friendships:

Structural: Friendships span different parts of the
network
Interpersonal: Friendship between two people is
either strong or weak
Structural role: Triadic Closure
a If two people in a
network have a friend in
common, then there is
an increased likelihood
b they will become friends
c
themselves.
Which edge is more
likely, a-b or a-c?
Granovetter makes a connection between
social and structural role of an edge
First point: Structure
Structurally embedded edges are also socially strong
Long-range edges spanning different parts of the
network are socially weak
Second point: Information
Long-range edges allow you to gather information
from different parts of the network and get a job
Structurally embedded edges are S Weak
Strong
heavily redundant in terms of S
a W
b
S
information access S
Triadic closure == High clustering coefficient
Reasons for triadic closure:
If and have a friend in common, then:
is more likely to meet
B
(since they both spend time with )
and trust each other A C
(since they have a friend in common)
has incentive to bring and together
(as it is hard for to maintain two disjoint relationships)
Empirical study by Bearman and Moody:
Teenage girls with low clustering coefficient are
more likely to contemplate suicide
Bridge
Define: Bridge edge a
If removed, it disconnects the graph b
Define: Local bridge Local bridge

Edge of Span > 2
(Span of an edge is the distance of the a
b
edge endpoints if the edge is deleted. Local
bridges with long span are like real bridges)
Define: Two types of edges: Edge:
W or S
Strong (friend), Weak (acquaintance)
Define: Strong triadic closure: S S
Two strong ties imply a third edge

Fact: If strong triadic closure is W
satisfied then local bridges S
S W S
S
are weak ties! S

a
b
Claim: If node satisfies Strong Triadic Closure
and is involved in at least two strong ties, then
any local bridge adjacent to must be a weak tie.
Proof by contradiction:
Assume satisfies Strong Triadic S S
Closure and has 2 strong ties A
Let be local bridge

and a strong tie
S S
Then must exist S
A
B
because of Strong C
Triadic Closure
But then is not a bridge!
(since B-C must be connected due to Strong Triadic Closure property)
For many years Granovetters theory was not
tested
But, today we have large who-talks-to-whom
graphs:
Email, Messenger, Cell phones, Facebook
Onnela et al. 2007:

Cell-phone network of 20% of countrys population
Edge strength: # phone calls
Edge overlap:
() ()
&' =
() ()
() a set
of neighbors
of node
Overlap =
when an edge is
a local bridge
Cell-phone network
Observation:
Highly used links True
Neighborhood overlap
have high overlap! Permuted
strengths
Legend:
True: The data
Permuted strengths: Keep
the network structure
but randomly reassign
edge strengths
Edge strength (#calls)
Real edge strengths in mobile call graph
Strong ties are more embedded (have higher overlap)
Same network, same set of edge strengths
but now strengths are randomly shuffled
Low
Size of largest component
disconnects
the network
sooner
Fraction of removed links
Removing links by strength (#calls)

Low to high
High to low Conceptual picture
of network structure
Low
disconnects
the network
sooner
Removing links based on overlap

Low to high
Granovetters theory leads to the following
conceptual picture of networks
Strong ties
Weak ties
Granovetters theory
suggest that networks
are composed of
tightly connected
sets of nodes
Communities, clusters,
Network communities: groups, modules
Sets of nodes with lots of connections inside and
few to outside (the rest of the network)
How to automatically
find such densely
connected groups of
nodes?
Ideally such automatically
detected clusters would
then correspond to real
groups
For example: groups, modules
Zacharys Karate club network:
Observe social ties and rivalries in a university karate club
During his observation, conflicts led the group to split
Split could be explained by a minimum cut in the network
Find micro-markets by partitioning the
query x advertiser graph:
query
advertiser
Can we identify
node groups?
(communities,
modules, clusters)
Nodes: Teams
Edges: Games played
NCAA conferences
Nodes: Teams
Edges: Games played
Can we identify
social communities?
Nodes: Users
Edges: Friendships
High school Company
Stanford (Basketball)
Stanford (Squash)
Nodes: Users
Social communities Edges: Friendships
Can we identify
functional modules?
Nodes: Proteins
Edges: Interactions
Functional modules
Nodes: Proteins
Edges: Interactions
How to find communities?
We will work with undirected (unweighted) networks

Edge betweenness: Number of
shortest paths passing over the edge b=16
Intuition: b=7.5
Edge strengths (call volume) Edge betweenness

in a real network in a real network
[Girvan-Newman 02]
Divisive hierarchical clustering based on the

notion of edge betweenness:
Number of shortest paths passing through the edge
Girvan-Newman Algorithm:
Undirected unweighted networks
Repeat until no edges are left:
Calculate betweenness of edges
Remove edges with highest betweenness
Connected components are communities
Gives a hierarchical decomposition of the network
12
1
33
49
Need to re-compute
betweenness at
every step
Step 1: Step 2:
Step 3: Hierarchical network decomposition:
1. How to compute betweenness?
2. How to select the number of
clusters?
Want to compute Breadth first search
betweenness of starting from :
paths starting at
node 0
Forward step: Count the number of shortest
paths from to all other nodes of the
network
Backward step: Compute betweenness: If
there are multiple paths count them
fractionally
The algorithm:
Add edge flows:
-- node flow =
1+child edges 1+1 paths to H
-- split the flow up Split evenly
based on the parent
value
Repeat the BFS 1+0.5 paths to J
Split 1:2
procedure for each
starting node
1 path to K.
Split evenly
fractionally
The algorithm:
Add edge flows:
-- node flow =
based on the parent
value
Split 1:2
procedure for each
starting node
1 path to K.
Split evenly
clusters?
Communities: sets of
tightly connected nodes
Define: Modularity
A measure of how well
a network is partitioned
into communities
Given a partitioning of the
network into groups ! :
Q s S [ (# edges within group s)
(expected # edges within group s) ]
Need a null model!
Given real on nodes and edges,
construct rewired network
Same degree distribution but i
random connections
j
Consider as a multigraph
The expected number of edges between nodes

and of degrees and equals to: =
The expected number of edges in (multigraph) G:

= = =
Note:

= = D F = 2
FG
Modularity of partitioning S of graph G:

, =

Aij = 1 if ij,
Normalizing cost.: -1<Q<1
0 else
Modularity values take range [1,1]

It is positive if the number of edges within
groups exceeds the expected number
0.3-0.7<Q means significant community structure
Modularity is useful for selecting the
number of clusters: Q
Why not optimize Modularity directly?

Lets split the graph into 2 communities!
Want to directly optimize modularity!
R VW VX
max , = &'
N ST YN &Y 'Y ST
Community membership vector s:

si = 1 if node i is in community 1 & ' + 1 1.. if si=sj
=
-1 if node i is in community -1 2 0.. else
R VW VX YW YX [R
, = &G 'G &'
ST ST S
R VW VX
= \T &,'G &' & '
ST
Note: each row/col of B
sums to 0: = ,
Define:
= =

Modularity matrix: =
Membership: = {, +}
R VW VX
Then: , = \T &G 'G &' & '
ST
R
= \T &,'G &' & '
R R
= \T & & ' &' ' = \T b
=
Task: Find s{-1,+1}n that maximizes Q(G,s)

Symmetric matrix A
That is positive semi-definite:
=
Then solutions , to equation = :
Eigenvectors ordered by the magnitude of their
corresponding eigenvalues & (R S j )
are orthonormal (orthogonal and unit length)
form a coordinate system (basis)
If is positive-semidefinite: & 0 (and they always exist)
Eigen Decomposition theorem: Can rewrite matrix
in terms of its eigenvectors and eigenvalues: =
&
R m
Rewrite: , = in terms of its
\T
eigenvectors and eigenvalues:
j j j
= s m D & & &b = D b & & &b = D b x & S &

&pR &pR &pR
So, if there would be no other constraints on

then to maximize , we make =
x2
Why? Because q
Remember has fixed length! s
Assigns all weight in the sum to (largest eigenvalue)
x1
All other terms are zero because of orthonormality
Lets consider only the first term in the summation
(because is the largest):
max , = j&pR b & S & b j S j
Y
Lets maximize: p , where sj{-1,+1}

To do this, we set:
+ , (jth coordinate of )
= t
, < (jth coordinate of < )
Continue the bisection hierarchically

Fast Modularity Optimization Algorithm:
Find leading eigenvector of modularity matrix B
Divide the nodes by the signs of the elements of
Repeat hierarchically until:
If a proposed split does not cause modularity to increase,
declare community indivisible and do not split it
If all communities are indivisible, stop
How to find ? Power method!
Bv (t )
Start with random v(0), repeat : v (t +1) =
Bv (t )
When converged (v(t) v(t+1)), set xn = v(t)
Girvan-Newman:
Based on the strength of weak ties
Remove edge of highest betweenness
Modularity:
Overall quality of the partitioning of a graph
Use to determine the number of communities
Fast modularity optimization:
Transform the modularity optimization to a
eigenvalue problem
[Ron Burt]
Who is better off, Robert or James?

Few structural holes Many structural holes
Structural Holes provide ego with access

to novel information, power, freedom
The network constraint measure [Burt]:
To what extent are persons contacts redundant
k
k
= / 2
p25=
p12=
i j i
2 1 p15= 5
j
Low: disconnected contacts
4
High: contacts that are
close or strongly tied 1 2 3 4 5
1 .00 .25 .25 .25 .25
2
2 .50 .00 .00 .00 .50

ci = cij = pij + ( pik pkj ) 3 1.0 .00 .00 .00 .00
4 .50 .00 .00 .00 .50
j j k 5 .33 .33 .00 .33 .00
prop. of s energy invested in relationship with
Constraint: To what
extent are persons
contacts redundant
Low: disconnected
contacts
High: contacts that
are close or strongly
tied
Network constraint:
James: = 0.309
Robert: = 0.148
[Ron Burt]
Data Mining
Introduction
Data Science
Data Science is an interdisciplinary eld focused on

extracting knowledge or insights from large volumes
of data.
Data Scientist
Figure: http://www.marketingdistillery.com/2014/11/29/
is-data-science-a-buzzword-modern-data-scientist-defined/
Data Science
Figure: Drew Convays Venn diagram

Classication
Denition
Given nC different classes, a classier algorithm builds a model
that predicts for every unlabelled instance I the class C to which
it belongs with accuracy.
Example
A spam lter
Example
Twitter Sentiment analysis: analyze tweets with positive or
negative feelings
Classication
Example
no com no day no
yes cat no day yes

yes edu yes day ?
k-Nearest Neighbours
k-NN Classier
Training: store all instances in memory
Prediction:
Find the k nearest instances
Output majority class of these k instances
Bayes Classiers
Nave Bayes

Based on Bayes Theorem:
P(c)P(d|c)
P(c|d) =
P(d)
prior likelikood
posterior =
evidence
Estimates the probability of observing attribute a and the
prior probability P(c)
Probability of class c given an instance d:
P(c) a2d P(a|c)

P(c|d) =
P(d)
Bayes Classiers
Multinomial Nave Bayes

Considers a document as a bag-of-words.
Estimates the probability of observing word w and the prior
probability P(c)
Probability of class c given a test document d:
P(c) w2d P(w|c)nwd

P(c|d) =
P(d)
Perceptron
Attribute w
Attribute w
Attribute w
Attribute w
Data stream: h~xi , yi i

Classical perceptron: h~w (~xi ) = sgn(~wT~xi ),
Perceptron
Attribute w
Attribute w
Attribute w
Attribute w
We use sigmoid function h~w = s (~wT~x) where
s (x) = /( + e x )
s 0 (x) = s (x)( s (x))

Perceptron

Stochastic Gradient Descent: ~w = ~w hJ~xi
Gradient of the error function:
J = (yi h~w (~xi ))h~w (~xi )

i
h~w (~xi ) = h~w (~xi )( h~w (~xi ))

Weight update rule
~w = ~w + h (yi h~w (~xi ))h~w (~xi )( h~w (~xi ))~xi

i
Restricted Boltzmann Machines (RBMs)
z1 z2 z3 z4
x1 x2 x3 x4 x5
Energy-based models, where
E(~x,~z)
P(~x,~z) e .
Manipulate a weight matrix W to nd low-energy states

and thus generate high probability P(~x,~z), where
E(~x,~z) = W.
RBMs can be stacked on top of each other to form

so-called Deep Belief Networks (DBNs)
Classication
Example
no com no day no
yes cat no day yes

yes edu yes day ?
Classication

yes edu yes day ?
Time
Day Night
Contains Money YES
Yes No
YES NO
Decision Trees
Basic induction strategy:

A the best decision attribute for next node
Assign A as decision attribute for node
For each value of A, create new descendant of node
Sort training examples to leaf nodes
If training examples perfectly classied, Then STOP, Else
iterate over new leaf nodes
Bagging
Example
Dataset of Instances : A, B, C, D
Classier : D, B, A, D
Classier : B, C, B, B
Classier : D, C, A, C
Bagging builds a set of M base models, with a bootstrap

sample created by drawing random samples with
replacement.
Random Forests
Bagging
Random Trees: trees that in each node only uses a random
subset of the attributes
Random Forests is one of the most popular methods in
machine learning.
Boosting
The strength of Weak Learnability, Schapire
A boosting algorithm transforms a weak learner

into a strong one
Boosting
A formal description of Boosting (Schapire)

given a training set (x , y ), . . . , (xm , ym )
yi 2 { , +} correct label of instance xi 2 X
for t = , . . . , T
construct distribution Dt
nd weak classier
ht : X ! { , +}
with small error et = PrDt [ht (xi ) 6= yi ] on Dt

output nal classier
Boosting
AdaBoost
: for t = ,,...T do
et /( et ) if ht (xi ) = yi
t otherwise
Boosting
AdaBoost
: for t = ,,...T do
et if ht (xi ) = yi
t et otherwise
Stacking
Use a classier to combine predictions of base classiers

Example
Use a perceptron to do stacking
Use decision trees as base classiers
Clustering
Denition
afnities.
Example
Example
Clustering
Denition
Given
an objective function cost(I)
each instance
f : I ! {, . . . , K}
Clustering
Denition
Given
an objective function cost(C, I)
cost(C, I) = d (x, C)
x2I
where
d(x, c): distance function between x and c
d (x, C) = minc2C d (x, c): distance from x to the nearest
point in C
k-means
. Choose k initial centers C = {c , . . . , ck }

For i = , . . . , N
For k = , . . . , K
k-means++
. Choose a initial center c

For k = , . . . , K
select ck = p 2 I with probability d (p, C)/cost(C, I)
For i = , . . . , N
For k = , . . . , K
Internal Measures
Sum square distance
Dunn index D = ddmin
max
C-Index C = S S Smin
max S min
External Measures
Rand Measure
F Measure
Jaccard
Purity
DBSCAN
e-neighborhood(p): set of points that are at a distance of p
less or equal to e
Core object: object whose e-neighborhood has an overall
weight at least
A point p is directly density-reachable from q if
p is in e-neighborhood(q)
q is a core object
A point p is density-reachable from q if
there is a chain of points p , . . . , pn such that pi+ is directly
A point p is density-connected from q if
there is point o such that p and q are density-reachable
from o
DBSCAN
A cluster C of points satises
if p 2 C and q is density-reachable from p, then q 2 C
all points p, q 2 C are density-connected
A cluster is uniquely determined by any of its core points
A cluster can be obtained
choosing an arbitrary core point as a seed
retrieve all points that are density-reachable from the seed
DBSCAN
Figure: DBSCAN Point Example with =

DBSCAN
select an arbitrary point p
retrieve all points density-reachable from p
if p is a core point, a cluster is formed
If p is a border point
no points are density-reachable from p
DBSCAN visits the next point of the database
Continue the process until all of the points have been
processed
Frequent Patterns

constant.
Denition Denition
superpatterns of t.

in D.
Frequent Patterns

constant.
Denition Denition
superpatterns of t.

in D.
Frequent Patterns

constant.
Denition Denition
superpatterns of t.

in D.
Frequent Patterns

constant.
Denition Denition
superpatterns of t.

in D.
Pattern Mining
Dataset Example
Document Patterns
d abce
d cde
d abce
d acde
d abcde
d6 bcd
Itemset Mining
Support Frequent
d abce d,d,d,d,d,d6 c
d cde
d,d,d,d,d e,ce
d abce
d,d,d,d a,ac,ae,ace
d acde
d,d,d,d6 b,bc
d abcde
d,d,d,d6 d,cd
d6 bcd
d,d,d ab,abc,abe
be,bce,abce
d,d,d de,cde
minimal support =
Itemset Mining
Support Frequent
d abce 6 c
d cde
e,ce
d abce
a,ac,ae,ace
d acde
b,bc
d abcde
d,cd
d6 bcd
ab,abc,abe
be,bce,abce
de,cde
Itemset Mining
Support Frequent Gen Closed

d abce 6 c c c
d cde
e,ce e ce
d abce
a,ac,ae,ace a ace
d acde
b,bc b bc
d abcde
d,cd d cd
d6 bcd
ab,abc,abe ab
be,bce,abce be abce
de,cde de cde
Itemset Mining

d abce 6 c c c
d cde
e,ce e ce
d abce
a,ac,ae,ace a ace
d acde
b,bc b bc
d abcde
d,cd d cd
d6 bcd
ab,abc,abe ab
de,cde de cde cde
Itemset Mining

d abce 6 c c c
d cde
e,ce e ce
d abce
a,ac,ae,ace a ace
d acde
b,bc b bc
d abcde
d,cd d cd
d6 bcd
ab,abc,abe ab
de,cde de cde cde
Itemset Mining

d abce 6 c c c
d cde
e,ce e ce
d abce
a,ac,ae,ace a ace
d acde
b,bc b bc
d abcde
d,cd d cd
d6 bcd
ab,abc,abe ab
e ! ce be,bce,abce be abce abce
de,cde de cde cde
Itemset Mining

d abce 6 c c c
d cde
e,ce e ce
d abce
a,ac,ae,ace a ace
d acde
b,bc b bc
d abcde
d,cd d cd
d6 bcd
ab,abc,abe ab
de,cde de cde cde
Itemset Mining

d abce 6 c c c
d cde
e,ce e ce
d abce
a,ac,ae,ace a ace
d acde
b,bc b bc
d abcde
d,cd d cd
d6 bcd
ab,abc,abe ab
de,cde de cde cde
Itemset Mining

d abce 6 c c c
d cde
e,ce e ce
d abce
a,ac,ae,ace a ace
d acde
b,bc b bc
d abcde
d,cd d cd
d6 bcd
ab,abc,abe ab
a ! ace be,bce,abce be abce abce
de,cde de cde cde
Itemset Mining

d abce 6 c c c
d cde
e,ce e ce
d abce
a,ac,ae,ace a ace
d acde
b,bc b bc
d abcde
d,cd d cd
d6 bcd
ab,abc,abe ab
de,cde de cde cde
Closed Patterns
Usually, there are too many frequent patterns. We can compute

a smaller set, while keeping the same information.
Example
A set of items, has subsets, that is more
than the number of atoms in the universe
Closed Patterns
A priori property
If t0 is a subpattern of t, then Support (t0 ) Support (t).
Denition
A frequent pattern t is closed if none of its proper superpatterns
has the same support as it has.
Frequent subpatterns and their supports can be generated from

closed patterns.
Maximal Patterns
Denition
A frequent pattern t is maximal if none of its proper
superpatterns is frequent.
Frequent subpatterns can be generated from maximal patterns,

but not with their support.
All maximal patterns are closed, but not all closed patterns are
maximal.
Non streaming frequent itemset miners
Representation:
Horizontal layout
T: a, b, c
T: b, c, e
T: b, d, e
Vertical layout
a:
b:
c:
Search:
Breadth-rst (levelwise): Apriori
Depth-rst: Eclat, FP-Growth
The Apriori Algorithm
A A
Initialize the item set size k =

Start with single element sets
Prune the non-frequent ones
while there are frequent item sets
do create candidates with one item more
6 Prune the non-frequent ones
Increment the item set size k = k +
8 Output: the frequent item sets

The Eclat Algorithm
Depth-First Search
Vertical representation
Support counting is done by intersecting lists of
transaction identiers
The FP-Growth Algorithm
Depth-First Search
Vertical and Horizontal representation : FP-Tree
prex tree with links between nodes that correspond to the
same item
Support counting is done using FP-Tree
Mining Graph Data
Problem
Given a data set D of graphs, nd frequent graphs.
Transaction Id Graph
O
C C S N
O
O
C C S N
C
N
C C S N
The gSpan Algorithm
S(g, D, min sup, S)
Input: A graph g, a graph dataset D, min sup.

Output: The frequent graph set S.
if g 6= min(g)
then return S
insert g into S
update support counter structure
C 0/
6 for each g0 that can be right-most
extended from g in one step
do if support(g) min sup
8 then insert g0 into C
for each g0 in C
do S S(g0 , D, min sup, S)
return S
Data Mining
Data Preprocessing
Data Basics
Machine Learning/Data Mining Applications
Business Analytics
Is this costumer credit-worthy?
Is a costumer willing to respond to an email?
Do costumers divide in similar groups?
How much a costumer is going to spend next semester?
World Wide Web
Financial Analytics
Internet of Things
Image Recognition, Speech
..
Data collection
Data Preprocesing
Feature extraction
Data cleaning
Data Postprocesing
Multidimensional Data
Example:
Competitor Name Swim Cycle Run Total
John T : : 8: :
Norman P 8: : : :
Alex K : 8: n/a n/a
Sarah H : : : :
Table: Triathlon results
Example or Instance
data point, transaction, entity, tuple, object, or feature-vector
Attribute or Feature
eld, dimension
Instance Types
Dense
red, white, Barcelona, , up
red, red, Barcelona, , down
black, white, Paris, , up
red, green, Paris, , down
Sparse
, , , , , , , , , , , , , , , , , , , , , ,
, , , , , , , , , , , , , , , , , , , , , ,
, , , , , , , , , , , , , , , , , , , , , ,
, , , , , , , , , , , , , , , , , , , , , ,
, , , , , , , , , , , , , , , , , , , , , ,
, , , , , , , , , , , , , , , , , , , , , ,
Attribute Type
Numerical
, , ., ., .
Categorical or Discrete
+, -
red, green, black
yes, no
up, down
Barcelona, Paris, London, New York
Text Data: vector-space representation
The cat is black
Binary: Categorical or Numerical
Attribute/Column Relationships
Classication : predict value of a discrete attribute
Regression: predict value of a numeric attribute
Instance/Row Relationships
Clustering: determine subsets of rows, in which the values
in the corresponding columns are similar
Outlier Detection: determine the rows that are very different
from the other rows
Big Data Scalability
Distributed Systems:
Hardware: Hadoop cluster
Software: MapReduce, Spark, Flink, Storm
Streaming Algorithms
Single pass over the data
Concept Drift
Data Preparation
Data collection
Data Preprocesing
Feature extraction
Data cleaning
Data Postprocesing
Feature Extraction
Sensor data: wavelets or Fourier Transforms

Image Data: histograms or visual words
Web logs: multidimensional data
Network trafc: specic features as network protocol,
bytes transferred
Text Data: remove stop words, stem data,
multidimensional data
Feature Conversion
Numeric to Discrete
Equi-width ranges
Equi-log ranges
Equi-depth ranges
Discrete to Numeric
Binarization: one numeric attribute for each value
Text to Numeric
remove stop words, stem data, tf-idf, multidimensional data
Time Series to Discrete Sequence Data
SAX: equi-depth discretization after window-based
averaging
Time Series to Numeric Data
Discrete Wavelet Transform
Discrete Fourier Transform
Term Frequency-Inverse Document Frequency
Term frequency
Boolean frequencies
tf(t, d) = if t occurs in d and otherwise;
Logarithmically scaled frequency
tf(t, d) = + logft,d , or zero if ft,d is zero;
Augmented frequency,
ft,d
tf(t, d) = . + .
max{ft 0 ,d : t 0 2 d}
Inverse document frequency
N
idf(t, D) = log
|{d 2 D : t 2 d}|
Term frequency-inverse document frequency
tfidf(t, d, D) = tf(t, d) idf(t, D)

Data Cleaning
Handling missing entries
Eliminate entries with a missing value
Estimate missing values
Algorithms can handle missing values
Handling incorrect entries
Duplicate detection and inconsistency detection
Domain knowledge
Data-centric methods
Scaling and normalization
Standardization: for instance i, attribute j:
xij - j
zji =
j
Normalization:
xij - minj
yij =
maxj - min j
Sampling for Static Data

Sampling with Replacement
Sampling without Replacement: no duplicates
Biased Sampling
Stratied Sampling
Reservoir Sampling for Data Streams
Given a data stream, choose k items with the same
probability, storing only k elements in memory.
R S
R S
for every item i in the rst k items of the stream

do store item i in the reservoir
n=k
for every item i in the stream after the rst k items of the stream
do select a random number r between and n
6 if r < k
then replace item r in the reservoir with item i
8 n=n+
Figure: Algorithm R S
Feature Subset Selection

Supervised feature selection
Unsupervised feature selection
Biased Sampling
Stratied Sampling
Dimensionality reduction with axis rotation
Singular Value Decomposition
Latent Semantic Analysis
Goal: Principal component analysis computes the most

meaningful basis to re-express a noisy, garbled data set.
The hope is that this new basis will lter out the noise and
reveal hidden dynamics
Normalize Input Data

Compute k orthonormal vectors to have a basis for the
normalized data
Organize the data set X as an m n matrix, where m is the

number of features and n is the number of instances.
Normalize Input Data: subtract off the mean for each
instance xi
Calculate the SVD or the eigenvectors of the covariance
Find some orthonormal matrix P where Y = PX such that

SY = YY T
n-
is diagonalized.
The rows of P are the principal components of X.
Clustering, classification and
evaluation
Mostafa H. Chehreghani
Mostafa.chehreghani@gmail.com
Clustering
Paris, 18 October 2015

Clustering
Definition
affinities.
Example
Example
Clustering
Definition
Given
I an objective function cost(I)
each instance
f : I {1, . . . , K }
Clustering
Definition
Given
I an objective function cost(C, I)
X
cost(C, I) = d 2 (x, C)
xI
where
I d(x, c): distance function between x and c
I d 2 (x, C) = mincC d 2 (x, c): distance from x to the nearest
point in C
k-means
I 1. Choose k initial centers C = {c1 , . . . , ck }

I For i = 1, . . . , N
I For k = 1, . . . , K
k-means++
I 1. Choose a initial center c1

I For k = 2, . . . , K
I select ck = p I with probability d 2 (p, C)/cost(C, I)
I For i = 1, . . . , N
I For k = 1, . . . , K
Internal Measures
I Cluster Cohesion: Measures how closely related are
objects in a cluster
I Cluster Separation: Measure how distinct or well
separated a cluster is from other clusters
I Silhouette Coefficient: 1 a/b if a < b
I a = average distance of i to the points in its cluster
I b = min (average distance of i to points in another cluster)
External Measures
I Rand Measure
I F Measure
I Jaccard
I Purity
Distances
Numeric features
I Euclidean:
X
d(x, y ) = ||x y ||2 = (xi yi )2
I Manhattan distance:
X
d(x, y) = ||x y||1 = |xi yi |
DBSCAN
I -neighborhood(p): set of points that are at a distance of p
less or equal to
I Core object: object whose -neighborhood has an overall
weight at least
I A point p is directly density-reachable from q if
I p is in -neighborhood(q)
I q is a core object
I A point p is density-reachable from q if
I there is a chain of points p1 , . . . , pn such that pi+1 is directly
I A point p is density-connected from q if
I there is point o such that p and q are density-reachable
from o
DBSCAN
I A cluster C of points satisfies
I if p C and q is density-reachable from p, then q C
I all points p, q C are density-connected
I A cluster is uniquely determined by any of its core points
I A cluster can be obtained
I choosing an arbitrary core point as a seed
I retrieve all points that are density-reachable from the seed
DBSCAN
I select an arbitrary point p
I retrieve all points density-reachable from p
I if p is a core point, a cluster is formed
I If p is a border point
I no points are density-reachable from p
I DBSCAN visits the next point of the database
I Continue the process until all of the points have been
processed
DBSCAN
Figure: DBSCAN Point Example with =3

BIRCH

USING H IERARCHIES
I Clustering Features CF = (N, LS, SS)
I N: number of data points
I LS: linear sum of the N data points
I SS: square sum of the N data points
I Properties:
I Additivity: CF1 + CF2 = (N1 + N2 , LS1 + LS2 , SS1 + SS2 )
I Easy to compute: average inter-cluster distance
and average intra-cluster distance
I Uses CF tree
I Height-balanced tree with two parameters
I B: branching factor
I T: radius leaf threshold
BIRCH

USING H IERARCHIES
Phase 1: Scan all data and build an initial in-memory CF

tree
Phase 2: Condense into desirable range by building a
smaller CF tree (optional)
Phase 3: Global clustering
Phase 4: Cluster refining (optional and off line, as requires
more passes)
BIRCH:
Balanced Iterative Reducing and Clustering using
Hierarchies
Tian Zhang, Raghu Ramakrishnan, Miron Livny
Presented by Zhao Li
2009, Spring
Introduction to BIRCH

Designed for very large data sets
Time and memory are limited
Incremental and dynamic clustering of incoming objects
Only one scan of data is necessary
Does not need the whole data set in advance

Two key phases:
Scans the database to build an in-memory tree
Applies clustering algorithm to cluster the leaf nodes
September 1, 2017 2
Given a cluster of instances , we define:

Centroid:
Radius: average distance from member points to centroid
Diameter: average pair-wise distance within a cluster
September 1, 2017 3
centroid Euclidean distance:

centroid Manhattan distance:
average inter-cluster:
average intra-cluster:
variance increase:
September 1, 2017 4
Clustering Feature

The Birch algorithm builds a dendrogram called clustering
feature tree (CF tree) while scanning the data set.

Each entry in the CF tree represents a cluster of objects
and is characterized by a 3-tuple: (N, LS, SS), where N is
the number of objects in the cluster and LS, SS are defined
in the following.
September 1, 2017 5
Properties of Clustering Feature

CF entry is more compact
Stores significantly less than all of the data points in
the sub-cluster

A CF entry has sufficient information to calculate
D0-D4

Additivity theorem allows us to merge sub-clusters
incrementally & consistently
September 1, 2017 6
CF-Tree
Each non-leaf node has at

most B entries
Each leaf node has at
most L CF entries,
each of which satisfies
threshold T
September 1, 2017 7
CF-Tree Insertion

Recurse down from root
Find the appropriate leaf
Follow the "closest"-CF path, w.r.t. D0 / / D4

Modify the leaf
If the closest-CF leaf cannot absorb, make a new CF
entry. If there is no room for new leaf, split the parent
node

Traverse back
Update CFs on the path or splitting nodes
September 1, 2017 8
CF-Tree Rebuilding

If we run out of space, increase threshold T
By increasing the threshold, CFs absorb more data

Rebuilding "pushes" CFs over
The larger T allows different CFs to group together

Reducibility theorem
Increasing T will result in a CF-tree smaller than the
original
September 1, 2017 9
Example of BIRCH
New subcluster
sc8 sc3
sc1 sc4 sc5 sc6 sc7
sc2 LN2 LN3

Root
LN1 LN1 LN2 LN3
sc8 sc1 sc5

sc2 sc4 sc6 sc7
sc3
Insertion Operation in BIRCH
If the branching factor of a leaf node can not exceed 3, then LN1 is split.
sc8 sc3
sc1 sc4 sc5 sc6 sc7
sc2
LN1 LN2 LN3
LN1 Root
LN1 LN2 LN3
LN1
sc8 sc1 sc3sc4sc5 sc6 sc7

sc2
If the branching factor of a non-leaf node can not
exceed 3, then the root is split and the height of
the CF Tree increases by one.
sc8 sc3
sc1 sc4 sc5 sc6 sc7
sc2
LN1 LN2 LN3
LN1 Root
NLN1 NLN2
LN1
LN1 LN2 LN3
sc8 sc1 sc2 sc3sc4sc5 sc6 sc7

BIRCH Overview
Input parameters:
Memory (M): 5% of data set
Disk space (R): 20% of M
Distance equation: D2
Quality equation: weighted average diameter (D)
Initial threshold (T): 0.0
Page size (P): 1024 bytes
KMEANS clustering
DS Time D # Scan DS Time D # Scan
1 43.9 2.09 289 1o 33.8 1.97 197
BIRCH clustering
2 13.2 4.43 51 2o 12.7 4.20 29
DS
3 Time
32.9 D
3.66 # 187
Scan DS
3o Time
36.0 D
4.35 # 241
Scan
1 11.5 1.87 2 1o 13.6 1.87 2
2 10.7 1.99 2 2o 12.1 1.99 2
Exam Questions

What is the main limitation of BIRCH?
Since each node in a CF tree can hold only a limited
number of entries due to the size, a CF tree node doesnt
always correspond to what a user may consider a nature
cluster. Moreover, if the clusters are not spherical in
shape, it doesnt perform well because it uses the notion
of radius or diameter to control the boundary of a
cluster.
Classification Evaluation
Paris, 27 September 2016

Evaluation
1. Error estimation: Hold-out or Cross-Validation

2. Evaluation performance measures: Accuracy or -statistic
3. Statistical significance validation: MacNemar or Nemenyi test
Evaluation Framework
Error Estimation
Data available for testing

I Holdout an independent test set
I Apply the current decision model to the test set
I The loss estimated in the holdout is an unbiased estimator
Holdout Evaluation
1. Error Estimation
Not enough data available for testing

I Divide dataset in 10 folds
I Repeat 10 times: use one fold for testing and the rest for
training
k-fold Cross-validation
Predicted Predicted
Class+ Class- Total
Total 82 18 100
Predicted Predicted
Class+ Class- Total
Total tp+fp fn+tn N
tp
I Precision = tp+fp
tp
I Recall = tp+fn
precisionrecall
I F1 = 2 precision+recall
Predicted Predicted
Class+ Class- Total
Total 82 18 100
75 10 75 83 10 17
I Accuracy = 100 +100 = 83 100 + 17 100 = 85%
I Arithmetic mean = ( 75 10
83 + 17 )/2 = 74.59%
q
75 10
I Geometric mean = 83 17 = 72.90%
Predicted Predicted
Class+ Class- Total
Total 82 18 100
Predicted Predicted
Class+ Class- Total
Correct Class+ 68.06 14.94 83
Correct Class- 13.94 3.06 17
Total 82 18 100
Table: Confusion matrix for chance predictor
Kappa Statistic
I p0 : classifiers prequential accuracy
I pc : probability that a chance classifier makes a correct
prediction.
I statistic
p0 pc
=
1 pc
I = 1 if the classifier is always correct
I = 0 if the predictions coincide with the correct ones as
often as those of the chance classifier
Matthews correlation coefficient (MCC)
tp tn fp fn
p
(tp + fp)(tp + fn)(tn + fp)(tn + fn)
Predicted Predicted
Class+ Class- Total
Total tp+fp fn+tn N
AUC Area under the curve

A ROC space is defined by FPR and TPR (recall)
fp
I FPR = fp+tp
tp
I TPR = tp+fn
3. Statistical significance validation (2 Classifiers)
Classifier A Classifier A
Class+ Class- Total
Classifier B Class+ c a c+a
Classifier B Class- b d b+d
Total c+b a+d a+b+c+d
M = |a b 1|2 /(a + b)
The test follows the 2 distribution. At 0.99 confidence it rejects
the null hypothesis (the performances are equal) if M > 6.635.
McNemar test
r
k(k + 1)
CD = q
6N

Nemenyi test
r
k(k + 1)
CD = q
6N

# classifiers 2 3 4 5 6 7
q0.05 1.960 2.343 2.569 2.728 2.850 2.949
q0.10 1.645 2.052 2.291 2.459 2.589 2.693
Table: Critical values for the Nemenyi test
How to organize the Web?
First try: Human curated
Web directories
Yahoo, DMOZ, LookSmart
Second try: Web Search
Information Retrieval attempts to
find relevant docs in a small
and trusted set
Newspaper articles, Patents, etc.
But: Web is huge, full of untrusted documents,
random things, web spam, etc.
So we need a good way to rank webpages!
2 challenges of web search:
(1) Web contains many sources of information
Who to trust?
Insight: Trustworthy pages may point to each other!
(2) What is the best answer to query
newspaper?
No single right answer
Insight: Pages that actually know about newspapers
might all be pointing to many newspapers
All web pages are not equally important
www.joe-schmoe.com vs. www.stanford.edu
We already know:
There is large diversity
in the web-graph vs.
node connectivity.
So, lets rank the pages
using the web graph
link structure!
We will cover the following Link Analysis
approaches to computing importance of
nodes in a graph:
Hubs and Authorities (HITS)
Page Rank
Random Walk with Restarts
Sidenote: Various notions of node centrality: Node

Degree centrality = degree of
Betweenness centrality = #shortest paths passing through
Closeness centrality = avg. length of shortest paths from to
all other nodes of the network
Eigenvector centrality = like PageRank
Goal (back to the newspaper example):
Dont just find newspapers. Find experts pages that
link in a coordinated way to good newspapers
Idea: Links as votes
Hubs and Authorities NYT: 10
Each page has 2 scores: Ebay: 3

Quality as an expert (hub):
Yahoo: 3
Total sum of votes of pages pointed to
Quality as a content provider (authority): CNN: 8
Total sum of votes of experts WSJ: 9

Principle of repeated improvement
Interesting pages fall into two classes:
1. Authorities are pages containing
useful information
Newspaper home pages
Course home pages
Home pages of auto manufacturers
2. Hubs are pages that link to authorities

List of newspapers NYT: 10
Ebay: 3
Course bulletin Yahoo: 3
List of U.S. auto manufacturers CNN: 8
WSJ: 9
Each page starts with hub score 1
Authorities collect their votes

each page has both the hub and the authority score)
Hubs collect authority scores

Authorities collect hub scores

A good hub links to many good authorities
A good authority is linked from many good
hubs
Note a self-reinforcing recursive definition
Model using two scores for each node:

Hub score and Authority score
Represented as vectors and , where the i-th
element is the hub/authority score of the i-th node
[Kleinberg 98]
Convergence criteria:
Each page has 2 scores: = 5 5 6 678
<
<
Authority score: 5
6 678
<
= 5 5 <
Hub score: 5
HITS algorithm:
()) ())
Initialize: ' = 1/ n, h2 = 1/ n
Then keep iterating until convergence:
(678) (6)
: Authority: 5 = '5 '
(678) (6)
: Hub: 5 = 5' '
: Normalize:
< <
678 678
5 5 = 1, ' ' =1
[Kleinberg 98]
Details!
Hits in the vector notation:

Vector = ( , ), = ( , )
Adjacency matrix (n x n): = if
Can rewrite = as =
So: = And similarly: =
Repeat until convergence:
(678) = (6)
(678) = I (6)
Normalize (678) and (678)
Details!
What is = ?
Then: = ( )
new
new
is updated (in 2 steps):
= I ( ) = (I )
h is updated (in 2 steps):

= (I ) = ( I )
Thus, in steps:
= I L
= I L
Repeated matrix powering
Definition: Eigenvectors & Eigenvalues
Let =
for some scalar , vector , matrix
Then is an eigenvector, and is its eigenvalue
The steady state (HITS has converged) is:
= Note constants c,c
dont matter as we
normalize them out
= RR every step of HITS
So, authority is eigenvector of

(associated with the largest eigenvalue)
Similarly: hub is eigenvector of
Still the same idea: Links as votes
Think of in-links as votes:
www.stanford.edu has 23,400 in-links
www.joe-schmoe.com has 1 in-link
Are all in-links equal?

Links from important pages count more
Recursive question!
A vote from an important
page is worth more: i k
ri/3 r /4
Each links vote is proportional k
to the importance of its source j rj/3
page rj/3 rj/3
If page i with importance ri has

di out-links, each link gets ri / di
rj = ri/3 + rk/4
votes
Page js own importance rj is
the sum of the votes on its in-
links
A page is important if it is The web in 1839
pointed to by other important ry/2

pages y
Define a rank rj for node j
ra/2
ri ry/2
rj = a
rm
m
i j di ra/2
out-degree of node Flow equations:
ry = ry /2 + ra /2
ra = ry /2 + rm
You might wonder: Lets just use Gaussian elimination rm = ra /2
to solve this system of linear equations. Bad idea!
j
Stochastic adjacency matrix
Let page have out-links i

If , then =

is a column stochastic matrix
Columns sum to 1 1/3
Rank vector : An entry per page M

is the importance score of page
=
The flow equations can be written
ri
= rj =
i j di
i1 i2 i3
Imagine a random web surfer:
At any time , surfer is on some page
At time + , the surfer follows an j
out-link from uniformly at random ri
rj =
Ends up on some page linked from i j d out (i)
Process repeats indefinitely
Let:
() vector whose th coordinate is the
prob. that the surfer is at page at time
So, () is a probability distribution over pages
i1 i2 i3
Where is the surfer at time t+1?
Follows a link uniformly at random
j
+ = () p(t + 1) = M p(t )
Suppose the random walk reaches a state
+ = () = ()
then () is stationary distribution of a random walk
Our original rank vector satisfies =
So, is a stationary distribution for
the random walk
Given a web graph with n nodes, where the
nodes are pages and edges are hyperlinks
Assign each node an initial page rank
Repeat until convergence (i |ri(t+1) ri(t)| < )
Calculate the page rank of each node
(t )
( t +1) ri
rj =
i j di
. out-degree of node
y a m
Power Iteration: y
y 0
Set ' 1/N a 0 1
]^
a m m 0 0
1: ' 5' _
^
ry = ry /2 + ra /2
2: ra = ry /2 + rm
If | | > : goto 1 rm = ra /2
Example:
ry 1/3 1/3 5/12 9/24 6/15
ra = 1/3 3/6 1/3 11/24 6/15
rm 1/3 1/6 3/12 1/6 3/15
Iteration 0, 1, 2,
y a m
Power Iteration: y
y 0
Set ' 1/N a 0 1
]^
a m m 0 0
1: ' 5' _
^
ry = ry /2 + ra /2
2: ra = ry /2 + rm
If | | > : goto 1 rm = ra /2
Example:
ry 1/3 1/3 5/12 9/24 6/15
ra = 1/3 3/6 1/3 11/24 6/15
rm 1/3 1/6 3/12 1/6 3/15
Iteration 0, 1, 2,
(t )
( t +1) ri
rj = or
r = Mr
i j di
equivalently
Does this converge?
Does it converge to what we want?
Are results reasonable?
The Spider trap problem:
(t )
( t +1) ri
a b rj =
i j di
Example:
ra 1 0 0 0
=
rb 0 1 1 1
The Dead end problem:
(t )
( t +1) ri
a b rj =
i j di
Example:
ra 1 0 0 0
rb = 0 1 0 0
2 problems:
(1) Some pages are
dead ends (have no out-links)
Such pages cause
importance to leak out
(2) Spider traps

(all out-links are within the group)
Eventually spider traps absorb all importance
y a m
Power Iteration: y
y 0
8
Set ' = a 0 0
c a m m 0 1
]^
' = 5' _
^ ry = ry /2 + ra /2
rm = ra /2 + rm
Example:
ry 1/3 2/6 3/12 5/24 0
ra = 1/3 1/6 2/12 3/24 0
rm 1/3 3/6 7/12 16/24 1
Iteration 0, 1, 2,
The Google solution for spider traps: At each
time step, the random surfer has two options
With prob. , follow a link at random
With prob. 1-, jump to a random page
Common values for are in the range 0.8 to 0.9
Surfer will teleport out of spider trap within a
few time steps
y y
a m a m
y a m
Power Iteration: y
y 0
8
Set ' = a 0 0
c a m m 0 0
]^
' = 5' _
^ ry = ry /2 + ra /2
rm = ra /2
Example:
ry 1/3 2/6 3/12 5/24 0
ra = 1/3 1/6 2/12 3/24 0
rm 1/3 1/6 1/12 2/24 0
Iteration 0, 1, 2,
Teleports: Follow random teleport links with
probability 1.0 from dead-ends
Adjust matrix accordingly
y y
a m a m
y a m y a m
y 0 y
a 0 0 a 0
m 0 0 m 0
Googles solution: At each step, random
surfer has two options:
With probability , follow a link at random
With probability 1-, jump to some random page
PageRank equation [Brin-Page, 98]
5 1
' = = + (1 )
5 di out-degree
5' of node i
The above formulation assumes that has no dead ends. We can

either preprocess matrix (bad!) or explicitly follow random teleport
links with probability 1.0 from dead-ends. See P. Berkhin, A Survey
on PageRank Computing, Internet Mathematics, 2005.
Details!
PageRank as a principal eigenvector

= or equivalently =

But we really want (**): d out-degree i
of node i
= +

Lets define: Note: is a sparse

= + ( ) matrix but is dense
(all entries 0). In
Now we get what we want: practice we never
materialize but
= rather we use the
sum formulation (**)
What is ?
In practice 0.15 (Jump approx. every 5-6 links)
Input: Graph and parameter
Directed graph with spider traps and dead ends
Parameter
Output: PageRank vector
) 8
Set: ' = , = 1
c
do:
(m)
()
: =
()
= if in-deg. of is 0

Now re-insert the leaked PageRank:
o
: = R + (6)
where: = ' '

=+
(6) (6o8)
while ' ' ' >
PageRank and HITS are two solutions to the
same problem:
What is the value of an in-link from u to v?
In the PageRank model, the value of the link
depends on the links into u
In the HITS model, it depends on the value of the
other links out of u
The destinies of PageRank and HITS

post-1998 were very different
I 1 J
1 1
A 1 H 1 B
1 1
D
1 1 1
E G
F
a.k.a.: Relevance, Closeness, Similarity

Given:

Conferences-to-authors IJCAI
graph Philip S. Yu
Goal:
KDD
Ning Zhong
Proximity on graphs ICDM
Q: What is most related SDM R. Ramakrishnan
conference to ICDM? AAAI M. Jordan

NIPS

Conference Author
Shortest path is not good:
No influence for degree-1 nodes (E, F, G)!

Multi-faceted relationships
Network Flow is not good:
Does not punish long paths
I 1 J
1 1
A 1 H 1 B
Multiple Connections
1 1
D Quality of connection

1 1 1 Direct & In-direct
E G
connections
F
Length, Degree,
Weight
10
9
12
2
8
1
11
3
6
5
Goal: Evaluate pages not just by popularity
but by how close they are to the topic
Teleporting can go to:
Any page with equal probability
PageRank (we used this so far)
A topic-specific set of relevant pages
Topic-specific (personalized) PageRank (S ...teleport set)
= + ( )/|| if
= otherwise
Random Walk with Restart: S is a single element
Graphs and web search:

Ranks nodes by importance IJCAI
Personalized PageRank: Philip S. Yu

KDD
Ranks proximity of nodes Ning Zhong
to the teleport nodes ICDM
Proximity on graphs: SDM R. Ramakrishnan
Q: What is most related AAAI M. Jordan
conference to ICDM?

NIPS
Random Walks with Restarts

Teleport back to the starting node: Conference Author
S = { single node }
Node 4
0.04 0.03 Node 1 0.13
10 Node 2 0.10
9
0.10 Node 3 0.13
12
0.13 2 0.08 0.02 Node 4 /
1 8 Node 5 0.13
3 0.13 11
0.04 Node 6 0.05
4 Node 7 0.05
Node 8 0.08
6 0.05 Node 9 0.04
5
0.13 Node 10 0.03
7 Node 11 0.04
0.05 Node 12 0.02
Nearby nodes, higher scores Ranking vector

More red, more relevant
PKDD
SDM PAKDD
0.008
0.007
0.009
KDD 0.005 ICML
0.011
ICDM
0.005
0.004
CIKM ICDE
0.005
0.004
0.004
ECML SIGMOD
DMKD
Q: Which conferences
are closest to KDD &
K
ICDM?
I A: Personalized
PageRank with
Graph of CS conferences teleport set S={KDD,
ICDM}
Pins belong to Boards
Input:
Input: Output
Input:
Input: Output
Bipartite Pin and Board graph
Pixie Random Walk
5 5 5 5 5 5 14 9 Q 16 7 8 8 8 8 1 1 1
Yummm Strawberries Smoothies Smoothie Madness!!
Observations Models Algorithms
Small diameter, Erds-Renyi model,

Decentralized search
Edge clustering Small-world model
Patterns of signed Structural balance, Models for predicting

edge creation Theory of status edge signs
Viral Marketing, Blogosphere, Independent cascade model, Influence maximization,

Memetracking Game theoretic model Outbreak detection, LIM
Preferential attachment, PageRank, Hubs and

Scale-Free
Copying model authorities
Densification power law, Microscopic model of Link prediction,

Shrinking diameters evolving networks Supervised random walks
Strength of weak ties, Community detection:

Core-periphery Kronecker Graphs
Girvan-Newman, Modularity
We often think of networks looking
like this:
What lead to such a conceptual picture?

How information flows through the network?
What structurally distinct roles do nodes play?
What roles do different links (short vs. long) play?
How people find out about new jobs?
Mark Granovetter, part of his PhD in 1960s
People find the information through personal contacts
But: Contacts were often acquaintances
rather than close friends
This is surprising: One would expect your friends to
help you out more than casual acquaintances
Why is it that acquaintances are most helpful?
[Granovetter 73]
Two perspectives on friendships:

Structural: Friendships span different parts of the
network
Interpersonal: Friendship between two people is
either strong or weak
Structural role: Triadic Closure
a If two people in a
network have a friend in
common, then there is
an increased likelihood
b they will become friends
c
themselves.
Which edge is more
likely, a-b or a-c?
Granovetter makes a connection between
social and structural role of an edge
First point: Structure
Structurally embedded edges are also socially strong
Long-range edges spanning different parts of the
network are socially weak
Second point: Information
Long-range edges allow you to gather information
from different parts of the network and get a job
Structurally embedded edges are S Weak
Strong
heavily redundant in terms of S
a W
b
S
information access S
Triadic closure == High clustering coefficient
Reasons for triadic closure:
If and have a friend in common, then:
is more likely to meet
B
(since they both spend time with )
and trust each other A C
(since they have a friend in common)
has incentive to bring and together
(as it is hard for to maintain two disjoint relationships)
Empirical study by Bearman and Moody:
Teenage girls with low clustering coefficient are
more likely to contemplate suicide
Bridge
Define: Bridge edge a
If removed, it disconnects the graph b
Define: Local bridge Local bridge

Edge of Span > 2
(Span of an edge is the distance of the a
b
edge endpoints if the edge is deleted. Local
bridges with long span are like real bridges)
Define: Two types of edges: Edge:
W or S
Strong (friend), Weak (acquaintance)
Define: Strong triadic closure: S S
Two strong ties imply a third edge

Fact: If strong triadic closure is W
satisfied then local bridges S
S W S
S
are weak ties! S

a
b
Claim: If node satisfies Strong Triadic Closure
and is involved in at least two strong ties, then
any local bridge adjacent to must be a weak tie.
Proof by contradiction:
Assume satisfies Strong Triadic S S
Closure and has 2 strong ties A
Let be local bridge

and a strong tie
S S
Then must exist S
A
B
because of Strong C
Triadic Closure
But then is not a bridge!
(since B-C must be connected due to Strong Triadic Closure property)
For many years Granovetters theory was not
tested
But, today we have large who-talks-to-whom
graphs:
Email, Messenger, Cell phones, Facebook
Onnela et al. 2007:

Cell-phone network of 20% of countrys population
Edge strength: # phone calls
Edge overlap:
() ()
&' =
() ()
() a set
of neighbors
of node
Overlap =
when an edge is
a local bridge
Cell-phone network
Observation:
Highly used links True
Neighborhood overlap
have high overlap! Permuted
strengths
Legend:
True: The data
Permuted strengths: Keep
the network structure
but randomly reassign
edge strengths
Edge strength (#calls)
Real edge strengths in mobile call graph
Strong ties are more embedded (have higher overlap)
Same network, same set of edge strengths
but now strengths are randomly shuffled
Low
disconnects
the network
sooner
Removing links by strength (#calls)

Low to high
Low
disconnects
the network
sooner
Removing links based on overlap

Low to high
Granovetters theory leads to the following
conceptual picture of networks
Strong ties
Weak ties
Granovetters theory
suggest that networks
are composed of
tightly connected
sets of nodes
Network communities: groups, modules
Sets of nodes with lots of connections inside and
few to outside (the rest of the network)
How to automatically
find such densely
connected groups of
nodes?
Ideally such automatically
detected clusters would
then correspond to real
groups
For example: groups, modules
Zacharys Karate club network:
Observe social ties and rivalries in a university karate club
During his observation, conflicts led the group to split
Split could be explained by a minimum cut in the network
Find micro-markets by partitioning the
query x advertiser graph:
query
advertiser
Can we identify
node groups?
(communities,
modules, clusters)
Nodes: Teams
Edges: Games played
NCAA conferences
Nodes: Teams
Edges: Games played
Can we identify
social communities?
Nodes: Users
Edges: Friendships
High school Company
Stanford (Basketball)
Stanford (Squash)
Nodes: Users
Social communities Edges: Friendships
Can we identify
functional modules?
Nodes: Proteins
Edges: Interactions
Functional modules
Nodes: Proteins
Edges: Interactions
How to find communities?
We will work with undirected (unweighted) networks

Edge betweenness: Number of
shortest paths passing over the edge b=16
Intuition: b=7.5
Edge strengths (call volume) Edge betweenness

in a real network in a real network
[Girvan-Newman 02]
Divisive hierarchical clustering based on the

notion of edge betweenness:
Number of shortest paths passing through the edge
Girvan-Newman Algorithm:
Undirected unweighted networks
Repeat until no edges are left:
Calculate betweenness of edges
Remove edges with highest betweenness
Connected components are communities
Gives a hierarchical decomposition of the network
12
1
33
49
Need to re-compute
betweenness at
every step
Step 1: Step 2:
Step 3: Hierarchical network decomposition:
clusters?
Want to compute Breadth first search
betweenness of starting from :
paths starting at
node 0
Forward step: Count the number of shortest
paths from to all other nodes of the
network
fractionally
The algorithm:
Add edge flows:
-- node flow =
based on the parent
value
Split 1:2
procedure for each
starting node
1 path to K.
Split evenly
fractionally
The algorithm:
Add edge flows:
-- node flow =
based on the parent
value
Split 1:2
procedure for each
starting node
1 path to K.
Split evenly
clusters?
Communities: sets of
tightly connected nodes
Define: Modularity
A measure of how well
a network is partitioned
into communities
Given a partitioning of the
network into groups ! :
Need a null model!
Given real on nodes and edges,
construct rewired network
Same degree distribution but i
random connections
j
Consider as a multigraph
The expected number of edges between nodes

and of degrees and equals to: =
The expected number of edges in (multigraph) G:

= = =
Note:

= = D F = 2
FG
Modularity of partitioning S of graph G:

, =

Aij = 1 if ij,
Normalizing cost.: -1<Q<1
0 else
Modularity values take range [1,1]

It is positive if the number of edges within
groups exceeds the expected number
0.3-0.7<Q means significant community structure
Modularity is useful for selecting the
number of clusters: Q
Why not optimize Modularity directly?

Lets split the graph into 2 communities!
Want to directly optimize modularity!
R VW VX
max , = &'
N ST YN &Y 'Y ST
Community membership vector s:

si = 1 if node i is in community 1 & ' + 1 1.. if si=sj
=
-1 if node i is in community -1 2 0.. else
R VW VX YW YX [R
, = &G 'G &'
ST ST S
R VW VX
= \T &,'G &' & '
ST
Note: each row/col of B
sums to 0: = ,
Define:
= =

Modularity matrix: =
Membership: = {, +}
R VW VX
Then: , = \T &G 'G &' & '
ST
R
= \T &,'G &' & '
R R
= \T & & ' &' ' = \T b
=
Task: Find s{-1,+1}n that maximizes Q(G,s)

Symmetric matrix A
That is positive semi-definite:
=
Then solutions , to equation = :
Eigenvectors ordered by the magnitude of their
corresponding eigenvalues & (R S j )
are orthonormal (orthogonal and unit length)
form a coordinate system (basis)
If is positive-semidefinite: & 0 (and they always exist)
Eigen Decomposition theorem: Can rewrite matrix
in terms of its eigenvectors and eigenvalues: =
&
R m
Rewrite: , = in terms of its
\T
eigenvectors and eigenvalues:
j j j
= s m D & & &b = D b & & &b = D b x & S &

&pR &pR &pR
So, if there would be no other constraints on

then to maximize , we make =
x2
Why? Because q
Remember has fixed length! s
Assigns all weight in the sum to (largest eigenvalue)
x1
All other terms are zero because of orthonormality
Lets consider only the first term in the summation
(because is the largest):
max , = j&pR b & S & b j S j
Y
Lets maximize: p , where sj{-1,+1}

To do this, we set:
+ , (jth coordinate of )
= t
, < (jth coordinate of < )
Continue the bisection hierarchically

Fast Modularity Optimization Algorithm:
Find leading eigenvector of modularity matrix B
Divide the nodes by the signs of the elements of
Repeat hierarchically until:
If a proposed split does not cause modularity to increase,
declare community indivisible and do not split it
If all communities are indivisible, stop
How to find ? Power method!
Bv (t )
Start with random v(0), repeat : v (t +1) =
Bv (t )
When converged (v(t) v(t+1)), set xn = v(t)
Girvan-Newman:
Based on the strength of weak ties
Remove edge of highest betweenness
Modularity:
Overall quality of the partitioning of a graph
Use to determine the number of communities
Fast modularity optimization:
Transform the modularity optimization to a
eigenvalue problem
[Ron Burt]
Who is better off, Robert or James?

Few structural holes Many structural holes
Structural Holes provide ego with access

to novel information, power, freedom
The network constraint measure [Burt]:
To what extent are persons contacts redundant
k
k
= / 2
p25=
p12=
i j i
2 1 p15= 5
j
Low: disconnected contacts
4
High: contacts that are
close or strongly tied 1 2 3 4 5
1 .00 .25 .25 .25 .25
2
2 .50 .00 .00 .00 .50

ci = cij = pij + ( pik pkj ) 3 1.0 .00 .00 .00 .00
4 .50 .00 .00 .00 .50
j j k 5 .33 .33 .00 .33 .00
prop. of s energy invested in relationship with
Constraint: To what
extent are persons
contacts redundant
Low: disconnected
contacts
High: contacts that
are close or strongly
tied
Network constraint:
James: = 0.309
Robert: = 0.148
[Ron Burt]

Machine Learning Slides

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Machine Learning Slides

Uploaded by

Copyright:

Available Formats

Machine Learning and

Data Science is an interdisciplinary eld focused on

Figure: Drew Convays Venn diagram

Assume we have to classify the following new instance:

P(c) a2d P(a|c)

Multinomial Nave Bayes

P(c) w2d P(w|c)nwd

Attribute w Output h~w (~xi )

Data stream: h~xi , yi i

Attribute w Output h~w (~xi )

We use sigmoid function h~w = s (~wT~x) where

s 0 (x) = s (x)( s (x))

Minimize Mean-square error: J(~w) = (yi h~w (~xi ))

J = (yi h~w (~xi ))h~w (~xi )

h~w (~xi ) = h~w (~xi )( h~w (~xi ))

~w = ~w + h (yi h~w (~xi ))h~w (~xi )( h~w (~xi ))~xi

Energy-based models, where

Manipulate a weight matrix W to nd low-energy states

RBMs can be stacked on top of each other to form

Assume we have to classify the following new instance:

Assume we have to classify the following new instance:

Basic induction strategy:

Bagging builds a set of M base models, with a bootstrap

The strength of Weak Learnability, Schapire

A boosting algorithm transforms a weak learner

A formal description of Boosting (Schapire)

with small error et = PrDt [ht (xi ) 6= yi ] on Dt

Use a classier to combine predictions of base classiers

. Choose k initial centers C = {c , . . . , ck }

. Choose a initial center c

Figure: DBSCAN Point Example with =

Suppose D is a dataset of patterns, t 2 D, and min sup is a

Frequent Subpattern Problem

Suppose D is a dataset of patterns, t 2 D, and min sup is a

Frequent Subpattern Problem

Suppose D is a dataset of patterns, t 2 D, and min sup is a

Frequent Subpattern Problem

Suppose D is a dataset of patterns, t 2 D, and min sup is a

Frequent Subpattern Problem

Support Frequent Gen Closed

Support Frequent Gen Closed Max

Support Frequent Gen Closed Max

Support Frequent Gen Closed Max

Support Frequent Gen Closed Max

Support Frequent Gen Closed Max

Support Frequent Gen Closed Max

Support Frequent Gen Closed Max

Usually, there are too many frequent patterns. We can compute

Frequent subpatterns and their supports can be generated from

Frequent subpatterns can be generated from maximal patterns,

Initialize the item set size k =

8 Output: the frequent item sets

S(g, D, min sup, S)

Input: A graph g, a graph dataset D, min sup.

Sensor data: wavelets or Fourier Transforms

Inverse document frequency

Term frequency-inverse document frequency

tfidf(t, d, D) = tf(t, d) idf(t, D)

Sampling for Static Data

for every item i in the rst k items of the stream

Feature Subset Selection

Goal: Principal component analysis computes the most

Normalize Input Data

Organize the data set X as an m n matrix, where m is the