You are on page 1of 474

Machine Learning and

Data Mining
Introduction

Albert Bifet(@abifet)
Data Science

Data Science is an interdisciplinary eld focused on


extracting knowledge or insights from large volumes
of data.
Data Scientist

Figure: http://www.marketingdistillery.com/2014/11/29/
is-data-science-a-buzzword-modern-data-scientist-defined/
Data Science

Figure: Drew Convays Venn diagram


Classication

Denition
Given nC different classes, a classier algorithm builds a model
that predicts for every unlabelled instance I the class C to which
it belongs with accuracy.

Example
A spam lter

Example
Twitter Sentiment analysis: analyze tweets with positive or
negative feelings
Classication
Example
Contains Domain Has Time
Data set that Money type attach. received spam
describes e-mail yes com yes night yes
features for yes edu no night yes
deciding if it is no com yes night yes
spam. no edu no day no
no com no day no
yes cat no day yes

Assume we have to classify the following new instance:


Contains Domain Has Time
Money type attach. received spam
yes edu yes day ?
k-Nearest Neighbours

k-NN Classier
Training: store all instances in memory
Prediction:
Find the k nearest instances
Output majority class of these k instances
Bayes Classiers
Nave Bayes

Based on Bayes Theorem:

P(c)P(d|c)
P(c|d) =
P(d)

prior likelikood
posterior =
evidence
Estimates the probability of observing attribute a and the
prior probability P(c)
Probability of class c given an instance d:

P(c) a2d P(a|c)


P(c|d) =
P(d)
Bayes Classiers

Multinomial Nave Bayes



Considers a document as a bag-of-words.
Estimates the probability of observing word w and the prior
probability P(c)
Probability of class c given a test document d:

P(c) w2d P(w|c)nwd


P(c|d) =
P(d)
Perceptron

Attribute w

Attribute w

Attribute w Output h~w (~xi )

Attribute w

Attribute w

Data stream: h~xi , yi i


Classical perceptron: h~w (~xi ) = sgn(~wT~xi ),
Minimize Mean-square error: J(~w) = (yi h~w (~xi ))
Perceptron
Attribute w

Attribute w

Attribute w Output h~w (~xi )

Attribute w

Attribute w

We use sigmoid function h~w = s (~wT~x) where

s (x) = /( + e x )

s 0 (x) = s (x)( s (x))


Perceptron

Minimize Mean-square error: J(~w) = (yi h~w (~xi ))


Stochastic Gradient Descent: ~w = ~w hJ~xi
Gradient of the error function:

J = (yi h~w (~xi ))h~w (~xi )


i

h~w (~xi ) = h~w (~xi )( h~w (~xi ))


Weight update rule

~w = ~w + h (yi h~w (~xi ))h~w (~xi )( h~w (~xi ))~xi


i
Restricted Boltzmann Machines (RBMs)
z1 z2 z3 z4

x1 x2 x3 x4 x5

Energy-based models, where

E(~x,~z)
P(~x,~z) e .

Manipulate a weight matrix W to nd low-energy states


and thus generate high probability P(~x,~z), where

E(~x,~z) = W.

RBMs can be stacked on top of each other to form


so-called Deep Belief Networks (DBNs)
Classication
Example
Contains Domain Has Time
Data set that Money type attach. received spam
describes e-mail yes com yes night yes
features for yes edu no night yes
deciding if it is no com yes night yes
spam. no edu no day no
no com no day no
yes cat no day yes

Assume we have to classify the following new instance:


Contains Domain Has Time
Money type attach. received spam
yes edu yes day ?
Classication

Assume we have to classify the following new instance:


Contains Domain Has Time
Money type attach. received spam
yes edu yes day ?

Time
Day Night
Contains Money YES

Yes No
YES NO
Decision Trees

Basic induction strategy:


A the best decision attribute for next node
Assign A as decision attribute for node
For each value of A, create new descendant of node
Sort training examples to leaf nodes
If training examples perfectly classied, Then STOP, Else
iterate over new leaf nodes
Bagging

Example
Dataset of Instances : A, B, C, D

Classier : B, A, C, B
Classier : D, B, A, D
Classier : B, A, C, B
Classier : B, C, B, B
Classier : D, C, A, C

Bagging builds a set of M base models, with a bootstrap


sample created by drawing random samples with
replacement.
Random Forests

Bagging
Random Trees: trees that in each node only uses a random
subset of the attributes
Random Forests is one of the most popular methods in
machine learning.
Boosting

The strength of Weak Learnability, Schapire

A boosting algorithm transforms a weak learner


into a strong one
Boosting

A formal description of Boosting (Schapire)


given a training set (x , y ), . . . , (xm , ym )
yi 2 { , +} correct label of instance xi 2 X
for t = , . . . , T
construct distribution Dt
nd weak classier

ht : X ! { , +}

with small error et = PrDt [ht (xi ) 6= yi ] on Dt


output nal classier
Boosting

AdaBoost
: Initialize D (i) = /m for all i 2 {, , ..., m}
: for t = ,,...T do
: Call WeakLearn, providing it with distribution Dt
: Get back hypothesis ht : X ! Y
: Calculate error of ht : et = i:ht (xi )6=yi Dt (i)
6: Update distribution
et /( et ) if ht (xi ) = yi
Dt : Dt+ (i) = DZt (i)
t otherwise
where Zt is a normalization constant (chosen so Dt+ is a
probability distribution)
: return hn (x) = arg maxy2Y t:ht (x)=y log et /( et )
Boosting

AdaBoost
: Initialize D (i) = /m for all i 2 {, , ..., m}
: for t = ,,...T do
: Call WeakLearn, providing it with distribution Dt
: Get back hypothesis ht : X ! Y
: Calculate error of ht : et = i:ht (xi )6=yi Dt (i)
6: Update distribution
et if ht (xi ) = yi
Dt : Dt+ (i) = DZt (i)
t et otherwise
where Zt is a normalization constant (chosen so Dt+ is a
probability distribution)
: return hn (x) = arg maxy2Y t:ht (x)=y log et /( et )
Stacking

Use a classier to combine predictions of base classiers


Example
Use a perceptron to do stacking
Use decision trees as base classiers
Clustering

Denition
Clustering is the distribution of a set of instances of examples
into non-known groups according to some common relations or
afnities.

Example
Market segmentation of customers

Example
Social network communities
Clustering

Denition
Given
a set of instances I
a number of clusters K
an objective function cost(I)
a clustering algorithm computes an assignment of a cluster for
each instance
f : I ! {, . . . , K}
that minimizes the objective function cost(I)
Clustering
Denition
Given
a set of instances I
a number of clusters K
an objective function cost(C, I)
a clustering algorithm computes a set C of instances with
|C| = K that minimizes the objective function

cost(C, I) = d (x, C)
x2I

where
d(x, c): distance function between x and c
d (x, C) = minc2C d (x, c): distance from x to the nearest
point in C
k-means

. Choose k initial centers C = {c , . . . , ck }


. while stopping criterion has not been met
For i = , . . . , N
nd closest center ck 2 C to each instance pi
assign instance pi to cluster Ck
For k = , . . . , K
set ck to be the center of mass of all points in Ci
k-means++

. Choose a initial center c


For k = , . . . , K
select ck = p 2 I with probability d (p, C)/cost(C, I)
. while stopping criterion has not been met
For i = , . . . , N
nd closest center ck 2 C to each instance pi
assign instance pi to cluster Ck
For k = , . . . , K
set ck to be the center of mass of all points in Ci
Performance Measures

Internal Measures
Sum square distance
Dunn index D = ddmin
max

C-Index C = S S Smin
max S min

External Measures
Rand Measure
F Measure
Jaccard
Purity
Density based methods
DBSCAN
e-neighborhood(p): set of points that are at a distance of p
less or equal to e
Core object: object whose e-neighborhood has an overall
weight at least
A point p is directly density-reachable from q if
p is in e-neighborhood(q)
q is a core object
A point p is density-reachable from q if
there is a chain of points p , . . . , pn such that pi+ is directly
density-reachable from pi
A point p is density-connected from q if
there is point o such that p and q are density-reachable
from o
Density based methods

DBSCAN
A cluster C of points satises
if p 2 C and q is density-reachable from p, then q 2 C
all points p, q 2 C are density-connected
A cluster is uniquely determined by any of its core points
A cluster can be obtained
choosing an arbitrary core point as a seed
retrieve all points that are density-reachable from the seed
DBSCAN

Figure: DBSCAN Point Example with =


Density based methods

DBSCAN
select an arbitrary point p
retrieve all points density-reachable from p
if p is a core point, a cluster is formed
If p is a border point
no points are density-reachable from p
DBSCAN visits the next point of the database
Continue the process until all of the points have been
processed
Frequent Patterns

Suppose D is a dataset of patterns, t 2 D, and min sup is a


constant.

Denition Denition
Support (t): number of Pattern t is frequent if
patterns in D that are Support (t) min sup.
superpatterns of t.

Frequent Subpattern Problem


Given D and min sup, nd all frequent subpatterns of patterns
in D.
Frequent Patterns

Suppose D is a dataset of patterns, t 2 D, and min sup is a


constant.

Denition Denition
Support (t): number of Pattern t is frequent if
patterns in D that are Support (t) min sup.
superpatterns of t.

Frequent Subpattern Problem


Given D and min sup, nd all frequent subpatterns of patterns
in D.
Frequent Patterns

Suppose D is a dataset of patterns, t 2 D, and min sup is a


constant.

Denition Denition
Support (t): number of Pattern t is frequent if
patterns in D that are Support (t) min sup.
superpatterns of t.

Frequent Subpattern Problem


Given D and min sup, nd all frequent subpatterns of patterns
in D.
Frequent Patterns

Suppose D is a dataset of patterns, t 2 D, and min sup is a


constant.

Denition Denition
Support (t): number of Pattern t is frequent if
patterns in D that are Support (t) min sup.
superpatterns of t.

Frequent Subpattern Problem


Given D and min sup, nd all frequent subpatterns of patterns
in D.
Pattern Mining

Dataset Example
Document Patterns
d abce
d cde
d abce
d acde
d abcde
d6 bcd
Itemset Mining

Support Frequent
d abce d,d,d,d,d,d6 c
d cde
d,d,d,d,d e,ce
d abce
d,d,d,d a,ac,ae,ace
d acde
d,d,d,d6 b,bc
d abcde
d,d,d,d6 d,cd
d6 bcd
d,d,d ab,abc,abe
be,bce,abce
d,d,d de,cde

minimal support =
Itemset Mining

Support Frequent
d abce 6 c
d cde
e,ce
d abce
a,ac,ae,ace
d acde
b,bc
d abcde
d,cd
d6 bcd
ab,abc,abe
be,bce,abce
de,cde
Itemset Mining

Support Frequent Gen Closed


d abce 6 c c c
d cde
e,ce e ce
d abce
a,ac,ae,ace a ace
d acde
b,bc b bc
d abcde
d,cd d cd
d6 bcd
ab,abc,abe ab
be,bce,abce be abce
de,cde de cde
Itemset Mining

Support Frequent Gen Closed Max


d abce 6 c c c
d cde
e,ce e ce
d abce
a,ac,ae,ace a ace
d acde
b,bc b bc
d abcde
d,cd d cd
d6 bcd
ab,abc,abe ab
be,bce,abce be abce abce
de,cde de cde cde
Itemset Mining

Support Frequent Gen Closed Max


d abce 6 c c c
d cde
e,ce e ce
d abce
a,ac,ae,ace a ace
d acde
b,bc b bc
d abcde
d,cd d cd
d6 bcd
ab,abc,abe ab
be,bce,abce be abce abce
de,cde de cde cde
Itemset Mining

Support Frequent Gen Closed Max


d abce 6 c c c
d cde
e,ce e ce
d abce
a,ac,ae,ace a ace
d acde
b,bc b bc
d abcde
d,cd d cd
d6 bcd
ab,abc,abe ab
e ! ce be,bce,abce be abce abce
de,cde de cde cde
Itemset Mining

Support Frequent Gen Closed Max


d abce 6 c c c
d cde
e,ce e ce
d abce
a,ac,ae,ace a ace
d acde
b,bc b bc
d abcde
d,cd d cd
d6 bcd
ab,abc,abe ab
be,bce,abce be abce abce
de,cde de cde cde
Itemset Mining

Support Frequent Gen Closed Max


d abce 6 c c c
d cde
e,ce e ce
d abce
a,ac,ae,ace a ace
d acde
b,bc b bc
d abcde
d,cd d cd
d6 bcd
ab,abc,abe ab
be,bce,abce be abce abce
de,cde de cde cde
Itemset Mining

Support Frequent Gen Closed Max


d abce 6 c c c
d cde
e,ce e ce
d abce
a,ac,ae,ace a ace
d acde
b,bc b bc
d abcde
d,cd d cd
d6 bcd
ab,abc,abe ab
a ! ace be,bce,abce be abce abce
de,cde de cde cde
Itemset Mining

Support Frequent Gen Closed Max


d abce 6 c c c
d cde
e,ce e ce
d abce
a,ac,ae,ace a ace
d acde
b,bc b bc
d abcde
d,cd d cd
d6 bcd
ab,abc,abe ab
be,bce,abce be abce abce
de,cde de cde cde
Closed Patterns

Usually, there are too many frequent patterns. We can compute


a smaller set, while keeping the same information.
Example
A set of items, has subsets, that is more
than the number of atoms in the universe
Closed Patterns

A priori property
If t0 is a subpattern of t, then Support (t0 ) Support (t).

Denition
A frequent pattern t is closed if none of its proper superpatterns
has the same support as it has.

Frequent subpatterns and their supports can be generated from


closed patterns.
Maximal Patterns

Denition
A frequent pattern t is maximal if none of its proper
superpatterns is frequent.

Frequent subpatterns can be generated from maximal patterns,


but not with their support.
All maximal patterns are closed, but not all closed patterns are
maximal.
Non streaming frequent itemset miners
Representation:
Horizontal layout
T: a, b, c
T: b, c, e
T: b, d, e
Vertical layout
a:
b:
c:

Search:
Breadth-rst (levelwise): Apriori
Depth-rst: Eclat, FP-Growth
The Apriori Algorithm

A A

Initialize the item set size k =


Start with single element sets
Prune the non-frequent ones
while there are frequent item sets
do create candidates with one item more
6 Prune the non-frequent ones
Increment the item set size k = k +

8 Output: the frequent item sets


The Eclat Algorithm

Depth-First Search
divide-and-conquer scheme : the problem is processed by
splitting it into smaller subproblems, which are then
processed recursively
conditional database for the prex a
transactions that contain a
conditional database for item sets without a
transactions that not contain a

Vertical representation
Support counting is done by intersecting lists of
transaction identiers
The FP-Growth Algorithm

Depth-First Search
divide-and-conquer scheme : the problem is processed by
splitting it into smaller subproblems, which are then
processed recursively
conditional database for the prex a
transactions that contain a
conditional database for item sets without a
transactions that not contain a
Vertical and Horizontal representation : FP-Tree
prex tree with links between nodes that correspond to the
same item
Support counting is done using FP-Tree
Mining Graph Data
Problem
Given a data set D of graphs, nd frequent graphs.

Transaction Id Graph
O
C C S N
O
O
C C S N
C
N
C C S N
The gSpan Algorithm

S(g, D, min sup, S)

Input: A graph g, a graph dataset D, min sup.


Output: The frequent graph set S.

if g 6= min(g)
then return S
insert g into S
update support counter structure
C 0/
6 for each g0 that can be right-most
extended from g in one step
do if support(g) min sup
8 then insert g0 into C
for each g0 in C
do S S(g0 , D, min sup, S)
return S
Machine Learning and
Data Mining
Data Preprocessing

Albert Bifet(@abifet)
Data Basics
Machine Learning/Data Mining Applications

Business Analytics
Is this costumer credit-worthy?
Is a costumer willing to respond to an email?
Do costumers divide in similar groups?
How much a costumer is going to spend next semester?
World Wide Web
Financial Analytics
Internet of Things
Image Recognition, Speech
..
The Data Mining Process

Data collection
Data Preprocesing
Feature extraction
Data cleaning
Feature selection and transformation
Analytical processing and algorithms
Data Postprocesing
Multidimensional Data

Example:
Competitor Name Swim Cycle Run Total
John T : : 8: :
Norman P 8: : : :
Alex K : 8: n/a n/a
Sarah H : : : :
Table: Triathlon results

Example or Instance
data point, transaction, entity, tuple, object, or feature-vector
Attribute or Feature
eld, dimension
Instance Types

Dense
red, white, Barcelona, , up
red, red, Barcelona, , down
black, white, Paris, , up
red, green, Paris, , down
Sparse
, , , , , , , , , , , , , , , , , , , , , ,
, , , , , , , , , , , , , , , , , , , , , ,
, , , , , , , , , , , , , , , , , , , , , ,
, , , , , , , , , , , , , , , , , , , , , ,
, , , , , , , , , , , , , , , , , , , , , ,
, , , , , , , , , , , , , , , , , , , , , ,
Attribute Type

Numerical
, , ., ., .
Categorical or Discrete
+, -
red, green, black
yes, no
up, down
Barcelona, Paris, London, New York
Text Data: vector-space representation
The cat is black
Binary: Categorical or Numerical
Analytical processing and algorithms

Attribute/Column Relationships
Classication : predict value of a discrete attribute
Regression: predict value of a numeric attribute
Instance/Row Relationships
Clustering: determine subsets of rows, in which the values
in the corresponding columns are similar
Outlier Detection: determine the rows that are very different
from the other rows
Big Data Scalability

Distributed Systems:
Hardware: Hadoop cluster
Software: MapReduce, Spark, Flink, Storm
Streaming Algorithms
Single pass over the data
Concept Drift
Data Preparation
The Data Mining Process

Data collection
Data Preprocesing
Feature extraction
Data cleaning
Feature selection and transformation
Analytical processing and algorithms
Data Postprocesing
Feature Extraction

Sensor data: wavelets or Fourier Transforms


Image Data: histograms or visual words
Web logs: multidimensional data
Network trafc: specic features as network protocol,
bytes transferred
Text Data: remove stop words, stem data,
multidimensional data
Feature Conversion

Numeric to Discrete
Equi-width ranges
Equi-log ranges
Equi-depth ranges
Discrete to Numeric
Binarization: one numeric attribute for each value
Text to Numeric
remove stop words, stem data, tf-idf, multidimensional data
Time Series to Discrete Sequence Data
SAX: equi-depth discretization after window-based
averaging
Time Series to Numeric Data
Discrete Wavelet Transform
Discrete Fourier Transform
Term Frequency-Inverse Document Frequency
Term frequency
Boolean frequencies
tf(t, d) = if t occurs in d and otherwise;
Logarithmically scaled frequency
tf(t, d) = + logft,d , or zero if ft,d is zero;
Augmented frequency,

ft,d
tf(t, d) = . + .
max{ft 0 ,d : t 0 2 d}

Inverse document frequency

N
idf(t, D) = log
|{d 2 D : t 2 d}|

Term frequency-inverse document frequency

tfidf(t, d, D) = tf(t, d) idf(t, D)


Data Cleaning
Handling missing entries
Eliminate entries with a missing value
Estimate missing values
Algorithms can handle missing values
Handling incorrect entries
Duplicate detection and inconsistency detection
Domain knowledge
Data-centric methods
Scaling and normalization
Standardization: for instance i, attribute j:

xij - j
zji =
j

Normalization:
xij - minj
yij =
maxj - min j
Feature selection and transformation

Sampling for Static Data


Sampling with Replacement
Sampling without Replacement: no duplicates
Biased Sampling
Stratied Sampling
Reservoir Sampling for Data Streams
Given a data stream, choose k items with the same
probability, storing only k elements in memory.
R S

R S

for every item i in the rst k items of the stream


do store item i in the reservoir
n=k
for every item i in the stream after the rst k items of the stream
do select a random number r between and n
6 if r < k
then replace item r in the reservoir with item i
8 n=n+

Figure: Algorithm R S
Feature selection and transformation

Feature Subset Selection


Supervised feature selection
Unsupervised feature selection
Biased Sampling
Stratied Sampling
Dimensionality reduction with axis rotation
Principal Component Analysis
Singular Value Decomposition
Latent Semantic Analysis
Principal Component Analysis

Goal: Principal component analysis computes the most


meaningful basis to re-express a noisy, garbled data set.
The hope is that this new basis will lter out the noise and
reveal hidden dynamics

Normalize Input Data


Compute k orthonormal vectors to have a basis for the
normalized data
Sort these principal components
Eliminate components with low variance
Principal Component Analysis

Organize the data set X as an m n matrix, where m is the


number of features and n is the number of instances.
Normalize Input Data: subtract off the mean for each
instance xi
Calculate the SVD or the eigenvectors of the covariance
Find some orthonormal matrix P where Y = PX such that


SY = YY T
n-
is diagonalized.
The rows of P are the principal components of X.
Sort these principal components
Eliminate components with low variance
Clustering, classification and
evaluation

Mostafa H. Chehreghani

Mostafa.chehreghani@gmail.com
Clustering

Albert Bifet (@abifet)

Paris, 18 October 2015


albert.bifet@telecom-paristech.fr
Clustering

Definition
Clustering is the distribution of a set of instances of examples
into non-known groups according to some common relations or
affinities.

Example
Market segmentation of customers

Example
Social network communities
Clustering

Definition
Given
I a set of instances I
I a number of clusters K
I an objective function cost(I)
a clustering algorithm computes an assignment of a cluster for
each instance
f : I {1, . . . , K }
that minimizes the objective function cost(I)
Clustering
Definition
Given
I a set of instances I
I a number of clusters K
I an objective function cost(C, I)
a clustering algorithm computes a set C of instances with
|C| = K that minimizes the objective function
X
cost(C, I) = d 2 (x, C)
xI

where
I d(x, c): distance function between x and c
I d 2 (x, C) = mincC d 2 (x, c): distance from x to the nearest
point in C
k-means

I 1. Choose k initial centers C = {c1 , . . . , ck }


I 2. while stopping criterion has not been met
I For i = 1, . . . , N
I find closest center ck C to each instance pi
I assign instance pi to cluster Ck
I For k = 1, . . . , K
I set ck to be the center of mass of all points in Ci
k-means++

I 1. Choose a initial center c1


I For k = 2, . . . , K
I select ck = p I with probability d 2 (p, C)/cost(C, I)
I 2. while stopping criterion has not been met
I For i = 1, . . . , N
I find closest center ck C to each instance pi
I assign instance pi to cluster Ck
I For k = 1, . . . , K
I set ck to be the center of mass of all points in Ci
Performance Measures
Internal Measures
I Cluster Cohesion: Measures how closely related are
objects in a cluster
I Cluster Separation: Measure how distinct or well
separated a cluster is from other clusters
I Silhouette Coefficient: 1 a/b if a < b
I a = average distance of i to the points in its cluster
I b = min (average distance of i to points in another cluster)

External Measures
I Rand Measure
I F Measure
I Jaccard
I Purity
Distances

Numeric features
I Euclidean:
X
d(x, y ) = ||x y ||2 = (xi yi )2

I Manhattan distance:
X
d(x, y) = ||x y||1 = |xi yi |
Density based methods

DBSCAN
I -neighborhood(p): set of points that are at a distance of p
less or equal to 
I Core object: object whose -neighborhood has an overall
weight at least
I A point p is directly density-reachable from q if
I p is in -neighborhood(q)
I q is a core object
I A point p is density-reachable from q if
I there is a chain of points p1 , . . . , pn such that pi+1 is directly
density-reachable from pi
I A point p is density-connected from q if
I there is point o such that p and q are density-reachable
from o
Density based methods

DBSCAN
I A cluster C of points satisfies
I if p C and q is density-reachable from p, then q C
I all points p, q C are density-connected
I A cluster is uniquely determined by any of its core points
I A cluster can be obtained
I choosing an arbitrary core point as a seed
I retrieve all points that are density-reachable from the seed
Density based methods

DBSCAN
I select an arbitrary point p
I retrieve all points density-reachable from p
I if p is a core point, a cluster is formed
I If p is a border point
I no points are density-reachable from p
I DBSCAN visits the next point of the database
I Continue the process until all of the points have been
processed
DBSCAN

Figure: DBSCAN Point Example with =3


BIRCH

B ALANCED I TERATIVE R EDUCING AND C LUSTERING


USING H IERARCHIES
I Clustering Features CF = (N, LS, SS)
I N: number of data points
I LS: linear sum of the N data points
I SS: square sum of the N data points
I Properties:
I Additivity: CF1 + CF2 = (N1 + N2 , LS1 + LS2 , SS1 + SS2 )
I Easy to compute: average inter-cluster distance
and average intra-cluster distance
I Uses CF tree
I Height-balanced tree with two parameters
I B: branching factor
I T: radius leaf threshold
BIRCH

B ALANCED I TERATIVE R EDUCING AND C LUSTERING


USING H IERARCHIES

Phase 1: Scan all data and build an initial in-memory CF


tree
Phase 2: Condense into desirable range by building a
smaller CF tree (optional)
Phase 3: Global clustering
Phase 4: Cluster refining (optional and off line, as requires
more passes)
BIRCH:
Balanced Iterative Reducing and Clustering using
Hierarchies

Tian Zhang, Raghu Ramakrishnan, Miron Livny

Presented by Zhao Li
2009, Spring
Introduction to BIRCH

Designed for very large data sets
Time and memory are limited
Incremental and dynamic clustering of incoming objects
Only one scan of data is necessary
Does not need the whole data set in advance

Two key phases:
Scans the database to build an in-memory tree
Applies clustering algorithm to cluster the leaf nodes

September 1, 2017 2
Similarity Metric(1)

Given a cluster of instances , we define:


Centroid:

Radius: average distance from member points to centroid

Diameter: average pair-wise distance within a cluster

September 1, 2017 3
Similarity Metric(2)

centroid Euclidean distance:


centroid Manhattan distance:

average inter-cluster:
average intra-cluster:
variance increase:

September 1, 2017 4
Clustering Feature

The Birch algorithm builds a dendrogram called clustering
feature tree (CF tree) while scanning the data set.

Each entry in the CF tree represents a cluster of objects
and is characterized by a 3-tuple: (N, LS, SS), where N is
the number of objects in the cluster and LS, SS are defined
in the following.

September 1, 2017 5
Properties of Clustering Feature

CF entry is more compact
Stores significantly less than all of the data points in
the sub-cluster

A CF entry has sufficient information to calculate
D0-D4

Additivity theorem allows us to merge sub-clusters
incrementally & consistently

September 1, 2017 6
CF-Tree

Each non-leaf node has at


most B entries
Each leaf node has at
most L CF entries,
each of which satisfies
threshold T

September 1, 2017 7
CF-Tree Insertion

Recurse down from root
Find the appropriate leaf
Follow the "closest"-CF path, w.r.t. D0 / / D4

Modify the leaf
If the closest-CF leaf cannot absorb, make a new CF
entry. If there is no room for new leaf, split the parent
node

Traverse back
Update CFs on the path or splitting nodes
September 1, 2017 8
CF-Tree Rebuilding

If we run out of space, increase threshold T
By increasing the threshold, CFs absorb more data

Rebuilding "pushes" CFs over
The larger T allows different CFs to group together

Reducibility theorem
Increasing T will result in a CF-tree smaller than the
original

September 1, 2017 9
Example of BIRCH
New subcluster
sc8 sc3
sc1 sc4 sc5 sc6 sc7

sc2 LN2 LN3


Root
LN1 LN1 LN2 LN3

sc8 sc1 sc5


sc2 sc4 sc6 sc7
sc3

September 1, 2017 10
Insertion Operation in BIRCH
If the branching factor of a leaf node can not exceed 3, then LN1 is split.

sc8 sc3
sc1 sc4 sc5 sc6 sc7

sc2
LN1 LN2 LN3

LN1 Root
LN1 LN2 LN3
LN1

sc8 sc1 sc3sc4sc5 sc6 sc7


sc2
September 1, 2017 11
If the branching factor of a non-leaf node can not
exceed 3, then the root is split and the height of
the CF Tree increases by one.

sc8 sc3
sc1 sc4 sc5 sc6 sc7

sc2
LN1 LN2 LN3

LN1 Root
NLN1 NLN2
LN1
LN1 LN2 LN3

sc8 sc1 sc2 sc3sc4sc5 sc6 sc7


September 1, 2017 12
BIRCH Overview

September 1, 2017 13
Experimental Results

Input parameters:
Memory (M): 5% of data set
Disk space (R): 20% of M
Distance equation: D2
Quality equation: weighted average diameter (D)
Initial threshold (T): 0.0
Page size (P): 1024 bytes
September 1, 2017 14
Experimental Results
KMEANS clustering

DS Time D # Scan DS Time D # Scan

1 43.9 2.09 289 1o 33.8 1.97 197

BIRCH clustering
2 13.2 4.43 51 2o 12.7 4.20 29

DS
3 Time
32.9 D
3.66 # 187
Scan DS
3o Time
36.0 D
4.35 # 241
Scan

1 11.5 1.87 2 1o 13.6 1.87 2

September 1, 2017 15
2 10.7 1.99 2 2o 12.1 1.99 2
Exam Questions

What is the main limitation of BIRCH?
Since each node in a CF tree can hold only a limited
number of entries due to the size, a CF tree node doesnt
always correspond to what a user may consider a nature
cluster. Moreover, if the clusters are not spherical in
shape, it doesnt perform well because it uses the notion
of radius or diameter to control the boundary of a
cluster.

September 1, 2017 16
Classification Evaluation

Albert Bifet (@abifet)

Paris, 27 September 2016


albert.bifet@telecom-paristech.fr
Evaluation

1. Error estimation: Hold-out or Cross-Validation


2. Evaluation performance measures: Accuracy or -statistic
3. Statistical significance validation: MacNemar or Nemenyi test

Evaluation Framework
Error Estimation

Data available for testing


I Holdout an independent test set
I Apply the current decision model to the test set
I The loss estimated in the holdout is an unbiased estimator

Holdout Evaluation
1. Error Estimation

Not enough data available for testing


I Divide dataset in 10 folds
I Repeat 10 times: use one fold for testing and the rest for
training

k-fold Cross-validation
2. Evaluation performance measures

Predicted Predicted
Class+ Class- Total
Correct Class+ 75 8 83
Correct Class- 7 10 17
Total 82 18 100
Table: Simple confusion matrix example
2. Evaluation performance measures

Predicted Predicted
Class+ Class- Total
Correct Class+ tp fn tp+fn
Correct Class- fp tn fp+tn
Total tp+fp fn+tn N
Table: Simple confusion matrix example

tp
I Precision = tp+fp
tp
I Recall = tp+fn
precisionrecall
I F1 = 2 precision+recall
2. Evaluation performance measures

Predicted Predicted
Class+ Class- Total
Correct Class+ 75 8 83
Correct Class- 7 10 17
Total 82 18 100
Table: Simple confusion matrix example

75 10 75 83 10 17
I Accuracy = 100 +100 = 83 100 + 17 100 = 85%
I Arithmetic mean = ( 75 10
83 + 17 )/2 = 74.59%
q
75 10
I Geometric mean = 83 17 = 72.90%
2. Performance Measures with Unbalanced Classes
Predicted Predicted
Class+ Class- Total
Correct Class+ 75 8 83
Correct Class- 7 10 17
Total 82 18 100
Table: Simple confusion matrix example

Predicted Predicted
Class+ Class- Total
Correct Class+ 68.06 14.94 83
Correct Class- 13.94 3.06 17
Total 82 18 100
Table: Confusion matrix for chance predictor
2. Performance Measures with Unbalanced Classes
Kappa Statistic
I p0 : classifiers prequential accuracy
I pc : probability that a chance classifier makes a correct
prediction.
I statistic
p0 pc
=
1 pc
I = 1 if the classifier is always correct
I = 0 if the predictions coincide with the correct ones as
often as those of the chance classifier

Matthews correlation coefficient (MCC)

tp tn fp fn
p
(tp + fp)(tp + fn)(tn + fp)(tn + fn)
2. Evaluation performance measures

Predicted Predicted
Class+ Class- Total
Correct Class+ tp fn tp+fn
Correct Class- fp tn fp+tn
Total tp+fp fn+tn N
Table: Simple confusion matrix example

AUC Area under the curve


A ROC space is defined by FPR and TPR (recall)
fp
I FPR = fp+tp
tp
I TPR = tp+fn
3. Statistical significance validation (2 Classifiers)
Classifier A Classifier A
Class+ Class- Total
Classifier B Class+ c a c+a
Classifier B Class- b d b+d
Total c+b a+d a+b+c+d

M = |a b 1|2 /(a + b)
The test follows the 2 distribution. At 0.99 confidence it rejects
the null hypothesis (the performances are equal) if M > 6.635.

McNemar test
3. Statistical significance validation (> 2 Classifiers)
Two classifiers are performing differently if the corresponding
average ranks differ by at least the critical difference
r
k(k + 1)
CD = q
6N

I k is the number of learners, N is the number of datasets,


I critical values q are
based on the Studentized range
statistic divided by 2.

Nemenyi test
3. Statistical significance validation (> 2 Classifiers)
Two classifiers are performing differently if the corresponding
average ranks differ by at least the critical difference
r
k(k + 1)
CD = q
6N

I k is the number of learners, N is the number of datasets,


I critical values q are
based on the Studentized range
statistic divided by 2.

# classifiers 2 3 4 5 6 7
q0.05 1.960 2.343 2.569 2.728 2.850 2.949
q0.10 1.645 2.052 2.291 2.459 2.589 2.693
Table: Critical values for the Nemenyi test
CS224W: Social and Information Network Analysis
Jure Leskovec, Stanford University
http://cs224w.stanford.edu
How to organize the Web?
First try: Human curated
Web directories
Yahoo, DMOZ, LookSmart
Second try: Web Search
Information Retrieval attempts to
find relevant docs in a small
and trusted set
Newspaper articles, Patents, etc.
But: Web is huge, full of untrusted documents,
random things, web spam, etc.
So we need a good way to rank webpages!
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 2
2 challenges of web search:
(1) Web contains many sources of information
Who to trust?
Insight: Trustworthy pages may point to each other!
(2) What is the best answer to query
newspaper?
No single right answer
Insight: Pages that actually know about newspapers
might all be pointing to many newspapers

11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 3
All web pages are not equally important
www.joe-schmoe.com vs. www.stanford.edu

We already know:
There is large diversity
in the web-graph vs.
node connectivity.
So, lets rank the pages
using the web graph
link structure!
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 4
We will cover the following Link Analysis
approaches to computing importance of
nodes in a graph:
Hubs and Authorities (HITS)
Page Rank
Random Walk with Restarts

Sidenote: Various notions of node centrality: Node


Degree centrality = degree of
Betweenness centrality = #shortest paths passing through
Closeness centrality = avg. length of shortest paths from to
all other nodes of the network
Eigenvector centrality = like PageRank
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 5
Goal (back to the newspaper example):
Dont just find newspapers. Find experts pages that
link in a coordinated way to good newspapers
Idea: Links as votes
Page is more important if it has more links
In-coming links? Out-going links?
Hubs and Authorities NYT: 10

Each page has 2 scores: Ebay: 3


Quality as an expert (hub):
Yahoo: 3
Total sum of votes of pages pointed to
Quality as a content provider (authority): CNN: 8

Total sum of votes of experts WSJ: 9


Principle of repeated improvement
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 7
Interesting pages fall into two classes:
1. Authorities are pages containing
useful information
Newspaper home pages
Course home pages
Home pages of auto manufacturers

2. Hubs are pages that link to authorities


List of newspapers NYT: 10
Ebay: 3
Course bulletin Yahoo: 3
List of U.S. auto manufacturers CNN: 8
WSJ: 9

11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 8
Each page starts with hub score 1
Authorities collect their votes

(Note this is an idealized example. In reality graph is not bipartite and


each page has both the hub and the authority score)
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 9
Hubs collect authority scores

(Note this is an idealized example. In reality graph is not bipartite and


each page has both the hub and authority score)
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 10
Authorities collect hub scores

(Note this is an idealized example. In reality graph is not bipartite and


each page has both the hub and authority score)
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 11
A good hub links to many good authorities
A good authority is linked from many good
hubs
Note a self-reinforcing recursive definition

Model using two scores for each node:


Hub score and Authority score
Represented as vectors and , where the i-th
element is the hub/authority score of the i-th node

11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 12
[Kleinberg 98]

Convergence criteria:
Each page has 2 scores: = 5 5 6 678
<
<
Authority score: 5
6 678
<
= 5 5 <
Hub score: 5

HITS algorithm:
()) ())
Initialize: ' = 1/ n, h2 = 1/ n
Then keep iterating until convergence:
(678) (6)
: Authority: 5 = '5 '
(678) (6)
: Hub: 5 = 5' '
: Normalize:
< <
678 678
5 5 = 1, ' ' =1
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 13
[Kleinberg 98]
Details!

Hits in the vector notation:


Vector = ( , ), = ( , )
Adjacency matrix (n x n): = if
Can rewrite = as =
So: = And similarly: =
Repeat until convergence:
(678) = (6)
(678) = I (6)
Normalize (678) and (678)

11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 14
Details!

What is = ?
Then: = ( )
new
new
is updated (in 2 steps):
= I ( ) = (I )
h is updated (in 2 steps):

= (I ) = ( I )
Thus, in steps:
= I L
= I L
Repeated matrix powering
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 15
Definition: Eigenvectors & Eigenvalues
Let =
for some scalar , vector , matrix
Then is an eigenvector, and is its eigenvalue
The steady state (HITS has converged) is:
= Note constants c,c
dont matter as we
normalize them out
= RR every step of HITS

So, authority is eigenvector of


(associated with the largest eigenvalue)
Similarly: hub is eigenvector of
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 16
Still the same idea: Links as votes
Page is more important if it has more links
In-coming links? Out-going links?
Think of in-links as votes:
www.stanford.edu has 23,400 in-links
www.joe-schmoe.com has 1 in-link

Are all in-links equal?


Links from important pages count more
Recursive question!

11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 18
A vote from an important
page is worth more: i k
ri/3 r /4
Each links vote is proportional k

to the importance of its source j rj/3

page rj/3 rj/3

If page i with importance ri has


di out-links, each link gets ri / di
rj = ri/3 + rk/4
votes
Page js own importance rj is
the sum of the votes on its in-
links
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 19
A page is important if it is The web in 1839

pointed to by other important ry/2


pages y
Define a rank rj for node j
ra/2
ri ry/2
rj = a
rm
m
i j di ra/2
out-degree of node Flow equations:
ry = ry /2 + ra /2
ra = ry /2 + rm
You might wonder: Lets just use Gaussian elimination rm = ra /2
to solve this system of linear equations. Bad idea!
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 20
j
Stochastic adjacency matrix
Let page have out-links i

If , then =

is a column stochastic matrix
Columns sum to 1 1/3

Rank vector : An entry per page M


is the importance score of page
=
The flow equations can be written
ri
= rj =
i j di
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 21
i1 i2 i3
Imagine a random web surfer:
At any time , surfer is on some page
At time + , the surfer follows an j
out-link from uniformly at random ri
rj =
Ends up on some page linked from i j d out (i)
Process repeats indefinitely
Let:
() vector whose th coordinate is the
prob. that the surfer is at page at time
So, () is a probability distribution over pages

11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 22
i1 i2 i3
Where is the surfer at time t+1?
Follows a link uniformly at random
j
+ = () p(t + 1) = M p(t )
Suppose the random walk reaches a state
+ = () = ()
then () is stationary distribution of a random walk
Our original rank vector satisfies =
So, is a stationary distribution for
the random walk

11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 23
Given a web graph with n nodes, where the
nodes are pages and edges are hyperlinks
Assign each node an initial page rank
Repeat until convergence (i |ri(t+1) ri(t)| < )
Calculate the page rank of each node
(t )
( t +1) ri
rj =
i j di

. out-degree of node
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 25
y a m
Power Iteration: y
y 0
Set ' 1/N a 0 1
]^
a m m 0 0
1: ' 5' _
^
ry = ry /2 + ra /2
2: ra = ry /2 + rm
If | | > : goto 1 rm = ra /2

Example:
ry 1/3 1/3 5/12 9/24 6/15
ra = 1/3 3/6 1/3 11/24 6/15
rm 1/3 1/6 3/12 1/6 3/15
Iteration 0, 1, 2,

11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 26
y a m
Power Iteration: y
y 0
Set ' 1/N a 0 1
]^
a m m 0 0
1: ' 5' _
^
ry = ry /2 + ra /2
2: ra = ry /2 + rm
If | | > : goto 1 rm = ra /2

Example:
ry 1/3 1/3 5/12 9/24 6/15
ra = 1/3 3/6 1/3 11/24 6/15
rm 1/3 1/6 3/12 1/6 3/15
Iteration 0, 1, 2,

11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 27
(t )
( t +1) ri
rj = or
r = Mr
i j di
equivalently

Does this converge?

Does it converge to what we want?

Are results reasonable?

11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 28
The Spider trap problem:

(t )
( t +1) ri
a b rj =
i j di
Example:
Iteration: 0, 1, 2, 3
ra 1 0 0 0
=
rb 0 1 1 1

11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 29
The Dead end problem:

(t )
( t +1) ri
a b rj =
i j di

Example:
Iteration: 0, 1, 2, 3
ra 1 0 0 0
rb = 0 1 0 0

11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 30
2 problems:
(1) Some pages are
dead ends (have no out-links)
Such pages cause
importance to leak out

(2) Spider traps


(all out-links are within the group)
Eventually spider traps absorb all importance

11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 31
y a m
Power Iteration: y
y 0
8
Set ' = a 0 0
c a m m 0 1
]^
' = 5' _
^ ry = ry /2 + ra /2
And iterate ra = ry /2
rm = ra /2 + rm
Example:
ry 1/3 2/6 3/12 5/24 0
ra = 1/3 1/6 2/12 3/24 0
rm 1/3 3/6 7/12 16/24 1
Iteration 0, 1, 2,

11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 32
The Google solution for spider traps: At each
time step, the random surfer has two options
With prob. , follow a link at random
With prob. 1-, jump to a random page
Common values for are in the range 0.8 to 0.9
Surfer will teleport out of spider trap within a
few time steps
y y

a m a m
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 33
y a m
Power Iteration: y
y 0
8
Set ' = a 0 0
c a m m 0 0
]^
' = 5' _
^ ry = ry /2 + ra /2
And iterate ra = ry /2
rm = ra /2
Example:
ry 1/3 2/6 3/12 5/24 0
ra = 1/3 1/6 2/12 3/24 0
rm 1/3 1/6 1/12 2/24 0
Iteration 0, 1, 2,

11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 34
Teleports: Follow random teleport links with
probability 1.0 from dead-ends
Adjust matrix accordingly

y y

a m a m
y a m y a m
y 0 y
a 0 0 a 0
m 0 0 m 0

11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 35
Googles solution: At each step, random
surfer has two options:
With probability , follow a link at random
With probability 1-, jump to some random page
PageRank equation [Brin-Page, 98]
5 1
' = = + (1 )
5 di out-degree
5' of node i

The above formulation assumes that has no dead ends. We can


either preprocess matrix (bad!) or explicitly follow random teleport
links with probability 1.0 from dead-ends. See P. Berkhin, A Survey
on PageRank Computing, Internet Mathematics, 2005.
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 36
Details!

PageRank as a principal eigenvector



= or equivalently =

But we really want (**): d out-degree i
of node i
= +

Lets define: Note: is a sparse

= + ( ) matrix but is dense
(all entries 0). In
Now we get what we want: practice we never
materialize but
= rather we use the
sum formulation (**)
What is ?
In practice 0.15 (Jump approx. every 5-6 links)
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 37
Input: Graph and parameter
Directed graph with spider traps and dead ends
Parameter
Output: PageRank vector
) 8
Set: ' = , = 1
c
do:
(m)
()
: =
()
= if in-deg. of is 0

Now re-insert the leaked PageRank:
o
: = R + (6)
where: = ' '

=+
(6) (6o8)
while ' ' ' >
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 38
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 39
PageRank and HITS are two solutions to the
same problem:
What is the value of an in-link from u to v?
In the PageRank model, the value of the link
depends on the links into u
In the HITS model, it depends on the value of the
other links out of u

The destinies of PageRank and HITS


post-1998 were very different

11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 40
[Tong-Faloutsos, 06]

I 1 J
1 1

A 1 H 1 B

1 1
D

1 1 1
E G
F

a.k.a.: Relevance, Closeness, Similarity


11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 42
Given:



Conferences-to-authors IJCAI
graph Philip S. Yu

Goal:
KDD

Ning Zhong
Proximity on graphs ICDM

Q: What is most related SDM R. Ramakrishnan

conference to ICDM? AAAI M. Jordan


NIPS


Conference Author

11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 43
Shortest path is not good:

No influence for degree-1 nodes (E, F, G)!


Multi-faceted relationships
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 44
Network Flow is not good:

Does not punish long paths

11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 45
[Tong-Faloutsos, 06]

I 1 J
1 1

A 1 H 1 B

Multiple Connections
1 1
D Quality of connection

1 1 1 Direct & In-direct
E G
connections
F
Length, Degree,
Weight
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 46
10
9

12
2

8
1
11
3

6
5

11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 47
Goal: Evaluate pages not just by popularity
but by how close they are to the topic
Teleporting can go to:
Any page with equal probability
PageRank (we used this so far)
A topic-specific set of relevant pages
Topic-specific (personalized) PageRank (S ...teleport set)
= + ( )/|| if
= otherwise
Random Walk with Restart: S is a single element
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 48
Graphs and web search:



Ranks nodes by importance IJCAI

Personalized PageRank: Philip S. Yu


KDD
Ranks proximity of nodes Ning Zhong
to the teleport nodes ICDM

Proximity on graphs: SDM R. Ramakrishnan

Q: What is most related AAAI M. Jordan

conference to ICDM?


NIPS

Random Walks with Restarts


Teleport back to the starting node: Conference Author
S = { single node }
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 49
Node 4
0.04 0.03 Node 1 0.13
10 Node 2 0.10
9
0.10 Node 3 0.13
12
0.13 2 0.08 0.02 Node 4 /
1 8 Node 5 0.13
3 0.13 11
0.04 Node 6 0.05
4 Node 7 0.05
Node 8 0.08
6 0.05 Node 9 0.04
5
0.13 Node 10 0.03
7 Node 11 0.04
0.05 Node 12 0.02

Nearby nodes, higher scores Ranking vector


More red, more relevant

11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 50
PKDD

SDM PAKDD
0.008
0.007
0.009
KDD 0.005 ICML
0.011
ICDM
0.005
0.004
CIKM ICDE
0.005
0.004
0.004

ECML SIGMOD

DMKD

11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 51
Q: Which conferences
are closest to KDD &
K
ICDM?
I A: Personalized
PageRank with
Graph of CS conferences teleport set S={KDD,
ICDM}

11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 52
Pins belong to Boards

11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 53
Input:

11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 54
Input: Output

11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 55
Input:

11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 56
Input: Output

11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 57
Proximity to query node(s) Q:

Bipartite Pin and Board graph

11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 58
Pixie Random Walk
Proximity to query node(s) Q:

5 5 5 5 5 5 14 9 Q 16 7 8 8 8 8 1 1 1

Yummm Strawberries Smoothies Smoothie Madness!!

11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 59
CS224W: Social and Information Network Analysis
Jure Leskovec, Stanford University
http://cs224w.stanford.edu
Observations Models Algorithms

Small diameter, Erds-Renyi model,


Decentralized search
Edge clustering Small-world model

Patterns of signed Structural balance, Models for predicting


edge creation Theory of status edge signs

Viral Marketing, Blogosphere, Independent cascade model, Influence maximization,


Memetracking Game theoretic model Outbreak detection, LIM

Preferential attachment, PageRank, Hubs and


Scale-Free
Copying model authorities

Densification power law, Microscopic model of Link prediction,


Shrinking diameters evolving networks Supervised random walks

Strength of weak ties, Community detection:


Core-periphery Kronecker Graphs
Girvan-Newman, Modularity

11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 2
We often think of networks looking
like this:

What lead to such a conceptual picture?


11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 3
How information flows through the network?
What structurally distinct roles do nodes play?
What roles do different links (short vs. long) play?
How people find out about new jobs?
Mark Granovetter, part of his PhD in 1960s
People find the information through personal contacts
But: Contacts were often acquaintances
rather than close friends
This is surprising: One would expect your friends to
help you out more than casual acquaintances
Why is it that acquaintances are most helpful?
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 4
[Granovetter 73]

Two perspectives on friendships:


Structural: Friendships span different parts of the
network
Interpersonal: Friendship between two people is
either strong or weak
Structural role: Triadic Closure
a If two people in a
network have a friend in
common, then there is
an increased likelihood
b they will become friends
c
themselves.
Which edge is more
likely, a-b or a-c?
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 5
Granovetter makes a connection between
social and structural role of an edge
First point: Structure
Structurally embedded edges are also socially strong
Long-range edges spanning different parts of the
network are socially weak
Second point: Information
Long-range edges allow you to gather information
from different parts of the network and get a job
Structurally embedded edges are S Weak
Strong
heavily redundant in terms of S
a W
b
S

information access S

11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 6
Triadic closure == High clustering coefficient
Reasons for triadic closure:
If and have a friend in common, then:
is more likely to meet
B
(since they both spend time with )
and trust each other A C
(since they have a friend in common)
has incentive to bring and together
(as it is hard for to maintain two disjoint relationships)
Empirical study by Bearman and Moody:
Teenage girls with low clustering coefficient are
more likely to contemplate suicide
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 7
Bridge
Define: Bridge edge a
If removed, it disconnects the graph b

Define: Local bridge Local bridge


Edge of Span > 2
(Span of an edge is the distance of the a
b
edge endpoints if the edge is deleted. Local
bridges with long span are like real bridges)
Define: Two types of edges: Edge:
W or S
Strong (friend), Weak (acquaintance)
Define: Strong triadic closure: S S

Two strong ties imply a third edge


Fact: If strong triadic closure is W
satisfied then local bridges S
S W S
S

are weak ties! S


a
b

11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 8
Claim: If node satisfies Strong Triadic Closure
and is involved in at least two strong ties, then
any local bridge adjacent to must be a weak tie.
Proof by contradiction:
Assume satisfies Strong Triadic S S
Closure and has 2 strong ties A

Let be local bridge


and a strong tie
S S
Then must exist S
A
B

because of Strong C

Triadic Closure
But then is not a bridge!
(since B-C must be connected due to Strong Triadic Closure property)
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 9
For many years Granovetters theory was not
tested
But, today we have large who-talks-to-whom
graphs:
Email, Messenger, Cell phones, Facebook

Onnela et al. 2007:


Cell-phone network of 20% of countrys population
Edge strength: # phone calls

11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 10
Edge overlap:
() ()
&' =
() ()
() a set
of neighbors
of node

Overlap =
when an edge is
a local bridge

11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 11
Cell-phone network
Observation:
Highly used links True

Neighborhood overlap
have high overlap! Permuted
strengths

Legend:
True: The data
Permuted strengths: Keep
the network structure
but randomly reassign
edge strengths
Edge strength (#calls)

11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 12
Real edge strengths in mobile call graph
Strong ties are more embedded (have higher overlap)
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 13
Same network, same set of edge strengths
but now strengths are randomly shuffled
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 14
Low
Size of largest component

disconnects
the network
sooner

Fraction of removed links

Removing links by strength (#calls)


Low to high
High to low Conceptual picture
of network structure
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 15
Low
Size of largest component

disconnects
the network
sooner

Fraction of removed links

Removing links based on overlap


Low to high
High to low Conceptual picture
of network structure
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 16
Granovetters theory leads to the following
conceptual picture of networks
Strong ties

Weak ties

11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 17
Granovetters theory
suggest that networks
are composed of
tightly connected
sets of nodes

Communities, clusters,
Network communities: groups, modules
Sets of nodes with lots of connections inside and
few to outside (the rest of the network)
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 19
How to automatically
find such densely
connected groups of
nodes?
Ideally such automatically
detected clusters would
then correspond to real
groups
Communities, clusters,
For example: groups, modules

11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 20
Zacharys Karate club network:
Observe social ties and rivalries in a university karate club
During his observation, conflicts led the group to split
Split could be explained by a minimum cut in the network
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 21
Find micro-markets by partitioning the
query x advertiser graph:
query

advertiser

11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 22
Can we identify
node groups?
(communities,
modules, clusters)

Nodes: Teams
Edges: Games played
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 23
NCAA conferences

Nodes: Teams
Edges: Games played
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 24
Can we identify
social communities?

Nodes: Users
Edges: Friendships
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 25
High school Company

Stanford (Basketball)
Stanford (Squash)

Nodes: Users
Social communities Edges: Friendships
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 26
Can we identify
functional modules?

Nodes: Proteins
Edges: Interactions
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 27
Functional modules

Nodes: Proteins
Edges: Interactions
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 28
How to find communities?

We will work with undirected (unweighted) networks


11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 29
Edge betweenness: Number of
shortest paths passing over the edge b=16
Intuition: b=7.5

Edge strengths (call volume) Edge betweenness


in a real network in a real network
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 30
[Girvan-Newman 02]

Divisive hierarchical clustering based on the


notion of edge betweenness:
Number of shortest paths passing through the edge
Girvan-Newman Algorithm:
Undirected unweighted networks
Repeat until no edges are left:
Calculate betweenness of edges
Remove edges with highest betweenness
Connected components are communities
Gives a hierarchical decomposition of the network

11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 31
12
1
33
49

Need to re-compute
betweenness at
every step

11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 32
Step 1: Step 2:

Step 3: Hierarchical network decomposition:

11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 33
1. How to compute betweenness?
2. How to select the number of
clusters?

11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 34
Want to compute Breadth first search
betweenness of starting from :
paths starting at
node 0

11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 35
Forward step: Count the number of shortest
paths from to all other nodes of the
network

11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 36
Backward step: Compute betweenness: If
there are multiple paths count them
fractionally
The algorithm:
Add edge flows:
-- node flow =
1+child edges 1+1 paths to H
-- split the flow up Split evenly
based on the parent
value
Repeat the BFS 1+0.5 paths to J
Split 1:2
procedure for each
starting node
1 path to K.
Split evenly
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 37
Backward step: Compute betweenness: If
there are multiple paths count them
fractionally
The algorithm:
Add edge flows:
-- node flow =
1+child edges 1+1 paths to H
-- split the flow up Split evenly
based on the parent
value
Repeat the BFS 1+0.5 paths to J
Split 1:2
procedure for each
starting node
1 path to K.
Split evenly
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 38
1. How to compute betweenness?
2. How to select the number of
clusters?

11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 39
Communities: sets of
tightly connected nodes
Define: Modularity
A measure of how well
a network is partitioned
into communities
Given a partitioning of the
network into groups ! :
Q s S [ (# edges within group s)
(expected # edges within group s) ]
Need a null model!
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 40
Given real on nodes and edges,
construct rewired network
Same degree distribution but i
random connections
j
Consider as a multigraph
The expected number of edges between nodes

and of degrees and equals to: =
The expected number of edges in (multigraph) G:


= = =
Note:

= = D F = 2
FG
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 41
Modularity of partitioning S of graph G:
Q s S [ (# edges within group s)
(expected # edges within group s) ]

, =

Aij = 1 if ij,
Normalizing cost.: -1<Q<1
0 else

Modularity values take range [1,1]


It is positive if the number of edges within
groups exceeds the expected number
0.3-0.7<Q means significant community structure

11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 42
Modularity is useful for selecting the
number of clusters: Q

Why not optimize Modularity directly?


11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 43
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 44
Lets split the graph into 2 communities!
Want to directly optimize modularity!
R VW VX
max , = &'
N ST YN &Y 'Y ST

Community membership vector s:


si = 1 if node i is in community 1 & ' + 1 1.. if si=sj
=
-1 if node i is in community -1 2 0.. else

R VW VX YW YX [R
, = &G 'G &'
ST ST S
R VW VX
= \T &,'G &' & '
ST
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 45
Note: each row/col of B
sums to 0: = ,
Define:
= =


Modularity matrix: =
Membership: = {, +}
R VW VX
Then: , = \T &G 'G &' & '
ST
R
= \T &,'G &' & '
R R
= \T & & ' &' ' = \T b
=

Task: Find s{-1,+1}n that maximizes Q(G,s)


11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 46
Symmetric matrix A
That is positive semi-definite:
=
Then solutions , to equation = :
Eigenvectors ordered by the magnitude of their
corresponding eigenvalues & (R S j )
are orthonormal (orthogonal and unit length)
form a coordinate system (basis)
If is positive-semidefinite: & 0 (and they always exist)
Eigen Decomposition theorem: Can rewrite matrix
in terms of its eigenvectors and eigenvalues: =
&
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 47
R m
Rewrite: , = in terms of its
\T
eigenvectors and eigenvalues:
j j j

= s m D & & &b = D b & & &b = D b x & S &


&pR &pR &pR

So, if there would be no other constraints on


then to maximize , we make =
x2
Why? Because q
Remember has fixed length! s
Assigns all weight in the sum to (largest eigenvalue)
x1
All other terms are zero because of orthonormality
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 48
Lets consider only the first term in the summation
(because is the largest):
max , = j&pR b & S & b j S j
Y

Lets maximize: p , where sj{-1,+1}


To do this, we set:

+ , (jth coordinate of )
= t
, < (jth coordinate of < )

Continue the bisection hierarchically


11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 49
Fast Modularity Optimization Algorithm:
Find leading eigenvector of modularity matrix B
Divide the nodes by the signs of the elements of
Repeat hierarchically until:
If a proposed split does not cause modularity to increase,
declare community indivisible and do not split it
If all communities are indivisible, stop
How to find ? Power method!
Bv (t )
Start with random v(0), repeat : v (t +1) =
Bv (t )
When converged (v(t) v(t+1)), set xn = v(t)

11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 50
Girvan-Newman:
Based on the strength of weak ties
Remove edge of highest betweenness
Modularity:
Overall quality of the partitioning of a graph
Use to determine the number of communities
Fast modularity optimization:
Transform the modularity optimization to a
eigenvalue problem

11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 51
[Ron Burt]

Who is better off, Robert or James?


11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 53
Few structural holes Many structural holes

Structural Holes provide ego with access


to novel information, power, freedom
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 54
The network constraint measure [Burt]:
To what extent are persons contacts redundant
k
k
= / 2
p25=
p12=
i j i
2 1 p15= 5
j
Low: disconnected contacts
4
High: contacts that are
close or strongly tied 1 2 3 4 5
1 .00 .25 .25 .25 .25
2
2 .50 .00 .00 .00 .50

ci = cij = pij + ( pik pkj ) 3 1.0 .00 .00 .00 .00
4 .50 .00 .00 .00 .50
j j k 5 .33 .33 .00 .33 .00
prop. of s energy invested in relationship with
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 55
Constraint: To what
extent are persons
contacts redundant
Low: disconnected
contacts
High: contacts that
are close or strongly
tied
Network constraint:
James: = 0.309
Robert: = 0.148
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 56
[Ron Burt]

11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 57
Machine Learning and
Data Mining
Introduction

Albert Bifet(@abifet)
Data Science

Data Science is an interdisciplinary eld focused on


extracting knowledge or insights from large volumes
of data.
Data Scientist

Figure: http://www.marketingdistillery.com/2014/11/29/
is-data-science-a-buzzword-modern-data-scientist-defined/
Data Science

Figure: Drew Convays Venn diagram


Classication

Denition
Given nC different classes, a classier algorithm builds a model
that predicts for every unlabelled instance I the class C to which
it belongs with accuracy.

Example
A spam lter

Example
Twitter Sentiment analysis: analyze tweets with positive or
negative feelings
Classication
Example
Contains Domain Has Time
Data set that Money type attach. received spam
describes e-mail yes com yes night yes
features for yes edu no night yes
deciding if it is no com yes night yes
spam. no edu no day no
no com no day no
yes cat no day yes

Assume we have to classify the following new instance:


Contains Domain Has Time
Money type attach. received spam
yes edu yes day ?
k-Nearest Neighbours

k-NN Classier
Training: store all instances in memory
Prediction:
Find the k nearest instances
Output majority class of these k instances
Bayes Classiers
Nave Bayes

Based on Bayes Theorem:

P(c)P(d|c)
P(c|d) =
P(d)

prior likelikood
posterior =
evidence
Estimates the probability of observing attribute a and the
prior probability P(c)
Probability of class c given an instance d:

P(c) a2d P(a|c)


P(c|d) =
P(d)
Bayes Classiers

Multinomial Nave Bayes



Considers a document as a bag-of-words.
Estimates the probability of observing word w and the prior
probability P(c)
Probability of class c given a test document d:

P(c) w2d P(w|c)nwd


P(c|d) =
P(d)
Perceptron

Attribute w

Attribute w

Attribute w Output h~w (~xi )

Attribute w

Attribute w

Data stream: h~xi , yi i


Classical perceptron: h~w (~xi ) = sgn(~wT~xi ),
Minimize Mean-square error: J(~w) = (yi h~w (~xi ))
Perceptron
Attribute w

Attribute w

Attribute w Output h~w (~xi )

Attribute w

Attribute w

We use sigmoid function h~w = s (~wT~x) where

s (x) = /( + e x )

s 0 (x) = s (x)( s (x))


Perceptron

Minimize Mean-square error: J(~w) = (yi h~w (~xi ))


Stochastic Gradient Descent: ~w = ~w hJ~xi
Gradient of the error function:

J = (yi h~w (~xi ))h~w (~xi )


i

h~w (~xi ) = h~w (~xi )( h~w (~xi ))


Weight update rule

~w = ~w + h (yi h~w (~xi ))h~w (~xi )( h~w (~xi ))~xi


i
Restricted Boltzmann Machines (RBMs)
z1 z2 z3 z4

x1 x2 x3 x4 x5

Energy-based models, where

E(~x,~z)
P(~x,~z) e .

Manipulate a weight matrix W to nd low-energy states


and thus generate high probability P(~x,~z), where

E(~x,~z) = W.

RBMs can be stacked on top of each other to form


so-called Deep Belief Networks (DBNs)
Classication
Example
Contains Domain Has Time
Data set that Money type attach. received spam
describes e-mail yes com yes night yes
features for yes edu no night yes
deciding if it is no com yes night yes
spam. no edu no day no
no com no day no
yes cat no day yes

Assume we have to classify the following new instance:


Contains Domain Has Time
Money type attach. received spam
yes edu yes day ?
Classication

Assume we have to classify the following new instance:


Contains Domain Has Time
Money type attach. received spam
yes edu yes day ?

Time
Day Night
Contains Money YES

Yes No
YES NO
Decision Trees

Basic induction strategy:


A the best decision attribute for next node
Assign A as decision attribute for node
For each value of A, create new descendant of node
Sort training examples to leaf nodes
If training examples perfectly classied, Then STOP, Else
iterate over new leaf nodes
Bagging

Example
Dataset of Instances : A, B, C, D

Classier : B, A, C, B
Classier : D, B, A, D
Classier : B, A, C, B
Classier : B, C, B, B
Classier : D, C, A, C

Bagging builds a set of M base models, with a bootstrap


sample created by drawing random samples with
replacement.
Random Forests

Bagging
Random Trees: trees that in each node only uses a random
subset of the attributes
Random Forests is one of the most popular methods in
machine learning.
Boosting

The strength of Weak Learnability, Schapire

A boosting algorithm transforms a weak learner


into a strong one
Boosting

A formal description of Boosting (Schapire)


given a training set (x , y ), . . . , (xm , ym )
yi 2 { , +} correct label of instance xi 2 X
for t = , . . . , T
construct distribution Dt
nd weak classier

ht : X ! { , +}

with small error et = PrDt [ht (xi ) 6= yi ] on Dt


output nal classier
Boosting

AdaBoost
: Initialize D (i) = /m for all i 2 {, , ..., m}
: for t = ,,...T do
: Call WeakLearn, providing it with distribution Dt
: Get back hypothesis ht : X ! Y
: Calculate error of ht : et = i:ht (xi )6=yi Dt (i)
6: Update distribution
et /( et ) if ht (xi ) = yi
Dt : Dt+ (i) = DZt (i)
t otherwise
where Zt is a normalization constant (chosen so Dt+ is a
probability distribution)
: return hn (x) = arg maxy2Y t:ht (x)=y log et /( et )
Boosting

AdaBoost
: Initialize D (i) = /m for all i 2 {, , ..., m}
: for t = ,,...T do
: Call WeakLearn, providing it with distribution Dt
: Get back hypothesis ht : X ! Y
: Calculate error of ht : et = i:ht (xi )6=yi Dt (i)
6: Update distribution
et if ht (xi ) = yi
Dt : Dt+ (i) = DZt (i)
t et otherwise
where Zt is a normalization constant (chosen so Dt+ is a
probability distribution)
: return hn (x) = arg maxy2Y t:ht (x)=y log et /( et )
Stacking

Use a classier to combine predictions of base classiers


Example
Use a perceptron to do stacking
Use decision trees as base classiers
Clustering

Denition
Clustering is the distribution of a set of instances of examples
into non-known groups according to some common relations or
afnities.

Example
Market segmentation of customers

Example
Social network communities
Clustering

Denition
Given
a set of instances I
a number of clusters K
an objective function cost(I)
a clustering algorithm computes an assignment of a cluster for
each instance
f : I ! {, . . . , K}
that minimizes the objective function cost(I)
Clustering
Denition
Given
a set of instances I
a number of clusters K
an objective function cost(C, I)
a clustering algorithm computes a set C of instances with
|C| = K that minimizes the objective function

cost(C, I) = d (x, C)
x2I

where
d(x, c): distance function between x and c
d (x, C) = minc2C d (x, c): distance from x to the nearest
point in C
k-means

. Choose k initial centers C = {c , . . . , ck }


. while stopping criterion has not been met
For i = , . . . , N
nd closest center ck 2 C to each instance pi
assign instance pi to cluster Ck
For k = , . . . , K
set ck to be the center of mass of all points in Ci
k-means++

. Choose a initial center c


For k = , . . . , K
select ck = p 2 I with probability d (p, C)/cost(C, I)
. while stopping criterion has not been met
For i = , . . . , N
nd closest center ck 2 C to each instance pi
assign instance pi to cluster Ck
For k = , . . . , K
set ck to be the center of mass of all points in Ci
Performance Measures

Internal Measures
Sum square distance
Dunn index D = ddmin
max

C-Index C = S S Smin
max S min

External Measures
Rand Measure
F Measure
Jaccard
Purity
Density based methods
DBSCAN
e-neighborhood(p): set of points that are at a distance of p
less or equal to e
Core object: object whose e-neighborhood has an overall
weight at least
A point p is directly density-reachable from q if
p is in e-neighborhood(q)
q is a core object
A point p is density-reachable from q if
there is a chain of points p , . . . , pn such that pi+ is directly
density-reachable from pi
A point p is density-connected from q if
there is point o such that p and q are density-reachable
from o
Density based methods

DBSCAN
A cluster C of points satises
if p 2 C and q is density-reachable from p, then q 2 C
all points p, q 2 C are density-connected
A cluster is uniquely determined by any of its core points
A cluster can be obtained
choosing an arbitrary core point as a seed
retrieve all points that are density-reachable from the seed
DBSCAN

Figure: DBSCAN Point Example with =


Density based methods

DBSCAN
select an arbitrary point p
retrieve all points density-reachable from p
if p is a core point, a cluster is formed
If p is a border point
no points are density-reachable from p
DBSCAN visits the next point of the database
Continue the process until all of the points have been
processed
Frequent Patterns

Suppose D is a dataset of patterns, t 2 D, and min sup is a


constant.

Denition Denition
Support (t): number of Pattern t is frequent if
patterns in D that are Support (t) min sup.
superpatterns of t.

Frequent Subpattern Problem


Given D and min sup, nd all frequent subpatterns of patterns
in D.
Frequent Patterns

Suppose D is a dataset of patterns, t 2 D, and min sup is a


constant.

Denition Denition
Support (t): number of Pattern t is frequent if
patterns in D that are Support (t) min sup.
superpatterns of t.

Frequent Subpattern Problem


Given D and min sup, nd all frequent subpatterns of patterns
in D.
Frequent Patterns

Suppose D is a dataset of patterns, t 2 D, and min sup is a


constant.

Denition Denition
Support (t): number of Pattern t is frequent if
patterns in D that are Support (t) min sup.
superpatterns of t.

Frequent Subpattern Problem


Given D and min sup, nd all frequent subpatterns of patterns
in D.
Frequent Patterns

Suppose D is a dataset of patterns, t 2 D, and min sup is a


constant.

Denition Denition
Support (t): number of Pattern t is frequent if
patterns in D that are Support (t) min sup.
superpatterns of t.

Frequent Subpattern Problem


Given D and min sup, nd all frequent subpatterns of patterns
in D.
Pattern Mining

Dataset Example
Document Patterns
d abce
d cde
d abce
d acde
d abcde
d6 bcd
Itemset Mining

Support Frequent
d abce d,d,d,d,d,d6 c
d cde
d,d,d,d,d e,ce
d abce
d,d,d,d a,ac,ae,ace
d acde
d,d,d,d6 b,bc
d abcde
d,d,d,d6 d,cd
d6 bcd
d,d,d ab,abc,abe
be,bce,abce
d,d,d de,cde

minimal support =
Itemset Mining

Support Frequent
d abce 6 c
d cde
e,ce
d abce
a,ac,ae,ace
d acde
b,bc
d abcde
d,cd
d6 bcd
ab,abc,abe
be,bce,abce
de,cde
Itemset Mining

Support Frequent Gen Closed


d abce 6 c c c
d cde
e,ce e ce
d abce
a,ac,ae,ace a ace
d acde
b,bc b bc
d abcde
d,cd d cd
d6 bcd
ab,abc,abe ab
be,bce,abce be abce
de,cde de cde
Itemset Mining

Support Frequent Gen Closed Max


d abce 6 c c c
d cde
e,ce e ce
d abce
a,ac,ae,ace a ace
d acde
b,bc b bc
d abcde
d,cd d cd
d6 bcd
ab,abc,abe ab
be,bce,abce be abce abce
de,cde de cde cde
Itemset Mining

Support Frequent Gen Closed Max


d abce 6 c c c
d cde
e,ce e ce
d abce
a,ac,ae,ace a ace
d acde
b,bc b bc
d abcde
d,cd d cd
d6 bcd
ab,abc,abe ab
be,bce,abce be abce abce
de,cde de cde cde
Itemset Mining

Support Frequent Gen Closed Max


d abce 6 c c c
d cde
e,ce e ce
d abce
a,ac,ae,ace a ace
d acde
b,bc b bc
d abcde
d,cd d cd
d6 bcd
ab,abc,abe ab
e ! ce be,bce,abce be abce abce
de,cde de cde cde
Itemset Mining

Support Frequent Gen Closed Max


d abce 6 c c c
d cde
e,ce e ce
d abce
a,ac,ae,ace a ace
d acde
b,bc b bc
d abcde
d,cd d cd
d6 bcd
ab,abc,abe ab
be,bce,abce be abce abce
de,cde de cde cde
Itemset Mining

Support Frequent Gen Closed Max


d abce 6 c c c
d cde
e,ce e ce
d abce
a,ac,ae,ace a ace
d acde
b,bc b bc
d abcde
d,cd d cd
d6 bcd
ab,abc,abe ab
be,bce,abce be abce abce
de,cde de cde cde
Itemset Mining

Support Frequent Gen Closed Max


d abce 6 c c c
d cde
e,ce e ce
d abce
a,ac,ae,ace a ace
d acde
b,bc b bc
d abcde
d,cd d cd
d6 bcd
ab,abc,abe ab
a ! ace be,bce,abce be abce abce
de,cde de cde cde
Itemset Mining

Support Frequent Gen Closed Max


d abce 6 c c c
d cde
e,ce e ce
d abce
a,ac,ae,ace a ace
d acde
b,bc b bc
d abcde
d,cd d cd
d6 bcd
ab,abc,abe ab
be,bce,abce be abce abce
de,cde de cde cde
Closed Patterns

Usually, there are too many frequent patterns. We can compute


a smaller set, while keeping the same information.
Example
A set of items, has subsets, that is more
than the number of atoms in the universe
Closed Patterns

A priori property
If t0 is a subpattern of t, then Support (t0 ) Support (t).

Denition
A frequent pattern t is closed if none of its proper superpatterns
has the same support as it has.

Frequent subpatterns and their supports can be generated from


closed patterns.
Maximal Patterns

Denition
A frequent pattern t is maximal if none of its proper
superpatterns is frequent.

Frequent subpatterns can be generated from maximal patterns,


but not with their support.
All maximal patterns are closed, but not all closed patterns are
maximal.
Non streaming frequent itemset miners
Representation:
Horizontal layout
T: a, b, c
T: b, c, e
T: b, d, e
Vertical layout
a:
b:
c:

Search:
Breadth-rst (levelwise): Apriori
Depth-rst: Eclat, FP-Growth
The Apriori Algorithm

A A

Initialize the item set size k =


Start with single element sets
Prune the non-frequent ones
while there are frequent item sets
do create candidates with one item more
6 Prune the non-frequent ones
Increment the item set size k = k +

8 Output: the frequent item sets


The Eclat Algorithm

Depth-First Search
divide-and-conquer scheme : the problem is processed by
splitting it into smaller subproblems, which are then
processed recursively
conditional database for the prex a
transactions that contain a
conditional database for item sets without a
transactions that not contain a

Vertical representation
Support counting is done by intersecting lists of
transaction identiers
The FP-Growth Algorithm

Depth-First Search
divide-and-conquer scheme : the problem is processed by
splitting it into smaller subproblems, which are then
processed recursively
conditional database for the prex a
transactions that contain a
conditional database for item sets without a
transactions that not contain a
Vertical and Horizontal representation : FP-Tree
prex tree with links between nodes that correspond to the
same item
Support counting is done using FP-Tree
Mining Graph Data
Problem
Given a data set D of graphs, nd frequent graphs.

Transaction Id Graph
O
C C S N
O
O
C C S N
C
N
C C S N
The gSpan Algorithm

S(g, D, min sup, S)

Input: A graph g, a graph dataset D, min sup.


Output: The frequent graph set S.

if g 6= min(g)
then return S
insert g into S
update support counter structure
C 0/
6 for each g0 that can be right-most
extended from g in one step
do if support(g) min sup
8 then insert g0 into C
for each g0 in C
do S S(g0 , D, min sup, S)
return S
Machine Learning and
Data Mining
Data Preprocessing

Albert Bifet(@abifet)
Data Basics
Machine Learning/Data Mining Applications

Business Analytics
Is this costumer credit-worthy?
Is a costumer willing to respond to an email?
Do costumers divide in similar groups?
How much a costumer is going to spend next semester?
World Wide Web
Financial Analytics
Internet of Things
Image Recognition, Speech
..
The Data Mining Process

Data collection
Data Preprocesing
Feature extraction
Data cleaning
Feature selection and transformation
Analytical processing and algorithms
Data Postprocesing
Multidimensional Data

Example:
Competitor Name Swim Cycle Run Total
John T : : 8: :
Norman P 8: : : :
Alex K : 8: n/a n/a
Sarah H : : : :
Table: Triathlon results

Example or Instance
data point, transaction, entity, tuple, object, or feature-vector
Attribute or Feature
eld, dimension
Instance Types

Dense
red, white, Barcelona, , up
red, red, Barcelona, , down
black, white, Paris, , up
red, green, Paris, , down
Sparse
, , , , , , , , , , , , , , , , , , , , , ,
, , , , , , , , , , , , , , , , , , , , , ,
, , , , , , , , , , , , , , , , , , , , , ,
, , , , , , , , , , , , , , , , , , , , , ,
, , , , , , , , , , , , , , , , , , , , , ,
, , , , , , , , , , , , , , , , , , , , , ,
Attribute Type

Numerical
, , ., ., .
Categorical or Discrete
+, -
red, green, black
yes, no
up, down
Barcelona, Paris, London, New York
Text Data: vector-space representation
The cat is black
Binary: Categorical or Numerical
Analytical processing and algorithms

Attribute/Column Relationships
Classication : predict value of a discrete attribute
Regression: predict value of a numeric attribute
Instance/Row Relationships
Clustering: determine subsets of rows, in which the values
in the corresponding columns are similar
Outlier Detection: determine the rows that are very different
from the other rows
Big Data Scalability

Distributed Systems:
Hardware: Hadoop cluster
Software: MapReduce, Spark, Flink, Storm
Streaming Algorithms
Single pass over the data
Concept Drift
Data Preparation
The Data Mining Process

Data collection
Data Preprocesing
Feature extraction
Data cleaning
Feature selection and transformation
Analytical processing and algorithms
Data Postprocesing
Feature Extraction

Sensor data: wavelets or Fourier Transforms


Image Data: histograms or visual words
Web logs: multidimensional data
Network trafc: specic features as network protocol,
bytes transferred
Text Data: remove stop words, stem data,
multidimensional data
Feature Conversion

Numeric to Discrete
Equi-width ranges
Equi-log ranges
Equi-depth ranges
Discrete to Numeric
Binarization: one numeric attribute for each value
Text to Numeric
remove stop words, stem data, tf-idf, multidimensional data
Time Series to Discrete Sequence Data
SAX: equi-depth discretization after window-based
averaging
Time Series to Numeric Data
Discrete Wavelet Transform
Discrete Fourier Transform
Term Frequency-Inverse Document Frequency
Term frequency
Boolean frequencies
tf(t, d) = if t occurs in d and otherwise;
Logarithmically scaled frequency
tf(t, d) = + logft,d , or zero if ft,d is zero;
Augmented frequency,

ft,d
tf(t, d) = . + .
max{ft 0 ,d : t 0 2 d}

Inverse document frequency

N
idf(t, D) = log
|{d 2 D : t 2 d}|

Term frequency-inverse document frequency

tfidf(t, d, D) = tf(t, d) idf(t, D)


Data Cleaning
Handling missing entries
Eliminate entries with a missing value
Estimate missing values
Algorithms can handle missing values
Handling incorrect entries
Duplicate detection and inconsistency detection
Domain knowledge
Data-centric methods
Scaling and normalization
Standardization: for instance i, attribute j:

xij - j
zji =
j

Normalization:
xij - minj
yij =
maxj - min j
Feature selection and transformation

Sampling for Static Data


Sampling with Replacement
Sampling without Replacement: no duplicates
Biased Sampling
Stratied Sampling
Reservoir Sampling for Data Streams
Given a data stream, choose k items with the same
probability, storing only k elements in memory.
R S

R S

for every item i in the rst k items of the stream


do store item i in the reservoir
n=k
for every item i in the stream after the rst k items of the stream
do select a random number r between and n
6 if r < k
then replace item r in the reservoir with item i
8 n=n+

Figure: Algorithm R S
Feature selection and transformation

Feature Subset Selection


Supervised feature selection
Unsupervised feature selection
Biased Sampling
Stratied Sampling
Dimensionality reduction with axis rotation
Principal Component Analysis
Singular Value Decomposition
Latent Semantic Analysis
Principal Component Analysis

Goal: Principal component analysis computes the most


meaningful basis to re-express a noisy, garbled data set.
The hope is that this new basis will lter out the noise and
reveal hidden dynamics

Normalize Input Data


Compute k orthonormal vectors to have a basis for the
normalized data
Sort these principal components
Eliminate components with low variance
Principal Component Analysis

Organize the data set X as an m n matrix, where m is the


number of features and n is the number of instances.
Normalize Input Data: subtract off the mean for each
instance xi
Calculate the SVD or the eigenvectors of the covariance
Find some orthonormal matrix P where Y = PX such that


SY = YY T
n-
is diagonalized.
The rows of P are the principal components of X.
Sort these principal components
Eliminate components with low variance
Clustering, classification and
evaluation

Mostafa H. Chehreghani

Mostafa.chehreghani@gmail.com
Clustering

Albert Bifet (@abifet)

Paris, 18 October 2015


albert.bifet@telecom-paristech.fr
Clustering

Definition
Clustering is the distribution of a set of instances of examples
into non-known groups according to some common relations or
affinities.

Example
Market segmentation of customers

Example
Social network communities
Clustering

Definition
Given
I a set of instances I
I a number of clusters K
I an objective function cost(I)
a clustering algorithm computes an assignment of a cluster for
each instance
f : I {1, . . . , K }
that minimizes the objective function cost(I)
Clustering
Definition
Given
I a set of instances I
I a number of clusters K
I an objective function cost(C, I)
a clustering algorithm computes a set C of instances with
|C| = K that minimizes the objective function
X
cost(C, I) = d 2 (x, C)
xI

where
I d(x, c): distance function between x and c
I d 2 (x, C) = mincC d 2 (x, c): distance from x to the nearest
point in C
k-means

I 1. Choose k initial centers C = {c1 , . . . , ck }


I 2. while stopping criterion has not been met
I For i = 1, . . . , N
I find closest center ck C to each instance pi
I assign instance pi to cluster Ck
I For k = 1, . . . , K
I set ck to be the center of mass of all points in Ci
k-means++

I 1. Choose a initial center c1


I For k = 2, . . . , K
I select ck = p I with probability d 2 (p, C)/cost(C, I)
I 2. while stopping criterion has not been met
I For i = 1, . . . , N
I find closest center ck C to each instance pi
I assign instance pi to cluster Ck
I For k = 1, . . . , K
I set ck to be the center of mass of all points in Ci
Performance Measures
Internal Measures
I Cluster Cohesion: Measures how closely related are
objects in a cluster
I Cluster Separation: Measure how distinct or well
separated a cluster is from other clusters
I Silhouette Coefficient: 1 a/b if a < b
I a = average distance of i to the points in its cluster
I b = min (average distance of i to points in another cluster)

External Measures
I Rand Measure
I F Measure
I Jaccard
I Purity
Distances

Numeric features
I Euclidean:
X
d(x, y ) = ||x y ||2 = (xi yi )2

I Manhattan distance:
X
d(x, y) = ||x y||1 = |xi yi |
Density based methods

DBSCAN
I -neighborhood(p): set of points that are at a distance of p
less or equal to 
I Core object: object whose -neighborhood has an overall
weight at least
I A point p is directly density-reachable from q if
I p is in -neighborhood(q)
I q is a core object
I A point p is density-reachable from q if
I there is a chain of points p1 , . . . , pn such that pi+1 is directly
density-reachable from pi
I A point p is density-connected from q if
I there is point o such that p and q are density-reachable
from o
Density based methods

DBSCAN
I A cluster C of points satisfies
I if p C and q is density-reachable from p, then q C
I all points p, q C are density-connected
I A cluster is uniquely determined by any of its core points
I A cluster can be obtained
I choosing an arbitrary core point as a seed
I retrieve all points that are density-reachable from the seed
Density based methods

DBSCAN
I select an arbitrary point p
I retrieve all points density-reachable from p
I if p is a core point, a cluster is formed
I If p is a border point
I no points are density-reachable from p
I DBSCAN visits the next point of the database
I Continue the process until all of the points have been
processed
DBSCAN

Figure: DBSCAN Point Example with =3


BIRCH

B ALANCED I TERATIVE R EDUCING AND C LUSTERING


USING H IERARCHIES
I Clustering Features CF = (N, LS, SS)
I N: number of data points
I LS: linear sum of the N data points
I SS: square sum of the N data points
I Properties:
I Additivity: CF1 + CF2 = (N1 + N2 , LS1 + LS2 , SS1 + SS2 )
I Easy to compute: average inter-cluster distance
and average intra-cluster distance
I Uses CF tree
I Height-balanced tree with two parameters
I B: branching factor
I T: radius leaf threshold
BIRCH

B ALANCED I TERATIVE R EDUCING AND C LUSTERING


USING H IERARCHIES

Phase 1: Scan all data and build an initial in-memory CF


tree
Phase 2: Condense into desirable range by building a
smaller CF tree (optional)
Phase 3: Global clustering
Phase 4: Cluster refining (optional and off line, as requires
more passes)
BIRCH:
Balanced Iterative Reducing and Clustering using
Hierarchies

Tian Zhang, Raghu Ramakrishnan, Miron Livny

Presented by Zhao Li
2009, Spring
Introduction to BIRCH

Designed for very large data sets
Time and memory are limited
Incremental and dynamic clustering of incoming objects
Only one scan of data is necessary
Does not need the whole data set in advance

Two key phases:
Scans the database to build an in-memory tree
Applies clustering algorithm to cluster the leaf nodes

September 1, 2017 2
Similarity Metric(1)

Given a cluster of instances , we define:


Centroid:

Radius: average distance from member points to centroid

Diameter: average pair-wise distance within a cluster

September 1, 2017 3
Similarity Metric(2)

centroid Euclidean distance:


centroid Manhattan distance:

average inter-cluster:
average intra-cluster:
variance increase:

September 1, 2017 4
Clustering Feature

The Birch algorithm builds a dendrogram called clustering
feature tree (CF tree) while scanning the data set.

Each entry in the CF tree represents a cluster of objects
and is characterized by a 3-tuple: (N, LS, SS), where N is
the number of objects in the cluster and LS, SS are defined
in the following.

September 1, 2017 5
Properties of Clustering Feature

CF entry is more compact
Stores significantly less than all of the data points in
the sub-cluster

A CF entry has sufficient information to calculate
D0-D4

Additivity theorem allows us to merge sub-clusters
incrementally & consistently

September 1, 2017 6
CF-Tree

Each non-leaf node has at


most B entries
Each leaf node has at
most L CF entries,
each of which satisfies
threshold T

September 1, 2017 7
CF-Tree Insertion

Recurse down from root
Find the appropriate leaf
Follow the "closest"-CF path, w.r.t. D0 / / D4

Modify the leaf
If the closest-CF leaf cannot absorb, make a new CF
entry. If there is no room for new leaf, split the parent
node

Traverse back
Update CFs on the path or splitting nodes
September 1, 2017 8
CF-Tree Rebuilding

If we run out of space, increase threshold T
By increasing the threshold, CFs absorb more data

Rebuilding "pushes" CFs over
The larger T allows different CFs to group together

Reducibility theorem
Increasing T will result in a CF-tree smaller than the
original

September 1, 2017 9
Example of BIRCH
New subcluster
sc8 sc3
sc1 sc4 sc5 sc6 sc7

sc2 LN2 LN3


Root
LN1 LN1 LN2 LN3

sc8 sc1 sc5


sc2 sc4 sc6 sc7
sc3

September 1, 2017 10
Insertion Operation in BIRCH
If the branching factor of a leaf node can not exceed 3, then LN1 is split.

sc8 sc3
sc1 sc4 sc5 sc6 sc7

sc2
LN1 LN2 LN3

LN1 Root
LN1 LN2 LN3
LN1

sc8 sc1 sc3sc4sc5 sc6 sc7


sc2
September 1, 2017 11
If the branching factor of a non-leaf node can not
exceed 3, then the root is split and the height of
the CF Tree increases by one.

sc8 sc3
sc1 sc4 sc5 sc6 sc7

sc2
LN1 LN2 LN3

LN1 Root
NLN1 NLN2
LN1
LN1 LN2 LN3

sc8 sc1 sc2 sc3sc4sc5 sc6 sc7


September 1, 2017 12
BIRCH Overview

September 1, 2017 13
Experimental Results

Input parameters:
Memory (M): 5% of data set
Disk space (R): 20% of M
Distance equation: D2
Quality equation: weighted average diameter (D)
Initial threshold (T): 0.0
Page size (P): 1024 bytes
September 1, 2017 14
Experimental Results
KMEANS clustering

DS Time D # Scan DS Time D # Scan

1 43.9 2.09 289 1o 33.8 1.97 197

BIRCH clustering
2 13.2 4.43 51 2o 12.7 4.20 29

DS
3 Time
32.9 D
3.66 # 187
Scan DS
3o Time
36.0 D
4.35 # 241
Scan

1 11.5 1.87 2 1o 13.6 1.87 2

September 1, 2017 15
2 10.7 1.99 2 2o 12.1 1.99 2
Exam Questions

What is the main limitation of BIRCH?
Since each node in a CF tree can hold only a limited
number of entries due to the size, a CF tree node doesnt
always correspond to what a user may consider a nature
cluster. Moreover, if the clusters are not spherical in
shape, it doesnt perform well because it uses the notion
of radius or diameter to control the boundary of a
cluster.

September 1, 2017 16
Classification Evaluation

Albert Bifet (@abifet)

Paris, 27 September 2016


albert.bifet@telecom-paristech.fr
Evaluation

1. Error estimation: Hold-out or Cross-Validation


2. Evaluation performance measures: Accuracy or -statistic
3. Statistical significance validation: MacNemar or Nemenyi test

Evaluation Framework
Error Estimation

Data available for testing


I Holdout an independent test set
I Apply the current decision model to the test set
I The loss estimated in the holdout is an unbiased estimator

Holdout Evaluation
1. Error Estimation

Not enough data available for testing


I Divide dataset in 10 folds
I Repeat 10 times: use one fold for testing and the rest for
training

k-fold Cross-validation
2. Evaluation performance measures

Predicted Predicted
Class+ Class- Total
Correct Class+ 75 8 83
Correct Class- 7 10 17
Total 82 18 100
Table: Simple confusion matrix example
2. Evaluation performance measures

Predicted Predicted
Class+ Class- Total
Correct Class+ tp fn tp+fn
Correct Class- fp tn fp+tn
Total tp+fp fn+tn N
Table: Simple confusion matrix example

tp
I Precision = tp+fp
tp
I Recall = tp+fn
precisionrecall
I F1 = 2 precision+recall
2. Evaluation performance measures

Predicted Predicted
Class+ Class- Total
Correct Class+ 75 8 83
Correct Class- 7 10 17
Total 82 18 100
Table: Simple confusion matrix example

75 10 75 83 10 17
I Accuracy = 100 +100 = 83 100 + 17 100 = 85%
I Arithmetic mean = ( 75 10
83 + 17 )/2 = 74.59%
q
75 10
I Geometric mean = 83 17 = 72.90%
2. Performance Measures with Unbalanced Classes
Predicted Predicted
Class+ Class- Total
Correct Class+ 75 8 83
Correct Class- 7 10 17
Total 82 18 100
Table: Simple confusion matrix example

Predicted Predicted
Class+ Class- Total
Correct Class+ 68.06 14.94 83
Correct Class- 13.94 3.06 17
Total 82 18 100
Table: Confusion matrix for chance predictor
2. Performance Measures with Unbalanced Classes
Kappa Statistic
I p0 : classifiers prequential accuracy
I pc : probability that a chance classifier makes a correct
prediction.
I statistic
p0 pc
=
1 pc
I = 1 if the classifier is always correct
I = 0 if the predictions coincide with the correct ones as
often as those of the chance classifier

Matthews correlation coefficient (MCC)

tp tn fp fn
p
(tp + fp)(tp + fn)(tn + fp)(tn + fn)
2. Evaluation performance measures

Predicted Predicted
Class+ Class- Total
Correct Class+ tp fn tp+fn
Correct Class- fp tn fp+tn
Total tp+fp fn+tn N
Table: Simple confusion matrix example

AUC Area under the curve


A ROC space is defined by FPR and TPR (recall)
fp
I FPR = fp+tp
tp
I TPR = tp+fn
3. Statistical significance validation (2 Classifiers)
Classifier A Classifier A
Class+ Class- Total
Classifier B Class+ c a c+a
Classifier B Class- b d b+d
Total c+b a+d a+b+c+d

M = |a b 1|2 /(a + b)
The test follows the 2 distribution. At 0.99 confidence it rejects
the null hypothesis (the performances are equal) if M > 6.635.

McNemar test
3. Statistical significance validation (> 2 Classifiers)
Two classifiers are performing differently if the corresponding
average ranks differ by at least the critical difference
r
k(k + 1)
CD = q
6N

I k is the number of learners, N is the number of datasets,


I critical values q are
based on the Studentized range
statistic divided by 2.

Nemenyi test
3. Statistical significance validation (> 2 Classifiers)
Two classifiers are performing differently if the corresponding
average ranks differ by at least the critical difference
r
k(k + 1)
CD = q
6N

I k is the number of learners, N is the number of datasets,


I critical values q are
based on the Studentized range
statistic divided by 2.

# classifiers 2 3 4 5 6 7
q0.05 1.960 2.343 2.569 2.728 2.850 2.949
q0.10 1.645 2.052 2.291 2.459 2.589 2.693
Table: Critical values for the Nemenyi test
CS224W: Social and Information Network Analysis
Jure Leskovec, Stanford University
http://cs224w.stanford.edu
How to organize the Web?
First try: Human curated
Web directories
Yahoo, DMOZ, LookSmart
Second try: Web Search
Information Retrieval attempts to
find relevant docs in a small
and trusted set
Newspaper articles, Patents, etc.
But: Web is huge, full of untrusted documents,
random things, web spam, etc.
So we need a good way to rank webpages!
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 2
2 challenges of web search:
(1) Web contains many sources of information
Who to trust?
Insight: Trustworthy pages may point to each other!
(2) What is the best answer to query
newspaper?
No single right answer
Insight: Pages that actually know about newspapers
might all be pointing to many newspapers

11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 3
All web pages are not equally important
www.joe-schmoe.com vs. www.stanford.edu

We already know:
There is large diversity
in the web-graph vs.
node connectivity.
So, lets rank the pages
using the web graph
link structure!
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 4
We will cover the following Link Analysis
approaches to computing importance of
nodes in a graph:
Hubs and Authorities (HITS)
Page Rank
Random Walk with Restarts

Sidenote: Various notions of node centrality: Node


Degree centrality = degree of
Betweenness centrality = #shortest paths passing through
Closeness centrality = avg. length of shortest paths from to
all other nodes of the network
Eigenvector centrality = like PageRank
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 5
Goal (back to the newspaper example):
Dont just find newspapers. Find experts pages that
link in a coordinated way to good newspapers
Idea: Links as votes
Page is more important if it has more links
In-coming links? Out-going links?
Hubs and Authorities NYT: 10

Each page has 2 scores: Ebay: 3


Quality as an expert (hub):
Yahoo: 3
Total sum of votes of pages pointed to
Quality as a content provider (authority): CNN: 8

Total sum of votes of experts WSJ: 9


Principle of repeated improvement
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 7
Interesting pages fall into two classes:
1. Authorities are pages containing
useful information
Newspaper home pages
Course home pages
Home pages of auto manufacturers

2. Hubs are pages that link to authorities


List of newspapers NYT: 10
Ebay: 3
Course bulletin Yahoo: 3
List of U.S. auto manufacturers CNN: 8
WSJ: 9

11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 8
Each page starts with hub score 1
Authorities collect their votes

(Note this is an idealized example. In reality graph is not bipartite and


each page has both the hub and the authority score)
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 9
Hubs collect authority scores

(Note this is an idealized example. In reality graph is not bipartite and


each page has both the hub and authority score)
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 10
Authorities collect hub scores

(Note this is an idealized example. In reality graph is not bipartite and


each page has both the hub and authority score)
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 11
A good hub links to many good authorities
A good authority is linked from many good
hubs
Note a self-reinforcing recursive definition

Model using two scores for each node:


Hub score and Authority score
Represented as vectors and , where the i-th
element is the hub/authority score of the i-th node

11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 12
[Kleinberg 98]

Convergence criteria:
Each page has 2 scores: = 5 5 6 678
<
<
Authority score: 5
6 678
<
= 5 5 <
Hub score: 5

HITS algorithm:
()) ())
Initialize: ' = 1/ n, h2 = 1/ n
Then keep iterating until convergence:
(678) (6)
: Authority: 5 = '5 '
(678) (6)
: Hub: 5 = 5' '
: Normalize:
< <
678 678
5 5 = 1, ' ' =1
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 13
[Kleinberg 98]
Details!

Hits in the vector notation:


Vector = ( , ), = ( , )
Adjacency matrix (n x n): = if
Can rewrite = as =
So: = And similarly: =
Repeat until convergence:
(678) = (6)
(678) = I (6)
Normalize (678) and (678)

11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 14
Details!

What is = ?
Then: = ( )
new
new
is updated (in 2 steps):
= I ( ) = (I )
h is updated (in 2 steps):

= (I ) = ( I )
Thus, in steps:
= I L
= I L
Repeated matrix powering
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 15
Definition: Eigenvectors & Eigenvalues
Let =
for some scalar , vector , matrix
Then is an eigenvector, and is its eigenvalue
The steady state (HITS has converged) is:
= Note constants c,c
dont matter as we
normalize them out
= RR every step of HITS

So, authority is eigenvector of


(associated with the largest eigenvalue)
Similarly: hub is eigenvector of
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 16
Still the same idea: Links as votes
Page is more important if it has more links
In-coming links? Out-going links?
Think of in-links as votes:
www.stanford.edu has 23,400 in-links
www.joe-schmoe.com has 1 in-link

Are all in-links equal?


Links from important pages count more
Recursive question!

11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 18
A vote from an important
page is worth more: i k
ri/3 r /4
Each links vote is proportional k

to the importance of its source j rj/3

page rj/3 rj/3

If page i with importance ri has


di out-links, each link gets ri / di
rj = ri/3 + rk/4
votes
Page js own importance rj is
the sum of the votes on its in-
links
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 19
A page is important if it is The web in 1839

pointed to by other important ry/2


pages y
Define a rank rj for node j
ra/2
ri ry/2
rj = a
rm
m
i j di ra/2
out-degree of node Flow equations:
ry = ry /2 + ra /2
ra = ry /2 + rm
You might wonder: Lets just use Gaussian elimination rm = ra /2
to solve this system of linear equations. Bad idea!
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 20
j
Stochastic adjacency matrix
Let page have out-links i

If , then =

is a column stochastic matrix
Columns sum to 1 1/3

Rank vector : An entry per page M


is the importance score of page
=
The flow equations can be written
ri
= rj =
i j di
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 21
i1 i2 i3
Imagine a random web surfer:
At any time , surfer is on some page
At time + , the surfer follows an j
out-link from uniformly at random ri
rj =
Ends up on some page linked from i j d out (i)
Process repeats indefinitely
Let:
() vector whose th coordinate is the
prob. that the surfer is at page at time
So, () is a probability distribution over pages

11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 22
i1 i2 i3
Where is the surfer at time t+1?
Follows a link uniformly at random
j
+ = () p(t + 1) = M p(t )
Suppose the random walk reaches a state
+ = () = ()
then () is stationary distribution of a random walk
Our original rank vector satisfies =
So, is a stationary distribution for
the random walk

11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 23
Given a web graph with n nodes, where the
nodes are pages and edges are hyperlinks
Assign each node an initial page rank
Repeat until convergence (i |ri(t+1) ri(t)| < )
Calculate the page rank of each node
(t )
( t +1) ri
rj =
i j di

. out-degree of node
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 25
y a m
Power Iteration: y
y 0
Set ' 1/N a 0 1
]^
a m m 0 0
1: ' 5' _
^
ry = ry /2 + ra /2
2: ra = ry /2 + rm
If | | > : goto 1 rm = ra /2

Example:
ry 1/3 1/3 5/12 9/24 6/15
ra = 1/3 3/6 1/3 11/24 6/15
rm 1/3 1/6 3/12 1/6 3/15
Iteration 0, 1, 2,

11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 26
y a m
Power Iteration: y
y 0
Set ' 1/N a 0 1
]^
a m m 0 0
1: ' 5' _
^
ry = ry /2 + ra /2
2: ra = ry /2 + rm
If | | > : goto 1 rm = ra /2

Example:
ry 1/3 1/3 5/12 9/24 6/15
ra = 1/3 3/6 1/3 11/24 6/15
rm 1/3 1/6 3/12 1/6 3/15
Iteration 0, 1, 2,

11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 27
(t )
( t +1) ri
rj = or
r = Mr
i j di
equivalently

Does this converge?

Does it converge to what we want?

Are results reasonable?

11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 28
The Spider trap problem:

(t )
( t +1) ri
a b rj =
i j di
Example:
Iteration: 0, 1, 2, 3
ra 1 0 0 0
=
rb 0 1 1 1

11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 29
The Dead end problem:

(t )
( t +1) ri
a b rj =
i j di

Example:
Iteration: 0, 1, 2, 3
ra 1 0 0 0
rb = 0 1 0 0

11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 30
2 problems:
(1) Some pages are
dead ends (have no out-links)
Such pages cause
importance to leak out

(2) Spider traps


(all out-links are within the group)
Eventually spider traps absorb all importance

11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 31
y a m
Power Iteration: y
y 0
8
Set ' = a 0 0
c a m m 0 1
]^
' = 5' _
^ ry = ry /2 + ra /2
And iterate ra = ry /2
rm = ra /2 + rm
Example:
ry 1/3 2/6 3/12 5/24 0
ra = 1/3 1/6 2/12 3/24 0
rm 1/3 3/6 7/12 16/24 1
Iteration 0, 1, 2,

11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 32
The Google solution for spider traps: At each
time step, the random surfer has two options
With prob. , follow a link at random
With prob. 1-, jump to a random page
Common values for are in the range 0.8 to 0.9
Surfer will teleport out of spider trap within a
few time steps
y y

a m a m
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 33
y a m
Power Iteration: y
y 0
8
Set ' = a 0 0
c a m m 0 0
]^
' = 5' _
^ ry = ry /2 + ra /2
And iterate ra = ry /2
rm = ra /2
Example:
ry 1/3 2/6 3/12 5/24 0
ra = 1/3 1/6 2/12 3/24 0
rm 1/3 1/6 1/12 2/24 0
Iteration 0, 1, 2,

11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 34
Teleports: Follow random teleport links with
probability 1.0 from dead-ends
Adjust matrix accordingly

y y

a m a m
y a m y a m
y 0 y
a 0 0 a 0
m 0 0 m 0

11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 35
Googles solution: At each step, random
surfer has two options:
With probability , follow a link at random
With probability 1-, jump to some random page
PageRank equation [Brin-Page, 98]
5 1
' = = + (1 )
5 di out-degree
5' of node i

The above formulation assumes that has no dead ends. We can


either preprocess matrix (bad!) or explicitly follow random teleport
links with probability 1.0 from dead-ends. See P. Berkhin, A Survey
on PageRank Computing, Internet Mathematics, 2005.
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 36
Details!

PageRank as a principal eigenvector



= or equivalently =

But we really want (**): d out-degree i
of node i
= +

Lets define: Note: is a sparse

= + ( ) matrix but is dense
(all entries 0). In
Now we get what we want: practice we never
materialize but
= rather we use the
sum formulation (**)
What is ?
In practice 0.15 (Jump approx. every 5-6 links)
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 37
Input: Graph and parameter
Directed graph with spider traps and dead ends
Parameter
Output: PageRank vector
) 8
Set: ' = , = 1
c
do:
(m)
()
: =
()
= if in-deg. of is 0

Now re-insert the leaked PageRank:
o
: = R + (6)
where: = ' '

=+
(6) (6o8)
while ' ' ' >
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 38
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 39
PageRank and HITS are two solutions to the
same problem:
What is the value of an in-link from u to v?
In the PageRank model, the value of the link
depends on the links into u
In the HITS model, it depends on the value of the
other links out of u

The destinies of PageRank and HITS


post-1998 were very different

11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 40
[Tong-Faloutsos, 06]

I 1 J
1 1

A 1 H 1 B

1 1
D

1 1 1
E G
F

a.k.a.: Relevance, Closeness, Similarity


11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 42
Given:



Conferences-to-authors IJCAI
graph Philip S. Yu

Goal:
KDD

Ning Zhong
Proximity on graphs ICDM

Q: What is most related SDM R. Ramakrishnan

conference to ICDM? AAAI M. Jordan


NIPS


Conference Author

11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 43
Shortest path is not good:

No influence for degree-1 nodes (E, F, G)!


Multi-faceted relationships
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 44
Network Flow is not good:

Does not punish long paths

11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 45
[Tong-Faloutsos, 06]

I 1 J
1 1

A 1 H 1 B

Multiple Connections
1 1
D Quality of connection

1 1 1 Direct & In-direct
E G
connections
F
Length, Degree,
Weight
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 46
10
9

12
2

8
1
11
3

6
5

11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 47
Goal: Evaluate pages not just by popularity
but by how close they are to the topic
Teleporting can go to:
Any page with equal probability
PageRank (we used this so far)
A topic-specific set of relevant pages
Topic-specific (personalized) PageRank (S ...teleport set)
= + ( )/|| if
= otherwise
Random Walk with Restart: S is a single element
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 48
Graphs and web search:



Ranks nodes by importance IJCAI

Personalized PageRank: Philip S. Yu


KDD
Ranks proximity of nodes Ning Zhong
to the teleport nodes ICDM

Proximity on graphs: SDM R. Ramakrishnan

Q: What is most related AAAI M. Jordan

conference to ICDM?


NIPS

Random Walks with Restarts


Teleport back to the starting node: Conference Author
S = { single node }
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 49
Node 4
0.04 0.03 Node 1 0.13
10 Node 2 0.10
9
0.10 Node 3 0.13
12
0.13 2 0.08 0.02 Node 4 /
1 8 Node 5 0.13
3 0.13 11
0.04 Node 6 0.05
4 Node 7 0.05
Node 8 0.08
6 0.05 Node 9 0.04
5
0.13 Node 10 0.03
7 Node 11 0.04
0.05 Node 12 0.02

Nearby nodes, higher scores Ranking vector


More red, more relevant

11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 50
PKDD

SDM PAKDD
0.008
0.007
0.009
KDD 0.005 ICML
0.011
ICDM
0.005
0.004
CIKM ICDE
0.005
0.004
0.004

ECML SIGMOD

DMKD

11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 51
Q: Which conferences
are closest to KDD &
K
ICDM?
I A: Personalized
PageRank with
Graph of CS conferences teleport set S={KDD,
ICDM}

11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 52
Pins belong to Boards

11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 53
Input:

11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 54
Input: Output

11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 55
Input:

11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 56
Input: Output

11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 57
Proximity to query node(s) Q:

Bipartite Pin and Board graph

11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 58
Pixie Random Walk
Proximity to query node(s) Q:

5 5 5 5 5 5 14 9 Q 16 7 8 8 8 8 1 1 1

Yummm Strawberries Smoothies Smoothie Madness!!

11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 59
CS224W: Social and Information Network Analysis
Jure Leskovec, Stanford University
http://cs224w.stanford.edu
Observations Models Algorithms

Small diameter, Erds-Renyi model,


Decentralized search
Edge clustering Small-world model

Patterns of signed Structural balance, Models for predicting


edge creation Theory of status edge signs

Viral Marketing, Blogosphere, Independent cascade model, Influence maximization,


Memetracking Game theoretic model Outbreak detection, LIM

Preferential attachment, PageRank, Hubs and


Scale-Free
Copying model authorities

Densification power law, Microscopic model of Link prediction,


Shrinking diameters evolving networks Supervised random walks

Strength of weak ties, Community detection:


Core-periphery Kronecker Graphs
Girvan-Newman, Modularity

11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 2
We often think of networks looking
like this:

What lead to such a conceptual picture?


11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 3
How information flows through the network?
What structurally distinct roles do nodes play?
What roles do different links (short vs. long) play?
How people find out about new jobs?
Mark Granovetter, part of his PhD in 1960s
People find the information through personal contacts
But: Contacts were often acquaintances
rather than close friends
This is surprising: One would expect your friends to
help you out more than casual acquaintances
Why is it that acquaintances are most helpful?
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 4
[Granovetter 73]

Two perspectives on friendships:


Structural: Friendships span different parts of the
network
Interpersonal: Friendship between two people is
either strong or weak
Structural role: Triadic Closure
a If two people in a
network have a friend in
common, then there is
an increased likelihood
b they will become friends
c
themselves.
Which edge is more
likely, a-b or a-c?
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 5
Granovetter makes a connection between
social and structural role of an edge
First point: Structure
Structurally embedded edges are also socially strong
Long-range edges spanning different parts of the
network are socially weak
Second point: Information
Long-range edges allow you to gather information
from different parts of the network and get a job
Structurally embedded edges are S Weak
Strong
heavily redundant in terms of S
a W
b
S

information access S

11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 6
Triadic closure == High clustering coefficient
Reasons for triadic closure:
If and have a friend in common, then:
is more likely to meet
B
(since they both spend time with )
and trust each other A C
(since they have a friend in common)
has incentive to bring and together
(as it is hard for to maintain two disjoint relationships)
Empirical study by Bearman and Moody:
Teenage girls with low clustering coefficient are
more likely to contemplate suicide
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 7
Bridge
Define: Bridge edge a
If removed, it disconnects the graph b

Define: Local bridge Local bridge


Edge of Span > 2
(Span of an edge is the distance of the a
b
edge endpoints if the edge is deleted. Local
bridges with long span are like real bridges)
Define: Two types of edges: Edge:
W or S
Strong (friend), Weak (acquaintance)
Define: Strong triadic closure: S S

Two strong ties imply a third edge


Fact: If strong triadic closure is W
satisfied then local bridges S
S W S
S

are weak ties! S


a
b

11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 8
Claim: If node satisfies Strong Triadic Closure
and is involved in at least two strong ties, then
any local bridge adjacent to must be a weak tie.
Proof by contradiction:
Assume satisfies Strong Triadic S S
Closure and has 2 strong ties A

Let be local bridge


and a strong tie
S S
Then must exist S
A
B

because of Strong C

Triadic Closure
But then is not a bridge!
(since B-C must be connected due to Strong Triadic Closure property)
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 9
For many years Granovetters theory was not
tested
But, today we have large who-talks-to-whom
graphs:
Email, Messenger, Cell phones, Facebook

Onnela et al. 2007:


Cell-phone network of 20% of countrys population
Edge strength: # phone calls

11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 10
Edge overlap:
() ()
&' =
() ()
() a set
of neighbors
of node

Overlap =
when an edge is
a local bridge

11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 11
Cell-phone network
Observation:
Highly used links True

Neighborhood overlap
have high overlap! Permuted
strengths

Legend:
True: The data
Permuted strengths: Keep
the network structure
but randomly reassign
edge strengths
Edge strength (#calls)

11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 12
Real edge strengths in mobile call graph
Strong ties are more embedded (have higher overlap)
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 13
Same network, same set of edge strengths
but now strengths are randomly shuffled
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 14
Low
Size of largest component

disconnects
the network
sooner

Fraction of removed links

Removing links by strength (#calls)


Low to high
High to low Conceptual picture
of network structure
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 15
Low
Size of largest component

disconnects
the network
sooner

Fraction of removed links

Removing links based on overlap


Low to high
High to low Conceptual picture
of network structure
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 16
Granovetters theory leads to the following
conceptual picture of networks
Strong ties

Weak ties

11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 17
Granovetters theory
suggest that networks
are composed of
tightly connected
sets of nodes

Communities, clusters,
Network communities: groups, modules
Sets of nodes with lots of connections inside and
few to outside (the rest of the network)
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 19
How to automatically
find such densely
connected groups of
nodes?
Ideally such automatically
detected clusters would
then correspond to real
groups
Communities, clusters,
For example: groups, modules

11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 20
Zacharys Karate club network:
Observe social ties and rivalries in a university karate club
During his observation, conflicts led the group to split
Split could be explained by a minimum cut in the network
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 21
Find micro-markets by partitioning the
query x advertiser graph:
query

advertiser

11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 22
Can we identify
node groups?
(communities,
modules, clusters)

Nodes: Teams
Edges: Games played
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 23
NCAA conferences

Nodes: Teams
Edges: Games played
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 24
Can we identify
social communities?

Nodes: Users
Edges: Friendships
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 25
High school Company

Stanford (Basketball)
Stanford (Squash)

Nodes: Users
Social communities Edges: Friendships
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 26
Can we identify
functional modules?

Nodes: Proteins
Edges: Interactions
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 27
Functional modules

Nodes: Proteins
Edges: Interactions
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 28
How to find communities?

We will work with undirected (unweighted) networks


11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 29
Edge betweenness: Number of
shortest paths passing over the edge b=16
Intuition: b=7.5

Edge strengths (call volume) Edge betweenness


in a real network in a real network
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 30
[Girvan-Newman 02]

Divisive hierarchical clustering based on the


notion of edge betweenness:
Number of shortest paths passing through the edge
Girvan-Newman Algorithm:
Undirected unweighted networks
Repeat until no edges are left:
Calculate betweenness of edges
Remove edges with highest betweenness
Connected components are communities
Gives a hierarchical decomposition of the network

11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 31
12
1
33
49

Need to re-compute
betweenness at
every step

11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 32
Step 1: Step 2:

Step 3: Hierarchical network decomposition:

11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 33
1. How to compute betweenness?
2. How to select the number of
clusters?

11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 34
Want to compute Breadth first search
betweenness of starting from :
paths starting at
node 0

11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 35
Forward step: Count the number of shortest
paths from to all other nodes of the
network

11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 36
Backward step: Compute betweenness: If
there are multiple paths count them
fractionally
The algorithm:
Add edge flows:
-- node flow =
1+child edges 1+1 paths to H
-- split the flow up Split evenly
based on the parent
value
Repeat the BFS 1+0.5 paths to J
Split 1:2
procedure for each
starting node
1 path to K.
Split evenly
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 37
Backward step: Compute betweenness: If
there are multiple paths count them
fractionally
The algorithm:
Add edge flows:
-- node flow =
1+child edges 1+1 paths to H
-- split the flow up Split evenly
based on the parent
value
Repeat the BFS 1+0.5 paths to J
Split 1:2
procedure for each
starting node
1 path to K.
Split evenly
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 38
1. How to compute betweenness?
2. How to select the number of
clusters?

11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 39
Communities: sets of
tightly connected nodes
Define: Modularity
A measure of how well
a network is partitioned
into communities
Given a partitioning of the
network into groups ! :
Q s S [ (# edges within group s)
(expected # edges within group s) ]
Need a null model!
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 40
Given real on nodes and edges,
construct rewired network
Same degree distribution but i
random connections
j
Consider as a multigraph
The expected number of edges between nodes

and of degrees and equals to: =
The expected number of edges in (multigraph) G:


= = =
Note:

= = D F = 2
FG
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 41
Modularity of partitioning S of graph G:
Q s S [ (# edges within group s)
(expected # edges within group s) ]

, =

Aij = 1 if ij,
Normalizing cost.: -1<Q<1
0 else

Modularity values take range [1,1]


It is positive if the number of edges within
groups exceeds the expected number
0.3-0.7<Q means significant community structure

11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 42
Modularity is useful for selecting the
number of clusters: Q

Why not optimize Modularity directly?


11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 43
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 44
Lets split the graph into 2 communities!
Want to directly optimize modularity!
R VW VX
max , = &'
N ST YN &Y 'Y ST

Community membership vector s:


si = 1 if node i is in community 1 & ' + 1 1.. if si=sj
=
-1 if node i is in community -1 2 0.. else

R VW VX YW YX [R
, = &G 'G &'
ST ST S
R VW VX
= \T &,'G &' & '
ST
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 45
Note: each row/col of B
sums to 0: = ,
Define:
= =


Modularity matrix: =
Membership: = {, +}
R VW VX
Then: , = \T &G 'G &' & '
ST
R
= \T &,'G &' & '
R R
= \T & & ' &' ' = \T b
=

Task: Find s{-1,+1}n that maximizes Q(G,s)


11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 46
Symmetric matrix A
That is positive semi-definite:
=
Then solutions , to equation = :
Eigenvectors ordered by the magnitude of their
corresponding eigenvalues & (R S j )
are orthonormal (orthogonal and unit length)
form a coordinate system (basis)
If is positive-semidefinite: & 0 (and they always exist)
Eigen Decomposition theorem: Can rewrite matrix
in terms of its eigenvectors and eigenvalues: =
&
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 47
R m
Rewrite: , = in terms of its
\T
eigenvectors and eigenvalues:
j j j

= s m D & & &b = D b & & &b = D b x & S &


&pR &pR &pR

So, if there would be no other constraints on


then to maximize , we make =
x2
Why? Because q
Remember has fixed length! s
Assigns all weight in the sum to (largest eigenvalue)
x1
All other terms are zero because of orthonormality
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 48
Lets consider only the first term in the summation
(because is the largest):
max , = j&pR b & S & b j S j
Y

Lets maximize: p , where sj{-1,+1}


To do this, we set:

+ , (jth coordinate of )
= t
, < (jth coordinate of < )

Continue the bisection hierarchically


11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 49
Fast Modularity Optimization Algorithm:
Find leading eigenvector of modularity matrix B
Divide the nodes by the signs of the elements of
Repeat hierarchically until:
If a proposed split does not cause modularity to increase,
declare community indivisible and do not split it
If all communities are indivisible, stop
How to find ? Power method!
Bv (t )
Start with random v(0), repeat : v (t +1) =
Bv (t )
When converged (v(t) v(t+1)), set xn = v(t)

11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 50
Girvan-Newman:
Based on the strength of weak ties
Remove edge of highest betweenness
Modularity:
Overall quality of the partitioning of a graph
Use to determine the number of communities
Fast modularity optimization:
Transform the modularity optimization to a
eigenvalue problem

11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 51
[Ron Burt]

Who is better off, Robert or James?


11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 53
Few structural holes Many structural holes

Structural Holes provide ego with access


to novel information, power, freedom
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 54
The network constraint measure [Burt]:
To what extent are persons contacts redundant
k
k
= / 2
p25=
p12=
i j i
2 1 p15= 5
j
Low: disconnected contacts
4
High: contacts that are
close or strongly tied 1 2 3 4 5
1 .00 .25 .25 .25 .25
2
2 .50 .00 .00 .00 .50

ci = cij = pij + ( pik pkj ) 3 1.0 .00 .00 .00 .00
4 .50 .00 .00 .00 .50
j j k 5 .33 .33 .00 .33 .00
prop. of s energy invested in relationship with
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 55
Constraint: To what
extent are persons
contacts redundant
Low: disconnected
contacts
High: contacts that
are close or strongly
tied
Network constraint:
James: = 0.309
Robert: = 0.148
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 56
[Ron Burt]

11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 57

You might also like