Professional Documents
Culture Documents
Data Mining
Introduction
Albert Bifet(@abifet)
Data Science
Figure: http://www.marketingdistillery.com/2014/11/29/
is-data-science-a-buzzword-modern-data-scientist-defined/
Data Science
Denition
Given nC different classes, a classier algorithm builds a model
that predicts for every unlabelled instance I the class C to which
it belongs with accuracy.
Example
A spam lter
Example
Twitter Sentiment analysis: analyze tweets with positive or
negative feelings
Classication
Example
Contains Domain Has Time
Data set that Money type attach. received spam
describes e-mail yes com yes night yes
features for yes edu no night yes
deciding if it is no com yes night yes
spam. no edu no day no
no com no day no
yes cat no day yes
k-NN Classier
Training: store all instances in memory
Prediction:
Find the k nearest instances
Output majority class of these k instances
Bayes Classiers
Nave Bayes
Based on Bayes Theorem:
P(c)P(d|c)
P(c|d) =
P(d)
prior likelikood
posterior =
evidence
Estimates the probability of observing attribute a and the
prior probability P(c)
Probability of class c given an instance d:
Attribute w
Attribute w
Attribute w
Attribute w
Attribute w
Attribute w
Attribute w
s (x) = /( + e x )
x1 x2 x3 x4 x5
E(~x,~z)
P(~x,~z) e .
E(~x,~z) = W.
Time
Day Night
Contains Money YES
Yes No
YES NO
Decision Trees
Example
Dataset of Instances : A, B, C, D
Classier : B, A, C, B
Classier : D, B, A, D
Classier : B, A, C, B
Classier : B, C, B, B
Classier : D, C, A, C
Bagging
Random Trees: trees that in each node only uses a random
subset of the attributes
Random Forests is one of the most popular methods in
machine learning.
Boosting
ht : X ! { , +}
AdaBoost
: Initialize D (i) = /m for all i 2 {, , ..., m}
: for t = ,,...T do
: Call WeakLearn, providing it with distribution Dt
: Get back hypothesis ht : X ! Y
: Calculate error of ht : et = i:ht (xi )6=yi Dt (i)
6: Update distribution
et /( et ) if ht (xi ) = yi
Dt : Dt+ (i) = DZt (i)
t otherwise
where Zt is a normalization constant (chosen so Dt+ is a
probability distribution)
: return hn (x) = arg maxy2Y t:ht (x)=y log et /( et )
Boosting
AdaBoost
: Initialize D (i) = /m for all i 2 {, , ..., m}
: for t = ,,...T do
: Call WeakLearn, providing it with distribution Dt
: Get back hypothesis ht : X ! Y
: Calculate error of ht : et = i:ht (xi )6=yi Dt (i)
6: Update distribution
et if ht (xi ) = yi
Dt : Dt+ (i) = DZt (i)
t et otherwise
where Zt is a normalization constant (chosen so Dt+ is a
probability distribution)
: return hn (x) = arg maxy2Y t:ht (x)=y log et /( et )
Stacking
Denition
Clustering is the distribution of a set of instances of examples
into non-known groups according to some common relations or
afnities.
Example
Market segmentation of customers
Example
Social network communities
Clustering
Denition
Given
a set of instances I
a number of clusters K
an objective function cost(I)
a clustering algorithm computes an assignment of a cluster for
each instance
f : I ! {, . . . , K}
that minimizes the objective function cost(I)
Clustering
Denition
Given
a set of instances I
a number of clusters K
an objective function cost(C, I)
a clustering algorithm computes a set C of instances with
|C| = K that minimizes the objective function
cost(C, I) = d (x, C)
x2I
where
d(x, c): distance function between x and c
d (x, C) = minc2C d (x, c): distance from x to the nearest
point in C
k-means
Internal Measures
Sum square distance
Dunn index D = ddmin
max
C-Index C = S S Smin
max S min
External Measures
Rand Measure
F Measure
Jaccard
Purity
Density based methods
DBSCAN
e-neighborhood(p): set of points that are at a distance of p
less or equal to e
Core object: object whose e-neighborhood has an overall
weight at least
A point p is directly density-reachable from q if
p is in e-neighborhood(q)
q is a core object
A point p is density-reachable from q if
there is a chain of points p , . . . , pn such that pi+ is directly
density-reachable from pi
A point p is density-connected from q if
there is point o such that p and q are density-reachable
from o
Density based methods
DBSCAN
A cluster C of points satises
if p 2 C and q is density-reachable from p, then q 2 C
all points p, q 2 C are density-connected
A cluster is uniquely determined by any of its core points
A cluster can be obtained
choosing an arbitrary core point as a seed
retrieve all points that are density-reachable from the seed
DBSCAN
DBSCAN
select an arbitrary point p
retrieve all points density-reachable from p
if p is a core point, a cluster is formed
If p is a border point
no points are density-reachable from p
DBSCAN visits the next point of the database
Continue the process until all of the points have been
processed
Frequent Patterns
Denition Denition
Support (t): number of Pattern t is frequent if
patterns in D that are Support (t) min sup.
superpatterns of t.
Denition Denition
Support (t): number of Pattern t is frequent if
patterns in D that are Support (t) min sup.
superpatterns of t.
Denition Denition
Support (t): number of Pattern t is frequent if
patterns in D that are Support (t) min sup.
superpatterns of t.
Denition Denition
Support (t): number of Pattern t is frequent if
patterns in D that are Support (t) min sup.
superpatterns of t.
Dataset Example
Document Patterns
d abce
d cde
d abce
d acde
d abcde
d6 bcd
Itemset Mining
Support Frequent
d abce d,d,d,d,d,d6 c
d cde
d,d,d,d,d e,ce
d abce
d,d,d,d a,ac,ae,ace
d acde
d,d,d,d6 b,bc
d abcde
d,d,d,d6 d,cd
d6 bcd
d,d,d ab,abc,abe
be,bce,abce
d,d,d de,cde
minimal support =
Itemset Mining
Support Frequent
d abce 6 c
d cde
e,ce
d abce
a,ac,ae,ace
d acde
b,bc
d abcde
d,cd
d6 bcd
ab,abc,abe
be,bce,abce
de,cde
Itemset Mining
A priori property
If t0 is a subpattern of t, then Support (t0 ) Support (t).
Denition
A frequent pattern t is closed if none of its proper superpatterns
has the same support as it has.
Denition
A frequent pattern t is maximal if none of its proper
superpatterns is frequent.
Search:
Breadth-rst (levelwise): Apriori
Depth-rst: Eclat, FP-Growth
The Apriori Algorithm
A A
Depth-First Search
divide-and-conquer scheme : the problem is processed by
splitting it into smaller subproblems, which are then
processed recursively
conditional database for the prex a
transactions that contain a
conditional database for item sets without a
transactions that not contain a
Vertical representation
Support counting is done by intersecting lists of
transaction identiers
The FP-Growth Algorithm
Depth-First Search
divide-and-conquer scheme : the problem is processed by
splitting it into smaller subproblems, which are then
processed recursively
conditional database for the prex a
transactions that contain a
conditional database for item sets without a
transactions that not contain a
Vertical and Horizontal representation : FP-Tree
prex tree with links between nodes that correspond to the
same item
Support counting is done using FP-Tree
Mining Graph Data
Problem
Given a data set D of graphs, nd frequent graphs.
Transaction Id Graph
O
C C S N
O
O
C C S N
C
N
C C S N
The gSpan Algorithm
if g 6= min(g)
then return S
insert g into S
update support counter structure
C 0/
6 for each g0 that can be right-most
extended from g in one step
do if support(g) min sup
8 then insert g0 into C
for each g0 in C
do S S(g0 , D, min sup, S)
return S
Machine Learning and
Data Mining
Data Preprocessing
Albert Bifet(@abifet)
Data Basics
Machine Learning/Data Mining Applications
Business Analytics
Is this costumer credit-worthy?
Is a costumer willing to respond to an email?
Do costumers divide in similar groups?
How much a costumer is going to spend next semester?
World Wide Web
Financial Analytics
Internet of Things
Image Recognition, Speech
..
The Data Mining Process
Data collection
Data Preprocesing
Feature extraction
Data cleaning
Feature selection and transformation
Analytical processing and algorithms
Data Postprocesing
Multidimensional Data
Example:
Competitor Name Swim Cycle Run Total
John T : : 8: :
Norman P 8: : : :
Alex K : 8: n/a n/a
Sarah H : : : :
Table: Triathlon results
Example or Instance
data point, transaction, entity, tuple, object, or feature-vector
Attribute or Feature
eld, dimension
Instance Types
Dense
red, white, Barcelona, , up
red, red, Barcelona, , down
black, white, Paris, , up
red, green, Paris, , down
Sparse
, , , , , , , , , , , , , , , , , , , , , ,
, , , , , , , , , , , , , , , , , , , , , ,
, , , , , , , , , , , , , , , , , , , , , ,
, , , , , , , , , , , , , , , , , , , , , ,
, , , , , , , , , , , , , , , , , , , , , ,
, , , , , , , , , , , , , , , , , , , , , ,
Attribute Type
Numerical
, , ., ., .
Categorical or Discrete
+, -
red, green, black
yes, no
up, down
Barcelona, Paris, London, New York
Text Data: vector-space representation
The cat is black
Binary: Categorical or Numerical
Analytical processing and algorithms
Attribute/Column Relationships
Classication : predict value of a discrete attribute
Regression: predict value of a numeric attribute
Instance/Row Relationships
Clustering: determine subsets of rows, in which the values
in the corresponding columns are similar
Outlier Detection: determine the rows that are very different
from the other rows
Big Data Scalability
Distributed Systems:
Hardware: Hadoop cluster
Software: MapReduce, Spark, Flink, Storm
Streaming Algorithms
Single pass over the data
Concept Drift
Data Preparation
The Data Mining Process
Data collection
Data Preprocesing
Feature extraction
Data cleaning
Feature selection and transformation
Analytical processing and algorithms
Data Postprocesing
Feature Extraction
Numeric to Discrete
Equi-width ranges
Equi-log ranges
Equi-depth ranges
Discrete to Numeric
Binarization: one numeric attribute for each value
Text to Numeric
remove stop words, stem data, tf-idf, multidimensional data
Time Series to Discrete Sequence Data
SAX: equi-depth discretization after window-based
averaging
Time Series to Numeric Data
Discrete Wavelet Transform
Discrete Fourier Transform
Term Frequency-Inverse Document Frequency
Term frequency
Boolean frequencies
tf(t, d) = if t occurs in d and otherwise;
Logarithmically scaled frequency
tf(t, d) = + logft,d , or zero if ft,d is zero;
Augmented frequency,
ft,d
tf(t, d) = . + .
max{ft 0 ,d : t 0 2 d}
N
idf(t, D) = log
|{d 2 D : t 2 d}|
xij - j
zji =
j
Normalization:
xij - minj
yij =
maxj - min j
Feature selection and transformation
R S
Figure: Algorithm R S
Feature selection and transformation
SY = YY T
n-
is diagonalized.
The rows of P are the principal components of X.
Sort these principal components
Eliminate components with low variance
Clustering, classification and
evaluation
Mostafa H. Chehreghani
Mostafa.chehreghani@gmail.com
Clustering
Definition
Clustering is the distribution of a set of instances of examples
into non-known groups according to some common relations or
affinities.
Example
Market segmentation of customers
Example
Social network communities
Clustering
Definition
Given
I a set of instances I
I a number of clusters K
I an objective function cost(I)
a clustering algorithm computes an assignment of a cluster for
each instance
f : I {1, . . . , K }
that minimizes the objective function cost(I)
Clustering
Definition
Given
I a set of instances I
I a number of clusters K
I an objective function cost(C, I)
a clustering algorithm computes a set C of instances with
|C| = K that minimizes the objective function
X
cost(C, I) = d 2 (x, C)
xI
where
I d(x, c): distance function between x and c
I d 2 (x, C) = mincC d 2 (x, c): distance from x to the nearest
point in C
k-means
External Measures
I Rand Measure
I F Measure
I Jaccard
I Purity
Distances
Numeric features
I Euclidean:
X
d(x, y ) = ||x y ||2 = (xi yi )2
I Manhattan distance:
X
d(x, y) = ||x y||1 = |xi yi |
Density based methods
DBSCAN
I -neighborhood(p): set of points that are at a distance of p
less or equal to
I Core object: object whose -neighborhood has an overall
weight at least
I A point p is directly density-reachable from q if
I p is in -neighborhood(q)
I q is a core object
I A point p is density-reachable from q if
I there is a chain of points p1 , . . . , pn such that pi+1 is directly
density-reachable from pi
I A point p is density-connected from q if
I there is point o such that p and q are density-reachable
from o
Density based methods
DBSCAN
I A cluster C of points satisfies
I if p C and q is density-reachable from p, then q C
I all points p, q C are density-connected
I A cluster is uniquely determined by any of its core points
I A cluster can be obtained
I choosing an arbitrary core point as a seed
I retrieve all points that are density-reachable from the seed
Density based methods
DBSCAN
I select an arbitrary point p
I retrieve all points density-reachable from p
I if p is a core point, a cluster is formed
I If p is a border point
I no points are density-reachable from p
I DBSCAN visits the next point of the database
I Continue the process until all of the points have been
processed
DBSCAN
Presented by Zhao Li
2009, Spring
Introduction to BIRCH
Designed for very large data sets
Time and memory are limited
Incremental and dynamic clustering of incoming objects
Only one scan of data is necessary
Does not need the whole data set in advance
Two key phases:
Scans the database to build an in-memory tree
Applies clustering algorithm to cluster the leaf nodes
September 1, 2017 2
Similarity Metric(1)
September 1, 2017 3
Similarity Metric(2)
average inter-cluster:
average intra-cluster:
variance increase:
September 1, 2017 4
Clustering Feature
The Birch algorithm builds a dendrogram called clustering
feature tree (CF tree) while scanning the data set.
Each entry in the CF tree represents a cluster of objects
and is characterized by a 3-tuple: (N, LS, SS), where N is
the number of objects in the cluster and LS, SS are defined
in the following.
September 1, 2017 5
Properties of Clustering Feature
CF entry is more compact
Stores significantly less than all of the data points in
the sub-cluster
A CF entry has sufficient information to calculate
D0-D4
Additivity theorem allows us to merge sub-clusters
incrementally & consistently
September 1, 2017 6
CF-Tree
September 1, 2017 7
CF-Tree Insertion
Recurse down from root
Find the appropriate leaf
Follow the "closest"-CF path, w.r.t. D0 / / D4
Modify the leaf
If the closest-CF leaf cannot absorb, make a new CF
entry. If there is no room for new leaf, split the parent
node
Traverse back
Update CFs on the path or splitting nodes
September 1, 2017 8
CF-Tree Rebuilding
If we run out of space, increase threshold T
By increasing the threshold, CFs absorb more data
Rebuilding "pushes" CFs over
The larger T allows different CFs to group together
Reducibility theorem
Increasing T will result in a CF-tree smaller than the
original
September 1, 2017 9
Example of BIRCH
New subcluster
sc8 sc3
sc1 sc4 sc5 sc6 sc7
September 1, 2017 10
Insertion Operation in BIRCH
If the branching factor of a leaf node can not exceed 3, then LN1 is split.
sc8 sc3
sc1 sc4 sc5 sc6 sc7
sc2
LN1 LN2 LN3
LN1 Root
LN1 LN2 LN3
LN1
sc8 sc3
sc1 sc4 sc5 sc6 sc7
sc2
LN1 LN2 LN3
LN1 Root
NLN1 NLN2
LN1
LN1 LN2 LN3
September 1, 2017 13
Experimental Results
Input parameters:
Memory (M): 5% of data set
Disk space (R): 20% of M
Distance equation: D2
Quality equation: weighted average diameter (D)
Initial threshold (T): 0.0
Page size (P): 1024 bytes
September 1, 2017 14
Experimental Results
KMEANS clustering
BIRCH clustering
2 13.2 4.43 51 2o 12.7 4.20 29
DS
3 Time
32.9 D
3.66 # 187
Scan DS
3o Time
36.0 D
4.35 # 241
Scan
September 1, 2017 15
2 10.7 1.99 2 2o 12.1 1.99 2
Exam Questions
What is the main limitation of BIRCH?
Since each node in a CF tree can hold only a limited
number of entries due to the size, a CF tree node doesnt
always correspond to what a user may consider a nature
cluster. Moreover, if the clusters are not spherical in
shape, it doesnt perform well because it uses the notion
of radius or diameter to control the boundary of a
cluster.
September 1, 2017 16
Classification Evaluation
Evaluation Framework
Error Estimation
Holdout Evaluation
1. Error Estimation
k-fold Cross-validation
2. Evaluation performance measures
Predicted Predicted
Class+ Class- Total
Correct Class+ 75 8 83
Correct Class- 7 10 17
Total 82 18 100
Table: Simple confusion matrix example
2. Evaluation performance measures
Predicted Predicted
Class+ Class- Total
Correct Class+ tp fn tp+fn
Correct Class- fp tn fp+tn
Total tp+fp fn+tn N
Table: Simple confusion matrix example
tp
I Precision = tp+fp
tp
I Recall = tp+fn
precisionrecall
I F1 = 2 precision+recall
2. Evaluation performance measures
Predicted Predicted
Class+ Class- Total
Correct Class+ 75 8 83
Correct Class- 7 10 17
Total 82 18 100
Table: Simple confusion matrix example
75 10 75 83 10 17
I Accuracy = 100 +100 = 83 100 + 17 100 = 85%
I Arithmetic mean = ( 75 10
83 + 17 )/2 = 74.59%
q
75 10
I Geometric mean = 83 17 = 72.90%
2. Performance Measures with Unbalanced Classes
Predicted Predicted
Class+ Class- Total
Correct Class+ 75 8 83
Correct Class- 7 10 17
Total 82 18 100
Table: Simple confusion matrix example
Predicted Predicted
Class+ Class- Total
Correct Class+ 68.06 14.94 83
Correct Class- 13.94 3.06 17
Total 82 18 100
Table: Confusion matrix for chance predictor
2. Performance Measures with Unbalanced Classes
Kappa Statistic
I p0 : classifiers prequential accuracy
I pc : probability that a chance classifier makes a correct
prediction.
I statistic
p0 pc
=
1 pc
I = 1 if the classifier is always correct
I = 0 if the predictions coincide with the correct ones as
often as those of the chance classifier
tp tn fp fn
p
(tp + fp)(tp + fn)(tn + fp)(tn + fn)
2. Evaluation performance measures
Predicted Predicted
Class+ Class- Total
Correct Class+ tp fn tp+fn
Correct Class- fp tn fp+tn
Total tp+fp fn+tn N
Table: Simple confusion matrix example
M = |a b 1|2 /(a + b)
The test follows the 2 distribution. At 0.99 confidence it rejects
the null hypothesis (the performances are equal) if M > 6.635.
McNemar test
3. Statistical significance validation (> 2 Classifiers)
Two classifiers are performing differently if the corresponding
average ranks differ by at least the critical difference
r
k(k + 1)
CD = q
6N
Nemenyi test
3. Statistical significance validation (> 2 Classifiers)
Two classifiers are performing differently if the corresponding
average ranks differ by at least the critical difference
r
k(k + 1)
CD = q
6N
# classifiers 2 3 4 5 6 7
q0.05 1.960 2.343 2.569 2.728 2.850 2.949
q0.10 1.645 2.052 2.291 2.459 2.589 2.693
Table: Critical values for the Nemenyi test
CS224W: Social and Information Network Analysis
Jure Leskovec, Stanford University
http://cs224w.stanford.edu
How to organize the Web?
First try: Human curated
Web directories
Yahoo, DMOZ, LookSmart
Second try: Web Search
Information Retrieval attempts to
find relevant docs in a small
and trusted set
Newspaper articles, Patents, etc.
But: Web is huge, full of untrusted documents,
random things, web spam, etc.
So we need a good way to rank webpages!
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 2
2 challenges of web search:
(1) Web contains many sources of information
Who to trust?
Insight: Trustworthy pages may point to each other!
(2) What is the best answer to query
newspaper?
No single right answer
Insight: Pages that actually know about newspapers
might all be pointing to many newspapers
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 3
All web pages are not equally important
www.joe-schmoe.com vs. www.stanford.edu
We already know:
There is large diversity
in the web-graph vs.
node connectivity.
So, lets rank the pages
using the web graph
link structure!
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 4
We will cover the following Link Analysis
approaches to computing importance of
nodes in a graph:
Hubs and Authorities (HITS)
Page Rank
Random Walk with Restarts
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 8
Each page starts with hub score 1
Authorities collect their votes
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 12
[Kleinberg 98]
Convergence criteria:
Each page has 2 scores: = 5 5 6 678
<
<
Authority score: 5
6 678
<
= 5 5 <
Hub score: 5
HITS algorithm:
()) ())
Initialize: ' = 1/ n, h2 = 1/ n
Then keep iterating until convergence:
(678) (6)
: Authority: 5 = '5 '
(678) (6)
: Hub: 5 = 5' '
: Normalize:
< <
678 678
5 5 = 1, ' ' =1
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 13
[Kleinberg 98]
Details!
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 14
Details!
What is = ?
Then: = ( )
new
new
is updated (in 2 steps):
= I ( ) = (I )
h is updated (in 2 steps):
= (I ) = ( I )
Thus, in steps:
= I L
= I L
Repeated matrix powering
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 15
Definition: Eigenvectors & Eigenvalues
Let =
for some scalar , vector , matrix
Then is an eigenvector, and is its eigenvalue
The steady state (HITS has converged) is:
= Note constants c,c
dont matter as we
normalize them out
= RR every step of HITS
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 18
A vote from an important
page is worth more: i k
ri/3 r /4
Each links vote is proportional k
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 22
i1 i2 i3
Where is the surfer at time t+1?
Follows a link uniformly at random
j
+ = () p(t + 1) = M p(t )
Suppose the random walk reaches a state
+ = () = ()
then () is stationary distribution of a random walk
Our original rank vector satisfies =
So, is a stationary distribution for
the random walk
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 23
Given a web graph with n nodes, where the
nodes are pages and edges are hyperlinks
Assign each node an initial page rank
Repeat until convergence (i |ri(t+1) ri(t)| < )
Calculate the page rank of each node
(t )
( t +1) ri
rj =
i j di
. out-degree of node
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 25
y a m
Power Iteration: y
y 0
Set ' 1/N a 0 1
]^
a m m 0 0
1: ' 5' _
^
ry = ry /2 + ra /2
2: ra = ry /2 + rm
If | | > : goto 1 rm = ra /2
Example:
ry 1/3 1/3 5/12 9/24 6/15
ra = 1/3 3/6 1/3 11/24 6/15
rm 1/3 1/6 3/12 1/6 3/15
Iteration 0, 1, 2,
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 26
y a m
Power Iteration: y
y 0
Set ' 1/N a 0 1
]^
a m m 0 0
1: ' 5' _
^
ry = ry /2 + ra /2
2: ra = ry /2 + rm
If | | > : goto 1 rm = ra /2
Example:
ry 1/3 1/3 5/12 9/24 6/15
ra = 1/3 3/6 1/3 11/24 6/15
rm 1/3 1/6 3/12 1/6 3/15
Iteration 0, 1, 2,
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 27
(t )
( t +1) ri
rj = or
r = Mr
i j di
equivalently
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 28
The Spider trap problem:
(t )
( t +1) ri
a b rj =
i j di
Example:
Iteration: 0, 1, 2, 3
ra 1 0 0 0
=
rb 0 1 1 1
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 29
The Dead end problem:
(t )
( t +1) ri
a b rj =
i j di
Example:
Iteration: 0, 1, 2, 3
ra 1 0 0 0
rb = 0 1 0 0
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 30
2 problems:
(1) Some pages are
dead ends (have no out-links)
Such pages cause
importance to leak out
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 31
y a m
Power Iteration: y
y 0
8
Set ' = a 0 0
c a m m 0 1
]^
' = 5' _
^ ry = ry /2 + ra /2
And iterate ra = ry /2
rm = ra /2 + rm
Example:
ry 1/3 2/6 3/12 5/24 0
ra = 1/3 1/6 2/12 3/24 0
rm 1/3 3/6 7/12 16/24 1
Iteration 0, 1, 2,
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 32
The Google solution for spider traps: At each
time step, the random surfer has two options
With prob. , follow a link at random
With prob. 1-, jump to a random page
Common values for are in the range 0.8 to 0.9
Surfer will teleport out of spider trap within a
few time steps
y y
a m a m
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 33
y a m
Power Iteration: y
y 0
8
Set ' = a 0 0
c a m m 0 0
]^
' = 5' _
^ ry = ry /2 + ra /2
And iterate ra = ry /2
rm = ra /2
Example:
ry 1/3 2/6 3/12 5/24 0
ra = 1/3 1/6 2/12 3/24 0
rm 1/3 1/6 1/12 2/24 0
Iteration 0, 1, 2,
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 34
Teleports: Follow random teleport links with
probability 1.0 from dead-ends
Adjust matrix accordingly
y y
a m a m
y a m y a m
y 0 y
a 0 0 a 0
m 0 0 m 0
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 35
Googles solution: At each step, random
surfer has two options:
With probability , follow a link at random
With probability 1-, jump to some random page
PageRank equation [Brin-Page, 98]
5 1
' = = + (1 )
5 di out-degree
5' of node i
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 40
[Tong-Faloutsos, 06]
I 1 J
1 1
A 1 H 1 B
1 1
D
1 1 1
E G
F
Conferences-to-authors IJCAI
graph Philip S. Yu
Goal:
KDD
Ning Zhong
Proximity on graphs ICDM
NIPS
Conference Author
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 43
Shortest path is not good:
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 45
[Tong-Faloutsos, 06]
I 1 J
1 1
A 1 H 1 B
Multiple Connections
1 1
D Quality of connection
1 1 1 Direct & In-direct
E G
connections
F
Length, Degree,
Weight
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 46
10
9
12
2
8
1
11
3
6
5
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 47
Goal: Evaluate pages not just by popularity
but by how close they are to the topic
Teleporting can go to:
Any page with equal probability
PageRank (we used this so far)
A topic-specific set of relevant pages
Topic-specific (personalized) PageRank (S ...teleport set)
= + ( )/|| if
= otherwise
Random Walk with Restart: S is a single element
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 48
Graphs and web search:
Ranks nodes by importance IJCAI
conference to ICDM?
NIPS
Teleport back to the starting node: Conference Author
S = { single node }
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 49
Node 4
0.04 0.03 Node 1 0.13
10 Node 2 0.10
9
0.10 Node 3 0.13
12
0.13 2 0.08 0.02 Node 4 /
1 8 Node 5 0.13
3 0.13 11
0.04 Node 6 0.05
4 Node 7 0.05
Node 8 0.08
6 0.05 Node 9 0.04
5
0.13 Node 10 0.03
7 Node 11 0.04
0.05 Node 12 0.02
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 50
PKDD
SDM PAKDD
0.008
0.007
0.009
KDD 0.005 ICML
0.011
ICDM
0.005
0.004
CIKM ICDE
0.005
0.004
0.004
ECML SIGMOD
DMKD
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 51
Q: Which conferences
are closest to KDD &
K
ICDM?
I A: Personalized
PageRank with
Graph of CS conferences teleport set S={KDD,
ICDM}
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 52
Pins belong to Boards
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 53
Input:
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 54
Input: Output
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 55
Input:
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 56
Input: Output
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 57
Proximity to query node(s) Q:
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 58
Pixie Random Walk
Proximity to query node(s) Q:
5 5 5 5 5 5 14 9 Q 16 7 8 8 8 8 1 1 1
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 59
CS224W: Social and Information Network Analysis
Jure Leskovec, Stanford University
http://cs224w.stanford.edu
Observations Models Algorithms
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 2
We often think of networks looking
like this:
information access S
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 6
Triadic closure == High clustering coefficient
Reasons for triadic closure:
If and have a friend in common, then:
is more likely to meet
B
(since they both spend time with )
and trust each other A C
(since they have a friend in common)
has incentive to bring and together
(as it is hard for to maintain two disjoint relationships)
Empirical study by Bearman and Moody:
Teenage girls with low clustering coefficient are
more likely to contemplate suicide
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 7
Bridge
Define: Bridge edge a
If removed, it disconnects the graph b
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 8
Claim: If node satisfies Strong Triadic Closure
and is involved in at least two strong ties, then
any local bridge adjacent to must be a weak tie.
Proof by contradiction:
Assume satisfies Strong Triadic S S
Closure and has 2 strong ties A
because of Strong C
Triadic Closure
But then is not a bridge!
(since B-C must be connected due to Strong Triadic Closure property)
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 9
For many years Granovetters theory was not
tested
But, today we have large who-talks-to-whom
graphs:
Email, Messenger, Cell phones, Facebook
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 10
Edge overlap:
() ()
&' =
() ()
() a set
of neighbors
of node
Overlap =
when an edge is
a local bridge
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 11
Cell-phone network
Observation:
Highly used links True
Neighborhood overlap
have high overlap! Permuted
strengths
Legend:
True: The data
Permuted strengths: Keep
the network structure
but randomly reassign
edge strengths
Edge strength (#calls)
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 12
Real edge strengths in mobile call graph
Strong ties are more embedded (have higher overlap)
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 13
Same network, same set of edge strengths
but now strengths are randomly shuffled
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 14
Low
Size of largest component
disconnects
the network
sooner
disconnects
the network
sooner
Weak ties
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 17
Granovetters theory
suggest that networks
are composed of
tightly connected
sets of nodes
Communities, clusters,
Network communities: groups, modules
Sets of nodes with lots of connections inside and
few to outside (the rest of the network)
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 19
How to automatically
find such densely
connected groups of
nodes?
Ideally such automatically
detected clusters would
then correspond to real
groups
Communities, clusters,
For example: groups, modules
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 20
Zacharys Karate club network:
Observe social ties and rivalries in a university karate club
During his observation, conflicts led the group to split
Split could be explained by a minimum cut in the network
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 21
Find micro-markets by partitioning the
query x advertiser graph:
query
advertiser
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 22
Can we identify
node groups?
(communities,
modules, clusters)
Nodes: Teams
Edges: Games played
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 23
NCAA conferences
Nodes: Teams
Edges: Games played
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 24
Can we identify
social communities?
Nodes: Users
Edges: Friendships
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 25
High school Company
Stanford (Basketball)
Stanford (Squash)
Nodes: Users
Social communities Edges: Friendships
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 26
Can we identify
functional modules?
Nodes: Proteins
Edges: Interactions
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 27
Functional modules
Nodes: Proteins
Edges: Interactions
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 28
How to find communities?
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 31
12
1
33
49
Need to re-compute
betweenness at
every step
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 32
Step 1: Step 2:
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 33
1. How to compute betweenness?
2. How to select the number of
clusters?
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 34
Want to compute Breadth first search
betweenness of starting from :
paths starting at
node 0
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 35
Forward step: Count the number of shortest
paths from to all other nodes of the
network
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 36
Backward step: Compute betweenness: If
there are multiple paths count them
fractionally
The algorithm:
Add edge flows:
-- node flow =
1+child edges 1+1 paths to H
-- split the flow up Split evenly
based on the parent
value
Repeat the BFS 1+0.5 paths to J
Split 1:2
procedure for each
starting node
1 path to K.
Split evenly
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 37
Backward step: Compute betweenness: If
there are multiple paths count them
fractionally
The algorithm:
Add edge flows:
-- node flow =
1+child edges 1+1 paths to H
-- split the flow up Split evenly
based on the parent
value
Repeat the BFS 1+0.5 paths to J
Split 1:2
procedure for each
starting node
1 path to K.
Split evenly
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 38
1. How to compute betweenness?
2. How to select the number of
clusters?
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 39
Communities: sets of
tightly connected nodes
Define: Modularity
A measure of how well
a network is partitioned
into communities
Given a partitioning of the
network into groups ! :
Q s S [ (# edges within group s)
(expected # edges within group s) ]
Need a null model!
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 40
Given real on nodes and edges,
construct rewired network
Same degree distribution but i
random connections
j
Consider as a multigraph
The expected number of edges between nodes
and of degrees and equals to: =
The expected number of edges in (multigraph) G:
= = =
Note:
= = D F = 2
FG
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 41
Modularity of partitioning S of graph G:
Q s S [ (# edges within group s)
(expected # edges within group s) ]
, =
Aij = 1 if ij,
Normalizing cost.: -1<Q<1
0 else
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 42
Modularity is useful for selecting the
number of clusters: Q
R VW VX YW YX [R
, = &G 'G &'
ST ST S
R VW VX
= \T &,'G &' & '
ST
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 45
Note: each row/col of B
sums to 0: = ,
Define:
= =
Modularity matrix: =
Membership: = {, +}
R VW VX
Then: , = \T &G 'G &' & '
ST
R
= \T &,'G &' & '
R R
= \T & & ' &' ' = \T b
=
+ , (jth coordinate of )
= t
, < (jth coordinate of < )
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 50
Girvan-Newman:
Based on the strength of weak ties
Remove edge of highest betweenness
Modularity:
Overall quality of the partitioning of a graph
Use to determine the number of communities
Fast modularity optimization:
Transform the modularity optimization to a
eigenvalue problem
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 51
[Ron Burt]
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 57
Machine Learning and
Data Mining
Introduction
Albert Bifet(@abifet)
Data Science
Figure: http://www.marketingdistillery.com/2014/11/29/
is-data-science-a-buzzword-modern-data-scientist-defined/
Data Science
Denition
Given nC different classes, a classier algorithm builds a model
that predicts for every unlabelled instance I the class C to which
it belongs with accuracy.
Example
A spam lter
Example
Twitter Sentiment analysis: analyze tweets with positive or
negative feelings
Classication
Example
Contains Domain Has Time
Data set that Money type attach. received spam
describes e-mail yes com yes night yes
features for yes edu no night yes
deciding if it is no com yes night yes
spam. no edu no day no
no com no day no
yes cat no day yes
k-NN Classier
Training: store all instances in memory
Prediction:
Find the k nearest instances
Output majority class of these k instances
Bayes Classiers
Nave Bayes
Based on Bayes Theorem:
P(c)P(d|c)
P(c|d) =
P(d)
prior likelikood
posterior =
evidence
Estimates the probability of observing attribute a and the
prior probability P(c)
Probability of class c given an instance d:
Attribute w
Attribute w
Attribute w
Attribute w
Attribute w
Attribute w
Attribute w
s (x) = /( + e x )
x1 x2 x3 x4 x5
E(~x,~z)
P(~x,~z) e .
E(~x,~z) = W.
Time
Day Night
Contains Money YES
Yes No
YES NO
Decision Trees
Example
Dataset of Instances : A, B, C, D
Classier : B, A, C, B
Classier : D, B, A, D
Classier : B, A, C, B
Classier : B, C, B, B
Classier : D, C, A, C
Bagging
Random Trees: trees that in each node only uses a random
subset of the attributes
Random Forests is one of the most popular methods in
machine learning.
Boosting
ht : X ! { , +}
AdaBoost
: Initialize D (i) = /m for all i 2 {, , ..., m}
: for t = ,,...T do
: Call WeakLearn, providing it with distribution Dt
: Get back hypothesis ht : X ! Y
: Calculate error of ht : et = i:ht (xi )6=yi Dt (i)
6: Update distribution
et /( et ) if ht (xi ) = yi
Dt : Dt+ (i) = DZt (i)
t otherwise
where Zt is a normalization constant (chosen so Dt+ is a
probability distribution)
: return hn (x) = arg maxy2Y t:ht (x)=y log et /( et )
Boosting
AdaBoost
: Initialize D (i) = /m for all i 2 {, , ..., m}
: for t = ,,...T do
: Call WeakLearn, providing it with distribution Dt
: Get back hypothesis ht : X ! Y
: Calculate error of ht : et = i:ht (xi )6=yi Dt (i)
6: Update distribution
et if ht (xi ) = yi
Dt : Dt+ (i) = DZt (i)
t et otherwise
where Zt is a normalization constant (chosen so Dt+ is a
probability distribution)
: return hn (x) = arg maxy2Y t:ht (x)=y log et /( et )
Stacking
Denition
Clustering is the distribution of a set of instances of examples
into non-known groups according to some common relations or
afnities.
Example
Market segmentation of customers
Example
Social network communities
Clustering
Denition
Given
a set of instances I
a number of clusters K
an objective function cost(I)
a clustering algorithm computes an assignment of a cluster for
each instance
f : I ! {, . . . , K}
that minimizes the objective function cost(I)
Clustering
Denition
Given
a set of instances I
a number of clusters K
an objective function cost(C, I)
a clustering algorithm computes a set C of instances with
|C| = K that minimizes the objective function
cost(C, I) = d (x, C)
x2I
where
d(x, c): distance function between x and c
d (x, C) = minc2C d (x, c): distance from x to the nearest
point in C
k-means
Internal Measures
Sum square distance
Dunn index D = ddmin
max
C-Index C = S S Smin
max S min
External Measures
Rand Measure
F Measure
Jaccard
Purity
Density based methods
DBSCAN
e-neighborhood(p): set of points that are at a distance of p
less or equal to e
Core object: object whose e-neighborhood has an overall
weight at least
A point p is directly density-reachable from q if
p is in e-neighborhood(q)
q is a core object
A point p is density-reachable from q if
there is a chain of points p , . . . , pn such that pi+ is directly
density-reachable from pi
A point p is density-connected from q if
there is point o such that p and q are density-reachable
from o
Density based methods
DBSCAN
A cluster C of points satises
if p 2 C and q is density-reachable from p, then q 2 C
all points p, q 2 C are density-connected
A cluster is uniquely determined by any of its core points
A cluster can be obtained
choosing an arbitrary core point as a seed
retrieve all points that are density-reachable from the seed
DBSCAN
DBSCAN
select an arbitrary point p
retrieve all points density-reachable from p
if p is a core point, a cluster is formed
If p is a border point
no points are density-reachable from p
DBSCAN visits the next point of the database
Continue the process until all of the points have been
processed
Frequent Patterns
Denition Denition
Support (t): number of Pattern t is frequent if
patterns in D that are Support (t) min sup.
superpatterns of t.
Denition Denition
Support (t): number of Pattern t is frequent if
patterns in D that are Support (t) min sup.
superpatterns of t.
Denition Denition
Support (t): number of Pattern t is frequent if
patterns in D that are Support (t) min sup.
superpatterns of t.
Denition Denition
Support (t): number of Pattern t is frequent if
patterns in D that are Support (t) min sup.
superpatterns of t.
Dataset Example
Document Patterns
d abce
d cde
d abce
d acde
d abcde
d6 bcd
Itemset Mining
Support Frequent
d abce d,d,d,d,d,d6 c
d cde
d,d,d,d,d e,ce
d abce
d,d,d,d a,ac,ae,ace
d acde
d,d,d,d6 b,bc
d abcde
d,d,d,d6 d,cd
d6 bcd
d,d,d ab,abc,abe
be,bce,abce
d,d,d de,cde
minimal support =
Itemset Mining
Support Frequent
d abce 6 c
d cde
e,ce
d abce
a,ac,ae,ace
d acde
b,bc
d abcde
d,cd
d6 bcd
ab,abc,abe
be,bce,abce
de,cde
Itemset Mining
A priori property
If t0 is a subpattern of t, then Support (t0 ) Support (t).
Denition
A frequent pattern t is closed if none of its proper superpatterns
has the same support as it has.
Denition
A frequent pattern t is maximal if none of its proper
superpatterns is frequent.
Search:
Breadth-rst (levelwise): Apriori
Depth-rst: Eclat, FP-Growth
The Apriori Algorithm
A A
Depth-First Search
divide-and-conquer scheme : the problem is processed by
splitting it into smaller subproblems, which are then
processed recursively
conditional database for the prex a
transactions that contain a
conditional database for item sets without a
transactions that not contain a
Vertical representation
Support counting is done by intersecting lists of
transaction identiers
The FP-Growth Algorithm
Depth-First Search
divide-and-conquer scheme : the problem is processed by
splitting it into smaller subproblems, which are then
processed recursively
conditional database for the prex a
transactions that contain a
conditional database for item sets without a
transactions that not contain a
Vertical and Horizontal representation : FP-Tree
prex tree with links between nodes that correspond to the
same item
Support counting is done using FP-Tree
Mining Graph Data
Problem
Given a data set D of graphs, nd frequent graphs.
Transaction Id Graph
O
C C S N
O
O
C C S N
C
N
C C S N
The gSpan Algorithm
if g 6= min(g)
then return S
insert g into S
update support counter structure
C 0/
6 for each g0 that can be right-most
extended from g in one step
do if support(g) min sup
8 then insert g0 into C
for each g0 in C
do S S(g0 , D, min sup, S)
return S
Machine Learning and
Data Mining
Data Preprocessing
Albert Bifet(@abifet)
Data Basics
Machine Learning/Data Mining Applications
Business Analytics
Is this costumer credit-worthy?
Is a costumer willing to respond to an email?
Do costumers divide in similar groups?
How much a costumer is going to spend next semester?
World Wide Web
Financial Analytics
Internet of Things
Image Recognition, Speech
..
The Data Mining Process
Data collection
Data Preprocesing
Feature extraction
Data cleaning
Feature selection and transformation
Analytical processing and algorithms
Data Postprocesing
Multidimensional Data
Example:
Competitor Name Swim Cycle Run Total
John T : : 8: :
Norman P 8: : : :
Alex K : 8: n/a n/a
Sarah H : : : :
Table: Triathlon results
Example or Instance
data point, transaction, entity, tuple, object, or feature-vector
Attribute or Feature
eld, dimension
Instance Types
Dense
red, white, Barcelona, , up
red, red, Barcelona, , down
black, white, Paris, , up
red, green, Paris, , down
Sparse
, , , , , , , , , , , , , , , , , , , , , ,
, , , , , , , , , , , , , , , , , , , , , ,
, , , , , , , , , , , , , , , , , , , , , ,
, , , , , , , , , , , , , , , , , , , , , ,
, , , , , , , , , , , , , , , , , , , , , ,
, , , , , , , , , , , , , , , , , , , , , ,
Attribute Type
Numerical
, , ., ., .
Categorical or Discrete
+, -
red, green, black
yes, no
up, down
Barcelona, Paris, London, New York
Text Data: vector-space representation
The cat is black
Binary: Categorical or Numerical
Analytical processing and algorithms
Attribute/Column Relationships
Classication : predict value of a discrete attribute
Regression: predict value of a numeric attribute
Instance/Row Relationships
Clustering: determine subsets of rows, in which the values
in the corresponding columns are similar
Outlier Detection: determine the rows that are very different
from the other rows
Big Data Scalability
Distributed Systems:
Hardware: Hadoop cluster
Software: MapReduce, Spark, Flink, Storm
Streaming Algorithms
Single pass over the data
Concept Drift
Data Preparation
The Data Mining Process
Data collection
Data Preprocesing
Feature extraction
Data cleaning
Feature selection and transformation
Analytical processing and algorithms
Data Postprocesing
Feature Extraction
Numeric to Discrete
Equi-width ranges
Equi-log ranges
Equi-depth ranges
Discrete to Numeric
Binarization: one numeric attribute for each value
Text to Numeric
remove stop words, stem data, tf-idf, multidimensional data
Time Series to Discrete Sequence Data
SAX: equi-depth discretization after window-based
averaging
Time Series to Numeric Data
Discrete Wavelet Transform
Discrete Fourier Transform
Term Frequency-Inverse Document Frequency
Term frequency
Boolean frequencies
tf(t, d) = if t occurs in d and otherwise;
Logarithmically scaled frequency
tf(t, d) = + logft,d , or zero if ft,d is zero;
Augmented frequency,
ft,d
tf(t, d) = . + .
max{ft 0 ,d : t 0 2 d}
N
idf(t, D) = log
|{d 2 D : t 2 d}|
xij - j
zji =
j
Normalization:
xij - minj
yij =
maxj - min j
Feature selection and transformation
R S
Figure: Algorithm R S
Feature selection and transformation
SY = YY T
n-
is diagonalized.
The rows of P are the principal components of X.
Sort these principal components
Eliminate components with low variance
Clustering, classification and
evaluation
Mostafa H. Chehreghani
Mostafa.chehreghani@gmail.com
Clustering
Definition
Clustering is the distribution of a set of instances of examples
into non-known groups according to some common relations or
affinities.
Example
Market segmentation of customers
Example
Social network communities
Clustering
Definition
Given
I a set of instances I
I a number of clusters K
I an objective function cost(I)
a clustering algorithm computes an assignment of a cluster for
each instance
f : I {1, . . . , K }
that minimizes the objective function cost(I)
Clustering
Definition
Given
I a set of instances I
I a number of clusters K
I an objective function cost(C, I)
a clustering algorithm computes a set C of instances with
|C| = K that minimizes the objective function
X
cost(C, I) = d 2 (x, C)
xI
where
I d(x, c): distance function between x and c
I d 2 (x, C) = mincC d 2 (x, c): distance from x to the nearest
point in C
k-means
External Measures
I Rand Measure
I F Measure
I Jaccard
I Purity
Distances
Numeric features
I Euclidean:
X
d(x, y ) = ||x y ||2 = (xi yi )2
I Manhattan distance:
X
d(x, y) = ||x y||1 = |xi yi |
Density based methods
DBSCAN
I -neighborhood(p): set of points that are at a distance of p
less or equal to
I Core object: object whose -neighborhood has an overall
weight at least
I A point p is directly density-reachable from q if
I p is in -neighborhood(q)
I q is a core object
I A point p is density-reachable from q if
I there is a chain of points p1 , . . . , pn such that pi+1 is directly
density-reachable from pi
I A point p is density-connected from q if
I there is point o such that p and q are density-reachable
from o
Density based methods
DBSCAN
I A cluster C of points satisfies
I if p C and q is density-reachable from p, then q C
I all points p, q C are density-connected
I A cluster is uniquely determined by any of its core points
I A cluster can be obtained
I choosing an arbitrary core point as a seed
I retrieve all points that are density-reachable from the seed
Density based methods
DBSCAN
I select an arbitrary point p
I retrieve all points density-reachable from p
I if p is a core point, a cluster is formed
I If p is a border point
I no points are density-reachable from p
I DBSCAN visits the next point of the database
I Continue the process until all of the points have been
processed
DBSCAN
Presented by Zhao Li
2009, Spring
Introduction to BIRCH
Designed for very large data sets
Time and memory are limited
Incremental and dynamic clustering of incoming objects
Only one scan of data is necessary
Does not need the whole data set in advance
Two key phases:
Scans the database to build an in-memory tree
Applies clustering algorithm to cluster the leaf nodes
September 1, 2017 2
Similarity Metric(1)
September 1, 2017 3
Similarity Metric(2)
average inter-cluster:
average intra-cluster:
variance increase:
September 1, 2017 4
Clustering Feature
The Birch algorithm builds a dendrogram called clustering
feature tree (CF tree) while scanning the data set.
Each entry in the CF tree represents a cluster of objects
and is characterized by a 3-tuple: (N, LS, SS), where N is
the number of objects in the cluster and LS, SS are defined
in the following.
September 1, 2017 5
Properties of Clustering Feature
CF entry is more compact
Stores significantly less than all of the data points in
the sub-cluster
A CF entry has sufficient information to calculate
D0-D4
Additivity theorem allows us to merge sub-clusters
incrementally & consistently
September 1, 2017 6
CF-Tree
September 1, 2017 7
CF-Tree Insertion
Recurse down from root
Find the appropriate leaf
Follow the "closest"-CF path, w.r.t. D0 / / D4
Modify the leaf
If the closest-CF leaf cannot absorb, make a new CF
entry. If there is no room for new leaf, split the parent
node
Traverse back
Update CFs on the path or splitting nodes
September 1, 2017 8
CF-Tree Rebuilding
If we run out of space, increase threshold T
By increasing the threshold, CFs absorb more data
Rebuilding "pushes" CFs over
The larger T allows different CFs to group together
Reducibility theorem
Increasing T will result in a CF-tree smaller than the
original
September 1, 2017 9
Example of BIRCH
New subcluster
sc8 sc3
sc1 sc4 sc5 sc6 sc7
September 1, 2017 10
Insertion Operation in BIRCH
If the branching factor of a leaf node can not exceed 3, then LN1 is split.
sc8 sc3
sc1 sc4 sc5 sc6 sc7
sc2
LN1 LN2 LN3
LN1 Root
LN1 LN2 LN3
LN1
sc8 sc3
sc1 sc4 sc5 sc6 sc7
sc2
LN1 LN2 LN3
LN1 Root
NLN1 NLN2
LN1
LN1 LN2 LN3
September 1, 2017 13
Experimental Results
Input parameters:
Memory (M): 5% of data set
Disk space (R): 20% of M
Distance equation: D2
Quality equation: weighted average diameter (D)
Initial threshold (T): 0.0
Page size (P): 1024 bytes
September 1, 2017 14
Experimental Results
KMEANS clustering
BIRCH clustering
2 13.2 4.43 51 2o 12.7 4.20 29
DS
3 Time
32.9 D
3.66 # 187
Scan DS
3o Time
36.0 D
4.35 # 241
Scan
September 1, 2017 15
2 10.7 1.99 2 2o 12.1 1.99 2
Exam Questions
What is the main limitation of BIRCH?
Since each node in a CF tree can hold only a limited
number of entries due to the size, a CF tree node doesnt
always correspond to what a user may consider a nature
cluster. Moreover, if the clusters are not spherical in
shape, it doesnt perform well because it uses the notion
of radius or diameter to control the boundary of a
cluster.
September 1, 2017 16
Classification Evaluation
Evaluation Framework
Error Estimation
Holdout Evaluation
1. Error Estimation
k-fold Cross-validation
2. Evaluation performance measures
Predicted Predicted
Class+ Class- Total
Correct Class+ 75 8 83
Correct Class- 7 10 17
Total 82 18 100
Table: Simple confusion matrix example
2. Evaluation performance measures
Predicted Predicted
Class+ Class- Total
Correct Class+ tp fn tp+fn
Correct Class- fp tn fp+tn
Total tp+fp fn+tn N
Table: Simple confusion matrix example
tp
I Precision = tp+fp
tp
I Recall = tp+fn
precisionrecall
I F1 = 2 precision+recall
2. Evaluation performance measures
Predicted Predicted
Class+ Class- Total
Correct Class+ 75 8 83
Correct Class- 7 10 17
Total 82 18 100
Table: Simple confusion matrix example
75 10 75 83 10 17
I Accuracy = 100 +100 = 83 100 + 17 100 = 85%
I Arithmetic mean = ( 75 10
83 + 17 )/2 = 74.59%
q
75 10
I Geometric mean = 83 17 = 72.90%
2. Performance Measures with Unbalanced Classes
Predicted Predicted
Class+ Class- Total
Correct Class+ 75 8 83
Correct Class- 7 10 17
Total 82 18 100
Table: Simple confusion matrix example
Predicted Predicted
Class+ Class- Total
Correct Class+ 68.06 14.94 83
Correct Class- 13.94 3.06 17
Total 82 18 100
Table: Confusion matrix for chance predictor
2. Performance Measures with Unbalanced Classes
Kappa Statistic
I p0 : classifiers prequential accuracy
I pc : probability that a chance classifier makes a correct
prediction.
I statistic
p0 pc
=
1 pc
I = 1 if the classifier is always correct
I = 0 if the predictions coincide with the correct ones as
often as those of the chance classifier
tp tn fp fn
p
(tp + fp)(tp + fn)(tn + fp)(tn + fn)
2. Evaluation performance measures
Predicted Predicted
Class+ Class- Total
Correct Class+ tp fn tp+fn
Correct Class- fp tn fp+tn
Total tp+fp fn+tn N
Table: Simple confusion matrix example
M = |a b 1|2 /(a + b)
The test follows the 2 distribution. At 0.99 confidence it rejects
the null hypothesis (the performances are equal) if M > 6.635.
McNemar test
3. Statistical significance validation (> 2 Classifiers)
Two classifiers are performing differently if the corresponding
average ranks differ by at least the critical difference
r
k(k + 1)
CD = q
6N
Nemenyi test
3. Statistical significance validation (> 2 Classifiers)
Two classifiers are performing differently if the corresponding
average ranks differ by at least the critical difference
r
k(k + 1)
CD = q
6N
# classifiers 2 3 4 5 6 7
q0.05 1.960 2.343 2.569 2.728 2.850 2.949
q0.10 1.645 2.052 2.291 2.459 2.589 2.693
Table: Critical values for the Nemenyi test
CS224W: Social and Information Network Analysis
Jure Leskovec, Stanford University
http://cs224w.stanford.edu
How to organize the Web?
First try: Human curated
Web directories
Yahoo, DMOZ, LookSmart
Second try: Web Search
Information Retrieval attempts to
find relevant docs in a small
and trusted set
Newspaper articles, Patents, etc.
But: Web is huge, full of untrusted documents,
random things, web spam, etc.
So we need a good way to rank webpages!
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 2
2 challenges of web search:
(1) Web contains many sources of information
Who to trust?
Insight: Trustworthy pages may point to each other!
(2) What is the best answer to query
newspaper?
No single right answer
Insight: Pages that actually know about newspapers
might all be pointing to many newspapers
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 3
All web pages are not equally important
www.joe-schmoe.com vs. www.stanford.edu
We already know:
There is large diversity
in the web-graph vs.
node connectivity.
So, lets rank the pages
using the web graph
link structure!
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 4
We will cover the following Link Analysis
approaches to computing importance of
nodes in a graph:
Hubs and Authorities (HITS)
Page Rank
Random Walk with Restarts
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 8
Each page starts with hub score 1
Authorities collect their votes
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 12
[Kleinberg 98]
Convergence criteria:
Each page has 2 scores: = 5 5 6 678
<
<
Authority score: 5
6 678
<
= 5 5 <
Hub score: 5
HITS algorithm:
()) ())
Initialize: ' = 1/ n, h2 = 1/ n
Then keep iterating until convergence:
(678) (6)
: Authority: 5 = '5 '
(678) (6)
: Hub: 5 = 5' '
: Normalize:
< <
678 678
5 5 = 1, ' ' =1
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 13
[Kleinberg 98]
Details!
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 14
Details!
What is = ?
Then: = ( )
new
new
is updated (in 2 steps):
= I ( ) = (I )
h is updated (in 2 steps):
= (I ) = ( I )
Thus, in steps:
= I L
= I L
Repeated matrix powering
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 15
Definition: Eigenvectors & Eigenvalues
Let =
for some scalar , vector , matrix
Then is an eigenvector, and is its eigenvalue
The steady state (HITS has converged) is:
= Note constants c,c
dont matter as we
normalize them out
= RR every step of HITS
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 18
A vote from an important
page is worth more: i k
ri/3 r /4
Each links vote is proportional k
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 22
i1 i2 i3
Where is the surfer at time t+1?
Follows a link uniformly at random
j
+ = () p(t + 1) = M p(t )
Suppose the random walk reaches a state
+ = () = ()
then () is stationary distribution of a random walk
Our original rank vector satisfies =
So, is a stationary distribution for
the random walk
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 23
Given a web graph with n nodes, where the
nodes are pages and edges are hyperlinks
Assign each node an initial page rank
Repeat until convergence (i |ri(t+1) ri(t)| < )
Calculate the page rank of each node
(t )
( t +1) ri
rj =
i j di
. out-degree of node
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 25
y a m
Power Iteration: y
y 0
Set ' 1/N a 0 1
]^
a m m 0 0
1: ' 5' _
^
ry = ry /2 + ra /2
2: ra = ry /2 + rm
If | | > : goto 1 rm = ra /2
Example:
ry 1/3 1/3 5/12 9/24 6/15
ra = 1/3 3/6 1/3 11/24 6/15
rm 1/3 1/6 3/12 1/6 3/15
Iteration 0, 1, 2,
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 26
y a m
Power Iteration: y
y 0
Set ' 1/N a 0 1
]^
a m m 0 0
1: ' 5' _
^
ry = ry /2 + ra /2
2: ra = ry /2 + rm
If | | > : goto 1 rm = ra /2
Example:
ry 1/3 1/3 5/12 9/24 6/15
ra = 1/3 3/6 1/3 11/24 6/15
rm 1/3 1/6 3/12 1/6 3/15
Iteration 0, 1, 2,
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 27
(t )
( t +1) ri
rj = or
r = Mr
i j di
equivalently
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 28
The Spider trap problem:
(t )
( t +1) ri
a b rj =
i j di
Example:
Iteration: 0, 1, 2, 3
ra 1 0 0 0
=
rb 0 1 1 1
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 29
The Dead end problem:
(t )
( t +1) ri
a b rj =
i j di
Example:
Iteration: 0, 1, 2, 3
ra 1 0 0 0
rb = 0 1 0 0
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 30
2 problems:
(1) Some pages are
dead ends (have no out-links)
Such pages cause
importance to leak out
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 31
y a m
Power Iteration: y
y 0
8
Set ' = a 0 0
c a m m 0 1
]^
' = 5' _
^ ry = ry /2 + ra /2
And iterate ra = ry /2
rm = ra /2 + rm
Example:
ry 1/3 2/6 3/12 5/24 0
ra = 1/3 1/6 2/12 3/24 0
rm 1/3 3/6 7/12 16/24 1
Iteration 0, 1, 2,
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 32
The Google solution for spider traps: At each
time step, the random surfer has two options
With prob. , follow a link at random
With prob. 1-, jump to a random page
Common values for are in the range 0.8 to 0.9
Surfer will teleport out of spider trap within a
few time steps
y y
a m a m
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 33
y a m
Power Iteration: y
y 0
8
Set ' = a 0 0
c a m m 0 0
]^
' = 5' _
^ ry = ry /2 + ra /2
And iterate ra = ry /2
rm = ra /2
Example:
ry 1/3 2/6 3/12 5/24 0
ra = 1/3 1/6 2/12 3/24 0
rm 1/3 1/6 1/12 2/24 0
Iteration 0, 1, 2,
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 34
Teleports: Follow random teleport links with
probability 1.0 from dead-ends
Adjust matrix accordingly
y y
a m a m
y a m y a m
y 0 y
a 0 0 a 0
m 0 0 m 0
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 35
Googles solution: At each step, random
surfer has two options:
With probability , follow a link at random
With probability 1-, jump to some random page
PageRank equation [Brin-Page, 98]
5 1
' = = + (1 )
5 di out-degree
5' of node i
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 40
[Tong-Faloutsos, 06]
I 1 J
1 1
A 1 H 1 B
1 1
D
1 1 1
E G
F
Conferences-to-authors IJCAI
graph Philip S. Yu
Goal:
KDD
Ning Zhong
Proximity on graphs ICDM
NIPS
Conference Author
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 43
Shortest path is not good:
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 45
[Tong-Faloutsos, 06]
I 1 J
1 1
A 1 H 1 B
Multiple Connections
1 1
D Quality of connection
1 1 1 Direct & In-direct
E G
connections
F
Length, Degree,
Weight
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 46
10
9
12
2
8
1
11
3
6
5
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 47
Goal: Evaluate pages not just by popularity
but by how close they are to the topic
Teleporting can go to:
Any page with equal probability
PageRank (we used this so far)
A topic-specific set of relevant pages
Topic-specific (personalized) PageRank (S ...teleport set)
= + ( )/|| if
= otherwise
Random Walk with Restart: S is a single element
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 48
Graphs and web search:
Ranks nodes by importance IJCAI
conference to ICDM?
NIPS
Teleport back to the starting node: Conference Author
S = { single node }
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 49
Node 4
0.04 0.03 Node 1 0.13
10 Node 2 0.10
9
0.10 Node 3 0.13
12
0.13 2 0.08 0.02 Node 4 /
1 8 Node 5 0.13
3 0.13 11
0.04 Node 6 0.05
4 Node 7 0.05
Node 8 0.08
6 0.05 Node 9 0.04
5
0.13 Node 10 0.03
7 Node 11 0.04
0.05 Node 12 0.02
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 50
PKDD
SDM PAKDD
0.008
0.007
0.009
KDD 0.005 ICML
0.011
ICDM
0.005
0.004
CIKM ICDE
0.005
0.004
0.004
ECML SIGMOD
DMKD
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 51
Q: Which conferences
are closest to KDD &
K
ICDM?
I A: Personalized
PageRank with
Graph of CS conferences teleport set S={KDD,
ICDM}
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 52
Pins belong to Boards
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 53
Input:
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 54
Input: Output
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 55
Input:
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 56
Input: Output
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 57
Proximity to query node(s) Q:
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 58
Pixie Random Walk
Proximity to query node(s) Q:
5 5 5 5 5 5 14 9 Q 16 7 8 8 8 8 1 1 1
11/10/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 59
CS224W: Social and Information Network Analysis
Jure Leskovec, Stanford University
http://cs224w.stanford.edu
Observations Models Algorithms
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 2
We often think of networks looking
like this:
information access S
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 6
Triadic closure == High clustering coefficient
Reasons for triadic closure:
If and have a friend in common, then:
is more likely to meet
B
(since they both spend time with )
and trust each other A C
(since they have a friend in common)
has incentive to bring and together
(as it is hard for to maintain two disjoint relationships)
Empirical study by Bearman and Moody:
Teenage girls with low clustering coefficient are
more likely to contemplate suicide
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 7
Bridge
Define: Bridge edge a
If removed, it disconnects the graph b
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 8
Claim: If node satisfies Strong Triadic Closure
and is involved in at least two strong ties, then
any local bridge adjacent to must be a weak tie.
Proof by contradiction:
Assume satisfies Strong Triadic S S
Closure and has 2 strong ties A
because of Strong C
Triadic Closure
But then is not a bridge!
(since B-C must be connected due to Strong Triadic Closure property)
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 9
For many years Granovetters theory was not
tested
But, today we have large who-talks-to-whom
graphs:
Email, Messenger, Cell phones, Facebook
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 10
Edge overlap:
() ()
&' =
() ()
() a set
of neighbors
of node
Overlap =
when an edge is
a local bridge
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 11
Cell-phone network
Observation:
Highly used links True
Neighborhood overlap
have high overlap! Permuted
strengths
Legend:
True: The data
Permuted strengths: Keep
the network structure
but randomly reassign
edge strengths
Edge strength (#calls)
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 12
Real edge strengths in mobile call graph
Strong ties are more embedded (have higher overlap)
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 13
Same network, same set of edge strengths
but now strengths are randomly shuffled
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 14
Low
Size of largest component
disconnects
the network
sooner
disconnects
the network
sooner
Weak ties
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 17
Granovetters theory
suggest that networks
are composed of
tightly connected
sets of nodes
Communities, clusters,
Network communities: groups, modules
Sets of nodes with lots of connections inside and
few to outside (the rest of the network)
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 19
How to automatically
find such densely
connected groups of
nodes?
Ideally such automatically
detected clusters would
then correspond to real
groups
Communities, clusters,
For example: groups, modules
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 20
Zacharys Karate club network:
Observe social ties and rivalries in a university karate club
During his observation, conflicts led the group to split
Split could be explained by a minimum cut in the network
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 21
Find micro-markets by partitioning the
query x advertiser graph:
query
advertiser
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 22
Can we identify
node groups?
(communities,
modules, clusters)
Nodes: Teams
Edges: Games played
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 23
NCAA conferences
Nodes: Teams
Edges: Games played
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 24
Can we identify
social communities?
Nodes: Users
Edges: Friendships
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 25
High school Company
Stanford (Basketball)
Stanford (Squash)
Nodes: Users
Social communities Edges: Friendships
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 26
Can we identify
functional modules?
Nodes: Proteins
Edges: Interactions
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 27
Functional modules
Nodes: Proteins
Edges: Interactions
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 28
How to find communities?
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 31
12
1
33
49
Need to re-compute
betweenness at
every step
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 32
Step 1: Step 2:
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 33
1. How to compute betweenness?
2. How to select the number of
clusters?
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 34
Want to compute Breadth first search
betweenness of starting from :
paths starting at
node 0
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 35
Forward step: Count the number of shortest
paths from to all other nodes of the
network
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 36
Backward step: Compute betweenness: If
there are multiple paths count them
fractionally
The algorithm:
Add edge flows:
-- node flow =
1+child edges 1+1 paths to H
-- split the flow up Split evenly
based on the parent
value
Repeat the BFS 1+0.5 paths to J
Split 1:2
procedure for each
starting node
1 path to K.
Split evenly
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 37
Backward step: Compute betweenness: If
there are multiple paths count them
fractionally
The algorithm:
Add edge flows:
-- node flow =
1+child edges 1+1 paths to H
-- split the flow up Split evenly
based on the parent
value
Repeat the BFS 1+0.5 paths to J
Split 1:2
procedure for each
starting node
1 path to K.
Split evenly
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 38
1. How to compute betweenness?
2. How to select the number of
clusters?
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 39
Communities: sets of
tightly connected nodes
Define: Modularity
A measure of how well
a network is partitioned
into communities
Given a partitioning of the
network into groups ! :
Q s S [ (# edges within group s)
(expected # edges within group s) ]
Need a null model!
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 40
Given real on nodes and edges,
construct rewired network
Same degree distribution but i
random connections
j
Consider as a multigraph
The expected number of edges between nodes
and of degrees and equals to: =
The expected number of edges in (multigraph) G:
= = =
Note:
= = D F = 2
FG
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 41
Modularity of partitioning S of graph G:
Q s S [ (# edges within group s)
(expected # edges within group s) ]
, =
Aij = 1 if ij,
Normalizing cost.: -1<Q<1
0 else
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 42
Modularity is useful for selecting the
number of clusters: Q
R VW VX YW YX [R
, = &G 'G &'
ST ST S
R VW VX
= \T &,'G &' & '
ST
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 45
Note: each row/col of B
sums to 0: = ,
Define:
= =
Modularity matrix: =
Membership: = {, +}
R VW VX
Then: , = \T &G 'G &' & '
ST
R
= \T &,'G &' & '
R R
= \T & & ' &' ' = \T b
=
+ , (jth coordinate of )
= t
, < (jth coordinate of < )
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 50
Girvan-Newman:
Based on the strength of weak ties
Remove edge of highest betweenness
Modularity:
Overall quality of the partitioning of a graph
Use to determine the number of communities
Fast modularity optimization:
Transform the modularity optimization to a
eigenvalue problem
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 51
[Ron Burt]
11/14/16 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 57