Professional Documents
Culture Documents
MT 2012
0.1
Reading
Outline
1 Introduction
2 Models for Networks
3 Steins Method for Normal Approximation
4 Steins Method for Poisson Approximation
5 Threshold Behaviour
6 Yule Processes
7 Shortest paths in small world networks
8 Other models and summaries
9 Sampling from networks
10 Fitting a model
11 Nonparametric methods
12 Statistical inference for vertex characteristics and edges
13 Motifs
14 Modules and communities
15 Further topics
0.4
1 Introduction
Gesine Reinert
MT 2012
1.2
1.1
Castellani
Ginori
Bischeri
Guadagni
Pazzi
Barbadori
Salviati
Acciaiuoli
Lamberteschi
Pucci
Albizzi
Medici
Medici
Acciaiuoli
Pazzi
Ginori
Barbadori
Albizzi
Ridolfi
Tornabuoni
Tornabuoni
Strozzi
Peruzzi
Pucci
Ridolfi
Salviati
Guadagni
Castellani
Strozzi
Bischeri
Peruzzi
Lamberteschi
1.3
1.4
metabolic networks
protein-protein interaction networks
spread of epidemics
neural network of C. elegans
social networks
collaboration networks (Erdos numbers ... )
Membership of management boards
World Wide Web
power grid of the Western US
1.6
1.5
1.7
1.8
1.9
0
0
0
0
0
0
0
0
0
0
0
0
1
1
0
1
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
1
0
0
0
1
0
0
0
1
0
0
0
0
0
0
0
1
0
0
0
1
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
1
0
0
0
1
0
0
0
0
0
0
0
1
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
1
1
1
0
0
0
0
0
0
0
0
0
1
1
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
1
1
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
1
1
0
0
0
0
0
0
0
0
1
1
0
0
0
0
0
0
0
0
0
1
1
0
0
0
0
0
1
0
1
0
0
0
0
0
0
0
0
0
1
0
1
0
0
0
1
0
0
0
1.12
1.13
1.14
Here 0/0 := 0.
The average clustering coefficient is defined as
C=
1 X
C(v).
|V|
vV
1.16
1
2
u6=vV;w6=u,vV
C=
3 number of triangles
.
number of connected triples
u6=wV
1.18
1.17
3 E(number of triangles)
.
E(number of connected triples)
1.19
1.20
u6=vV
1.21
1.22
1.23
1.24
1.25
1.26
1.27
1.28
2.1
Network
Los Alamos
MEDLINE
NCSTRL
n
52,909
1,520,251
11,994
ave degree
9.7
18.1
3.59
2.2
C
0.43
0.066
0.496
CBernoulli
0.00018
0.000011
0.0003
2.3
2.4
2.6
2.5
2.7
2.8
2.10
2.9
If 1 , . P
. . , L denote the proportion of the vertices of different
types, = 1, then for a vertex V picked uniformly at
random,
X
X
E(d(V )) =
(L 1)p, +
Lk pk, .
k6=
Daudin et al. (2008) have shown that this model is very flexible
and is able to fit many real-world networks reasonably well. It
does not produce a power-law degree distribution however.
2.12
We note that more than one set of matchings may result in the
same network; if we labelled the stubs there would be typically
many ways we can join up pairs of labelled stubs to create the
same final configuration.
i=1
where d(V ) P
is the degree of a randomly chosen vertex;
1
2
Ed(V ) = n ni=1 d(i) = 2m
n . As long as Ed(V ) remains
bounded, the expected number of self-loops will remain
bounded, whereas the number of edges will grow with the
number of vertices in the network. Hence self-loops will be rare
when the network is large.
2.14
2.13
2.15
2.16
2.17
2.18
2.19
2.20
Fn F
2.21
3.1
Pn P
is equivalent to:
1. Pn (A) P (A) for each P -continuity set A (i.e.
P (A) = 0)
R
R
2. f dPn f dP for all functions f that are bounded,
continuous, real-valued
R
R
3. f dPn f dP for all functions f that are bounded,
infinitely often differentiable, continuous, real-valued.
Theorem 3.1
(Stein 1972) If W N (0, 1), then
E[W f (W )] = Ef (W )
(3.1)
3.3
Ef (W ) =
f (w)ew /2 dw
2
Z w
Z 0
1
x2 /2
xe
dx dw
f (w)
=
2
Z
Z
1
x2 /2
xe
dx dw.
f (w)
+
2 0
w
3.5
3.4
Proof continued
f (w)dw (x)ex /2 dx
Ef (W ) =
2
x
Z Z x
1
2
f (w)dw xex /2 dx
+
2 0
0
Z
1
2
=
[f (x) f (0)]xex /2 dx
2
= E[W f (W )],
(3.2)
as required.
3.6
3.7
Proof
Lemma 3.2
The unique bounded solution f (w) = fh (w) of the Stein
equation (3.2) is given by
Z w
w2
2
(h(x) N h)ex /2 dx
fh (w) = e 2
Z
w2
2
2
(h(x) N h)ex /2 dx.
= e
w2
w2
2
fh (w) =
(3.3)
(3.4)
(h(x) N h)ex
2 /2
(h(x) N h)ex
dx
2 /2
dx.
The next lemma will show that the solution is bounded. The
general solution to (3.2) is given by fh (w) plus some constant
w2
3.9
Proof
From (3.3) and (3.4) we have that
|fh (z)| ||h() N h||e
w2
2
ex
2 /2
dx.
|w|
Lemma 3.3
For h bounded and fh its solution of the Stein equation (3.2) we
have
r
||fh ||
||h() N h||.
(3.5)
2
g(w) = e
w2
2
ex
2 /2
dx
|w|
x2 /2
dx <
x x2 /2
ew /2
e
dx =
.
w
w
(3.6)
3.11
Proof continued
Moreover
g(0) =
2P(Z 0) =
1
2 =
2
,
2
Theorem 3.4
(Stein 1972) If (3.1) holds for all bounded, continuous and
piecewise continuously differentiable functions f with
E|f (Z)| < , then W N (0, 1).
2 /2
x
.
e
dx
e2
2
|w|
(3.7)
3.13
3.12
Proof
hH
fh
3.15
More bounds
Lemma 3.5
1
Let X,
PXn1 , . . . , Xn be i.i.d. with EX = 0 and Var(X) = n ; put
W = i=1 Xi . For any absolutely continuous function
h : R R we have
!
n
X
2
|Eh(W ) N h| 6 kh k +
E|Xi3 | .
n
i=1
3.16
Proof
3.17
Proof continued
n
X
i=1
n
X
E{f (W )} E{W f (W )} =
EXi f (W )
EXi f (Wi ) +
i=1
So
n
X
i=1
EXi2 f (Wi ) + R
i=1
i=1
n
1X
Ef (Wi ) + R.
n
i=1
1
2||h || ,
n
i=1
1X
E{f (W ) f (Wi )} + R
n
i=1
3.18
3.19
kKi
XX
iI kKi
3.21
3.20
Proof
Proof continued
Let h : R R be absolutely continuous. Then, for f = fh
solution of the Stein equation (3.2) for h,
E W f (W ) f (W )
)
(
X
E(Xi Zi f (Wi ))
=
E(W f (W ))
X
iI
(EXi W EXi Wi ) =
XX
E(Xi W )
iI
E(Xi Zi )
iI
iI
XX
E(Xi Zi f (Wi ))
iI
iI kKi
XX
+
E(Xi Zik [E(f (Wik ) E{f (W )}] .
+
E(Xi Zik ).
iI kKi
iI kKi
3.22
3.23
Proof continued
By Taylor expansion,
X
W f (W ) =
Xi f (W )
Xi Zi f (Wi ) =
Xi Zik f (Wi )
kKi
iI
X
1 2
=
Xi f (Wi ) + Zi f (Wi ) + Zi f (Wi + i Zi )
2
kKi
iI
Xi Zik f (Wik + Vik f (Wik + ik Vik )
for some ik [0, 1]. As Wik and (Xi , Zik ) are independent,
X
XX
E(Xi Zi f (Wi ))
E(Xi Zik )E(f (Wik ))
iI
iI kKi
XX
||f ||
E|Xi Zik Vik |.
iI kKi
iI
3.25
and the theorem follows from using the bound ||f || 2||h ||.
3.26
3.27
We standardise
Y p3
X = p
.
V ar(T )
6K
and
Z =
27n5 p3
||h ||.
3
|Eh(W ) N h| 6
X .
6K K
X +
X .
K \K
3.28
3.29
Proof.
2 3
p (1 p3 ),
and
E{|X3 |} =
E{(X )2 } =
E(|X X X )
n ,K
E{|X3 |} + 3
1 X X
E{X2 }
n 6=K \{}
1 3
p (1 p3 )
2
n K
2 3
p (1 p3 )
3
n 22n2 p3
11n5 p3
<
.
3
3
3 3
1 3
p3
3 3
3 9
(p
(1
p
)
+
(1
p
)p
2
.
3
3
Next, as K K ,
X X
E|X Z V |
n K
3.30
n ,K
11n5 p3
3 3
E(|X X X |)
3.31
n K
E{|X Z |}E{|Z + V |}
E{|X X |)}E{|X |}
n ,K
n 30n2 3
p (1
3 3
2 5n5
p3 ) < 3 p3 .
3.33
3.32
More formally
pk = e
k
,
k!
k = 0, 1, . . . .
3.34
4.1
To see this:
Eg(Z + 1) =
g(j + 1)e
j0
k1
k1
(4.1)
Lemma 4.1
j!
k
g(k)e
(k 1)!
g(j + 1) =
k
kg(k)e
= E{Zg(Z)}.
k!
j!
e
j+1
j
X
k=0
j!
e
j+1
k=j+1
k
[I(k A) P o{}(A)]
k!
e
(4.2)
k
[I(k A) P o{}(A)].
(4.3)
k!
4.3
4.2
Proof continued
g(j + 1) jg(j) =
j! X k
e
[I(k A) P o{}(A)]
e
j
k!
k=0
j1
k
X
j!
e
j
k=0
j
k!
[I(k A) P o{}(A)]
j!
e e
[I(j A) P o{}(A)]
j
j!
= [I(j A) P o{}(A)].
=
4.4
4.5
Lemma 4.2
For the solution of the Stein equation
Lemma 4.3
12
E{g(W + 1) W g(W )} = 0.
Then W P o().
and
g := sup |g(j + 1) g(j)| min(1,
).
4.7
4.6
Let X1 , X2P
, . . . be independent Bernoulli(pi ) and W = ni=1 Xi .
With = ni=1 pi
X
X
E(W g(W )) =
E(Xi g(W )) =
E(Xi g(W Xi + 1))
X
=
pi E(g(W Xi + 1))
and thus
E{g(W + 1) W g(W )}
X
=
pi E(g(W + 1) g(W Xi + 1))
X
=
p2i E(g(W + 1) g(W Xi + 1)|Xi = 1).
We conclude that
Here the sup is over all functions g which solve the Stein
equation.
p2i min(1, 1 )
p2i .
4.9
Proof
EW g(W ) =
Theorem 4.4
EX g(W )
With = E(W ),
X
I
dT V (L(W ), P o())
X
6
[(p E( ) + E(X ( X ))] min 1, 1 .
EX g(W X + 1).
4.10
P
Putting W = 6A X , so that W = W + ,
X
EX g(W X + 1)
4.11
X
|R1 | =
p {E(g(W + 1) g(W X + 1))
I
X
p gE( )
EX g(W + 1)
X
I
p Eg(W + 1)
and
X
I
X
|R2 | =
E{X (g(W X + 1) g(W X + 1))}
I
X
gE{X ( X )}.
p Eg(W + 1)
X
I
X
I
= Eg(W + 1) + R1 + R2 .
4.13
Theorem 4.5
With =
p3 ,
n 3
p (3np3 + 3np2 ) min 1, 1 .
dT V (L(W ), P o()) 6
3
Thus, if p =
c
n
n
3
1
n
n
3
3
p .
4.15
4.14
Proof
If |{u, v, w} {a, b, c}| 1 then Xu,v,w and Xa,b,c are
independent. Let A = Ku,v,w contain (u, v, w) plus all those
indices which share exactly two vertices with {u, v, w}. Recall
that |A | < 3n.
Here p = p3 and
E{ } 3np3 .
Moreover
E{X ( X )} =
p5 < 3np5 .
A ,6=
4.16
4.17
5 Threshold behaviour
1X
C(v),
C=
n v
it is also true that, in distribution,
C
where Z P oisson
n
3
3
p .
1
Z,
n p2
n
2
4.18
5.1
A mathematical formulation
35 edges
0 triangles
0 squares
45 edges
1 triangle
0 squares
80 edges
8 triangles
5 squares
5.2
5.3
Pseudo-code
Start: S = U = , T = V
If U 6= : Let v be the last vertex added to U .
5.4
5.5
Properties of DFS
The DFS algorithm can be used to reveal the connected
components of G. It starts revealing a connected component C
at the moment the first vertex of C gets into (empty
beforehand) U and completes discovering all of C when U
becomes empty again.
5.6
5.7
Some observations
Since the addition of a vertex (except the first vertex) in a
connected component to U is caused by a positive answer to a
query,
t
X
Xi .
|U (t)| 1 +
i=1
At all times,
|U (t)| = n |S(t)| |T (t)|.
5.8
5.9
Heuristic
To make this argument precise we use Chebychevs inequality
and the Chernoff inequality:
Lemma 5.1
1.
k
2(N p+k/3)
P(X N p > k) e
5.10
5.11
Proof
Let W denote the number of intervals of length in {1, . . . , N }
in which at least k of the random variables Xi take the value 1.
Using Markovs inequality,
Corollary 5.2
Let Xi , i = 1, . . . , N be a sequence of i.i.d. Bernoulli trials with
success probability p. Let k, N such that k > p. Then the
probability that there is no (discrete) interval of length in
{1, . . . , N } in which at least k of the random variables Xi take
the value 1 is at most
Now
E(W ) = (N + 1)P
2(kp)2
2(p+(kp)/3)
(N + 1)e
P(W 1) E(W ).
X
j=1
Xi > k .
X
X
(kp)2
P
Xi > k = P
Xi p > k p e 2(p+(kp)/3)
j=1
j=1
5.13
Proof: Part 1
For Part 1, let p = 1
n . Assume that G contains a connected
component C with more than 72 log n vertices. Let
k + 1 = 52 log n + 1 be the smallest integer which is larger
than 52 log n.
Look at the epoch when C was created. Consider the moment
inside this epoch when the algorithm has found the (k + 1)st
vertex of C and is about to move it to U . Write S = S C at
this moment. Then |S U | = k, and thus the algorithm got
exactly k positive answers to its queries to random variables Xi
during the epoch.
At that moment during the epoch only pairs of edges touching
S U have been queried, and the number of such pairs is
therefore at most
k
+ k(n k) < kn.
2
Theorem 5.3
Let > 0, and let G G(n, p).
5.14
5.15
Proof of Part 2
n
2
2
n
5
2 )
2(1 3
1
2
Now let p = 1+
n . Let N0 = 2 n + 1. We claim that after the
first N0 queries of the DFS algorithm we have |U | 15 2 n, with
the contents of U forming a path of desired length at that
moment. Assume that the DFS algorithm has carried out the
first N0 queries.
< n 2 .
5.16
Indeed,
!
t
X
1
1
P 1+
Xi > n
P |U (t)| > n
3
3
i=1
)
(
2
1
3 n (1 + tp)
.
exp
2 tp + 31 13 n (1 + tp)
Now
and
1
tp +
3
<
1
10 1
1
n
n
n (1 + tp) (1 + ) +
+
3
3
9
3
2 tp
2
1
3 n (1 + tp)
+ 31 13 n (1 + tp)
(5.1)
2
1
1
n (1 + tp) (n2 3n)
3
9
and so
5.17
2
3 ,
n
.
13
5.18
5.19
Now
1
1 P {|S(N0 )| n} {|U (t)| <
3
1
= P |S(N0 )| n) + P(|U (t)| <
3
n
1
P |S(N0 )| n + 1 e 13
3
and so
1
n}
3
1
n +0
3
|S(N0 ) U (N0 )| N0 p n 3 .
1
P(|S(N0 )| n) e
3
n
13
as claimed.
5.21
5.20
2
1
|S(N0 )| N0 p + 2 n n 3 .
5
N0 p + n n
5
5.22
5.23
6 Yule Processes
The main references for this chapter are
the book by S.M. Ross (1995) and Steven Lalleys lecture notes at
http://galton.uchicago.edu/~lalley/Courses/312/Branching.pdf.
For the Yule process we have (Exercise) that
P(X(t) = j|X(0) = i)
ji
j 1 ti
1 et
,
e
=
i1
j i 1.
(6.1)
for t 0.
6.1
6.2
Proof (heuristical)
Assume that we can treat densities as if they were probabilities.
Then, for 0 = s0 s1 s2 . . . sn t we have
th
Starting
Pi with a single individual, the time of the i birth is
Si = j=1 Ti . The following result holds.
P(S1 = s1 , S2 = s2 , . . . , Sn = sn |X(t) = n + 1)
1
P(S1 = s1 , S2 = s2 , . . . , Sn = sn , X(t) = n + 1)
=
P(X(t) = n + 1)
1
=
P(Ti = si si1 , i = 1, . . . , n, Tn+1 > t sn )
P(X(t) = n + 1)
1
=
P(X(t) = n + 1)
n n
o
Y
iei(si si=1 ) e(n+1)(tsn )
Lemma 6.1
i=1
= n!n
Y
1
et
e(tsi ) .
P(X(t) = n + 1)
i=1
6.3
6.4
Lemma 6.2
Start the process with one individual. Then
Hence
P(S1 = s1 , S2 = s2 , . . . , Sn = sn |X(t) = n + 1) = n!
n
Y
EX(t) = et .
f (si ),
i=1
as required.
6.6
6.5
Proof.
Start the process with one individual. Let a0 be the age at time
t = 0 of the initial individual. Then the sum A(t) of all ages at
time t is
X(t)1
X
A(t) = a0 + t +
(t Si ).
Condition on X(t):
E{A(t)|X(t) = n + 1} = a0 + t + E
i=1
= a0 + t + n
Lemma 6.3
We have that
( n
X
i=1
t
(t Si )|X(t) = n + 1
(t x)
e(tx)
dx,
1 et
et1
EA(t) = a0 +
.
E{A(t)|X(t)} = a0 + t + (X(t) 1)
1 et tet
.
(1 et )
6.7
6.8
Asymptotic growth
Theorem 6.4
Start the process with 1 individual. Then
X(t)
W
et
(t ) Sm log m log W
(m ).
almost surely,
X(t)
> Zt > et Set < t
et
6.9
Theorem 6.6
Sm log m V
n
X
with probability 1.
Proof. We have that
Sm =
Xj = S
j=1
P
1
so that ESm = m
i=1 (i)
2
VarTi = (i) , and
j=1
m
X
Ti ,
i=1
1 log m
and and
and with Theorem 6.6 we may conclude that the lemma holds.
6.11
6.12
A branching argument
eV ,
t < .
6.14
6.13
by et gives
et X(t) = et 1(t < T1 )
T1
+e
(tT1 )
1(t T1 )(e
ZtT
1
(tT1 )
+e
ZtT
).
1
MW (t) =
Eeut(W +W ) du
0
Z 1
Z
=
Eeut(W +W ) du +
0
W = eT1 (W + W ) = U (W + W ).
Eeut(W
+W )
du
Eeut(1)(W +W
= (1 )
0
Z 1
EeutW EeutW ) du
+
Letting t gives
du
where
and
are independent copies of W and are
independent of T1 ; U = eT U [0, 1].
= (1 )MW (t(1 )) +
6.15
MW (ut)2 du.
6.16
MW (ut)2 du.
Theorem 6.8
T
X
2
2
E{ST } = E
j .
tMW
(t) = MW (t) + MW (t)2 .
j=1
6.17
6.18
Thus,
Proof
E{ST2 } = E
Since T is a stopping time, for any integer k 1 the event
{T k} =}T > k 1} depends only on the random variables
Xi for i < k , and hence is independent of Xk . In particular, if
j < k then
=
=
m
X
k=1
m
X
k=1
m
X
k=1
m
X
k=1
= E
m
X
k=1
as required.
Xk 1(T k)
E Xk2 1(T k) + 2
1j<km
E Xk2 1(T k)
E Xk2 E {1(T k)}
k2 P(T k)
X
k=1
6.19
!2
T mk2
,
6.20
Lemma 6.9
(A
Inequality). Assume that in addition,
P Maximal
2 = 2 < . Then
j j
P{sup |Sn | }
n1
2
.
2
(6.2)
Finally,
P using (6.2), for every k 1 we can find an nk < such
that jnk j2 8k . By (6.2),
P
As k 2k < we may apply the Borel-Cantelli theorem, and
hence it follows that Sn converges with probability 1, and
therefore has a finite limit W .
n1
Pick
m; thenfor min(T, m) Walds Theorem applies and we get
2
E Smin(T,m)
2 . Hence
2
2 P(T m} E Smin(T,m)
1{T m}
2
E Smin(T,m)
2.
6.21
E
)
=
P
lim
E
P (
i=n i
n=1 i=n i
6.22
lim P (
i=n Ei )
lim
P(Ei ).
i=1
6.23
7.1
7.3
7.2
For the pure growth process S(t) started at P let M (t) number
of intervals at time t, and let s(t) total length of the circle
covered at time t. For a Yule process with birth rate 2 we
know from Lemma 6.2 and Lemma 6.3 that
EM (t) = e2t ;
1 2t
Es(t) =
e 1 .
Vx 2e2x .
7.4
7.5
Poisson approximation
We now derive a Poisson approximation for the number of
intersections of intervals when started from P, P . Assume that
there are m intervals I1 , . . . , Im , with given lengths s1 , . . . , sm ,
in the first process, and n intervals J1 , . . . , Jn , with given
lengths u1 , . . . , un , in the second process, positioned uniformly
on C, independently. Let
Xij
Put
lsu = max{max si , max uj }.
i
Then
V =
m X
n
X
Xij .
i=1 j=1
Then
EV = =
2
lsu ,
L
are independent if indices do not intersect.
EXij
and put
2 XX
min(si , uj ).
L
i=1 j=1
7.6
7.7
Proof
Let I = {(i, j) : i {1, . . . , m}, j {1, . . . , n}} and
A(i,j) = {(k, ) I : {k, } {i, j} =
6 }.
Then, with the notation of Theorem 4.4,
X
X
2
p E (m + n) lsu
p
L
Corollary 7.1
dT V (L(V ), P o())
4
(m + n)lsu .
L
2
(m + n)lsu .
=
L
Now a random intervals, of length b overlaps with a given
interval of length a on the circle with probability 2 min(a,b)
, and
L
so also
X
2
(m + n)lsu .
E(X ( X ))
L
I
7.9
m X
n
X
2
Vx Poisson
min(si , uj ) .
L
2L
m X
n
X
= 4L1
i=1 j=1
= 4L
Pm Pn
i=1
j=1
min(si ,uj )
P{Vx = 0|M (x ) = m, N (x ) = n, s1 , . . . , sm , u1 , . . . , un }
2
So
e L
min(si , uj )
i=1 j=1
7.10
Unconditioning
7.11
P{Vx = 0} Ee L
R x
M (u)N (u)du
e4u W W du
Theorem 7.2
2x
E exp{e W W } =
1 4x
e
1
4
p
L
1
W W e2x ( L)2 = W W e2x ,
4
4
= WW
and so
2x W W
P{Vx = 0} Eee
ey
dy
1 + e2x y
.
7.12
7.13
Proof (heuristical)
From Lemma 6.2 and Lemma 6.3,
Corollary 7.3
If T denotes a random variable with distribution given by
Z y
e dy
P [T > x] =
1
+ e2x y
0
and D = 1 12 log(L) + T , then
1
sup |P[D x] P[D x]| = O (L) 5 log2 (L) .
x
exp{e2x W W }.
Hence
2x
E exp{e W W } =
This finishes the proof.
ey
dy
1 + e2x y
7.15
7.14
Remarks
7.16
8.1
v=1
v=1
2 XX
1X
Xu,v ,
d(v) =
n
n
u<v
noting that each edge gets counted twice. As the Xu,v are
independent, in the sparse regime we can use a Poisson
approximation again, giving that
n X
X
n(n 1)
Xu,v P o
p
2
u<v
v=1
D
where Z P o
n(n1)
p
2
2
Z,
n
.
8.3
In the dense regime, the counts for small subgraphs are not
asymptotically independent. Indeed in dense Bernoulli random
graphs the number of edges already asymptotically determines
the number of 2-stars and the number of triangles, see Reinert
and R
ollin (2010).
u,v
P
where m++ = u,v mu,v is twice the total number of edges in
m, see Picard et al. (2007).
8.4
8.5
8.7
8.6
8.8
8.9
E(C)
m 1 (log n)2
.
8
n
8.10
8.11
log n log(m/2) 1 3
+ .
log log n + log(m/2)
2
pi,j =
8.12
d(i)d(j)
.
2m 1
8.13
1
n
2m
n .
8.14
8.16
8.17
Connectivity
1u,vk
Then
(m) =
Q
X
c1 =1
Q
X
ck =1
c1 ck
uv
pm
cu ,cv .
1u,vk
8.19
8.18
8.20
8.21
Weighted networks
Directed graphs
8.22
8.23
8.24
9.1
Sampling schemes
9.3
9.2
iU
Nu
=y=
1X
yi ;
n
= Nu y.
iS
iU
9.4
9.5
1 2
;
n
Var(
) =
1 2 2
N .
n u
1
V (
) = Nu2 s2 ,
n
1 X
(yi y)2
n1
iS
iS
1
.
Nu
9.6
Mean
9.7
Variance
Let
i,j = P(i S, j S)
with i,i = i . Then
X yi
E
i
iS
X yi
E
Zi
i
iU
X yi
EZi
i
iU
X
yi = ,
Var (
) =
X y2
i
VarZi
i2
iU
X X y i yj
+
Cov(Zi , Zj )
i j
iU jU,j6=i
X y2
i
i (1 i )
=
i2
iU
X X y i yj
(i,j i j ).
+
i j
iU
so
is an unbiased estimator for .
iU jU,j6=i
9.8
9.9
XX
iU jU
yi yj
i,j
1 .
i j
Assuming that i,j > 0 for all pairs, an unbiased estimator for
Var (
) is given by
XX
1
1
.
Var (
) =
yi yj
i j
i,j
Nu 1
n1
Nu
n
i =
Similarly, for i 6= j,
iS jS
i,j =
n
.
Nu
n(n 1)
.
Nu (Nu 1)
9.10
1X
yi ;
n
9.11
= Nu y.
i =
iS
But their variance is strictly less than the variance under simple
random sampling with replacement;
Var (
) =
and, for i 6= j,
i,j =
Nu2
Nu (Nu n) 2
<
2.
n
n
n
|V|
n(n 1)
.
|V|(|V| 1)
9.12
9.13
Star sampling
If we include all adjacent vertices of all vertices in S then i,j
stays the same, but the vertex inclusion probabilities change.
For i,
X
i =
(1)|L|+1 P(L),
LN (i)
n
i =
|V|
where N (i) is the set of all vertices which are adjacent to i, and
i itself, and P(L) is the probability of selecting the set L when
obtaining the sample S.
and, for i 6= j,
i,j = P(i 6= S, j 6= S) = 1
|V|2
n
|V|
n
9.14
Variants
9.15
i,jV
9.16
9.17
k
1
1000
2
1
2
0.01
0.167
0.01
0.003
0.05
0.03
E.no
3
50
6
3
50
60
mean p-value
1.74 E-09
0.1978
0
1.65E-13
0.0101
0.0146
max p-value
8.97 E-08
0.8913
0
3.30 E-12
0.1124
0.2840
9.19
9.18
Bernoulli (Erd
os-Renyi) random graphs. In the random graph
model of Erd
os and Renyi (1959), the (finite) node set V is
given, say |V | = n. We denote the set of all potential edges by
E; thus |E| = n2 . An edge between two nodes is present with
probability p, independently of all other edges. Here p is an
unknown parameter.
9.20
10.1
We denote by
E = {(u, v) V V : u 6= v}
the set of potential edges.
10.3
10.2
(i,j)E
= (1 p)|E|
= (1 p)|E|
Y p xi,j
1p
(i,j)E
p
1p
P (i,j)E xi,j
and
(p; x)
p
P
Note that t = (i,j)E xi,j is the total number of edges in the
random graph.
10.4
1
t
(|E| t) + .
1p
p
10.5
1
(|E| t) t(1 p) = p(|E| t)
1p
t
.
t = p|E| p =
|E|
=
10.7
10.6
1
exp{1 L(x) + 2 S2 (x) + 3 S3 (x) + 4 T (x)}
10.9
10.11
10.10
2
As in a Bernoulli
random graph D n Z, where
Z P o n(n1)
p , under the null hypothesis the average node
2
10.13
-3 -2 -1
-3
-2
-1
Theoretical Quantiles
10
-5
Quantile-quantile plots
-3
-2
-1
Theoretical Quantiles
10.14
10.15
11 Nonparametric methods
We can also use a quantile-quantile plot for two sets of
simulated data, or for one set of simulated data and one set of
observed data. The interpretation is always the same: if the
data come from the same distribution, then we should see a
diagonal line; otherwise not. Here we compare 1000 Normal
(0,1) variables with 1000 Poisson (1) variables - clearly not a
good fit.
10.16
11.1
m
.
n+1
11.2
11.3
For random graphs, Monte Carlo tests often use shuffling edges
with the number of edges fixed, or fixing the vertex degree
distribution, or fixing some other summary.
Suppose we want to see whether our observed clustering
coefficient is unusual for the type of network we would like to
consider. Then we may draw many networks uniformly at
random from all networks having the same vertex degree
sequence, say. We count how often a clustering coefficient at
least as extreme as ours occurs, and we use that to test the
hypothesis.
In practice these types of test are the most used tests in network
analysis. They are called conditional uniform graph tests.
11.5
Some caveats:
11.6
11.7
11.8
11.9
log P(k)
-8 -7 -6 -5 -4 -3 -2
P(d(V ) = k) C(k) C k ,
-4
-8
-6
-2
log k
2
log k
11.10
11.11
11.13
11.14
11.15
A1
A2
B1
n11
n21
n+1
B2
n12
n22
n+2
n1+
n2+
n
12.1
12.2
The likelihood
log eij = i + j + ,
XX
i
Thus, under H0 ,
Then
npij = eij = ei +j + .
12.3
12.4
12.5
12.6
12.7
12.8
F (a, S(x)) := P
bS(n)
nB(x)
12.9
12.10
12.11
Predicted proteins
1262
78
1608
150
32
273
M.V.
0.35
0.36
0.39
0.57
0.72
0.44
F.
0.17
0.37
0.31
0.70
0.50
0.47
12.12
E. F.
0.44
0.49
0.54
0.71
0.69
0.71
Organism (DIP)
D.M
C.E
S.C
E.C
M.M
H.S
Predicted proteins
1275
85
1618
154
32
274
M.V.
0.53
0.38
0.67
0.69
0.59
0.79
F.
0.67
0.55
0.61
0.69
0.88
0.90
E. F.
0.69
0.71
0.67
0.70
0.81
0.89
12.13
12.14
p(x)
1 p(x)
= T x,
or, equivalently,
T
p(x) =
e x
.
1 + eT x
12.15
12.16
12.17
12.18
12.19
Protein interaction
network
AC
B
A
A
B
B
D
A
B
12.21
12.20
X
X
txy =
f (va vc vb va ) ,
zB(x,y)
12.22
12.24
12.23
12.25
13 Motifs
Picard et al. (2008) calculate not only the mean but also the
variance for motifs on 3 and 4 nodes in undirected graphs under
the models
1. Bernoulli random graph (ER)
2. Random graphs with fixed degree sequence (FDD) (here they
estimate the mean and standard deviation from simulations;
and they also consider a version with fixed expected degrees)
3. Erd
os-Renyi mixture models (ERMG).
13.1
obs
ER
FDD
ERMG
14,113
75
112,490
5704.08
10.85
7676.83
14,113
66.91
112,490
13,602.97
52.82
93,741.08
248,093
11,368
6,425.495
52,774.79
72.47
133,050.00
248,093
3579.49
5,772,005.15
243,846.93
10,221.17
1,537,740.00
13.2
ER
FDD
ERMG
311.08
3.40
681.76
0
7.80
0
2659.18
20.41
27,039.88
1281.87
8.90
5089.62
0
68.58
0
51,676.68
3041.98
1,672.086.51
13.3
13.4
P(W = w) = e
c=1
and
P(W = 0) = e .
13.6
13.5
Parameter estimation
var mean
var + mean
and
= (1 a)mean.
These parameters can be estimated using the sample mean and
the sample variance.
The Polya-Aeppli distribution is a better fit to the count
distribution, compared to the normal distribution; this result is
fairly consistent across motifs and across networks. In
particular the normal approximation can lead to false positive
results: some motifs may be thought as being exceptional while
they are not.
13.7
13.8
13.10
13.9
Let djG (k) be the sample distribution of the node counts for a
given orbit degree k in a graph G and for a particular
automorphism orbit j. This sample distribution is then scaled
by 1/k in order that large degrees do not dominate the score,
and normalized to give a total sum of 1,
NGj (k)
djG (k)/k
= P
j
=1 dG ()/
1 X
(1 Dj (G, H)).
GDDA =
73
j=0
13.11
13.12
13.14
Threshold behaviour
While we see that similar models have overlapping GDDA
scores, none of the models under investigation here fits the
protein interaction network. We also tried preferential
attachment and gene duplication networks. In Rito et al.
(2012) we find that the Yeast protein interaction network is
highly heterogeneous, which makes model fitting very difficult.
But remember - the model depends on the research question;
here we were interested in finding motifs.
13.16
13.17
14.1
Modularity
As a guidance to how many communities a network should be
split into, they use the modularity. For a division with g groups,
define a g g matrix e whose component eij is the fraction of
edges in the original network that connect nodes in group i to
those in group j. Then the modularity is defined to be
X
X
Q=
ei,i
ei,j ek,i ,
i,j,k
the fraction of all edges that lie within communities minus the
expected value of the same quantity in a graph where the nodes
have the same degrees but edges are placed at random. A value
of Q = 0 indicates that the community is no stronger than
would be expected by random shuffling.
14.2
14.3
14.4
15 Further topics
14.5
15.1
15.2
Further reading
15.4
15.5
15.6
15.7
15.8