You are on page 1of 15

Information Bottleneck Method

Azamat Berdyshev

University of Toronto

October 1, 2017
Outline

Some Information Theory basics

Information Bottleneck Method

Applications in Deep Learning

Some Information Theory basics 2


Entropy

Let X ∈ X be a discrete r.v. distributed as X ∼ P then

X
H(X) = − P (x) log P (x)
x∈X
= − E [log P (X)]

Some Information Theory basics 3


Conditional Entropy

Let (X, Y ) ∈ X × Y be a pair of discrete r.v. jointly distributed as


(X, Y ) ∼ PXY then

XX PX (x)
H(X|Y ) = PXY (x, y) log
PXY (x, y)
x∈X y∈Y
X
= PX (x)H(Y |X = x)
x∈X

Some Information Theory basics 4


Mutual Information

I(X; Y ) = H(X) − H(X|Y )


XX PXY (x, y)
= PXY (x, y) log
PX (x) PY (y)
x∈X y∈Y

Some Information Theory basics 5


Data Processing Inequality

◮ Let X → Y → Z be a Markov chain, then

I(X; Y ) > I(X; Z)

◮ Reparametrization invariance trick: for any invertible φ, ψ

I(X; Y ) = I(φ(X); ψ(Z))

Some Information Theory basics 6


Outline

Some Information Theory basics

Information Bottleneck Method

Applications in Deep Learning

Information Bottleneck Method 7


Information Bottleneck Problem
(N. Tishby, F. Pereira, W. Bialek, 1999)

f (x) g(t)
◮ Consider the information channel: X −−−→ T −−→ Y
minimize I(X; T )
PT |X (t|x)
(1)
subject to I(T ; Y ) > ǫ

◮ let λ be the Lagrange multiplier, then

min I(X; T ) − λI(T ; Y ) (2)


PT |X (t|x)

Information Bottleneck Method 8


Outline

Some Information Theory basics

Information Bottleneck Method

Applications in Deep Learning

Applications in Deep Learning 9


Applications in Deep Learning 10
Pictures with tikz

Applications in Deep Learning 11


Group lasso
(e.g., Yuan & Lin; Meier, van de Geer, Bühlmann; Jacob, Obozinski, Vert)

◮ problem:
PN
minimize f (x) + λ i=1 kxi k2
i.e., like lasso, but require groups of variables to be zero or not

◮ also called ℓ1,2 mixed norm regularization

Applications in Deep Learning 12


Structured group lasso
(Jacob, Obozinski, Vert; Bach et al.; Zhao, Rocha, Yu; . . . )

◮ problem:
PN
minimize f (x) + i=1 λi kxgi k2
where gi ⊆ [n] and G = {g1 , . . . , gN }

◮ like group lasso, but the groups can overlap arbitrarily

◮ particular choices of groups can impose ‘structured’ sparsity

◮ e.g., topic models, selecting interaction terms for (graphical) models,


tree structure of gene networks, fMRI data

◮ generalizes to the composite absolute penalties family:

r(x) = k(kxg1 kp1 , . . . , kxgN kpN )kp0

Applications in Deep Learning 13


Structured group lasso
(Jacob, Obozinski, Vert; Bach et al.; Zhao, Rocha, Yu; . . . )

hierarchical selection:

2 3

4 5 6

◮ G = {{4}, {5}, {6}, {2, 4}, {3, 5, 6}, {1, 2, 3, 4, 5, 6}}


◮ nonzero variables form a rooted and connected subtree
– if node is selected, so are its ancestors
– if node is not selected, neither are its descendants

Applications in Deep Learning 14


Sample ADMM implementation: lasso

prox_f = @(v,rho) (rho/(1 + rho))*(v - b) + b;


prox_g = @(v,rho) (max(0, v - 1/rho) - max(0, -v - 1/rho));

AA = A*A’;
L = chol(eye(m) + AA);

for iter = 1:MAX_ITER


xx = prox_g(xz - xt, rho);
yx = prox_f(yz - yt, rho);

yz = L \ (L’ \ (A*(xx + xt) + AA*(yx + yt)));


xz = xx + xt + A’*(yx + yt - yz);

xt = xt + xx - xz;
yt = yt + yx - yz;
end
Applications in Deep Learning 15

You might also like