Deep Belief Networks

(1)

Deep Belief Networks

Seung-Hoon Na

Chonbuk National University

(2)

Boltzmann Machines

• Energy-based model define dover a d-

dimensional binary random vector 𝒙 ∈ 0,1

^𝑑

(3)

Boltzmann Machines with latent variables

• Latent variables can model higher-order interactions among the visible units.

• Boltzmann machine becomes a universal

approximator of probability mass functions over discrete variables (Le Roux and Bengio, 2008)

• Decompose 𝒙 into the visible units 𝒗 and the latent

(or hidden) units 𝒉.

(4)

Boltzmann Machines

(5)

Restricted Boltzmann Machine

• No connections are permitted between visible units and hidden units  Bipartite graph

𝐸 𝒗, 𝒉 = −𝒗^𝑇𝑾𝒉 − 𝒃^𝑇𝒗 − 𝒄^𝑇𝒉

𝑾

(6)

Restricted Boltzmann Machine

• Energy-based model

(7)

Restricted Boltzmann Machine:

Conditional Distributions

• The conditional distributions 𝑃(𝒉 | 𝒗) and 𝑃(𝒗 | 𝒉) are factorial and relatively simple to compute and to sample from

(8)

Restricted Boltzmann Machine:

Conditional Distributions

(9)

Restricted Boltzmann Machine

(10)

Restricted Boltzmann Machine

𝑃(𝒉, 𝒗) = 1

𝑍 𝑒𝑥𝑝 −𝐸(𝒉, 𝒗) 𝑍 = ෍

𝒉,𝒗

𝑒𝑥𝑝 −𝐸(𝒉, 𝒗)

𝐸 𝒉, 𝒗 = −𝒃

^𝑇

𝒗 −𝒄

^𝑇

𝒉 −𝒗

^𝑇

𝑾𝒉 𝐸 𝒉, 𝒗 = −𝒃

^𝑇

𝒗 − 𝒄

_−𝑗^𝑇

𝒉

_−𝑗

− 𝑐

_𝑗

ℎ

_𝑗

−𝒗

^𝑇

𝑾

_−𝑗

𝒉

_−𝑗

− 𝒗

^𝑇

𝑾

_𝑗

ℎ

_𝑗

𝑃 𝒉, 𝒗

= 1

𝑍 𝑒𝑥𝑝 𝒃

^𝑇

𝒗 + 𝒄

_−𝑗^𝑇

𝒉

_−𝑗

+ 𝒗

^𝑇

𝑾

_−𝑗

𝒉

_−𝑗

𝑒𝑥𝑝

𝑐_𝑗ℎ_𝑗 + 𝒗^𝑇𝑾_𝑗ℎ_𝑗

=

𝑓 𝒉_−𝑗

𝑒𝑥𝑝

𝑐_𝑗ℎ_𝑗 + 𝒗^𝑇𝑾_𝑗ℎ_𝑗

𝒉_−𝑗와 ℎ_𝑗의 항으로 decomposition

ℎ_𝑗와 독립인 항

(11)

Restricted Boltzmann Machine

𝑃 ℎ_𝑗, 𝒗 = ෍

𝒉_−𝑗

𝑃 𝒉, 𝒗 = ෍

𝒉_−𝑗

𝑓 𝒉_−𝑗 𝑒𝑥𝑝 𝑐_𝑗ℎ_𝑗 + 𝒗^𝑇𝑾_𝑗ℎ_𝑗

= ෨𝑍 𝑒𝑥𝑝 𝑐_𝑗ℎ_𝑗 + 𝒗^𝑇𝑾_𝑗ℎ_𝑗

𝑃 ℎ_𝑗 = 1|𝒗 = 𝑃(ℎ_𝑗 = 1, 𝒗)

𝑃 𝒗 = 𝑃(ℎ_𝑗 = 1, 𝒗)

𝑃 ℎ_𝑗 = 0, 𝒗 + 𝑃(ℎ_𝑗 = 1, 𝒗)

= 𝑒𝑥𝑝 𝑐_𝑗 + 𝒗^𝑇𝑾_𝑗

exp(0) + 𝑒𝑥𝑝 𝑐_𝑗 + 𝒗^𝑇𝑾_𝑗

= 𝜎 𝑐_𝑗 + 𝒗^𝑇𝑾_𝑗

(12)

Restricted Boltzmann Machine

• The gradient of the Log-likelihood wrt 𝑾

log 𝑝(𝒗) = log ෍

ℎ

𝑃 𝒗, 𝒉 = log ෍

𝒉

exp −𝐸 𝒗, 𝒉 − log 𝑍

(A) (B)

𝜕(𝐴)

𝜕𝑾 = ෍

𝒉

exp −𝐸 𝒗, 𝒉 ⋅ 𝒗𝒉^𝑇

σ_𝒉 exp −𝐸 𝒗, 𝒉 = ෍

𝒉

𝑃 𝒗, 𝒉

𝑃 𝒗 𝒗𝒉^𝑇

= ෍

𝒉

𝑃 𝒉 𝒗 𝒗𝒉^𝑇 = 𝒗 ෍

𝒉

𝑃 𝒉 𝒗 𝒉^𝑇 = 𝒗𝑬 𝒉 𝒗 ^𝑇

𝜕(𝐵)

𝜕𝑾 = σ_𝒗 σ_𝒉 exp −𝐸(𝒗, 𝒉) 𝒗𝒉^𝑇

𝑍 = ෍

𝒗

෍

𝒉

𝑃 𝒗, 𝒉 𝒗𝒉^𝑇

= ෍

𝒗

෍

𝒉

𝑃 𝒗, 𝒉 𝒗𝒉^𝑇 = ෍

𝒗

𝑃(𝒗) ෍

𝒉

𝑃 𝒗|𝒉 𝒗𝒉^𝑇 = 𝐸[𝒗𝒉^𝑇]

(13)

Restricted Boltzmann Machine

• Notation

https://www.cs.toronto.edu/~tijmen/csc321/documents/maddison_rbmtutorial.pdf

(14)

Restricted Boltzmann Machine

• The gradient of the log-likelihood wrt 𝑾, 𝒃, 𝒄

• Vectorize everything

https://www.cs.toronto.edu/~tijmen/csc321/documents/maddison_rbmtutorial.pdf

(15)

Restricted Boltzmann Machine

• 𝑍: Partition function

Partition function

(16)

Restricted Boltzmann Machine:

MC sampling

• The negative statistic is the real problem  MC sampling

• With M true samples (𝒗_𝑚, 𝒉_𝑚) from the distribution defined by the RBM, we could approximate

• We can get these samples by initializing N independent Markov chain at each data point 𝒗_𝑛 and running until convergence

(17)

Restricted Boltzmann Machine:

MCMC - Gibbs

• Alternating Gibbs

(18)

MCMC:

A picture of the maximum likelihood learning algorithm for an RBM

0

v_ih_j v_ih_j^

i

j

i

j

i

j

i

j

t = 0 t = 1 t = 2 t = infinity













 



j i j

i ij

h v h

w v

v

p ( )

₀

log

Start with a training vector on the visible units.

Then alternate between updating all the hidden units in parallel and updating all the visible units in parallel.

a fantasy

http://www.cs.toronto.edu/~hinton/csc2535/lectures.html

(19)

RBM as an infinite logistic belief net with tied weights

an infinite logistic belief net with tied weights

The learning rule of RBM is the same as the maximum likelihood learning rule for the infinite logistic belief net with tied weights [Hinton et al ‘06]

W v

1

h

¹

v

⁰

h

0

v

²

h

²

WT

W

W etc.

0

s

i

0

s

j 1

s

j 2

s

j

1

s

i 2

s

i

WT

W

(20)

Restricted Boltzmann Machine:

Contrastive divergence

• Run the Markov chain for only one step, get samples , assume that

1-step CD

• Use the smoothed “reconstructions” in their place in gradient calculations

(21)

Restricted Boltzmann Machine:

Contrastive divergence

• The contrastive divergence gradients on data

point 𝒗

_𝑛

(22)

Contrastive divergence:

A quick way to learn an RBM

0

v_ih_j v_ih_j¹

i

j

i

j

t = 0 t = 1

) (  

⁰

  

¹



 w

_ij

 v

_i

h

_j

v

_i

h

_j

Start with a training vector on the visible units.

Update all the hidden units in parallel

Update the all the visible units in parallel to get a “reconstruction”.

Update the hidden units again.

This is not following the gradient of the log likelihood. But it

works well. It is approximately following the gradient of another objective function (Carreira-Perpinan & Hinton, 2005).

reconstruction data

Revised from http://www.cs.toronto.edu/~hinton/csc2535/lectures.html

(23)

Partition function

• Normalized probability

• Partition function

Partition function

(24)

Log-Likelihood Gradient

Positive phrase Negative phrase

(25)

MCMC Algorithm: Basic

• Burning in a set of Markov chains from a random initialization every time the gradient is needed

(26)

MCMC Algorithm

(27)

Contrastive Divergence [Hinton ‘00]

• Initializes the Markov chain at each step with samples from the data distribution

(28)

Contrastive Divergence

• The negative phase of CD can fail to suppress

spurious modes

(29)

Restricted Boltzmann Machine:

Summary

• Conditional probs.

– 𝑃 𝒉|𝒗 = ς

_𝑗=1

𝜎 2𝒉 − 1 ⊙ 𝒄 + 𝑾

^𝑇

𝒗

𝑗

– 𝑃 𝒗|𝒉 = ς

_𝑗=1

𝜎 2𝒗 − 1 ⊙ 𝒃 + 𝑾𝒉

𝑗

– Easy to sample from.

• Training RBM

–

^{𝜕 log 𝑝(𝒗)}

𝜕 𝑤_𝑖𝑗

= 𝑣

_𝑖

ℎ

_{𝑗 𝑑𝑎𝑡𝑎}

− 𝑣

_𝑖

ℎ

_{𝑗 𝑚𝑜𝑑𝑒𝑙}

– Δ𝑤

_𝑖𝑗

= 𝜂 𝑣

_𝑖

ℎ

_{𝑗 𝑑𝑎𝑡𝑎}

− 𝑣

_𝑖

ℎ

_{𝑗 𝑚𝑜𝑑𝑒𝑙}

– Based on sampling: MCMC, CD, PSD

(30)

Deep Belief Networks (DBN)

• Generative models with several layers of latent variables

• No intra-layer connections

– The connections between the top two layers are undirected – The connections between all other layers are directed

a hybrid graphical model involving both directed and undirected connections

A DBN with only one hidden layer

= RBM

(31)

Training DBN:

Greedy layer-wise pretraining

• The first layer RBM is trained to approximately maximize

– Contrastive divergence or stochastic maximum likelihood

• The second RBM is trained to approximately

maximize

(32)

DBN: Representation Learning

• Greedy layer-wise unsupervised pretraining [Hinton ’06]

– Proceeds one layer at a time, training the k-th layer while keeping the previous ones fixed

– The lower layers are not adapted when the upper

Maximize 𝐸_𝒗~𝑝_{𝑑𝑎𝑡𝑎} log 𝑝 𝒗

Approximately maximize

(33)

• Untie the recognition weights 𝑊_𝑖^𝑇 from the generative weights 𝑊_𝑖

• Perform wake-sleep algorithm

• 1) Up-pass (wake stage)

– Picks a state for every hidden variable using 𝑊_𝑖^𝑇

– Adjust generative weights 𝑊_𝑖

• 2) Down-pass (sleep stage)

– Stochastically activate each lower layer using 𝑊_𝑖

– Update only recognition weights 𝑊_𝑖^𝑇

Training DBN:

Wake-sleep fine tuning

H

⁰

𝑊₀ 𝑊₁ 𝑊₂ 𝑊₃

𝑊₀^𝑇 𝑊₁^𝑇 𝑊₂^𝑇

Top-level undirected connections

Generative weights Recognition

weights

(34)

DBN for classification:

A model of digit recognition

2000 top-level neurons

500 neurons

28 x 28 pixel image

10 label neurons

The model learns to generate

combinations of labels and images.

To perform recognition we start with a neutral state of the label units and do an up-pass from the image followed by a few iterations of the top-level associative memory.

The top two layers form an associative memory whose

energy landscape models the low dimensional manifolds of the

digits.

The energy valleys have names

https://www.cs.toronto.edu/~hinton/nipstutorial/nipstut3.pdf

(35)

Deep Boltzmann Machine

• Unlike DBN, DBM is an entirely undirected model

Connections are only between units in neighboring layers. There are no

intralayer connections

(36)

Deep Boltzmann Machine

• The bipartite structure of the DBM

• We can apply the same equations we have previously used for the conditional distributions of an RBM to determine the conditional distributions in a DBM

(37)

Deep Boltzmann Machine

• DBM with two hidden layers

• The bipartite structure

– Makes Gibbs sampling in a DBM eﬃcient

– Gibbs sampling can be divided into two blocks of updates

• Block1: including all even layers (including the visible layer)

• Block2: including all odd layers.

(38)

Deep Boltzmann Machine

• Interesting Properties

– The posterior distrib. 𝑃(𝒉|𝒗) is simple

• Allows richer approximations of the posterior

– Interesting from the point of view of neuroscience

• The use of proper mean ﬁeld allows the approximate

inference procedure for DBMs to capture the inﬂuence of top-down feedback interactions

– Sampling is relatively diﬃcult

• Need to use MCMC across all layers, with every layer of the model participating in every Markov chain transition