Deep Belief Networks
Seung-Hoon Na
Chonbuk National University
Boltzmann Machines
• Energy-based model define dover a d-
dimensional binary random vector 𝒙 ∈ 0,1
𝑑Boltzmann Machines with latent variables
• Latent variables can model higher-order interactions among the visible units.
• Boltzmann machine becomes a universal
approximator of probability mass functions over discrete variables (Le Roux and Bengio, 2008)
• Decompose 𝒙 into the visible units 𝒗 and the latent
(or hidden) units 𝒉.
Boltzmann Machines
Restricted Boltzmann Machine
• No connections are permitted between visible units and hidden units Bipartite graph
𝐸 𝒗, 𝒉 = −𝒗𝑇𝑾𝒉 − 𝒃𝑇𝒗 − 𝒄𝑇𝒉
𝑾
Restricted Boltzmann Machine
• Energy-based model
Restricted Boltzmann Machine:
Conditional Distributions
• The conditional distributions 𝑃(𝒉 | 𝒗) and 𝑃(𝒗 | 𝒉) are factorial and relatively simple to compute and to sample from
Restricted Boltzmann Machine:
Conditional Distributions
Restricted Boltzmann Machine
Restricted Boltzmann Machine
𝑃(𝒉, 𝒗) = 1
𝑍 𝑒𝑥𝑝 −𝐸(𝒉, 𝒗) 𝑍 =
𝒉,𝒗
𝑒𝑥𝑝 −𝐸(𝒉, 𝒗)
𝐸 𝒉, 𝒗 = −𝒃
𝑇𝒗 −𝒄
𝑇𝒉 −𝒗
𝑇𝑾𝒉 𝐸 𝒉, 𝒗 = −𝒃
𝑇𝒗 − 𝒄
−𝑗𝑇𝒉
−𝑗− 𝑐
𝑗ℎ
𝑗−𝒗
𝑇𝑾
−𝑗𝒉
−𝑗− 𝒗
𝑇𝑾
𝑗ℎ
𝑗𝑃 𝒉, 𝒗
= 1
𝑍 𝑒𝑥𝑝 𝒃
𝑇𝒗 + 𝒄
−𝑗𝑇𝒉
−𝑗+ 𝒗
𝑇𝑾
−𝑗𝒉
−𝑗𝑒𝑥𝑝
𝑐𝑗ℎ𝑗 + 𝒗𝑇𝑾𝑗ℎ𝑗=
𝑓 𝒉−𝑗𝑒𝑥𝑝
𝑐𝑗ℎ𝑗 + 𝒗𝑇𝑾𝑗ℎ𝑗𝒉−𝑗와 ℎ𝑗의 항으로 decomposition
ℎ𝑗와 독립인 항
Restricted Boltzmann Machine
𝑃 ℎ𝑗, 𝒗 =
𝒉−𝑗
𝑃 𝒉, 𝒗 =
𝒉−𝑗
𝑓 𝒉−𝑗 𝑒𝑥𝑝 𝑐𝑗ℎ𝑗 + 𝒗𝑇𝑾𝑗ℎ𝑗
= ෨𝑍 𝑒𝑥𝑝 𝑐𝑗ℎ𝑗 + 𝒗𝑇𝑾𝑗ℎ𝑗
𝑃 ℎ𝑗 = 1|𝒗 = 𝑃(ℎ𝑗 = 1, 𝒗)
𝑃 𝒗 = 𝑃(ℎ𝑗 = 1, 𝒗)
𝑃 ℎ𝑗 = 0, 𝒗 + 𝑃(ℎ𝑗 = 1, 𝒗)
= 𝑒𝑥𝑝 𝑐𝑗 + 𝒗𝑇𝑾𝑗
exp(0) + 𝑒𝑥𝑝 𝑐𝑗 + 𝒗𝑇𝑾𝑗
= 𝜎 𝑐𝑗 + 𝒗𝑇𝑾𝑗
Restricted Boltzmann Machine
• The gradient of the Log-likelihood wrt 𝑾
log 𝑝(𝒗) = log
ℎ
𝑃 𝒗, 𝒉 = log
𝒉
exp −𝐸 𝒗, 𝒉 − log 𝑍
(A) (B)
𝜕(𝐴)
𝜕𝑾 =
𝒉
exp −𝐸 𝒗, 𝒉 ⋅ 𝒗𝒉𝑇
σ𝒉 exp −𝐸 𝒗, 𝒉 =
𝒉
𝑃 𝒗, 𝒉
𝑃 𝒗 𝒗𝒉𝑇
=
𝒉
𝑃 𝒉 𝒗 𝒗𝒉𝑇 = 𝒗
𝒉
𝑃 𝒉 𝒗 𝒉𝑇 = 𝒗𝑬 𝒉 𝒗 𝑇
𝜕(𝐵)
𝜕𝑾 = σ𝒗 σ𝒉 exp −𝐸(𝒗, 𝒉) 𝒗𝒉𝑇
𝑍 =
𝒗
𝒉
𝑃 𝒗, 𝒉 𝒗𝒉𝑇
=
𝒗
𝒉
𝑃 𝒗, 𝒉 𝒗𝒉𝑇 =
𝒗
𝑃(𝒗)
𝒉
𝑃 𝒗|𝒉 𝒗𝒉𝑇 = 𝐸[𝒗𝒉𝑇]
Restricted Boltzmann Machine
• Notation
https://www.cs.toronto.edu/~tijmen/csc321/documents/maddison_rbmtutorial.pdf
Restricted Boltzmann Machine
• The gradient of the log-likelihood wrt 𝑾, 𝒃, 𝒄
• Vectorize everything
https://www.cs.toronto.edu/~tijmen/csc321/documents/maddison_rbmtutorial.pdf
Restricted Boltzmann Machine
• 𝑍: Partition function
Partition function
Restricted Boltzmann Machine:
MC sampling
• The negative statistic is the real problem MC sampling
• With M true samples (𝒗𝑚, 𝒉𝑚) from the distribution defined by the RBM, we could approximate
• We can get these samples by initializing N independent Markov chain at each data point 𝒗𝑛 and running until convergence
Restricted Boltzmann Machine:
MCMC - Gibbs
• Alternating Gibbs
MCMC:
A picture of the maximum likelihood learning algorithm for an RBM
0
vihj vihj
i
j
i
j
i
j
i
j
t = 0 t = 1 t = 2 t = infinity
j i j
i ij
h v h
w v
v
p ( )
0log
Start with a training vector on the visible units.
Then alternate between updating all the hidden units in parallel and updating all the visible units in parallel.
a fantasy
http://www.cs.toronto.edu/~hinton/csc2535/lectures.html
RBM as an infinite logistic belief net with tied weights
an infinite logistic belief net with tied weights
The learning rule of RBM is the same as the maximum likelihood learning rule for the infinite logistic belief net with tied weights [Hinton et al ‘06]
W v
1h
1v
0h
0v
2h
2WT
WT
WT
W
W etc.
0
s
i0
s
j 1s
j 2s
j1
s
i 2s
iWT
WT
WT
W
W
Restricted Boltzmann Machine:
Contrastive divergence
• Run the Markov chain for only one step, get samples , assume that
1-step CD
• Use the smoothed “reconstructions” in their place in gradient calculations
Restricted Boltzmann Machine:
Contrastive divergence
• The contrastive divergence gradients on data
point 𝒗
𝑛Contrastive divergence:
A quick way to learn an RBM
0
vihj vihj1
i
j
i
j
t = 0 t = 1
) (
0
1
w
ij v
ih
jv
ih
jStart with a training vector on the visible units.
Update all the hidden units in parallel
Update the all the visible units in parallel to get a “reconstruction”.
Update the hidden units again.
This is not following the gradient of the log likelihood. But it
works well. It is approximately following the gradient of another objective function (Carreira-Perpinan & Hinton, 2005).
reconstruction data
Revised from http://www.cs.toronto.edu/~hinton/csc2535/lectures.html
Partition function
• Normalized probability
• Partition function
Partition function
Log-Likelihood Gradient
Positive phrase Negative phrase
MCMC Algorithm: Basic
• Burning in a set of Markov chains from a random initialization every time the gradient is needed
MCMC Algorithm
Contrastive Divergence [Hinton ‘00]
• Initializes the Markov chain at each step with samples from the data distribution
Contrastive Divergence
• The negative phase of CD can fail to suppress
spurious modes
Restricted Boltzmann Machine:
Summary
• Conditional probs.
– 𝑃 𝒉|𝒗 = ς
𝑗=1𝜎 2𝒉 − 1 ⊙ 𝒄 + 𝑾
𝑇𝒗
𝑗
– 𝑃 𝒗|𝒉 = ς
𝑗=1𝜎 2𝒗 − 1 ⊙ 𝒃 + 𝑾𝒉
𝑗
– Easy to sample from.
• Training RBM
–
𝜕 log 𝑝(𝒗)𝜕 𝑤𝑖𝑗
= 𝑣
𝑖ℎ
𝑗 𝑑𝑎𝑡𝑎− 𝑣
𝑖ℎ
𝑗 𝑚𝑜𝑑𝑒𝑙– Δ𝑤
𝑖𝑗= 𝜂 𝑣
𝑖ℎ
𝑗 𝑑𝑎𝑡𝑎− 𝑣
𝑖ℎ
𝑗 𝑚𝑜𝑑𝑒𝑙– Based on sampling: MCMC, CD, PSD
Deep Belief Networks (DBN)
• Generative models with several layers of latent variables
• No intra-layer connections
– The connections between the top two layers are undirected – The connections between all other layers are directed
a hybrid graphical model involving both directed and undirected connections
A DBN with only one hidden layer
= RBM
Training DBN:
Greedy layer-wise pretraining
• The first layer RBM is trained to approximately maximize
– Contrastive divergence or stochastic maximum likelihood
• The second RBM is trained to approximately
maximize
DBN: Representation Learning
• Greedy layer-wise unsupervised pretraining [Hinton ’06]
– Proceeds one layer at a time, training the k-th layer while keeping the previous ones fixed
– The lower layers are not adapted when the upper
Maximize 𝐸𝒗~𝑝𝑑𝑎𝑡𝑎 log 𝑝 𝒗
Approximately maximize
• Untie the recognition weights 𝑊𝑖𝑇 from the generative weights 𝑊𝑖
• Perform wake-sleep algorithm
• 1) Up-pass (wake stage)
– Picks a state for every hidden variable using 𝑊𝑖𝑇
– Adjust generative weights 𝑊𝑖
• 2) Down-pass (sleep stage)
– Stochastically activate each lower layer using 𝑊𝑖
– Update only recognition weights 𝑊𝑖𝑇
Training DBN:
Wake-sleep fine tuning
H
0𝑊0 𝑊1 𝑊2 𝑊3
𝑊0𝑇 𝑊1𝑇 𝑊2𝑇
Top-level undirected connections
Generative weights Recognition
weights
DBN for classification:
A model of digit recognition
2000 top-level neurons
500 neurons
500 neurons
28 x 28 pixel image
10 label neurons
The model learns to generate
combinations of labels and images.
To perform recognition we start with a neutral state of the label units and do an up-pass from the image followed by a few iterations of the top-level associative memory.
The top two layers form an associative memory whose
energy landscape models the low dimensional manifolds of the
digits.
The energy valleys have names
https://www.cs.toronto.edu/~hinton/nipstutorial/nipstut3.pdf
Deep Boltzmann Machine
• Unlike DBN, DBM is an entirely undirected model
Connections are only between units in neighboring layers. There are no
intralayer connections
Deep Boltzmann Machine
• The bipartite structure of the DBM
• We can apply the same equations we have previously used for the conditional distributions of an RBM to determine the conditional distributions in a DBM
Deep Boltzmann Machine
• DBM with two hidden layers
• The bipartite structure
– Makes Gibbs sampling in a DBM efficient
– Gibbs sampling can be divided into two blocks of updates
• Block1: including all even layers (including the visible layer)
• Block2: including all odd layers.
Deep Boltzmann Machine
• Interesting Properties
– The posterior distrib. 𝑃(𝒉|𝒗) is simple
• Allows richer approximations of the posterior
– Interesting from the point of view of neuroscience
• The use of proper mean field allows the approximate
inference procedure for DBMs to capture the influence of top-down feedback interactions
– Sampling is relatively difficult
• Need to use MCMC across all layers, with every layer of the model participating in every Markov chain transition
Deep Boltzmann Machine
• Mean field approximation for two hidden layers
• : approx of
• The mean field approach is to minimize
• Parametrize 𝑄 as a product of Bernoulli distributions
Deep Boltzmann Machine
• We have the following approximation to the posterior
• Applying the mean field equations, we obtain