• 검색 결과가 없습니다.

Deep Belief Networks

N/A
N/A
Protected

Academic year: 2021

Share "Deep Belief Networks"

Copied!
43
0
0

로드 중.... (전체 텍스트 보기)

전체 글

(1)

Deep Belief Networks

Seung-Hoon Na

Chonbuk National University

(2)

Boltzmann Machines

• Energy-based model define dover a d-

dimensional binary random vector 𝒙 ∈ 0,1

𝑑

(3)

Boltzmann Machines with latent variables

• Latent variables can model higher-order interactions among the visible units.

• Boltzmann machine becomes a universal

approximator of probability mass functions over discrete variables (Le Roux and Bengio, 2008)

• Decompose 𝒙 into the visible units 𝒗 and the latent

(or hidden) units 𝒉.

(4)

Boltzmann Machines

(5)

Restricted Boltzmann Machine

• No connections are permitted between visible units and hidden units  Bipartite graph

𝐸 𝒗, 𝒉 = −𝒗𝑇𝑾𝒉 − 𝒃𝑇𝒗 − 𝒄𝑇𝒉

𝑾

(6)

Restricted Boltzmann Machine

• Energy-based model

(7)

Restricted Boltzmann Machine:

Conditional Distributions

• The conditional distributions 𝑃(𝒉 | 𝒗) and 𝑃(𝒗 | 𝒉) are factorial and relatively simple to compute and to sample from

(8)

Restricted Boltzmann Machine:

Conditional Distributions

(9)

Restricted Boltzmann Machine

(10)

Restricted Boltzmann Machine

𝑃(𝒉, 𝒗) = 1

𝑍 𝑒𝑥𝑝 −𝐸(𝒉, 𝒗) 𝑍 = ෍

𝒉,𝒗

𝑒𝑥𝑝 −𝐸(𝒉, 𝒗)

𝐸 𝒉, 𝒗 = −𝒃

𝑇

𝒗 −𝒄

𝑇

𝒉 −𝒗

𝑇

𝑾𝒉 𝐸 𝒉, 𝒗 = −𝒃

𝑇

𝒗 − 𝒄

−𝑗𝑇

𝒉

−𝑗

− 𝑐

𝑗

𝑗

−𝒗

𝑇

𝑾

−𝑗

𝒉

−𝑗

− 𝒗

𝑇

𝑾

𝑗

𝑗

𝑃 𝒉, 𝒗

= 1

𝑍 𝑒𝑥𝑝 𝒃

𝑇

𝒗 + 𝒄

−𝑗𝑇

𝒉

−𝑗

+ 𝒗

𝑇

𝑾

−𝑗

𝒉

−𝑗

𝑒𝑥𝑝

𝑐𝑗𝑗 + 𝒗𝑇𝑾𝑗𝑗

=

𝑓 𝒉−𝑗

𝑒𝑥𝑝

𝑐𝑗𝑗 + 𝒗𝑇𝑾𝑗𝑗

𝒉−𝑗와 ℎ𝑗의 항으로 decomposition

𝑗와 독립인 항

(11)

Restricted Boltzmann Machine

𝑃 ℎ𝑗, 𝒗 = ෍

𝒉−𝑗

𝑃 𝒉, 𝒗 = ෍

𝒉−𝑗

𝑓 𝒉−𝑗 𝑒𝑥𝑝 𝑐𝑗𝑗 + 𝒗𝑇𝑾𝑗𝑗

= ෨𝑍 𝑒𝑥𝑝 𝑐𝑗𝑗 + 𝒗𝑇𝑾𝑗𝑗

𝑃 ℎ𝑗 = 1|𝒗 = 𝑃(ℎ𝑗 = 1, 𝒗)

𝑃 𝒗 = 𝑃(ℎ𝑗 = 1, 𝒗)

𝑃 ℎ𝑗 = 0, 𝒗 + 𝑃(ℎ𝑗 = 1, 𝒗)

= 𝑒𝑥𝑝 𝑐𝑗 + 𝒗𝑇𝑾𝑗

exp(0) + 𝑒𝑥𝑝 𝑐𝑗 + 𝒗𝑇𝑾𝑗

= 𝜎 𝑐𝑗 + 𝒗𝑇𝑾𝑗

(12)

Restricted Boltzmann Machine

• The gradient of the Log-likelihood wrt 𝑾

log 𝑝(𝒗) = log ෍

𝑃 𝒗, 𝒉 = log ෍

𝒉

exp −𝐸 𝒗, 𝒉 − log 𝑍

(A) (B)

𝜕(𝐴)

𝜕𝑾 = ෍

𝒉

exp −𝐸 𝒗, 𝒉 ⋅ 𝒗𝒉𝑇

σ𝒉 exp −𝐸 𝒗, 𝒉 = ෍

𝒉

𝑃 𝒗, 𝒉

𝑃 𝒗 𝒗𝒉𝑇

= ෍

𝒉

𝑃 𝒉 𝒗 𝒗𝒉𝑇 = 𝒗 ෍

𝒉

𝑃 𝒉 𝒗 𝒉𝑇 = 𝒗𝑬 𝒉 𝒗 𝑇

𝜕(𝐵)

𝜕𝑾 = σ𝒗 σ𝒉 exp −𝐸(𝒗, 𝒉) 𝒗𝒉𝑇

𝑍 = ෍

𝒗

𝒉

𝑃 𝒗, 𝒉 𝒗𝒉𝑇

= ෍

𝒗

𝒉

𝑃 𝒗, 𝒉 𝒗𝒉𝑇 = ෍

𝒗

𝑃(𝒗) ෍

𝒉

𝑃 𝒗|𝒉 𝒗𝒉𝑇 = 𝐸[𝒗𝒉𝑇]

(13)

Restricted Boltzmann Machine

• Notation

https://www.cs.toronto.edu/~tijmen/csc321/documents/maddison_rbmtutorial.pdf

(14)

Restricted Boltzmann Machine

• The gradient of the log-likelihood wrt 𝑾, 𝒃, 𝒄

• Vectorize everything

https://www.cs.toronto.edu/~tijmen/csc321/documents/maddison_rbmtutorial.pdf

(15)

Restricted Boltzmann Machine

• 𝑍: Partition function

Partition function

(16)

Restricted Boltzmann Machine:

MC sampling

• The negative statistic is the real problem  MC sampling

• With M true samples (𝒗𝑚, 𝒉𝑚) from the distribution defined by the RBM, we could approximate

• We can get these samples by initializing N independent Markov chain at each data point 𝒗𝑛 and running until convergence

(17)

Restricted Boltzmann Machine:

MCMC - Gibbs

• Alternating Gibbs

(18)

MCMC:

A picture of the maximum likelihood learning algorithm for an RBM

0

vihjvihj

i

j

i

j

i

j

i

j

t = 0 t = 1 t = 2 t = infinity

 

j i j

i ij

h v h

w v

v

p ( )

0

log

Start with a training vector on the visible units.

Then alternate between updating all the hidden units in parallel and updating all the visible units in parallel.

a fantasy

http://www.cs.toronto.edu/~hinton/csc2535/lectures.html

(19)

RBM as an infinite logistic belief net with tied weights

an infinite logistic belief net with tied weights

The learning rule of RBM is the same as the maximum likelihood learning rule for the infinite logistic belief net with tied weights [Hinton et al ‘06]

W v

1

h

1

v

0

h

0

v

2

h

2

WT

WT

WT

W

W etc.

0

s

i

0

s

j 1

s

j 2

s

j

1

s

i 2

s

i

WT

WT

WT

W

W

(20)

Restricted Boltzmann Machine:

Contrastive divergence

• Run the Markov chain for only one step, get samples , assume that

1-step CD

• Use the smoothed “reconstructions” in their place in gradient calculations

(21)

Restricted Boltzmann Machine:

Contrastive divergence

• The contrastive divergence gradients on data

point 𝒗

𝑛

(22)

Contrastive divergence:

A quick way to learn an RBM

0

vihj vihj1

i

j

i

j

t = 0 t = 1

) (  

0

  

1

w

ij

v

i

h

j

v

i

h

j

Start with a training vector on the visible units.

Update all the hidden units in parallel

Update the all the visible units in parallel to get a “reconstruction”.

Update the hidden units again.

This is not following the gradient of the log likelihood. But it

works well. It is approximately following the gradient of another objective function (Carreira-Perpinan & Hinton, 2005).

reconstruction data

Revised from http://www.cs.toronto.edu/~hinton/csc2535/lectures.html

(23)

Partition function

• Normalized probability

• Partition function

Partition function

(24)

Log-Likelihood Gradient

Positive phrase Negative phrase

(25)

MCMC Algorithm: Basic

• Burning in a set of Markov chains from a random initialization every time the gradient is needed

(26)

MCMC Algorithm

(27)

Contrastive Divergence [Hinton ‘00]

• Initializes the Markov chain at each step with samples from the data distribution

(28)

Contrastive Divergence

• The negative phase of CD can fail to suppress

spurious modes

(29)

Restricted Boltzmann Machine:

Summary

• Conditional probs.

– 𝑃 𝒉|𝒗 = ς

𝑗=1

𝜎 2𝒉 − 1 ⊙ 𝒄 + 𝑾

𝑇

𝒗

𝑗

– 𝑃 𝒗|𝒉 = ς

𝑗=1

𝜎 2𝒗 − 1 ⊙ 𝒃 + 𝑾𝒉

𝑗

– Easy to sample from.

• Training RBM

𝜕 log 𝑝(𝒗)

𝜕 𝑤𝑖𝑗

= 𝑣

𝑖

𝑗 𝑑𝑎𝑡𝑎

− 𝑣

𝑖

𝑗 𝑚𝑜𝑑𝑒𝑙

– Δ𝑤

𝑖𝑗

= 𝜂 𝑣

𝑖

𝑗 𝑑𝑎𝑡𝑎

− 𝑣

𝑖

𝑗 𝑚𝑜𝑑𝑒𝑙

– Based on sampling: MCMC, CD, PSD

(30)

Deep Belief Networks (DBN)

• Generative models with several layers of latent variables

• No intra-layer connections

– The connections between the top two layers are undirected – The connections between all other layers are directed

a hybrid graphical model involving both directed and undirected connections

A DBN with only one hidden layer

= RBM

(31)

Training DBN:

Greedy layer-wise pretraining

• The first layer RBM is trained to approximately maximize

– Contrastive divergence or stochastic maximum likelihood

• The second RBM is trained to approximately

maximize

(32)

DBN: Representation Learning

• Greedy layer-wise unsupervised pretraining [Hinton ’06]

– Proceeds one layer at a time, training the k-th layer while keeping the previous ones fixed

– The lower layers are not adapted when the upper

Maximize 𝐸𝒗~𝑝𝑑𝑎𝑡𝑎 log 𝑝 𝒗

Approximately maximize

(33)

• Untie the recognition weights 𝑊𝑖𝑇 from the generative weights 𝑊𝑖

• Perform wake-sleep algorithm

• 1) Up-pass (wake stage)

– Picks a state for every hidden variable using 𝑊𝑖𝑇

– Adjust generative weights 𝑊𝑖

• 2) Down-pass (sleep stage)

– Stochastically activate each lower layer using 𝑊𝑖

– Update only recognition weights 𝑊𝑖𝑇

Training DBN:

Wake-sleep fine tuning

H

0

𝑊0 𝑊1 𝑊2 𝑊3

𝑊0𝑇 𝑊1𝑇 𝑊2𝑇

Top-level undirected connections

Generative weights Recognition

weights

(34)

DBN for classification:

A model of digit recognition

2000 top-level neurons

500 neurons

500 neurons

28 x 28 pixel image

10 label neurons

The model learns to generate

combinations of labels and images.

To perform recognition we start with a neutral state of the label units and do an up-pass from the image followed by a few iterations of the top-level associative memory.

The top two layers form an associative memory whose

energy landscape models the low dimensional manifolds of the

digits.

The energy valleys have names

https://www.cs.toronto.edu/~hinton/nipstutorial/nipstut3.pdf

(35)

Deep Boltzmann Machine

• Unlike DBN, DBM is an entirely undirected model

Connections are only between units in neighboring layers. There are no

intralayer connections

(36)

Deep Boltzmann Machine

• The bipartite structure of the DBM

• We can apply the same equations we have previously used for the conditional distributions of an RBM to determine the conditional distributions in a DBM

(37)

Deep Boltzmann Machine

• DBM with two hidden layers

• The bipartite structure

– Makes Gibbs sampling in a DBM efficient

– Gibbs sampling can be divided into two blocks of updates

• Block1: including all even layers (including the visible layer)

• Block2: including all odd layers.

(38)

Deep Boltzmann Machine

• Interesting Properties

– The posterior distrib. 𝑃(𝒉|𝒗) is simple

• Allows richer approximations of the posterior

– Interesting from the point of view of neuroscience

• The use of proper mean field allows the approximate

inference procedure for DBMs to capture the influence of top-down feedback interactions

– Sampling is relatively difficult

• Need to use MCMC across all layers, with every layer of the model participating in every Markov chain transition

(39)

Deep Boltzmann Machine

• Mean field approximation for two hidden layers

• : approx of

• The mean field approach is to minimize

• Parametrize 𝑄 as a product of Bernoulli distributions

(40)

Deep Boltzmann Machine

• We have the following approximation to the posterior

• Applying the mean field equations, we obtain

(41)

Deep Boltzmann Machine

• DBM Parameter Learning

– ELBO: The variational lower bound on log 𝑃 𝒗; 𝜃

(42)

Deep Boltzmann Machine

(43)

Deep Boltzmann Machine

참조

관련 문서

We implemented a gated QRNN model with medium hidden size: 2 layers with 640 units in each layer. Both QRNN layers use a convolutional filter width k of

ABSTRACT : In this study, the analysis of load transition of PHC pile was performed with the static load test, which was driven in deep soft clay layer of MyungJi site in