• 검색 결과가 없습니다.

Advanced Deep Learning

N/A
N/A
Protected

Academic year: 2022

Share "Advanced Deep Learning"

Copied!
29
0
0

로드 중.... (전체 텍스트 보기)

전체 글

(1)

Advanced Deep Learning

Approximate Inference-2

U Kang

(2)

Outline

Inference as Optimization Expectation Maximization MAP Inference

Variational Inference and Learning

(3)

MAP Inference

If we wish to develop a learning process based on maximizing 𝐿(𝑣, ℎ, 𝑞), then it is helpful to think of MAP inference as a procedure that provides a

value of 𝑞.

What is MAP inference?

Finds the most likely value of a variable

(4)

MAP Inference

Exact inference consists of maximizing

with respect to q over an unrestricted family of probability distributions, using an exact

optimization algorithm.

We restrict the family of distributions of q to take on a Dirac distribution

(5)

MAP Inference

We can now control 𝑞 entirely via µ. Dropping

terms of 𝐿 that do not vary with µ, we are left with the optimization problem

which is equivalent to the MAP inference problem

(6)

MAP Inference

Thus, we can think of the following algorithm for maximizing ELBO, which is similar to EM

Alternate the following two steps

Perform MAP inference to infer h* while fixing 𝜃

Update 𝜃 to increase log p(h*, v)

(7)

Outline

Inference as Optimization Expectation Maximization MAP Inference

Variational Inference and Learning

Discrete Latent Variables Calculus of Variations

Continuous Latent Variables

(8)

Previously

Lower bound L

Inference = max L w.r.t. q

Learning = max L w.r.t. θ

EM algorithm -> allows us to make large learning steps with fixed q

MAP inference enable us to learn using a point of estimate rather than inferring the entire distribution

(9)

Variational Methods

We want to do the following:

Given this surveillance footage X, did the suspect show up in it?

Given this twitter feed X, is the author depressed?

Problem: cannot compute P(z|x)

(10)

Variational Methods

Idea:

Allow us to re-write statistical inference problems as optimization problems

Statistical inference = infer value of RV given another

Optimization problem = find the parameter values that minimize cost function

Bridge between optimization and statistical techniques

(11)

Variational Learning

Key idea: We can maximize L over a restricted family of distributions q

L is the lower bound of log 𝑝(𝑣; 𝜃)

Chose family such that is easy to compute.

(12)

The Mean field approach

q is a factorial distribution

where ℎ𝑖 are independent (and thus q(h|v) cannot match the true distribution p(h|v))

Advantage: no need to specify parametric form for q.

The optimization problem determines the optimal probability under the constraints

(13)

Discrete Latent Variable

The goal is to maximize ELBO:

In the mean field approach, we assume q is a factorial distribution

We can parameterize q with a vector ෠ℎ whose entries are probabilities; then 𝑞 ℎ𝑖 = 1 𝑣) = ෡ℎ𝑖 Then, we simply optimize the parameters ෠ℎ by

(14)

Outline

Inference as Optimization Expectation Maximization

MAP Inference and Sparse Coding Variational Inference and Learning

Discrete Latent Variables Calculus of Variations

Continuous Latent Variables

(15)

Recap

Inference is to compute 𝑝 ℎ 𝑣 = 𝑝(𝑣|ℎ)𝑝(ℎ)

𝑝(𝑣)

However, exact inference requires an exponential amount of time in these models

Computing 𝑝(𝑣) is intractable

Approximate inference is needed

(16)

Recap

We compute Evidence Lower Bound (ELBO) instead of 𝑝(𝑣).

ELBO:

After rearranging the equation,

For any choice of 𝑞, 𝐿 provides a lower bound.

If we take 𝑞 equals to 𝑝, we can get 𝑝(𝑣) exactly.

(17)

Calculus of Variations

In machine learning, minimizing a function 𝐽(𝜃) by finding the input vector 𝜃 ∈ 𝑅𝑛 is the purpose.

This can be accomplished by solving for the critical points where 𝛻𝜃𝐽 𝜃 = 0.

(18)

Calculus of Variations

But in some cases, we actually want to solve for a function 𝑓(𝑥).

Calculus of variations is a method of finding the critical points w.r.t 𝑓(𝑥).

A function of a function 𝑓 is functional 𝐽[𝑓].

We can take functional derivatives (a.k.a.

variational derivatives), of 𝐽[𝑓] with respect to individual values of the function 𝑓(𝑥) at any

specific value of 𝑥.

(19)

Calculus of Variations

Euler-Lagrange equation (simplified form)

Consider a functional 𝐽 𝑓 = ׬ 𝑔 𝑓 𝑥 , 𝑥 𝑑𝑥. Extreme point of 𝐽 is given by the condition 𝜕

𝜕𝑓(𝑥) 𝑔 𝑓 𝑥 , 𝑥 = 0

(Proof) Assume we change f by the amount of 𝜖 ∙ 𝜂(𝑥) by an arbitrary function 𝜂(𝑥) . Then, 𝐽 𝑓(𝑥) + 𝜖𝜂 𝑥 =

׬ 𝑔 𝑓 𝑥 + 𝜖𝜂 𝑥 , 𝑥 𝑑𝑥 = ׬[𝑔 𝑓 𝑥 , 𝑥 + 𝜕𝑔

𝜕𝑓 𝜖𝜂 𝑥 ] 𝑑𝑥 = 𝐽 + 𝜖 ׬ 𝜕𝑔

𝜕𝑓 𝜂 𝑥 𝑑𝑥

=: Use 𝑦 𝑥1 + 𝜖1, … , 𝑥𝐷 + 𝜖𝐷 = 𝑦 𝑥1, … , 𝑥𝐷 + σ𝑖=1𝐷 𝜕𝑦

𝜕𝑥𝑖𝜖𝑖 + 𝑂(𝜖2)

(20)

Calculus of Variations

𝐻 𝑞 = − ׬ 𝑞(𝑥) log 𝑞(𝑥) 𝑑𝑥

We want to maximize 𝐿.

So we have to find the 𝑞 which becomes a critical point of L

Find 𝛿

𝛿𝑞(𝑥) 𝐿=0

(21)

Example

Consider the problem of finding the probability distribution function over 𝑥 ∈ 𝑅 that has the maximal differential entropy 𝐻 𝑝 =

− ׬ 𝑝 𝑥 log 𝑝(𝑥) 𝑑𝑥 among the distribution with 𝐸 𝑥 = 𝜇 and 𝑉𝑎𝑟 𝑥 = 𝜎2

I.e., the problem is to find 𝑝 𝑥 to maximize 𝐻 𝑝 =

− ׬ 𝑝 𝑥 log 𝑝(𝑥) 𝑑𝑥, such that

p(x) integrates to 1

(22)

Lagrange Multipliers

How to maximize (or minimize) a function with equality constraint?

Lagrange multipliers

Problem: maximize f(x) when g(x)=0

Solution

Maximize L(𝑥, 𝜆) = 𝑓(𝑥) + 𝜆 · 𝑔(𝑥)

𝜆 is called Lagrange multiplier

We find x and 𝜆 s.t. 𝛻𝑥𝐿 = 0 and 𝜕𝐿 = 0

(23)

Calculus of Variations

Goal: find 𝑝 𝑥 which maximizes 𝐻 𝑝 =

− ׬ 𝑝 𝑥 log 𝑝(𝑥) 𝑑𝑥, s.t. ׬ 𝑝(𝑥) 𝑑𝑥 = 1, 𝐸 𝑥 = 𝜇, and 𝑉𝑎𝑟 𝑥 = 𝜎2

Using Lagrange multiplier, we maximize

(24)

Calculus of Variations

We set the functional derivatives equal to 0:

We obtain

We are free to choose Lagrange multipliers as long as

׬ 𝑝(𝑥) 𝑑𝑥 = 1, 𝐸 𝑥 = 𝜇, and 𝑉𝑎𝑟 𝑥 = 𝜎2

We may set the followings.

Then, we obtain

(25)

Outline

Inference as Optimization Expectation Maximization

MAP Inference and Sparse Coding Variational Inference and Learning

Discrete Latent Variables Calculus of Variations

Continuous Latent Variables

(26)

Continuous Latent Variables

When our model contains continuous latent

variables, we can perform variational inference by maximizing 𝐿 using calculus of variations.

If we make the mean field approximation, 𝑞 ℎ 𝑣 = ς𝑖 𝑞 ℎ𝑖 𝑣 , and fix 𝑞 ℎ𝑖 𝑣 for all 𝑖 ≠ 𝑗, then the optimal 𝑞 ℎ𝑗 𝑣 can be obtained by normalizing the unnormalized distribution

𝑞 ℎ𝑗 𝑣 = exp(𝐸 𝑣 log 𝑝 ℎ, 𝑣 )

(27)

Proof of Mean Field Approximation

(claim) Assuming 𝑞 ℎ 𝑣 = ς𝑖𝑞 ℎ𝑖 𝑣 , the optimal 𝑞 ℎ𝑗 𝑣 is given by normalizing the unnormalized distribution ෤𝑞 ℎ𝑗 𝑣 =

exp(𝐸

−𝑗~𝑞 −𝑗 𝑣 log 𝑝 ℎ, 𝑣 )

(proof) Note that ELBO = 𝐸ℎ~𝑞 log 𝑝 ℎ, 𝑣 + 𝐻(𝑞), and 𝑞 ℎ 𝑣 = ς𝑖 𝑞𝑖(ℎ𝑖|𝑣).

Thus, ELBO = ׬ ς𝑖 𝑞𝑖 (log 𝑝(ℎ, 𝑣)) 𝑑ℎ − ׬(ς𝑖𝑞𝑖)(log ς𝑖𝑞𝑖) 𝑑ℎ =

׬ 𝑞𝑗 {׬ log 𝑝 ℎ, 𝑣 (ς𝑖≠𝑗𝑞𝑖 𝑑ℎ𝑖)} 𝑑ℎ𝑗 − ׬(ς𝑖𝑞𝑖)(σ𝑖 log 𝑞𝑖)𝑑ℎ

If we take out terms related to 𝑞𝑗, then ELBO becomes

׬ 𝑞𝑗 𝐸−𝑗(log 𝑝(ℎ, 𝑣))𝑑ℎ𝑗 − ׬ 𝑞𝑗 log 𝑞𝑗 𝑑ℎ𝑗 + 𝑐𝑜𝑛𝑠𝑡

= ׬ 𝑞𝑗 log 𝑝(ℎ𝑗, 𝑣) 𝑑ℎ𝑗 − ׬ 𝑞𝑗 log 𝑞𝑗 𝑑ℎ𝑗 + 𝑐𝑜𝑛𝑠𝑡

where 𝑝(ℎ𝑗, 𝑣) is a prob. distribution and log 𝑝(ℎ𝑗, 𝑣) = 𝐸−𝑗 log 𝑝 ℎ, 𝑣 + 𝑐𝑜𝑛𝑠𝑡

Note that (1) is negative KL divergence -𝐷𝐾𝐿(𝑞𝑗||𝑝(ℎ𝑗, 𝑣)); thus, the best 𝑞𝑗

(1)

(28)

What you need to know

Inference as Optimization

Expectation Maximization

MAP Inference and Sparse Coding

Variational Inference and Learning

(29)

Questions?

참조

관련 문서

다양한 형태가 있지만, 일반적인 구성은 탄력성을 가진 단결정 실리콘 막에 4개의 저항기를 달고 이 저항기들을 휘트스톤 브리지(전기 저항 측정기) 회 로와

뇌우의 절반 아랫부분에 강한 상승기류가 있으면 저층 순환이 토네이도 안 으로 뻗칠 수 있다.. 앞에서 언급하였듯이 저층 반사도 경도에서 오복한 부 분의

두 효과가 서로 반대로 작용하는 경우를 조심하라(순 효과는 항상 뚜렷하 지 않다). 사실, 최대 온난 이류는 일반적으로 발달히는 저기압에서 양의

그리고 오산에서 관측된 4일 00UTC 단열선도에서 지상에서 850hPa 고도까지 강한 역전층이 형성되어 있는 것을 볼 수 있는 데 이는 925hPa에서 850hPa

소나기 구름과 같은 작은 규모의 현상들은 국지적인 기 상현상에 큰 영향을 줄 수 있으며 나아가 전체 대기 시스템의 가장 중요한 요소일 수도 있다.. 그러한

구름이 만들어지고 강수과정이 나타나는 지구 대류권에서 기화열은 융해열보다 중요하게 다루어진다... 지표부근 대기에서는 등온선과

불안정한 대기에서는 단열 냉각이 상승 운동에 미치는 효과 또는 단열 승 온이 하강 운동에 미치는 효과가 감소되기 때문에, 불안정은 오메가(상승 과 하강

따라서, 고도가 높은 단일층 구름에서 대류권 냉각이 가장 적게 일어나는 반면, 고도가 낮은 단일층 구름에서 대류권 냉각이 가장 크게 일어난다.... 나머지 4%는