VAE (Variational AutoEncoder

(1)

Variational

AutoEncoder

Wanho Choi

(wanochoi.com)

(2)

인공 신경망

(3)

(4)

Generative Models

Discriminative Model

Generative Model

(5)

Generative Models

Discriminative Model

Generative Model

: Decision boundary

: Probability distribution

FCN

CNN

Autoencoder

GAN

(6)

Two Most Popular Generative Models

• VAE (Variational AutoEncoder)

‣

Root:

Bayesian inference

‣

Goal: to model the underlying

probability

distribution of given data so that it could

sample new data from that distribution

• GAN (Generative Adversarial Network)

‣

Root:

game theory

‣

Goal: to find the

Nash Equilibrium

(7)

오토인코더

(8)

Data Compression

• It

converts the input into a smaller representation that we can reproduce the original data

to some degree of quality. (loss data compression)

• AE performs data compression with

unsupervised learning (SW 2.0)

unlike many hard-coded algorithms (SW 1.0).

Original

Data

Compressor

(encoder)

Decompressor

(decoder)

Compressed

(encoded)

Data

Decompressed

Data

(9)

Encoding / Decoding

• Encoding: the process of encrypting given data

• Decoding: the process of restoring original data from encrypted data

• Codec = (En)coder + Decoder (or Compression + Decompression)

http://aess.com.tr/encoding-decoding-2/ https://www.joydeepdeb.com/tools/url-encoding-decoding.html

(10)

AutoEncoder

Encoder

Decoder

x ∈ IR

d

latent vector

Input Layer

Hidden Layer

Output Layer

인지 네트워크 (recognition network) 생성 네트워크 (generative network)

z = f(x)

̂x = g(z)

̂x ∈ IR

d

(k ≪ d)

lower-dimensional

representation

z ∈ IR

k

information loss

key feature only

(11)

AutoEncoder

• The general idea of the AE is to

squeeze the given information (input data)

through a narrow

bottleneck

between the

mirrored encoder (input) and decoder (output) parts of a neural network.

• Because the network architecture and loss function are setup so that the output tries to

emulate the input, the network has to learn how to encode input data on the very limited

space represented by the bottleneck.

(12)

Some Keywords

• Autoencoder = Sandglass-shaped Net. = Diabolo Net.

• Bottleneck = Latent Space = Manifold

• Code = Features = Hidden Variable = Latent Variable = Latent Vector = Encoding Vector

• Dimensionality Reduction = Feature Detection

= Latent Representation = Hidden Representation = Compressed Representation

• Representation Learning = Manifold Learning

image

space

latent

space

image

space

(13)

Manifold

• Euclidean space: a space in that the distance between two points is defined as follows:

‣

1D Euclidean space: line

‣

2D Euclidean space: plane

‣

3D Euclidean space: general 3D space where we live

A(a

₁

, a

₂

, . . . , a

_n

)

(14)

Manifold

• Manifold: a space which locally looks Euclidean

• For each point on the manifold, there exists a hyper-sphere with center and very small

radius so that the intersection of this sphere and the manifold can be continuously deformed

to a disk. It means one can twist or bend the shape without cutting and gluing.

• A

n-manifold means a smooth surface in the (n+1)-dimensional space.

• Ex) 2-manifold = smooth surface in the 3D space

(15)

Manifold in Deep Learning

• Manifold: 고차원 공간에서의 데이터를 잘 아우르는 저차원의 sub-space

• Manifold 상에서 특정 방향으로 움직이면서 얻어지는 점들은 유의미한 결과들의 집합이다.

• 원래의 고차원 공간상의 두 점 사이의 Euclidean distance는 무의미한 개념일 가능성이 높다.

• Manifold 상에 존재 하는 두 점 사이의 geodesic distance는 유의미한 개념을 갖는다.

• Manifold 상에서는 데이터의 분류(classification) 또한 쉽고 명확해진다.

https://www.iro.umontreal.ca/~bengioy/talks/gss2012-YB3-algorithms-AE-depth.pdf

f : ℝ

d

→ ℝ

m

(d ≤ m)

x

_i

= f(τ

_i

) + ϵ

_i

x

_i

∈ ℝ

m

https://www.slideshare.net/NaverEngineering/ss-96581209 https://en.wikipedia.org/wiki/Manifold_regularization

(16)

Curse of Dimensionality

• 차원의 저주

• 데이터의 차원이 증가할수록 해당 공간의 크기(부피)가 기하급수적으로 증가하기 때문에

동일한 개수의 discrete sample points로 구성된 데이터의 밀도가

전체 공간에서 차지하는 비중은 급속도 줄어든다.

• 따라서, 데이터의 차원이 증가할수록

데이터의 분포 분석 또는 모델의 추정에 필요한 sample data의 개수는 기하급수적으로 증가한다.

(17)

Manifold Hypothesis

• 이미지(image)와 같은 데이터는 고차원 공간(high dimensional space)에서의 밀도(density)는 낮지만

이들을 포함하는 저차원의 manifold가 존재한다.

• 이 저차원의 manifold를 벗어나는 순간 데이터의 밀도는 급격히 낮아진다.

https://dsp.stackexchange.com/questions/34126/random-noise-removal-in-images

(18)

The Latent Space of MNIST

(19)

The Latent Space of Face-like Images

• The space of all face-like images is smaller than the space of all images.

all images

face-like images

face-like image

manifold

(20)

Vanilla AutoEncoder

• The

simplest autoencoder

• Three layers network

• Loss =

• Exactly same as

PCA

∥ x

_i

− ̂x

_i

∥

2 ⋮

⋮

z

₁

x

₁

x

₂

x

_d

̂x

2 ̂x

d

⋮

z

_k

̂x

1

(21)

Vanilla AE as PCA

import torch

import numpy as np

import matplotlib.pyplot as plt

from mpl_toolkits.mplot3d import Axes3D LEARNING_RATE = 0.001

TOTAL_EPOCHS = 1000

def Points(n, w1, w2, noise): points = np.empty((n, 3))

angles = np.random.rand(n) * 3 * np.pi / 2 - 0.5

points[:,0] = np.cos(angles) + np.sin(angles)/2 + noise * np.random.randn(n) / 2 points[:,1] = np.sin(angles) * 0.7 + noise * np.random.randn(n) / 2

points[:,2] = points[:,0] * w1 + points[:,1] * w2 + noise * np.random.randn(n) return points

points = Points(100, 0.1, 0.3, 0.1) x = torch.from_numpy(points).float()

(22)

Vanilla AE as PCA

def Display3D(points): fig = plt.figure() ax = Axes3D(fig)

ax.scatter(points[:,0], points[:,1], points[:,2]) plt.show() def Display2D(points): plt.plot(points[:,0], points[:,1], “b.") plt.show() class Model(torch.nn.Module): def __init__(self): super(Model, self).__init__()

self.encoder = torch.nn.Linear(3, 2) # from 3D to 2D

self.decoder = torch.nn.Linear(2, 3) # from 2D to 3D

def forward(self, x):

x = self.encoder(x) x = self.decoder(x) return x

def z(self, x): # latent vector

z = self.encoder(x) return z

(23)

Vanilla AE as PCA

model = Model()

CostFunc = torch.nn.MSELoss()

optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE) for epoch in range(TOTAL_EPOCHS):

output = model(x) cost = CostFunc(output, x) cost.backward() optimizer.step() optimizer.zero_grad() print('Cost: {:.4f}’.format(cost.item())) Display3D(points) z = model.z(x).detach() Display2D(z.numpy())

3/3

(24)

Vanilla AE as PCA

3D Input Points

(25)

Vanilla AE for MNIST

import torch, torchvision

from matplotlib import pyplot as plt BATCH_SIZE = 100 LEARNING_RATE = 0.001 TOTAL_EPOCHS = 3 transforms = torchvision.transforms.Compose([torchvision.transforms.ToTensor()]) train_dataset

= torchvision.datasets.MNIST(root='./data/', train=True, transform=transforms, download=True) test_dataset

= torchvision.datasets.MNIST(root='./data/', train=False, transform=transforms) train_dataloader

= torch.utils.data.DataLoader(dataset=train_dataset, batch_size=BATCH_SIZE, shuffle=True) test_dataloader

= torch.utils.data.DataLoader(dataset=test_dataset, batch_size=BATCH_SIZE, shuffle=True)

(26)

Vanilla AE for MNIST

class Model(torch.nn.Module): def __init__(self): super(Model, self).__init__() self.encoder = torch.nn.Linear(28*28, 100) self.decoder = torch.nn.Linear(100, 28*28) def forward(self, x): x = self.encoder(x) x = self.decoder(x) return x model = Model() CostFunc = torch.nn.MSELoss()

for images, _ in train_dataloader:

images = images.reshape(-1, 784) # flattening for input

output = model(images) # feed forward

cost = CostFunc(output, images) # compare with input images (not with labels)

cost.backward() optimizer.step()

optimizer.zero_grad()

print('Cost: {:.4f}’.format(cost.item()))

(27)

Vanilla AE for MNIST

plt.figure(figsize=(20, 4)) for i in range(10): img = images plt.subplot(2, 10, i+1) plt.imshow(img[i].reshape(28, 28)) plt.gray() plt.axis(‘off') img = output.detach().numpy() plt.subplot(2, 10, i+11) plt.imshow(img[i].reshape(28, 28)) plt.gray() plt.axis(‘off') plt.tight_layout() plt.show()

3/3

(28)

(29)

Vanilla AE as De-noiser

(30)

Vanilla AE as De-noiser

class Model(torch.nn.Module): def __init__(self): super(Model, self).__init__() self.encoder = torch.nn.Linear(28*28, 100) self.decoder = torch.nn.Linear(100, 28*28) def forward(self, x): x = self.encoder(x) x = self.decoder(x) return x model = Model() CostFunc = torch.nn.MSELoss()

for images, _ in train_dataloader: images = images.reshape(-1, 784)

images = images + torch.randn(images.size()) * 0.5 # add noise

output = model(images)

cost = CostFunc(output, images) cost.backward()

optimizer.step()

print('Cost: {:.4f}’.format(cost.item()))

(31)

Vanilla AE as De-noiser

3/3

(32)

(33)

Deep AE as De-noiser

Encoder

Decoder

x ∈ IR

d

latent vector

Input Layer

Hidden Layer

Output Layer

z = f(x)

̂x = g(z)

̂x ∈ IR

d

(k ≪ d)

lower-dimensional

representation

z ∈ IR

k

information loss

key feature only

(34)

Deep AE as De-noiser

class Model(torch.nn.Module): def __init__(self): super(Model, self).__init__() self.encoder = torch.nn.Sequential( torch.nn.Linear(28*28, 256), torch.nn.ReLU(), torch.nn.Linear( 256, 64), torch.nn.ReLU(), torch.nn.Linear( 64, 16), torch.nn.ReLU(), torch.nn.Linear( 16, 4)) self.decoder = torch.nn.Sequential( torch.nn.Linear( 4, 16), torch.nn.ReLU(), torch.nn.Linear( 16, 64), torch.nn.ReLU(), torch.nn.Linear( 64, 256), torch.nn.ReLU(), torch.nn.Linear(256, 28*28), torch.nn.Tanh()) def forward(self, x): x = self.encoder(x) x = self.decoder(x) return x

(35)

(36)

Convolutional AutoEncoder (CAE)

• AE with

convolution & pooling layers

instead of fully-connected layers

‣

Encoder:

down-sampling

‣

Decoder:

up-sampling

• Not to be flattened input layer

Encoder

Decoder

x ∈ IR

d

latent vector

Input Layer

Hidden Layer

Output Layer

z = f(x)

̂x = g(z)

̂x ∈ IR

d

(k ≪ d)

lower-dimensional

representation

z ∈ IR

k

information loss

key feature only

Convolutional

Pooling

De-convolutional

De-pooling

Down

Sampling

Up

Sampling

(37)

Convolutional AutoEncoder (CAE)

(38)

Convolutional AutoEncoder (CAE)

class Model(torch.nn.Module): def __init__(self): super(Model, self).__init__() # (bs, 28, 28, 1) -> conv -> (bs, 28, 28, 8) -> pool -> (bs, 14, 14, 8) # (bs, 14, 14, 8) -> conv -> (bs, 14, 14, 4) -> pool -> (bs, 7, 7, 4) self.encoder = torch.nn.Sequential(

torch.nn.Conv2d(1, 128, 3, 1, 1), # in_channels, out_channels, kernel_size, stride, padding

torch.nn.MaxPool2d(2, 2),

torch.nn.Conv2d(128, 8, 3, 1, 1), # in_channels, out_channels, kernel_size, stride, padding

torch.nn.MaxPool2d(2, 2))

# (kernel_size, strid) = (2, 2) will increase the spatial dims by 2 # (bs, 7, 7, 4) -> (bs, 14, 14, 8)

# (bs, 14, 14, 8) -> (bs, 28, 28, 1)

self.decoder = torch.nn.Sequential(

torch.nn.ConvTranspose2d(8, 128, 2, 2, 0), # in_channels, out_channels, kernel_size, stride, padding

torch.nn.ConvTranspose2d(128, 1, 2, 2, 0)) # in_channels, out_channels, kernel_size, stride, padding def forward(self, x): x = self.encoder(x) x = self.decoder(x) return x

2/4

(39)

Convolutional AutoEncoder (CAE)

model = Model()

CostFunc = torch.nn.MSELoss()

for images, _ in train_dataloader: # no flattening process

output = model(images)

cost = CostFunc(output, images) cost.backward()

optimizer.step()

print('Cost: {:.4f}'.format(cost.item()))

(40)

Convolutional AutoEncoder (CAE)

4/4

(41)

(42)

Symmetricity & Weight Sharing

Q) Do AutoEncoder have to be symmetrical in encoder and decoder?

A) There is no specific constraint on the symmetry of an autoencoder.

At the beginning, people tended to enforce such symmetry to the maximum: not only the layers were symmetrical,

but also the weights of the layers in the encoder and decoder where shared. This is not a requirement, but it allows to

use certain loss functions (i.e. RBM score matching) and can act as regularization, as you eﬀectively reduce by half

the number of parameters to optimize. Nowadays, however, I think no one imposes encoder-decoder weight sharing.

About architectural symmetry, it is common to find the same number of layers, the same type of layers and the same

layer sizes in encoder and decoder, but there is no need for that.

For instance, in convolutional autoencoders, in the past it was very common to find convolutional layers in the

encoder and deconvolutional layers in the decoder, but now you normally see upsampling layers in the decoder

because they have less artifacts problems.

(43)

AE as a Fake Sample Generator

• After the learning is completed,

you can separate the decoder and use it as a

fake sample generator.

Decoder

Encoder

Decoder

Noisy Input Denoised Output

Newly Generated Output Features

(44)

AE as a Fake Sample Generator

• After the learning is completed,

you can separate the decoder and use it as a

fake sample generator.

Decoder

Encoder

Decoder

Noisy Input Denoised Output

Newly Generated Output Features

(45)

로그 함수

(46)

Log: Definition

• 일 때 이 함수의 역함수

• 단,

이고,

인 경우만 생각한다. (부정과 불능인 경우를 피하기 위해서)

• 두 함수는 서로 역함수의 관계이기 때문에

와

의 그래프는

를 중심으로 대칭

y = a

x

a > 0

a ≠ 1

y = a

x

x = log

_a

y

y = x

밑 진수

y = a

x

⟺

x = log

_a

y

(47)

Log: Graph

• 일 때의 예

• 와

는 서로 역함수의 관계

• 에서

일 때,

• 에서

일 때,

a = 2

y = 2

x

y = log

₂

x

y = log

₂

x

x → 0

y → − ∞

y = log

₂

x

x → ∞

y → ∞

y = 2

x

_{y = x}

y = log

₂

x

0 = log

₂

1 −∞ = log

₂

ϵ

x = 1

y = 2

x

y = 2

x = 2

y = log

2 x

y = 1

(48)

Log: Two Cases

• 이면 증가함수

• 이면 감소함수

a > 1

a < 1

y = log

₂

x

y = log

1 2

x

(49)

Log: Properties

•

• log

_a

1 = 0, log

_a

a = 1

log

_a

x

n

= nlog

_a

x

log

_a

x + log

_a

y = log

_a

xy

log

_a

x − log

_a

y = log

_a

x

y

:

f(1) = 0, f(a) = 1

f(x

n

) = nf(x)

f(x) + f(y) = f(xy)

f(x) − f(y) = f (

x

_{y )}

y = a

x

⟺

x = log

_a

y

(50)

Log: Properties (proof)

•

• log

_a

1 = 0, log

_a

a = 1

log

_a

x

n

= nlog

_a

x

log

_a

x + log

_a

y = log

_a

xy

log

_a

x − log

_a

y = log

_a

x

y

:

f(1) = 0, f(a) = 1

f(x

n

) = nf(x)

f(x) + f(y) = f(xy)

f(x) − f(y) = f (

x

_{y )}

y = a

x

⟺

x = log

_a

y

일 때,

이므로

x = 0

1 = a

0

0 = log

_a

1 일 때,

이므로

x = 1

a = a

1

1 = log

_a

a

(51)

Log: Properties (proof)

•

• log

_a

1 = 0, log

_a

a = 1

log

_a

x

n

= nlog

_a

x

log

_a

x + log

_a

y = log

_a

xy

log

_a

x − log

_a

y = log

_a

x

y

:

f(1) = 0, f(a) = 1

f(x

n

) = nf(x)

f(x) + f(y) = f(xy)

f(x) − f(y) = f (

x

_{y )}

y = a

x

⟺

x = log

_a

y

이라 하면,

이므로

이고, log의 정의에 의해서

log

_a

x = m

a

m

= x

a

nm

= x

n

log

_a

x

n

= mn = nlog

_a

x

(52)

Log: Properties (proof)

•

• log

_a

1 = 0, log

_a

a = 1

log

_a

x

n

= nlog

_a

x

log

_a

x + log

_a

y = log

_a

xy

log

_a

x − log

_a

y = log

_a

x

y

:

f(1) = 0, f(a) = 1

f(x

n

) = nf(x)

f(x) + f(y) = f(xy)

f(x) − f(y) = f (

x

_{y )}

y = a

x

⟺

x = log

_a

y

이라 하면,

이므로

에서

log의 정의를 이용하면

log

_a

x = m, log

_a

y = n

a

m

= x, a

n

= y

a

m

× a

n

= a

m+n

= xy

(53)

Log: Properties (proof)

•

• log

_a

1 = 0, log

_a

a = 1

log

_a

x

n

= nlog

_a

x

log

_a

x + log

_a

y = log

_a

xy

log

_a

x − log

_a

y = log

_a

x

y

:

f(1) = 0, f(a) = 1

f(x

n

) = nf(x)

f(x) + f(y) = f(xy)

f(x) − f(y) = f (

x

_{y )}

y = a

x

⟺

x = log

_a

y

이라 하면,

이므로

에서

log의 정의를 이용하면

log

_a

x = m, log

_a

y = n

a

m

= x, a

n

= y

a

m

÷ a

n

= a

m−n

= x ÷ y

log

_a

x

y

= m − n = log

a

x − log

a

y

(54)

Natural Logarithm

• (자연 상수 또는 Euler’s constant) = 2.71… 를 밑으로 하는 log

• 다음과 같이 밑인 를 생략하고 쓰기도 한다.

e

y = a

x

⟺

x = log

_a

y

y = e

x

⟺

x = log

_e

y

일 때

a = e

log

_e

y = ln y

(55)

Exponential and Logarithm

•

• log e = 1

log e

k

= k

e

log k

= k

(56)

Exponential and Logarithm (proof)

•

• log e = 1

log e

k

= k

e

log k

= k

라 하면,

로그함수의 정의에 의해서

즉,

이므로

log e = log

_e

e = 1

log e

k

= k log e = k × 1 = k

e

log k

= y

log k = log

_e

y

log k = log y

y = k

(57)

Exponential and Logarithm (proof)

•

• log e = 1

log e

k

= k

e

log k

= k

라 하면,

로그함수의 정의에 의해서

즉,

이므로

log e = log

_e

e = 1

log e

k

= k log e = k × 1 = k

e

log k

= y

log k = log

_e

y

log k = log y

y = k

(58)

Exponential and Logarithm (proof)

•

• log e = 1

log e

k

= k

e

log k

= k

라 하면,

로그 함수의 정의에 의해서

즉,

이므로

log e = log

_e

e = 1

log e

k

= k log e = k × 1 = k

e

log k

= y

log k = log

_e

y

log k = log y

y = k

(59)

Probability

(60)

Deterministic vs Stochastic

• Deterministic model

‣

The output of the model if fully

deterministic by the system parameters

• Stochastic (=Probabilistic) model

‣

Some inherent randomness

‣

The same set of parameters will lead to

the diﬀerent result. (ensemble)

(61)

표본공간 (Sample Space)

• 한 번의 시행에서 일어날 수 있는 모든 결과(=사건)의 집합

‣

주사위를 던져서 나오는 숫자에 대한 표본 공간

‣

동전을 던져서 나오는 면에 대한 표본 공간

‣

10문제가 나온 수학 시험 점수에 대한 표본 공간

S

= 1,2,3,4,5,6

{

}

S

= 0,10,20,30,40,50,60,70,80,90,100

{

}

S

= front,back

{

}

(62)

확률 (Probability)

• 동일한 조건에서 같은 실험을 무수히 반복할 때 특정 결과가 나오는 비율

• 확률은 0과 1 사이의 값을 가진다.

• 1에 가까울수록 확률이 높다고 말하며,

이는 무한 시행시 특정 결과가 나올 확실성이 그만큼 크다는 것을 의미한다.

• 주사위를 던졌을 때 3이 나올 확률: 1/6

• 동전을 던졌을 때 앞면이 나올 확률: 1/2

(63)

확률변수 (Random Variable)

• 일정한 확률을 가지고 발생하는 실험결과(사건)에 수치를 부여한 것.

• 확률변수와 일반변수의 차이점은 확률표본에서 관찰한 변수인지 아닌지에 달려있다.

(stochastic vs deterministic)

• 일반적으로 확률변수는 대문자, 확률변수가 취하는 값은 소문자로 표기한다.

: 표본공간 안의 사건 가 일어날 확률은 이다.

X

x

p

P(X = x) = p

(64)

평균 (Mean)

• 평균 = 산술 평균 = 표본 평균

• 통계학의 내용 중에서 가장 이해하기 쉬운 개념 중 하나

• N

개의 관측값

(x

₁

, x

₂

, ⋯, x

_N

)

이 주어져 있을 때, 평균은 다음과 같이 정의한다.

μ = 1

N

∑

i=1

x

_i

(65)

분산 (Variance)

• 관측값들이 평균으로부터 얼마나 흩어져 있는가?

• 퍼져 있는 정도만 중요하기 때문에 부호는 중요하지 않다.

• 부호를 없애는 방법으로 절대값을 취하는 방법과 제곱을 취하는 방법이 있는데,

절대값은 대수 연산에 있어 여러 가지 불편함이 있기 때문에 잘 사용하지 않는다.

• 따라서, 각 편차(=관측값-평균)의 제곱의 합의 평균으로 분산을 정의한다.

σ

2 = 1

N

∑

i=1

(x

i

− μ)

2 ( : 표준편차)

σ

(66)

예제: 평균과 분산

• 2학년 1반 학생들의 수학 성적

• 2학년 2반 학생들의 수학 성적

65

72

57

92

45 σ

2

=

1

5 (

65 − 66.2

)

2

+ 72 − 66.2

(

)

2

+ 57 − 66.2

(

)

2

+ 92 − 66.2

(

)

2

+ 45 − 66.2

(

)

2

{

}

= 246.96

µ

=

1

5 (

65 + 72 + 57 + 92 + 45

)

= 66.2

76

58

72

65

63 σ

2

=

1

5 (

76 − 66.8

)

2

+ 58 − 66.8

(

)

2

+ 72 − 66.8

(

)

2

+ 65 − 66.8

(

)

2

+ 63− 66.8

(

)

2

{

}

= 41.36

µ

=

1

5 (

76 + 58 + 72 + 65 + 63

)

= 66.8

(67)

기대값 (Expectation Value)

• 가중평균 (weighted average)

• (각 사건이 발생할 때의 결과)×(그 사건이 일어날 확률)을 전체 사건에 대해 합한 값

• 어떤 확률적 사건에 대한 평균의 의미를 지님.

E(X) =

_∑

N

i

[x

i

⋅ P(x

_i

_{)] = μ}

(68)

기대값 (Expectation Value)

수익률

확률

호황

50%

10%

보통

30%

50%

불황

-10%

40%

E(X)

= 50 × 0.1

(

)

+ 30 × 0.5

(

)

+ −10 × 0.4

(

)

= 5 +15 − 4

= 16%

(69)

.

E(aX + b) = aE(X) + b

E(aX + b) =

_∑

N

i=1

[(ax

i

+ b)P(x

_i

_)]

=

_∑

N

i=1

[ax

i

P(x

_i

_{)] +}

_∑

N

i=1

[bP(x

i

)]

= a

_∑

N

i=1

[x

i

P(x

_i

_{)] + b}

_∑

N

i=1

[P(x

i

)]

=

_∑

N

i=1

[ax

i

P(x

_i

) + bP(x

_i

_)]

= aE(X) + b

P(x

_i

) = P(ax

_i

+ b)

N

∑

i=1

[P(x

i

)] = 1

(70)

.

Var(X) = E[(X − μ)

2 ]

(71)

.

Var(X) = E(X

2 ) − E(X)

2 Var(X) = E [(X − μ)

2 ] =

N

∑

i=1

[(x

i

− μ)

2 P(x

_i

_)]

=

_∑

N

i=1

[(x

2 i

− 2μx

i

+ μ

2 )P(x

i

)]

=

_∑

N

i=1

[x

2 i

P(x

i

)] − 2μ

N

∑

i=1

[x

i

P(x

_i

_{)] + μ}

2 _∑

N

i=1

[P(x

i

)]

= E(X

2 ) − 2μ

2 + μ

2 = E(X

2 ) − E(X)

2 μ =

_∑

N

i=1

[x

i

P(x

_i

_)]

E(X) = μ

=

_∑

N

i=1

[x

2 i

P(x

i

) − 2μx

i

P(x

i

) + μ

2 P(x

i

)]

= E(X

2 ) − μ

2

(72)

.

Var(aX + b) = a

2 _Var(X)

Var(aX + b) = E [(aX + b)

2 ] − E [(aX + b)]

2 = E [a

2 X

2 + 2abX + b

2 _{] − (a}

2 E[X]

2 + 2abE(X) + b

2 ₎

= E [(aX + b)

2 ] − (aE[X] + b)

2 = a

2 E[X

2 ] + 2abE[X] + b

2 − a

2 E[X]

2 − 2abE(X) − b

2 = a

2 E[X

2 ] − a

2 E[X]

2 = a

2 _(E[X

2 ] − E[X]

2 ₎

(73)

조건부 확률 (Conditional Probability)

• 사건 B가 일어났을 때, 사건 A가 일어나는 확률

P(A|B) = P(A, B)

P(B)

P(A, B) = P(A|B) ⋅ P(B)

P(A, B) ≡ P(A ∩ B)

⟺

(74)

사전확률 vs 사후확률

• 사전확률 (prior probability)

‣

사건이 일어나기 전에 이미 가지고 있는 지식을 통해 부여한 미래의 확률

‣

예) 동전을 던져서 앞면이 나올 확률: 0.5

• 사후확률 (posterior probability)

‣

사건이 발생한 후에 추가된 정보를 이용하여 수정한 확률

‣

조건부 확률의 형태로 표현

‣

예) 결혼을 했을 확률: 0.5, 아이가 없을 확률 0.4, 결혼했지만 아이가 없을 확률: 0.2

P(married) = 0.5

P(childless) = 0.4

P(childless|married) = 0.2

P(married|childless) = P(childless|married) ⋅ P(married)

(75)

확률분포 (Probability Distribution)

• 확률변수 가 가지는 값과 그 값을 가질 확률과의 대응 관계

‣

이산확률분포 (discrete probability distribution): 연속적인 값을 갖지 않는 분포를 가지는 경우

‣

연속확률분포 (continuous probability distribution): 확률밀도함수(probability density funciton)를

이용하여 분포를 표현할 수 있는 경우

X

0 0.05 0.1 0.15 0.2

1

2

3

4

5

6 “확률변수 = 주사위를 던졌을 때 나오는 숫자”

에 대한 확률분포는 이산균등분포

(76)

확률밀도함수 (Probability Density Funciton)

• 확률변수의 연속적인 분포를 나타내는 함수

y = f(x)

x

y

P(a ≤ x ≤ b) = ∫

_a

b

f(x)dx

a

b

(77)

전확률 (Total Probability)

• 모든 사건에 대한 확률의 합은 1이다.

‣

이산확률분포

‣

연속확률분포

N

∑

i

P(x

_i

) = 1

∫

∞

−∞

f(x) dx = 1

(78)

전체 확률의 법칙 (The Law of Total Probability)

P(X = x) = ∫

_−∞

∞

[P(x|z) ⋅ P(z)] dz

P(X = x

_i

) =

_∑

N

j

P(X = x

_i

∩ Z = z

_j

) =

_∑

N

j

[P(X = x

i

|Z = z

_j

) ⋅ P(Z = z

_j

_)]

P(x) = ∑

z

[P(x|z) ⋅ P(z)]

discrete version:

(79)

가우스 분포 (Gaussian Distribution)

• 정규분포(normal distribution)라고도 한다.

34.1%

13.5 %

2.5%

2.5 %

P(x)

=

1 σ

2 π

e

−(x−

µ

)

2

2 σ

2

(80)

가우스 분포 (Gaussian Distribution)

• 많은 수의 경제, 사회, 자연 현상들이 정규분포를 따른다.

• 한국 남자의 평균키가 170cm이라는 것은… 키가 170cm에 가까운 사람들이 가장 많고,

이 수치에서 크게 벗어난 150cm, 또는 190cm의 사람들의 수는 기하급수적으로 적어진다는 의미이다.

(81)

정규 분포 (Normal Distribution)

• 어떠한 확률 변수 가 평균(mean) , 분산(variance) 인 정규 분포를 따를 때, 다음과 같이 표시한다.

X

μ

σ

2 X ∼ N(μ, σ

2 )

(82)

정규 분포의 선형 변환

• X ∼ N(μ, σ

2 )

이고

Y = aX + b

이면,

Y ∼ N(aμ + b, a

2 σ

2 )

이다.

μ

_Y

= E(Y) = E(aX + b) = aE(x) + b = aμ + b

σ

_Y

2 = Var(Y) = Var(aX + b) = a

2 Var(X) = a

2 σ

2 Y ∼ N(aμ + b, a

2 σ

2 )

(83)

Log of Gaussian

= log

(

1 σ 2π

e

−

(x − μ)2

_2σ2

)

= log(2πσ

2 )

−

1

2 + log e

−

(x − μ)2

2σ2

log P(x)

= log 1

σ 2π

+ log e

−

(x − μ)2

_2σ2

= − 1

2 log(2πσ

2 ) − (x − μ)

2 2σ

2 log P(x) = − 1

2 log(2πσ

2 ) − (x − μ)

2 2σ

2

(84)

정보 이론

(85)

• 정보의 크고 작음을 어떻게 정량화(=수치화) 할 수 있을까?

• 작은 확률로 나타나는 사건일수록 훨씬 더 많은 정보를 준다.

(항상 일어나는 사건은 아무런 정보도 주지 못한다.)

예) “주사위의 숫자가 7보다 작을 것이다.”는 아무런 의미가 없는 정보이다.

• 확률이 1인 사건(반드시 일어나는 사건)은 0으로,

확률이 0에 가까울수록 기하급수적으로(exponentially) 증가하는 함수

로 정보량을 표현하자! (확률 0~1을 정보량 ~0으로 mapping하는 함수)

• 즉, 발생할 확률이

인 사건 가 가지고 있는 정보량:

∞

P(x)

x

정보 획득량 (Information Gain)

I(x) = log 1

P(x)

= − logP(x)

0 ≤ P(x) ≤ 1

P(x)

I

I = − logP(x)

(86)

정보 획득량 (Information Gain)

I(x) = − logP(x)

High Probability

Low Information

Low Probability

High Information

(87)

엔트로피 (Entropy)

• 특정 확률변수에 대한 모든 정보 획득량의 기대값(expectation)

: the average of information

H (P(X)) =

_∑

N

i

[P(x

i

) ⋅ I(x

_i

_{)] = −}

_∑

N

i

[P(x

i

) ⋅ logP(x

_i

_)]

I(x) = − logP(x)

H (P(X)) = E

P(X)

[−logP(X)]

(88)

교차 엔트로피 (Cross Entropy)

• 만약 잘못된 정보를 사용해서 정보를 획득했다면… (예) 노이즈(noise)

• Q(X)

라는 잘못된 확률 정보를 통해서 얻은 엔트로피

H (P(X), Q(X)) = −

_∑

N

i

[P(x

i

) ⋅ logQ(x

_i

_)]

H (P(X), Q(X)) = E

P(X)

[−logQ(X)]

(89)

KL(Kullback-Leiber) Divergence

• 서로 다른 두 확률분포의 비슷한 정도를 측정하는 방법

• D

_KL

(P(X) ∥ Q(X))

: 확률분포 가 확률분포 를 근사(approximation)할 때, 그 차이값을 나타냄