Variational
AutoEncoder
Wanho Choi
(wanochoi.com)
인공 신경망
Generative Models
Discriminative Model
Generative Model
Generative Models
Discriminative Model
Generative Model
: Decision boundary
: Probability distribution
FCN
CNN
Autoencoder
GAN
Two Most Popular Generative Models
•
VAE (Variational AutoEncoder)
‣
Root:
Bayesian inference
‣
Goal: to model the underlying
probability
distribution of given data so that it could
sample new data from that distribution
•
GAN (Generative Adversarial Network)
‣
Root:
game theory
‣
Goal: to find the
Nash Equilibrium
오토인코더
Data Compression
•
It
converts the input into a smaller representation that we can reproduce the original data
to some degree of quality. (loss data compression)
•
AE performs data compression with
unsupervised learning (SW 2.0)
unlike many hard-coded algorithms (SW 1.0).
Original
Data
Compressor
(encoder)
Decompressor
(decoder)
Compressed
(encoded)
Data
Decompressed
Data
Encoding / Decoding
•
Encoding: the process of encrypting given data
•
Decoding: the process of restoring original data from encrypted data
•
Codec = (En)coder + Decoder (or Compression + Decompression)
http://aess.com.tr/encoding-decoding-2/ https://www.joydeepdeb.com/tools/url-encoding-decoding.html
AutoEncoder
Encoder
Decoder
x ∈ IR
d
latent vector
Input Layer
Hidden Layer
Output Layer
인지 네트워크 (recognition network) 생성 네트워크 (generative network)
z = f(x)
̂x = g(z)
̂x ∈ IR
d
(k ≪ d)
lower-dimensional
representation
z ∈ IR
k
information loss
key feature only
AutoEncoder
•
The general idea of the AE is to
squeeze the given information (input data)
through a narrow
bottleneck
between the
mirrored encoder (input) and decoder (output) parts of a neural network.
•
Because the network architecture and loss function are setup so that the output tries to
emulate the input, the network has to learn how to encode input data on the very limited
space represented by the bottleneck.
Some Keywords
•
Autoencoder = Sandglass-shaped Net. = Diabolo Net.
•
Bottleneck = Latent Space = Manifold
•
Code = Features = Hidden Variable = Latent Variable = Latent Vector = Encoding Vector
•
Dimensionality Reduction = Feature Detection
= Latent Representation = Hidden Representation = Compressed Representation
•
Representation Learning = Manifold Learning
image
space
latent
space
image
space
Manifold
•
Euclidean space: a space in that the distance between two points is defined as follows:
‣
1D Euclidean space: line
‣
2D Euclidean space: plane
‣
3D Euclidean space: general 3D space where we live
A(a
1
, a
2
, . . . , a
n
)
Manifold
•
Manifold: a space which locally looks Euclidean
•
For each point on the manifold, there exists a hyper-sphere with center and very small
radius so that the intersection of this sphere and the manifold can be continuously deformed
to a disk. It means one can twist or bend the shape without cutting and gluing.
•
A
n-manifold means a smooth surface in the (n+1)-dimensional space.
•
Ex) 2-manifold = smooth surface in the 3D space
Manifold in Deep Learning
•
Manifold: 고차원 공간에서의 데이터를 잘 아우르는 저차원의 sub-space
•
Manifold 상에서 특정 방향으로 움직이면서 얻어지는 점들은 유의미한 결과들의 집합이다.
•
원래의 고차원 공간상의 두 점 사이의 Euclidean distance는 무의미한 개념일 가능성이 높다.
•
Manifold 상에 존재 하는 두 점 사이의 geodesic distance는 유의미한 개념을 갖는다.
•
Manifold 상에서는 데이터의 분류(classification) 또한 쉽고 명확해진다.
https://www.iro.umontreal.ca/~bengioy/talks/gss2012-YB3-algorithms-AE-depth.pdff : ℝ
d
→ ℝ
m
(d ≤ m)
x
i
= f(τ
i
) + ϵ
i
x
i
∈ ℝ
m
https://www.slideshare.net/NaverEngineering/ss-96581209 https://en.wikipedia.org/wiki/Manifold_regularizationCurse of Dimensionality
•
차원의 저주
•
데이터의 차원이 증가할수록 해당 공간의 크기(부피)가 기하급수적으로 증가하기 때문에
동일한 개수의 discrete sample points로 구성된 데이터의 밀도가
전체 공간에서 차지하는 비중은 급속도 줄어든다.
•
따라서, 데이터의 차원이 증가할수록
데이터의 분포 분석 또는 모델의 추정에 필요한 sample data의 개수는 기하급수적으로 증가한다.
Manifold Hypothesis
•
이미지(image)와 같은 데이터는 고차원 공간(high dimensional space)에서의 밀도(density)는 낮지만
이들을 포함하는 저차원의 manifold가 존재한다.
•
이 저차원의 manifold를 벗어나는 순간 데이터의 밀도는 급격히 낮아진다.
https://dsp.stackexchange.com/questions/34126/random-noise-removal-in-images
The Latent Space of MNIST
The Latent Space of Face-like Images
•
The space of all face-like images is smaller than the space of all images.
all images
face-like images
face-like image
manifold
Vanilla AutoEncoder
•
The
simplest autoencoder
•
Three layers network
•
Loss =
•
Exactly same as
PCA
∥ x
i
− ̂x
i
∥
2
⋮
⋮
z
1
x
1
x
2
x
d
̂x
2
̂x
d
⋮
z
k
̂x
1
Vanilla AE as PCA
import torch
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D LEARNING_RATE = 0.001
TOTAL_EPOCHS = 1000
def Points(n, w1, w2, noise): points = np.empty((n, 3))
angles = np.random.rand(n) * 3 * np.pi / 2 - 0.5
points[:,0] = np.cos(angles) + np.sin(angles)/2 + noise * np.random.randn(n) / 2 points[:,1] = np.sin(angles) * 0.7 + noise * np.random.randn(n) / 2
points[:,2] = points[:,0] * w1 + points[:,1] * w2 + noise * np.random.randn(n) return points
points = Points(100, 0.1, 0.3, 0.1) x = torch.from_numpy(points).float()
Vanilla AE as PCA
def Display3D(points): fig = plt.figure() ax = Axes3D(fig)
ax.scatter(points[:,0], points[:,1], points[:,2]) plt.show() def Display2D(points): plt.plot(points[:,0], points[:,1], “b.") plt.show() class Model(torch.nn.Module): def __init__(self): super(Model, self).__init__()
self.encoder = torch.nn.Linear(3, 2) # from 3D to 2D
self.decoder = torch.nn.Linear(2, 3) # from 2D to 3D
def forward(self, x):
x = self.encoder(x) x = self.decoder(x) return x
def z(self, x): # latent vector
z = self.encoder(x) return z
Vanilla AE as PCA
model = Model()
CostFunc = torch.nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE) for epoch in range(TOTAL_EPOCHS):
output = model(x) cost = CostFunc(output, x) cost.backward() optimizer.step() optimizer.zero_grad() print('Cost: {:.4f}’.format(cost.item())) Display3D(points) z = model.z(x).detach() Display2D(z.numpy())
3/3
Vanilla AE as PCA
3D Input Points
Vanilla AE for MNIST
import torch, torchvision
from matplotlib import pyplot as plt BATCH_SIZE = 100 LEARNING_RATE = 0.001 TOTAL_EPOCHS = 3 transforms = torchvision.transforms.Compose([torchvision.transforms.ToTensor()]) train_dataset
= torchvision.datasets.MNIST(root='./data/', train=True, transform=transforms, download=True) test_dataset
= torchvision.datasets.MNIST(root='./data/', train=False, transform=transforms) train_dataloader
= torch.utils.data.DataLoader(dataset=train_dataset, batch_size=BATCH_SIZE, shuffle=True) test_dataloader
= torch.utils.data.DataLoader(dataset=test_dataset, batch_size=BATCH_SIZE, shuffle=True)
Vanilla AE for MNIST
class Model(torch.nn.Module): def __init__(self): super(Model, self).__init__() self.encoder = torch.nn.Linear(28*28, 100) self.decoder = torch.nn.Linear(100, 28*28) def forward(self, x): x = self.encoder(x) x = self.decoder(x) return x model = Model() CostFunc = torch.nn.MSELoss()optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE) for epoch in range(TOTAL_EPOCHS):
for images, _ in train_dataloader:
images = images.reshape(-1, 784) # flattening for input
output = model(images) # feed forward
cost = CostFunc(output, images) # compare with input images (not with labels)
cost.backward() optimizer.step()
optimizer.zero_grad()
print('Cost: {:.4f}’.format(cost.item()))
Vanilla AE for MNIST
plt.figure(figsize=(20, 4)) for i in range(10): img = images plt.subplot(2, 10, i+1) plt.imshow(img[i].reshape(28, 28)) plt.gray() plt.axis(‘off') img = output.detach().numpy() plt.subplot(2, 10, i+11) plt.imshow(img[i].reshape(28, 28)) plt.gray() plt.axis(‘off') plt.tight_layout() plt.show()3/3
Vanilla AE as De-noiser
import torch, torchvision
from matplotlib import pyplot as plt BATCH_SIZE = 100 LEARNING_RATE = 0.001 TOTAL_EPOCHS = 3 transforms = torchvision.transforms.Compose([torchvision.transforms.ToTensor()]) train_dataset
= torchvision.datasets.MNIST(root='./data/', train=True, transform=transforms, download=True) test_dataset
= torchvision.datasets.MNIST(root='./data/', train=False, transform=transforms) train_dataloader
= torch.utils.data.DataLoader(dataset=train_dataset, batch_size=BATCH_SIZE, shuffle=True) test_dataloader
= torch.utils.data.DataLoader(dataset=test_dataset, batch_size=BATCH_SIZE, shuffle=True)
Vanilla AE as De-noiser
class Model(torch.nn.Module): def __init__(self): super(Model, self).__init__() self.encoder = torch.nn.Linear(28*28, 100) self.decoder = torch.nn.Linear(100, 28*28) def forward(self, x): x = self.encoder(x) x = self.decoder(x) return x model = Model() CostFunc = torch.nn.MSELoss()optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE) for epoch in range(TOTAL_EPOCHS):
for images, _ in train_dataloader: images = images.reshape(-1, 784)
images = images + torch.randn(images.size()) * 0.5 # add noise
output = model(images)
cost = CostFunc(output, images) cost.backward()
optimizer.step()
optimizer.zero_grad()
print('Cost: {:.4f}’.format(cost.item()))
Vanilla AE as De-noiser
plt.figure(figsize=(20, 4)) for i in range(10): img = images plt.subplot(2, 10, i+1) plt.imshow(img[i].reshape(28, 28)) plt.gray() plt.axis(‘off') img = output.detach().numpy() plt.subplot(2, 10, i+11) plt.imshow(img[i].reshape(28, 28)) plt.gray() plt.axis(‘off') plt.tight_layout() plt.show()3/3
Deep AE as De-noiser
Encoder
Decoder
x ∈ IR
d
latent vector
Input Layer
Hidden Layer
Output Layer
인지 네트워크 (recognition network) 생성 네트워크 (generative network)
z = f(x)
̂x = g(z)
̂x ∈ IR
d
(k ≪ d)
lower-dimensional
representation
z ∈ IR
k
information loss
key feature only
Deep AE as De-noiser
class Model(torch.nn.Module): def __init__(self): super(Model, self).__init__() self.encoder = torch.nn.Sequential( torch.nn.Linear(28*28, 256), torch.nn.ReLU(), torch.nn.Linear( 256, 64), torch.nn.ReLU(), torch.nn.Linear( 64, 16), torch.nn.ReLU(), torch.nn.Linear( 16, 4)) self.decoder = torch.nn.Sequential( torch.nn.Linear( 4, 16), torch.nn.ReLU(), torch.nn.Linear( 16, 64), torch.nn.ReLU(), torch.nn.Linear( 64, 256), torch.nn.ReLU(), torch.nn.Linear(256, 28*28), torch.nn.Tanh()) def forward(self, x): x = self.encoder(x) x = self.decoder(x) return xConvolutional AutoEncoder (CAE)
•
AE with
convolution & pooling layers
instead of fully-connected layers
‣
Encoder:
down-sampling
‣
Decoder:
up-sampling
•
Not to be flattened input layer
Encoder
Decoder
x ∈ IR
d
latent vector
Input Layer
Hidden Layer
Output Layer
인지 네트워크 (recognition network) 생성 네트워크 (generative network)
z = f(x)
̂x = g(z)
̂x ∈ IR
d
(k ≪ d)
lower-dimensional
representation
z ∈ IR
k
information loss
key feature only
Convolutional
Pooling
De-convolutional
De-pooling
Down
Sampling
Up
Sampling
Convolutional AutoEncoder (CAE)
import torch, torchvision
from matplotlib import pyplot as plt BATCH_SIZE = 100 LEARNING_RATE = 0.001 TOTAL_EPOCHS = 3 transforms = torchvision.transforms.Compose([torchvision.transforms.ToTensor()]) train_dataset
= torchvision.datasets.MNIST(root='./data/', train=True, transform=transforms, download=True) test_dataset
= torchvision.datasets.MNIST(root='./data/', train=False, transform=transforms) train_dataloader
= torch.utils.data.DataLoader(dataset=train_dataset, batch_size=BATCH_SIZE, shuffle=True) test_dataloader
= torch.utils.data.DataLoader(dataset=test_dataset, batch_size=BATCH_SIZE, shuffle=True)
Convolutional AutoEncoder (CAE)
class Model(torch.nn.Module): def __init__(self): super(Model, self).__init__() # (bs, 28, 28, 1) -> conv -> (bs, 28, 28, 8) -> pool -> (bs, 14, 14, 8) # (bs, 14, 14, 8) -> conv -> (bs, 14, 14, 4) -> pool -> (bs, 7, 7, 4) self.encoder = torch.nn.Sequential(torch.nn.Conv2d(1, 128, 3, 1, 1), # in_channels, out_channels, kernel_size, stride, padding
torch.nn.MaxPool2d(2, 2),
torch.nn.Conv2d(128, 8, 3, 1, 1), # in_channels, out_channels, kernel_size, stride, padding
torch.nn.MaxPool2d(2, 2))
# (kernel_size, strid) = (2, 2) will increase the spatial dims by 2 # (bs, 7, 7, 4) -> (bs, 14, 14, 8)
# (bs, 14, 14, 8) -> (bs, 28, 28, 1)
self.decoder = torch.nn.Sequential(
torch.nn.ConvTranspose2d(8, 128, 2, 2, 0), # in_channels, out_channels, kernel_size, stride, padding
torch.nn.ConvTranspose2d(128, 1, 2, 2, 0)) # in_channels, out_channels, kernel_size, stride, padding def forward(self, x): x = self.encoder(x) x = self.decoder(x) return x
2/4
Convolutional AutoEncoder (CAE)
model = Model()
CostFunc = torch.nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE) for epoch in range(TOTAL_EPOCHS):
for images, _ in train_dataloader: # no flattening process
output = model(images)
cost = CostFunc(output, images) cost.backward()
optimizer.step()
optimizer.zero_grad()
print('Cost: {:.4f}'.format(cost.item()))
Convolutional AutoEncoder (CAE)
plt.figure(figsize=(20, 4)) for i in range(10): img = images plt.subplot(2, 10, i+1) plt.imshow(img[i].reshape(28, 28)) plt.gray() plt.axis(‘off') img = output.detach().numpy() plt.subplot(2, 10, i+11) plt.imshow(img[i].reshape(28, 28)) plt.gray() plt.axis(‘off') plt.tight_layout() plt.show()4/4
Symmetricity & Weight Sharing
Q) Do AutoEncoder have to be symmetrical in encoder and decoder?
A) There is no specific constraint on the symmetry of an autoencoder.
At the beginning, people tended to enforce such symmetry to the maximum: not only the layers were symmetrical,
but also the weights of the layers in the encoder and decoder where shared. This is not a requirement, but it allows to
use certain loss functions (i.e. RBM score matching) and can act as regularization, as you effectively reduce by half
the number of parameters to optimize. Nowadays, however, I think no one imposes encoder-decoder weight sharing.
About architectural symmetry, it is common to find the same number of layers, the same type of layers and the same
layer sizes in encoder and decoder, but there is no need for that.
For instance, in convolutional autoencoders, in the past it was very common to find convolutional layers in the
encoder and deconvolutional layers in the decoder, but now you normally see upsampling layers in the decoder
because they have less artifacts problems.
AE as a Fake Sample Generator
•
After the learning is completed,
you can separate the decoder and use it as a
fake sample generator.
Decoder
Encoder
Decoder
Noisy Input Denoised Output
Newly Generated Output Features
AE as a Fake Sample Generator
•
After the learning is completed,
you can separate the decoder and use it as a
fake sample generator.
Decoder
Encoder
Decoder
Noisy Input Denoised Output
Newly Generated Output Features
로그 함수
Log: Definition
•
일 때 이 함수의 역함수
•
단,
이고,
인 경우만 생각한다. (부정과 불능인 경우를 피하기 위해서)
•
두 함수는 서로 역함수의 관계이기 때문에
와
의 그래프는
를 중심으로 대칭
y = a
x
a > 0
a ≠ 1
y = a
x
x = log
a
y
y = x
밑 진수y = a
x
⟺
x = log
a
y
Log: Graph
•
일 때의 예
•
와
는 서로 역함수의 관계
•
에서
일 때,
•
에서
일 때,
a = 2
y = 2
x
y = log
2
x
y = log
2
x
x → 0
y → − ∞
y = log
2
x
x → ∞
y → ∞
y = 2
x
y = x
y = log
2
x
0 = log
21
−∞ = log
2ϵ
x = 1
y = 2
x
y = 2
x = 2
y = log
2
x
y = 1
Log: Two Cases
•
이면 증가함수
•
이면 감소함수
a > 1
a < 1
y = log
2
x
y = log
1 2x
Log: Properties
•
•
•
•
log
a
1 = 0, log
a
a = 1
log
a
x
n
= nlog
a
x
log
a
x + log
a
y = log
a
xy
log
a
x − log
a
y = log
a
x
y
:
:
:
:
f(1) = 0, f(a) = 1
f(x
n
) = nf(x)
f(x) + f(y) = f(xy)
f(x) − f(y) = f (
x
y )
y = a
x
⟺
x = log
a
y
Log: Properties (proof)
•
•
•
•
log
a
1 = 0, log
a
a = 1
log
a
x
n
= nlog
a
x
log
a
x + log
a
y = log
a
xy
log
a
x − log
a
y = log
a
x
y
:
:
:
:
f(1) = 0, f(a) = 1
f(x
n
) = nf(x)
f(x) + f(y) = f(xy)
f(x) − f(y) = f (
x
y )
y = a
x
⟺
x = log
a
y
일 때,
이므로
x = 0
1 = a
00 = log
a1
일 때,
이므로
x = 1
a = a
11 = log
aa
Log: Properties (proof)
•
•
•
•
log
a
1 = 0, log
a
a = 1
log
a
x
n
= nlog
a
x
log
a
x + log
a
y = log
a
xy
log
a
x − log
a
y = log
a
x
y
:
:
:
:
f(1) = 0, f(a) = 1
f(x
n
) = nf(x)
f(x) + f(y) = f(xy)
f(x) − f(y) = f (
x
y )
y = a
x
⟺
x = log
a
y
이라 하면,
이므로
이고, log의 정의에 의해서
log
ax = m
a
m= x
a
nm= x
nlog
ax
n= mn = nlog
ax
Log: Properties (proof)
•
•
•
•
log
a
1 = 0, log
a
a = 1
log
a
x
n
= nlog
a
x
log
a
x + log
a
y = log
a
xy
log
a
x − log
a
y = log
a
x
y
:
:
:
:
f(1) = 0, f(a) = 1
f(x
n
) = nf(x)
f(x) + f(y) = f(xy)
f(x) − f(y) = f (
x
y )
y = a
x
⟺
x = log
a
y
이라 하면,
이므로
에서
log의 정의를 이용하면
log
ax = m, log
ay = n
a
m= x, a
n= y
a
m× a
n= a
m+n= xy
Log: Properties (proof)
•
•
•
•
log
a
1 = 0, log
a
a = 1
log
a
x
n
= nlog
a
x
log
a
x + log
a
y = log
a
xy
log
a
x − log
a
y = log
a
x
y
:
:
:
:
f(1) = 0, f(a) = 1
f(x
n
) = nf(x)
f(x) + f(y) = f(xy)
f(x) − f(y) = f (
x
y )
y = a
x
⟺
x = log
a
y
이라 하면,
이므로
에서
log의 정의를 이용하면
log
ax = m, log
ay = n
a
m= x, a
n= y
a
m÷ a
n= a
m−n= x ÷ y
log
ax
y
= m − n = log
ax − log
ay
Natural Logarithm
•
(자연 상수 또는 Euler’s constant) = 2.71… 를 밑으로 하는 log
•
다음과 같이 밑인 를 생략하고 쓰기도 한다.
e
e
y = a
x
⟺
x = log
a
y
y = e
x
⟺
x = log
e
y
일 때
a = e
log
e
y = ln y
Exponential and Logarithm
•
•
•
log e = 1
log e
k
= k
e
log k
= k
Exponential and Logarithm (proof)
•
•
•
log e = 1
log e
k
= k
e
log k
= k
라 하면,
로그함수의 정의에 의해서
즉,
이므로
log e = log
e
e = 1
log e
k
= k log e = k × 1 = k
e
log k
= y
log k = log
e
y
log k = log y
y = k
Exponential and Logarithm (proof)
•
•
•
log e = 1
log e
k
= k
e
log k
= k
라 하면,
로그함수의 정의에 의해서
즉,
이므로
log e = log
e
e = 1
log e
k
= k log e = k × 1 = k
e
log k
= y
log k = log
e
y
log k = log y
y = k
Exponential and Logarithm (proof)
•
•
•
log e = 1
log e
k
= k
e
log k
= k
라 하면,
로그 함수의 정의에 의해서
즉,
이므로
log e = log
e
e = 1
log e
k
= k log e = k × 1 = k
e
log k
= y
log k = log
e
y
log k = log y
y = k
Probability
Deterministic vs Stochastic
•
Deterministic model
‣
The output of the model if fully
deterministic by the system parameters
•
Stochastic (=Probabilistic) model
‣
Some inherent randomness
‣
The same set of parameters will lead to
the different result. (ensemble)
표본공간 (Sample Space)
•
한 번의 시행에서 일어날 수 있는 모든 결과(=사건)의 집합
‣
주사위를 던져서 나오는 숫자에 대한 표본 공간
‣
동전을 던져서 나오는 면에 대한 표본 공간
‣
10문제가 나온 수학 시험 점수에 대한 표본 공간
S
= 1,2,3,4,5,6
{
}
S
= 0,10,20,30,40,50,60,70,80,90,100
{
}
S
= front,back
{
}
확률 (Probability)
•
동일한 조건에서 같은 실험을 무수히 반복할 때 특정 결과가 나오는 비율
•
확률은 0과 1 사이의 값을 가진다.
•
1에 가까울수록 확률이 높다고 말하며,
이는 무한 시행시 특정 결과가 나올 확실성이 그만큼 크다는 것을 의미한다.
•
주사위를 던졌을 때 3이 나올 확률: 1/6
•
동전을 던졌을 때 앞면이 나올 확률: 1/2
확률변수 (Random Variable)
•
일정한 확률을 가지고 발생하는 실험결과(사건)에 수치를 부여한 것.
•
확률변수와 일반변수의 차이점은 확률표본에서 관찰한 변수인지 아닌지에 달려있다.
(stochastic vs deterministic)
•
일반적으로 확률변수는 대문자, 확률변수가 취하는 값은 소문자로 표기한다.
: 표본공간 안의 사건 가 일어날 확률은 이다.
X
x
p
P(X = x) = p
평균 (Mean)
•
평균 = 산술 평균 = 표본 평균
•
통계학의 내용 중에서 가장 이해하기 쉬운 개념 중 하나
•
N
개의 관측값
(x
1
, x
2
, ⋯, x
N
)
이 주어져 있을 때, 평균은 다음과 같이 정의한다.
μ = 1
N
N
∑
i=1
x
i
분산 (Variance)
•
관측값들이 평균으로부터 얼마나 흩어져 있는가?
•
퍼져 있는 정도만 중요하기 때문에 부호는 중요하지 않다.
•
부호를 없애는 방법으로 절대값을 취하는 방법과 제곱을 취하는 방법이 있는데,
절대값은 대수 연산에 있어 여러 가지 불편함이 있기 때문에 잘 사용하지 않는다.
•
따라서, 각 편차(=관측값-평균)의 제곱의 합의 평균으로 분산을 정의한다.
σ
2
= 1
N
N
∑
i=1
(x
i
− μ)
2
( : 표준편차)
σ
예제: 평균과 분산
•
2학년 1반 학생들의 수학 성적
•
2학년 2반 학생들의 수학 성적
65
72
57
92
45
σ
2=
1
5
(
65
− 66.2
)
2+ 72 − 66.2
(
)
2+ 57 − 66.2
(
)
2+ 92 − 66.2
(
)
2+ 45 − 66.2
(
)
2{
}
= 246.96
µ
=
1
5
(
65
+ 72 + 57 + 92 + 45
)
= 66.2
76
58
72
65
63
σ
2=
1
5
(
76
− 66.8
)
2+ 58 − 66.8
(
)
2+ 72 − 66.8
(
)
2+ 65 − 66.8
(
)
2+ 63− 66.8
(
)
2{
}
= 41.36
µ
=
1
5
(
76
+ 58 + 72 + 65 + 63
)
= 66.8
기대값 (Expectation Value)
•
가중평균 (weighted average)
•
(각 사건이 발생할 때의 결과)×(그 사건이 일어날 확률)을 전체 사건에 대해 합한 값
•
어떤 확률적 사건에 대한 평균의 의미를 지님.
E(X) =
∑
N
i
[x
i
⋅ P(x
i
)] = μ
기대값 (Expectation Value)
수익률
확률
호황
50%
10%
보통
30%
50%
불황
-10%
40%
E(X)
= 50 × 0.1
(
)
+ 30 × 0.5
(
)
+ −10 × 0.4
(
)
= 5 +15 − 4
= 16%
.
E(aX + b) = aE(X) + b
E(aX + b) =
∑
N
i=1
[(ax
i
+ b)P(x
i
)]
=
∑
N
i=1
[ax
i
P(x
i
)] +
∑
N
i=1
[bP(x
i
)]
= a
∑
N
i=1
[x
i
P(x
i
)] + b
∑
N
i=1
[P(x
i
)]
=
∑
N
i=1
[ax
i
P(x
i
) + bP(x
i
)]
= aE(X) + b
P(x
i
) = P(ax
i
+ b)
N
∑
i=1
[P(x
i
)] = 1
.
Var(X) = E[(X − μ)
2
]
.
Var(X) = E(X
2
) − E(X)
2
Var(X) = E [(X − μ)
2
] =
N
∑
i=1
[(x
i
− μ)
2
P(x
i
)]
=
∑
N
i=1
[(x
2
i
− 2μx
i
+ μ
2
)P(x
i
)]
=
∑
N
i=1
[x
2
i
P(x
i
)] − 2μ
N
∑
i=1
[x
i
P(x
i
)] + μ
2
∑
N
i=1
[P(x
i
)]
= E(X
2
) − 2μ
2
+ μ
2
= E(X
2
) − E(X)
2
μ =
∑
N
i=1
[x
i
P(x
i
)]
E(X) = μ
=
∑
N
i=1
[x
2
i
P(x
i
) − 2μx
i
P(x
i
) + μ
2
P(x
i
)]
= E(X
2
) − μ
2
.