• 검색 결과가 없습니다.

Parametric Density Estimation (I)

N/A
N/A
Protected

Academic year: 2022

Share "Parametric Density Estimation (I)"

Copied!
57
0
0

로드 중.... (전체 텍스트 보기)

전체 글

(1)

Parametric Density Estimation (I)

Jin Young Choi

Seoul National University

(2)

J. Y. Choi. SNU

Outline

• Density Estimation

• Maximum Likelihood vs Bayesian Learning

• Maximum Likelihood Estimation

• Maximum A-Posteriori (MAP) Estimation

• Example: The Gaussian Case

• Bayesian Learning

• Recursive Bayesian Incremental Learning

2

(3)

Estimation of PDF

• To get an optimal classifier we need 𝑝(𝑤𝑖) and 𝑝(𝑥|𝑤𝑖) , but usually we do not know them.

• Solution: to use training data to estimate the unknown probabilities.

• Key task: Estimation of class conditional density

(4)

J. Y. Choi. SNU

Learning of PDF

• Supervised learning: we get to see samples from each of the classes

“separately” (called tagged or labeled samples).

• Tagged samples are “expensive”.

• Two methods: parametric (easier) and non-parametric (harder)

4

(5)

Learning From Observed Data

) ( i P

Hidden Observed

Unsupervised

(6)

J. Y. Choi. SNU

Parametric Learning

• Assume specific parametric distributions with parameters:

𝑝 𝑥 𝜔𝑖 ≈ 𝑝 𝑥|𝜃𝑖 , 𝜃𝑖 ∈ Θ ⊂ 𝑅𝑝

• Estimate parameters ෠𝜃(𝐷) from training data D

• Replace true value of class-conditional density with approximation and apply the Bayesian framework for decision making.

6

(7)

Parametric Learning

• Suppose we can assume that the relevant (class-conditional) densities are of some parametric form. That is,

𝑝 𝑥 𝜔 ≈ 𝑝 𝑥 𝜃 , 𝑤ℎ𝑒𝑟𝑒 𝜃 ∈ Θ ∈ 𝑅𝑝

• Examples of parameterized densities:

• Binomial: 𝑥 has 𝑚 1’s and (𝑛 − 𝑚) 0’s 𝑝 𝑥 𝜃 = 𝑛

𝑚 𝜃𝑚(1 − 𝜃)𝑛−𝑚, Θ = [0, 1]

• Exponential: Each data point 𝑥 is distributed according to 𝑝 𝑥 𝜃 = 𝜃𝑒−𝜃𝑥, Θ = (0, ∞)

• Normal:

(8)

J. Y. Choi. SNU

Parametric Learning

• Bayesian Decision 𝑎𝑟𝑔 max

𝑖 𝑝(𝜔𝑖|𝑥)

• Maximum Likelihood Estimation 𝑎𝑟𝑔 max

𝜃𝑖 𝑝(𝐷|𝜃𝑖)

• Maximum A-Posteriori Estimation 𝑎𝑟𝑔 max

𝜃𝑖 𝑝 𝜃𝑖 𝐷 , 𝜃𝑖 = 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡

• Bayesian Learning (Estimation)

Find 𝑝 𝜃𝑖 𝐷 , 𝜃𝑖 = 𝑟𝑎𝑛𝑑𝑜𝑚 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒

8

(9)

Preliminary Examples

• Maximum Likelihood Estimation 𝑎𝑟𝑔 max

𝜃𝑖 𝑝(𝐷|𝜃𝑖)

(10)

J. Y. Choi. SNU

Preliminary Examples

10

• Maximum Likelihood Estimation 𝑎𝑟𝑔 max

𝜃𝑖 𝑝(𝐷|𝜃𝑖)

(11)

Preliminary Examples

• Maximum Likelihood Estimation 𝑎𝑟𝑔 max

𝜃𝑖 𝑝(𝐷|𝜃𝑖)

(12)

J. Y. Choi. SNU

Preliminary Examples

12

• Maximum Likelihood Estimation 𝑎𝑟𝑔 max

𝜃𝑖 𝑝(𝐷|𝜃𝑖)

(13)

Preliminary Examples

• Maximum Likelihood Estimation 𝑎𝑟𝑔 max

𝜃𝑖 𝑝(𝐷|𝜃𝑖)

(14)

J. Y. Choi. SNU

Preliminary Examples

14

• Bayesian Learning (Estimation)

Find 𝑝 𝜃𝑖 𝐷 , 𝜃𝑖 = 𝑟𝑎𝑛𝑑𝑜𝑚 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒

(15)

Preliminary Examples

• Bayesian Learning (Estimation)

Find 𝑝 𝜃𝑖 𝐷 , 𝜃𝑖 = 𝑟𝑎𝑛𝑑𝑜𝑚 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒

(16)

J. Y. Choi. SNU

Preliminary Examples

16

• Bayesian Learning (Estimation)

Find 𝑝 𝜃𝑖 𝐷 , 𝜃𝑖 = 𝑟𝑎𝑛𝑑𝑜𝑚 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒

𝜇

(17)

Preliminary Examples

• Bayesian Learning (Estimation)

Find 𝑝 𝜃𝑖 𝐷 , 𝜃𝑖 = 𝑟𝑎𝑛𝑑𝑜𝑚 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒

𝜃 =

(18)

J. Y. Choi. SNU

Maximum Likelihood vs Bayesian Learning

• Maximum likelihood estimation: estimate parameter value ෠𝜃(𝐷) that makes the data most probable (i.e., maximizes the probability of

obtaining the sample that has actually been observed), 𝑝 𝑥 𝐷 ≈ 𝑝(𝑥| ෠𝜃 𝐷 ), ෠𝜃 𝐷 = 𝑎𝑟𝑔 max

𝜃 𝑝(𝐷|𝜃)

• Bayesian learning: define a prior probability on the model space 𝑝(𝜃) and compute the posterior 𝑝(𝜃|𝐷). Additional samples sharp the

posterior density which peaks near the true values of the parameters .

18

(19)

Sampling Model

• It is assumed that a training set 𝑆 = {(𝑥𝑙, ഥ𝜔𝑙)|𝑙 = 1, … , 𝑁} with independently generated samples is available.

• The sample set is partitioned into separate sample sets for each class, 𝐷𝑗 = 𝑥𝑙 𝑥𝑙, ഥ𝜔𝑙 ∈ 𝑆𝑗

• A generic sample set will simply be denoted by 𝐷 = {𝑥𝑙|𝑙 = 1, … , 𝑁}

• Each class-conditional 𝑝(𝑥|𝜔𝑗) is assumed to have a known

parametric form and is uniquely specified by a parameter (vector) 𝜃𝑗.

• Samples in each set 𝐷𝑗 are assumed to be independent and identically distributed (i.i.d.) according to some true probability law 𝑝(𝑥|𝜔𝑗).

(20)

J. Y. Choi. SNU

Log-Likelihood function and Score Function

• The sample sets are assumed to be functionally independent, so the training set 𝑆𝑗 contains no information about 𝜃𝑖 for 𝑖 ≠ 𝑗

• The i.i.d. assumption implies that

𝑝 𝐷𝑗 𝜃𝑗 = ς𝑥∈𝐷𝑗 𝑝(𝑥|𝜃𝑗)

• Let 𝐷 be a generic sample set of size 𝑛 = 𝐷

• Log-likelihood function:

𝑙 𝜃; 𝐷 ≡ ln 𝑝(𝐷|𝜃) = ෍

𝑘=1 𝑛

ln 𝑝(𝑥𝑘|𝜃)

• The log-likelihood function is identical to the logarithm of the probability density (or mass) function, but is interpreted as a function of parameter 𝜃

20

(21)

Log-Likelihood Illustration

• Assume that all the points in 𝐷 are drawn from some (one-dimensional) normal distribution with some (known) variance and unknown mean.

𝑝 𝐷 𝜃 = ෑ

𝑥∈𝐷

𝑝(𝑥|𝜃)

𝑙 𝜃; 𝐷 ≡ ln 𝑝(𝐷|𝜃) = ෍

𝑘=1 𝑛

ln 𝑝(𝑥𝑘|𝜃)

(22)

J. Y. Choi. SNU

Maximum Likelihood Estimation (MLE)

• The “most likely value” for MLE is given by 𝜃 𝐷 = 𝑎𝑟𝑔max෠

𝜃∈Θ 𝑙(𝜃; 𝐷)

• Gradient function

𝛻𝜃𝑙 𝜃; 𝐷 = 𝜕𝑙 𝜃;𝐷

𝜕𝜃1 , … ,𝜕𝑙 𝜃;𝐷

𝜕𝜃𝑝

• Necessary condition for MLE (if not on border of domain Θ)

𝛻𝜃𝑙 𝜃; 𝐷 = 0

22

(23)

Maximum A-Posteriori (MAP) Estimation

• The “most likely value” for MAP is given by 𝜃 𝐷 = 𝑎𝑟𝑔max෠

𝜃∈Θ 𝑝(𝜃|𝐷)

= 𝑎𝑟𝑔max

𝜃∈Θ 𝑝(𝐷|𝜃) 𝑝0( 𝜃)/𝑝(𝐷)

= 𝑎𝑟𝑔max

𝜃∈Θ ln ෑ

𝑘

𝑝 𝑥𝑘 𝜃 𝑝0(𝜃)

= 𝑎𝑟𝑔max

𝜃∈Θ

𝑘

ln 𝑝 𝑥𝑘 𝜃 + ln 𝑝0(𝜃)

• We can ignore the normalizing factor 𝑝(𝐷)

(24)

J. Y. Choi. SNU

Example of Maximum Likelihood

The Gaussian Case: Unknown Mean

• Suppose that the samples are drawn from a multivariate normal population with mean 𝜇, and covariance matrix Σ.

• By MLE, find unknown parameter 𝜃 = 𝜇 for sample data 𝐷 = {𝑥𝑘|𝑘 = 1, … , 𝑛}

24

(25)

Example of Maximum Likelihood

The Gaussian Case: Unknown Mean

• Log-likelihood is

𝑙 𝜇; 𝐷 = ln 𝑝 𝐷 𝜇 = ln ς𝑘=1𝑛 𝑝 𝑥𝑘 𝜇 = σ𝑘=1𝑛 ln 𝑝 𝑥𝑘 𝜇 where

ln 𝑝 𝑥𝑘 𝜇 = − 1

2 𝑥𝑘 − 𝜇 𝑡Σ−1 𝑥𝑘 − 𝜇 − 𝑑

2 ln 2𝜋 − 1

2 ln Σ

• For a sample point 𝑥𝑘, we have

𝛻𝜇 ln 𝑝 𝑥𝑘 𝜇 = Σ−1 𝑥𝑘 − 𝜇

(26)

J. Y. Choi. SNU

Example of Maximum Likelihood

The Gaussian Case: Unknown Mean and Covariance

• In the general multivariate normal case, neither the mean nor the covariance matrix is known.

• Estimate the mean and covariance from data.

26

(27)

Example of Maximum Likelihood

The Gaussian Case: Unknown Mean and Covariance

• In the general multivariate normal case, neither the mean nor the covariance matrix is known.

• Consider first the univariate case with 𝜃1 = 𝜇, 𝜃2 = 𝜎2

• The log-likelihood of a single point is ln 𝑝 𝑥𝑘 𝜃 = − 1

2 ln 2𝜋𝜃21

2𝜃2 (𝑥𝑘 − 𝜃1)2 and its derivative becomes

𝛻 𝑙 𝑥 ; 𝜃 = 𝛻 ln 𝑝 𝑥 𝜃 =

1

𝜃2 (𝑥𝑘 − 𝜃1)

(28)

J. Y. Choi. SNU

Example of Maximum Likelihood

The Gaussian Case: Unknown Mean and Covariance

• Setting the gradient to zero, and using all the sample points, we get the following necessary conditions:

σ𝑘 1

𝜃2 𝑥𝑘 − 𝜃1 = 0, σ𝑘(− 1

2𝜃2 + 𝑥𝑘−𝜃1

2

2𝜃22 ) = 0

• Optimal Solution becomes Ƹ𝜇 = 𝜃1 = 1

𝑛 σ𝑘=1𝑛 𝑥𝑘

𝜎2 = 𝜃2 = 1

𝑛 σ𝑘=1𝑛 (𝑥𝑘 − Ƹ𝜇)2

28

(29)

Example of Maximum Likelihood

The Gaussian multivariate case

• For the multivariate case, it is easy to show that the MLE estimates are given by

Ƹ𝜇 = 1

𝑛 ෍

𝑘=1 𝑛

𝑥𝑘

Σ = ෠

1

𝑛

σ

𝑘=1𝑛

(𝑥

𝑘

− Ƹ𝜇)(𝑥

𝑘

− Ƹ𝜇)

𝑡

• The MLE for 𝜎2 is biased (i.e., the expected value over all data sets of size n of the sample variance is not equal to the true variance, that is,

𝐸 1

𝑛 ෍

𝑛

(𝑥𝑘 − Ƹ𝜇)2 = 𝑛 − 1

𝑛 𝜎2 ≠ 𝜎2

(30)

J. Y. Choi. SNU

Interim Summary and Next Outline

• Maximum Likelihood vs Bayesian Learning

• Maximum Likelihood Estimation

• Maximum A-Posteriori (MAP) Estimation

• Example of ML: The Gaussian Case

• Bayesian Learning

• Recursive Bayesian Incremental Learning

30

(31)

Parametric Density Estimation (II)

Jin Young Choi

Seoul National University

(32)

J. Y. Choi. SNU

Outline

• Density Estimation

• Maximum Likelihood vs Bayesian Learning

• Maximum Likelihood Estimation

• Maximum A-Posteriori (MAP) Estimation

• Example: The Gaussian Case

• Bayesian Learning

• Recursive Bayesian Incremental Learning

2

(33)

Parametric Learning

• Bayesian Decision 𝑎𝑟𝑔 max

𝑖 𝑝(𝜔𝑖|𝑥)

• Maximum Likelihood Estimation 𝑎𝑟𝑔 max

𝜃𝑖 𝑝(𝐷|𝜃𝑖)

• Maximum A-Posteriori Estimation 𝑎𝑟𝑔 max

𝜃𝑖 𝑝 𝜃𝑖 𝐷 , 𝜃𝑖 = 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡

• Bayesian Learning (Estimation)

Find 𝑝 𝜃𝑖 𝐷 , 𝜃𝑖 = 𝑟𝑎𝑛𝑑𝑜𝑚 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒

(34)

J. Y. Choi. SNU

Example of Maximum Likelihood

The Gaussian Case: Unknown Mean and Covariance

• In the general multivariate normal case, neither the mean nor the covariance matrix is known.

• Consider first the univariate case with 𝜃1 = 𝜇, 𝜃2 = 𝜎2

• The log-likelihood of a single point is ln 𝑝 𝑥𝑘 𝜃 = − 1

2 ln 2𝜋𝜃21

2𝜃2 (𝑥𝑘 − 𝜃1)2 and its derivative becomes

𝛻𝜃𝑙 𝑥𝑘; 𝜃 = 𝛻𝜃 ln 𝑝 𝑥𝑘 𝜃 =

1

𝜃2 (𝑥𝑘 − 𝜃1)

− 1

2𝜃2 + (𝑥𝑘 − 𝜃1 )2 2𝜃22

4

(35)

Example of Maximum Likelihood

The Gaussian Case: Unknown Mean and Covariance

• Setting the gradient to zero, and using all the sample points, we get the following necessary conditions:

σ𝑘 1

𝜃2 𝑥𝑘 − 𝜃1 = 0, σ𝑘(− 1

2𝜃2 + 𝑥𝑘−𝜃1

2

2𝜃22 ) = 0

• Optimal Solution becomes Ƹ𝜇 = 𝜃1 = 1

𝑛 σ𝑘=1𝑛 𝑥𝑘

𝜎2 = 𝜃2 = 1 σ𝑘=1𝑛 (𝑥𝑘 − Ƹ𝜇)2

(36)

J. Y. Choi. SNU

Example of Maximum Likelihood

The Gaussian multivariate case

• For the multivariate case, it is easy to show that the MLE estimates are given by

Ƹ𝜇 = 1

𝑛 ෍

𝑘=1 𝑛

𝑥𝑘

Σ = ෠

1

𝑛

σ

𝑘=1𝑛

(𝑥

𝑘

− Ƹ𝜇)(𝑥

𝑘

− Ƹ𝜇)

𝑡

• The MLE for 𝜎2 is biased (i.e., the expected value over all data sets of size n of the sample variance is not equal to the true variance, that is,

𝐸 1

𝑛 ෍

𝑘=1 𝑛

(𝑥𝑘 − Ƹ𝜇)2 = 𝑛 − 1

𝑛 𝜎2 ≠ 𝜎2

6

(37)

Example of Maximum Likelihood:

The Gaussian multivariate case

• Unbiased estimator for 𝜇 and Σ are given by Ƹ𝜇 = 1

𝑛 ෍

𝑘=1 𝑛

𝑥𝑘

𝐶 =

1

𝑛−1

σ

𝑘=1𝑛

(𝑥

𝑘

− Ƹ𝜇)(𝑥

𝑘

− Ƹ𝜇)

𝑡

• 𝐶 is called the unbiased sample covariance matrix 𝐸 [ ො𝜎2] = 𝐸 1

𝑛 ෍

𝑘=1 𝑛

(𝑥𝑘 − Ƹ𝜇)2 = 𝑛 − 1

𝑛 − 1 𝜎2 = 𝜎2

• Biased sample variance 𝜎ො2 is asymptotically unbiased.

(38)

J. Y. Choi. SNU

Bayesian Learning

• The aim of decision is to find posteriors 𝑝(𝜔𝑖|𝑥) knowing 𝑝(𝑥|𝜔𝑖) and 𝑝(𝜔𝑖), but they are unknown.

• How to estimate them from the sample D?

• Generally used assumptions:

 Priors generally are known or obtainable from a trivial calculations.

Thus 𝑝 𝜔𝑖 ~𝑝(𝜔𝑖|𝐷). cf. 𝑝 𝜔𝑖 ≠ 𝑝(𝜔𝑖|𝑥). 𝐷 includes label.

 The training set can be separated into 𝑐 subsets: 𝐷1, … , 𝐷𝑐

 𝐷𝑗 have no influence on 𝑝(𝐱|𝜔𝑖, 𝐷𝑖) if 𝑖 ≠ 𝑗.

 We omit the index to estimate 𝑝(𝐱|𝜔𝑖, 𝐷𝑖) for each class

8

𝑝 𝐱|𝜔𝑖 = 𝑝(𝐱|𝜔𝑖, 𝐷𝑖) ≅ 𝑝(𝐱|𝐷𝑖)

(39)

Bayesian Learning: General Theory

Bayesian learning considers 𝜃 ← 𝜃𝑖 ← 𝜔𝑖 (the parameter vector to be estimated) to be a random variable.

Before we observe the data, the parameters are described by a prior 𝑝(𝜃) which is typically very broad. Once we observed the data, we can make use of Bayes’ formula to find posterior 𝑝(𝜃|𝐷). Since some values of the parameters are more consistent with the data than others, the posterior is narrower than prior. This is Bayesian learning (see fig.)

𝑝 𝐱|𝜔𝑖 = 𝑝 𝐱|𝜔𝑖, 𝐷𝑖 ≅ 𝑝 𝐱|𝐷𝑖 ← 𝑝 𝐱|𝜃 , 𝑝(𝜃|𝐷𝑖) where 𝜃 is latent random variable.

(40)

J. Y. Choi. SNU

Bayesian Learning: General Theory

• Density function for 𝑥, given the training data set 𝐷𝑖,

𝑝 𝑥 𝜔𝑖 ≅ 𝑝 𝑥 𝐷𝑖 = න 𝑝 𝑥, 𝜃 𝐷𝑖 𝑑𝜃

• By chain rule, 𝑝 𝑥, 𝜃 𝐷 = 𝑝 𝑥 𝜃, 𝐷 𝑝(𝜃|𝐷)

• By independence between 𝑥 and 𝐷 , 𝑝 𝑥 𝜃, 𝐷 = 𝑝 𝑥 𝜃

• Therefore 𝑝 𝑥 𝐷 = ׬ 𝑝 𝑥 𝜃 𝑝 𝜃 𝐷 𝑑𝜃(= 𝑝 𝑥 𝜃 if 𝑝 𝜃 𝐷 is impulse).

• Instead of choosing a specific value for 𝜃, the Bayesian approach performs a weighted average over all values of 𝑝(𝜃)

• The weighting factor 𝑝(𝜃|𝐷), which is a posterior of 𝜃, is determined by starting from some assumed prior 𝑝(𝜃)

10

(41)

• Then update it using Bayes’ formula to take account of

data set 𝐷. Since 𝐷 = 𝑥1, … , 𝑥𝑛 are drawn independently, 𝑝 𝐷 𝜃 = ෑ

𝑘=1 𝑛

𝑝 𝑥𝑘 𝜃 , which is likelihood function.

• Posterior for 𝜃 is

𝑝 𝜃 𝐷 = 𝑝 𝐷 𝜃 𝑝(𝜃)

𝑝(𝐷) = 𝑝(𝜃)

𝑝(𝐷) ෑ

𝑘=1 𝑛

𝑝 𝑥𝑘 𝜃 ,

𝑛

Bayesian Learning: General Theory

(42)

J. Y. Choi. SNU

Bayesian Learning

– Univariate Normal Distribution

• Let us use the Bayesian estimation technique to calculate a posteriori density 𝑝 𝜃 𝐷 and the desired probability density (likelihood)

𝑝(𝑥|𝐷) for the case 𝑝(𝑥|𝜇)~𝑁 𝜇, Σ , where 𝜃 ≔ 𝜇.

• Univariate Case: 𝑝 𝜇 𝐷

• Let 𝜇 be the only unknown parameter 𝑝(𝑥|𝜇)~𝑁 𝜇, 𝜎2

12

𝑝 𝐱|𝜔𝑖 = 𝑝 𝐱|𝜔𝑖, 𝐷𝑖 ≅ 𝑝 𝐱|𝐷𝑖 ← 𝑝 𝐱|𝜃 , 𝑝(𝜃|𝐷𝑖) where 𝜃 is latent random variable.

(43)

Bayesian Learning

– Univariate Normal Distribution

• Conjugate prior probability: normal distribution over 𝜇, 𝑝 𝜇 ~𝑁(𝜇0, 𝜎02)

encodes some prior knowledge about the true mean 𝜇0 , while 𝜎02 measures our prior uncertainty.

• If 𝜇 is drawn from 𝑝(𝜇) then density for 𝑥 is completely determined.

Letting 𝐷 = 𝑥1, … , 𝑥𝑛 , we use

𝑝 𝜇 𝐷 = 𝑝 𝐷 𝜇 𝑝(𝜇)

׬ 𝑝 𝐷 𝜇 𝑝 𝜇 𝑑𝜇

𝑛

(44)

J. Y. Choi. SNU 14

Bayesian Learning

– Univariate Normal Distribution

1 1 2

( | D) exp

2 2

n n n

p   

 

 

𝑝 𝜇 𝐷

𝜎0

Bayesian Learning

– Univariate Normal Distribution

posterior

(45)

Computing the posterior distribution 𝑝 𝜇|𝐷 = 𝛼 𝑝 𝐷 𝜇 𝑝 𝜇

= 𝛼 exp −1

2

𝑘=1

𝑛 (𝑥𝑘 − 𝜇) 𝜎

2

+ (𝜇 − 𝜇0) 𝜎0

2

= 𝛼′′ exp −1 2

𝑛

𝜎2 + 1

𝜎02 𝜇2 − 2 1

𝜎2

𝑘=1 𝑛

𝑥𝑘 + 𝜇0 𝜎02 𝜇

Since 𝑝 𝜇|𝐷 should be Gaussian 𝑝 𝜇|𝐷 = 1

2𝜋𝜎𝑛 exp −1 2

(𝜇 − 𝜇𝑛) 𝜎𝑛

2

= 𝛼′′′ exp − 1

2𝜎𝑛2 𝜇2 𝜇𝑛 𝜎𝑛 𝜇

Bayesian Learning

– Univariate Normal Distribution

(46)

J. Y. Choi. SNU

• Solution: 1

𝜎𝑛2 = 𝑛

𝜎2 + 1

𝜎02 , 𝜇𝑛

𝜎𝑛2 = 𝑛

𝜎2 Ƹ𝜇𝑛 + 𝜇0

𝜎02

where Ƹ𝜇𝑛 = 1

𝑛 σ𝑘=1𝑛 𝑥𝑘 is the sample mean.

• Solving explicitly for 𝜇𝑛 and 𝜎𝑛2 we obtain 𝜎𝑛2 = 𝜎02𝜎2

𝑛𝜎02+𝜎2 , 𝜇𝑛 = 𝑛𝜎02

𝑛𝜎02+𝜎2 Ƹ𝜇𝑛 + 𝜎2

𝑛𝜎02+𝜎2 𝜇0

• 𝜇𝑛 represents our best guess for 𝜇 after observing 𝑛 samples.

• 𝜎𝑛2 measures our uncertainty about this guess.

• 𝜎𝑛2 decreases monotonically with 𝑛 (approaching 𝜎𝑛2𝜎2

𝑛 → 0 as 𝑛 approaches infinity)

16

Bayesian Learning

– Univariate Normal Distribution

1 1 2

( | D) exp 2 2

n n n

p  



 

(47)

• Each additional observation decreases our uncertainty about the true value of 𝜇.

• As 𝑛 increases, 𝜇 becomes more and more sharply peaked, approaching a Dirac delta function as 𝑛 approaches infinity. This behavior is known as

Bayesian Learning.

( | D) p

Bayesian Learning

– Univariate Normal Distribution

(48)

J. Y. Choi. SNU

𝜇𝑛 = 𝑛𝜎02

𝑛𝜎02 + 𝜎2 Ƹ𝜇𝑛 + 𝜎2

𝑛𝜎02 + 𝜎2 𝜇0, 𝜎𝑛2 = 𝜎02𝜎2 𝑛𝜎02 + 𝜎2

• In general, 𝜇𝑛 is a linear combination of Ƹ𝜇𝑛 and 𝜇0 , with coefficients that are non-negative and sum to 1.

• Thus 𝜇𝑛 lies somewhere between Ƹ𝜇𝑛 and 𝜇0.

• If 𝜎0 ≠ 0, 𝜇𝑛 → Ƹ𝜇𝑛 as 𝑛 → ∞.

• If 𝜎0 = 0, our a priori certainty that 𝜇 = 𝜇0 = 𝜇𝑛 is so strong that no number of observations can change our opinion.

• If 𝜎0 ≫ 𝜎, a priori guess is uncertain, and we take 𝜇𝑛 = Ƹ𝜇𝑛

• The ratio 𝜎2

𝜎02 is called dogmatism.

18

  0

Bayesian Learning

– Univariate Normal Distribution

(49)

• The Univariate Case:

𝑝 𝑥 𝐷 = න 𝑝 𝑥 𝜇 𝑝 𝜇 𝐷 𝑑𝜇

= න 1

2𝜋𝜎𝜎𝑛 exp − 1 2

(𝑥 − 𝜇) 𝜎

2

− 1 2

(𝜇 − 𝜇𝑛) 𝜎𝑛

2

𝑑𝜇

= 1

2𝜋𝜎𝜎𝑛 exp − 1 2

𝑥 − 𝜇𝑛 2

𝜎2 + 𝜎𝑛2 𝑓 𝜎, 𝜎𝑛

∝ exp − 1 2

𝑥 − 𝜇𝑛 2

𝜎2 + 𝜎2 𝑛→∞ exp − 1 2

𝑥 − 𝜇 2 𝜎2

Bayesian Learning

– Univariate Normal Distribution

𝑝 𝐱|𝜔𝑖 = 𝑝 𝐱|𝜔𝑖, 𝐷𝑖 ≅ 𝑝 𝐱|𝐷𝑖 ← 𝑝 𝐱|𝜃 , 𝑝(𝜃|𝐷𝑖) where 𝜃 is latent random variable.

(50)

J. Y. Choi. SNU

• Since 𝑝 𝑥 𝐷 ∝ exp − 1

2

𝑥−𝜇𝑛 2

𝜎2+𝜎𝑛2 , we can write 𝑝 𝑥 𝐷 ~𝑁(𝜇𝑛, 𝜎2 + 𝜎𝑛2)

• To obtain the class conditional probability 𝑝 𝑥 𝐷 , whose parametric form is known to be 𝑝 𝑥 𝜇 ~𝑁 𝜇, 𝜎2 .

• We replace 𝜇 by 𝜇𝑛 and 𝜎2 by 𝜎2 + 𝜎𝑛2.

• The conditional mean 𝜇𝑛 is treated as if it were the true mean, and the known variance is increased by 𝜎𝑛2 to account for the additional

uncertainty in 𝑥 resulting from our lack of exact knowledge of the mean 𝜇 .

20

Bayesian Learning

– Univariate Normal Distribution

𝑝 𝐱|𝜔𝑖 = 𝑝 𝐱|𝜔𝑖, 𝐷𝑖 ≅ 𝑝 𝐱|𝐷𝑖 ← 𝑝 𝐱|𝜃 , 𝑝(𝜃|𝐷𝑖) where 𝜃 is latent random variable.

(51)

Example (demo-MAP)

• We have N points which are generated by one dimensional Gaussian,

• Prior probability is given by

• The total objective function is:

which is maximized to give

• For , the influence of prior is negligible and result is ML estimate. But for very strong belief in the prior for , the estimate tends to zero. Thus, if few

( | ) x[ ,1].

p x G ( ) [0, 2], p G

2 2

2 1

log ( | ) log ( | ) ( ) ( )

N

i i n

i i n

E p x p x p x

 

2 1

1 1

N n n

x

N

2 2

0

2 2 2 2 0

0 0

n ˆn

n

n n

(52)

J. Y. Choi. SNU

Recursive Bayesian Incremental Learning

• We have seen that 𝑝 𝐷𝑛 𝜃 = ς𝑘=1𝑛 𝑝 𝑥𝑘 𝜃 = 𝑝 𝑥𝑛 𝜃 ς𝑘=1𝑛−1 𝑝 𝑥𝑘 𝜃 . Let us define 𝐷𝑛 = 𝑥1, … , 𝑥𝑛 . Then

𝑝 𝐷𝑛 𝜃 = 𝑝 𝑥𝑛 𝜃 𝑝 𝐷𝑛−1 𝜃 .

• Substituting into 𝑝 𝜃 𝐷𝑛 and using Bayes we have:

𝑝 𝜃 𝐷𝑛 ∝ 𝑝 𝐷𝑛 𝜃 𝑝 𝜃 = 𝑝 𝑥𝑛 𝜃 𝑝 𝐷𝑛−1 𝜃 𝑝 𝜃

= 𝑝 𝑥𝑛 𝜃 𝑝 𝜃 𝐷𝑛−1 𝑝 𝐷𝑛−1

𝑝 𝜃 𝑝(𝜃)

∝ 𝑝 𝑥𝑛 𝜃 𝑝 𝜃 𝐷𝑛−1

22

(53)

Recursive Bayesian Incremental Learning

• While 𝑝 𝜃 𝐷0 = 𝑝(𝜃), the repeated use of 𝑝 𝜃 𝐷𝑛 ∝ 𝑝 𝑥𝑛 𝜃 𝑝 𝜃 𝐷𝑛−1 produces a sequence

𝑝(𝜃), 𝑝 𝜃 𝑥1 = 𝑝 𝜃 𝐷1 , 𝑝(𝜃|𝑥1, 𝑥2) = 𝑝 𝜃 𝐷2 , …

• This is called the recursive Bayes approach to the parameter estimation. (Also incremental or on-line learning).

• When this sequence of densities converges to a Dirac delta function centered about the true parameter value, we have Bayesian learning.

(54)

J. Y. Choi. SNU

• Example of Recursive Formula

24

n n-1

(θ|D ) (x |θ) (θ|D )n

p p p

Recursive Bayesian Incremental Learning

2

n 2

2 2 3 2 2

( )

1 1 1

(θ|D )= exp exp 2

2 2

n n

n n n

p  

n 2 1

1 2 2 2 2

1 1

1 1 1

(θ|D )= exp 2

2

n n

n n

p x

2 2

0

2 2 2 2 0

0 0

n ˆn

n

n n

2 2

2 0

2 2

0

n n

 

2 2

1

2 2 2 2 1

1 1

n

n n n

n n

x

2 2

2 1

2 2

1 n n

n

 

Recursive Bayesian Incremental Learning

(55)

Maximal Likelihood vs. Bayesian

• ML and Bayesian estimations are asymptotically equivalent and “consistent”. They yield the same class-conditional densities when the size of the training data grows to infinity.

• ML is typically computationally easier: in ML we need to do (multidimensional) differentia tion and in Bayesian (multidimensional) integration.

• ML is often easier to interpret: it returns the single best model (parameter) whereas Bay esian gives a weighted average of models.

• But for a finite training data (and given a reliable prior) Bayesian is more accurate (uses more of the information).

• Bayesian with “flat” prior is essentially ML; with asymmetric and broad priors the method s lead to different solutions.

(56)

J. Y. Choi. SNU

Maximal Likelihood vs. Bayesian

• MLP, MAP

• Bayesian Learning

2 2

( | D )n ( | ) ( | D )n : ( ,n n) p x

p x θ p θ dθ N  

… …

: arg max (D | )

: arg max ( | D )

n n

n n

MLP p

MAP p

n n

2 ^ 2 2

0 0

(θ|D )= (D |θ) (θ)

( ; n, n) ( ; n, , , )

p p p

f f

       

( | D )n ( | n) : ( n, 2) p x p x θ N θ

50

(57)

Interim Summary

• In MLE estimation,

• In Bayesian Learning

• Weighted average of Candidate Models

• Gaussian Model

• Incremental Bayesian Learning

( | D) ( | ) ( | D) p x

p x θ p θ dθ

2 2

0

2 2 2 2 0

0 0

n ˆn

n

n n

2 2

2 0

2 2

0

n n

 

n n-1

(θ|D ) (x |θ) (θ|D )n

p p p

2 2

2 2

2  

참조

관련 문서

월강수량 (monthly total) 평년편차 (anomaly from normal) 최근10년평균편차 (anomaly from 10yr mean)..

월강수량 (monthly total) 평년편차 (anomaly from normal) 최근10년평균편차 (anomaly from 10yr mean)..

월강수량 (monthly total) 평년편차 (anomaly from normal) 최근10년평균편차 (anomaly from 10yr mean).

월강수량 (monthly total) 평년편차 (anomaly from normal) 최근10년평균편차 (anomaly from 10yr mean).

월강수량 (monthly total) 평년편차 (anomaly from normal) 최근10년평균편차 (anomaly from 10yr mean)..

월강수량 (monthly total) 평년편차 (anomaly from normal) 최근10년평균편차 (anomaly from 10yr mean).

월강수량 (monthly total) 평년편차 (anomaly from normal) 최근10년평균편차 (anomaly from 10yr mean).

연강수량 (annual total) 평년편차 (anomaly from normal) 최근10년평균편차 (anomaly from 10yr mean).