Parametric Density Estimation (I)

(1)

J. Y. Choi. SNU

Parametric Density Estimation (I)

Jin Young Choi

Seoul National University

1

(2)

J. Y. Choi. SNU

Outline

• Density Estimation

• Maximum Likelihood vs Bayesian Learning

• Maximum Likelihood Estimation

• Maximum A-Posteriori (MAP) Estimation

• Example: The Gaussian Case

• Bayesian Learning

• Recursive Bayesian Incremental Learning

2

(3)

J. Y. Choi. SNU

Estimation of PDF

• To get an optimal classifier we need 𝑝(𝑤_𝑖) and 𝑝(𝑥|𝑤_𝑖) , but usually we do not know them.

• Solution: to use training data to estimate the unknown

probabilities.

• Key task: Estimation of class conditional density

3

(4)

J. Y. Choi. SNU

Learning of PDF

• Supervised learning: we get to see samples from each of the classes

“separately” (called tagged or labeled samples).

• Tagged samples are “expensive”.

• Two methods: parametric (easier) and non-parametric (harder)

4

(5)

J. Y. Choi. SNU

Learning From Observed Data

5

) ( _i P 

)

| (_x ₁

P _P(_x |₂) P ₍x _|₃₎

Hidden Observed

Unsupervised

Supervised

(6)

J. Y. Choi. SNU

Parametric Learning

• Assume specific parametric distributions with parameters:

𝑝 𝑥 𝜔_𝑖 ≈ 𝑝 𝑥|𝜃_𝑖 , 𝜃_𝑖 ∈ Θ ⊂ 𝑅^𝑝

• Estimate parameters 𝜃(𝐷)෠ from training data D

• Replace true value of class-conditional density with approximation and apply the Bayesian framework for decision making.

6

(7)

J. Y. Choi. SNU

Parametric Learning

• Suppose we can assume that the relevant (class-conditional) densities are of some parametric form. That is,

𝑝 𝑥 𝜔 ≈ 𝑝 𝑥 𝜃 , 𝑤ℎ𝑒𝑟𝑒 𝜃 ∈ Θ ∈ 𝑅^𝑝

• Examples of parameterized densities:

• Binomial: 𝑥 has 𝑚 1’s and (𝑛 − 𝑚) 0’s 𝑝 𝑥 𝜃 = 𝑛

𝑚 𝜃^𝑚(1 − 𝜃)^𝑛−𝑚, Θ = [0, 1]

• Exponential: Each data point 𝑥 is distributed according to 𝑝 𝑥 𝜃 = 𝜃𝑒^−𝜃𝑥, Θ = (0, ∞)

• Normal:

𝑝 𝑥 𝜃 ~𝑁(𝜇, 𝜎²)

7

(8)

J. Y. Choi. SNU

Parametric Learning

• Bayesian Decision 𝑎𝑟𝑔 max

𝑖 𝑝(𝜔_𝑖|𝑥)

• Maximum Likelihood Estimation 𝑎𝑟𝑔 max

𝜃_𝑖 𝑝(𝐷|𝜃_𝑖)

• Maximum A-Posteriori Estimation 𝑎𝑟𝑔 max

𝜃_𝑖 𝑝 𝜃_𝑖 𝐷 , 𝜃_𝑖 = 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡

• Bayesian Learning (Estimation)

Find 𝑝 𝜃_𝑖 𝐷 , 𝜃_𝑖 = 𝑟𝑎𝑛𝑑𝑜𝑚 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒

8

(9)

J. Y. Choi. SNU

Preliminary Examples

9

(10)

J. Y. Choi. SNU

Preliminary Examples

10

(11)

J. Y. Choi. SNU

11

(12)

J. Y. Choi. SNU

12

(13)

J. Y. Choi. SNU

13

(14)

J. Y. Choi. SNU

Preliminary Examples

14

(15)

J. Y. Choi. SNU

15

(16)

J. Y. Choi. SNU

16

𝜇

(17)

J. Y. Choi. SNU

17

𝜇

𝜃 =

(18)

J. Y. Choi. SNU

Maximum Likelihood vs Bayesian Learning

• Maximum likelihood estimation: estimate parameter value 𝜃(𝐷)෠ that makes the data most probable (i.e., maximizes the probability of

obtaining the sample that has actually been observed), 𝑝 𝑥 𝐷 ≈ 𝑝(𝑥| ෠𝜃 𝐷 ), ෠𝜃 𝐷 = 𝑎𝑟𝑔 max

𝜃 𝑝(𝐷|𝜃)

• Bayesian learning: define a prior probability on the model space 𝑝(𝜃) and compute the posterior 𝑝(𝜃|𝐷). Additional samples sharp the

posterior density which peaks near the true values of the parameters .

18

(19)

J. Y. Choi. SNU

Sampling Model

• It is assumed that a training set 𝑆 = {(𝑥_𝑙, ഥ𝜔_𝑙)|𝑙 = 1, … , 𝑁} with independently generated samples is available.

• The sample set is partitioned into separate sample sets for each class, 𝐷_𝑗 = 𝑥_𝑙 𝑥_𝑙, ഥ𝜔_𝑙 ∈ 𝑆_𝑗

• A generic sample set will simply be denoted by 𝐷 = {𝑥_𝑙|𝑙 = 1, … , 𝑁}

• Each class-conditional 𝑝(𝑥|𝜔_𝑗) is assumed to have a known

parametric form and is uniquely specified by a parameter (vector) 𝜃_𝑗.

• Samples in each set 𝐷_𝑗 are assumed to be independent and identically distributed (i.i.d.) according to some true probability law 𝑝(𝑥|𝜔_𝑗).

19

(20)

J. Y. Choi. SNU

Log-Likelihood function and Score Function

• The sample sets are assumed to be functionally independent, so the training set 𝑆_𝑗 contains no information about 𝜃_𝑖 for 𝑖 ≠ 𝑗

• The i.i.d. assumption implies that

𝑝 𝐷_𝑗 𝜃_𝑗 = ς_𝑥∈𝐷_𝑗 𝑝(𝑥|𝜃_𝑗)

• Let 𝐷 be a generic sample set of size 𝑛 = 𝐷

• Log-likelihood function:

𝑙 𝜃; 𝐷 ≡ ln 𝑝(𝐷|𝜃) = ෍

𝑘=1 𝑛

ln 𝑝(𝑥_𝑘|𝜃)

• The log-likelihood function is identical to the logarithm of the probability density (or mass) function, but is interpreted as a function of parameter 𝜃

20

(21)

J. Y. Choi. SNU

Log-Likelihood Illustration

• Assume that all the points in 𝐷 are drawn from some (one-dimensional) normal distribution with some (known) variance and unknown mean.

21

𝑝 𝐷 𝜃 = ෑ

𝑥∈𝐷

𝑝(𝑥|𝜃)

𝑙 𝜃; 𝐷 ≡ ln 𝑝(𝐷|𝜃) = ෍

𝑘=1 𝑛

ln 𝑝(𝑥_𝑘|𝜃)

(22)

J. Y. Choi. SNU

Maximum Likelihood Estimation (MLE)

• The “most likely value” for MLE is given by 𝜃 𝐷 = 𝑎𝑟𝑔max෠

𝜃∈Θ 𝑙(𝜃; 𝐷)

• Gradient function

𝛻_𝜃𝑙 𝜃; 𝐷 = ^{𝜕𝑙 𝜃;𝐷}

𝜕𝜃₁ , … ,^{𝜕𝑙 𝜃;𝐷}

𝜕𝜃_𝑝

• Necessary condition for MLE (if not on border of domain Θ)

𝛻_𝜃𝑙 𝜃; 𝐷 = 0

22

(23)

J. Y. Choi. SNU

Maximum A-Posteriori (MAP) Estimation

• The “most likely value” for MAP is given by

𝜃 𝐷 = 𝑎𝑟𝑔max෠

𝜃∈Θ 𝑝(𝜃|𝐷)

= 𝑎𝑟𝑔max

𝜃∈Θ 𝑝(𝐷|𝜃) 𝑝₀( 𝜃)/𝑝(𝐷)

= 𝑎𝑟𝑔max

𝜃∈Θ ln ෑ

𝑘

𝑝 𝑥_𝑘 𝜃 𝑝₀(𝜃)

= 𝑎𝑟𝑔max

𝜃∈Θ ෍

𝑘

ln 𝑝 𝑥_𝑘 𝜃 + ln 𝑝₀(𝜃)

• We can ignore the normalizing factor 𝑝(𝐷)

23

(24)

J. Y. Choi. SNU

Example of Maximum Likelihood

The Gaussian Case: Unknown Mean

• Suppose that the samples are drawn from a multivariate normal population with mean 𝜇, and covariance matrix Σ.

• By MLE, find unknown parameter 𝜃 = 𝜇 for sample data 𝐷 = {𝑥_𝑘|𝑘 = 1, … , 𝑛}

24

(25)

J. Y. Choi. SNU

Example of Maximum Likelihood

The Gaussian Case: Unknown Mean

• Log-likelihood is

𝑙 𝜇; 𝐷 = ln 𝑝 𝐷 𝜇 = ln ς_𝑘=1^𝑛 𝑝 𝑥_𝑘 𝜇 = σ_𝑘=1^𝑛 ln 𝑝 𝑥_𝑘 𝜇 where

ln 𝑝 𝑥_𝑘 𝜇 = − ¹

2 𝑥_𝑘 − 𝜇 ^𝑡Σ⁻¹ 𝑥_𝑘 − 𝜇 − ^𝑑

2 ln 2𝜋 − ¹

2 ln Σ

• For a sample point 𝑥_𝑘, we have

𝛻_𝜇 ln 𝑝 𝑥_𝑘 𝜇 = Σ⁻¹ 𝑥_𝑘 − 𝜇

• The maximum likelihood estimate for 𝜇 must satisfy σ_𝑘=1^𝑛 Σ⁻¹ 𝑥_𝑘 − 𝜇 = 0 −→ 𝜇 = ¹

𝑛 σ_𝑘=1^𝑛 𝑥_𝑘 (sample mean)

25

(26)

J. Y. Choi. SNU

The Gaussian Case: Unknown Mean and Covariance

• In the general multivariate normal case, neither the mean nor the covariance matrix is known.

• Estimate the mean and covariance from data.

26

(27)

J. Y. Choi. SNU

Example of Maximum Likelihood

• In the general multivariate normal case, neither the mean nor the covariance matrix is known.

• Consider first the univariate case with 𝜃₁ = 𝜇, 𝜃₂ = 𝜎²

• The log-likelihood of a single point is ln 𝑝 𝑥_𝑘 𝜃 = − ¹

2 ln 2𝜋𝜃₂ − ¹

2𝜃₂ (𝑥_𝑘 − 𝜃₁)² and its derivative becomes

𝛻_𝜃𝑙 𝑥_𝑘; 𝜃 = 𝛻_𝜃 ln 𝑝 𝑥_𝑘 𝜃 =

1

𝜃₂ (𝑥_𝑘 − 𝜃₁)

− 1

2𝜃₂ + (𝑥_𝑘 − 𝜃₁ )² 2𝜃₂²

27

(28)

J. Y. Choi. SNU

Example of Maximum Likelihood

• Setting the gradient to zero, and using all the sample points, we get the following necessary conditions:

σ_𝑘 ¹

𝜃₂ 𝑥_𝑘 − 𝜃₁ = 0, σ_𝑘(− ¹

2𝜃₂ + ^𝑥^𝑘^−𝜃¹

2

2𝜃₂² ) = 0

• Optimal Solution becomes Ƹ𝜇 = 𝜃₁ = ¹

𝑛 σ_𝑘=1^𝑛 𝑥_𝑘

ො

𝜎² = 𝜃₂ = ¹

𝑛 σ_𝑘=1^𝑛 (𝑥_𝑘 − Ƹ𝜇)²

28

(29)

J. Y. Choi. SNU

The Gaussian multivariate case

• For the multivariate case, it is easy to show that the MLE estimates are given by

Ƹ𝜇 = 1

𝑛 ෍

𝑘=1 𝑛

𝑥_𝑘

Σ = ෠

¹

𝑛

σ

_𝑘=1^𝑛

(𝑥

_𝑘

− Ƹ𝜇)(𝑥

_𝑘

− Ƹ𝜇)

^𝑡

• The MLE for 𝜎² is biased (i.e., the expected value over all data sets of size n of the sample variance is not equal to the true variance, that is,

𝐸 1

𝑛 ෍

𝑘=1 𝑛

(𝑥_𝑘 − Ƹ𝜇)² = 𝑛 − 1

𝑛 𝜎² ≠ 𝜎²

29

(30)

J. Y. Choi. SNU

Interim Summary and Next Outline

• Example of ML: The Gaussian Case

30

(31)

J. Y. Choi. SNU

Parametric Density Estimation (II)

Jin Young Choi

Seoul National University

1

(32)

J. Y. Choi. SNU

Outline

• Density Estimation

• Example: The Gaussian Case

2

(33)

J. Y. Choi. SNU

Parametric Learning

• Bayesian Decision 𝑎𝑟𝑔 max

𝑖 𝑝(𝜔_𝑖|𝑥)

• Maximum A-Posteriori Estimation 𝑎𝑟𝑔 max

𝜃_𝑖 𝑝 𝜃_𝑖 𝐷 , 𝜃_𝑖 = 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡

3

(34)

J. Y. Choi. SNU

• For the multivariate case, ln 𝑝 𝑥

_𝑘

𝜇 = −

¹

2

𝑥

_𝑘

− 𝜇

^𝑡

Σ

⁻¹

𝑥

_𝑘

− 𝜇 −

^𝑑

2

ln 2𝜋 −

¹

2

ln Σ

• MLE estimates are given by

Ƹ𝜇 = ¹

Σ = ෠

¹

𝑛

σ

_𝑘=1^𝑛

(𝑥

_𝑘

− Ƹ𝜇)(𝑥

_𝑘

− Ƹ𝜇)

^𝑡

• The MLE for Σ is biased (i.e., the expected value over all data sets of size n of the sample variance is not equal to the true variance, that is,

𝐸 ¹

𝑛 σ_𝑘=1^𝑛 (𝑥_𝑘 − Ƹ𝜇)(𝑥_𝑘 − Ƹ𝜇)^𝑡 = ^𝑛−𝑝

𝑛 Σ ≠ Σ

4

(35)

J. Y. Choi. SNU

Example of Maximum Likelihood:

• Unbiased estimator for 𝜇 and Σ are given by

ො

𝜇 = ¹

Σ =෠ ¹

𝑛−𝑝 σ_𝑘=1^𝑛 (𝑥_𝑘 − Ƹ𝜇)(𝑥_𝑘 − Ƹ𝜇)^𝑡

• Σ෠ is called the unbiased sample covariance matrix

𝐸 ¹

𝑛−𝑝 σ_𝑘=1^𝑛 (𝑥_𝑘 − ො𝜇)(𝑥_𝑘 − ො𝜇)^𝑡 = ^𝑛−𝑝

𝑛−𝑝Σ = Σ

• Biased sample variance Σ෠ is asymptotically unbiased.

𝐸 ¹

𝑛 σ_𝑘=1^𝑛 (𝑥_𝑘 − ො𝜇)(𝑥_𝑘 − ො𝜇)^𝑡 = ^𝑛−𝑝

𝑛 Σ → Σ as 𝑛 → ∞

5

(36)

J. Y. Choi. SNU

Bayesian Learning

• The aim of decision is to find posteriors 𝑝(𝜔_𝑖|𝑥) knowing 𝑝(𝑥|𝜔_𝑖) and 𝑝(𝜔_𝑖), but they are unknown.

• How to estimate them from the training sample set D?

• Generally used assumptions:

✓ Priors generally are known or obtainable from a trivial calculations.

Thus 𝑝 𝜔_𝑖 ~𝑝(𝜔_𝑖|𝑆). cf. 𝑝 𝜔_𝑖|𝑆 ≠ 𝑝(𝜔_𝑖|𝑥). 𝑆 includes label.

✓ The training set can be separated into 𝑐 subsets: 𝐷₁, … , 𝐷_𝑐

✓ 𝐷_𝑗 have no influence on 𝑝(𝐱|𝜔_𝑖, 𝐷_𝑖) if 𝑖 ≠ 𝑗.

✓ Bayesian learning: find 𝑝 𝜃_𝑖|𝐷_𝑖 ^{𝑜𝑚𝑖𝑡 𝑖𝑛𝑑𝑒𝑥} 𝑝 𝜃|𝐷

6

(37)

J. Y. Choi. SNU

Bayesian Learning: General Theory

• Bayesian learning considers 𝜃 ← 𝜃_𝑖 ← 𝜔_𝑖 (the parameter vector to be estimated) to be a random variable.

• Before we observe the data, the parameters are described by a prior 𝑝(𝜃) which is typically very broad. Once we observed the data, we can make use of Bayes’ formula to find posterior 𝑝(𝜃|𝐷). If the values of parameters are more consistent with the data, the posterior is narrower than prior. This is Bayesian learning (see fig.)

7

Find 𝑝 𝜃 𝐷 by Bayes rule 𝑝 𝜃 𝐷 ∝ 𝑝 𝐷|𝜃 𝑝(𝜃)

(38)

J. Y. Choi. SNU

Bayesian Learning: General Theory

• Density function for 𝑥, given the training data set 𝐷,

𝑝(𝑥|𝜔) ≅ 𝑝(𝑥|𝐷) = න 𝑝 𝑥, 𝜃 𝐷 𝑑𝜃 = න 𝑝 𝑥 𝜃, 𝐷 𝑝(𝜃|𝐷) 𝑑𝜃

• By independence between 𝑥 and 𝐷 , 𝑝 𝑥 𝜃, 𝐷 = 𝑝 𝑥 𝜃

• Therefore 𝑝 𝑥 𝐷 = ׬ 𝑝 𝑥 𝜃 𝑝 𝜃 𝐷 𝑑𝜃(= 𝑝 𝑥 𝜃 if 𝑝 𝜃 𝐷 is impulse).

• Instead of choosing a specific value for 𝜃, the Bayesian approach performs a weighted average over all values of 𝑝(𝜃)

• The weighting factor 𝑝(𝜃|𝐷), which is a posterior of 𝜃, is determined by starting from some assumed prior 𝑝(𝜃)

8

(39)

J. Y. Choi. SNU

• Then update it using Bayes’ formula to take account of

data set 𝐷. Since 𝐷 = 𝑥₁, … , 𝑥_𝑛 are drawn independently, 𝑝 𝐷 𝜃 = ς_𝑘=1^𝑛 𝑝 𝑥_𝑘 𝜃 ,

which is likelihood function.

• Posterior for 𝜃 is

𝑝 𝜃 𝐷 = ^𝑝 𝐷 𝜃 ^𝑝(𝜃)

𝑝(𝐷) = ^𝑝(𝜃)

𝑝(𝐷) ς_𝑘=1^𝑛 𝑝 𝑥_𝑘 𝜃 , where normalization factor 𝑝 𝐷 = ׬ 𝑝(𝜃) ς_𝑘=1^𝑛 𝑝 𝑥_𝑘 𝜃 𝑑𝜃.

9

Bayesian Learning: General Theory

(40)

J. Y. Choi. SNU

Bayesian Learning

– Univariate Normal Distribution

• Let us use the Bayesian estimation technique to calculate a posteriori density 𝑝 𝜃 𝐷 and the desired probability density (likelihood) 𝑝(𝑥|𝐷) for the case 𝑝(𝑥|𝜇)~𝑁 𝜇, Σ , where rv 𝜃 for the parameter 𝜇.

• Univariate Case: 𝑝 𝜇 𝐷

• Let 𝜇 be the only unknown parameter 𝑝 𝐱|𝜔_𝑖 ~𝑝(𝑥|𝜇)~𝑁 𝜇, 𝜎²

10

(41)

J. Y. Choi. SNU

Bayesian Learning

• Conjugate prior probability: normal distribution over 𝜇, 𝑝 𝜇 ~𝑁(𝜇₀, 𝜎₀²)

encodes some prior knowledge about the true mean 𝜇₀ , while 𝜎₀² measures our prior uncertainty.

• If 𝜇 is drawn from 𝑝(𝜇) then density for 𝑥 is completely determined.

Letting 𝐷 = 𝑥₁, … , 𝑥_𝑛 , we use

𝑝 𝜇 𝐷 = 𝑝 𝐷 𝜇 𝑝(𝜇)

׬ 𝑝 𝐷 𝜇 𝑝 𝜇 𝑑𝜇

= α ෑ

𝑘=1 𝑛

𝑝 𝑥_𝑘 𝜇 𝑝(𝜇).

11

(42)

J. Y. Choi. SNU ¹²

Bayesian Learning

1 1 2

( | D) exp

2 2

n n n

p   

 

  −  

 

= −  

   

 

𝑝 𝜇 𝐷

𝜎₀

Bayesian Learning

posterior

(43)

J. Y. Choi. SNU

• Computing the posterior distribution 𝑝 𝜇|𝐷 = 𝛼 𝑝(𝐷 𝜇 𝑝 𝜇

= 𝛼^′ exp −¹

2 σ_𝑘=1^𝑛 ^(𝑥^𝑘^−𝜇)

𝜎

2 + ^(𝜇−𝜇⁰⁾

𝜎₀

2

= 𝛼^′′ exp − ¹

2

𝑛

𝜎² + ¹

𝜎₀² 𝜇² − 2 ¹

𝜎² σ_𝑘=1^𝑛 𝑥_𝑘 + ^𝜇⁰

𝜎₀² 𝜇

• Since 𝑝 𝜇|𝐷 should be Gaussian 𝑝 𝜇|𝐷 = ¹

2𝜋𝜎_𝑛 exp −¹

2

(𝜇−𝜇_𝑛) 𝜎_𝑛

2 = 𝛼^′′′ exp − ¹

2𝜎_𝑛² 𝜇² − ^𝜇^𝑛

𝜎_𝑛 𝜇

• 𝜇_𝑛 = 𝑓 𝑥_𝑘, 𝜇₀, 𝜎₀ =?, 𝜎_𝑛 = ℎ 𝑥_𝑘, 𝜇₀, 𝜎₀ =?

13

Bayesian Learning

(44)

J. Y. Choi. SNU

• Solution: ¹

𝜎_𝑛² = ^𝑛

𝜎² + ¹

𝜎₀² , ^𝜇^𝑛

𝜎_𝑛² = ^𝑛

𝜎² Ƹ𝜇_𝑛 + ^𝜇⁰

𝜎₀²

where Ƹ𝜇_𝑛 = ¹

𝑛 σ_𝑘=1^𝑛 𝑥_𝑘 is the sample mean.

• Solving explicitly for 𝜇_𝑛 and 𝜎_𝑛² we obtain 𝜎_𝑛² = ^𝜎⁰²^𝜎²

𝑛𝜎₀²+𝜎² , 𝜇_𝑛 = ^𝑛𝜎⁰²

𝑛𝜎₀²+𝜎² Ƹ𝜇_𝑛 + ^𝜎²

𝑛𝜎₀²+𝜎² 𝜇₀

• 𝜇_𝑛 represents our best guess for 𝜇 after observing 𝑛 samples.

• 𝜎_𝑛² measures our uncertainty about this guess.

• 𝜎_𝑛² decreases monotonically with 𝑛 (approaching 𝜎_𝑛² → ^𝜎²

𝑛 → 0 as 𝑛 approaches infinity)

14

Bayesian Learning

1 1 2

( | D) exp

2 2

n n n

p   

 

  −  

 

= −  

   

 

(45)

J. Y. Choi. SNU

• Each additional observation decreases our uncertainty about the true value of 𝜇.

• As 𝑛 increases, 𝜇 becomes more and more sharply peaked, approaching a Dirac delta function as 𝑛 approaches infinity. This behavior is known as

Bayesian Learning.

15

( | D) p 

Bayesian Learning

(46)

J. Y. Choi. SNU

𝜇_𝑛 = 𝑛𝜎₀²

𝑛𝜎₀² + 𝜎² Ƹ𝜇_𝑛 + 𝜎²

𝑛𝜎₀² + 𝜎² 𝜇₀, 𝜎_𝑛² = 𝜎₀²𝜎² 𝑛𝜎₀² + 𝜎²

• In general, 𝜇_𝑛 is a linear combination of Ƹ𝜇_𝑛 and 𝜇₀ , with coefficients that are non-negative and sum to 1.

• Thus 𝜇_𝑛 lies somewhere between Ƹ𝜇_𝑛 and 𝜇₀.

• If 𝜎₀ ≠ 0, 𝜇_𝑛 → Ƹ𝜇_𝑛 as 𝑛 → ∞.

• If 𝜎₀ = 0, our a priori certainty that 𝜇 = 𝜇₀ = 𝜇_𝑛 is so strong that no number of observations can change our opinion.

• If 𝜎₀ ≫ 𝜎, a priori guess is uncertain, and we take 𝜇_𝑛 = Ƹ𝜇_𝑛

• The ratio ^𝜎²

𝜎₀² is called dogmatism.

16

 = 0

Bayesian Learning

(47)

J. Y. Choi. SNU ¹⁷

• The Univariate Case:

𝑝 𝑥 𝐷 = න 𝑝 𝑥 𝜇 𝑝 𝜇 𝐷 𝑑𝜇

= ׬ ¹

2𝜋𝜎𝜎_𝑛 exp − ¹

2

(𝑥−𝜇) 𝜎

2 − ¹

2

(𝜇−𝜇_𝑛) 𝜎_𝑛

2 𝑑𝜇

= ¹

2𝜋𝜎𝜎_𝑛 exp − ¹

2

𝑥−𝜇_𝑛 ²

𝜎²+𝜎_𝑛² 𝑓 𝜎, 𝜎_𝑛

∝ exp − ¹

2

𝑥−𝜇_𝑛 ²

𝜎²+𝜎_𝑛² 𝑛→∞ exp − ¹

2

𝑥−𝜇 ² 𝜎²

Bayesian Learning

(48)

J. Y. Choi. SNU

• Since 𝑝 𝑥 𝐷 ∝ exp − ¹

2

𝑥−𝜇_𝑛 ²

𝜎²+𝜎_𝑛² , we can write 𝑝 𝑥 𝐷 ~𝑁(𝜇_𝑛, 𝜎² + 𝜎_𝑛²)

• To obtain the class conditional probability 𝑝 𝑥 𝐷 , whose parametric form is known to be 𝑝 𝑥 𝜇 ~𝑁 𝜇, 𝜎² .

• We replace 𝜇 by 𝜇_𝑛 and 𝜎² by 𝜎² + 𝜎_𝑛².

• The conditional mean 𝜇_𝑛 is treated as if it were the true mean, and the known variance is increased by 𝜎_𝑛² to account for the additional

uncertainty in 𝑥 resulting from our lack of exact knowledge of the mean 𝜇 .

18

Bayesian Learning

(49)

J. Y. Choi. SNU

Recursive Bayesian Incremental Learning

• We have seen that 𝑝 𝐷^𝑛 𝜃 = ς_𝑘=1^𝑛 𝑝 𝑥_𝑘 𝜃 = 𝑝 𝑥_𝑛 𝜃 ς_𝑘=1^𝑛−1 𝑝 𝑥_𝑘 𝜃 . Let us define 𝐷^𝑛 = 𝑥₁, … , 𝑥_𝑛 . Then

𝑝 𝐷^𝑛 𝜃 = 𝑝 𝑥_𝑛 𝜃 𝑝 𝐷^𝑛−1 𝜃 .

• Substituting into 𝑝 𝜃 𝐷^𝑛 and using Bayes we have:

𝑝 𝜃 𝐷^𝑛 ∝ 𝑝 𝐷^𝑛 𝜃 𝑝 𝜃 = 𝑝 𝑥_𝑛 𝜃 𝑝 𝐷^𝑛−1 𝜃 𝑝 𝜃

= 𝑝 𝑥_𝑛 𝜃 𝑝 𝜃 𝐷^𝑛−1 𝑝 𝐷^𝑛−1 𝑝(𝜃) 𝑝 𝜃

∝ 𝑝 𝑥_𝑛 𝜃 𝑝 𝜃 𝐷^𝑛−1

19

(50)

J. Y. Choi. SNU

• While 𝑝 𝜃 𝐷⁰ = 𝑝(𝜃), the repeated use of 𝑝 𝜃 𝐷^𝑛 ∝ 𝑝 𝑥_𝑛 𝜃 𝑝 𝜃 𝐷^𝑛−1 produces a sequence

𝑝(𝜃), 𝑝 𝜃 𝑥₁ = 𝑝 𝜃 𝐷¹ , 𝑝(𝜃|𝑥₁, 𝑥₂) = 𝑝 𝜃 𝐷² , …

• This is called the recursive Bayes approach to the parameter estimation. (Also incremental or on-line learning).

• When this sequence of densities converges to a Dirac delta function centered about the true parameter value, we have Bayesian learning.

20

(51)

J. Y. Choi. SNU

• Example of Recursive Formula

21

Recursive Bayesian Incremental Learning

𝜃_𝑛 = 𝜎_𝑛−1²

𝜎_𝑛−1² + 𝜎² 𝑥_𝑛 + 𝜎²

𝜎_𝑛−1² + 𝜎² 𝜃_𝑛−1 𝜎_𝑛² = 𝜎_𝑛−1² 𝜎² 𝜎_𝑛−1² + 𝜎²

𝑝(𝜃|𝐷^𝑛) ∝ 𝑝(𝐷^𝑛|𝜃)𝑝(𝜃|𝐷⁰)

𝑝(𝜃|𝐷^𝑛) ∝ 𝑝(𝑥_𝑛|𝜃)𝑝(𝜃|𝐷^𝑛−1) 𝜃_𝑛 = 𝑛𝜎₀²

𝑛𝜎₀² + 𝜎² 𝑥ො_𝑛 + 𝜎²

𝑛𝜎₀² + 𝜎² 𝜃₀ 𝜎_𝑛² = 𝜎₀²𝜎² 𝑛𝜎₀² + 𝜎²

(52)

J. Y. Choi. SNU

Maximal Likelihood vs. Bayesian

• MLP, MAP

2 2

( | D )ⁿ ( | ) ( | D )ⁿ : ( ,_n _n) p ^x =



p ^{x θ} p ^θ d^θ N   +

… …

^: arg max (D | )

: arg max ( | D )

n n

MLP p

MAP p



 

=

n n

2 ^ 2 2

0 0

(θ|D )= (D |θ) (θ)

( ; _n, _n) ( ; ⁿ, , , )

p p p

f f



   =     

( | D )ⁿ ( | _n) : ( _n, 2) p x = p x θ N θ 

50

(53)

J. Y. Choi. SNU

Maximal Likelihood vs. Bayesian

• ML and Bayesian estimations are asymptotically equivalent and “consistent”. They yield the same class-conditional densities when the size of the training data grows to infinity.

• ML is typically computationally easier: in ML we need to do (multidimensional) differentia tion and in Bayesian (multidimensional) integration.

• ML is often easier to interpret: it returns the single best model (parameter) whereas Bay esian gives a weighted average of models.

• But for a finite training data (and given a reliable prior) Bayesian is more accurate (uses more of the information).

• Bayesian with “flat” prior is essentially ML.

23

(54)

J. Y. Choi. SNU

Interim Summary

• In MLE estimation,

• In Bayesian Learning

• Weighted average of Candidate Models

• Gaussian Model

• Incremental Bayesian Learning

24

𝑝(𝑥|𝐷) = න𝑝(𝑥|𝜃)𝑝(𝜃|𝐷)𝑑𝜃

𝜃_𝑛 = 𝜎_𝑛−1²

𝜎_𝑛−1² + 𝜎² 𝑥_𝑛 + 𝜎²

𝜎_𝑛−1² + 𝜎² 𝜃_𝑛−1 𝜎_𝑛² = 𝜎_𝑛−1² 𝜎² 𝜎_𝑛−1² + 𝜎² 𝑝(𝜃|𝐷^𝑛) ∝ 𝑝(𝐷^𝑛|𝜃)𝑝(𝜃|𝐷⁰)

𝑝(𝜃|𝐷^𝑛) ∝ 𝑝(𝑥_𝑛|𝜃)𝑝(𝜃|𝐷^𝑛−1) 𝜃_𝑛 = 𝑛𝜎₀²

𝑛𝜎₀² + 𝜎² 𝜇ො_𝑛 + 𝜎²

𝑛𝜎₀² + 𝜎²𝜃₀ 𝜎_𝑛² = 𝜎₀²𝜎² 𝑛𝜎₀² + 𝜎²

ො

𝜇 = ¹

𝑛 σ_𝑘=1^𝑛 𝑥_𝑘, Σ =෠ ¹

𝑛−𝑝σ_𝑘=1^𝑛 (𝑥_𝑘 − ො𝜇)(𝑥_𝑘 − ො𝜇)^𝑡