Parametric Density Estimation (I)
Jin Young Choi
Seoul National University
J. Y. Choi. SNU
Outline
• Density Estimation
• Maximum Likelihood vs Bayesian Learning
• Maximum Likelihood Estimation
• Maximum A-Posteriori (MAP) Estimation
• Example: The Gaussian Case
• Bayesian Learning
• Recursive Bayesian Incremental Learning
2
Estimation of PDF
• To get an optimal classifier we need 𝑝(𝑤𝑖) and 𝑝(𝑥|𝑤𝑖) , but usually we do not know them.
• Solution: to use training data to estimate the unknown probabilities.
• Key task: Estimation of class conditional density
J. Y. Choi. SNU
Learning of PDF
• Supervised learning: we get to see samples from each of the classes
“separately” (called tagged or labeled samples).
• Tagged samples are “expensive”.
• Two methods: parametric (easier) and non-parametric (harder)
4
Learning From Observed Data
) ( i P
Hidden Observed
Unsupervised
J. Y. Choi. SNU
Parametric Learning
• Assume specific parametric distributions with parameters:
𝑝 𝑥 𝜔𝑖 ≈ 𝑝 𝑥|𝜃𝑖 , 𝜃𝑖 ∈ Θ ⊂ 𝑅𝑝
• Estimate parameters 𝜃(𝐷) from training data D
• Replace true value of class-conditional density with approximation and apply the Bayesian framework for decision making.
6
Parametric Learning
• Suppose we can assume that the relevant (class-conditional) densities are of some parametric form. That is,
𝑝 𝑥 𝜔 ≈ 𝑝 𝑥 𝜃 , 𝑤ℎ𝑒𝑟𝑒 𝜃 ∈ Θ ∈ 𝑅𝑝
• Examples of parameterized densities:
• Binomial: 𝑥 has 𝑚 1’s and (𝑛 − 𝑚) 0’s 𝑝 𝑥 𝜃 = 𝑛
𝑚 𝜃𝑚(1 − 𝜃)𝑛−𝑚, Θ = [0, 1]
• Exponential: Each data point 𝑥 is distributed according to 𝑝 𝑥 𝜃 = 𝜃𝑒−𝜃𝑥, Θ = (0, ∞)
• Normal:
J. Y. Choi. SNU
Parametric Learning
• Bayesian Decision 𝑎𝑟𝑔 max
𝑖 𝑝(𝜔𝑖|𝑥)
• Maximum Likelihood Estimation 𝑎𝑟𝑔 max
𝜃𝑖 𝑝(𝐷|𝜃𝑖)
• Maximum A-Posteriori Estimation 𝑎𝑟𝑔 max
𝜃𝑖 𝑝 𝜃𝑖 𝐷 , 𝜃𝑖 = 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡
• Bayesian Learning (Estimation)
Find 𝑝 𝜃𝑖 𝐷 , 𝜃𝑖 = 𝑟𝑎𝑛𝑑𝑜𝑚 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒
8
Preliminary Examples
• Maximum Likelihood Estimation 𝑎𝑟𝑔 max
𝜃𝑖 𝑝(𝐷|𝜃𝑖)
J. Y. Choi. SNU
Preliminary Examples
10
• Maximum Likelihood Estimation 𝑎𝑟𝑔 max
𝜃𝑖 𝑝(𝐷|𝜃𝑖)
Preliminary Examples
• Maximum Likelihood Estimation 𝑎𝑟𝑔 max
𝜃𝑖 𝑝(𝐷|𝜃𝑖)
J. Y. Choi. SNU
Preliminary Examples
12
• Maximum Likelihood Estimation 𝑎𝑟𝑔 max
𝜃𝑖 𝑝(𝐷|𝜃𝑖)
Preliminary Examples
• Maximum Likelihood Estimation 𝑎𝑟𝑔 max
𝜃𝑖 𝑝(𝐷|𝜃𝑖)
J. Y. Choi. SNU
Preliminary Examples
14
• Bayesian Learning (Estimation)
Find 𝑝 𝜃𝑖 𝐷 , 𝜃𝑖 = 𝑟𝑎𝑛𝑑𝑜𝑚 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒
Preliminary Examples
• Bayesian Learning (Estimation)
Find 𝑝 𝜃𝑖 𝐷 , 𝜃𝑖 = 𝑟𝑎𝑛𝑑𝑜𝑚 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒
J. Y. Choi. SNU
Preliminary Examples
16
• Bayesian Learning (Estimation)
Find 𝑝 𝜃𝑖 𝐷 , 𝜃𝑖 = 𝑟𝑎𝑛𝑑𝑜𝑚 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒
𝜇
Preliminary Examples
• Bayesian Learning (Estimation)
Find 𝑝 𝜃𝑖 𝐷 , 𝜃𝑖 = 𝑟𝑎𝑛𝑑𝑜𝑚 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒
𝜃 =
J. Y. Choi. SNU
Maximum Likelihood vs Bayesian Learning
• Maximum likelihood estimation: estimate parameter value 𝜃(𝐷) that makes the data most probable (i.e., maximizes the probability of
obtaining the sample that has actually been observed), 𝑝 𝑥 𝐷 ≈ 𝑝(𝑥| 𝜃 𝐷 ), 𝜃 𝐷 = 𝑎𝑟𝑔 max
𝜃 𝑝(𝐷|𝜃)
• Bayesian learning: define a prior probability on the model space 𝑝(𝜃) and compute the posterior 𝑝(𝜃|𝐷). Additional samples sharp the
posterior density which peaks near the true values of the parameters .
18
Sampling Model
• It is assumed that a training set 𝑆 = {(𝑥𝑙, ഥ𝜔𝑙)|𝑙 = 1, … , 𝑁} with independently generated samples is available.
• The sample set is partitioned into separate sample sets for each class, 𝐷𝑗 = 𝑥𝑙 𝑥𝑙, ഥ𝜔𝑙 ∈ 𝑆𝑗
• A generic sample set will simply be denoted by 𝐷 = {𝑥𝑙|𝑙 = 1, … , 𝑁}
• Each class-conditional 𝑝(𝑥|𝜔𝑗) is assumed to have a known
parametric form and is uniquely specified by a parameter (vector) 𝜃𝑗.
• Samples in each set 𝐷𝑗 are assumed to be independent and identically distributed (i.i.d.) according to some true probability law 𝑝(𝑥|𝜔𝑗).
J. Y. Choi. SNU
Log-Likelihood function and Score Function
• The sample sets are assumed to be functionally independent, so the training set 𝑆𝑗 contains no information about 𝜃𝑖 for 𝑖 ≠ 𝑗
• The i.i.d. assumption implies that
𝑝 𝐷𝑗 𝜃𝑗 = ς𝑥∈𝐷𝑗 𝑝(𝑥|𝜃𝑗)
• Let 𝐷 be a generic sample set of size 𝑛 = 𝐷
• Log-likelihood function:
𝑙 𝜃; 𝐷 ≡ ln 𝑝(𝐷|𝜃) =
𝑘=1 𝑛
ln 𝑝(𝑥𝑘|𝜃)
• The log-likelihood function is identical to the logarithm of the probability density (or mass) function, but is interpreted as a function of parameter 𝜃
20
Log-Likelihood Illustration
• Assume that all the points in 𝐷 are drawn from some (one-dimensional) normal distribution with some (known) variance and unknown mean.
𝑝 𝐷 𝜃 = ෑ
𝑥∈𝐷
𝑝(𝑥|𝜃)
𝑙 𝜃; 𝐷 ≡ ln 𝑝(𝐷|𝜃) =
𝑘=1 𝑛
ln 𝑝(𝑥𝑘|𝜃)
J. Y. Choi. SNU
Maximum Likelihood Estimation (MLE)
• The “most likely value” for MLE is given by 𝜃 𝐷 = 𝑎𝑟𝑔max
𝜃∈Θ 𝑙(𝜃; 𝐷)
• Gradient function
𝛻𝜃𝑙 𝜃; 𝐷 = 𝜕𝑙 𝜃;𝐷
𝜕𝜃1 , … ,𝜕𝑙 𝜃;𝐷
𝜕𝜃𝑝
• Necessary condition for MLE (if not on border of domain Θ)
𝛻𝜃𝑙 𝜃; 𝐷 = 0
22
Maximum A-Posteriori (MAP) Estimation
• The “most likely value” for MAP is given by 𝜃 𝐷 = 𝑎𝑟𝑔max
𝜃∈Θ 𝑝(𝜃|𝐷)
= 𝑎𝑟𝑔max
𝜃∈Θ 𝑝(𝐷|𝜃) 𝑝0( 𝜃)/𝑝(𝐷)
= 𝑎𝑟𝑔max
𝜃∈Θ ln ෑ
𝑘
𝑝 𝑥𝑘 𝜃 𝑝0(𝜃)
= 𝑎𝑟𝑔max
𝜃∈Θ
𝑘
ln 𝑝 𝑥𝑘 𝜃 + ln 𝑝0(𝜃)
• We can ignore the normalizing factor 𝑝(𝐷)
J. Y. Choi. SNU
Example of Maximum Likelihood
The Gaussian Case: Unknown Mean
• Suppose that the samples are drawn from a multivariate normal population with mean 𝜇, and covariance matrix Σ.
• By MLE, find unknown parameter 𝜃 = 𝜇 for sample data 𝐷 = {𝑥𝑘|𝑘 = 1, … , 𝑛}
24
Example of Maximum Likelihood
The Gaussian Case: Unknown Mean
• Log-likelihood is
𝑙 𝜇; 𝐷 = ln 𝑝 𝐷 𝜇 = ln ς𝑘=1𝑛 𝑝 𝑥𝑘 𝜇 = σ𝑘=1𝑛 ln 𝑝 𝑥𝑘 𝜇 where
ln 𝑝 𝑥𝑘 𝜇 = − 1
2 𝑥𝑘 − 𝜇 𝑡Σ−1 𝑥𝑘 − 𝜇 − 𝑑
2 ln 2𝜋 − 1
2 ln Σ
• For a sample point 𝑥𝑘, we have
𝛻𝜇 ln 𝑝 𝑥𝑘 𝜇 = Σ−1 𝑥𝑘 − 𝜇
J. Y. Choi. SNU
Example of Maximum Likelihood
The Gaussian Case: Unknown Mean and Covariance
• In the general multivariate normal case, neither the mean nor the covariance matrix is known.
• Estimate the mean and covariance from data.
26
Example of Maximum Likelihood
The Gaussian Case: Unknown Mean and Covariance
• In the general multivariate normal case, neither the mean nor the covariance matrix is known.
• Consider first the univariate case with 𝜃1 = 𝜇, 𝜃2 = 𝜎2
• The log-likelihood of a single point is ln 𝑝 𝑥𝑘 𝜃 = − 1
2 ln 2𝜋𝜃2 − 1
2𝜃2 (𝑥𝑘 − 𝜃1)2 and its derivative becomes
𝛻 𝑙 𝑥 ; 𝜃 = 𝛻 ln 𝑝 𝑥 𝜃 =
1
𝜃2 (𝑥𝑘 − 𝜃1)
J. Y. Choi. SNU
Example of Maximum Likelihood
The Gaussian Case: Unknown Mean and Covariance
• Setting the gradient to zero, and using all the sample points, we get the following necessary conditions:
σ𝑘 1
𝜃2 𝑥𝑘 − 𝜃1 = 0, σ𝑘(− 1
2𝜃2 + 𝑥𝑘−𝜃1
2
2𝜃22 ) = 0
• Optimal Solution becomes Ƹ𝜇 = 𝜃1 = 1
𝑛 σ𝑘=1𝑛 𝑥𝑘
ො
𝜎2 = 𝜃2 = 1
𝑛 σ𝑘=1𝑛 (𝑥𝑘 − Ƹ𝜇)2
28
Example of Maximum Likelihood
The Gaussian multivariate case
• For the multivariate case, it is easy to show that the MLE estimates are given by
Ƹ𝜇 = 1
𝑛
𝑘=1 𝑛
𝑥𝑘
Σ =
1𝑛
σ
𝑘=1𝑛(𝑥
𝑘− Ƹ𝜇)(𝑥
𝑘− Ƹ𝜇)
𝑡• The MLE for 𝜎2 is biased (i.e., the expected value over all data sets of size n of the sample variance is not equal to the true variance, that is,
𝐸 1
𝑛
𝑛
(𝑥𝑘 − Ƹ𝜇)2 = 𝑛 − 1
𝑛 𝜎2 ≠ 𝜎2
J. Y. Choi. SNU
Interim Summary and Next Outline
• Maximum Likelihood vs Bayesian Learning
• Maximum Likelihood Estimation
• Maximum A-Posteriori (MAP) Estimation
• Example of ML: The Gaussian Case
• Bayesian Learning
• Recursive Bayesian Incremental Learning
30
Parametric Density Estimation (II)
Jin Young Choi
Seoul National University
J. Y. Choi. SNU
Outline
• Density Estimation
• Maximum Likelihood vs Bayesian Learning
• Maximum Likelihood Estimation
• Maximum A-Posteriori (MAP) Estimation
• Example: The Gaussian Case
• Bayesian Learning
• Recursive Bayesian Incremental Learning
2
Parametric Learning
• Bayesian Decision 𝑎𝑟𝑔 max
𝑖 𝑝(𝜔𝑖|𝑥)
• Maximum Likelihood Estimation 𝑎𝑟𝑔 max
𝜃𝑖 𝑝(𝐷|𝜃𝑖)
• Maximum A-Posteriori Estimation 𝑎𝑟𝑔 max
𝜃𝑖 𝑝 𝜃𝑖 𝐷 , 𝜃𝑖 = 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡
• Bayesian Learning (Estimation)
Find 𝑝 𝜃𝑖 𝐷 , 𝜃𝑖 = 𝑟𝑎𝑛𝑑𝑜𝑚 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒
J. Y. Choi. SNU
Example of Maximum Likelihood
The Gaussian Case: Unknown Mean and Covariance
• In the general multivariate normal case, neither the mean nor the covariance matrix is known.
• Consider first the univariate case with 𝜃1 = 𝜇, 𝜃2 = 𝜎2
• The log-likelihood of a single point is ln 𝑝 𝑥𝑘 𝜃 = − 1
2 ln 2𝜋𝜃2 − 1
2𝜃2 (𝑥𝑘 − 𝜃1)2 and its derivative becomes
𝛻𝜃𝑙 𝑥𝑘; 𝜃 = 𝛻𝜃 ln 𝑝 𝑥𝑘 𝜃 =
1
𝜃2 (𝑥𝑘 − 𝜃1)
− 1
2𝜃2 + (𝑥𝑘 − 𝜃1 )2 2𝜃22
4
Example of Maximum Likelihood
The Gaussian Case: Unknown Mean and Covariance
• Setting the gradient to zero, and using all the sample points, we get the following necessary conditions:
σ𝑘 1
𝜃2 𝑥𝑘 − 𝜃1 = 0, σ𝑘(− 1
2𝜃2 + 𝑥𝑘−𝜃1
2
2𝜃22 ) = 0
• Optimal Solution becomes Ƹ𝜇 = 𝜃1 = 1
𝑛 σ𝑘=1𝑛 𝑥𝑘
ො
𝜎2 = 𝜃2 = 1 σ𝑘=1𝑛 (𝑥𝑘 − Ƹ𝜇)2
J. Y. Choi. SNU
Example of Maximum Likelihood
The Gaussian multivariate case
• For the multivariate case, it is easy to show that the MLE estimates are given by
Ƹ𝜇 = 1
𝑛
𝑘=1 𝑛
𝑥𝑘
Σ =
1𝑛
σ
𝑘=1𝑛(𝑥
𝑘− Ƹ𝜇)(𝑥
𝑘− Ƹ𝜇)
𝑡• The MLE for 𝜎2 is biased (i.e., the expected value over all data sets of size n of the sample variance is not equal to the true variance, that is,
𝐸 1
𝑛
𝑘=1 𝑛
(𝑥𝑘 − Ƹ𝜇)2 = 𝑛 − 1
𝑛 𝜎2 ≠ 𝜎2
6
Example of Maximum Likelihood:
The Gaussian multivariate case
• Unbiased estimator for 𝜇 and Σ are given by Ƹ𝜇 = 1
𝑛
𝑘=1 𝑛
𝑥𝑘
𝐶 =
1𝑛−1
σ
𝑘=1𝑛(𝑥
𝑘− Ƹ𝜇)(𝑥
𝑘− Ƹ𝜇)
𝑡• 𝐶 is called the unbiased sample covariance matrix 𝐸 [ ො𝜎2] = 𝐸 1
𝑛
𝑘=1 𝑛
(𝑥𝑘 − Ƹ𝜇)2 = 𝑛 − 1
𝑛 − 1 𝜎2 = 𝜎2
• Biased sample variance 𝜎ො2 is asymptotically unbiased.
J. Y. Choi. SNU
Bayesian Learning
• The aim of decision is to find posteriors 𝑝(𝜔𝑖|𝑥) knowing 𝑝(𝑥|𝜔𝑖) and 𝑝(𝜔𝑖), but they are unknown.
• How to estimate them from the sample D?
• Generally used assumptions:
Priors generally are known or obtainable from a trivial calculations.
Thus 𝑝 𝜔𝑖 ~𝑝(𝜔𝑖|𝐷). cf. 𝑝 𝜔𝑖 ≠ 𝑝(𝜔𝑖|𝑥). 𝐷 includes label.
The training set can be separated into 𝑐 subsets: 𝐷1, … , 𝐷𝑐
𝐷𝑗 have no influence on 𝑝(𝐱|𝜔𝑖, 𝐷𝑖) if 𝑖 ≠ 𝑗.
We omit the index to estimate 𝑝(𝐱|𝜔𝑖, 𝐷𝑖) for each class
8
𝑝 𝐱|𝜔𝑖 = 𝑝(𝐱|𝜔𝑖, 𝐷𝑖) ≅ 𝑝(𝐱|𝐷𝑖)
Bayesian Learning: General Theory
• Bayesian learning considers 𝜃 ← 𝜃𝑖 ← 𝜔𝑖 (the parameter vector to be estimated) to be a random variable.
• Before we observe the data, the parameters are described by a prior 𝑝(𝜃) which is typically very broad. Once we observed the data, we can make use of Bayes’ formula to find posterior 𝑝(𝜃|𝐷). Since some values of the parameters are more consistent with the data than others, the posterior is narrower than prior. This is Bayesian learning (see fig.)
𝑝 𝐱|𝜔𝑖 = 𝑝 𝐱|𝜔𝑖, 𝐷𝑖 ≅ 𝑝 𝐱|𝐷𝑖 ← 𝑝 𝐱|𝜃 , 𝑝(𝜃|𝐷𝑖) where 𝜃 is latent random variable.
J. Y. Choi. SNU
Bayesian Learning: General Theory
• Density function for 𝑥, given the training data set 𝐷𝑖,
𝑝 𝑥 𝜔𝑖 ≅ 𝑝 𝑥 𝐷𝑖 = න 𝑝 𝑥, 𝜃 𝐷𝑖 𝑑𝜃
• By chain rule, 𝑝 𝑥, 𝜃 𝐷 = 𝑝 𝑥 𝜃, 𝐷 𝑝(𝜃|𝐷)
• By independence between 𝑥 and 𝐷 , 𝑝 𝑥 𝜃, 𝐷 = 𝑝 𝑥 𝜃
• Therefore 𝑝 𝑥 𝐷 = 𝑝 𝑥 𝜃 𝑝 𝜃 𝐷 𝑑𝜃(= 𝑝 𝑥 𝜃 if 𝑝 𝜃 𝐷 is impulse).
• Instead of choosing a specific value for 𝜃, the Bayesian approach performs a weighted average over all values of 𝑝(𝜃)
• The weighting factor 𝑝(𝜃|𝐷), which is a posterior of 𝜃, is determined by starting from some assumed prior 𝑝(𝜃)
10
• Then update it using Bayes’ formula to take account of
data set 𝐷. Since 𝐷 = 𝑥1, … , 𝑥𝑛 are drawn independently, 𝑝 𝐷 𝜃 = ෑ
𝑘=1 𝑛
𝑝 𝑥𝑘 𝜃 , which is likelihood function.
• Posterior for 𝜃 is
𝑝 𝜃 𝐷 = 𝑝 𝐷 𝜃 𝑝(𝜃)
𝑝(𝐷) = 𝑝(𝜃)
𝑝(𝐷) ෑ
𝑘=1 𝑛
𝑝 𝑥𝑘 𝜃 ,
𝑛
Bayesian Learning: General Theory
J. Y. Choi. SNU
Bayesian Learning
– Univariate Normal Distribution
• Let us use the Bayesian estimation technique to calculate a posteriori density 𝑝 𝜃 𝐷 and the desired probability density (likelihood)
𝑝(𝑥|𝐷) for the case 𝑝(𝑥|𝜇)~𝑁 𝜇, Σ , where 𝜃 ≔ 𝜇.
• Univariate Case: 𝑝 𝜇 𝐷
• Let 𝜇 be the only unknown parameter 𝑝(𝑥|𝜇)~𝑁 𝜇, 𝜎2
12
𝑝 𝐱|𝜔𝑖 = 𝑝 𝐱|𝜔𝑖, 𝐷𝑖 ≅ 𝑝 𝐱|𝐷𝑖 ← 𝑝 𝐱|𝜃 , 𝑝(𝜃|𝐷𝑖) where 𝜃 is latent random variable.
Bayesian Learning
– Univariate Normal Distribution
• Conjugate prior probability: normal distribution over 𝜇, 𝑝 𝜇 ~𝑁(𝜇0, 𝜎02)
encodes some prior knowledge about the true mean 𝜇0 , while 𝜎02 measures our prior uncertainty.
• If 𝜇 is drawn from 𝑝(𝜇) then density for 𝑥 is completely determined.
Letting 𝐷 = 𝑥1, … , 𝑥𝑛 , we use
𝑝 𝜇 𝐷 = 𝑝 𝐷 𝜇 𝑝(𝜇)
𝑝 𝐷 𝜇 𝑝 𝜇 𝑑𝜇
𝑛
J. Y. Choi. SNU 14
Bayesian Learning
– Univariate Normal Distribution
1 1 2
( | D) exp
2 2
n n n
p
𝑝 𝜇 𝐷
𝜎0
Bayesian Learning
– Univariate Normal Distribution
posterior
• Computing the posterior distribution 𝑝 𝜇|𝐷 = 𝛼 𝑝 𝐷 𝜇 𝑝 𝜇
= 𝛼′ exp −1
2
𝑘=1
𝑛 (𝑥𝑘 − 𝜇) 𝜎
2
+ (𝜇 − 𝜇0) 𝜎0
2
= 𝛼′′ exp −1 2
𝑛
𝜎2 + 1
𝜎02 𝜇2 − 2 1
𝜎2
𝑘=1 𝑛
𝑥𝑘 + 𝜇0 𝜎02 𝜇
• Since 𝑝 𝜇|𝐷 should be Gaussian 𝑝 𝜇|𝐷 = 1
2𝜋𝜎𝑛 exp −1 2
(𝜇 − 𝜇𝑛) 𝜎𝑛
2
= 𝛼′′′ exp − 1
2𝜎𝑛2 𝜇2 − 𝜇𝑛 𝜎𝑛 𝜇
•
Bayesian Learning
– Univariate Normal Distribution
J. Y. Choi. SNU
• Solution: 1
𝜎𝑛2 = 𝑛
𝜎2 + 1
𝜎02 , 𝜇𝑛
𝜎𝑛2 = 𝑛
𝜎2 Ƹ𝜇𝑛 + 𝜇0
𝜎02
where Ƹ𝜇𝑛 = 1
𝑛 σ𝑘=1𝑛 𝑥𝑘 is the sample mean.
• Solving explicitly for 𝜇𝑛 and 𝜎𝑛2 we obtain 𝜎𝑛2 = 𝜎02𝜎2
𝑛𝜎02+𝜎2 , 𝜇𝑛 = 𝑛𝜎02
𝑛𝜎02+𝜎2 Ƹ𝜇𝑛 + 𝜎2
𝑛𝜎02+𝜎2 𝜇0
• 𝜇𝑛 represents our best guess for 𝜇 after observing 𝑛 samples.
• 𝜎𝑛2 measures our uncertainty about this guess.
• 𝜎𝑛2 decreases monotonically with 𝑛 (approaching 𝜎𝑛2 → 𝜎2
𝑛 → 0 as 𝑛 approaches infinity)
16
Bayesian Learning
– Univariate Normal Distribution
1 1 2
( | D) exp 2 2
n n n
p
• Each additional observation decreases our uncertainty about the true value of 𝜇.
• As 𝑛 increases, 𝜇 becomes more and more sharply peaked, approaching a Dirac delta function as 𝑛 approaches infinity. This behavior is known as
Bayesian Learning.
( | D) p
Bayesian Learning
– Univariate Normal Distribution
J. Y. Choi. SNU
𝜇𝑛 = 𝑛𝜎02
𝑛𝜎02 + 𝜎2 Ƹ𝜇𝑛 + 𝜎2
𝑛𝜎02 + 𝜎2 𝜇0, 𝜎𝑛2 = 𝜎02𝜎2 𝑛𝜎02 + 𝜎2
• In general, 𝜇𝑛 is a linear combination of Ƹ𝜇𝑛 and 𝜇0 , with coefficients that are non-negative and sum to 1.
• Thus 𝜇𝑛 lies somewhere between Ƹ𝜇𝑛 and 𝜇0.
• If 𝜎0 ≠ 0, 𝜇𝑛 → Ƹ𝜇𝑛 as 𝑛 → ∞.
• If 𝜎0 = 0, our a priori certainty that 𝜇 = 𝜇0 = 𝜇𝑛 is so strong that no number of observations can change our opinion.
• If 𝜎0 ≫ 𝜎, a priori guess is uncertain, and we take 𝜇𝑛 = Ƹ𝜇𝑛
• The ratio 𝜎2
𝜎02 is called dogmatism.
18
0
Bayesian Learning
– Univariate Normal Distribution
• The Univariate Case:
𝑝 𝑥 𝐷 = න 𝑝 𝑥 𝜇 𝑝 𝜇 𝐷 𝑑𝜇
= න 1
2𝜋𝜎𝜎𝑛 exp − 1 2
(𝑥 − 𝜇) 𝜎
2
− 1 2
(𝜇 − 𝜇𝑛) 𝜎𝑛
2
𝑑𝜇
= 1
2𝜋𝜎𝜎𝑛 exp − 1 2
𝑥 − 𝜇𝑛 2
𝜎2 + 𝜎𝑛2 𝑓 𝜎, 𝜎𝑛
∝ exp − 1 2
𝑥 − 𝜇𝑛 2
𝜎2 + 𝜎2 𝑛→∞ exp − 1 2
𝑥 − 𝜇 2 𝜎2
Bayesian Learning
– Univariate Normal Distribution
𝑝 𝐱|𝜔𝑖 = 𝑝 𝐱|𝜔𝑖, 𝐷𝑖 ≅ 𝑝 𝐱|𝐷𝑖 ← 𝑝 𝐱|𝜃 , 𝑝(𝜃|𝐷𝑖) where 𝜃 is latent random variable.
J. Y. Choi. SNU
• Since 𝑝 𝑥 𝐷 ∝ exp − 1
2
𝑥−𝜇𝑛 2
𝜎2+𝜎𝑛2 , we can write 𝑝 𝑥 𝐷 ~𝑁(𝜇𝑛, 𝜎2 + 𝜎𝑛2)
• To obtain the class conditional probability 𝑝 𝑥 𝐷 , whose parametric form is known to be 𝑝 𝑥 𝜇 ~𝑁 𝜇, 𝜎2 .
• We replace 𝜇 by 𝜇𝑛 and 𝜎2 by 𝜎2 + 𝜎𝑛2.
• The conditional mean 𝜇𝑛 is treated as if it were the true mean, and the known variance is increased by 𝜎𝑛2 to account for the additional
uncertainty in 𝑥 resulting from our lack of exact knowledge of the mean 𝜇 .
20
Bayesian Learning
– Univariate Normal Distribution
𝑝 𝐱|𝜔𝑖 = 𝑝 𝐱|𝜔𝑖, 𝐷𝑖 ≅ 𝑝 𝐱|𝐷𝑖 ← 𝑝 𝐱|𝜃 , 𝑝(𝜃|𝐷𝑖) where 𝜃 is latent random variable.
Example (demo-MAP)
• We have N points which are generated by one dimensional Gaussian,
• Prior probability is given by
• The total objective function is:
which is maximized to give
• For , the influence of prior is negligible and result is ML estimate. But for very strong belief in the prior for , the estimate tends to zero. Thus, if few
( | ) x[ ,1].
p x G ( ) [0, 2], p G
2 2
2 1
log ( | ) log ( | ) ( ) ( )
N
i i n
i i n
E p x p x p x
2 1
1 1
N n n
x
N
2 2
0
2 2 2 2 0
0 0
n ˆn
n
n n
J. Y. Choi. SNU
Recursive Bayesian Incremental Learning
• We have seen that 𝑝 𝐷𝑛 𝜃 = ς𝑘=1𝑛 𝑝 𝑥𝑘 𝜃 = 𝑝 𝑥𝑛 𝜃 ς𝑘=1𝑛−1 𝑝 𝑥𝑘 𝜃 . Let us define 𝐷𝑛 = 𝑥1, … , 𝑥𝑛 . Then
𝑝 𝐷𝑛 𝜃 = 𝑝 𝑥𝑛 𝜃 𝑝 𝐷𝑛−1 𝜃 .
• Substituting into 𝑝 𝜃 𝐷𝑛 and using Bayes we have:
𝑝 𝜃 𝐷𝑛 ∝ 𝑝 𝐷𝑛 𝜃 𝑝 𝜃 = 𝑝 𝑥𝑛 𝜃 𝑝 𝐷𝑛−1 𝜃 𝑝 𝜃
= 𝑝 𝑥𝑛 𝜃 𝑝 𝜃 𝐷𝑛−1 𝑝 𝐷𝑛−1
𝑝 𝜃 𝑝(𝜃)
∝ 𝑝 𝑥𝑛 𝜃 𝑝 𝜃 𝐷𝑛−1
22
Recursive Bayesian Incremental Learning
• While 𝑝 𝜃 𝐷0 = 𝑝(𝜃), the repeated use of 𝑝 𝜃 𝐷𝑛 ∝ 𝑝 𝑥𝑛 𝜃 𝑝 𝜃 𝐷𝑛−1 produces a sequence
𝑝(𝜃), 𝑝 𝜃 𝑥1 = 𝑝 𝜃 𝐷1 , 𝑝(𝜃|𝑥1, 𝑥2) = 𝑝 𝜃 𝐷2 , …
• This is called the recursive Bayes approach to the parameter estimation. (Also incremental or on-line learning).
• When this sequence of densities converges to a Dirac delta function centered about the true parameter value, we have Bayesian learning.
J. Y. Choi. SNU
• Example of Recursive Formula
24
n n-1
(θ|D ) (x |θ) (θ|D )n
p p p
Recursive Bayesian Incremental Learning
2
n 2
2 2 3 2 2
( )
1 1 1
(θ|D )= exp exp 2
2 2
n n
n n n
p
n 2 1
1 2 2 2 2
1 1
1 1 1
(θ|D )= exp 2
2
n n
n n
p x
2 2
0
2 2 2 2 0
0 0
n ˆn
n
n n
2 2
2 0
2 2
0
n n
2 2
1
2 2 2 2 1
1 1
n
n n n
n n
x
2 2
2 1
2 2
1 n n
n
Recursive Bayesian Incremental Learning
Maximal Likelihood vs. Bayesian
• ML and Bayesian estimations are asymptotically equivalent and “consistent”. They yield the same class-conditional densities when the size of the training data grows to infinity.
• ML is typically computationally easier: in ML we need to do (multidimensional) differentia tion and in Bayesian (multidimensional) integration.
• ML is often easier to interpret: it returns the single best model (parameter) whereas Bay esian gives a weighted average of models.
• But for a finite training data (and given a reliable prior) Bayesian is more accurate (uses more of the information).
• Bayesian with “flat” prior is essentially ML; with asymmetric and broad priors the method s lead to different solutions.
J. Y. Choi. SNU
Maximal Likelihood vs. Bayesian
• MLP, MAP
• Bayesian Learning
2 2
( | D )n ( | ) ( | D )n : ( ,n n) p x
p x θ p θ dθ N … …
: arg max (D | ): arg max ( | D )
n n
n n
MLP p
MAP p
n n
2 ^ 2 2
0 0
(θ|D )= (D |θ) (θ)
( ; n, n) ( ; n, , , )
p p p
f f
( | D )n ( | n) : ( n, 2) p x p x θ N θ
50
Interim Summary
• In MLE estimation,
• In Bayesian Learning
• Weighted average of Candidate Models
• Gaussian Model
• Incremental Bayesian Learning
( | D) ( | ) ( | D) p x
p x θ p θ dθ2 2
0
2 2 2 2 0
0 0
n ˆn
n
n n
2 2
2 0
2 2
0
n n
n n-1
(θ|D ) (x |θ) (θ|D )n
p p p
2 2
2 2
2