J. Y. Choi. SNU
Parametric Density Estimation (I)
Jin Young Choi
Seoul National University
1
J. Y. Choi. SNU
Outline
โข Density Estimation
โข Maximum Likelihood vs Bayesian Learning
โข Maximum Likelihood Estimation
โข Maximum A-Posteriori (MAP) Estimation
โข Example: The Gaussian Case
โข Bayesian Learning
โข Recursive Bayesian Incremental Learning
2
J. Y. Choi. SNU
Estimation of PDF
โข To get an optimal classifier we need ๐(๐ค๐) and ๐(๐ฅ|๐ค๐) , but usually we do not know them.
โข Solution: to use training data to estimate the unknown
probabilities.
โข Key task: Estimation of class conditional density
3
J. Y. Choi. SNU
Learning of PDF
โข Supervised learning: we get to see samples from each of the classes
โseparatelyโ (called tagged or labeled samples).
โข Tagged samples are โexpensiveโ.
โข Two methods: parametric (easier) and non-parametric (harder)
4
J. Y. Choi. SNU
Learning From Observed Data
5
) ( i P ๏ท
)
| (x ๏ท1
P P(x |๏ท2) P (x |๏ท3)
Hidden Observed
Unsupervised
Supervised
J. Y. Choi. SNU
Parametric Learning
โข Assume specific parametric distributions with parameters:
๐ ๐ฅ ๐๐ โ ๐ ๐ฅ|๐๐ , ๐๐ โ ฮ โ ๐ ๐
โข Estimate parameters ๐(๐ท)เท from training data D
โข Replace true value of class-conditional density with approximation and apply the Bayesian framework for decision making.
6
J. Y. Choi. SNU
Parametric Learning
โข Suppose we can assume that the relevant (class-conditional) densities are of some parametric form. That is,
๐ ๐ฅ ๐ โ ๐ ๐ฅ ๐ , ๐คโ๐๐๐ ๐ โ ฮ โ ๐ ๐
โข Examples of parameterized densities:
โข Binomial: ๐ฅ has ๐ 1โs and (๐ โ ๐) 0โs ๐ ๐ฅ ๐ = ๐
๐ ๐๐(1 โ ๐)๐โ๐, ฮ = [0, 1]
โข Exponential: Each data point ๐ฅ is distributed according to ๐ ๐ฅ ๐ = ๐๐โ๐๐ฅ, ฮ = (0, โ)
โข Normal:
๐ ๐ฅ ๐ ~๐(๐, ๐2)
7
J. Y. Choi. SNU
Parametric Learning
โข Bayesian Decision ๐๐๐ max
๐ ๐(๐๐|๐ฅ)
โข Maximum Likelihood Estimation ๐๐๐ max
๐๐ ๐(๐ท|๐๐)
โข Maximum A-Posteriori Estimation ๐๐๐ max
๐๐ ๐ ๐๐ ๐ท , ๐๐ = ๐๐๐๐ ๐ก๐๐๐ก
โข Bayesian Learning (Estimation)
Find ๐ ๐๐ ๐ท , ๐๐ = ๐๐๐๐๐๐ ๐ฃ๐๐๐๐๐๐๐
8
J. Y. Choi. SNU
Preliminary Examples
9
โข Maximum Likelihood Estimation ๐๐๐ max
๐๐ ๐(๐ท|๐๐)
J. Y. Choi. SNU
Preliminary Examples
10
โข Maximum Likelihood Estimation ๐๐๐ max
๐๐ ๐(๐ท|๐๐)
J. Y. Choi. SNU
Preliminary Examples
11
โข Maximum Likelihood Estimation ๐๐๐ max
๐๐ ๐(๐ท|๐๐)
J. Y. Choi. SNU
Preliminary Examples
12
โข Maximum Likelihood Estimation ๐๐๐ max
๐๐ ๐(๐ท|๐๐)
J. Y. Choi. SNU
Preliminary Examples
13
โข Maximum Likelihood Estimation ๐๐๐ max
๐๐ ๐(๐ท|๐๐)
J. Y. Choi. SNU
Preliminary Examples
14
โข Bayesian Learning (Estimation)
Find ๐ ๐๐ ๐ท , ๐๐ = ๐๐๐๐๐๐ ๐ฃ๐๐๐๐๐๐๐
J. Y. Choi. SNU
Preliminary Examples
15
โข Bayesian Learning (Estimation)
Find ๐ ๐๐ ๐ท , ๐๐ = ๐๐๐๐๐๐ ๐ฃ๐๐๐๐๐๐๐
J. Y. Choi. SNU
Preliminary Examples
16
โข Bayesian Learning (Estimation)
Find ๐ ๐๐ ๐ท , ๐๐ = ๐๐๐๐๐๐ ๐ฃ๐๐๐๐๐๐๐
๐
J. Y. Choi. SNU
Preliminary Examples
17
โข Bayesian Learning (Estimation)
Find ๐ ๐๐ ๐ท , ๐๐ = ๐๐๐๐๐๐ ๐ฃ๐๐๐๐๐๐๐
๐
๐ =
J. Y. Choi. SNU
Maximum Likelihood vs Bayesian Learning
โข Maximum likelihood estimation: estimate parameter value ๐(๐ท)เท that makes the data most probable (i.e., maximizes the probability of
obtaining the sample that has actually been observed), ๐ ๐ฅ ๐ท โ ๐(๐ฅ| เท ๐ ๐ท ), เท ๐ ๐ท = ๐๐๐ max
๐ ๐(๐ท|๐)
โข Bayesian learning: define a prior probability on the model space ๐(๐) and compute the posterior ๐(๐|๐ท). Additional samples sharp the
posterior density which peaks near the true values of the parameters .
18
J. Y. Choi. SNU
Sampling Model
โข It is assumed that a training set ๐ = {(๐ฅ๐, เดฅ๐๐)|๐ = 1, โฆ , ๐} with independently generated samples is available.
โข The sample set is partitioned into separate sample sets for each class, ๐ท๐ = ๐ฅ๐ ๐ฅ๐, เดฅ๐๐ โ ๐๐
โข A generic sample set will simply be denoted by ๐ท = {๐ฅ๐|๐ = 1, โฆ , ๐}
โข Each class-conditional ๐(๐ฅ|๐๐) is assumed to have a known
parametric form and is uniquely specified by a parameter (vector) ๐๐.
โข Samples in each set ๐ท๐ are assumed to be independent and identically distributed (i.i.d.) according to some true probability law ๐(๐ฅ|๐๐).
19
J. Y. Choi. SNU
Log-Likelihood function and Score Function
โข The sample sets are assumed to be functionally independent, so the training set ๐๐ contains no information about ๐๐ for ๐ โ ๐
โข The i.i.d. assumption implies that
๐ ๐ท๐ ๐๐ = ฯ๐ฅโ๐ท๐ ๐(๐ฅ|๐๐)
โข Let ๐ท be a generic sample set of size ๐ = ๐ท
โข Log-likelihood function:
๐ ๐; ๐ท โก ln ๐(๐ท|๐) = เท
๐=1 ๐
ln ๐(๐ฅ๐|๐)
โข The log-likelihood function is identical to the logarithm of the probability density (or mass) function, but is interpreted as a function of parameter ๐
20
J. Y. Choi. SNU
Log-Likelihood Illustration
โข Assume that all the points in ๐ท are drawn from some (one-dimensional) normal distribution with some (known) variance and unknown mean.
21
๐ ๐ท ๐ = เท
๐ฅโ๐ท
๐(๐ฅ|๐)
๐ ๐; ๐ท โก ln ๐(๐ท|๐) = เท
๐=1 ๐
ln ๐(๐ฅ๐|๐)
J. Y. Choi. SNU
Maximum Likelihood Estimation (MLE)
โข The โmost likely valueโ for MLE is given by ๐ ๐ท = ๐๐๐maxเท
๐โฮ ๐(๐; ๐ท)
โข Gradient function
๐ป๐๐ ๐; ๐ท = ๐๐ ๐;๐ท
๐๐1 , โฆ ,๐๐ ๐;๐ท
๐๐๐
โข Necessary condition for MLE (if not on border of domain ฮ)
๐ป๐๐ ๐; ๐ท = 0
22
J. Y. Choi. SNU
Maximum A-Posteriori (MAP) Estimation
โข The โmost likely valueโ for MAP is given by
๐ ๐ท = ๐๐๐maxเท
๐โฮ ๐(๐|๐ท)
= ๐๐๐max
๐โฮ ๐(๐ท|๐) ๐0( ๐)/๐(๐ท)
= ๐๐๐max
๐โฮ ln เท
๐
๐ ๐ฅ๐ ๐ ๐0(๐)
= ๐๐๐max
๐โฮ เท
๐
ln ๐ ๐ฅ๐ ๐ + ln ๐0(๐)
โข We can ignore the normalizing factor ๐(๐ท)
23
J. Y. Choi. SNU
Example of Maximum Likelihood
The Gaussian Case: Unknown Mean
โข Suppose that the samples are drawn from a multivariate normal population with mean ๐, and covariance matrix ฮฃ.
โข By MLE, find unknown parameter ๐ = ๐ for sample data ๐ท = {๐ฅ๐|๐ = 1, โฆ , ๐}
24
J. Y. Choi. SNU
Example of Maximum Likelihood
The Gaussian Case: Unknown Mean
โข Log-likelihood is
๐ ๐; ๐ท = ln ๐ ๐ท ๐ = ln ฯ๐=1๐ ๐ ๐ฅ๐ ๐ = ฯ๐=1๐ ln ๐ ๐ฅ๐ ๐ where
ln ๐ ๐ฅ๐ ๐ = โ 1
2 ๐ฅ๐ โ ๐ ๐กฮฃโ1 ๐ฅ๐ โ ๐ โ ๐
2 ln 2๐ โ 1
2 ln ฮฃ
โข For a sample point ๐ฅ๐, we have
๐ป๐ ln ๐ ๐ฅ๐ ๐ = ฮฃโ1 ๐ฅ๐ โ ๐
โข The maximum likelihood estimate for ๐ must satisfy ฯ๐=1๐ ฮฃโ1 ๐ฅ๐ โ ๐ = 0 โโ ๐ = 1
๐ ฯ๐=1๐ ๐ฅ๐ (sample mean)
25
J. Y. Choi. SNU
Example of Maximum Likelihood
The Gaussian Case: Unknown Mean and Covariance
โข In the general multivariate normal case, neither the mean nor the covariance matrix is known.
โข Estimate the mean and covariance from data.
26
J. Y. Choi. SNU
Example of Maximum Likelihood
The Gaussian Case: Unknown Mean and Covariance
โข In the general multivariate normal case, neither the mean nor the covariance matrix is known.
โข Consider first the univariate case with ๐1 = ๐, ๐2 = ๐2
โข The log-likelihood of a single point is ln ๐ ๐ฅ๐ ๐ = โ 1
2 ln 2๐๐2 โ 1
2๐2 (๐ฅ๐ โ ๐1)2 and its derivative becomes
๐ป๐๐ ๐ฅ๐; ๐ = ๐ป๐ ln ๐ ๐ฅ๐ ๐ =
1
๐2 (๐ฅ๐ โ ๐1)
โ 1
2๐2 + (๐ฅ๐ โ ๐1 )2 2๐22
27
J. Y. Choi. SNU
Example of Maximum Likelihood
The Gaussian Case: Unknown Mean and Covariance
โข Setting the gradient to zero, and using all the sample points, we get the following necessary conditions:
ฯ๐ 1
๐2 ๐ฅ๐ โ ๐1 = 0, ฯ๐(โ 1
2๐2 + ๐ฅ๐โ๐1
2
2๐22 ) = 0
โข Optimal Solution becomes ฦธ๐ = ๐1 = 1
๐ ฯ๐=1๐ ๐ฅ๐
เท
๐2 = ๐2 = 1
๐ ฯ๐=1๐ (๐ฅ๐ โ ฦธ๐)2
28
J. Y. Choi. SNU
Example of Maximum Likelihood
The Gaussian multivariate case
โข For the multivariate case, it is easy to show that the MLE estimates are given by
ฦธ๐ = 1
๐ เท
๐=1 ๐
๐ฅ๐
ฮฃ = เท
1๐
ฯ
๐=1๐(๐ฅ
๐โ ฦธ๐)(๐ฅ
๐โ ฦธ๐)
๐กโข The MLE for ๐2 is biased (i.e., the expected value over all data sets of size n of the sample variance is not equal to the true variance, that is,
๐ธ 1
๐ เท
๐=1 ๐
(๐ฅ๐ โ ฦธ๐)2 = ๐ โ 1
๐ ๐2 โ ๐2
29
J. Y. Choi. SNU
Interim Summary and Next Outline
โข Maximum Likelihood vs Bayesian Learning
โข Maximum Likelihood Estimation
โข Maximum A-Posteriori (MAP) Estimation
โข Example of ML: The Gaussian Case
โข Bayesian Learning
โข Recursive Bayesian Incremental Learning
30
J. Y. Choi. SNU
Parametric Density Estimation (II)
Jin Young Choi
Seoul National University
1
J. Y. Choi. SNU
Outline
โข Density Estimation
โข Maximum Likelihood vs Bayesian Learning
โข Maximum Likelihood Estimation
โข Maximum A-Posteriori (MAP) Estimation
โข Example: The Gaussian Case
โข Bayesian Learning
โข Recursive Bayesian Incremental Learning
2
J. Y. Choi. SNU
Parametric Learning
โข Bayesian Decision ๐๐๐ max
๐ ๐(๐๐|๐ฅ)
โข Maximum Likelihood Estimation ๐๐๐ max
๐๐ ๐(๐ท|๐๐)
โข Maximum A-Posteriori Estimation ๐๐๐ max
๐๐ ๐ ๐๐ ๐ท , ๐๐ = ๐๐๐๐ ๐ก๐๐๐ก
โข Bayesian Learning (Estimation)
Find ๐ ๐๐ ๐ท , ๐๐ = ๐๐๐๐๐๐ ๐ฃ๐๐๐๐๐๐๐
3
J. Y. Choi. SNU
Example of Maximum Likelihood
The Gaussian multivariate case
โข For the multivariate case, ln ๐ ๐ฅ
๐๐ = โ
12
๐ฅ
๐โ ๐
๐กฮฃ
โ1๐ฅ
๐โ ๐ โ
๐2
ln 2๐ โ
12
ln ฮฃ
โข MLE estimates are given by
ฦธ๐ = 1
๐ ฯ๐=1๐ ๐ฅ๐
ฮฃ = เท
1๐
ฯ
๐=1๐(๐ฅ
๐โ ฦธ๐)(๐ฅ
๐โ ฦธ๐)
๐กโข The MLE for ฮฃ is biased (i.e., the expected value over all data sets of size n of the sample variance is not equal to the true variance, that is,
๐ธ 1
๐ ฯ๐=1๐ (๐ฅ๐ โ ฦธ๐)(๐ฅ๐ โ ฦธ๐)๐ก = ๐โ๐
๐ ฮฃ โ ฮฃ
4
J. Y. Choi. SNU
Example of Maximum Likelihood:
The Gaussian multivariate case
โข Unbiased estimator for ๐ and ฮฃ are given by
เท
๐ = 1
๐ ฯ๐=1๐ ๐ฅ๐
ฮฃ =เท 1
๐โ๐ ฯ๐=1๐ (๐ฅ๐ โ ฦธ๐)(๐ฅ๐ โ ฦธ๐)๐ก
โข ฮฃเท is called the unbiased sample covariance matrix
๐ธ 1
๐โ๐ ฯ๐=1๐ (๐ฅ๐ โ เท๐)(๐ฅ๐ โ เท๐)๐ก = ๐โ๐
๐โ๐ฮฃ = ฮฃ
โข Biased sample variance ฮฃเท is asymptotically unbiased.
๐ธ 1
๐ ฯ๐=1๐ (๐ฅ๐ โ เท๐)(๐ฅ๐ โ เท๐)๐ก = ๐โ๐
๐ ฮฃ โ ฮฃ as ๐ โ โ
5
J. Y. Choi. SNU
Bayesian Learning
โข The aim of decision is to find posteriors ๐(๐๐|๐ฅ) knowing ๐(๐ฅ|๐๐) and ๐(๐๐), but they are unknown.
โข How to estimate them from the training sample set D?
โข Generally used assumptions:
โ Priors generally are known or obtainable from a trivial calculations.
Thus ๐ ๐๐ ~๐(๐๐|๐). cf. ๐ ๐๐|๐ โ ๐(๐๐|๐ฅ). ๐ includes label.
โ The training set can be separated into ๐ subsets: ๐ท1, โฆ , ๐ท๐
โ ๐ท๐ have no influence on ๐(๐ฑ|๐๐, ๐ท๐) if ๐ โ ๐.
โ Let the estimate of ๐ ๐ฑ|๐๐ be ๐(๐ฑ|๐๐, ๐ท๐) for each class ๐ ๐ฑ|๐๐ = ๐ ๐ฑ|๐๐, ๐ท๐ โ ๐ ๐ฑ|๐ท๐ : = ๐ ๐ฑ|๐๐
โ Bayesian learning: find ๐ ๐๐|๐ท๐ ๐๐๐๐ก ๐๐๐๐๐ฅ ๐ ๐|๐ท
6
J. Y. Choi. SNU
Bayesian Learning: General Theory
โข Bayesian learning considers ๐ โ ๐๐ โ ๐๐ (the parameter vector to be estimated) to be a random variable.
โข Before we observe the data, the parameters are described by a prior ๐(๐) which is typically very broad. Once we observed the data, we can make use of Bayesโ formula to find posterior ๐(๐|๐ท). If the values of parameters are more consistent with the data, the posterior is narrower than prior. This is Bayesian learning (see fig.)
7
๐ ๐ฑ|๐๐ = ๐ ๐ฑ|๐๐, ๐ท๐ โ ๐ ๐ฑ|๐ท๐ โ ๐ ๐ฑ|๐ , ๐(๐|๐ท) where ๐ is latent random variable.
Find ๐ ๐ ๐ท by Bayes rule ๐ ๐ ๐ท โ ๐ ๐ท|๐ ๐(๐)
J. Y. Choi. SNU
Bayesian Learning: General Theory
โข Density function for ๐ฅ, given the training data set ๐ท,
๐(๐ฅ|๐) โ ๐(๐ฅ|๐ท) = เถฑ ๐ ๐ฅ, ๐ ๐ท ๐๐ = เถฑ ๐ ๐ฅ ๐, ๐ท ๐(๐|๐ท) ๐๐
โข By independence between ๐ฅ and ๐ท , ๐ ๐ฅ ๐, ๐ท = ๐ ๐ฅ ๐
โข Therefore ๐ ๐ฅ ๐ท = ืฌ ๐ ๐ฅ ๐ ๐ ๐ ๐ท ๐๐(= ๐ ๐ฅ ๐ if ๐ ๐ ๐ท is impulse).
โข Instead of choosing a specific value for ๐, the Bayesian approach performs a weighted average over all values of ๐(๐)
โข The weighting factor ๐(๐|๐ท), which is a posterior of ๐, is determined by starting from some assumed prior ๐(๐)
8
J. Y. Choi. SNU
โข Then update it using Bayesโ formula to take account of
data set ๐ท. Since ๐ท = ๐ฅ1, โฆ , ๐ฅ๐ are drawn independently, ๐ ๐ท ๐ = ฯ๐=1๐ ๐ ๐ฅ๐ ๐ ,
which is likelihood function.
โข Posterior for ๐ is
๐ ๐ ๐ท = ๐ ๐ท ๐ ๐(๐)
๐(๐ท) = ๐(๐)
๐(๐ท) ฯ๐=1๐ ๐ ๐ฅ๐ ๐ , where normalization factor ๐ ๐ท = ืฌ ๐(๐) ฯ๐=1๐ ๐ ๐ฅ๐ ๐ ๐๐.
9
Bayesian Learning: General Theory
J. Y. Choi. SNU
Bayesian Learning
โ Univariate Normal Distribution
โข Let us use the Bayesian estimation technique to calculate a posteriori density ๐ ๐ ๐ท and the desired probability density (likelihood) ๐(๐ฅ|๐ท) for the case ๐(๐ฅ|๐)~๐ ๐, ฮฃ , where rv ๐ for the parameter ๐.
โข Univariate Case: ๐ ๐ ๐ท
โข Let ๐ be the only unknown parameter ๐ ๐ฑ|๐๐ ~๐(๐ฅ|๐)~๐ ๐, ๐2
10
J. Y. Choi. SNU
Bayesian Learning
โ Univariate Normal Distribution
โข Conjugate prior probability: normal distribution over ๐, ๐ ๐ ~๐(๐0, ๐02)
encodes some prior knowledge about the true mean ๐0 , while ๐02 measures our prior uncertainty.
โข If ๐ is drawn from ๐(๐) then density for ๐ฅ is completely determined.
Letting ๐ท = ๐ฅ1, โฆ , ๐ฅ๐ , we use
๐ ๐ ๐ท = ๐ ๐ท ๐ ๐(๐)
ืฌ ๐ ๐ท ๐ ๐ ๐ ๐๐
= ฮฑ เท
๐=1 ๐
๐ ๐ฅ๐ ๐ ๐(๐).
11
J. Y. Choi. SNU 12
Bayesian Learning
โ Univariate Normal Distribution
1 1 2
( | D) exp
2 2
n n n
p ๏ญ ๏ญ ๏ญ
๏ฐ๏ณ ๏ณ
๏ฉ ๏ฆ โ ๏ถ ๏น
๏ช ๏บ
= โ ๏ง ๏ท
๏ช ๏จ ๏ธ ๏บ
๏ซ ๏ป
๐ ๐ ๐ท
๐0
Bayesian Learning
โ Univariate Normal Distribution
posterior
J. Y. Choi. SNU
โข Computing the posterior distribution ๐ ๐|๐ท = ๐ผ ๐(๐ท ๐ ๐ ๐
= ๐ผโฒ exp โ1
2 ฯ๐=1๐ (๐ฅ๐โ๐)
๐
2 + (๐โ๐0)
๐0
2
= ๐ผโฒโฒ exp โ 1
2
๐
๐2 + 1
๐02 ๐2 โ 2 1
๐2 ฯ๐=1๐ ๐ฅ๐ + ๐0
๐02 ๐
โข Since ๐ ๐|๐ท should be Gaussian ๐ ๐|๐ท = 1
2๐๐๐ exp โ1
2
(๐โ๐๐) ๐๐
2 = ๐ผโฒโฒโฒ exp โ 1
2๐๐2 ๐2 โ ๐๐
๐๐ ๐
โข ๐๐ = ๐ ๐ฅ๐, ๐0, ๐0 =?, ๐๐ = โ ๐ฅ๐, ๐0, ๐0 =?
13
Bayesian Learning
โ Univariate Normal Distribution
J. Y. Choi. SNU
โข Solution: 1
๐๐2 = ๐
๐2 + 1
๐02 , ๐๐
๐๐2 = ๐
๐2 ฦธ๐๐ + ๐0
๐02
where ฦธ๐๐ = 1
๐ ฯ๐=1๐ ๐ฅ๐ is the sample mean.
โข Solving explicitly for ๐๐ and ๐๐2 we obtain ๐๐2 = ๐02๐2
๐๐02+๐2 , ๐๐ = ๐๐02
๐๐02+๐2 ฦธ๐๐ + ๐2
๐๐02+๐2 ๐0
โข ๐๐ represents our best guess for ๐ after observing ๐ samples.
โข ๐๐2 measures our uncertainty about this guess.
โข ๐๐2 decreases monotonically with ๐ (approaching ๐๐2 โ ๐2
๐ โ 0 as ๐ approaches infinity)
14
Bayesian Learning
โ Univariate Normal Distribution
1 1 2
( | D) exp
2 2
n n n
p ๏ญ ๏ญ ๏ญ
๏ฐ๏ณ ๏ณ
๏ฉ ๏ฆ โ ๏ถ ๏น
๏ช ๏บ
= โ ๏ง ๏ท
๏ช ๏จ ๏ธ ๏บ
๏ซ ๏ป
J. Y. Choi. SNU
โข Each additional observation decreases our uncertainty about the true value of ๐.
โข As ๐ increases, ๐ becomes more and more sharply peaked, approaching a Dirac delta function as ๐ approaches infinity. This behavior is known as
Bayesian Learning.
15
( | D) p ๏ญ
Bayesian Learning
โ Univariate Normal Distribution
J. Y. Choi. SNU
๐๐ = ๐๐02
๐๐02 + ๐2 ฦธ๐๐ + ๐2
๐๐02 + ๐2 ๐0, ๐๐2 = ๐02๐2 ๐๐02 + ๐2
โข In general, ๐๐ is a linear combination of ฦธ๐๐ and ๐0 , with coefficients that are non-negative and sum to 1.
โข Thus ๐๐ lies somewhere between ฦธ๐๐ and ๐0.
โข If ๐0 โ 0, ๐๐ โ ฦธ๐๐ as ๐ โ โ.
โข If ๐0 = 0, our a priori certainty that ๐ = ๐0 = ๐๐ is so strong that no number of observations can change our opinion.
โข If ๐0 โซ ๐, a priori guess is uncertain, and we take ๐๐ = ฦธ๐๐
โข The ratio ๐2
๐02 is called dogmatism.
16
๏ญ ๏ญ= 0
Bayesian Learning
โ Univariate Normal Distribution
J. Y. Choi. SNU 17
โข The Univariate Case:
๐ ๐ฅ ๐ท = เถฑ ๐ ๐ฅ ๐ ๐ ๐ ๐ท ๐๐
= ืฌ 1
2๐๐๐๐ exp โ 1
2
(๐ฅโ๐) ๐
2 โ 1
2
(๐โ๐๐) ๐๐
2 ๐๐
= 1
2๐๐๐๐ exp โ 1
2
๐ฅโ๐๐ 2
๐2+๐๐2 ๐ ๐, ๐๐
โ exp โ 1
2
๐ฅโ๐๐ 2
๐2+๐๐2 ๐โโ exp โ 1
2
๐ฅโ๐ 2 ๐2
Bayesian Learning
โ Univariate Normal Distribution
J. Y. Choi. SNU
โข Since ๐ ๐ฅ ๐ท โ exp โ 1
2
๐ฅโ๐๐ 2
๐2+๐๐2 , we can write ๐ ๐ฅ ๐ท ~๐(๐๐, ๐2 + ๐๐2)
โข To obtain the class conditional probability ๐ ๐ฅ ๐ท , whose parametric form is known to be ๐ ๐ฅ ๐ ~๐ ๐, ๐2 .
โข We replace ๐ by ๐๐ and ๐2 by ๐2 + ๐๐2.
โข The conditional mean ๐๐ is treated as if it were the true mean, and the known variance is increased by ๐๐2 to account for the additional
uncertainty in ๐ฅ resulting from our lack of exact knowledge of the mean ๐ .
18
Bayesian Learning
โ Univariate Normal Distribution
J. Y. Choi. SNU
Recursive Bayesian Incremental Learning
โข We have seen that ๐ ๐ท๐ ๐ = ฯ๐=1๐ ๐ ๐ฅ๐ ๐ = ๐ ๐ฅ๐ ๐ ฯ๐=1๐โ1 ๐ ๐ฅ๐ ๐ . Let us define ๐ท๐ = ๐ฅ1, โฆ , ๐ฅ๐ . Then
๐ ๐ท๐ ๐ = ๐ ๐ฅ๐ ๐ ๐ ๐ท๐โ1 ๐ .
โข Substituting into ๐ ๐ ๐ท๐ and using Bayes we have:
๐ ๐ ๐ท๐ โ ๐ ๐ท๐ ๐ ๐ ๐ = ๐ ๐ฅ๐ ๐ ๐ ๐ท๐โ1 ๐ ๐ ๐
= ๐ ๐ฅ๐ ๐ ๐ ๐ ๐ท๐โ1 ๐ ๐ท๐โ1 ๐(๐) ๐ ๐
โ ๐ ๐ฅ๐ ๐ ๐ ๐ ๐ท๐โ1
19
J. Y. Choi. SNU
Recursive Bayesian Incremental Learning
โข While ๐ ๐ ๐ท0 = ๐(๐), the repeated use of ๐ ๐ ๐ท๐ โ ๐ ๐ฅ๐ ๐ ๐ ๐ ๐ท๐โ1 produces a sequence
๐(๐), ๐ ๐ ๐ฅ1 = ๐ ๐ ๐ท1 , ๐(๐|๐ฅ1, ๐ฅ2) = ๐ ๐ ๐ท2 , โฆ
โข This is called the recursive Bayes approach to the parameter estimation. (Also incremental or on-line learning).
โข When this sequence of densities converges to a Dirac delta function centered about the true parameter value, we have Bayesian learning.
20
J. Y. Choi. SNU
โข Example of Recursive Formula
21
Recursive Bayesian Incremental Learning
๐๐ = ๐๐โ12
๐๐โ12 + ๐2 ๐ฅ๐ + ๐2
๐๐โ12 + ๐2 ๐๐โ1 ๐๐2 = ๐๐โ12 ๐2 ๐๐โ12 + ๐2
Recursive Bayesian Incremental Learning
๐(๐|๐ท๐) โ ๐(๐ท๐|๐)๐(๐|๐ท0)
๐(๐|๐ท๐) โ ๐(๐ฅ๐|๐)๐(๐|๐ท๐โ1) ๐๐ = ๐๐02
๐๐02 + ๐2 ๐ฅเท๐ + ๐2
๐๐02 + ๐2 ๐0 ๐๐2 = ๐02๐2 ๐๐02 + ๐2
J. Y. Choi. SNU
Maximal Likelihood vs. Bayesian
โข MLP, MAP
โข Bayesian Learning
2 2
( | D )n ( | ) ( | D )n : ( ,n n) p x =
๏ฒ
p x ฮธ p ฮธ dฮธ N ๏ฑ ๏ณ +๏ณโฆ โฆ
: arg max (D | ): arg max ( | D )
n n
n n
MLP p
MAP p
๏ฑ
๏ฑ
๏ฑ ๏ฑ
๏ฑ ๏ฑ
=
=
n n
2 ^ 2 2
0 0
(ฮธ|D )= (D |ฮธ) (ฮธ)
( ; n, n) ( ; n, , , )
p p p
f f
๏ก
๏ฑ ๏ฑ ๏ณ = ๏ฑ ๏ฑ ๏ณ ๏ณ ๏ฑ
( | D )n ( | n) : ( n, 2) p x = p x ฮธ N ฮธ ๏ณ
50
J. Y. Choi. SNU
Maximal Likelihood vs. Bayesian
โข ML and Bayesian estimations are asymptotically equivalent and โconsistentโ. They yield the same class-conditional densities when the size of the training data grows to infinity.
โข ML is typically computationally easier: in ML we need to do (multidimensional) differentia tion and in Bayesian (multidimensional) integration.
โข ML is often easier to interpret: it returns the single best model (parameter) whereas Bay esian gives a weighted average of models.
โข But for a finite training data (and given a reliable prior) Bayesian is more accurate (uses more of the information).
โข Bayesian with โflatโ prior is essentially ML.
23
J. Y. Choi. SNU
Interim Summary
โข In MLE estimation,
โข In Bayesian Learning
โข Weighted average of Candidate Models
โข Gaussian Model
โข Incremental Bayesian Learning
24
๐(๐ฅ|๐ท) = เถฑ๐(๐ฅ|๐)๐(๐|๐ท)๐๐
๐๐ = ๐๐โ12
๐๐โ12 + ๐2 ๐ฅ๐ + ๐2
๐๐โ12 + ๐2 ๐๐โ1 ๐๐2 = ๐๐โ12 ๐2 ๐๐โ12 + ๐2 ๐(๐|๐ท๐) โ ๐(๐ท๐|๐)๐(๐|๐ท0)
๐(๐|๐ท๐) โ ๐(๐ฅ๐|๐)๐(๐|๐ท๐โ1) ๐๐ = ๐๐02
๐๐02 + ๐2 ๐เท๐ + ๐2
๐๐02 + ๐2๐0 ๐๐2 = ๐02๐2 ๐๐02 + ๐2
เท
๐ = 1
๐ ฯ๐=1๐ ๐ฅ๐, ฮฃ =เท 1
๐โ๐ฯ๐=1๐ (๐ฅ๐ โ เท๐)(๐ฅ๐ โ เท๐)๐ก