• ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ๊ฐ€ ์—†์Šต๋‹ˆ๋‹ค.

Parametric Density Estimation (I)

N/A
N/A
Protected

Academic year: 2024

Share "Parametric Density Estimation (I)"

Copied!
54
0
0

์ „์ฒด ๊ธ€

(1)

J. Y. Choi. SNU

Parametric Density Estimation (I)

Jin Young Choi

Seoul National University

1

(2)

J. Y. Choi. SNU

Outline

โ€ข Density Estimation

โ€ข Maximum Likelihood vs Bayesian Learning

โ€ข Maximum Likelihood Estimation

โ€ข Maximum A-Posteriori (MAP) Estimation

โ€ข Example: The Gaussian Case

โ€ข Bayesian Learning

โ€ข Recursive Bayesian Incremental Learning

2

(3)

J. Y. Choi. SNU

Estimation of PDF

โ€ข To get an optimal classifier we need ๐‘(๐‘ค๐‘–) and ๐‘(๐‘ฅ|๐‘ค๐‘–) , but usually we do not know them.

โ€ข Solution: to use training data to estimate the unknown

probabilities.

โ€ข Key task: Estimation of class conditional density

3

(4)

J. Y. Choi. SNU

Learning of PDF

โ€ข Supervised learning: we get to see samples from each of the classes

โ€œseparatelyโ€ (called tagged or labeled samples).

โ€ข Tagged samples are โ€œexpensiveโ€.

โ€ข Two methods: parametric (easier) and non-parametric (harder)

4

(5)

J. Y. Choi. SNU

Learning From Observed Data

5

) ( i P ๏ท

)

| (x ๏ท1

P P(x |๏ท2) P (x |๏ท3)

Hidden Observed

Unsupervised

Supervised

(6)

J. Y. Choi. SNU

Parametric Learning

โ€ข Assume specific parametric distributions with parameters:

๐‘ ๐‘ฅ ๐œ”๐‘– โ‰ˆ ๐‘ ๐‘ฅ|๐œƒ๐‘– , ๐œƒ๐‘– โˆˆ ฮ˜ โŠ‚ ๐‘…๐‘

โ€ข Estimate parameters ๐œƒ(๐ท)เท  from training data D

โ€ข Replace true value of class-conditional density with approximation and apply the Bayesian framework for decision making.

6

(7)

J. Y. Choi. SNU

Parametric Learning

โ€ข Suppose we can assume that the relevant (class-conditional) densities are of some parametric form. That is,

๐‘ ๐‘ฅ ๐œ” โ‰ˆ ๐‘ ๐‘ฅ ๐œƒ , ๐‘คโ„Ž๐‘’๐‘Ÿ๐‘’ ๐œƒ โˆˆ ฮ˜ โˆˆ ๐‘…๐‘

โ€ข Examples of parameterized densities:

โ€ข Binomial: ๐‘ฅ has ๐‘š 1โ€™s and (๐‘› โˆ’ ๐‘š) 0โ€™s ๐‘ ๐‘ฅ ๐œƒ = ๐‘›

๐‘š ๐œƒ๐‘š(1 โˆ’ ๐œƒ)๐‘›โˆ’๐‘š, ฮ˜ = [0, 1]

โ€ข Exponential: Each data point ๐‘ฅ is distributed according to ๐‘ ๐‘ฅ ๐œƒ = ๐œƒ๐‘’โˆ’๐œƒ๐‘ฅ, ฮ˜ = (0, โˆž)

โ€ข Normal:

๐‘ ๐‘ฅ ๐œƒ ~๐‘(๐œ‡, ๐œŽ2)

7

(8)

J. Y. Choi. SNU

Parametric Learning

โ€ข Bayesian Decision ๐‘Ž๐‘Ÿ๐‘” max

๐‘– ๐‘(๐œ”๐‘–|๐‘ฅ)

โ€ข Maximum Likelihood Estimation ๐‘Ž๐‘Ÿ๐‘” max

๐œƒ๐‘– ๐‘(๐ท|๐œƒ๐‘–)

โ€ข Maximum A-Posteriori Estimation ๐‘Ž๐‘Ÿ๐‘” max

๐œƒ๐‘– ๐‘ ๐œƒ๐‘– ๐ท , ๐œƒ๐‘– = ๐‘๐‘œ๐‘›๐‘ ๐‘ก๐‘Ž๐‘›๐‘ก

โ€ข Bayesian Learning (Estimation)

Find ๐‘ ๐œƒ๐‘– ๐ท , ๐œƒ๐‘– = ๐‘Ÿ๐‘Ž๐‘›๐‘‘๐‘œ๐‘š ๐‘ฃ๐‘Ž๐‘Ÿ๐‘–๐‘Ž๐‘๐‘™๐‘’

8

(9)

J. Y. Choi. SNU

Preliminary Examples

9

โ€ข Maximum Likelihood Estimation ๐‘Ž๐‘Ÿ๐‘” max

๐œƒ๐‘– ๐‘(๐ท|๐œƒ๐‘–)

(10)

J. Y. Choi. SNU

Preliminary Examples

10

โ€ข Maximum Likelihood Estimation ๐‘Ž๐‘Ÿ๐‘” max

๐œƒ๐‘– ๐‘(๐ท|๐œƒ๐‘–)

(11)

J. Y. Choi. SNU

Preliminary Examples

11

โ€ข Maximum Likelihood Estimation ๐‘Ž๐‘Ÿ๐‘” max

๐œƒ๐‘– ๐‘(๐ท|๐œƒ๐‘–)

(12)

J. Y. Choi. SNU

Preliminary Examples

12

โ€ข Maximum Likelihood Estimation ๐‘Ž๐‘Ÿ๐‘” max

๐œƒ๐‘– ๐‘(๐ท|๐œƒ๐‘–)

(13)

J. Y. Choi. SNU

Preliminary Examples

13

โ€ข Maximum Likelihood Estimation ๐‘Ž๐‘Ÿ๐‘” max

๐œƒ๐‘– ๐‘(๐ท|๐œƒ๐‘–)

(14)

J. Y. Choi. SNU

Preliminary Examples

14

โ€ข Bayesian Learning (Estimation)

Find ๐‘ ๐œƒ๐‘– ๐ท , ๐œƒ๐‘– = ๐‘Ÿ๐‘Ž๐‘›๐‘‘๐‘œ๐‘š ๐‘ฃ๐‘Ž๐‘Ÿ๐‘–๐‘Ž๐‘๐‘™๐‘’

(15)

J. Y. Choi. SNU

Preliminary Examples

15

โ€ข Bayesian Learning (Estimation)

Find ๐‘ ๐œƒ๐‘– ๐ท , ๐œƒ๐‘– = ๐‘Ÿ๐‘Ž๐‘›๐‘‘๐‘œ๐‘š ๐‘ฃ๐‘Ž๐‘Ÿ๐‘–๐‘Ž๐‘๐‘™๐‘’

(16)

J. Y. Choi. SNU

Preliminary Examples

16

โ€ข Bayesian Learning (Estimation)

Find ๐‘ ๐œƒ๐‘– ๐ท , ๐œƒ๐‘– = ๐‘Ÿ๐‘Ž๐‘›๐‘‘๐‘œ๐‘š ๐‘ฃ๐‘Ž๐‘Ÿ๐‘–๐‘Ž๐‘๐‘™๐‘’

๐œ‡

(17)

J. Y. Choi. SNU

Preliminary Examples

17

โ€ข Bayesian Learning (Estimation)

Find ๐‘ ๐œƒ๐‘– ๐ท , ๐œƒ๐‘– = ๐‘Ÿ๐‘Ž๐‘›๐‘‘๐‘œ๐‘š ๐‘ฃ๐‘Ž๐‘Ÿ๐‘–๐‘Ž๐‘๐‘™๐‘’

๐œ‡

๐œƒ =

(18)

J. Y. Choi. SNU

Maximum Likelihood vs Bayesian Learning

โ€ข Maximum likelihood estimation: estimate parameter value ๐œƒ(๐ท)เท  that makes the data most probable (i.e., maximizes the probability of

obtaining the sample that has actually been observed), ๐‘ ๐‘ฅ ๐ท โ‰ˆ ๐‘(๐‘ฅ| เท ๐œƒ ๐ท ), เท ๐œƒ ๐ท = ๐‘Ž๐‘Ÿ๐‘” max

๐œƒ ๐‘(๐ท|๐œƒ)

โ€ข Bayesian learning: define a prior probability on the model space ๐‘(๐œƒ) and compute the posterior ๐‘(๐œƒ|๐ท). Additional samples sharp the

posterior density which peaks near the true values of the parameters .

18

(19)

J. Y. Choi. SNU

Sampling Model

โ€ข It is assumed that a training set ๐‘† = {(๐‘ฅ๐‘™, เดฅ๐œ”๐‘™)|๐‘™ = 1, โ€ฆ , ๐‘} with independently generated samples is available.

โ€ข The sample set is partitioned into separate sample sets for each class, ๐ท๐‘— = ๐‘ฅ๐‘™ ๐‘ฅ๐‘™, เดฅ๐œ”๐‘™ โˆˆ ๐‘†๐‘—

โ€ข A generic sample set will simply be denoted by ๐ท = {๐‘ฅ๐‘™|๐‘™ = 1, โ€ฆ , ๐‘}

โ€ข Each class-conditional ๐‘(๐‘ฅ|๐œ”๐‘—) is assumed to have a known

parametric form and is uniquely specified by a parameter (vector) ๐œƒ๐‘—.

โ€ข Samples in each set ๐ท๐‘— are assumed to be independent and identically distributed (i.i.d.) according to some true probability law ๐‘(๐‘ฅ|๐œ”๐‘—).

19

(20)

J. Y. Choi. SNU

Log-Likelihood function and Score Function

โ€ข The sample sets are assumed to be functionally independent, so the training set ๐‘†๐‘— contains no information about ๐œƒ๐‘– for ๐‘– โ‰  ๐‘—

โ€ข The i.i.d. assumption implies that

๐‘ ๐ท๐‘— ๐œƒ๐‘— = ฯ‚๐‘ฅโˆˆ๐ท๐‘— ๐‘(๐‘ฅ|๐œƒ๐‘—)

โ€ข Let ๐ท be a generic sample set of size ๐‘› = ๐ท

โ€ข Log-likelihood function:

๐‘™ ๐œƒ; ๐ท โ‰ก ln ๐‘(๐ท|๐œƒ) = เท

๐‘˜=1 ๐‘›

ln ๐‘(๐‘ฅ๐‘˜|๐œƒ)

โ€ข The log-likelihood function is identical to the logarithm of the probability density (or mass) function, but is interpreted as a function of parameter ๐œƒ

20

(21)

J. Y. Choi. SNU

Log-Likelihood Illustration

โ€ข Assume that all the points in ๐ท are drawn from some (one-dimensional) normal distribution with some (known) variance and unknown mean.

21

๐‘ ๐ท ๐œƒ = เท‘

๐‘ฅโˆˆ๐ท

๐‘(๐‘ฅ|๐œƒ)

๐‘™ ๐œƒ; ๐ท โ‰ก ln ๐‘(๐ท|๐œƒ) = เท

๐‘˜=1 ๐‘›

ln ๐‘(๐‘ฅ๐‘˜|๐œƒ)

(22)

J. Y. Choi. SNU

Maximum Likelihood Estimation (MLE)

โ€ข The โ€œmost likely valueโ€ for MLE is given by ๐œƒ ๐ท = ๐‘Ž๐‘Ÿ๐‘”maxเท 

๐œƒโˆˆฮ˜ ๐‘™(๐œƒ; ๐ท)

โ€ข Gradient function

๐›ป๐œƒ๐‘™ ๐œƒ; ๐ท = ๐œ•๐‘™ ๐œƒ;๐ท

๐œ•๐œƒ1 , โ€ฆ ,๐œ•๐‘™ ๐œƒ;๐ท

๐œ•๐œƒ๐‘

โ€ข Necessary condition for MLE (if not on border of domain ฮ˜)

๐›ป๐œƒ๐‘™ ๐œƒ; ๐ท = 0

22

(23)

J. Y. Choi. SNU

Maximum A-Posteriori (MAP) Estimation

โ€ข The โ€œmost likely valueโ€ for MAP is given by

๐œƒ ๐ท = ๐‘Ž๐‘Ÿ๐‘”maxเท 

๐œƒโˆˆฮ˜ ๐‘(๐œƒ|๐ท)

= ๐‘Ž๐‘Ÿ๐‘”max

๐œƒโˆˆฮ˜ ๐‘(๐ท|๐œƒ) ๐‘0( ๐œƒ)/๐‘(๐ท)

= ๐‘Ž๐‘Ÿ๐‘”max

๐œƒโˆˆฮ˜ ln เท‘

๐‘˜

๐‘ ๐‘ฅ๐‘˜ ๐œƒ ๐‘0(๐œƒ)

= ๐‘Ž๐‘Ÿ๐‘”max

๐œƒโˆˆฮ˜ เท

๐‘˜

ln ๐‘ ๐‘ฅ๐‘˜ ๐œƒ + ln ๐‘0(๐œƒ)

โ€ข We can ignore the normalizing factor ๐‘(๐ท)

23

(24)

J. Y. Choi. SNU

Example of Maximum Likelihood

The Gaussian Case: Unknown Mean

โ€ข Suppose that the samples are drawn from a multivariate normal population with mean ๐œ‡, and covariance matrix ฮฃ.

โ€ข By MLE, find unknown parameter ๐œƒ = ๐œ‡ for sample data ๐ท = {๐‘ฅ๐‘˜|๐‘˜ = 1, โ€ฆ , ๐‘›}

24

(25)

J. Y. Choi. SNU

Example of Maximum Likelihood

The Gaussian Case: Unknown Mean

โ€ข Log-likelihood is

๐‘™ ๐œ‡; ๐ท = ln ๐‘ ๐ท ๐œ‡ = ln ฯ‚๐‘˜=1๐‘› ๐‘ ๐‘ฅ๐‘˜ ๐œ‡ = ฯƒ๐‘˜=1๐‘› ln ๐‘ ๐‘ฅ๐‘˜ ๐œ‡ where

ln ๐‘ ๐‘ฅ๐‘˜ ๐œ‡ = โˆ’ 1

2 ๐‘ฅ๐‘˜ โˆ’ ๐œ‡ ๐‘กฮฃโˆ’1 ๐‘ฅ๐‘˜ โˆ’ ๐œ‡ โˆ’ ๐‘‘

2 ln 2๐œ‹ โˆ’ 1

2 ln ฮฃ

โ€ข For a sample point ๐‘ฅ๐‘˜, we have

๐›ป๐œ‡ ln ๐‘ ๐‘ฅ๐‘˜ ๐œ‡ = ฮฃโˆ’1 ๐‘ฅ๐‘˜ โˆ’ ๐œ‡

โ€ข The maximum likelihood estimate for ๐œ‡ must satisfy ฯƒ๐‘˜=1๐‘› ฮฃโˆ’1 ๐‘ฅ๐‘˜ โˆ’ ๐œ‡ = 0 โˆ’โ†’ ๐œ‡ = 1

๐‘› ฯƒ๐‘˜=1๐‘› ๐‘ฅ๐‘˜ (sample mean)

25

(26)

J. Y. Choi. SNU

Example of Maximum Likelihood

The Gaussian Case: Unknown Mean and Covariance

โ€ข In the general multivariate normal case, neither the mean nor the covariance matrix is known.

โ€ข Estimate the mean and covariance from data.

26

(27)

J. Y. Choi. SNU

Example of Maximum Likelihood

The Gaussian Case: Unknown Mean and Covariance

โ€ข In the general multivariate normal case, neither the mean nor the covariance matrix is known.

โ€ข Consider first the univariate case with ๐œƒ1 = ๐œ‡, ๐œƒ2 = ๐œŽ2

โ€ข The log-likelihood of a single point is ln ๐‘ ๐‘ฅ๐‘˜ ๐œƒ = โˆ’ 1

2 ln 2๐œ‹๐œƒ2 โˆ’ 1

2๐œƒ2 (๐‘ฅ๐‘˜ โˆ’ ๐œƒ1)2 and its derivative becomes

๐›ป๐œƒ๐‘™ ๐‘ฅ๐‘˜; ๐œƒ = ๐›ป๐œƒ ln ๐‘ ๐‘ฅ๐‘˜ ๐œƒ =

1

๐œƒ2 (๐‘ฅ๐‘˜ โˆ’ ๐œƒ1)

โˆ’ 1

2๐œƒ2 + (๐‘ฅ๐‘˜ โˆ’ ๐œƒ1 )2 2๐œƒ22

27

(28)

J. Y. Choi. SNU

Example of Maximum Likelihood

The Gaussian Case: Unknown Mean and Covariance

โ€ข Setting the gradient to zero, and using all the sample points, we get the following necessary conditions:

ฯƒ๐‘˜ 1

๐œƒ2 ๐‘ฅ๐‘˜ โˆ’ ๐œƒ1 = 0, ฯƒ๐‘˜(โˆ’ 1

2๐œƒ2 + ๐‘ฅ๐‘˜โˆ’๐œƒ1

2

2๐œƒ22 ) = 0

โ€ข Optimal Solution becomes ฦธ๐œ‡ = ๐œƒ1 = 1

๐‘› ฯƒ๐‘˜=1๐‘› ๐‘ฅ๐‘˜

เทœ

๐œŽ2 = ๐œƒ2 = 1

๐‘› ฯƒ๐‘˜=1๐‘› (๐‘ฅ๐‘˜ โˆ’ ฦธ๐œ‡)2

28

(29)

J. Y. Choi. SNU

Example of Maximum Likelihood

The Gaussian multivariate case

โ€ข For the multivariate case, it is easy to show that the MLE estimates are given by

ฦธ๐œ‡ = 1

๐‘› เท

๐‘˜=1 ๐‘›

๐‘ฅ๐‘˜

ฮฃ = เท 

1

๐‘›

ฯƒ

๐‘˜=1๐‘›

(๐‘ฅ

๐‘˜

โˆ’ ฦธ๐œ‡)(๐‘ฅ

๐‘˜

โˆ’ ฦธ๐œ‡)

๐‘ก

โ€ข The MLE for ๐œŽ2 is biased (i.e., the expected value over all data sets of size n of the sample variance is not equal to the true variance, that is,

๐ธ 1

๐‘› เท

๐‘˜=1 ๐‘›

(๐‘ฅ๐‘˜ โˆ’ ฦธ๐œ‡)2 = ๐‘› โˆ’ 1

๐‘› ๐œŽ2 โ‰  ๐œŽ2

29

(30)

J. Y. Choi. SNU

Interim Summary and Next Outline

โ€ข Maximum Likelihood vs Bayesian Learning

โ€ข Maximum Likelihood Estimation

โ€ข Maximum A-Posteriori (MAP) Estimation

โ€ข Example of ML: The Gaussian Case

โ€ข Bayesian Learning

โ€ข Recursive Bayesian Incremental Learning

30

(31)

J. Y. Choi. SNU

Parametric Density Estimation (II)

Jin Young Choi

Seoul National University

1

(32)

J. Y. Choi. SNU

Outline

โ€ข Density Estimation

โ€ข Maximum Likelihood vs Bayesian Learning

โ€ข Maximum Likelihood Estimation

โ€ข Maximum A-Posteriori (MAP) Estimation

โ€ข Example: The Gaussian Case

โ€ข Bayesian Learning

โ€ข Recursive Bayesian Incremental Learning

2

(33)

J. Y. Choi. SNU

Parametric Learning

โ€ข Bayesian Decision ๐‘Ž๐‘Ÿ๐‘” max

๐‘– ๐‘(๐œ”๐‘–|๐‘ฅ)

โ€ข Maximum Likelihood Estimation ๐‘Ž๐‘Ÿ๐‘” max

๐œƒ๐‘– ๐‘(๐ท|๐œƒ๐‘–)

โ€ข Maximum A-Posteriori Estimation ๐‘Ž๐‘Ÿ๐‘” max

๐œƒ๐‘– ๐‘ ๐œƒ๐‘– ๐ท , ๐œƒ๐‘– = ๐‘๐‘œ๐‘›๐‘ ๐‘ก๐‘Ž๐‘›๐‘ก

โ€ข Bayesian Learning (Estimation)

Find ๐‘ ๐œƒ๐‘– ๐ท , ๐œƒ๐‘– = ๐‘Ÿ๐‘Ž๐‘›๐‘‘๐‘œ๐‘š ๐‘ฃ๐‘Ž๐‘Ÿ๐‘–๐‘Ž๐‘๐‘™๐‘’

3

(34)

J. Y. Choi. SNU

Example of Maximum Likelihood

The Gaussian multivariate case

โ€ข For the multivariate case, ln ๐‘ ๐‘ฅ

๐‘˜

๐œ‡ = โˆ’

1

2

๐‘ฅ

๐‘˜

โˆ’ ๐œ‡

๐‘ก

ฮฃ

โˆ’1

๐‘ฅ

๐‘˜

โˆ’ ๐œ‡ โˆ’

๐‘‘

2

ln 2๐œ‹ โˆ’

1

2

ln ฮฃ

โ€ข MLE estimates are given by

ฦธ๐œ‡ = 1

๐‘› ฯƒ๐‘˜=1๐‘› ๐‘ฅ๐‘˜

ฮฃ = เท 

1

๐‘›

ฯƒ

๐‘˜=1๐‘›

(๐‘ฅ

๐‘˜

โˆ’ ฦธ๐œ‡)(๐‘ฅ

๐‘˜

โˆ’ ฦธ๐œ‡)

๐‘ก

โ€ข The MLE for ฮฃ is biased (i.e., the expected value over all data sets of size n of the sample variance is not equal to the true variance, that is,

๐ธ 1

๐‘› ฯƒ๐‘˜=1๐‘› (๐‘ฅ๐‘˜ โˆ’ ฦธ๐œ‡)(๐‘ฅ๐‘˜ โˆ’ ฦธ๐œ‡)๐‘ก = ๐‘›โˆ’๐‘

๐‘› ฮฃ โ‰  ฮฃ

4

(35)

J. Y. Choi. SNU

Example of Maximum Likelihood:

The Gaussian multivariate case

โ€ข Unbiased estimator for ๐œ‡ and ฮฃ are given by

เทœ

๐œ‡ = 1

๐‘› ฯƒ๐‘˜=1๐‘› ๐‘ฅ๐‘˜

ฮฃ =เท  1

๐‘›โˆ’๐‘ ฯƒ๐‘˜=1๐‘› (๐‘ฅ๐‘˜ โˆ’ ฦธ๐œ‡)(๐‘ฅ๐‘˜ โˆ’ ฦธ๐œ‡)๐‘ก

โ€ข ฮฃเท  is called the unbiased sample covariance matrix

๐ธ 1

๐‘›โˆ’๐‘ ฯƒ๐‘˜=1๐‘› (๐‘ฅ๐‘˜ โˆ’ เทœ๐œ‡)(๐‘ฅ๐‘˜ โˆ’ เทœ๐œ‡)๐‘ก = ๐‘›โˆ’๐‘

๐‘›โˆ’๐‘ฮฃ = ฮฃ

โ€ข Biased sample variance ฮฃเท  is asymptotically unbiased.

๐ธ 1

๐‘› ฯƒ๐‘˜=1๐‘› (๐‘ฅ๐‘˜ โˆ’ เทœ๐œ‡)(๐‘ฅ๐‘˜ โˆ’ เทœ๐œ‡)๐‘ก = ๐‘›โˆ’๐‘

๐‘› ฮฃ โ†’ ฮฃ as ๐‘› โ†’ โˆž

5

(36)

J. Y. Choi. SNU

Bayesian Learning

โ€ข The aim of decision is to find posteriors ๐‘(๐œ”๐‘–|๐‘ฅ) knowing ๐‘(๐‘ฅ|๐œ”๐‘–) and ๐‘(๐œ”๐‘–), but they are unknown.

โ€ข How to estimate them from the training sample set D?

โ€ข Generally used assumptions:

โœ“ Priors generally are known or obtainable from a trivial calculations.

Thus ๐‘ ๐œ”๐‘– ~๐‘(๐œ”๐‘–|๐‘†). cf. ๐‘ ๐œ”๐‘–|๐‘† โ‰  ๐‘(๐œ”๐‘–|๐‘ฅ). ๐‘† includes label.

โœ“ The training set can be separated into ๐‘ subsets: ๐ท1, โ€ฆ , ๐ท๐‘

โœ“ ๐ท๐‘— have no influence on ๐‘(๐ฑ|๐œ”๐‘–, ๐ท๐‘–) if ๐‘– โ‰  ๐‘—.

โœ“ Let the estimate of ๐‘ ๐ฑ|๐œ”๐‘– be ๐‘(๐ฑ|๐œ”๐‘–, ๐ท๐‘–) for each class ๐‘ ๐ฑ|๐œ”๐‘– = ๐‘ ๐ฑ|๐œ”๐‘–, ๐ท๐‘– โ‰… ๐‘ ๐ฑ|๐ท๐‘– : = ๐‘ ๐ฑ|๐œƒ๐‘–

โœ“ Bayesian learning: find ๐‘ ๐œƒ๐‘–|๐ท๐‘– ๐‘œ๐‘š๐‘–๐‘ก ๐‘–๐‘›๐‘‘๐‘’๐‘ฅ ๐‘ ๐œƒ|๐ท

6

(37)

J. Y. Choi. SNU

Bayesian Learning: General Theory

โ€ข Bayesian learning considers ๐œƒ โ† ๐œƒ๐‘– โ† ๐œ”๐‘– (the parameter vector to be estimated) to be a random variable.

โ€ข Before we observe the data, the parameters are described by a prior ๐‘(๐œƒ) which is typically very broad. Once we observed the data, we can make use of Bayesโ€™ formula to find posterior ๐‘(๐œƒ|๐ท). If the values of parameters are more consistent with the data, the posterior is narrower than prior. This is Bayesian learning (see fig.)

7

๐‘ ๐ฑ|๐œ”๐‘– = ๐‘ ๐ฑ|๐œ”๐‘–, ๐ท๐‘– โ‰… ๐‘ ๐ฑ|๐ท๐‘– โ† ๐‘ ๐ฑ|๐œƒ , ๐‘(๐œƒ|๐ท) where ๐œƒ is latent random variable.

Find ๐‘ ๐œƒ ๐ท by Bayes rule ๐‘ ๐œƒ ๐ท โˆ ๐‘ ๐ท|๐œƒ ๐‘(๐œƒ)

(38)

J. Y. Choi. SNU

Bayesian Learning: General Theory

โ€ข Density function for ๐‘ฅ, given the training data set ๐ท,

๐‘(๐‘ฅ|๐œ”) โ‰… ๐‘(๐‘ฅ|๐ท) = เถฑ ๐‘ ๐‘ฅ, ๐œƒ ๐ท ๐‘‘๐œƒ = เถฑ ๐‘ ๐‘ฅ ๐œƒ, ๐ท ๐‘(๐œƒ|๐ท) ๐‘‘๐œƒ

โ€ข By independence between ๐‘ฅ and ๐ท , ๐‘ ๐‘ฅ ๐œƒ, ๐ท = ๐‘ ๐‘ฅ ๐œƒ

โ€ข Therefore ๐‘ ๐‘ฅ ๐ท = ืฌ ๐‘ ๐‘ฅ ๐œƒ ๐‘ ๐œƒ ๐ท ๐‘‘๐œƒ(= ๐‘ ๐‘ฅ ๐œƒ if ๐‘ ๐œƒ ๐ท is impulse).

โ€ข Instead of choosing a specific value for ๐œƒ, the Bayesian approach performs a weighted average over all values of ๐‘(๐œƒ)

โ€ข The weighting factor ๐‘(๐œƒ|๐ท), which is a posterior of ๐œƒ, is determined by starting from some assumed prior ๐‘(๐œƒ)

8

(39)

J. Y. Choi. SNU

โ€ข Then update it using Bayesโ€™ formula to take account of

data set ๐ท. Since ๐ท = ๐‘ฅ1, โ€ฆ , ๐‘ฅ๐‘› are drawn independently, ๐‘ ๐ท ๐œƒ = ฯ‚๐‘˜=1๐‘› ๐‘ ๐‘ฅ๐‘˜ ๐œƒ ,

which is likelihood function.

โ€ข Posterior for ๐œƒ is

๐‘ ๐œƒ ๐ท = ๐‘ ๐ท ๐œƒ ๐‘(๐œƒ)

๐‘(๐ท) = ๐‘(๐œƒ)

๐‘(๐ท) ฯ‚๐‘˜=1๐‘› ๐‘ ๐‘ฅ๐‘˜ ๐œƒ , where normalization factor ๐‘ ๐ท = ืฌ ๐‘(๐œƒ) ฯ‚๐‘˜=1๐‘› ๐‘ ๐‘ฅ๐‘˜ ๐œƒ ๐‘‘๐œƒ.

9

Bayesian Learning: General Theory

(40)

J. Y. Choi. SNU

Bayesian Learning

โ€“ Univariate Normal Distribution

โ€ข Let us use the Bayesian estimation technique to calculate a posteriori density ๐‘ ๐œƒ ๐ท and the desired probability density (likelihood) ๐‘(๐‘ฅ|๐ท) for the case ๐‘(๐‘ฅ|๐œ‡)~๐‘ ๐œ‡, ฮฃ , where rv ๐œƒ for the parameter ๐œ‡.

โ€ข Univariate Case: ๐‘ ๐œ‡ ๐ท

โ€ข Let ๐œ‡ be the only unknown parameter ๐‘ ๐ฑ|๐œ”๐‘– ~๐‘(๐‘ฅ|๐œ‡)~๐‘ ๐œ‡, ๐œŽ2

10

(41)

J. Y. Choi. SNU

Bayesian Learning

โ€“ Univariate Normal Distribution

โ€ข Conjugate prior probability: normal distribution over ๐œ‡, ๐‘ ๐œ‡ ~๐‘(๐œ‡0, ๐œŽ02)

encodes some prior knowledge about the true mean ๐œ‡0 , while ๐œŽ02 measures our prior uncertainty.

โ€ข If ๐œ‡ is drawn from ๐‘(๐œ‡) then density for ๐‘ฅ is completely determined.

Letting ๐ท = ๐‘ฅ1, โ€ฆ , ๐‘ฅ๐‘› , we use

๐‘ ๐œ‡ ๐ท = ๐‘ ๐ท ๐œ‡ ๐‘(๐œ‡)

ืฌ ๐‘ ๐ท ๐œ‡ ๐‘ ๐œ‡ ๐‘‘๐œ‡

= ฮฑ เท‘

๐‘˜=1 ๐‘›

๐‘ ๐‘ฅ๐‘˜ ๐œ‡ ๐‘(๐œ‡).

11

(42)

J. Y. Choi. SNU 12

Bayesian Learning

โ€“ Univariate Normal Distribution

1 1 2

( | D) exp

2 2

n n n

p ๏ญ ๏ญ ๏ญ

๏ฐ๏ณ ๏ณ

๏ƒฉ ๏ƒฆ โˆ’ ๏ƒถ ๏ƒน

๏ƒช ๏ƒบ

= โˆ’ ๏ƒง ๏ƒท

๏ƒช ๏ƒจ ๏ƒธ ๏ƒบ

๏ƒซ ๏ƒป

๐‘ ๐œ‡ ๐ท

๐œŽ0

Bayesian Learning

โ€“ Univariate Normal Distribution

posterior

(43)

J. Y. Choi. SNU

โ€ข Computing the posterior distribution ๐‘ ๐œ‡|๐ท = ๐›ผ ๐‘(๐ท ๐œ‡ ๐‘ ๐œ‡

= ๐›ผโ€ฒ exp โˆ’1

2 ฯƒ๐‘˜=1๐‘› (๐‘ฅ๐‘˜โˆ’๐œ‡)

๐œŽ

2 + (๐œ‡โˆ’๐œ‡0)

๐œŽ0

2

= ๐›ผโ€ฒโ€ฒ exp โˆ’ 1

2

๐‘›

๐œŽ2 + 1

๐œŽ02 ๐œ‡2 โˆ’ 2 1

๐œŽ2 ฯƒ๐‘˜=1๐‘› ๐‘ฅ๐‘˜ + ๐œ‡0

๐œŽ02 ๐œ‡

โ€ข Since ๐‘ ๐œ‡|๐ท should be Gaussian ๐‘ ๐œ‡|๐ท = 1

2๐œ‹๐œŽ๐‘› exp โˆ’1

2

(๐œ‡โˆ’๐œ‡๐‘›) ๐œŽ๐‘›

2 = ๐›ผโ€ฒโ€ฒโ€ฒ exp โˆ’ 1

2๐œŽ๐‘›2 ๐œ‡2 โˆ’ ๐œ‡๐‘›

๐œŽ๐‘› ๐œ‡

โ€ข ๐œ‡๐‘› = ๐‘“ ๐‘ฅ๐‘˜, ๐œ‡0, ๐œŽ0 =?, ๐œŽ๐‘› = โ„Ž ๐‘ฅ๐‘˜, ๐œ‡0, ๐œŽ0 =?

13

Bayesian Learning

โ€“ Univariate Normal Distribution

(44)

J. Y. Choi. SNU

โ€ข Solution: 1

๐œŽ๐‘›2 = ๐‘›

๐œŽ2 + 1

๐œŽ02 , ๐œ‡๐‘›

๐œŽ๐‘›2 = ๐‘›

๐œŽ2 ฦธ๐œ‡๐‘› + ๐œ‡0

๐œŽ02

where ฦธ๐œ‡๐‘› = 1

๐‘› ฯƒ๐‘˜=1๐‘› ๐‘ฅ๐‘˜ is the sample mean.

โ€ข Solving explicitly for ๐œ‡๐‘› and ๐œŽ๐‘›2 we obtain ๐œŽ๐‘›2 = ๐œŽ02๐œŽ2

๐‘›๐œŽ02+๐œŽ2 , ๐œ‡๐‘› = ๐‘›๐œŽ02

๐‘›๐œŽ02+๐œŽ2 ฦธ๐œ‡๐‘› + ๐œŽ2

๐‘›๐œŽ02+๐œŽ2 ๐œ‡0

โ€ข ๐œ‡๐‘› represents our best guess for ๐œ‡ after observing ๐‘› samples.

โ€ข ๐œŽ๐‘›2 measures our uncertainty about this guess.

โ€ข ๐œŽ๐‘›2 decreases monotonically with ๐‘› (approaching ๐œŽ๐‘›2 โ†’ ๐œŽ2

๐‘› โ†’ 0 as ๐‘› approaches infinity)

14

Bayesian Learning

โ€“ Univariate Normal Distribution

1 1 2

( | D) exp

2 2

n n n

p ๏ญ ๏ญ ๏ญ

๏ฐ๏ณ ๏ณ

๏ƒฉ ๏ƒฆ โˆ’ ๏ƒถ ๏ƒน

๏ƒช ๏ƒบ

= โˆ’ ๏ƒง ๏ƒท

๏ƒช ๏ƒจ ๏ƒธ ๏ƒบ

๏ƒซ ๏ƒป

(45)

J. Y. Choi. SNU

โ€ข Each additional observation decreases our uncertainty about the true value of ๐œ‡.

โ€ข As ๐‘› increases, ๐œ‡ becomes more and more sharply peaked, approaching a Dirac delta function as ๐‘› approaches infinity. This behavior is known as

Bayesian Learning.

15

( | D) p ๏ญ

Bayesian Learning

โ€“ Univariate Normal Distribution

(46)

J. Y. Choi. SNU

๐œ‡๐‘› = ๐‘›๐œŽ02

๐‘›๐œŽ02 + ๐œŽ2 ฦธ๐œ‡๐‘› + ๐œŽ2

๐‘›๐œŽ02 + ๐œŽ2 ๐œ‡0, ๐œŽ๐‘›2 = ๐œŽ02๐œŽ2 ๐‘›๐œŽ02 + ๐œŽ2

โ€ข In general, ๐œ‡๐‘› is a linear combination of ฦธ๐œ‡๐‘› and ๐œ‡0 , with coefficients that are non-negative and sum to 1.

โ€ข Thus ๐œ‡๐‘› lies somewhere between ฦธ๐œ‡๐‘› and ๐œ‡0.

โ€ข If ๐œŽ0 โ‰  0, ๐œ‡๐‘› โ†’ ฦธ๐œ‡๐‘› as ๐‘› โ†’ โˆž.

โ€ข If ๐œŽ0 = 0, our a priori certainty that ๐œ‡ = ๐œ‡0 = ๐œ‡๐‘› is so strong that no number of observations can change our opinion.

โ€ข If ๐œŽ0 โ‰ซ ๐œŽ, a priori guess is uncertain, and we take ๐œ‡๐‘› = ฦธ๐œ‡๐‘›

โ€ข The ratio ๐œŽ2

๐œŽ02 is called dogmatism.

16

๏ญ ๏ญ= 0

Bayesian Learning

โ€“ Univariate Normal Distribution

(47)

J. Y. Choi. SNU 17

โ€ข The Univariate Case:

๐‘ ๐‘ฅ ๐ท = เถฑ ๐‘ ๐‘ฅ ๐œ‡ ๐‘ ๐œ‡ ๐ท ๐‘‘๐œ‡

= ืฌ 1

2๐œ‹๐œŽ๐œŽ๐‘› exp โˆ’ 1

2

(๐‘ฅโˆ’๐œ‡) ๐œŽ

2 โˆ’ 1

2

(๐œ‡โˆ’๐œ‡๐‘›) ๐œŽ๐‘›

2 ๐‘‘๐œ‡

= 1

2๐œ‹๐œŽ๐œŽ๐‘› exp โˆ’ 1

2

๐‘ฅโˆ’๐œ‡๐‘› 2

๐œŽ2+๐œŽ๐‘›2 ๐‘“ ๐œŽ, ๐œŽ๐‘›

โˆ exp โˆ’ 1

2

๐‘ฅโˆ’๐œ‡๐‘› 2

๐œŽ2+๐œŽ๐‘›2 ๐‘›โ†’โˆž exp โˆ’ 1

2

๐‘ฅโˆ’๐œ‡ 2 ๐œŽ2

Bayesian Learning

โ€“ Univariate Normal Distribution

(48)

J. Y. Choi. SNU

โ€ข Since ๐‘ ๐‘ฅ ๐ท โˆ exp โˆ’ 1

2

๐‘ฅโˆ’๐œ‡๐‘› 2

๐œŽ2+๐œŽ๐‘›2 , we can write ๐‘ ๐‘ฅ ๐ท ~๐‘(๐œ‡๐‘›, ๐œŽ2 + ๐œŽ๐‘›2)

โ€ข To obtain the class conditional probability ๐‘ ๐‘ฅ ๐ท , whose parametric form is known to be ๐‘ ๐‘ฅ ๐œ‡ ~๐‘ ๐œ‡, ๐œŽ2 .

โ€ข We replace ๐œ‡ by ๐œ‡๐‘› and ๐œŽ2 by ๐œŽ2 + ๐œŽ๐‘›2.

โ€ข The conditional mean ๐œ‡๐‘› is treated as if it were the true mean, and the known variance is increased by ๐œŽ๐‘›2 to account for the additional

uncertainty in ๐‘ฅ resulting from our lack of exact knowledge of the mean ๐œ‡ .

18

Bayesian Learning

โ€“ Univariate Normal Distribution

(49)

J. Y. Choi. SNU

Recursive Bayesian Incremental Learning

โ€ข We have seen that ๐‘ ๐ท๐‘› ๐œƒ = ฯ‚๐‘˜=1๐‘› ๐‘ ๐‘ฅ๐‘˜ ๐œƒ = ๐‘ ๐‘ฅ๐‘› ๐œƒ ฯ‚๐‘˜=1๐‘›โˆ’1 ๐‘ ๐‘ฅ๐‘˜ ๐œƒ . Let us define ๐ท๐‘› = ๐‘ฅ1, โ€ฆ , ๐‘ฅ๐‘› . Then

๐‘ ๐ท๐‘› ๐œƒ = ๐‘ ๐‘ฅ๐‘› ๐œƒ ๐‘ ๐ท๐‘›โˆ’1 ๐œƒ .

โ€ข Substituting into ๐‘ ๐œƒ ๐ท๐‘› and using Bayes we have:

๐‘ ๐œƒ ๐ท๐‘› โˆ ๐‘ ๐ท๐‘› ๐œƒ ๐‘ ๐œƒ = ๐‘ ๐‘ฅ๐‘› ๐œƒ ๐‘ ๐ท๐‘›โˆ’1 ๐œƒ ๐‘ ๐œƒ

= ๐‘ ๐‘ฅ๐‘› ๐œƒ ๐‘ ๐œƒ ๐ท๐‘›โˆ’1 ๐‘ ๐ท๐‘›โˆ’1 ๐‘(๐œƒ) ๐‘ ๐œƒ

โˆ ๐‘ ๐‘ฅ๐‘› ๐œƒ ๐‘ ๐œƒ ๐ท๐‘›โˆ’1

19

(50)

J. Y. Choi. SNU

Recursive Bayesian Incremental Learning

โ€ข While ๐‘ ๐œƒ ๐ท0 = ๐‘(๐œƒ), the repeated use of ๐‘ ๐œƒ ๐ท๐‘› โˆ ๐‘ ๐‘ฅ๐‘› ๐œƒ ๐‘ ๐œƒ ๐ท๐‘›โˆ’1 produces a sequence

๐‘(๐œƒ), ๐‘ ๐œƒ ๐‘ฅ1 = ๐‘ ๐œƒ ๐ท1 , ๐‘(๐œƒ|๐‘ฅ1, ๐‘ฅ2) = ๐‘ ๐œƒ ๐ท2 , โ€ฆ

โ€ข This is called the recursive Bayes approach to the parameter estimation. (Also incremental or on-line learning).

โ€ข When this sequence of densities converges to a Dirac delta function centered about the true parameter value, we have Bayesian learning.

20

(51)

J. Y. Choi. SNU

โ€ข Example of Recursive Formula

21

Recursive Bayesian Incremental Learning

๐œƒ๐‘› = ๐œŽ๐‘›โˆ’12

๐œŽ๐‘›โˆ’12 + ๐œŽ2 ๐‘ฅ๐‘› + ๐œŽ2

๐œŽ๐‘›โˆ’12 + ๐œŽ2 ๐œƒ๐‘›โˆ’1 ๐œŽ๐‘›2 = ๐œŽ๐‘›โˆ’12 ๐œŽ2 ๐œŽ๐‘›โˆ’12 + ๐œŽ2

Recursive Bayesian Incremental Learning

๐‘(๐œƒ|๐ท๐‘›) โˆ ๐‘(๐ท๐‘›|๐œƒ)๐‘(๐œƒ|๐ท0)

๐‘(๐œƒ|๐ท๐‘›) โˆ ๐‘(๐‘ฅ๐‘›|๐œƒ)๐‘(๐œƒ|๐ท๐‘›โˆ’1) ๐œƒ๐‘› = ๐‘›๐œŽ02

๐‘›๐œŽ02 + ๐œŽ2 ๐‘ฅเทœ๐‘› + ๐œŽ2

๐‘›๐œŽ02 + ๐œŽ2 ๐œƒ0 ๐œŽ๐‘›2 = ๐œŽ02๐œŽ2 ๐‘›๐œŽ02 + ๐œŽ2

(52)

J. Y. Choi. SNU

Maximal Likelihood vs. Bayesian

โ€ข MLP, MAP

โ€ข Bayesian Learning

2 2

( | D )n ( | ) ( | D )n : ( ,n n) p x =

๏ƒฒ

p x ฮธ p ฮธ dฮธ N ๏ฑ ๏ณ +๏ณ

โ€ฆ โ€ฆ

: arg max (D | )

: arg max ( | D )

n n

n n

MLP p

MAP p

๏ฑ

๏ฑ

๏ฑ ๏ฑ

๏ฑ ๏ฑ

=

=

n n

2 ^ 2 2

0 0

(ฮธ|D )= (D |ฮธ) (ฮธ)

( ; n, n) ( ; n, , , )

p p p

f f

๏ก

๏ฑ ๏ฑ ๏ณ = ๏ฑ ๏ฑ ๏ณ ๏ณ ๏ฑ

( | D )n ( | n) : ( n, 2) p x = p x ฮธ N ฮธ ๏ณ

50

(53)

J. Y. Choi. SNU

Maximal Likelihood vs. Bayesian

โ€ข ML and Bayesian estimations are asymptotically equivalent and โ€œconsistentโ€. They yield the same class-conditional densities when the size of the training data grows to infinity.

โ€ข ML is typically computationally easier: in ML we need to do (multidimensional) differentia tion and in Bayesian (multidimensional) integration.

โ€ข ML is often easier to interpret: it returns the single best model (parameter) whereas Bay esian gives a weighted average of models.

โ€ข But for a finite training data (and given a reliable prior) Bayesian is more accurate (uses more of the information).

โ€ข Bayesian with โ€œflatโ€ prior is essentially ML.

23

(54)

J. Y. Choi. SNU

Interim Summary

โ€ข In MLE estimation,

โ€ข In Bayesian Learning

โ€ข Weighted average of Candidate Models

โ€ข Gaussian Model

โ€ข Incremental Bayesian Learning

24

๐‘(๐‘ฅ|๐ท) = เถฑ๐‘(๐‘ฅ|๐œƒ)๐‘(๐œƒ|๐ท)๐‘‘๐œƒ

๐œƒ๐‘› = ๐œŽ๐‘›โˆ’12

๐œŽ๐‘›โˆ’12 + ๐œŽ2 ๐‘ฅ๐‘› + ๐œŽ2

๐œŽ๐‘›โˆ’12 + ๐œŽ2 ๐œƒ๐‘›โˆ’1 ๐œŽ๐‘›2 = ๐œŽ๐‘›โˆ’12 ๐œŽ2 ๐œŽ๐‘›โˆ’12 + ๐œŽ2 ๐‘(๐œƒ|๐ท๐‘›) โˆ ๐‘(๐ท๐‘›|๐œƒ)๐‘(๐œƒ|๐ท0)

๐‘(๐œƒ|๐ท๐‘›) โˆ ๐‘(๐‘ฅ๐‘›|๐œƒ)๐‘(๐œƒ|๐ท๐‘›โˆ’1) ๐œƒ๐‘› = ๐‘›๐œŽ02

๐‘›๐œŽ02 + ๐œŽ2 ๐œ‡เทœ๐‘› + ๐œŽ2

๐‘›๐œŽ02 + ๐œŽ2๐œƒ0 ๐œŽ๐‘›2 = ๐œŽ02๐œŽ2 ๐‘›๐œŽ02 + ๐œŽ2

เทœ

๐œ‡ = 1

๐‘› ฯƒ๐‘˜=1๐‘› ๐‘ฅ๐‘˜, ฮฃ =เท  1

๐‘›โˆ’๐‘ฯƒ๐‘˜=1๐‘› (๐‘ฅ๐‘˜ โˆ’ เทœ๐œ‡)(๐‘ฅ๐‘˜ โˆ’ เทœ๐œ‡)๐‘ก

์ฐธ์กฐ

๊ด€๋ จ ๋ฌธ์„œ

In the second manuscript, the Bayesian MCMC method using a data-based prior distribution and MLE(Maximum Likelihood Estimation) using a quadratic approximation are

Keywords: Approximate maximum likelihood estimation, generalized exponential dis- tribution, mid-point approximation method, progressive type I interval censoring, pro- gressive type

In this paper, we shall find maximum likelihood estimator(MLE), generalized maximum likelihood estimator(GMLE) and Bayesian estimator for reliability of a two-unit

We derive some approximate maximum likelihood estimators (AMLEs) and maximum likelihood estimator (MLE) of the scale parameter in the half-triangle distribution based

The maximum likelihood(ML) estimation has excellent accuracy for frequency estimation of multiple sinusoids, but the maximum likelihood function requires much loss owing to

Keywords: EM algorithm, finite mixture model, initial value, maximum Likelihood Estimation, model-based clustering. โ€  This research was supported by Basic Science Research

Using some Taylor series expansions, we obtain maximum likelihood estimator (MLE) and some approximate maximum likelihood estimators (AMLEs) of the scale parameter in half

The parameter estimation method is proposed Bayesian method to navigate through the model from the results of the decision tree based on the tying state according to