• 검색 결과가 없습니다.

Bayesian Decision Theory (II)

문서에서 Bayesian Decision Theory (I) (페이지 30-62)

Jin Young Choi

Seoul National University

1

Bayesian Decision Theory

• Bayes Formula

• Priori, Posteriori, Likelihood Probability

• Bayes Decision

• Risk Formulation

• Expected Loss, Conditional Risk, Total Risk

• Likelihood Ratio Test

• Decision Region

• Classifier for Bayes Decision

J. Y. Choi. SNU

Next Outline

• Bayes Classifier

• Normal Density

• ND: Univariate Case

• MND: Multivariate Case1: Indep., Same Variance

• MND: Multivariate Case2: Same Covariance Mtx.

• MND: Multivariate Case3: Different Covariance Mtx.

• Error Probability

3

Decision Regions

• The likelihood ratio 𝑝 𝑥 𝜔1 /𝑝(𝑥|𝜔2) vs. 𝑥

• The threshold 𝑇𝑎 for a loss function

1 12 22 2

2 21 11 1

( | ) ( ) ( )

( | ) ( ) ( ) a

p x p

p x p T

1 2

1 2

( | ) ( | ) ( ) ( )

p x p x Ta

g x g x

Ta

J. Y. Choi. SNU

Classifiers; Discriminant Functions

• A classifier can be represented by a set of discriminant functions 𝑔𝑖 𝑥 ; 𝑖 = 1, … , 𝐶.

• The classifier assigns observation 𝑥 to class 𝜔𝑖 if 𝑔𝑖 𝑥 >

𝑔𝑗 𝑥 for all 𝑖 ≠ 𝑗

5

1 2

( ) exp

i 2

g x x

( ) T

g xi w x b

The Bayes Classifier

• A Bayes classifier can be represented in this way

For the minimum error-rate case

𝑔𝑖 𝑥 = 𝑝(𝜔𝑖|𝑥)

For the general case with risks

𝑔𝑖 𝑥 = −𝑅 𝛼𝑖 𝑥 = − ෍

𝑗

𝜆𝑖𝑗 𝑝(𝜔𝑗|𝑥)

• If we replace 𝑔𝑖 𝑥 by 𝑓(𝑔𝑖 𝑥 ) , where 𝑓(. ) is a monotonically increasing function (ex, log…), the resulting classification is unchanged.

𝑔𝑖 𝑥 = 𝑝 𝑥 𝜔𝑖 𝑝 𝜔𝑖

𝑔𝑖 𝑥 = ln 𝑝 𝑥 𝜔𝑖 + ln 𝑝(𝜔𝑖)

J. Y. Choi. SNU

The Bayes Classifier

• The effect of any decision rule is to divide the feature space into C decision regions, 𝑅1, … , 𝑅𝐶.

• If 𝑔𝑖 𝑥 > 𝑔𝑗 𝑥 for all 𝑗 ≠ 𝑖 then 𝑥 is in 𝑅𝑖, and 𝑥 is assigned to 𝑤𝑖.

• Decision regions are separated by decision boundaries.

• Decision boundaries are surfaces in the feature space.

7

Feature 2 Hight

Feature 1 Weight

x

The Decision Regions

• Two dimensional two category classifier

J. Y. Choi. SNU

Two-Category Case

Use 2 discriminant functions 𝑔1(𝑥)and 𝑔2(𝑥), and

assigning 𝑥 to 𝜔1 if 𝑔1(𝑥)> 𝑔2(𝑥).

Alternative: define a single discriminant function

𝑔 𝑥 = 𝑔1 𝑥 − 𝑔2(𝑥),

decide 𝜔1 if 𝑔 𝑥 > 0, otherwise decide 𝜔2.

In two category case two forms are frequently used:

𝑔 𝑥 = 𝑝 𝜔1|𝑥 − 𝑝 𝜔2|𝑥

𝑔 𝑥 = ln𝑝 𝑥|𝜔1

𝑝 𝑥|𝜔2 + ln𝑝 𝜔1

𝑝 𝜔2

9

Normal Density - Univariate Case

• Gaussian density with mean μ and standard deviation σ (σ2 named variance )

𝑝 𝑥 = 1

(2𝜋)1/2𝜎 exp − 1 2

(𝑥 − 𝜇) 𝜎

2

𝑝 𝑥 ~𝑁(𝜇, 𝜎2)

• It can be shown that:

𝜇 = 𝐸 𝑥 = න

−∞

𝑥𝑝 𝑥 𝑑𝑥

𝜎2 = 𝐸 (𝑥 − 𝜇)2 = න

−∞

(𝑥 − 𝜇)2𝑝 𝑥 𝑑𝑥

J. Y. Choi. SNU

Normal Density - Multivariate Case

• The general multivariate normal density (MND) in a 𝑑 − dimensions is written as

𝑝 𝑥 = 1

(2𝜋)𝑑/2 Σ 1/2 𝑒𝑥𝑝 − 1

2 𝑥 − 𝜇 𝑡Σ−1 𝑥 − 𝜇 𝜇 = 𝐸 𝑥 = න

−∞

𝑥𝑝 𝑥 𝑑𝑥 Σ = 𝐸 (𝑥 − 𝜇)(𝑥 − 𝜇)𝑡 Σ𝑖𝑗 = 𝐸 (𝑥𝑖 − 𝜇𝑖)(𝑥𝑗 − 𝜇𝑗)

• The covariance matrix Σ is always symmetric and positive semidefinite.

11

Normal Density - Multivariate Case

• The multivariate normal density MND is completely specified by 𝑑 + 𝑑(𝑑 + 1)/2 parameters . Samples drawn from MND fall in a cluster of which center is determined by 𝜇 and a shape by Σ. The loci of points of constant density are hyper-ellipsoids

𝑟2 = 𝑥 − 𝜇 𝑡Σ−1 𝑥 − 𝜇 =constant

• The 𝑟 is called Mahalonobis distance from 𝑥 to 𝜇

• The principal axes of the hyperellipsoid are given by the eigenvectors of Σ.

J. Y. Choi. SNU

Normal Density - Multivariate Case

• The minimum-error-rate classification can be achieved using the discriminant functions:

𝑔𝑖 𝑥 = 𝑝 𝑥 𝜔𝑖 𝑝(𝜔𝑖) or

𝑔𝑖 𝑥 = ln 𝑝 𝑥 𝜔𝑖 + ln 𝑝(𝜔𝑖)

• If 𝑝 𝑥 𝜔𝑖 ~𝑁 𝜇𝑖, Σ𝑖 , then

𝑔𝑖 𝑥 = − 1

2 𝑥 − 𝜇𝑖 𝑡Σ𝑖−1 𝑥 − 𝜇𝑖𝑑

2 ln 2𝜋 − 1

2 ln Σ𝑖 + ln 𝑝(𝜔𝑖) since

𝑝 𝑥 𝜔𝑖 = 1

(2𝜋)𝑑/2 Σ𝑖 1/2 𝑒𝑥𝑝 −1

2 𝑥 − 𝜇𝑖 𝑡Σ𝑖−1 𝑥 − 𝜇𝑖

13

Discriminant Function for Normal Density

• Assume the features are statistically independent, and each feature has the same variance, that is, Σ𝑖 = 𝜎2𝐈

• What is the discriminant function?

J. Y. Choi. SNU

Discriminant Function for Normal Density

• Assume the features are statistically independent, and each feature has the same variance, that is, Σ𝑖 = 𝜎2 𝐈

• What is the discriminant function?

𝑔𝑖 𝑥 = − 1

2 𝑥 − 𝜇𝑖 𝑡Σ𝑖−1 𝑥 − 𝜇𝑖𝑑

2 ln 2𝜋 − 1

2 ln Σ𝑖 + ln 𝑝(𝜔𝑖)

= − 𝑥−𝜇𝑖 2

2𝜎2𝑑

2 ln 2𝜋 − 1

2 ln 𝜎2𝑑 + ln 𝑝(𝜔𝑖) Since − 𝑑

2 ln 2𝜋 − 1

2 ln 𝜎2𝑑 is independent to the class, 𝑔𝑖 𝑥 = − 𝑥−𝜇𝑖 2

2𝜎2 + ln 𝑝(𝜔𝑖)

15

Discriminant Function for Normal Density

• What is the discriminant function?

𝑔𝑖 𝑥 = − 𝑥−𝜇𝑖 2

2𝜎2 + ln 𝑝(𝜔𝑖)

= − 1

2𝜎2 (𝑥𝑡𝑥 − 2𝜇𝑖𝑡𝑥 + 𝜇𝑖𝑡𝜇𝑖) + ln 𝑝(𝜔𝑖) Since 𝑥𝑡𝑥 is also independent to the class,

𝑔𝑖 𝑥 = 𝑤𝑖𝑡𝑥 + 𝑤𝑖0 where

𝑤𝑖 = 𝜇𝑖

𝜎2

𝑤𝑖0 = − 𝜇𝑖𝑡𝜇𝑖

2𝜎2 + ln 𝑝(𝜔𝑖)

J. Y. Choi. SNU

Discriminant Function for Normal Density

• Decision surface between 𝑖 and 𝑗 classes is obtained by letting 𝑔𝑖 𝑥 = 𝑔𝑗 𝑥

which yields

𝑔 𝑥 = 𝑔𝑖 𝑥 − 𝑔𝑗 𝑥 = 𝑤𝑡𝑥 − 𝑤0 = 0 where

𝑤 = 𝑤𝑖 − 𝑤𝑗 = 𝜇𝑖−𝜇𝑗

𝜎2

𝑤0 = 𝑤𝑖0 − 𝑤𝑗0 = − 𝜇𝑖𝑡𝜇𝑖−𝜇𝑗𝑡𝜇𝑗

2𝜎2 + ln 𝑝(𝜔𝑖) − ln 𝑝(𝜔𝑗)

• This linear classifier corresponds to a neuron model as

• 𝑖 −class region if 𝑔 𝑥 > 0

• 𝑗 −class region if 𝑔 𝑥 < 0

• decision boundary if 𝑔 𝑥 = 0

17

• If 𝑝(𝑤𝑖) is equal to 𝑝(𝑤𝑗),

Discriminant Function for Normal Density

J. Y. Choi. SNU

Discriminant Function for Normal Density

• If 𝑝(𝑤𝑖) is not equal to 𝑝(𝑤𝑗)

19

Discriminant Function for Normal Density

• Assume the features are not statistically independent, and each feature has the different variance, but every class has the same covariance matrix, that is, Σ𝑖 = Σ

• What is the discriminant function?

J. Y. Choi. SNU

Discriminant Function for Normal Density

• Assume the features are not statistically independent, and each feature has the different variance, but every class has the same covariance matrix, that is, Σ𝑖 = Σ

• What is the discriminant function?

𝑔𝑖 𝑥 = − 1

2 𝑥 − 𝜇𝑖 𝑡Σ−1 𝑥 − 𝜇𝑖𝑑

2 ln 2𝜋 − 1

2 ln Σ + ln 𝑝(𝜔𝑖) Since − 𝑑

2 ln 2𝜋 − 1

2 ln Σ is independent to the class, 𝑔𝑖 𝑥 = − 1

2 𝑥 − 𝜇𝑖 𝑡Σ−1 𝑥 − 𝜇𝑖 + ln 𝑝(𝜔𝑖)

21

Discriminant Function for Normal Density

• What is the discriminant function?

𝑔𝑖 𝑥 = −1

2 𝑥 − 𝜇𝑖 𝑡Σ−1 𝑥 − 𝜇𝑖 + ln 𝑝(𝜔𝑖)

= − 1

2 (𝑥𝑡Σ−1𝑥 − 2𝜇𝑖𝑡Σ−1𝑥 + 𝜇𝑖𝑡Σ−1𝜇𝑖) + ln 𝑝(𝜔𝑖) Since 𝑥𝑡Σ−1𝑥 is also independent to the class,

𝑔𝑖 𝑥 = 𝑤𝑖𝑡𝑥 + 𝑤𝑖0 where

𝑤𝑖 = Σ−1𝜇𝑖

𝑤𝑖0 = −𝜇𝑖𝑡Σ−1𝜇𝑖

2 + ln 𝑝(𝜔𝑖)

J. Y. Choi. SNU

Discriminant Function for Normal Density

Decision surface between 𝑖 and 𝑗 classes is obtained by letting 𝑔𝑖 𝑥 = 𝑔𝑗 𝑥

which yields

𝑔 𝑥 = 𝑔𝑖 𝑥 − 𝑔𝑗 𝑥 = 𝑤𝑡𝑥 − 𝑤0 = 0 where

𝑤 = 𝑤𝑖 − 𝑤𝑗 = Σ−1(𝜇𝑖 − 𝜇𝑗)

𝑤0 = 𝑤𝑖0 − 𝑤𝑗0 = −𝜇𝑖𝑡Σ−1𝜇𝑖−𝜇𝑗𝑡Σ−1𝜇𝑗

2 + ln 𝑝(𝜔𝑖) − ln 𝑝(𝜔𝑗)

• This linear classifier corresponds to a neuron model as

• 𝑖 −class region if 𝑔 𝑥 > 0

• 𝑗 −class region if 𝑔 𝑥 < 0

• decision boundary if 𝑔 𝑥 = 0

23

Discriminant Function for Normal Density

• Assume the features are not statistically independent, and each feature has the different variance, but every class has the different covariance matrix, that is, Σ𝑖

• What is the discriminant function?

J. Y. Choi. SNU

Discriminant Function for Normal Density

• Assume the features are not statistically independent, and each feature has the different variance, but every class has the same covariance matrix, that is, Σ𝑖

• What is the discriminant function?

𝑔𝑖 𝑥 = − 1

2 𝑥 − 𝜇𝑖 𝑡Σ𝑖−1 𝑥 − 𝜇𝑖𝑑

2 ln 2𝜋 − 1

2 ln Σ𝑖 + ln 𝑝(𝜔𝑖)

• Since − 𝑑

2 ln 2𝜋 is independent to the class, 𝑔𝑖 𝑥 = − 1

2 𝑥 − 𝜇𝑖 𝑡Σ𝑖−1 𝑥 − 𝜇𝑖1

2 ln Σ𝑖 + ln 𝑝(𝜔𝑖)

25

Discriminant Function for Normal Density

• What is the discriminant function?

𝑔𝑖 𝑥 = − 1

2 𝑥 − 𝜇𝑖 𝑡Σ𝑖−1 𝑥 − 𝜇𝑖1

2 ln Σ𝑖 + ln 𝑝(𝜔𝑖)

= − 1

2 𝑥𝑡Σ𝑖−1𝑥 − 2𝜇𝑖𝑡Σ𝑖−1𝑥 + 𝜇𝑖𝑡Σ𝑖−1𝜇𝑖1

2 ln Σ𝑖 + ln 𝑝(𝜔𝑖)

Since 𝑥𝑡Σ𝑖−1𝑥 is not independent to the class, 𝑔𝑖 𝑥 = 𝑥𝑡𝑊𝑖𝑥 + 𝑤𝑖𝑡𝑥 + 𝑤𝑖0

where

𝑊𝑖 = −1

2 Σ𝑖−1 𝑤𝑖 = Σ−1𝜇𝑖

𝑤𝑖0 = − 𝜇𝑖𝑡Σ−1𝜇𝑖

2 + ln 𝑝(𝜔𝑖)

J. Y. Choi. SNU

Decision Boundary for General Gaussian

• Decision boundaries are hyperquadratic for general case

27

Linear Mapping

𝑇: 𝑋 → 𝑌, 𝑦 = 𝑊𝑥

𝑋 𝑌

J. Y. Choi. SNU

Affine Mapping

𝑇: 𝑋 → 𝑌, 𝑦 = 𝑊𝑥 + 𝑏

29

𝑋 𝑌

1 𝑏

1 2

3 4

5

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09

1 2 3 4 5

Nonlinear Mapping

𝑇: 𝑋 → 𝑌, 𝑦 = 𝜎 𝑊𝑥 + 𝑏 , 𝑓 𝑥 = 𝑎𝑇𝑦

𝑋 ∋ 𝑥 𝑌 ∋ 𝑦 =

𝑦1

𝑦𝑛 𝑓 𝑥 = 𝑎𝑇𝑦

𝑓 𝑥

1 𝑏

J. Y. Choi. SNU

Error Probabilities and Integrals

• Consider the 2-class problem and suppose that the feature space is divided into 2 regions 𝑅1 and 𝑅2. There are 2 ways in which a classification error can occur.

• An observation 𝑥 falls in 𝑅2, and the true state is 𝜔1.

• An observation 𝑥 falls in 𝑅1, and the true state is 𝜔2.

• The error probability

𝑃 𝑒𝑟𝑟𝑜𝑟 = 𝑃 𝑥 ∈ 𝑅2 𝜔1 𝑝 𝜔1 + 𝑃 𝑥 ∈ 𝑅1 𝜔2 𝑝(𝜔2)

= ׬𝑅

2 𝑝 𝑥 𝜔1 𝑝 𝜔1 𝑑𝑥 + ׬𝑅

1 𝑝 𝑥 𝜔2 𝑝 𝜔2 𝑑𝑥

31

Error Probabilities and Integrals

• Because 𝑥 is chosen arbitrarily, the probability of error is not as small as it might be.

• 𝑥𝐵 = Bayes optimal decision boundary , and gives the lowest probability of error.

• Bayes classifier maximizes the correct probability.

1 1

( ) ( | ) ( ) ( | ) ( )

i

C C

i i i i i

i i

P correct P p p p d

 

x

 

x x

Error Probabilities and Integrals

문서에서 Bayesian Decision Theory (I) (페이지 30-62)

관련 문서