Bayesian Decision Theory (II) - Bayesian Decision Theory (I)

Jin Young Choi

Seoul National University

Bayesian Decision Theory

• Bayes Formula

• Priori, Posteriori, Likelihood Probability

• Bayes Decision

• Risk Formulation

• Expected Loss, Conditional Risk, Total Risk

• Likelihood Ratio Test

• Decision Region

• Classifier for Bayes Decision

J. Y. Choi. SNU

Next Outline

• Bayes Classifier

• Normal Density

• ND: Univariate Case

• MND: Multivariate Case1: Indep., Same Variance

• MND: Multivariate Case2: Same Covariance Mtx.

• MND: Multivariate Case3: Different Covariance Mtx.

• Error Probability

Decision Regions

• The likelihood ratio 𝑝 𝑥 𝜔₁ /𝑝(𝑥|𝜔₂) vs. 𝑥

• The threshold 𝑇_𝑎 for a loss function

1 12 22 2

2 21 11 1

( | ) ( ) ( )

( | ) ( ) ( ) ^a

p x p

p x p T

   

  



1 2

( | ) ( | ) ( ) ( )

p x p x Ta

g x g x

  

 Ta

J. Y. Choi. SNU

Classifiers; Discriminant Functions

• A classifier can be represented by a set of discriminant functions 𝑔_𝑖 𝑥 ; 𝑖 = 1, … , 𝐶.

• The classifier assigns observation 𝑥 to class 𝜔_𝑖 if 𝑔_𝑖 𝑥 >

𝑔_𝑗 𝑥 for all 𝑖 ≠ 𝑗

1 2

( ) exp

i 2

g x x 



    

     ( ) ^T

g xi  w x b

The Bayes Classifier

• A Bayes classifier can be represented in this way

• For the minimum error-rate case

𝑔_𝑖 𝑥 = 𝑝(𝜔_𝑖|𝑥)

• For the general case with risks

𝑔_𝑖 𝑥 = −𝑅 𝛼_𝑖 𝑥 = − ෍

𝑗

𝜆_𝑖𝑗 𝑝(𝜔_𝑗|𝑥)

• If we replace 𝑔_𝑖 𝑥 by 𝑓(𝑔_𝑖 𝑥 ) , where 𝑓(. ) is a monotonically increasing function (ex, log…), the resulting classification is unchanged.

𝑔_𝑖 𝑥 = 𝑝 𝑥 𝜔_𝑖 𝑝 𝜔_𝑖

𝑔_𝑖 𝑥 = ln 𝑝 𝑥 𝜔_𝑖 + ln 𝑝(𝜔_𝑖)

J. Y. Choi. SNU

The Bayes Classifier

• The effect of any decision rule is to divide the feature space into C decision regions, 𝑅₁, … , 𝑅_𝐶.

• If 𝑔_𝑖 𝑥 > 𝑔_𝑗 𝑥 for all 𝑗 ≠ 𝑖 then 𝑥 is in 𝑅_𝑖, and 𝑥 is assigned to 𝑤_𝑖.

• Decision regions are separated by decision boundaries.

• Decision boundaries are surfaces in the feature space.

Feature 2 Hight

Feature 1 Weight

The Decision Regions

• Two dimensional two category classifier

J. Y. Choi. SNU

Two-Category Case

• Use 2 discriminant functions 𝑔₁(𝑥)and 𝑔₂(𝑥), and

• assigning 𝑥 to 𝜔₁ if 𝑔₁(𝑥)> 𝑔₂(𝑥).

• Alternative: define a single discriminant function

• 𝑔 𝑥 = 𝑔₁ 𝑥 − 𝑔₂(𝑥),

• decide 𝜔₁ if 𝑔 𝑥 > 0, otherwise decide 𝜔₂.

• In two category case two forms are frequently used:

• 𝑔 𝑥 = 𝑝 𝜔₁|𝑥 − 𝑝 𝜔₂|𝑥

• 𝑔 𝑥 = ln^{𝑝 𝑥|𝜔}¹

𝑝 𝑥|𝜔₂ + ln^{𝑝 𝜔}¹

𝑝 𝜔₂

Normal Density - Univariate Case

• Gaussian density with mean μ and standard deviation σ (σ² named variance )

𝑝 𝑥 = 1

(2𝜋)^1/2𝜎 exp − 1 2

(𝑥 − 𝜇) 𝜎

𝑝 𝑥 ~𝑁(𝜇, 𝜎²)

• It can be shown that:

𝜇 = 𝐸 𝑥 = න

−∞

∞

𝑥𝑝 𝑥 𝑑𝑥

𝜎² = 𝐸 (𝑥 − 𝜇)² = න

−∞

∞

(𝑥 − 𝜇)²𝑝 𝑥 𝑑𝑥

J. Y. Choi. SNU

Normal Density - Multivariate Case

• The general multivariate normal density (MND) in a 𝑑 − dimensions is written as

𝑝 𝑥 = 1

(2𝜋)^𝑑/2 Σ ^1/2 𝑒𝑥𝑝 − 1

2 𝑥 − 𝜇 ^𝑡Σ⁻¹ 𝑥 − 𝜇 𝜇 = 𝐸 𝑥 = න

−∞

∞

𝑥𝑝 𝑥 𝑑𝑥 Σ = 𝐸 (𝑥 − 𝜇)(𝑥 − 𝜇)^𝑡 Σ_𝑖𝑗 = 𝐸 (𝑥_𝑖 − 𝜇_𝑖)(𝑥_𝑗 − 𝜇_𝑗)

• The covariance matrix Σ is always symmetric and positive semidefinite.

Normal Density - Multivariate Case

• The multivariate normal density MND is completely specified by 𝑑 + 𝑑(𝑑 + 1)/2 parameters . Samples drawn from MND fall in a cluster of which center is determined by 𝜇 and a shape by Σ. The loci of points of constant density are hyper-ellipsoids

𝑟² = 𝑥 − 𝜇 ^𝑡Σ⁻¹ 𝑥 − 𝜇 =constant

• The 𝑟 is called Mahalonobis distance from 𝑥 to 𝜇

• The principal axes of the hyperellipsoid are given by the eigenvectors of Σ.

J. Y. Choi. SNU

Normal Density - Multivariate Case

• The minimum-error-rate classification can be achieved using the discriminant functions:

𝑔_𝑖 𝑥 = 𝑝 𝑥 𝜔_𝑖 𝑝(𝜔_𝑖) or

𝑔_𝑖 𝑥 = ln 𝑝 𝑥 𝜔_𝑖 + ln 𝑝(𝜔_𝑖)

• If 𝑝 𝑥 𝜔_𝑖 ~𝑁 𝜇_𝑖, Σ_𝑖 , then

𝑔_𝑖 𝑥 = − ¹

2 𝑥 − 𝜇_𝑖 ^𝑡Σ_𝑖⁻¹ 𝑥 − 𝜇_𝑖 − ^𝑑

2 ln 2𝜋 − ¹

2 ln Σ_𝑖 + ln 𝑝(𝜔_𝑖) since

𝑝 𝑥 𝜔_𝑖 = ¹

(2𝜋)^𝑑/2 Σ_𝑖 ^1/2 𝑒𝑥𝑝 −¹

2 𝑥 − 𝜇_𝑖 ^𝑡Σ_𝑖⁻¹ 𝑥 − 𝜇_𝑖

Discriminant Function for Normal Density

• Assume the features are statistically independent, and each feature has the same variance, that is, Σ_𝑖 = 𝜎²𝐈

• What is the discriminant function?

J. Y. Choi. SNU

Discriminant Function for Normal Density

• Assume the features are statistically independent, and each feature has the same variance, that is, Σ_𝑖 = 𝜎² 𝐈

• What is the discriminant function?

𝑔_𝑖 𝑥 = − ¹

2 𝑥 − 𝜇_𝑖 ^𝑡Σ_𝑖⁻¹ 𝑥 − 𝜇_𝑖 − ^𝑑

2 ln 2𝜋 − ¹

2 ln Σ_𝑖 + ln 𝑝(𝜔_𝑖)

= − ^𝑥−𝜇^𝑖 ²

2𝜎² − ^𝑑

2 ln 2𝜋 − ¹

2 ln 𝜎^2𝑑 + ln 𝑝(𝜔_𝑖) Since − ^𝑑

2 ln 2𝜋 − ¹

2 ln 𝜎^2𝑑 is independent to the class, 𝑔_𝑖 𝑥 = − ^𝑥−𝜇^𝑖 ²

2𝜎² + ln 𝑝(𝜔_𝑖)

Discriminant Function for Normal Density

• What is the discriminant function?

𝑔_𝑖 𝑥 = − ^𝑥−𝜇^𝑖 ²

2𝜎² + ln 𝑝(𝜔_𝑖)

= − ¹

2𝜎² (𝑥^𝑡𝑥 − 2𝜇_𝑖^𝑡𝑥 + 𝜇_𝑖^𝑡𝜇_𝑖) + ln 𝑝(𝜔_𝑖) Since 𝑥^𝑡𝑥 is also independent to the class,

𝑔_𝑖 𝑥 = 𝑤_𝑖^𝑡𝑥 + 𝑤_𝑖0 where

𝑤_𝑖 = ^𝜇^𝑖

𝜎²

𝑤_𝑖0 = − ^𝜇^𝑖^𝑡^𝜇^𝑖

2𝜎² + ln 𝑝(𝜔_𝑖)

J. Y. Choi. SNU

Discriminant Function for Normal Density

• Decision surface between 𝑖 and 𝑗 classes is obtained by letting 𝑔_𝑖 𝑥 = 𝑔_𝑗 𝑥

which yields

𝑔 𝑥 = 𝑔_𝑖 𝑥 − 𝑔_𝑗 𝑥 = 𝑤^𝑡𝑥 − 𝑤₀ = 0 where

𝑤 = 𝑤_𝑖 − 𝑤_𝑗 = ^𝜇^𝑖^−𝜇^𝑗

𝜎²

𝑤₀ = 𝑤_𝑖0 − 𝑤_𝑗0 = − ^𝜇^𝑖^𝑡^𝜇^𝑖^−𝜇^𝑗^𝑡^𝜇^𝑗

2𝜎² + ln 𝑝(𝜔_𝑖) − ln 𝑝(𝜔_𝑗)

• This linear classifier corresponds to a neuron model as

• 𝑖 −class region if 𝑔 𝑥 > 0

• 𝑗 −class region if 𝑔 𝑥 < 0

• decision boundary if 𝑔 𝑥 = 0

• If 𝑝(𝑤_𝑖) is equal to 𝑝(𝑤_𝑗),

Discriminant Function for Normal Density

J. Y. Choi. SNU

Discriminant Function for Normal Density

• If 𝑝(𝑤_𝑖) is not equal to 𝑝(𝑤_𝑗)

Discriminant Function for Normal Density

• Assume the features are not statistically independent, and each feature has the different variance, but every class has the same covariance matrix, that is, Σ_𝑖 = Σ

• What is the discriminant function?

J. Y. Choi. SNU

Discriminant Function for Normal Density

• Assume the features are not statistically independent, and each feature has the different variance, but every class has the same covariance matrix, that is, Σ_𝑖 = Σ

• What is the discriminant function?

𝑔_𝑖 𝑥 = − ¹

2 𝑥 − 𝜇_𝑖 ^𝑡Σ⁻¹ 𝑥 − 𝜇_𝑖 − ^𝑑

2 ln 2𝜋 − ¹

2 ln Σ + ln 𝑝(𝜔_𝑖) Since − ^𝑑

2 ln 2𝜋 − ¹

2 ln Σ is independent to the class, 𝑔_𝑖 𝑥 = − ¹

2 𝑥 − 𝜇_𝑖 ^𝑡Σ⁻¹ 𝑥 − 𝜇_𝑖 + ln 𝑝(𝜔_𝑖)

Discriminant Function for Normal Density

• What is the discriminant function?

𝑔_𝑖 𝑥 = −¹

2 𝑥 − 𝜇_𝑖 ^𝑡Σ⁻¹ 𝑥 − 𝜇_𝑖 + ln 𝑝(𝜔_𝑖)

= − ¹

2 (𝑥^𝑡Σ⁻¹𝑥 − 2𝜇_𝑖^𝑡Σ⁻¹𝑥 + 𝜇_𝑖^𝑡Σ⁻¹𝜇_𝑖) + ln 𝑝(𝜔_𝑖) Since 𝑥^𝑡Σ⁻¹𝑥 is also independent to the class,

𝑔_𝑖 𝑥 = 𝑤_𝑖^𝑡𝑥 + 𝑤_𝑖0 where

𝑤_𝑖 = Σ⁻¹𝜇_𝑖

𝑤_𝑖0 = −^𝜇^𝑖^𝑡^Σ⁻¹^𝜇^𝑖

2 + ln 𝑝(𝜔_𝑖)

J. Y. Choi. SNU

Discriminant Function for Normal Density

• Decision surface between 𝑖 and 𝑗 classes is obtained by letting 𝑔_𝑖 𝑥 = 𝑔_𝑗 𝑥

which yields

𝑔 𝑥 = 𝑔_𝑖 𝑥 − 𝑔_𝑗 𝑥 = 𝑤^𝑡𝑥 − 𝑤₀ = 0 where

𝑤 = 𝑤_𝑖 − 𝑤_𝑗 = Σ⁻¹(𝜇_𝑖 − 𝜇_𝑗)

𝑤₀ = 𝑤_𝑖0 − 𝑤_𝑗0 = −^𝜇^𝑖^𝑡^Σ⁻¹^𝜇^𝑖^−𝜇^𝑗^𝑡^Σ⁻¹^𝜇^𝑗

2 + ln 𝑝(𝜔_𝑖) − ln 𝑝(𝜔_𝑗)

• This linear classifier corresponds to a neuron model as

• 𝑖 −class region if 𝑔 𝑥 > 0

• 𝑗 −class region if 𝑔 𝑥 < 0

• decision boundary if 𝑔 𝑥 = 0

Discriminant Function for Normal Density

• Assume the features are not statistically independent, and each feature has the different variance, but every class has the different covariance matrix, that is, Σ_𝑖

• What is the discriminant function?

J. Y. Choi. SNU

Discriminant Function for Normal Density

• Assume the features are not statistically independent, and each feature has the different variance, but every class has the same covariance matrix, that is, Σ_𝑖

• What is the discriminant function?

𝑔_𝑖 𝑥 = − ¹

2 𝑥 − 𝜇_𝑖 ^𝑡Σ_𝑖⁻¹ 𝑥 − 𝜇_𝑖 − ^𝑑

2 ln 2𝜋 − ¹

2 ln Σ_𝑖 + ln 𝑝(𝜔_𝑖)

• Since − ^𝑑

2 ln 2𝜋 is independent to the class, 𝑔_𝑖 𝑥 = − ¹

2 𝑥 − 𝜇_𝑖 ^𝑡Σ_𝑖⁻¹ 𝑥 − 𝜇_𝑖 − ¹

2 ln Σ_𝑖 + ln 𝑝(𝜔_𝑖)

Discriminant Function for Normal Density

• What is the discriminant function?

𝑔_𝑖 𝑥 = − ¹

2 𝑥 − 𝜇_𝑖 ^𝑡Σ_𝑖⁻¹ 𝑥 − 𝜇_𝑖 − ¹

2 ln Σ_𝑖 + ln 𝑝(𝜔_𝑖)

= − ¹

2 𝑥^𝑡Σ_𝑖⁻¹𝑥 − 2𝜇_𝑖^𝑡Σ_𝑖⁻¹𝑥 + 𝜇_𝑖^𝑡Σ_𝑖⁻¹𝜇_𝑖 − ¹

2 ln Σ_𝑖 + ln 𝑝(𝜔_𝑖)

Since 𝑥^𝑡Σ_𝑖⁻¹𝑥 is not independent to the class, 𝑔_𝑖 𝑥 = 𝑥^𝑡𝑊_𝑖𝑥 + 𝑤_𝑖^𝑡𝑥 + 𝑤_𝑖0

where

𝑊_𝑖 = −¹

2 Σ_𝑖⁻¹ 𝑤_𝑖 = Σ⁻¹𝜇_𝑖

𝑤_𝑖0 = − ^𝜇^𝑖^𝑡^Σ⁻¹^𝜇^𝑖

2 + ln 𝑝(𝜔_𝑖)

J. Y. Choi. SNU

Decision Boundary for General Gaussian

• Decision boundaries are hyperquadratic for general case

Linear Mapping

𝑇: 𝑋 → 𝑌, 𝑦 = 𝑊𝑥

𝑋 𝑌

J. Y. Choi. SNU

Affine Mapping

𝑇: 𝑋 → 𝑌, 𝑦 = 𝑊𝑥 + 𝑏

𝑋 𝑌

1 𝑏

1 2

3 4

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09

1 2 3 4 5

Nonlinear Mapping

𝑇: 𝑋 → 𝑌, 𝑦 = 𝜎 𝑊𝑥 + 𝑏 , 𝑓 𝑥 = 𝑎^𝑇𝑦

𝑋 ∋ 𝑥 𝑌 ∋ 𝑦 =

𝑦₁

⋮ 𝑦_𝑛 𝑓 𝑥 = 𝑎^𝑇𝑦

𝑓 𝑥

1 𝑏

J. Y. Choi. SNU

Error Probabilities and Integrals

• Consider the 2-class problem and suppose that the feature space is divided into 2 regions 𝑅₁ and 𝑅₂. There are 2 ways in which a classification error can occur.

• An observation 𝑥 falls in 𝑅₂, and the true state is 𝜔₁.

• An observation 𝑥 falls in 𝑅₁, and the true state is 𝜔₂.

• The error probability

𝑃 𝑒𝑟𝑟𝑜𝑟 = 𝑃 𝑥 ∈ 𝑅₂ 𝜔₁ 𝑝 𝜔₁ + 𝑃 𝑥 ∈ 𝑅₁ 𝜔₂ 𝑝(𝜔₂)

= ׬_𝑅

2 𝑝 𝑥 𝜔₁ 𝑝 𝜔₁ 𝑑𝑥 + ׬_𝑅

1 𝑝 𝑥 𝜔₂ 𝑝 𝜔₂ 𝑑𝑥

Error Probabilities and Integrals

• Because 𝑥^∗ is chosen arbitrarily, the probability of error is not as small as it might be.

• 𝑥_𝐵 = Bayes optimal decision boundary , and gives the lowest probability of error.

• Bayes classifier maximizes the correct probability.

1 1

( ) ( | ) ( ) ( | ) ( )

C C

i i i i i

i i

P correct P  p  p  p  d

  





^x 

 

^x ^x

Error Probabilities and Integrals

문서에서 Bayesian Decision Theory (I) (페이지 30-62)