Bayesian Decision Theory (I)

(1)

Bayesian Decision Theory (I)

Jin Young Choi

Seoul National University

(2)

J. Y. Choi. SNU

Bayesian Decision Theory

• Bayes Formula

• Priori, Posteriori, Likelihood Probability

• Bayes Decision

• Risk Formulation

• Expected Loss, Conditional Risk, Total Risk

• Likelihood Ratio Test

• Decision Region

• Classifier for Bayes Decision

2

(3)

Bayesian Decision

• What do we learn from experience or observation?

• How can we decide the category of an object from observation?

• Is our decision always correct?

• We make a decision probabilistically

(4)

J. Y. Choi. SNU

Bayesian Decision

• Question:

• There live two kinds of fishes in a lake: tuna or salmon.

• If you catch a fish by fishing, is the fish likely to be tuna or salmon?

(5)

Bayesian Decision

• We have experienced that salmon has been caught in 70% and tuna in 30%.

• What is the next fish likely to be?

(6)

J. Y. Choi. SNU

Bayesian Decision

• If other types of fish are irrelevant:

𝑝 𝜔 = 𝜔₁ + 𝑝 𝜔 = 𝜔₂ = 1,

𝜔 is random variable, 𝜔₁and 𝜔₂ denote salmon and tuna.

• Probabilities reflect our prior knowledge obtained from past experience.

• Simple Decision Rule:

• Make a decision without seeing the fish.

• Decide 𝜔₁ if 𝑝 𝜔 = 𝜔₁ > 𝑝 𝜔 = 𝜔₂ 𝜔₂ otherwise.

(7)

Bayesian Decision

• In general, we will have some features and more information.

• Feature: lightness measurement = 𝑥

• Different fish yields different lightness readings (𝑥 is a random variable)

(8)

J. Y. Choi. SNU

Bayesian Decision

• Define

• 𝑝(𝑥|𝜔_𝑖)= Class Conditional Probability Density

• The difference between 𝑝(𝑥|𝜔₁) and 𝑝(𝑥|𝜔₂) describes the difference in lightness between tuna and salmon.

(9)

Bayesian Decision

• Hypothetical class-conditional probability

• Density functions are normalized (area under each curve is 1.0)

(10)

J. Y. Choi. SNU

Bayesian Decision

• Suppose that we know

• The prior probabilities 𝑝(𝜔₁) and 𝑝 𝜔₂

• The conditional densities 𝑝(𝑥|𝜔₁) and 𝑝 𝑥|𝜔₂

• Measure lightness of a fish = 𝑥

• What is the category of the fish with lightness of 𝑥 ?

• The probability that the fish has category of 𝜔_𝑖 is 𝑝(𝜔_𝑖|𝑥).

(11)

Bayes formula

• 𝑝 𝜔_𝑖|𝑥 = ^𝑝 𝑥 𝜔_{𝑖 𝑝(𝜔}_𝑖⁾

𝑝(𝑥) ,

where 𝑝 𝑥 = σ_𝑗 𝑝 𝑥 𝜔_𝑗 𝑝(𝜔_𝑗) = σ_𝑗 𝑝(𝑥, 𝜔_𝑗).

• 𝑃𝑜𝑠𝑡𝑒𝑟𝑖𝑜𝑟 = 𝐿𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑∗𝑃𝑟𝑖𝑜𝑟 𝐸𝑣𝑖𝑑𝑒𝑛𝑐𝑒

• 𝑝(𝑥|𝜔_𝑖) is called the likelihood of 𝜔_𝑖 with respect to 𝑥.

• The 𝜔_𝑖 category for which 𝑝(𝑥|𝜔_𝑖) is large is more "likely" to be the true category

• 𝑝(𝑥) is the evidence

• How frequently is a pattern with feature value x observed.

• Scale factor that the posterior probabilities sum to 1.

(12)

J. Y. Choi. SNU

Bayes formula

• Posterior probabilities for the particular priors 𝑝(𝜔₁) = 2/3 and 𝑝(𝜔₂) = 1/3. At every 𝑥 the posteriors sum to 1.

(13)

Bayes Decision Rule (Minimal probability error)

• Likelihood Decision:

• 𝜔₁ ∶ 𝑖𝑓 𝑝(𝑥|𝜔₁) > 𝑝(𝑥|𝜔₂)

• 𝜔₂ ∶ otherwise

• Posteriori Decision:

• 𝜔₁ ∶ 𝑖𝑓 𝑝 𝑥 𝜔₁ 𝑝(𝜔₁) > 𝑝 𝑥 𝜔₂ 𝑝(𝜔₂)

• 𝜔₂ ∶ otherwise

• Decision Error Probability

• 𝑝 𝑒𝑟𝑟𝑜𝑟 𝑥 = min(𝑝 𝜔₁ 𝑥 , 𝑝 𝜔₂ 𝑥 ) where the decision error is given by

𝑝 𝑒𝑟𝑟𝑜𝑟 𝑥 = ቊ𝑝 𝜔₂ 𝑥 𝑖𝑓 𝑤𝑒 𝑑𝑒𝑐𝑖𝑑𝑒 𝜔₁ 𝑓𝑜𝑟 𝜔₂

(14)

J. Y. Choi. SNU

Features: General Case

• Formalize the ideas just considered in 4 ways:

• Allow more than one feature:

• Replace the scalar 𝑥 by the feature vector 𝑥

• d-dimensional space ℛ^𝑑, 𝑥 ∈ ℛ^𝑑, is called the feature space

• Allow more than 2 states of nature:

• Generalize to several classes

• Allow actions other than merely deciding the state of nature:

• Possibility of rejection, refusing to make a decision in close cases

• Introducing general loss function: risk(loss) minimization.

(15)

Loss Function

• Loss (or cost) function states exactly how costly each action is, and is used to convert a probability into a decision. Loss functions let us treat situations in

which some kinds of classification mistakes are more costly than others.

(16)

J. Y. Choi. SNU

Formulation

• Let 𝜔₁, … , 𝜔_𝑐 be the finite set of c states of nature ("categories").

• Let 𝛼₁, … , 𝛼_𝑎 be the finite set of a possible actions.

(Ex) Action 𝛼_𝑖 = deciding that the true state is 𝜔_𝑖 or others.

• The loss function 𝜆 𝛼_𝑖 𝜔_𝑗 = loss incurred for taking action when the state of nature is 𝜔_𝑗.

• 𝑥 = 𝑑 −dimensional feature vector (random variable)

• 𝑝(𝑥|𝜔_𝑖) = likelihood probability density function for 𝑥 for given 𝜔_𝑖

• 𝑝(𝜔_𝑖) = prior probability that nature is in state 𝜔_𝑖.

(17)

Expected Loss (Risk)

• Suppose that we observe a particular 𝑥 and that we take action 𝛼_𝑖

• If the true state of nature is 𝜔_𝑗, then loss is 𝜆 𝛼_𝑖 𝜔_𝑗 , the expected loss before observation is

𝑅(𝛼_𝑖) = σ_𝑗=1^𝑐 𝜆 𝛼_𝑖 𝜔_𝑗 𝑝(𝜔_𝑗)

(18)

J. Y. Choi. SNU

Conditional Risk (Loss)

• After the observation, the expected loss which is called now

“conditional risk” is given by 𝑅 𝛼_𝑖 𝑥 = ෍

𝑗=1 𝑐

𝜆 𝛼_𝑖 𝜔_𝑗 𝑝(𝜔_𝑗|𝑥)

(19)

Total Risk

• Objective: Select the action that minimizes the conditional risk

• A general decision rule is a function 𝛼 𝑥

• For every 𝑥, the decision function 𝛼 𝑥 assumes one of the values, 𝛼₁, … , 𝛼_𝑎 , that is,

𝛼 𝑥 = 𝑎𝑟𝑔 min

𝛼_𝑖 𝑅 𝛼_𝑖 𝑥 = ෍

𝑗=1 𝑐

• The “total risk” is

න 𝑅 𝛼 𝑥 𝑥 𝑝 𝑥 𝑑𝑥

• Learning

min𝜃 ׬ 𝑅 𝛼_𝜃 𝑥 𝑥 𝑝 𝑥 𝑑𝑥, 𝑤ℎ𝑒𝑟𝑒 𝑓_𝜃 𝑥 ≈ (𝑝(𝜔₁|𝑥), 𝑝(𝜔₂|𝑥), … 𝑝(𝜔_𝑐|𝑥))

(20)

J. Y. Choi. SNU

Bayes Decision Rule:

• Compute the conditional risk

𝑅 𝛼_𝑖 𝑥 = ෍

𝑗=1 𝑐

𝜆 𝛼_𝑖 𝜔_𝑗 𝑝(𝜔_𝑗|𝑥) for 𝑖 = 1, … , 𝑎.

• Select the action 𝛼_𝑖 for which 𝑅 𝛼_𝑖 𝑥 is minimum.

• The resulting minimum total risk is called the Bayes Risk, denoted 𝑅^∗, and

is the best performance that can be achieved.

(21)

Two-Category Classification

• Action 𝛼₁ = deciding that the true state is 𝜔₁

• Action 𝛼₂ = deciding that the true state is 𝜔₂

• Let 𝜆_𝑖𝑗 = 𝜆 𝛼_𝑖 𝜔_𝑗 be the risk incurred for deciding 𝜔_𝑖 when true state is 𝜔_𝑗.

• The conditional Risks:

𝑅 𝛼₁ 𝑥 = 𝜆₁₁𝑝(𝜔₁ 𝑥 + 𝜆₁₂𝑝(𝜔₂ 𝑥 𝑅 𝛼₂ 𝑥 = 𝜆₂₁𝑝(𝜔₁ 𝑥 + 𝜆₂₂𝑝(𝜔₂ 𝑥

• Decide 𝜔₁ if 𝑅 𝛼₁ 𝑥 < 𝑅 𝛼₂ 𝑥

or if 𝜆₂₁ − 𝜆₁₁ 𝑝(𝜔₁ 𝑥 > (𝜆₁₂−𝜆₂₂)𝑝(𝜔₂ 𝑥

or if 𝜆₂₁ − 𝜆₁₁ 𝑝(𝑥|𝜔₁)𝑝(𝜔₁) > (𝜆₁₂−𝜆₂₂)𝑝(𝑥|𝜔₂)𝑝(𝜔₂) and 𝜔₂ , otherwise

(22)

J. Y. Choi. SNU

Two-Category Likelihood Ratio Test

• Under reasonable assumption that 𝜆₁₂ > 𝜆₂₂ and 𝜆₂₁ > 𝜆₁₁, (why?) decide 𝜔₁ if ^{𝑝(𝑥|𝜔}¹⁾

𝑝(𝑥|𝜔₂) > ^(𝜆¹²^−𝜆²²^)𝑝(𝜔²⁾

𝜆₂₁−𝜆₁₁ 𝑝(𝜔₁) = 𝑇 and 𝜔₂ , otherwise.

• The ratio ^{𝑝(𝑥|𝜔}¹⁾

𝑝(𝑥|𝜔₂) is called the likelihood ratio.

• We can decide 𝜔₁ if the likelihood ratio exceeds a threshold T value that is independent of the observation 𝑥.

(23)

Minimum-Error-Rate Classification

• Recall the conditional risk 𝑅 𝛼_𝑖 𝑥 = ෍

𝑗=1 𝑐

• If action 𝛼_𝑖 is taken for the true state 𝜔_𝑗, then

the decision is correct if 𝑖 = 𝑗, and in error otherwise.

• To give an equal cost to all errors, we define Zero-One Loss function as

𝜆 𝛼_𝑖 𝜔_𝑗 = ቊ0, 𝑖 = 𝑗

1, 𝑖 ≠ 𝑗 , for 𝑖, 𝑗 = 1, … , 𝐶

(24)

J. Y. Choi. SNU

Minimum-Error-Rate Classification cont.

• The conditional risk representing error rate is 𝑅 𝛼_𝑖 𝑥 = ෍

𝑗=1 𝑐

= ෍

𝑗≠𝑖 𝑐

𝑝(𝜔_𝑗|𝑥)

= 1 − 𝑝(𝜔_𝑖 𝑥

• To minimize the average probability of error, we should select the 𝑖 that maximizes the posterior probability i.e., 𝑝(𝜔_𝑖 𝑥

Decide 𝜔_𝑖 if 𝑝(𝜔_𝑖 𝑥 > 𝑝(𝜔_𝑗 𝑥 , for all 𝑗 ≠ 𝑖 (same as Bayes' decision rule)

(25)

Decision Regions

• The likelihood ratio 𝑝 𝑥 𝜔₁ /𝑝(𝑥|𝜔₂) vs. 𝑥

• The threshold 𝑇_𝑎 for a loss function

1 12 22 2

2 21 11 1

( | ) ( ) ( )

( | ) ( ) ( ) ^a

p x p

p x p T

   

  



1 2

( | ) ( | ) ( ) ( )

p x p x Ta

g x g x

  

 Ta

(26)

J. Y. Choi. SNU

Classifiers; Discriminant Functions

• A classifier can be represented by a set of discriminant functions 𝑔_𝑖 𝑥 ; 𝑖 = 1, … , 𝐶.

• The classifier assigns observation 𝑥 to class 𝜔_𝑖 if 𝑔_𝑖 𝑥 >

𝑔_𝑗 𝑥 for all 𝑖 ≠ 𝑗

26

1 2

( ) exp

i 2

g x x 



    

     ( ) ^T

g xi  w x b

(27)

The Bayes Classifier

• A Bayes classifier can be represented in this way

• For the minimum error-rate case

𝑔_𝑖 𝑥 = 𝑝(𝜔_𝑖|𝑥)

• For the general case with risks

𝑔_𝑖 𝑥 = −𝑅 𝛼_𝑖 𝑥 = − ෍

𝑗

𝜆_𝑖𝑗 𝑝(𝜔_𝑗|𝑥)

• If we replace 𝑔_𝑖 𝑥 by 𝑓(𝑔_𝑖 𝑥 ) , where 𝑓(. ) is a monotonically increasing function (ex, log…), the resulting classification is unchanged.

𝑔_𝑖 𝑥 = 𝑝 𝑥 𝜔_𝑖 𝑝 𝜔_𝑖

𝑔_𝑖 𝑥 = ln 𝑝 𝑥 𝜔_𝑖 + ln 𝑝(𝜔_𝑖)

(28)

J. Y. Choi. SNU

The Bayes Classifier

• The effect of any decision rule is to divide the feature space into C decision regions, 𝑅₁, … , 𝑅_𝐶.

• If 𝑔_𝑖 𝑥 > 𝑔_𝑗 𝑥 for all 𝑗 ≠ 𝑖 then 𝑥 is in 𝑅_𝑖, and 𝑥 is assigned to 𝑤_𝑖.

• Decision regions are separated by decision boundaries.

• Decision boundaries are surfaces in the feature space.

28

Feature 2 Hight

Feature 1 Weight

x

(29)

Interim Summary

• Bayes Formula

• Priori probability

• Likelihood

• Evidence

• Posterior Probability

• Bayes Decision

• Loss Function

• Expected Loss, Conditional Risk

• Total Risk

• Zero-one Loss Function (Bayes Decision)

• Decision Region and Classifiers

(30)

J. Y. Choi. SNU

Bayesian Decision Theory (II)

Jin Young Choi

Seoul National University

1

(31)

Bayesian Decision Theory

• Bayes Formula

• Priori, Posteriori, Likelihood Probability

• Bayes Decision

• Expected Loss, Conditional Risk, Total Risk

• Decision Region

• Classifier for Bayes Decision

(32)

J. Y. Choi. SNU

Next Outline

• Bayes Classifier

• Normal Density

• ND: Univariate Case

• MND: Multivariate Case1: Indep., Same Variance

• MND: Multivariate Case2: Same Covariance Mtx.

• MND: Multivariate Case3: Different Covariance Mtx.

• Error Probability

3

(33)

Decision Regions

• The likelihood ratio 𝑝 𝑥 𝜔₁ /𝑝(𝑥|𝜔₂) vs. 𝑥

• The threshold 𝑇_𝑎 for a loss function

1 12 22 2

2 21 11 1

( | ) ( ) ( )

( | ) ( ) ( ) ^a

p x p

p x p T

   

  



1 2

( | ) ( | ) ( ) ( )

p x p x Ta

g x g x

  

 Ta

(34)

J. Y. Choi. SNU

Classifiers; Discriminant Functions

• A classifier can be represented by a set of discriminant functions 𝑔_𝑖 𝑥 ; 𝑖 = 1, … , 𝐶.

• The classifier assigns observation 𝑥 to class 𝜔_𝑖 if 𝑔_𝑖 𝑥 >

𝑔_𝑗 𝑥 for all 𝑖 ≠ 𝑗

5

1 2

( ) exp

i 2

g x x 



    

     ( ) ^T

g xi  w x b

(35)

The Bayes Classifier

• A Bayes classifier can be represented in this way

• For the minimum error-rate case

𝑔_𝑖 𝑥 = 𝑝(𝜔_𝑖|𝑥)

• For the general case with risks

𝑔_𝑖 𝑥 = −𝑅 𝛼_𝑖 𝑥 = − ෍

𝑗

𝜆_𝑖𝑗 𝑝(𝜔_𝑗|𝑥)

• If we replace 𝑔_𝑖 𝑥 by 𝑓(𝑔_𝑖 𝑥 ) , where 𝑓(. ) is a monotonically increasing function (ex, log…), the resulting classification is unchanged.

𝑔_𝑖 𝑥 = 𝑝 𝑥 𝜔_𝑖 𝑝 𝜔_𝑖

(36)

J. Y. Choi. SNU

The Bayes Classifier

• The effect of any decision rule is to divide the feature space into C decision regions, 𝑅₁, … , 𝑅_𝐶.

• If 𝑔_𝑖 𝑥 > 𝑔_𝑗 𝑥 for all 𝑗 ≠ 𝑖 then 𝑥 is in 𝑅_𝑖, and 𝑥 is assigned to 𝑤_𝑖.

• Decision regions are separated by decision boundaries.

• Decision boundaries are surfaces in the feature space.

7

Feature 2 Hight

Feature 1 Weight

x

(37)

The Decision Regions

• Two dimensional two category classifier

(38)

J. Y. Choi. SNU

Two-Category Case

• Use 2 discriminant functions 𝑔₁(𝑥)and 𝑔₂(𝑥), and

• assigning 𝑥 to 𝜔₁ if 𝑔₁(𝑥)> 𝑔₂(𝑥).

• Alternative: define a single discriminant function

• 𝑔 𝑥 = 𝑔₁ 𝑥 − 𝑔₂(𝑥),

• decide 𝜔₁ if 𝑔 𝑥 > 0, otherwise decide 𝜔₂.

• In two category case two forms are frequently used:

• 𝑔 𝑥 = 𝑝 𝜔₁|𝑥 − 𝑝 𝜔₂|𝑥

• 𝑔 𝑥 = ln^{𝑝 𝑥|𝜔}¹

𝑝 𝑥|𝜔₂ + ln^{𝑝 𝜔}¹

𝑝 𝜔₂

9

(39)

Normal Density - Univariate Case

• Gaussian density with mean μ and standard deviation σ (σ² named variance )

𝑝 𝑥 = 1

(2𝜋)^1/2𝜎 exp − 1 2

(𝑥 − 𝜇) 𝜎

2

𝑝 𝑥 ~𝑁(𝜇, 𝜎²)

• It can be shown that:

𝜇 = 𝐸 𝑥 = න

−∞

∞

𝑥𝑝 𝑥 𝑑𝑥

𝜎² = 𝐸 (𝑥 − 𝜇)² = න

−∞

∞

(𝑥 − 𝜇)²𝑝 𝑥 𝑑𝑥

(40)

J. Y. Choi. SNU

Normal Density - Multivariate Case

• The general multivariate normal density (MND) in a 𝑑 − dimensions is written as

𝑝 𝑥 = 1

(2𝜋)^𝑑/2 Σ ^1/2 𝑒𝑥𝑝 − 1

2 𝑥 − 𝜇 ^𝑡Σ⁻¹ 𝑥 − 𝜇 𝜇 = 𝐸 𝑥 = න

−∞

∞

𝑥𝑝 𝑥 𝑑𝑥 Σ = 𝐸 (𝑥 − 𝜇)(𝑥 − 𝜇)^𝑡 Σ_𝑖𝑗 = 𝐸 (𝑥_𝑖 − 𝜇_𝑖)(𝑥_𝑗 − 𝜇_𝑗)

• The covariance matrix Σ is always symmetric and positive semidefinite.

11

(41)

Normal Density - Multivariate Case

• The multivariate normal density MND is completely specified by 𝑑 + 𝑑(𝑑 + 1)/2 parameters . Samples drawn from MND fall in a cluster of which center is determined by 𝜇 and a shape by Σ. The loci of points of constant density are hyper-ellipsoids

𝑟² = 𝑥 − 𝜇 ^𝑡Σ⁻¹ 𝑥 − 𝜇 =constant

• The 𝑟 is called Mahalonobis distance from 𝑥 to 𝜇

• The principal axes of the hyperellipsoid are given by the eigenvectors of Σ.

(42)

J. Y. Choi. SNU

Normal Density - Multivariate Case

• The minimum-error-rate classification can be achieved using the discriminant functions:

𝑔_𝑖 𝑥 = 𝑝 𝑥 𝜔_𝑖 𝑝(𝜔_𝑖) or

• If 𝑝 𝑥 𝜔_𝑖 ~𝑁 𝜇_𝑖, Σ_𝑖 , then

𝑔_𝑖 𝑥 = − ¹

2 𝑥 − 𝜇_𝑖 ^𝑡Σ_𝑖⁻¹ 𝑥 − 𝜇_𝑖 − ^𝑑

2 ln 2𝜋 − ¹

2 ln Σ_𝑖 + ln 𝑝(𝜔_𝑖) since

𝑝 𝑥 𝜔_𝑖 = ¹

(2𝜋)^𝑑/2 Σ_𝑖 ^1/2 𝑒𝑥𝑝 −¹

2 𝑥 − 𝜇_𝑖 ^𝑡Σ_𝑖⁻¹ 𝑥 − 𝜇_𝑖

13

(43)

Discriminant Function for Normal Density

• Assume the features are statistically independent, and each feature has the same variance, that is, Σ_𝑖 = 𝜎²𝐈

• What is the discriminant function?

(44)

J. Y. Choi. SNU

Discriminant Function for Normal Density

• Assume the features are statistically independent, and each feature has the same variance, that is, Σ_𝑖 = 𝜎² 𝐈

𝑔_𝑖 𝑥 = − ¹

2 𝑥 − 𝜇_𝑖 ^𝑡Σ_𝑖⁻¹ 𝑥 − 𝜇_𝑖 − ^𝑑

2 ln 2𝜋 − ¹

2 ln Σ_𝑖 + ln 𝑝(𝜔_𝑖)

= − ^𝑥−𝜇^𝑖 ²

2𝜎² − ^𝑑

2 ln 2𝜋 − ¹

2 ln 𝜎^2𝑑 + ln 𝑝(𝜔_𝑖) Since − ^𝑑

2 ln 2𝜋 − ¹

2 ln 𝜎^2𝑑 is independent to the class, 𝑔_𝑖 𝑥 = − ^𝑥−𝜇^𝑖 ²

2𝜎² + ln 𝑝(𝜔_𝑖)

15

(45)

Discriminant Function for Normal Density

𝑔_𝑖 𝑥 = − ^𝑥−𝜇^𝑖 ²

= − ¹

2𝜎² (𝑥^𝑡𝑥 − 2𝜇_𝑖^𝑡𝑥 + 𝜇_𝑖^𝑡𝜇_𝑖) + ln 𝑝(𝜔_𝑖) Since 𝑥^𝑡𝑥 is also independent to the class,

𝑔_𝑖 𝑥 = 𝑤_𝑖^𝑡𝑥 + 𝑤_𝑖0 where

𝑤_𝑖 = ^𝜇^𝑖

𝜎²

𝑤_𝑖0 = − ^𝜇^𝑖^𝑡^𝜇^𝑖

(46)

J. Y. Choi. SNU

Discriminant Function for Normal Density

• Decision surface between 𝑖 and 𝑗 classes is obtained by letting 𝑔_𝑖 𝑥 = 𝑔_𝑗 𝑥

which yields

𝑔 𝑥 = 𝑔_𝑖 𝑥 − 𝑔_𝑗 𝑥 = 𝑤^𝑡𝑥 − 𝑤₀ = 0 where

𝑤 = 𝑤_𝑖 − 𝑤_𝑗 = ^𝜇^𝑖^−𝜇^𝑗

𝜎²

𝑤₀ = 𝑤_𝑖0 − 𝑤_𝑗0 = − ^𝜇^𝑖^𝑡^𝜇^𝑖^−𝜇^𝑗^𝑡^𝜇^𝑗

2𝜎² + ln 𝑝(𝜔_𝑖) − ln 𝑝(𝜔_𝑗)

• This linear classifier corresponds to a neuron model as

• 𝑖 −class region if 𝑔 𝑥 > 0

• 𝑗 −class region if 𝑔 𝑥 < 0

• decision boundary if 𝑔 𝑥 = 0

17

(47)

• If 𝑝(𝑤_𝑖) is equal to 𝑝(𝑤_𝑗),

Discriminant Function for Normal Density

(48)

J. Y. Choi. SNU

Discriminant Function for Normal Density

• If 𝑝(𝑤_𝑖) is not equal to 𝑝(𝑤_𝑗)

19

(49)

Discriminant Function for Normal Density

• Assume the features are not statistically independent, and each feature has the different variance, but every class has the same covariance matrix, that is, Σ_𝑖 = Σ

(50)

J. Y. Choi. SNU

Discriminant Function for Normal Density

• Assume the features are not statistically independent, and each feature has the different variance, but every class has the same covariance matrix, that is, Σ_𝑖 = Σ

𝑔_𝑖 𝑥 = − ¹

2 𝑥 − 𝜇_𝑖 ^𝑡Σ⁻¹ 𝑥 − 𝜇_𝑖 − ^𝑑

2 ln 2𝜋 − ¹

2 ln Σ + ln 𝑝(𝜔_𝑖) Since − ^𝑑

2 ln 2𝜋 − ¹

2 ln Σ is independent to the class, 𝑔_𝑖 𝑥 = − ¹

2 𝑥 − 𝜇_𝑖 ^𝑡Σ⁻¹ 𝑥 − 𝜇_𝑖 + ln 𝑝(𝜔_𝑖)

21

(51)

Discriminant Function for Normal Density

𝑔_𝑖 𝑥 = −¹

2 𝑥 − 𝜇_𝑖 ^𝑡Σ⁻¹ 𝑥 − 𝜇_𝑖 + ln 𝑝(𝜔_𝑖)

= − ¹

2 (𝑥^𝑡Σ⁻¹𝑥 − 2𝜇_𝑖^𝑡Σ⁻¹𝑥 + 𝜇_𝑖^𝑡Σ⁻¹𝜇_𝑖) + ln 𝑝(𝜔_𝑖) Since 𝑥^𝑡Σ⁻¹𝑥 is also independent to the class,

𝑔_𝑖 𝑥 = 𝑤_𝑖^𝑡𝑥 + 𝑤_𝑖0 where

𝑤_𝑖 = Σ⁻¹𝜇_𝑖

𝑤_𝑖0 = −^𝜇^𝑖^𝑡^Σ⁻¹^𝜇^𝑖

2 + ln 𝑝(𝜔_𝑖)

(52)

J. Y. Choi. SNU

Discriminant Function for Normal Density

• Decision surface between 𝑖 and 𝑗 classes is obtained by letting 𝑔_𝑖 𝑥 = 𝑔_𝑗 𝑥

which yields

𝑔 𝑥 = 𝑔_𝑖 𝑥 − 𝑔_𝑗 𝑥 = 𝑤^𝑡𝑥 − 𝑤₀ = 0 where

𝑤 = 𝑤_𝑖 − 𝑤_𝑗 = Σ⁻¹(𝜇_𝑖 − 𝜇_𝑗)

𝑤₀ = 𝑤_𝑖0 − 𝑤_𝑗0 = −^𝜇^𝑖^𝑡^Σ⁻¹^𝜇^𝑖^−𝜇^𝑗^𝑡^Σ⁻¹^𝜇^𝑗

2 + ln 𝑝(𝜔_𝑖) − ln 𝑝(𝜔_𝑗)

• This linear classifier corresponds to a neuron model as

• 𝑖 −class region if 𝑔 𝑥 > 0

• 𝑗 −class region if 𝑔 𝑥 < 0

• decision boundary if 𝑔 𝑥 = 0

23

(53)

Discriminant Function for Normal Density

• Assume the features are not statistically independent, and each feature has the different variance, but every class has the different covariance matrix, that is, Σ_𝑖

(54)

J. Y. Choi. SNU

Discriminant Function for Normal Density

• Assume the features are not statistically independent, and each feature has the different variance, but every class has the same covariance matrix, that is, Σ_𝑖

𝑔_𝑖 𝑥 = − ¹

2 𝑥 − 𝜇_𝑖 ^𝑡Σ_𝑖⁻¹ 𝑥 − 𝜇_𝑖 − ^𝑑

2 ln 2𝜋 − ¹

• Since − ^𝑑

2 ln 2𝜋 is independent to the class, 𝑔_𝑖 𝑥 = − ¹

2 𝑥 − 𝜇_𝑖 ^𝑡Σ_𝑖⁻¹ 𝑥 − 𝜇_𝑖 − ¹

25

(55)

Discriminant Function for Normal Density

𝑔_𝑖 𝑥 = − ¹

2 𝑥 − 𝜇_𝑖 ^𝑡Σ_𝑖⁻¹ 𝑥 − 𝜇_𝑖 − ¹

= − ¹

2 𝑥^𝑡Σ_𝑖⁻¹𝑥 − 2𝜇_𝑖^𝑡Σ_𝑖⁻¹𝑥 + 𝜇_𝑖^𝑡Σ_𝑖⁻¹𝜇_𝑖 − ¹

Since 𝑥^𝑡Σ_𝑖⁻¹𝑥 is not independent to the class, 𝑔_𝑖 𝑥 = 𝑥^𝑡𝑊_𝑖𝑥 + 𝑤_𝑖^𝑡𝑥 + 𝑤_𝑖0

where

𝑊_𝑖 = −¹

2 Σ_𝑖⁻¹ 𝑤_𝑖 = Σ⁻¹𝜇_𝑖

𝑤_𝑖0 = − ^𝜇^𝑖^𝑡^Σ⁻¹^𝜇^𝑖

2 + ln 𝑝(𝜔_𝑖)

(56)

J. Y. Choi. SNU

Decision Boundary for General Gaussian

• Decision boundaries are hyperquadratic for general case

27

(57)

Linear Mapping

𝑇: 𝑋 → 𝑌, 𝑦 = 𝑊𝑥

𝑋 𝑌

(58)

J. Y. Choi. SNU

Affine Mapping

𝑇: 𝑋 → 𝑌, 𝑦 = 𝑊𝑥 + 𝑏

29

𝑋 𝑌

1 𝑏

(59)

1 2

3 4

5

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09

1 2 3 4 5

Nonlinear Mapping

𝑇: 𝑋 → 𝑌, 𝑦 = 𝜎 𝑊𝑥 + 𝑏 , 𝑓 𝑥 = 𝑎^𝑇𝑦

𝑋 ∋ 𝑥 𝑌 ∋ 𝑦 =

𝑦₁

⋮ 𝑦_𝑛 𝑓 𝑥 = 𝑎^𝑇𝑦

𝑓 𝑥

1 𝑏

(60)

J. Y. Choi. SNU

Error Probabilities and Integrals

• Consider the 2-class problem and suppose that the feature space is divided into 2 regions 𝑅₁ and 𝑅₂. There are 2 ways in which a classification error can occur.

• An observation 𝑥 falls in 𝑅₂, and the true state is 𝜔₁.

• An observation 𝑥 falls in 𝑅₁, and the true state is 𝜔₂.

• The error probability

𝑃 𝑒𝑟𝑟𝑜𝑟 = 𝑃 𝑥 ∈ 𝑅₂ 𝜔₁ 𝑝 𝜔₁ + 𝑃 𝑥 ∈ 𝑅₁ 𝜔₂ 𝑝(𝜔₂)

= ׬_𝑅

2 𝑝 𝑥 𝜔₁ 𝑝 𝜔₁ 𝑑𝑥 + ׬_𝑅

1 𝑝 𝑥 𝜔₂ 𝑝 𝜔₂ 𝑑𝑥

31

(61)

Error Probabilities and Integrals

• Because 𝑥^∗ is chosen arbitrarily, the probability of error is not as small as it might be.

• 𝑥_𝐵 = Bayes optimal decision boundary , and gives the lowest probability of error.

• Bayes classifier maximizes the correct probability.

1 1

( ) ( | ) ( ) ( | ) ( )

i

C C

i i i i i

i i

P correct P  p  p  p  d

  





^x 

 

^x ^x

Error Probabilities and Integrals

(62)

J. Y. Choi. SNU

Interim Summary

• Decision Region

• (Linear) Discriminant Function

• Decision Surface, Linear Machine

• Bayes Classifier

• Normal Density ND: Univariate Case

• MND: Multivariate Case1: Indep., Same Variance

• MND: Multivariate Case2: Same Covariance Matrix

• MND: Multivariate Case3: Different Covariance Matrix

• Error Probability

• Left Issues: Learning of PDF, NN learning, Bayesian Learning

33