**Bayesian Decision Theory (I)**

Jin Young Choi

Seoul National University

*J. Y. Choi. SNU*

### Bayesian Decision Theory

• Bayes Formula

• Priori, Posteriori, Likelihood Probability

• Bayes Decision

• Risk Formulation

• Expected Loss, Conditional Risk, Total Risk

• Likelihood Ratio Test

• Decision Region

• Classifier for Bayes Decision

2

### Bayesian Decision

• What do we learn from experience or observation?

• How can we decide the category of an object from observation?

• Is our decision always correct?

• We make a decision probabilistically

*J. Y. Choi. SNU*

### Bayesian Decision

• Question:

• There live two kinds of fishes in a lake: tuna or salmon.

• If you catch a fish by fishing, is the fish likely to be tuna or salmon?

### Bayesian Decision

• We have experienced that salmon has been caught in 70% and tuna in 30%.

• What is the next fish likely to be?

*J. Y. Choi. SNU*

### Bayesian Decision

• If other types of fish are irrelevant:

𝑝 𝜔 = 𝜔_{1} + 𝑝 𝜔 = 𝜔_{2} = 1,

𝜔 is random variable, 𝜔_{1}and 𝜔_{2} denote salmon and tuna.

• Probabilities reflect our prior knowledge obtained from past experience.

• Simple Decision Rule:

• Make a decision without seeing the fish.

• Decide 𝜔_{1} if 𝑝 𝜔 = 𝜔_{1} > 𝑝 𝜔 = 𝜔_{2}
𝜔_{2} otherwise.

### Bayesian Decision

• In general, we will have some features and more information.

• Feature: lightness measurement = 𝑥

• Different fish yields different lightness readings (𝑥 is a random variable)

*J. Y. Choi. SNU*

### Bayesian Decision

• Define

• 𝑝(𝑥|𝜔_{𝑖})= Class Conditional Probability Density

• The difference between 𝑝(𝑥|𝜔_{1}) and 𝑝(𝑥|𝜔_{2}) describes the difference in
lightness between tuna and salmon.

### Bayesian Decision

• Hypothetical class-conditional probability

• Density functions are normalized (area under each curve is 1.0)

*J. Y. Choi. SNU*

### Bayesian Decision

• Suppose that we know

• The prior probabilities 𝑝(𝜔_{1}) and 𝑝 𝜔_{2}

• The conditional densities 𝑝(𝑥|𝜔_{1}) and 𝑝 𝑥|𝜔_{2}

• Measure lightness of a fish = 𝑥

• What is the category of the fish with lightness of 𝑥 ?

• The probability that the fish has category of 𝜔_{𝑖} is 𝑝(𝜔_{𝑖}|𝑥).

### Bayes formula

• 𝑝 𝜔_{𝑖}|𝑥 = ^{𝑝} 𝑥 𝜔_{𝑖 𝑝(𝜔}_{𝑖}^{)}

𝑝(𝑥) ,

where 𝑝 𝑥 = σ_{𝑗} 𝑝 𝑥 𝜔_{𝑗} 𝑝(𝜔_{𝑗}) = σ_{𝑗} 𝑝(𝑥, 𝜔_{𝑗}).

• 𝑃𝑜𝑠𝑡𝑒𝑟𝑖𝑜𝑟 = 𝐿𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑∗𝑃𝑟𝑖𝑜𝑟 𝐸𝑣𝑖𝑑𝑒𝑛𝑐𝑒

• 𝑝(𝑥|𝜔_{𝑖}) is called the * likelihood* of 𝜔

_{𝑖}with respect to

*𝑥.*

• The 𝜔_{𝑖} category for which 𝑝(𝑥|𝜔_{𝑖}) is large is more "likely" to be the true
category

• **𝑝(𝑥) is the evidence**

• How frequently is a pattern with feature value x observed.

• Scale factor that the posterior probabilities sum to 1.

*J. Y. Choi. SNU*

### Bayes formula

• Posterior probabilities for the particular priors 𝑝(𝜔_{1}) = 2/3 and
𝑝(𝜔_{2}) = 1/3. At every 𝑥 the posteriors sum to 1.

### Bayes Decision Rule (Minimal probability error)

• Likelihood Decision:

• 𝜔_{1} ∶ 𝑖𝑓 𝑝(𝑥|𝜔_{1}) > 𝑝(𝑥|𝜔_{2})

• 𝜔_{2} *∶ otherwise*

• Posteriori Decision:

• 𝜔_{1} ∶ 𝑖𝑓 𝑝 𝑥 𝜔_{1} 𝑝(𝜔_{1}) > 𝑝 𝑥 𝜔_{2} 𝑝(𝜔_{2})

• 𝜔_{2} *∶ otherwise*

• Decision Error Probability

• 𝑝 𝑒𝑟𝑟𝑜𝑟 𝑥 = min(𝑝 𝜔_{1} 𝑥 , 𝑝 𝜔_{2} 𝑥 )
*where the decision error is given by*

𝑝 𝑒𝑟𝑟𝑜𝑟 𝑥 = ቊ𝑝 𝜔_{2} 𝑥 𝑖𝑓 𝑤𝑒 𝑑𝑒𝑐𝑖𝑑𝑒 𝜔_{1} 𝑓𝑜𝑟 𝜔_{2}

*J. Y. Choi. SNU*

### Features: General Case

• Formalize the ideas just considered in 4 ways:

• *Allow more than one feature:*

• Replace the scalar 𝑥 by the feature vector 𝑥

• *d-dimensional space* ℛ^{𝑑}, 𝑥 ∈ ℛ^{𝑑}, is called the feature space

• *Allow more than 2 states of nature:*

• Generalize to several classes

• *Allow actions* *other than merely deciding the state of nature:*

• Possibility of rejection, refusing to make a decision in close cases

• *Introducing general loss function: risk(loss) minimization.*

### Loss Function

• * Loss (or cost) function states exactly how costly each action is, and is used to *
convert a probability into a decision. Loss functions let us treat situations in

which some kinds of classification mistakes are more costly than others.

*J. Y. Choi. SNU*

### Formulation

• Let 𝜔_{1}, … , 𝜔_{𝑐} *be the finite set of c states of nature ("categories").*

• Let 𝛼_{1}, … , 𝛼_{𝑎} be the finite set of a possible actions.

(Ex) Action 𝛼_{𝑖} = deciding that the true state is 𝜔_{𝑖} or others.

• The loss function 𝜆 𝛼_{𝑖} 𝜔_{𝑗} = loss incurred for taking action
when the state of nature is 𝜔_{𝑗}.

• 𝑥 = 𝑑 −dimensional feature vector (random variable)

• 𝑝(𝑥|𝜔_{𝑖}) = likelihood probability density function for 𝑥 for given 𝜔_{𝑖}

• 𝑝(𝜔_{𝑖}) = prior probability that nature is in state 𝜔_{𝑖}.

### Expected Loss (Risk)

• Suppose that we observe a particular 𝑥 and that we take action 𝛼_{𝑖}

• If the true state of nature is 𝜔_{𝑗}, then loss is 𝜆 𝛼_{𝑖} 𝜔_{𝑗} , the expected loss
before observation is

𝑅(𝛼_{𝑖}) = σ_{𝑗=1}^{𝑐} 𝜆 𝛼_{𝑖} 𝜔_{𝑗} 𝑝(𝜔_{𝑗})

*J. Y. Choi. SNU*

### Conditional Risk (Loss)

• After the observation, the expected loss which is called now

“conditional risk” is given by
𝑅 𝛼_{𝑖} 𝑥 =

𝑗=1 𝑐

𝜆 𝛼_{𝑖} 𝜔_{𝑗} 𝑝(𝜔_{𝑗}|𝑥)

### Total Risk

• Objective: Select the action that minimizes the conditional risk

• A general decision rule is a function 𝛼 𝑥

• For every 𝑥, the decision function 𝛼 𝑥 assumes one of the values,
𝛼_{1}, … , 𝛼_{𝑎} , that is,

𝛼 𝑥 = 𝑎𝑟𝑔 min

𝛼_{𝑖} 𝑅 𝛼_{𝑖} 𝑥 =

𝑗=1 𝑐

𝜆 𝛼_{𝑖} 𝜔_{𝑗} 𝑝(𝜔_{𝑗}|𝑥)

• The “total risk” is

න 𝑅 𝛼 𝑥 𝑥 𝑝 𝑥 𝑑𝑥

• Learning

min𝜃 𝑅 𝛼_{𝜃} 𝑥 𝑥 𝑝 𝑥 𝑑𝑥, 𝑤ℎ𝑒𝑟𝑒 𝑓_{𝜃} 𝑥 ≈ (𝑝(𝜔_{1}|𝑥), 𝑝(𝜔_{2}|𝑥), … 𝑝(𝜔_{𝑐}|𝑥))

*J. Y. Choi. SNU*

### Bayes Decision Rule:

• Compute the conditional risk

𝑅 𝛼_{𝑖} 𝑥 =

𝑗=1 𝑐

𝜆 𝛼_{𝑖} 𝜔_{𝑗} 𝑝(𝜔_{𝑗}|𝑥)
for 𝑖 = 1, … , 𝑎.

• Select the action 𝛼_{𝑖} for which 𝑅 𝛼_{𝑖} 𝑥 is minimum.

• The resulting minimum total risk is called the Bayes Risk, denoted 𝑅^{∗}, and

is the best performance that can be achieved.

### Two-Category Classification

• Action 𝛼_{1} = deciding that the true state is 𝜔_{1}

• Action 𝛼_{2} = deciding that the true state is 𝜔_{2}

• Let 𝜆_{𝑖𝑗} = 𝜆 𝛼_{𝑖} 𝜔_{𝑗} be the risk incurred for deciding 𝜔_{𝑖} when true
state is 𝜔_{𝑗}*. *

• *The conditional Risks:*

𝑅 𝛼_{1} 𝑥 = 𝜆_{11}𝑝(𝜔_{1} 𝑥 + 𝜆_{12}𝑝(𝜔_{2} 𝑥
𝑅 𝛼_{2} 𝑥 = 𝜆_{21}𝑝(𝜔_{1} 𝑥 + 𝜆_{22}𝑝(𝜔_{2} 𝑥

• Decide 𝜔_{1} if 𝑅 𝛼_{1} 𝑥 < 𝑅 𝛼_{2} 𝑥

or if 𝜆_{21} − 𝜆_{11} 𝑝(𝜔_{1} 𝑥 > (𝜆_{12}−𝜆_{22})𝑝(𝜔_{2} 𝑥

or if 𝜆_{21} − 𝜆_{11} 𝑝(𝑥|𝜔_{1})𝑝(𝜔_{1}) > (𝜆_{12}−𝜆_{22})𝑝(𝑥|𝜔_{2})𝑝(𝜔_{2})
and 𝜔_{2} *, otherwise*

*J. Y. Choi. SNU*

### Two-Category Likelihood Ratio Test

• Under reasonable assumption that 𝜆_{12} > 𝜆_{22} and 𝜆_{21} > 𝜆_{11}, (why?)
decide 𝜔_{1} if ^{𝑝(𝑥|𝜔}^{1}^{)}

𝑝(𝑥|𝜔_{2}) > ^{(𝜆}^{12}^{−𝜆}^{22}^{)𝑝(𝜔}^{2}^{)}

𝜆_{21}−𝜆_{11} 𝑝(𝜔_{1}) = 𝑇
and 𝜔_{2} *, otherwise.*

• The ratio ^{𝑝(𝑥|𝜔}^{1}^{)}

𝑝(𝑥|𝜔_{2}) is called the likelihood ratio.

• We can decide 𝜔_{1} *if the likelihood ratio exceeds a threshold T value *
that is independent of the observation 𝑥.

### Minimum-Error-Rate Classification

• Recall the conditional risk
𝑅 𝛼_{𝑖} 𝑥 =

𝑗=1 𝑐

𝜆 𝛼_{𝑖} 𝜔_{𝑗} 𝑝(𝜔_{𝑗}|𝑥)

• If action 𝛼_{𝑖} is taken for the true state 𝜔_{𝑗}*, then *

the decision is correct if *𝑖 = 𝑗, and in error otherwise.*

• To give an equal cost to all errors, we define Zero-One Loss function as

𝜆 𝛼_{𝑖} 𝜔_{𝑗} = ቊ0, 𝑖 = 𝑗

1, 𝑖 ≠ 𝑗 , for 𝑖, 𝑗 = 1, … , 𝐶

*J. Y. Choi. SNU*

### Minimum-Error-Rate Classification cont.

• The conditional risk representing error rate is
𝑅 𝛼_{𝑖} 𝑥 =

𝑗=1 𝑐

𝜆 𝛼_{𝑖} 𝜔_{𝑗} 𝑝(𝜔_{𝑗}|𝑥)

=

𝑗≠𝑖 𝑐

𝑝(𝜔_{𝑗}|𝑥)

= 1 − 𝑝(𝜔_{𝑖} 𝑥

• **To minimize the average probability of error, we should select the 𝑖 that **
**maximizes the posterior probability i.e., 𝑝(𝜔**_{𝑖} 𝑥

Decide 𝜔_{𝑖} if 𝑝(𝜔_{𝑖} 𝑥 > 𝑝(𝜔_{𝑗} 𝑥 , for all 𝑗 ≠ 𝑖
(same as Bayes' decision rule)

### Decision Regions

• The likelihood ratio 𝑝 𝑥 𝜔_{1} /𝑝(𝑥|𝜔_{2}*) vs. 𝑥*

• The threshold 𝑇_{𝑎} for a loss function

1 12 22 2

2 21 11 1

( | ) ( ) ( )

( | ) ( ) ( ) ^{a}

*p x* *p*

*p x* *p* *T*

1 2

1 2

( | ) ( | ) ( ) ( )

*p x* *p x* *T**a*

*g x* *g x*

*T**a*

*J. Y. Choi. SNU*

## Classifiers; Discriminant Functions

• A classifier can be represented by a set of discriminant
functions 𝑔_{𝑖} 𝑥 ; 𝑖 = 1, … , 𝐶.

• The classifier assigns observation 𝑥 to class 𝜔_{𝑖} *if 𝑔*_{𝑖} 𝑥 >

𝑔_{𝑗} 𝑥 for all 𝑖 ≠ 𝑗

26

1 2

( ) exp

*i* 2

*g x* *x*

( ) ^{T}

*g x**i* *w x b*

### The Bayes Classifier

• A Bayes classifier can be represented in this way

• *For the minimum error-rate case *

𝑔_{𝑖} 𝑥 = 𝑝(𝜔_{𝑖}|𝑥)

• *For the general case with risks *

𝑔_{𝑖} 𝑥 = −𝑅 𝛼_{𝑖} 𝑥 = −

𝑗

𝜆_{𝑖𝑗} 𝑝(𝜔_{𝑗}|𝑥)

• If we replace 𝑔_{𝑖} 𝑥 by 𝑓(𝑔_{𝑖} 𝑥 ) , where 𝑓(. ) is a monotonically increasing
function (ex, log…), the resulting classification is unchanged.

𝑔_{𝑖} 𝑥 = 𝑝 𝑥 𝜔_{𝑖} 𝑝 𝜔_{𝑖}

𝑔_{𝑖} 𝑥 = ln 𝑝 𝑥 𝜔_{𝑖} + ln 𝑝(𝜔_{𝑖})

*J. Y. Choi. SNU*

### The Bayes Classifier

• The effect of any decision rule is to divide the feature space into
*C* decision regions, 𝑅_{1}, … , 𝑅_{𝐶}.

• If 𝑔_{𝑖} 𝑥 > 𝑔_{𝑗} 𝑥 for all 𝑗 ≠ 𝑖 then 𝑥 is in 𝑅_{𝑖}, and 𝑥 is assigned to 𝑤_{𝑖}.

• Decision regions are separated by decision boundaries.

• Decision boundaries are surfaces in the feature space.

28

**Feature 2 **
**Hight**

**Feature 1**
**Weight**

**x**

### Interim Summary

• Bayes Formula

• Priori probability

• Likelihood

• Evidence

• Posterior Probability

• Bayes Decision

• Risk Formulation

• Loss Function

• Expected Loss, Conditional Risk

• Total Risk

• Likelihood Ratio Test

• Zero-one Loss Function (Bayes Decision)

• Decision Region and Classifiers

*J. Y. Choi. SNU*

**Bayesian Decision Theory (II)**

Jin Young Choi

Seoul National University

1

### Bayesian Decision Theory

• Bayes Formula

• Priori, Posteriori, Likelihood Probability

• Bayes Decision

• Risk Formulation

• Expected Loss, Conditional Risk, Total Risk

• Likelihood Ratio Test

• Decision Region

• Classifier for Bayes Decision

*J. Y. Choi. SNU*

### Next Outline

• Bayes Classifier

• Normal Density

• ND: Univariate Case

• MND: Multivariate Case1: Indep., Same Variance

• MND: Multivariate Case2: Same Covariance Mtx.

• MND: Multivariate Case3: Different Covariance Mtx.

• Error Probability

3

### Decision Regions

• The likelihood ratio 𝑝 𝑥 𝜔_{1} /𝑝(𝑥|𝜔_{2}*) vs. 𝑥*

• The threshold 𝑇_{𝑎} for a loss function

1 12 22 2

2 21 11 1

( | ) ( ) ( )

( | ) ( ) ( ) ^{a}

*p x* *p*

*p x* *p* *T*

1 2

1 2

( | ) ( | ) ( ) ( )

*p x* *p x* *T**a*

*g x* *g x*

*T**a*

*J. Y. Choi. SNU*

## Classifiers; Discriminant Functions

• A classifier can be represented by a set of discriminant
functions 𝑔_{𝑖} 𝑥 ; 𝑖 = 1, … , 𝐶.

• The classifier assigns observation 𝑥 to class 𝜔_{𝑖} *if 𝑔*_{𝑖} 𝑥 >

𝑔_{𝑗} 𝑥 for all 𝑖 ≠ 𝑗

5

1 2

( ) exp

*i* 2

*g x* *x*

( ) ^{T}

*g x**i* *w x b*

### The Bayes Classifier

• A Bayes classifier can be represented in this way

• *For the minimum error-rate case *

𝑔_{𝑖} 𝑥 = 𝑝(𝜔_{𝑖}|𝑥)

• *For the general case with risks *

𝑔_{𝑖} 𝑥 = −𝑅 𝛼_{𝑖} 𝑥 = −

𝑗

𝜆_{𝑖𝑗} 𝑝(𝜔_{𝑗}|𝑥)

• If we replace 𝑔_{𝑖} 𝑥 by 𝑓(𝑔_{𝑖} 𝑥 ) , where 𝑓(. ) is a monotonically increasing
function (ex, log…), the resulting classification is unchanged.

𝑔_{𝑖} 𝑥 = 𝑝 𝑥 𝜔_{𝑖} 𝑝 𝜔_{𝑖}

𝑔_{𝑖} 𝑥 = ln 𝑝 𝑥 𝜔_{𝑖} + ln 𝑝(𝜔_{𝑖})

*J. Y. Choi. SNU*

### The Bayes Classifier

• The effect of any decision rule is to divide the feature space into
*C* decision regions, 𝑅_{1}, … , 𝑅_{𝐶}.

• If 𝑔_{𝑖} 𝑥 > 𝑔_{𝑗} 𝑥 for all 𝑗 ≠ 𝑖 then 𝑥 is in 𝑅_{𝑖}, and 𝑥 is assigned to 𝑤_{𝑖}.

• Decision regions are separated by decision boundaries.

• Decision boundaries are surfaces in the feature space.

7

**Feature 2 **
**Hight**

**Feature 1**
**Weight**

**x**

### The Decision Regions

• Two dimensional two category classifier

*J. Y. Choi. SNU*

### Two-Category Case

• Use 2 discriminant functions 𝑔_{1}(𝑥)and 𝑔_{2}(𝑥), and

• assigning 𝑥 to 𝜔_{1} if 𝑔_{1}(𝑥)> 𝑔_{2}(𝑥).

• Alternative: define a single discriminant function

• 𝑔 𝑥 = 𝑔_{1} 𝑥 − 𝑔_{2}(𝑥),

• decide 𝜔_{1} if 𝑔 𝑥 > 0, otherwise decide 𝜔_{2}.

• In two category case two forms are frequently used:

• 𝑔 𝑥 = 𝑝 𝜔_{1}|𝑥 − 𝑝 𝜔_{2}|𝑥

• 𝑔 𝑥 = ln^{𝑝 𝑥|𝜔}^{1}

𝑝 𝑥|𝜔_{2} + ln^{𝑝 𝜔}^{1}

𝑝 𝜔_{2}

9

### Normal Density - Univariate Case

• Gaussian density with mean μ and standard deviation σ (σ^{2} named variance )

𝑝 𝑥 = 1

(2𝜋)^{1/2}𝜎 exp − 1
2

(𝑥 − 𝜇) 𝜎

2

𝑝 𝑥 ~𝑁(𝜇, 𝜎^{2})

• It can be shown that:

𝜇 = 𝐸 𝑥 = න

−∞

∞

𝑥𝑝 𝑥 𝑑𝑥

𝜎^{2} = 𝐸 (𝑥 − 𝜇)^{2} = න

−∞

∞

(𝑥 − 𝜇)^{2}𝑝 𝑥 𝑑𝑥

*J. Y. Choi. SNU*

### Normal Density - Multivariate Case

• The general multivariate normal density (MND) in a 𝑑 − dimensions is written as

𝑝 𝑥 = 1

(2𝜋)^{𝑑/2} Σ ^{1/2} 𝑒𝑥𝑝 − 1

2 𝑥 − 𝜇 ^{𝑡}Σ^{−1} 𝑥 − 𝜇
𝜇 = 𝐸 𝑥 = න

−∞

∞

𝑥𝑝 𝑥 𝑑𝑥
Σ = 𝐸 (𝑥 − 𝜇)(𝑥 − 𝜇)^{𝑡}
Σ_{𝑖𝑗} = 𝐸 (𝑥_{𝑖} − 𝜇_{𝑖})(𝑥_{𝑗} − 𝜇_{𝑗})

• The covariance matrix Σ is always symmetric and positive semidefinite.

11

### Normal Density - Multivariate Case

• The multivariate normal density MND is completely specified by 𝑑 + 𝑑(𝑑 + 1)/2 parameters . Samples drawn from MND fall in a cluster of which center is determined by 𝜇 and a shape by Σ. The loci of points of constant density are hyper-ellipsoids

𝑟^{2} = 𝑥 − 𝜇 ^{𝑡}Σ^{−1} 𝑥 − 𝜇 =constant

• The 𝑟 is called Mahalonobis distance from 𝑥 to 𝜇

• The principal axes of the hyperellipsoid are given by the eigenvectors of Σ.

*J. Y. Choi. SNU*

### Normal Density - Multivariate Case

• The minimum-error-rate classification can be achieved using the discriminant functions:

𝑔_{𝑖} 𝑥 = 𝑝 𝑥 𝜔_{𝑖} 𝑝(𝜔_{𝑖})
or

𝑔_{𝑖} 𝑥 = ln 𝑝 𝑥 𝜔_{𝑖} + ln 𝑝(𝜔_{𝑖})

• If 𝑝 𝑥 𝜔_{𝑖} ~𝑁 𝜇_{𝑖}, Σ_{𝑖} ,
then

𝑔_{𝑖} 𝑥 = − ^{1}

2 𝑥 − 𝜇_{𝑖} ^{𝑡}Σ_{𝑖}^{−1} 𝑥 − 𝜇_{𝑖} − ^{𝑑}

2 ln 2𝜋 − ^{1}

2 ln Σ_{𝑖} + ln 𝑝(𝜔_{𝑖})
since

𝑝 𝑥 𝜔_{𝑖} = ^{1}

(2𝜋)^{𝑑/2} Σ_{𝑖} ^{1/2} 𝑒𝑥𝑝 −^{1}

2 𝑥 − 𝜇_{𝑖} ^{𝑡}Σ_{𝑖}^{−1} 𝑥 − 𝜇_{𝑖}

13

### Discriminant Function for Normal Density

• Assume the features are statistically independent, and each feature
has the same variance, that is, Σ_{𝑖} = 𝜎^{2}𝐈

• What is the discriminant function?

*J. Y. Choi. SNU*

### Discriminant Function for Normal Density

• Assume the features are statistically independent, and each feature
has the same variance, that is, Σ_{𝑖} = 𝜎^{2} 𝐈

• What is the discriminant function?

𝑔_{𝑖} 𝑥 = − ^{1}

2 𝑥 − 𝜇_{𝑖} ^{𝑡}Σ_{𝑖}^{−1} 𝑥 − 𝜇_{𝑖} − ^{𝑑}

2 ln 2𝜋 − ^{1}

2 ln Σ_{𝑖} + ln 𝑝(𝜔_{𝑖})

= − ^{𝑥−𝜇}^{𝑖} ^{2}

2𝜎^{2} − ^{𝑑}

2 ln 2𝜋 − ^{1}

2 ln 𝜎^{2𝑑} + ln 𝑝(𝜔_{𝑖})
Since − ^{𝑑}

2 ln 2𝜋 − ^{1}

2 ln 𝜎^{2𝑑} is independent to the class,
𝑔_{𝑖} 𝑥 = − ^{𝑥−𝜇}^{𝑖} ^{2}

2𝜎^{2} + ln 𝑝(𝜔_{𝑖})

15

### Discriminant Function for Normal Density

• What is the discriminant function?

𝑔_{𝑖} 𝑥 = − ^{𝑥−𝜇}^{𝑖} ^{2}

2𝜎^{2} + ln 𝑝(𝜔_{𝑖})

= − ^{1}

2𝜎^{2} (𝑥^{𝑡}𝑥 − 2𝜇_{𝑖}^{𝑡}𝑥 + 𝜇_{𝑖}^{𝑡}𝜇_{𝑖}) + ln 𝑝(𝜔_{𝑖})
Since 𝑥^{𝑡}𝑥 is also independent to the class,

𝑔_{𝑖} 𝑥 = 𝑤_{𝑖}^{𝑡}𝑥 + 𝑤_{𝑖0}
where

𝑤_{𝑖} = ^{𝜇}^{𝑖}

𝜎^{2}

𝑤_{𝑖0} = − ^{𝜇}^{𝑖}^{𝑡}^{𝜇}^{𝑖}

2𝜎^{2} + ln 𝑝(𝜔_{𝑖})

*J. Y. Choi. SNU*

### Discriminant Function for Normal Density

• Decision surface between 𝑖 and 𝑗 classes is obtained by letting
𝑔_{𝑖} 𝑥 = 𝑔_{𝑗} 𝑥

which yields

𝑔 𝑥 = 𝑔_{𝑖} 𝑥 − 𝑔_{𝑗} 𝑥 = 𝑤^{𝑡}𝑥 − 𝑤_{0} = 0
where

𝑤 = 𝑤_{𝑖} − 𝑤_{𝑗} = ^{𝜇}^{𝑖}^{−𝜇}^{𝑗}

𝜎^{2}

𝑤_{0} = 𝑤_{𝑖0} − 𝑤_{𝑗0} = − ^{𝜇}^{𝑖}^{𝑡}^{𝜇}^{𝑖}^{−𝜇}^{𝑗}^{𝑡}^{𝜇}^{𝑗}

2𝜎^{2} + ln 𝑝(𝜔_{𝑖}) − ln 𝑝(𝜔_{𝑗})

• This linear classifier corresponds to a neuron model as

• 𝑖 −class region if 𝑔 𝑥 > 0

• 𝑗 −class region if 𝑔 𝑥 < 0

• decision boundary if 𝑔 𝑥 = 0

17

• If 𝑝(𝑤_{𝑖}) is equal to 𝑝(𝑤_{𝑗}),

### Discriminant Function for Normal Density

*J. Y. Choi. SNU*

### Discriminant Function for Normal Density

• If 𝑝(𝑤_{𝑖}) is not equal to 𝑝(𝑤_{𝑗})

19

### Discriminant Function for Normal Density

• Assume the features are not statistically independent, and each
feature has the different variance, but every class has the same
covariance matrix, that is, Σ_{𝑖} = Σ

• What is the discriminant function?

*J. Y. Choi. SNU*

### Discriminant Function for Normal Density

• Assume the features are not statistically independent, and
each feature has the different variance, but every class has
the same covariance matrix, that is, Σ_{𝑖} = Σ

• What is the discriminant function?

𝑔_{𝑖} 𝑥 = − ^{1}

2 𝑥 − 𝜇_{𝑖} ^{𝑡}Σ^{−1} 𝑥 − 𝜇_{𝑖} − ^{𝑑}

2 ln 2𝜋 − ^{1}

2 ln Σ + ln 𝑝(𝜔_{𝑖})
Since − ^{𝑑}

2 ln 2𝜋 − ^{1}

2 ln Σ is independent to the class,
𝑔_{𝑖} 𝑥 = − ^{1}

2 𝑥 − 𝜇_{𝑖} ^{𝑡}Σ^{−1} 𝑥 − 𝜇_{𝑖} + ln 𝑝(𝜔_{𝑖})

21

### Discriminant Function for Normal Density

• What is the discriminant function?

𝑔_{𝑖} 𝑥 = −^{1}

2 𝑥 − 𝜇_{𝑖} ^{𝑡}Σ^{−1} 𝑥 − 𝜇_{𝑖} + ln 𝑝(𝜔_{𝑖})

= − ^{1}

2 (𝑥^{𝑡}Σ^{−1}𝑥 − 2𝜇_{𝑖}^{𝑡}Σ^{−1}𝑥 + 𝜇_{𝑖}^{𝑡}Σ^{−1}𝜇_{𝑖}) + ln 𝑝(𝜔_{𝑖})
Since 𝑥^{𝑡}Σ^{−1}𝑥 is also independent to the class,

𝑔_{𝑖} 𝑥 = 𝑤_{𝑖}^{𝑡}𝑥 + 𝑤_{𝑖0}
where

𝑤_{𝑖} = Σ^{−1}𝜇_{𝑖}

𝑤_{𝑖0} = −^{𝜇}^{𝑖}^{𝑡}^{Σ}^{−1}^{𝜇}^{𝑖}

2 + ln 𝑝(𝜔_{𝑖})

*J. Y. Choi. SNU*

### Discriminant Function for Normal Density

• Decision surface between 𝑖 and 𝑗 classes is obtained by letting
𝑔_{𝑖} 𝑥 = 𝑔_{𝑗} 𝑥

which yields

𝑔 𝑥 = 𝑔_{𝑖} 𝑥 − 𝑔_{𝑗} 𝑥 = 𝑤^{𝑡}𝑥 − 𝑤_{0} = 0
where

𝑤 = 𝑤_{𝑖} − 𝑤_{𝑗} = Σ^{−1}(𝜇_{𝑖} − 𝜇_{𝑗})

𝑤_{0} = 𝑤_{𝑖0} − 𝑤_{𝑗0} = −^{𝜇}^{𝑖}^{𝑡}^{Σ}^{−1}^{𝜇}^{𝑖}^{−𝜇}^{𝑗}^{𝑡}^{Σ}^{−1}^{𝜇}^{𝑗}

2 + ln 𝑝(𝜔_{𝑖}) − ln 𝑝(𝜔_{𝑗})

• This linear classifier corresponds to a neuron model as

• 𝑖 −class region if 𝑔 𝑥 > 0

• 𝑗 −class region if 𝑔 𝑥 < 0

• decision boundary if 𝑔 𝑥 = 0

23

### Discriminant Function for Normal Density

• Assume the features are not statistically independent, and each
feature has the different variance, but every class has the different
covariance matrix, that is, Σ_{𝑖}

• What is the discriminant function?

*J. Y. Choi. SNU*

### Discriminant Function for Normal Density

• Assume the features are not statistically independent, and
each feature has the different variance, but every class has
the same covariance matrix, that is, Σ_{𝑖}

• What is the discriminant function?

𝑔_{𝑖} 𝑥 = − ^{1}

2 𝑥 − 𝜇_{𝑖} ^{𝑡}Σ_{𝑖}^{−1} 𝑥 − 𝜇_{𝑖} − ^{𝑑}

2 ln 2𝜋 − ^{1}

2 ln Σ_{𝑖} + ln 𝑝(𝜔_{𝑖})

• Since − ^{𝑑}

2 ln 2𝜋 is independent to the class,
𝑔_{𝑖} 𝑥 = − ^{1}

2 𝑥 − 𝜇_{𝑖} ^{𝑡}Σ_{𝑖}^{−1} 𝑥 − 𝜇_{𝑖} − ^{1}

2 ln Σ_{𝑖} + ln 𝑝(𝜔_{𝑖})

25

### Discriminant Function for Normal Density

• What is the discriminant function?

𝑔_{𝑖} 𝑥 = − ^{1}

2 𝑥 − 𝜇_{𝑖} ^{𝑡}Σ_{𝑖}^{−1} 𝑥 − 𝜇_{𝑖} − ^{1}

2 ln Σ_{𝑖} + ln 𝑝(𝜔_{𝑖})

= − ^{1}

2 𝑥^{𝑡}Σ_{𝑖}^{−1}𝑥 − 2𝜇_{𝑖}^{𝑡}Σ_{𝑖}^{−1}𝑥 + 𝜇_{𝑖}^{𝑡}Σ_{𝑖}^{−1}𝜇_{𝑖} − ^{1}

2 ln Σ_{𝑖}
+ ln 𝑝(𝜔_{𝑖})

Since 𝑥^{𝑡}Σ_{𝑖}^{−1}𝑥 is not independent to the class,
𝑔_{𝑖} 𝑥 = 𝑥^{𝑡}𝑊_{𝑖}𝑥 + 𝑤_{𝑖}^{𝑡}𝑥 + 𝑤_{𝑖0}

where

𝑊_{𝑖} = −^{1}

2 Σ_{𝑖}^{−1}
𝑤_{𝑖} = Σ^{−1}𝜇_{𝑖}

𝑤_{𝑖0} = − ^{𝜇}^{𝑖}^{𝑡}^{Σ}^{−1}^{𝜇}^{𝑖}

2 + ln 𝑝(𝜔_{𝑖})

*J. Y. Choi. SNU*

### Decision Boundary for General Gaussian

• Decision boundaries are hyperquadratic for general case

27

### Linear Mapping

### 𝑇: 𝑋 → 𝑌, 𝑦 = 𝑊𝑥

𝑋 𝑌

*J. Y. Choi. SNU*

### Affine Mapping

### 𝑇: 𝑋 → 𝑌, 𝑦 = 𝑊𝑥 + 𝑏

29

𝑋 𝑌

1 𝑏

1 2

3 4

5

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09

1 2 3 4 5

### Nonlinear Mapping

𝑇: 𝑋 → 𝑌, 𝑦 = 𝜎 𝑊𝑥 + 𝑏 , 𝑓 𝑥 = 𝑎^{𝑇}𝑦

𝑋 ∋ 𝑥 𝑌 ∋ 𝑦 =

𝑦_{1}

⋮
𝑦_{𝑛}
𝑓 𝑥 = 𝑎^{𝑇}𝑦

𝑓 𝑥

1 𝑏

*J. Y. Choi. SNU*

### Error Probabilities and Integrals

• Consider the 2-class problem and suppose that the feature space is divided into
2 regions 𝑅_{1} and 𝑅_{2}*. There are 2 ways in which a classification error can occur.*

• An observation 𝑥 falls in 𝑅_{2}, and the true state is 𝜔_{1}.

• An observation 𝑥 falls in 𝑅_{1}, and the true state is 𝜔_{2}.

• The error probability

𝑃 𝑒𝑟𝑟𝑜𝑟 = 𝑃 𝑥 ∈ 𝑅_{2} 𝜔_{1} 𝑝 𝜔_{1} + 𝑃 𝑥 ∈ 𝑅_{1} 𝜔_{2} 𝑝(𝜔_{2})

= _{𝑅}

2 𝑝 𝑥 𝜔_{1} 𝑝 𝜔_{1} 𝑑𝑥 + _{𝑅}

1 𝑝 𝑥 𝜔_{2} 𝑝 𝜔_{2} 𝑑𝑥

31

### Error Probabilities and Integrals

• Because 𝑥^{∗} is chosen arbitrarily, the probability of error is not as
small as it might be.

• 𝑥_{𝐵} = Bayes optimal decision boundary , and gives the lowest
probability of error.

• Bayes classifier maximizes the correct probability.

1 1

( ) ( | ) ( ) ( | ) ( )

*i*

*C* *C*

*i* *i* *i* *i* *i*

*i* *i*

*P correct* *P* *p* *p* *p* *d*

###

** **

^{x}###

^{x}

^{x}### Error Probabilities and Integrals

*J. Y. Choi. SNU*

### Interim Summary

• Decision Region

• (Linear) Discriminant Function

• Decision Surface, Linear Machine

• Bayes Classifier

• Normal Density ND: Univariate Case

• MND: Multivariate Case1: Indep., Same Variance

• MND: Multivariate Case2: Same Covariance Matrix

• MND: Multivariate Case3: Different Covariance Matrix

• Error Probability

• Left Issues: Learning of PDF, NN learning, Bayesian Learning

33