Bayesian Decision Theory (I)
Jin Young Choi
Seoul National University
J. Y. Choi. SNU
Bayesian Decision Theory
• Bayes Formula
• Priori, Posteriori, Likelihood Probability
• Bayes Decision
• Risk Formulation
• Expected Loss, Conditional Risk, Total Risk
• Likelihood Ratio Test
• Decision Region
• Classifier for Bayes Decision
2
Bayesian Decision
• What do we learn from experience or observation?
• How can we decide the category of an object from observation?
• Is our decision always correct?
• We make a decision probabilistically
J. Y. Choi. SNU
Bayesian Decision
• Question:
• There live two kinds of fishes in a lake: tuna or salmon.
• If you catch a fish by fishing, is the fish likely to be tuna or salmon?
Bayesian Decision
• We have experienced that salmon has been caught in 70% and tuna in 30%.
• What is the next fish likely to be?
J. Y. Choi. SNU
Bayesian Decision
• If other types of fish are irrelevant:
𝑝 𝜔 = 𝜔1 + 𝑝 𝜔 = 𝜔2 = 1,
𝜔 is random variable, 𝜔1and 𝜔2 denote salmon and tuna.
• Probabilities reflect our prior knowledge obtained from past experience.
• Simple Decision Rule:
• Make a decision without seeing the fish.
• Decide 𝜔1 if 𝑝 𝜔 = 𝜔1 > 𝑝 𝜔 = 𝜔2 𝜔2 otherwise.
Bayesian Decision
• In general, we will have some features and more information.
• Feature: lightness measurement = 𝑥
• Different fish yields different lightness readings (𝑥 is a random variable)
J. Y. Choi. SNU
Bayesian Decision
• Define
• 𝑝(𝑥|𝜔𝑖)= Class Conditional Probability Density
• The difference between 𝑝(𝑥|𝜔1) and 𝑝(𝑥|𝜔2) describes the difference in lightness between tuna and salmon.
Bayesian Decision
• Hypothetical class-conditional probability
• Density functions are normalized (area under each curve is 1.0)
J. Y. Choi. SNU
Bayesian Decision
• Suppose that we know
• The prior probabilities 𝑝(𝜔1) and 𝑝 𝜔2
• The conditional densities 𝑝(𝑥|𝜔1) and 𝑝 𝑥|𝜔2
• Measure lightness of a fish = 𝑥
• What is the category of the fish with lightness of 𝑥 ?
• The probability that the fish has category of 𝜔𝑖 is 𝑝(𝜔𝑖|𝑥).
Bayes formula
• 𝑝 𝜔𝑖|𝑥 = 𝑝 𝑥 𝜔𝑖 𝑝(𝜔𝑖)
𝑝(𝑥) ,
where 𝑝 𝑥 = σ𝑗 𝑝 𝑥 𝜔𝑗 𝑝(𝜔𝑗) = σ𝑗 𝑝(𝑥, 𝜔𝑗).
• 𝑃𝑜𝑠𝑡𝑒𝑟𝑖𝑜𝑟 = 𝐿𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑∗𝑃𝑟𝑖𝑜𝑟 𝐸𝑣𝑖𝑑𝑒𝑛𝑐𝑒
• 𝑝(𝑥|𝜔𝑖) is called the likelihood of 𝜔𝑖 with respect to 𝑥.
• The 𝜔𝑖 category for which 𝑝(𝑥|𝜔𝑖) is large is more "likely" to be the true category
• 𝑝(𝑥) is the evidence
• How frequently is a pattern with feature value x observed.
• Scale factor that the posterior probabilities sum to 1.
J. Y. Choi. SNU
Bayes formula
• Posterior probabilities for the particular priors 𝑝(𝜔1) = 2/3 and 𝑝(𝜔2) = 1/3. At every 𝑥 the posteriors sum to 1.
Bayes Decision Rule (Minimal probability error)
• Likelihood Decision:
• 𝜔1 ∶ 𝑖𝑓 𝑝(𝑥|𝜔1) > 𝑝(𝑥|𝜔2)
• 𝜔2 ∶ otherwise
• Posteriori Decision:
• 𝜔1 ∶ 𝑖𝑓 𝑝 𝑥 𝜔1 𝑝(𝜔1) > 𝑝 𝑥 𝜔2 𝑝(𝜔2)
• 𝜔2 ∶ otherwise
• Decision Error Probability
• 𝑝 𝑒𝑟𝑟𝑜𝑟 𝑥 = min(𝑝 𝜔1 𝑥 , 𝑝 𝜔2 𝑥 ) where the decision error is given by
𝑝 𝑒𝑟𝑟𝑜𝑟 𝑥 = ቊ𝑝 𝜔2 𝑥 𝑖𝑓 𝑤𝑒 𝑑𝑒𝑐𝑖𝑑𝑒 𝜔1 𝑓𝑜𝑟 𝜔2
J. Y. Choi. SNU
Features: General Case
• Formalize the ideas just considered in 4 ways:
• Allow more than one feature:
• Replace the scalar 𝑥 by the feature vector 𝑥
• d-dimensional space ℛ𝑑, 𝑥 ∈ ℛ𝑑, is called the feature space
• Allow more than 2 states of nature:
• Generalize to several classes
• Allow actions other than merely deciding the state of nature:
• Possibility of rejection, refusing to make a decision in close cases
• Introducing general loss function: risk(loss) minimization.
Loss Function
• Loss (or cost) function states exactly how costly each action is, and is used to convert a probability into a decision. Loss functions let us treat situations in
which some kinds of classification mistakes are more costly than others.
J. Y. Choi. SNU
Formulation
• Let 𝜔1, … , 𝜔𝑐 be the finite set of c states of nature ("categories").
• Let 𝛼1, … , 𝛼𝑎 be the finite set of a possible actions.
(Ex) Action 𝛼𝑖 = deciding that the true state is 𝜔𝑖 or others.
• The loss function 𝜆 𝛼𝑖 𝜔𝑗 = loss incurred for taking action when the state of nature is 𝜔𝑗.
• 𝑥 = 𝑑 −dimensional feature vector (random variable)
• 𝑝(𝑥|𝜔𝑖) = likelihood probability density function for 𝑥 for given 𝜔𝑖
• 𝑝(𝜔𝑖) = prior probability that nature is in state 𝜔𝑖.
Expected Loss (Risk)
• Suppose that we observe a particular 𝑥 and that we take action 𝛼𝑖
• If the true state of nature is 𝜔𝑗, then loss is 𝜆 𝛼𝑖 𝜔𝑗 , the expected loss before observation is
𝑅(𝛼𝑖) = σ𝑗=1𝑐 𝜆 𝛼𝑖 𝜔𝑗 𝑝(𝜔𝑗)
J. Y. Choi. SNU
Conditional Risk (Loss)
• After the observation, the expected loss which is called now
“conditional risk” is given by 𝑅 𝛼𝑖 𝑥 =
𝑗=1 𝑐
𝜆 𝛼𝑖 𝜔𝑗 𝑝(𝜔𝑗|𝑥)
Total Risk
• Objective: Select the action that minimizes the conditional risk
• A general decision rule is a function 𝛼 𝑥
• For every 𝑥, the decision function 𝛼 𝑥 assumes one of the values, 𝛼1, … , 𝛼𝑎 , that is,
𝛼 𝑥 = 𝑎𝑟𝑔 min
𝛼𝑖 𝑅 𝛼𝑖 𝑥 =
𝑗=1 𝑐
𝜆 𝛼𝑖 𝜔𝑗 𝑝(𝜔𝑗|𝑥)
• The “total risk” is
න 𝑅 𝛼 𝑥 𝑥 𝑝 𝑥 𝑑𝑥
• Learning
min𝜃 𝑅 𝛼𝜃 𝑥 𝑥 𝑝 𝑥 𝑑𝑥, 𝑤ℎ𝑒𝑟𝑒 𝑓𝜃 𝑥 ≈ (𝑝(𝜔1|𝑥), 𝑝(𝜔2|𝑥), … 𝑝(𝜔𝑐|𝑥))
J. Y. Choi. SNU
Bayes Decision Rule:
• Compute the conditional risk
𝑅 𝛼𝑖 𝑥 =
𝑗=1 𝑐
𝜆 𝛼𝑖 𝜔𝑗 𝑝(𝜔𝑗|𝑥) for 𝑖 = 1, … , 𝑎.
• Select the action 𝛼𝑖 for which 𝑅 𝛼𝑖 𝑥 is minimum.
• The resulting minimum total risk is called the Bayes Risk, denoted 𝑅∗, and
is the best performance that can be achieved.
Two-Category Classification
• Action 𝛼1 = deciding that the true state is 𝜔1
• Action 𝛼2 = deciding that the true state is 𝜔2
• Let 𝜆𝑖𝑗 = 𝜆 𝛼𝑖 𝜔𝑗 be the risk incurred for deciding 𝜔𝑖 when true state is 𝜔𝑗.
• The conditional Risks:
𝑅 𝛼1 𝑥 = 𝜆11𝑝(𝜔1 𝑥 + 𝜆12𝑝(𝜔2 𝑥 𝑅 𝛼2 𝑥 = 𝜆21𝑝(𝜔1 𝑥 + 𝜆22𝑝(𝜔2 𝑥
• Decide 𝜔1 if 𝑅 𝛼1 𝑥 < 𝑅 𝛼2 𝑥
or if 𝜆21 − 𝜆11 𝑝(𝜔1 𝑥 > (𝜆12−𝜆22)𝑝(𝜔2 𝑥
or if 𝜆21 − 𝜆11 𝑝(𝑥|𝜔1)𝑝(𝜔1) > (𝜆12−𝜆22)𝑝(𝑥|𝜔2)𝑝(𝜔2) and 𝜔2 , otherwise
J. Y. Choi. SNU
Two-Category Likelihood Ratio Test
• Under reasonable assumption that 𝜆12 > 𝜆22 and 𝜆21 > 𝜆11, (why?) decide 𝜔1 if 𝑝(𝑥|𝜔1)
𝑝(𝑥|𝜔2) > (𝜆12−𝜆22)𝑝(𝜔2)
𝜆21−𝜆11 𝑝(𝜔1) = 𝑇 and 𝜔2 , otherwise.
• The ratio 𝑝(𝑥|𝜔1)
𝑝(𝑥|𝜔2) is called the likelihood ratio.
• We can decide 𝜔1 if the likelihood ratio exceeds a threshold T value that is independent of the observation 𝑥.
Minimum-Error-Rate Classification
• Recall the conditional risk 𝑅 𝛼𝑖 𝑥 =
𝑗=1 𝑐
𝜆 𝛼𝑖 𝜔𝑗 𝑝(𝜔𝑗|𝑥)
• If action 𝛼𝑖 is taken for the true state 𝜔𝑗, then
the decision is correct if 𝑖 = 𝑗, and in error otherwise.
• To give an equal cost to all errors, we define Zero-One Loss function as
𝜆 𝛼𝑖 𝜔𝑗 = ቊ0, 𝑖 = 𝑗
1, 𝑖 ≠ 𝑗 , for 𝑖, 𝑗 = 1, … , 𝐶
J. Y. Choi. SNU
Minimum-Error-Rate Classification cont.
• The conditional risk representing error rate is 𝑅 𝛼𝑖 𝑥 =
𝑗=1 𝑐
𝜆 𝛼𝑖 𝜔𝑗 𝑝(𝜔𝑗|𝑥)
=
𝑗≠𝑖 𝑐
𝑝(𝜔𝑗|𝑥)
= 1 − 𝑝(𝜔𝑖 𝑥
• To minimize the average probability of error, we should select the 𝑖 that maximizes the posterior probability i.e., 𝑝(𝜔𝑖 𝑥
Decide 𝜔𝑖 if 𝑝(𝜔𝑖 𝑥 > 𝑝(𝜔𝑗 𝑥 , for all 𝑗 ≠ 𝑖 (same as Bayes' decision rule)
Decision Regions
• The likelihood ratio 𝑝 𝑥 𝜔1 /𝑝(𝑥|𝜔2) vs. 𝑥
• The threshold 𝑇𝑎 for a loss function
1 12 22 2
2 21 11 1
( | ) ( ) ( )
( | ) ( ) ( ) a
p x p
p x p T
1 2
1 2
( | ) ( | ) ( ) ( )
p x p x Ta
g x g x
Ta
J. Y. Choi. SNU
Classifiers; Discriminant Functions
• A classifier can be represented by a set of discriminant functions 𝑔𝑖 𝑥 ; 𝑖 = 1, … , 𝐶.
• The classifier assigns observation 𝑥 to class 𝜔𝑖 if 𝑔𝑖 𝑥 >
𝑔𝑗 𝑥 for all 𝑖 ≠ 𝑗
26
1 2
( ) exp
i 2
g x x
( ) T
g xi w x b
The Bayes Classifier
• A Bayes classifier can be represented in this way
• For the minimum error-rate case
𝑔𝑖 𝑥 = 𝑝(𝜔𝑖|𝑥)
• For the general case with risks
𝑔𝑖 𝑥 = −𝑅 𝛼𝑖 𝑥 = −
𝑗
𝜆𝑖𝑗 𝑝(𝜔𝑗|𝑥)
• If we replace 𝑔𝑖 𝑥 by 𝑓(𝑔𝑖 𝑥 ) , where 𝑓(. ) is a monotonically increasing function (ex, log…), the resulting classification is unchanged.
𝑔𝑖 𝑥 = 𝑝 𝑥 𝜔𝑖 𝑝 𝜔𝑖
𝑔𝑖 𝑥 = ln 𝑝 𝑥 𝜔𝑖 + ln 𝑝(𝜔𝑖)
J. Y. Choi. SNU
The Bayes Classifier
• The effect of any decision rule is to divide the feature space into C decision regions, 𝑅1, … , 𝑅𝐶.
• If 𝑔𝑖 𝑥 > 𝑔𝑗 𝑥 for all 𝑗 ≠ 𝑖 then 𝑥 is in 𝑅𝑖, and 𝑥 is assigned to 𝑤𝑖.
• Decision regions are separated by decision boundaries.
• Decision boundaries are surfaces in the feature space.
28
Feature 2 Hight
Feature 1 Weight
x
Interim Summary
• Bayes Formula
• Priori probability
• Likelihood
• Evidence
• Posterior Probability
• Bayes Decision
• Risk Formulation
• Loss Function
• Expected Loss, Conditional Risk
• Total Risk
• Likelihood Ratio Test
• Zero-one Loss Function (Bayes Decision)
• Decision Region and Classifiers
J. Y. Choi. SNU
Bayesian Decision Theory (II)
Jin Young Choi
Seoul National University
1
Bayesian Decision Theory
• Bayes Formula
• Priori, Posteriori, Likelihood Probability
• Bayes Decision
• Risk Formulation
• Expected Loss, Conditional Risk, Total Risk
• Likelihood Ratio Test
• Decision Region
• Classifier for Bayes Decision
J. Y. Choi. SNU
Next Outline
• Bayes Classifier
• Normal Density
• ND: Univariate Case
• MND: Multivariate Case1: Indep., Same Variance
• MND: Multivariate Case2: Same Covariance Mtx.
• MND: Multivariate Case3: Different Covariance Mtx.
• Error Probability
3
Decision Regions
• The likelihood ratio 𝑝 𝑥 𝜔1 /𝑝(𝑥|𝜔2) vs. 𝑥
• The threshold 𝑇𝑎 for a loss function
1 12 22 2
2 21 11 1
( | ) ( ) ( )
( | ) ( ) ( ) a
p x p
p x p T
1 2
1 2
( | ) ( | ) ( ) ( )
p x p x Ta
g x g x
Ta
J. Y. Choi. SNU
Classifiers; Discriminant Functions
• A classifier can be represented by a set of discriminant functions 𝑔𝑖 𝑥 ; 𝑖 = 1, … , 𝐶.
• The classifier assigns observation 𝑥 to class 𝜔𝑖 if 𝑔𝑖 𝑥 >
𝑔𝑗 𝑥 for all 𝑖 ≠ 𝑗
5
1 2
( ) exp
i 2
g x x
( ) T
g xi w x b
The Bayes Classifier
• A Bayes classifier can be represented in this way
• For the minimum error-rate case
𝑔𝑖 𝑥 = 𝑝(𝜔𝑖|𝑥)
• For the general case with risks
𝑔𝑖 𝑥 = −𝑅 𝛼𝑖 𝑥 = −
𝑗
𝜆𝑖𝑗 𝑝(𝜔𝑗|𝑥)
• If we replace 𝑔𝑖 𝑥 by 𝑓(𝑔𝑖 𝑥 ) , where 𝑓(. ) is a monotonically increasing function (ex, log…), the resulting classification is unchanged.
𝑔𝑖 𝑥 = 𝑝 𝑥 𝜔𝑖 𝑝 𝜔𝑖
𝑔𝑖 𝑥 = ln 𝑝 𝑥 𝜔𝑖 + ln 𝑝(𝜔𝑖)
J. Y. Choi. SNU
The Bayes Classifier
• The effect of any decision rule is to divide the feature space into C decision regions, 𝑅1, … , 𝑅𝐶.
• If 𝑔𝑖 𝑥 > 𝑔𝑗 𝑥 for all 𝑗 ≠ 𝑖 then 𝑥 is in 𝑅𝑖, and 𝑥 is assigned to 𝑤𝑖.
• Decision regions are separated by decision boundaries.
• Decision boundaries are surfaces in the feature space.
7
Feature 2 Hight
Feature 1 Weight
x
The Decision Regions
• Two dimensional two category classifier
J. Y. Choi. SNU
Two-Category Case
• Use 2 discriminant functions 𝑔1(𝑥)and 𝑔2(𝑥), and
• assigning 𝑥 to 𝜔1 if 𝑔1(𝑥)> 𝑔2(𝑥).
• Alternative: define a single discriminant function
• 𝑔 𝑥 = 𝑔1 𝑥 − 𝑔2(𝑥),
• decide 𝜔1 if 𝑔 𝑥 > 0, otherwise decide 𝜔2.
• In two category case two forms are frequently used:
• 𝑔 𝑥 = 𝑝 𝜔1|𝑥 − 𝑝 𝜔2|𝑥
• 𝑔 𝑥 = ln𝑝 𝑥|𝜔1
𝑝 𝑥|𝜔2 + ln𝑝 𝜔1
𝑝 𝜔2
9
Normal Density - Univariate Case
• Gaussian density with mean μ and standard deviation σ (σ2 named variance )
𝑝 𝑥 = 1
(2𝜋)1/2𝜎 exp − 1 2
(𝑥 − 𝜇) 𝜎
2
𝑝 𝑥 ~𝑁(𝜇, 𝜎2)
• It can be shown that:
𝜇 = 𝐸 𝑥 = න
−∞
∞
𝑥𝑝 𝑥 𝑑𝑥
𝜎2 = 𝐸 (𝑥 − 𝜇)2 = න
−∞
∞
(𝑥 − 𝜇)2𝑝 𝑥 𝑑𝑥
J. Y. Choi. SNU
Normal Density - Multivariate Case
• The general multivariate normal density (MND) in a 𝑑 − dimensions is written as
𝑝 𝑥 = 1
(2𝜋)𝑑/2 Σ 1/2 𝑒𝑥𝑝 − 1
2 𝑥 − 𝜇 𝑡Σ−1 𝑥 − 𝜇 𝜇 = 𝐸 𝑥 = න
−∞
∞
𝑥𝑝 𝑥 𝑑𝑥 Σ = 𝐸 (𝑥 − 𝜇)(𝑥 − 𝜇)𝑡 Σ𝑖𝑗 = 𝐸 (𝑥𝑖 − 𝜇𝑖)(𝑥𝑗 − 𝜇𝑗)
• The covariance matrix Σ is always symmetric and positive semidefinite.
11
Normal Density - Multivariate Case
• The multivariate normal density MND is completely specified by 𝑑 + 𝑑(𝑑 + 1)/2 parameters . Samples drawn from MND fall in a cluster of which center is determined by 𝜇 and a shape by Σ. The loci of points of constant density are hyper-ellipsoids
𝑟2 = 𝑥 − 𝜇 𝑡Σ−1 𝑥 − 𝜇 =constant
• The 𝑟 is called Mahalonobis distance from 𝑥 to 𝜇
• The principal axes of the hyperellipsoid are given by the eigenvectors of Σ.
J. Y. Choi. SNU
Normal Density - Multivariate Case
• The minimum-error-rate classification can be achieved using the discriminant functions:
𝑔𝑖 𝑥 = 𝑝 𝑥 𝜔𝑖 𝑝(𝜔𝑖) or
𝑔𝑖 𝑥 = ln 𝑝 𝑥 𝜔𝑖 + ln 𝑝(𝜔𝑖)
• If 𝑝 𝑥 𝜔𝑖 ~𝑁 𝜇𝑖, Σ𝑖 , then
𝑔𝑖 𝑥 = − 1
2 𝑥 − 𝜇𝑖 𝑡Σ𝑖−1 𝑥 − 𝜇𝑖 − 𝑑
2 ln 2𝜋 − 1
2 ln Σ𝑖 + ln 𝑝(𝜔𝑖) since
𝑝 𝑥 𝜔𝑖 = 1
(2𝜋)𝑑/2 Σ𝑖 1/2 𝑒𝑥𝑝 −1
2 𝑥 − 𝜇𝑖 𝑡Σ𝑖−1 𝑥 − 𝜇𝑖
13
Discriminant Function for Normal Density
• Assume the features are statistically independent, and each feature has the same variance, that is, Σ𝑖 = 𝜎2𝐈
• What is the discriminant function?
J. Y. Choi. SNU
Discriminant Function for Normal Density
• Assume the features are statistically independent, and each feature has the same variance, that is, Σ𝑖 = 𝜎2 𝐈
• What is the discriminant function?
𝑔𝑖 𝑥 = − 1
2 𝑥 − 𝜇𝑖 𝑡Σ𝑖−1 𝑥 − 𝜇𝑖 − 𝑑
2 ln 2𝜋 − 1
2 ln Σ𝑖 + ln 𝑝(𝜔𝑖)
= − 𝑥−𝜇𝑖 2
2𝜎2 − 𝑑
2 ln 2𝜋 − 1
2 ln 𝜎2𝑑 + ln 𝑝(𝜔𝑖) Since − 𝑑
2 ln 2𝜋 − 1
2 ln 𝜎2𝑑 is independent to the class, 𝑔𝑖 𝑥 = − 𝑥−𝜇𝑖 2
2𝜎2 + ln 𝑝(𝜔𝑖)
15
Discriminant Function for Normal Density
• What is the discriminant function?
𝑔𝑖 𝑥 = − 𝑥−𝜇𝑖 2
2𝜎2 + ln 𝑝(𝜔𝑖)
= − 1
2𝜎2 (𝑥𝑡𝑥 − 2𝜇𝑖𝑡𝑥 + 𝜇𝑖𝑡𝜇𝑖) + ln 𝑝(𝜔𝑖) Since 𝑥𝑡𝑥 is also independent to the class,
𝑔𝑖 𝑥 = 𝑤𝑖𝑡𝑥 + 𝑤𝑖0 where
𝑤𝑖 = 𝜇𝑖
𝜎2
𝑤𝑖0 = − 𝜇𝑖𝑡𝜇𝑖
2𝜎2 + ln 𝑝(𝜔𝑖)
J. Y. Choi. SNU
Discriminant Function for Normal Density
• Decision surface between 𝑖 and 𝑗 classes is obtained by letting 𝑔𝑖 𝑥 = 𝑔𝑗 𝑥
which yields
𝑔 𝑥 = 𝑔𝑖 𝑥 − 𝑔𝑗 𝑥 = 𝑤𝑡𝑥 − 𝑤0 = 0 where
𝑤 = 𝑤𝑖 − 𝑤𝑗 = 𝜇𝑖−𝜇𝑗
𝜎2
𝑤0 = 𝑤𝑖0 − 𝑤𝑗0 = − 𝜇𝑖𝑡𝜇𝑖−𝜇𝑗𝑡𝜇𝑗
2𝜎2 + ln 𝑝(𝜔𝑖) − ln 𝑝(𝜔𝑗)
• This linear classifier corresponds to a neuron model as
• 𝑖 −class region if 𝑔 𝑥 > 0
• 𝑗 −class region if 𝑔 𝑥 < 0
• decision boundary if 𝑔 𝑥 = 0
17
• If 𝑝(𝑤𝑖) is equal to 𝑝(𝑤𝑗),
Discriminant Function for Normal Density
J. Y. Choi. SNU
Discriminant Function for Normal Density
• If 𝑝(𝑤𝑖) is not equal to 𝑝(𝑤𝑗)
19
Discriminant Function for Normal Density
• Assume the features are not statistically independent, and each feature has the different variance, but every class has the same covariance matrix, that is, Σ𝑖 = Σ
• What is the discriminant function?
J. Y. Choi. SNU
Discriminant Function for Normal Density
• Assume the features are not statistically independent, and each feature has the different variance, but every class has the same covariance matrix, that is, Σ𝑖 = Σ
• What is the discriminant function?
𝑔𝑖 𝑥 = − 1
2 𝑥 − 𝜇𝑖 𝑡Σ−1 𝑥 − 𝜇𝑖 − 𝑑
2 ln 2𝜋 − 1
2 ln Σ + ln 𝑝(𝜔𝑖) Since − 𝑑
2 ln 2𝜋 − 1
2 ln Σ is independent to the class, 𝑔𝑖 𝑥 = − 1
2 𝑥 − 𝜇𝑖 𝑡Σ−1 𝑥 − 𝜇𝑖 + ln 𝑝(𝜔𝑖)
21
Discriminant Function for Normal Density
• What is the discriminant function?
𝑔𝑖 𝑥 = −1
2 𝑥 − 𝜇𝑖 𝑡Σ−1 𝑥 − 𝜇𝑖 + ln 𝑝(𝜔𝑖)
= − 1
2 (𝑥𝑡Σ−1𝑥 − 2𝜇𝑖𝑡Σ−1𝑥 + 𝜇𝑖𝑡Σ−1𝜇𝑖) + ln 𝑝(𝜔𝑖) Since 𝑥𝑡Σ−1𝑥 is also independent to the class,
𝑔𝑖 𝑥 = 𝑤𝑖𝑡𝑥 + 𝑤𝑖0 where
𝑤𝑖 = Σ−1𝜇𝑖
𝑤𝑖0 = −𝜇𝑖𝑡Σ−1𝜇𝑖
2 + ln 𝑝(𝜔𝑖)
J. Y. Choi. SNU
Discriminant Function for Normal Density
• Decision surface between 𝑖 and 𝑗 classes is obtained by letting 𝑔𝑖 𝑥 = 𝑔𝑗 𝑥
which yields
𝑔 𝑥 = 𝑔𝑖 𝑥 − 𝑔𝑗 𝑥 = 𝑤𝑡𝑥 − 𝑤0 = 0 where
𝑤 = 𝑤𝑖 − 𝑤𝑗 = Σ−1(𝜇𝑖 − 𝜇𝑗)
𝑤0 = 𝑤𝑖0 − 𝑤𝑗0 = −𝜇𝑖𝑡Σ−1𝜇𝑖−𝜇𝑗𝑡Σ−1𝜇𝑗
2 + ln 𝑝(𝜔𝑖) − ln 𝑝(𝜔𝑗)
• This linear classifier corresponds to a neuron model as
• 𝑖 −class region if 𝑔 𝑥 > 0
• 𝑗 −class region if 𝑔 𝑥 < 0
• decision boundary if 𝑔 𝑥 = 0
23
Discriminant Function for Normal Density
• Assume the features are not statistically independent, and each feature has the different variance, but every class has the different covariance matrix, that is, Σ𝑖
• What is the discriminant function?
J. Y. Choi. SNU
Discriminant Function for Normal Density
• Assume the features are not statistically independent, and each feature has the different variance, but every class has the same covariance matrix, that is, Σ𝑖
• What is the discriminant function?
𝑔𝑖 𝑥 = − 1
2 𝑥 − 𝜇𝑖 𝑡Σ𝑖−1 𝑥 − 𝜇𝑖 − 𝑑
2 ln 2𝜋 − 1
2 ln Σ𝑖 + ln 𝑝(𝜔𝑖)
• Since − 𝑑
2 ln 2𝜋 is independent to the class, 𝑔𝑖 𝑥 = − 1
2 𝑥 − 𝜇𝑖 𝑡Σ𝑖−1 𝑥 − 𝜇𝑖 − 1
2 ln Σ𝑖 + ln 𝑝(𝜔𝑖)
25
Discriminant Function for Normal Density
• What is the discriminant function?
𝑔𝑖 𝑥 = − 1
2 𝑥 − 𝜇𝑖 𝑡Σ𝑖−1 𝑥 − 𝜇𝑖 − 1
2 ln Σ𝑖 + ln 𝑝(𝜔𝑖)
= − 1
2 𝑥𝑡Σ𝑖−1𝑥 − 2𝜇𝑖𝑡Σ𝑖−1𝑥 + 𝜇𝑖𝑡Σ𝑖−1𝜇𝑖 − 1
2 ln Σ𝑖 + ln 𝑝(𝜔𝑖)
Since 𝑥𝑡Σ𝑖−1𝑥 is not independent to the class, 𝑔𝑖 𝑥 = 𝑥𝑡𝑊𝑖𝑥 + 𝑤𝑖𝑡𝑥 + 𝑤𝑖0
where
𝑊𝑖 = −1
2 Σ𝑖−1 𝑤𝑖 = Σ−1𝜇𝑖
𝑤𝑖0 = − 𝜇𝑖𝑡Σ−1𝜇𝑖
2 + ln 𝑝(𝜔𝑖)
J. Y. Choi. SNU
Decision Boundary for General Gaussian
• Decision boundaries are hyperquadratic for general case
27
Linear Mapping
𝑇: 𝑋 → 𝑌, 𝑦 = 𝑊𝑥
𝑋 𝑌
J. Y. Choi. SNU
Affine Mapping
𝑇: 𝑋 → 𝑌, 𝑦 = 𝑊𝑥 + 𝑏
29
𝑋 𝑌
1 𝑏
1 2
3 4
5
0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
1 2 3 4 5
Nonlinear Mapping
𝑇: 𝑋 → 𝑌, 𝑦 = 𝜎 𝑊𝑥 + 𝑏 , 𝑓 𝑥 = 𝑎𝑇𝑦
𝑋 ∋ 𝑥 𝑌 ∋ 𝑦 =
𝑦1
⋮ 𝑦𝑛 𝑓 𝑥 = 𝑎𝑇𝑦
𝑓 𝑥
1 𝑏
J. Y. Choi. SNU
Error Probabilities and Integrals
• Consider the 2-class problem and suppose that the feature space is divided into 2 regions 𝑅1 and 𝑅2. There are 2 ways in which a classification error can occur.
• An observation 𝑥 falls in 𝑅2, and the true state is 𝜔1.
• An observation 𝑥 falls in 𝑅1, and the true state is 𝜔2.
• The error probability
𝑃 𝑒𝑟𝑟𝑜𝑟 = 𝑃 𝑥 ∈ 𝑅2 𝜔1 𝑝 𝜔1 + 𝑃 𝑥 ∈ 𝑅1 𝜔2 𝑝(𝜔2)
= 𝑅
2 𝑝 𝑥 𝜔1 𝑝 𝜔1 𝑑𝑥 + 𝑅
1 𝑝 𝑥 𝜔2 𝑝 𝜔2 𝑑𝑥
31
Error Probabilities and Integrals
• Because 𝑥∗ is chosen arbitrarily, the probability of error is not as small as it might be.
• 𝑥𝐵 = Bayes optimal decision boundary , and gives the lowest probability of error.
• Bayes classifier maximizes the correct probability.
1 1
( ) ( | ) ( ) ( | ) ( )
i
C C
i i i i i
i i
P correct P p p p d
x
x xError Probabilities and Integrals
J. Y. Choi. SNU
Interim Summary
• Decision Region
• (Linear) Discriminant Function
• Decision Surface, Linear Machine
• Bayes Classifier
• Normal Density ND: Univariate Case
• MND: Multivariate Case1: Indep., Same Variance
• MND: Multivariate Case2: Same Covariance Matrix
• MND: Multivariate Case3: Different Covariance Matrix
• Error Probability
• Left Issues: Learning of PDF, NN learning, Bayesian Learning
33