Jin Young Choi
Seoul National University
1
Bayesian Decision Theory
• Bayes Formula
• Priori, Posteriori, Likelihood Probability
• Bayes Decision
• Risk Formulation
• Expected Loss, Conditional Risk, Total Risk
• Likelihood Ratio Test
• Decision Region
• Classifier for Bayes Decision
J. Y. Choi. SNU
Next Outline
• Bayes Classifier
• Normal Density
• ND: Univariate Case
• MND: Multivariate Case1: Indep., Same Variance
• MND: Multivariate Case2: Same Covariance Mtx.
• MND: Multivariate Case3: Different Covariance Mtx.
• Error Probability
3
Decision Regions
• The likelihood ratio 𝑝 𝑥 𝜔1 /𝑝(𝑥|𝜔2) vs. 𝑥
• The threshold 𝑇𝑎 for a loss function
1 12 22 2
2 21 11 1
( | ) ( ) ( )
( | ) ( ) ( ) a
p x p
p x p T
1 2
1 2
( | ) ( | ) ( ) ( )
p x p x Ta
g x g x
Ta
J. Y. Choi. SNU
Classifiers; Discriminant Functions
• A classifier can be represented by a set of discriminant functions 𝑔𝑖 𝑥 ; 𝑖 = 1, … , 𝐶.
• The classifier assigns observation 𝑥 to class 𝜔𝑖 if 𝑔𝑖 𝑥 >
𝑔𝑗 𝑥 for all 𝑖 ≠ 𝑗
5
1 2
( ) exp
i 2
g x x
( ) T
g xi w x b
The Bayes Classifier
• A Bayes classifier can be represented in this way
• For the minimum error-rate case
𝑔𝑖 𝑥 = 𝑝(𝜔𝑖|𝑥)
• For the general case with risks
𝑔𝑖 𝑥 = −𝑅 𝛼𝑖 𝑥 = −
𝑗
𝜆𝑖𝑗 𝑝(𝜔𝑗|𝑥)
• If we replace 𝑔𝑖 𝑥 by 𝑓(𝑔𝑖 𝑥 ) , where 𝑓(. ) is a monotonically increasing function (ex, log…), the resulting classification is unchanged.
𝑔𝑖 𝑥 = 𝑝 𝑥 𝜔𝑖 𝑝 𝜔𝑖
𝑔𝑖 𝑥 = ln 𝑝 𝑥 𝜔𝑖 + ln 𝑝(𝜔𝑖)
J. Y. Choi. SNU
The Bayes Classifier
• The effect of any decision rule is to divide the feature space into C decision regions, 𝑅1, … , 𝑅𝐶.
• If 𝑔𝑖 𝑥 > 𝑔𝑗 𝑥 for all 𝑗 ≠ 𝑖 then 𝑥 is in 𝑅𝑖, and 𝑥 is assigned to 𝑤𝑖.
• Decision regions are separated by decision boundaries.
• Decision boundaries are surfaces in the feature space.
7
Feature 2 Hight
Feature 1 Weight
x
The Decision Regions
• Two dimensional two category classifier
J. Y. Choi. SNU
Two-Category Case
• Use 2 discriminant functions 𝑔1(𝑥)and 𝑔2(𝑥), and
• assigning 𝑥 to 𝜔1 if 𝑔1(𝑥)> 𝑔2(𝑥).
• Alternative: define a single discriminant function
• 𝑔 𝑥 = 𝑔1 𝑥 − 𝑔2(𝑥),
• decide 𝜔1 if 𝑔 𝑥 > 0, otherwise decide 𝜔2.
• In two category case two forms are frequently used:
• 𝑔 𝑥 = 𝑝 𝜔1|𝑥 − 𝑝 𝜔2|𝑥
• 𝑔 𝑥 = ln𝑝 𝑥|𝜔1
𝑝 𝑥|𝜔2 + ln𝑝 𝜔1
𝑝 𝜔2
9
Normal Density - Univariate Case
• Gaussian density with mean μ and standard deviation σ (σ2 named variance )
𝑝 𝑥 = 1
(2𝜋)1/2𝜎 exp − 1 2
(𝑥 − 𝜇) 𝜎
2
𝑝 𝑥 ~𝑁(𝜇, 𝜎2)
• It can be shown that:
𝜇 = 𝐸 𝑥 = න
−∞
∞
𝑥𝑝 𝑥 𝑑𝑥
𝜎2 = 𝐸 (𝑥 − 𝜇)2 = න
−∞
∞
(𝑥 − 𝜇)2𝑝 𝑥 𝑑𝑥
J. Y. Choi. SNU
Normal Density - Multivariate Case
• The general multivariate normal density (MND) in a 𝑑 − dimensions is written as
𝑝 𝑥 = 1
(2𝜋)𝑑/2 Σ 1/2 𝑒𝑥𝑝 − 1
2 𝑥 − 𝜇 𝑡Σ−1 𝑥 − 𝜇 𝜇 = 𝐸 𝑥 = න
−∞
∞
𝑥𝑝 𝑥 𝑑𝑥 Σ = 𝐸 (𝑥 − 𝜇)(𝑥 − 𝜇)𝑡 Σ𝑖𝑗 = 𝐸 (𝑥𝑖 − 𝜇𝑖)(𝑥𝑗 − 𝜇𝑗)
• The covariance matrix Σ is always symmetric and positive semidefinite.
11
Normal Density - Multivariate Case
• The multivariate normal density MND is completely specified by 𝑑 + 𝑑(𝑑 + 1)/2 parameters . Samples drawn from MND fall in a cluster of which center is determined by 𝜇 and a shape by Σ. The loci of points of constant density are hyper-ellipsoids
𝑟2 = 𝑥 − 𝜇 𝑡Σ−1 𝑥 − 𝜇 =constant
• The 𝑟 is called Mahalonobis distance from 𝑥 to 𝜇
• The principal axes of the hyperellipsoid are given by the eigenvectors of Σ.
J. Y. Choi. SNU
Normal Density - Multivariate Case
• The minimum-error-rate classification can be achieved using the discriminant functions:
𝑔𝑖 𝑥 = 𝑝 𝑥 𝜔𝑖 𝑝(𝜔𝑖) or
𝑔𝑖 𝑥 = ln 𝑝 𝑥 𝜔𝑖 + ln 𝑝(𝜔𝑖)
• If 𝑝 𝑥 𝜔𝑖 ~𝑁 𝜇𝑖, Σ𝑖 , then
𝑔𝑖 𝑥 = − 1
2 𝑥 − 𝜇𝑖 𝑡Σ𝑖−1 𝑥 − 𝜇𝑖 − 𝑑
2 ln 2𝜋 − 1
2 ln Σ𝑖 + ln 𝑝(𝜔𝑖) since
𝑝 𝑥 𝜔𝑖 = 1
(2𝜋)𝑑/2 Σ𝑖 1/2 𝑒𝑥𝑝 −1
2 𝑥 − 𝜇𝑖 𝑡Σ𝑖−1 𝑥 − 𝜇𝑖
13
Discriminant Function for Normal Density
• Assume the features are statistically independent, and each feature has the same variance, that is, Σ𝑖 = 𝜎2𝐈
• What is the discriminant function?
J. Y. Choi. SNU
Discriminant Function for Normal Density
• Assume the features are statistically independent, and each feature has the same variance, that is, Σ𝑖 = 𝜎2 𝐈
• What is the discriminant function?
𝑔𝑖 𝑥 = − 1
2 𝑥 − 𝜇𝑖 𝑡Σ𝑖−1 𝑥 − 𝜇𝑖 − 𝑑
2 ln 2𝜋 − 1
2 ln Σ𝑖 + ln 𝑝(𝜔𝑖)
= − 𝑥−𝜇𝑖 2
2𝜎2 − 𝑑
2 ln 2𝜋 − 1
2 ln 𝜎2𝑑 + ln 𝑝(𝜔𝑖) Since − 𝑑
2 ln 2𝜋 − 1
2 ln 𝜎2𝑑 is independent to the class, 𝑔𝑖 𝑥 = − 𝑥−𝜇𝑖 2
2𝜎2 + ln 𝑝(𝜔𝑖)
15
Discriminant Function for Normal Density
• What is the discriminant function?
𝑔𝑖 𝑥 = − 𝑥−𝜇𝑖 2
2𝜎2 + ln 𝑝(𝜔𝑖)
= − 1
2𝜎2 (𝑥𝑡𝑥 − 2𝜇𝑖𝑡𝑥 + 𝜇𝑖𝑡𝜇𝑖) + ln 𝑝(𝜔𝑖) Since 𝑥𝑡𝑥 is also independent to the class,
𝑔𝑖 𝑥 = 𝑤𝑖𝑡𝑥 + 𝑤𝑖0 where
𝑤𝑖 = 𝜇𝑖
𝜎2
𝑤𝑖0 = − 𝜇𝑖𝑡𝜇𝑖
2𝜎2 + ln 𝑝(𝜔𝑖)
J. Y. Choi. SNU
Discriminant Function for Normal Density
• Decision surface between 𝑖 and 𝑗 classes is obtained by letting 𝑔𝑖 𝑥 = 𝑔𝑗 𝑥
which yields
𝑔 𝑥 = 𝑔𝑖 𝑥 − 𝑔𝑗 𝑥 = 𝑤𝑡𝑥 − 𝑤0 = 0 where
𝑤 = 𝑤𝑖 − 𝑤𝑗 = 𝜇𝑖−𝜇𝑗
𝜎2
𝑤0 = 𝑤𝑖0 − 𝑤𝑗0 = − 𝜇𝑖𝑡𝜇𝑖−𝜇𝑗𝑡𝜇𝑗
2𝜎2 + ln 𝑝(𝜔𝑖) − ln 𝑝(𝜔𝑗)
• This linear classifier corresponds to a neuron model as
• 𝑖 −class region if 𝑔 𝑥 > 0
• 𝑗 −class region if 𝑔 𝑥 < 0
• decision boundary if 𝑔 𝑥 = 0
17
• If 𝑝(𝑤𝑖) is equal to 𝑝(𝑤𝑗),
Discriminant Function for Normal Density
J. Y. Choi. SNU
Discriminant Function for Normal Density
• If 𝑝(𝑤𝑖) is not equal to 𝑝(𝑤𝑗)
19
Discriminant Function for Normal Density
• Assume the features are not statistically independent, and each feature has the different variance, but every class has the same covariance matrix, that is, Σ𝑖 = Σ
• What is the discriminant function?
J. Y. Choi. SNU
Discriminant Function for Normal Density
• Assume the features are not statistically independent, and each feature has the different variance, but every class has the same covariance matrix, that is, Σ𝑖 = Σ
• What is the discriminant function?
𝑔𝑖 𝑥 = − 1
2 𝑥 − 𝜇𝑖 𝑡Σ−1 𝑥 − 𝜇𝑖 − 𝑑
2 ln 2𝜋 − 1
2 ln Σ + ln 𝑝(𝜔𝑖) Since − 𝑑
2 ln 2𝜋 − 1
2 ln Σ is independent to the class, 𝑔𝑖 𝑥 = − 1
2 𝑥 − 𝜇𝑖 𝑡Σ−1 𝑥 − 𝜇𝑖 + ln 𝑝(𝜔𝑖)
21
Discriminant Function for Normal Density
• What is the discriminant function?
𝑔𝑖 𝑥 = −1
2 𝑥 − 𝜇𝑖 𝑡Σ−1 𝑥 − 𝜇𝑖 + ln 𝑝(𝜔𝑖)
= − 1
2 (𝑥𝑡Σ−1𝑥 − 2𝜇𝑖𝑡Σ−1𝑥 + 𝜇𝑖𝑡Σ−1𝜇𝑖) + ln 𝑝(𝜔𝑖) Since 𝑥𝑡Σ−1𝑥 is also independent to the class,
𝑔𝑖 𝑥 = 𝑤𝑖𝑡𝑥 + 𝑤𝑖0 where
𝑤𝑖 = Σ−1𝜇𝑖
𝑤𝑖0 = −𝜇𝑖𝑡Σ−1𝜇𝑖
2 + ln 𝑝(𝜔𝑖)
J. Y. Choi. SNU
Discriminant Function for Normal Density
• Decision surface between 𝑖 and 𝑗 classes is obtained by letting 𝑔𝑖 𝑥 = 𝑔𝑗 𝑥
which yields
𝑔 𝑥 = 𝑔𝑖 𝑥 − 𝑔𝑗 𝑥 = 𝑤𝑡𝑥 − 𝑤0 = 0 where
𝑤 = 𝑤𝑖 − 𝑤𝑗 = Σ−1(𝜇𝑖 − 𝜇𝑗)
𝑤0 = 𝑤𝑖0 − 𝑤𝑗0 = −𝜇𝑖𝑡Σ−1𝜇𝑖−𝜇𝑗𝑡Σ−1𝜇𝑗
2 + ln 𝑝(𝜔𝑖) − ln 𝑝(𝜔𝑗)
• This linear classifier corresponds to a neuron model as
• 𝑖 −class region if 𝑔 𝑥 > 0
• 𝑗 −class region if 𝑔 𝑥 < 0
• decision boundary if 𝑔 𝑥 = 0
23
Discriminant Function for Normal Density
• Assume the features are not statistically independent, and each feature has the different variance, but every class has the different covariance matrix, that is, Σ𝑖
• What is the discriminant function?
J. Y. Choi. SNU
Discriminant Function for Normal Density
• Assume the features are not statistically independent, and each feature has the different variance, but every class has the same covariance matrix, that is, Σ𝑖
• What is the discriminant function?
𝑔𝑖 𝑥 = − 1
2 𝑥 − 𝜇𝑖 𝑡Σ𝑖−1 𝑥 − 𝜇𝑖 − 𝑑
2 ln 2𝜋 − 1
2 ln Σ𝑖 + ln 𝑝(𝜔𝑖)
• Since − 𝑑
2 ln 2𝜋 is independent to the class, 𝑔𝑖 𝑥 = − 1
2 𝑥 − 𝜇𝑖 𝑡Σ𝑖−1 𝑥 − 𝜇𝑖 − 1
2 ln Σ𝑖 + ln 𝑝(𝜔𝑖)
25
Discriminant Function for Normal Density
• What is the discriminant function?
𝑔𝑖 𝑥 = − 1
2 𝑥 − 𝜇𝑖 𝑡Σ𝑖−1 𝑥 − 𝜇𝑖 − 1
2 ln Σ𝑖 + ln 𝑝(𝜔𝑖)
= − 1
2 𝑥𝑡Σ𝑖−1𝑥 − 2𝜇𝑖𝑡Σ𝑖−1𝑥 + 𝜇𝑖𝑡Σ𝑖−1𝜇𝑖 − 1
2 ln Σ𝑖 + ln 𝑝(𝜔𝑖)
Since 𝑥𝑡Σ𝑖−1𝑥 is not independent to the class, 𝑔𝑖 𝑥 = 𝑥𝑡𝑊𝑖𝑥 + 𝑤𝑖𝑡𝑥 + 𝑤𝑖0
where
𝑊𝑖 = −1
2 Σ𝑖−1 𝑤𝑖 = Σ−1𝜇𝑖
𝑤𝑖0 = − 𝜇𝑖𝑡Σ−1𝜇𝑖
2 + ln 𝑝(𝜔𝑖)
J. Y. Choi. SNU
Decision Boundary for General Gaussian
• Decision boundaries are hyperquadratic for general case
27
Linear Mapping
𝑇: 𝑋 → 𝑌, 𝑦 = 𝑊𝑥
𝑋 𝑌
J. Y. Choi. SNU
Affine Mapping
𝑇: 𝑋 → 𝑌, 𝑦 = 𝑊𝑥 + 𝑏
29
𝑋 𝑌
1 𝑏
1 2
3 4
5
0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
1 2 3 4 5
Nonlinear Mapping
𝑇: 𝑋 → 𝑌, 𝑦 = 𝜎 𝑊𝑥 + 𝑏 , 𝑓 𝑥 = 𝑎𝑇𝑦
𝑋 ∋ 𝑥 𝑌 ∋ 𝑦 =
𝑦1
⋮ 𝑦𝑛 𝑓 𝑥 = 𝑎𝑇𝑦
𝑓 𝑥
1 𝑏
J. Y. Choi. SNU
Error Probabilities and Integrals
• Consider the 2-class problem and suppose that the feature space is divided into 2 regions 𝑅1 and 𝑅2. There are 2 ways in which a classification error can occur.
• An observation 𝑥 falls in 𝑅2, and the true state is 𝜔1.
• An observation 𝑥 falls in 𝑅1, and the true state is 𝜔2.
• The error probability
𝑃 𝑒𝑟𝑟𝑜𝑟 = 𝑃 𝑥 ∈ 𝑅2 𝜔1 𝑝 𝜔1 + 𝑃 𝑥 ∈ 𝑅1 𝜔2 𝑝(𝜔2)
= 𝑅
2 𝑝 𝑥 𝜔1 𝑝 𝜔1 𝑑𝑥 + 𝑅
1 𝑝 𝑥 𝜔2 𝑝 𝜔2 𝑑𝑥
31
Error Probabilities and Integrals
• Because 𝑥∗ is chosen arbitrarily, the probability of error is not as small as it might be.
• 𝑥𝐵 = Bayes optimal decision boundary , and gives the lowest probability of error.
• Bayes classifier maximizes the correct probability.
1 1
( ) ( | ) ( ) ( | ) ( )
i
C C
i i i i i
i i
P correct P p p p d