• 검색 결과가 없습니다.

Bayesian Decision Theory (I)

N/A
N/A
Protected

Academic year: 2022

Share "Bayesian Decision Theory (I)"

Copied!
62
0
0

로드 중.... (전체 텍스트 보기)

전체 글

(1)

Bayesian Decision Theory (I)

Jin Young Choi

Seoul National University

(2)

J. Y. Choi. SNU

Bayesian Decision Theory

• Bayes Formula

• Priori, Posteriori, Likelihood Probability

• Bayes Decision

• Risk Formulation

• Expected Loss, Conditional Risk, Total Risk

• Likelihood Ratio Test

• Decision Region

• Classifier for Bayes Decision

2

(3)

Bayesian Decision

• What do we learn from experience or observation?

• How can we decide the category of an object from observation?

• Is our decision always correct?

• We make a decision probabilistically

(4)

J. Y. Choi. SNU

Bayesian Decision

• Question:

• There live two kinds of fishes in a lake: tuna or salmon.

• If you catch a fish by fishing, is the fish likely to be tuna or salmon?

(5)

Bayesian Decision

• We have experienced that salmon has been caught in 70% and tuna in 30%.

• What is the next fish likely to be?

(6)

J. Y. Choi. SNU

Bayesian Decision

• If other types of fish are irrelevant:

𝑝 𝜔 = 𝜔1 + 𝑝 𝜔 = 𝜔2 = 1,

𝜔 is random variable, 𝜔1and 𝜔2 denote salmon and tuna.

• Probabilities reflect our prior knowledge obtained from past experience.

• Simple Decision Rule:

• Make a decision without seeing the fish.

• Decide 𝜔1 if 𝑝 𝜔 = 𝜔1 > 𝑝 𝜔 = 𝜔2 𝜔2 otherwise.

(7)

Bayesian Decision

• In general, we will have some features and more information.

• Feature: lightness measurement = 𝑥

• Different fish yields different lightness readings (𝑥 is a random variable)

(8)

J. Y. Choi. SNU

Bayesian Decision

• Define

• 𝑝(𝑥|𝜔𝑖)= Class Conditional Probability Density

• The difference between 𝑝(𝑥|𝜔1) and 𝑝(𝑥|𝜔2) describes the difference in lightness between tuna and salmon.

(9)

Bayesian Decision

• Hypothetical class-conditional probability

• Density functions are normalized (area under each curve is 1.0)

(10)

J. Y. Choi. SNU

Bayesian Decision

• Suppose that we know

• The prior probabilities 𝑝(𝜔1) and 𝑝 𝜔2

• The conditional densities 𝑝(𝑥|𝜔1) and 𝑝 𝑥|𝜔2

• Measure lightness of a fish = 𝑥

• What is the category of the fish with lightness of 𝑥 ?

• The probability that the fish has category of 𝜔𝑖 is 𝑝(𝜔𝑖|𝑥).

(11)

Bayes formula

• 𝑝 𝜔𝑖|𝑥 = 𝑝 𝑥 𝜔𝑖 𝑝(𝜔𝑖)

𝑝(𝑥) ,

where 𝑝 𝑥 = σ𝑗 𝑝 𝑥 𝜔𝑗 𝑝(𝜔𝑗) = σ𝑗 𝑝(𝑥, 𝜔𝑗).

• 𝑃𝑜𝑠𝑡𝑒𝑟𝑖𝑜𝑟 = 𝐿𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑∗𝑃𝑟𝑖𝑜𝑟 𝐸𝑣𝑖𝑑𝑒𝑛𝑐𝑒

• 𝑝(𝑥|𝜔𝑖) is called the likelihood of 𝜔𝑖 with respect to 𝑥.

• The 𝜔𝑖 category for which 𝑝(𝑥|𝜔𝑖) is large is more "likely" to be the true category

𝑝(𝑥) is the evidence

• How frequently is a pattern with feature value x observed.

• Scale factor that the posterior probabilities sum to 1.

(12)

J. Y. Choi. SNU

Bayes formula

• Posterior probabilities for the particular priors 𝑝(𝜔1) = 2/3 and 𝑝(𝜔2) = 1/3. At every 𝑥 the posteriors sum to 1.

(13)

Bayes Decision Rule (Minimal probability error)

• Likelihood Decision:

• 𝜔1 ∶ 𝑖𝑓 𝑝(𝑥|𝜔1) > 𝑝(𝑥|𝜔2)

• 𝜔2 ∶ otherwise

• Posteriori Decision:

• 𝜔1 ∶ 𝑖𝑓 𝑝 𝑥 𝜔1 𝑝(𝜔1) > 𝑝 𝑥 𝜔2 𝑝(𝜔2)

• 𝜔2 ∶ otherwise

• Decision Error Probability

• 𝑝 𝑒𝑟𝑟𝑜𝑟 𝑥 = min(𝑝 𝜔1 𝑥 , 𝑝 𝜔2 𝑥 ) where the decision error is given by

𝑝 𝑒𝑟𝑟𝑜𝑟 𝑥 = ቊ𝑝 𝜔2 𝑥 𝑖𝑓 𝑤𝑒 𝑑𝑒𝑐𝑖𝑑𝑒 𝜔1 𝑓𝑜𝑟 𝜔2

(14)

J. Y. Choi. SNU

Features: General Case

• Formalize the ideas just considered in 4 ways:

Allow more than one feature:

Replace the scalar 𝑥 by the feature vector 𝑥

d-dimensional space 𝑑, 𝑥 ∈ ℛ𝑑, is called the feature space

Allow more than 2 states of nature:

Generalize to several classes

Allow actions other than merely deciding the state of nature:

Possibility of rejection, refusing to make a decision in close cases

Introducing general loss function: risk(loss) minimization.

(15)

Loss Function

Loss (or cost) function states exactly how costly each action is, and is used to convert a probability into a decision. Loss functions let us treat situations in

which some kinds of classification mistakes are more costly than others.

(16)

J. Y. Choi. SNU

Formulation

• Let 𝜔1, … , 𝜔𝑐 be the finite set of c states of nature ("categories").

• Let 𝛼1, … , 𝛼𝑎 be the finite set of a possible actions.

(Ex) Action 𝛼𝑖 = deciding that the true state is 𝜔𝑖 or others.

• The loss function 𝜆 𝛼𝑖 𝜔𝑗 = loss incurred for taking action when the state of nature is 𝜔𝑗.

• 𝑥 = 𝑑 −dimensional feature vector (random variable)

• 𝑝(𝑥|𝜔𝑖) = likelihood probability density function for 𝑥 for given 𝜔𝑖

• 𝑝(𝜔𝑖) = prior probability that nature is in state 𝜔𝑖.

(17)

Expected Loss (Risk)

• Suppose that we observe a particular 𝑥 and that we take action 𝛼𝑖

• If the true state of nature is 𝜔𝑗, then loss is 𝜆 𝛼𝑖 𝜔𝑗 , the expected loss before observation is

𝑅(𝛼𝑖) = σ𝑗=1𝑐 𝜆 𝛼𝑖 𝜔𝑗 𝑝(𝜔𝑗)

(18)

J. Y. Choi. SNU

Conditional Risk (Loss)

• After the observation, the expected loss which is called now

“conditional risk” is given by 𝑅 𝛼𝑖 𝑥 = ෍

𝑗=1 𝑐

𝜆 𝛼𝑖 𝜔𝑗 𝑝(𝜔𝑗|𝑥)

(19)

Total Risk

• Objective: Select the action that minimizes the conditional risk

• A general decision rule is a function 𝛼 𝑥

• For every 𝑥, the decision function 𝛼 𝑥 assumes one of the values, 𝛼1, … , 𝛼𝑎 , that is,

𝛼 𝑥 = 𝑎𝑟𝑔 min

𝛼𝑖 𝑅 𝛼𝑖 𝑥 = ෍

𝑗=1 𝑐

𝜆 𝛼𝑖 𝜔𝑗 𝑝(𝜔𝑗|𝑥)

• The “total risk” is

න 𝑅 𝛼 𝑥 𝑥 𝑝 𝑥 𝑑𝑥

• Learning

min𝜃 ׬ 𝑅 𝛼𝜃 𝑥 𝑥 𝑝 𝑥 𝑑𝑥, 𝑤ℎ𝑒𝑟𝑒 𝑓𝜃 𝑥 ≈ (𝑝(𝜔1|𝑥), 𝑝(𝜔2|𝑥), … 𝑝(𝜔𝑐|𝑥))

(20)

J. Y. Choi. SNU

Bayes Decision Rule:

• Compute the conditional risk

𝑅 𝛼𝑖 𝑥 = ෍

𝑗=1 𝑐

𝜆 𝛼𝑖 𝜔𝑗 𝑝(𝜔𝑗|𝑥) for 𝑖 = 1, … , 𝑎.

• Select the action 𝛼𝑖 for which 𝑅 𝛼𝑖 𝑥 is minimum.

• The resulting minimum total risk is called the Bayes Risk, denoted 𝑅, and

is the best performance that can be achieved.

(21)

Two-Category Classification

• Action 𝛼1 = deciding that the true state is 𝜔1

• Action 𝛼2 = deciding that the true state is 𝜔2

• Let 𝜆𝑖𝑗 = 𝜆 𝛼𝑖 𝜔𝑗 be the risk incurred for deciding 𝜔𝑖 when true state is 𝜔𝑗.

The conditional Risks:

𝑅 𝛼1 𝑥 = 𝜆11𝑝(𝜔1 𝑥 + 𝜆12𝑝(𝜔2 𝑥 𝑅 𝛼2 𝑥 = 𝜆21𝑝(𝜔1 𝑥 + 𝜆22𝑝(𝜔2 𝑥

• Decide 𝜔1 if 𝑅 𝛼1 𝑥 < 𝑅 𝛼2 𝑥

or if 𝜆21 − 𝜆11 𝑝(𝜔1 𝑥 > (𝜆12−𝜆22)𝑝(𝜔2 𝑥

or if 𝜆21 − 𝜆11 𝑝(𝑥|𝜔1)𝑝(𝜔1) > (𝜆12−𝜆22)𝑝(𝑥|𝜔2)𝑝(𝜔2) and 𝜔2 , otherwise

(22)

J. Y. Choi. SNU

Two-Category Likelihood Ratio Test

• Under reasonable assumption that 𝜆12 > 𝜆22 and 𝜆21 > 𝜆11, (why?) decide 𝜔1 if 𝑝(𝑥|𝜔1)

𝑝(𝑥|𝜔2) > (𝜆12−𝜆22)𝑝(𝜔2)

𝜆21−𝜆11 𝑝(𝜔1) = 𝑇 and 𝜔2 , otherwise.

• The ratio 𝑝(𝑥|𝜔1)

𝑝(𝑥|𝜔2) is called the likelihood ratio.

• We can decide 𝜔1 if the likelihood ratio exceeds a threshold T value that is independent of the observation 𝑥.

(23)

Minimum-Error-Rate Classification

• Recall the conditional risk 𝑅 𝛼𝑖 𝑥 = ෍

𝑗=1 𝑐

𝜆 𝛼𝑖 𝜔𝑗 𝑝(𝜔𝑗|𝑥)

• If action 𝛼𝑖 is taken for the true state 𝜔𝑗, then

the decision is correct if 𝑖 = 𝑗, and in error otherwise.

• To give an equal cost to all errors, we define Zero-One Loss function as

𝜆 𝛼𝑖 𝜔𝑗 = ቊ0, 𝑖 = 𝑗

1, 𝑖 ≠ 𝑗 , for 𝑖, 𝑗 = 1, … , 𝐶

(24)

J. Y. Choi. SNU

Minimum-Error-Rate Classification cont.

• The conditional risk representing error rate is 𝑅 𝛼𝑖 𝑥 = ෍

𝑗=1 𝑐

𝜆 𝛼𝑖 𝜔𝑗 𝑝(𝜔𝑗|𝑥)

= ෍

𝑗≠𝑖 𝑐

𝑝(𝜔𝑗|𝑥)

= 1 − 𝑝(𝜔𝑖 𝑥

To minimize the average probability of error, we should select the 𝑖 that maximizes the posterior probability i.e., 𝑝(𝜔𝑖 𝑥

Decide 𝜔𝑖 if 𝑝(𝜔𝑖 𝑥 > 𝑝(𝜔𝑗 𝑥 , for all 𝑗 ≠ 𝑖 (same as Bayes' decision rule)

(25)

Decision Regions

• The likelihood ratio 𝑝 𝑥 𝜔1 /𝑝(𝑥|𝜔2) vs. 𝑥

• The threshold 𝑇𝑎 for a loss function

1 12 22 2

2 21 11 1

( | ) ( ) ( )

( | ) ( ) ( ) a

p x p

p x p T

1 2

1 2

( | ) ( | ) ( ) ( )

p x p x Ta

g x g x

Ta

(26)

J. Y. Choi. SNU

Classifiers; Discriminant Functions

• A classifier can be represented by a set of discriminant functions 𝑔𝑖 𝑥 ; 𝑖 = 1, … , 𝐶.

• The classifier assigns observation 𝑥 to class 𝜔𝑖 if 𝑔𝑖 𝑥 >

𝑔𝑗 𝑥 for all 𝑖 ≠ 𝑗

26

1 2

( ) exp

i 2

g x x

( ) T

g xi w x b

(27)

The Bayes Classifier

• A Bayes classifier can be represented in this way

For the minimum error-rate case

𝑔𝑖 𝑥 = 𝑝(𝜔𝑖|𝑥)

For the general case with risks

𝑔𝑖 𝑥 = −𝑅 𝛼𝑖 𝑥 = − ෍

𝑗

𝜆𝑖𝑗 𝑝(𝜔𝑗|𝑥)

• If we replace 𝑔𝑖 𝑥 by 𝑓(𝑔𝑖 𝑥 ) , where 𝑓(. ) is a monotonically increasing function (ex, log…), the resulting classification is unchanged.

𝑔𝑖 𝑥 = 𝑝 𝑥 𝜔𝑖 𝑝 𝜔𝑖

𝑔𝑖 𝑥 = ln 𝑝 𝑥 𝜔𝑖 + ln 𝑝(𝜔𝑖)

(28)

J. Y. Choi. SNU

The Bayes Classifier

• The effect of any decision rule is to divide the feature space into C decision regions, 𝑅1, … , 𝑅𝐶.

• If 𝑔𝑖 𝑥 > 𝑔𝑗 𝑥 for all 𝑗 ≠ 𝑖 then 𝑥 is in 𝑅𝑖, and 𝑥 is assigned to 𝑤𝑖.

• Decision regions are separated by decision boundaries.

• Decision boundaries are surfaces in the feature space.

28

Feature 2 Hight

Feature 1 Weight

x

(29)

Interim Summary

• Bayes Formula

• Priori probability

• Likelihood

• Evidence

• Posterior Probability

• Bayes Decision

• Risk Formulation

• Loss Function

• Expected Loss, Conditional Risk

• Total Risk

• Likelihood Ratio Test

• Zero-one Loss Function (Bayes Decision)

• Decision Region and Classifiers

(30)

J. Y. Choi. SNU

Bayesian Decision Theory (II)

Jin Young Choi

Seoul National University

1

(31)

Bayesian Decision Theory

• Bayes Formula

• Priori, Posteriori, Likelihood Probability

• Bayes Decision

• Risk Formulation

• Expected Loss, Conditional Risk, Total Risk

• Likelihood Ratio Test

• Decision Region

• Classifier for Bayes Decision

(32)

J. Y. Choi. SNU

Next Outline

• Bayes Classifier

• Normal Density

• ND: Univariate Case

• MND: Multivariate Case1: Indep., Same Variance

• MND: Multivariate Case2: Same Covariance Mtx.

• MND: Multivariate Case3: Different Covariance Mtx.

• Error Probability

3

(33)

Decision Regions

• The likelihood ratio 𝑝 𝑥 𝜔1 /𝑝(𝑥|𝜔2) vs. 𝑥

• The threshold 𝑇𝑎 for a loss function

1 12 22 2

2 21 11 1

( | ) ( ) ( )

( | ) ( ) ( ) a

p x p

p x p T

1 2

1 2

( | ) ( | ) ( ) ( )

p x p x Ta

g x g x

Ta

(34)

J. Y. Choi. SNU

Classifiers; Discriminant Functions

• A classifier can be represented by a set of discriminant functions 𝑔𝑖 𝑥 ; 𝑖 = 1, … , 𝐶.

• The classifier assigns observation 𝑥 to class 𝜔𝑖 if 𝑔𝑖 𝑥 >

𝑔𝑗 𝑥 for all 𝑖 ≠ 𝑗

5

1 2

( ) exp

i 2

g x x

( ) T

g xi w x b

(35)

The Bayes Classifier

• A Bayes classifier can be represented in this way

For the minimum error-rate case

𝑔𝑖 𝑥 = 𝑝(𝜔𝑖|𝑥)

For the general case with risks

𝑔𝑖 𝑥 = −𝑅 𝛼𝑖 𝑥 = − ෍

𝑗

𝜆𝑖𝑗 𝑝(𝜔𝑗|𝑥)

• If we replace 𝑔𝑖 𝑥 by 𝑓(𝑔𝑖 𝑥 ) , where 𝑓(. ) is a monotonically increasing function (ex, log…), the resulting classification is unchanged.

𝑔𝑖 𝑥 = 𝑝 𝑥 𝜔𝑖 𝑝 𝜔𝑖

𝑔𝑖 𝑥 = ln 𝑝 𝑥 𝜔𝑖 + ln 𝑝(𝜔𝑖)

(36)

J. Y. Choi. SNU

The Bayes Classifier

• The effect of any decision rule is to divide the feature space into C decision regions, 𝑅1, … , 𝑅𝐶.

• If 𝑔𝑖 𝑥 > 𝑔𝑗 𝑥 for all 𝑗 ≠ 𝑖 then 𝑥 is in 𝑅𝑖, and 𝑥 is assigned to 𝑤𝑖.

• Decision regions are separated by decision boundaries.

• Decision boundaries are surfaces in the feature space.

7

Feature 2 Hight

Feature 1 Weight

x

(37)

The Decision Regions

• Two dimensional two category classifier

(38)

J. Y. Choi. SNU

Two-Category Case

Use 2 discriminant functions 𝑔1(𝑥)and 𝑔2(𝑥), and

assigning 𝑥 to 𝜔1 if 𝑔1(𝑥)> 𝑔2(𝑥).

Alternative: define a single discriminant function

𝑔 𝑥 = 𝑔1 𝑥 − 𝑔2(𝑥),

decide 𝜔1 if 𝑔 𝑥 > 0, otherwise decide 𝜔2.

In two category case two forms are frequently used:

𝑔 𝑥 = 𝑝 𝜔1|𝑥 − 𝑝 𝜔2|𝑥

𝑔 𝑥 = ln𝑝 𝑥|𝜔1

𝑝 𝑥|𝜔2 + ln𝑝 𝜔1

𝑝 𝜔2

9

(39)

Normal Density - Univariate Case

• Gaussian density with mean μ and standard deviation σ (σ2 named variance )

𝑝 𝑥 = 1

(2𝜋)1/2𝜎 exp − 1 2

(𝑥 − 𝜇) 𝜎

2

𝑝 𝑥 ~𝑁(𝜇, 𝜎2)

• It can be shown that:

𝜇 = 𝐸 𝑥 = න

−∞

𝑥𝑝 𝑥 𝑑𝑥

𝜎2 = 𝐸 (𝑥 − 𝜇)2 = න

−∞

(𝑥 − 𝜇)2𝑝 𝑥 𝑑𝑥

(40)

J. Y. Choi. SNU

Normal Density - Multivariate Case

• The general multivariate normal density (MND) in a 𝑑 − dimensions is written as

𝑝 𝑥 = 1

(2𝜋)𝑑/2 Σ 1/2 𝑒𝑥𝑝 − 1

2 𝑥 − 𝜇 𝑡Σ−1 𝑥 − 𝜇 𝜇 = 𝐸 𝑥 = න

−∞

𝑥𝑝 𝑥 𝑑𝑥 Σ = 𝐸 (𝑥 − 𝜇)(𝑥 − 𝜇)𝑡 Σ𝑖𝑗 = 𝐸 (𝑥𝑖 − 𝜇𝑖)(𝑥𝑗 − 𝜇𝑗)

• The covariance matrix Σ is always symmetric and positive semidefinite.

11

(41)

Normal Density - Multivariate Case

• The multivariate normal density MND is completely specified by 𝑑 + 𝑑(𝑑 + 1)/2 parameters . Samples drawn from MND fall in a cluster of which center is determined by 𝜇 and a shape by Σ. The loci of points of constant density are hyper-ellipsoids

𝑟2 = 𝑥 − 𝜇 𝑡Σ−1 𝑥 − 𝜇 =constant

• The 𝑟 is called Mahalonobis distance from 𝑥 to 𝜇

• The principal axes of the hyperellipsoid are given by the eigenvectors of Σ.

(42)

J. Y. Choi. SNU

Normal Density - Multivariate Case

• The minimum-error-rate classification can be achieved using the discriminant functions:

𝑔𝑖 𝑥 = 𝑝 𝑥 𝜔𝑖 𝑝(𝜔𝑖) or

𝑔𝑖 𝑥 = ln 𝑝 𝑥 𝜔𝑖 + ln 𝑝(𝜔𝑖)

• If 𝑝 𝑥 𝜔𝑖 ~𝑁 𝜇𝑖, Σ𝑖 , then

𝑔𝑖 𝑥 = − 1

2 𝑥 − 𝜇𝑖 𝑡Σ𝑖−1 𝑥 − 𝜇𝑖𝑑

2 ln 2𝜋 − 1

2 ln Σ𝑖 + ln 𝑝(𝜔𝑖) since

𝑝 𝑥 𝜔𝑖 = 1

(2𝜋)𝑑/2 Σ𝑖 1/2 𝑒𝑥𝑝 −1

2 𝑥 − 𝜇𝑖 𝑡Σ𝑖−1 𝑥 − 𝜇𝑖

13

(43)

Discriminant Function for Normal Density

• Assume the features are statistically independent, and each feature has the same variance, that is, Σ𝑖 = 𝜎2𝐈

• What is the discriminant function?

(44)

J. Y. Choi. SNU

Discriminant Function for Normal Density

• Assume the features are statistically independent, and each feature has the same variance, that is, Σ𝑖 = 𝜎2 𝐈

• What is the discriminant function?

𝑔𝑖 𝑥 = − 1

2 𝑥 − 𝜇𝑖 𝑡Σ𝑖−1 𝑥 − 𝜇𝑖𝑑

2 ln 2𝜋 − 1

2 ln Σ𝑖 + ln 𝑝(𝜔𝑖)

= − 𝑥−𝜇𝑖 2

2𝜎2𝑑

2 ln 2𝜋 − 1

2 ln 𝜎2𝑑 + ln 𝑝(𝜔𝑖) Since − 𝑑

2 ln 2𝜋 − 1

2 ln 𝜎2𝑑 is independent to the class, 𝑔𝑖 𝑥 = − 𝑥−𝜇𝑖 2

2𝜎2 + ln 𝑝(𝜔𝑖)

15

(45)

Discriminant Function for Normal Density

• What is the discriminant function?

𝑔𝑖 𝑥 = − 𝑥−𝜇𝑖 2

2𝜎2 + ln 𝑝(𝜔𝑖)

= − 1

2𝜎2 (𝑥𝑡𝑥 − 2𝜇𝑖𝑡𝑥 + 𝜇𝑖𝑡𝜇𝑖) + ln 𝑝(𝜔𝑖) Since 𝑥𝑡𝑥 is also independent to the class,

𝑔𝑖 𝑥 = 𝑤𝑖𝑡𝑥 + 𝑤𝑖0 where

𝑤𝑖 = 𝜇𝑖

𝜎2

𝑤𝑖0 = − 𝜇𝑖𝑡𝜇𝑖

2𝜎2 + ln 𝑝(𝜔𝑖)

(46)

J. Y. Choi. SNU

Discriminant Function for Normal Density

• Decision surface between 𝑖 and 𝑗 classes is obtained by letting 𝑔𝑖 𝑥 = 𝑔𝑗 𝑥

which yields

𝑔 𝑥 = 𝑔𝑖 𝑥 − 𝑔𝑗 𝑥 = 𝑤𝑡𝑥 − 𝑤0 = 0 where

𝑤 = 𝑤𝑖 − 𝑤𝑗 = 𝜇𝑖−𝜇𝑗

𝜎2

𝑤0 = 𝑤𝑖0 − 𝑤𝑗0 = − 𝜇𝑖𝑡𝜇𝑖−𝜇𝑗𝑡𝜇𝑗

2𝜎2 + ln 𝑝(𝜔𝑖) − ln 𝑝(𝜔𝑗)

• This linear classifier corresponds to a neuron model as

• 𝑖 −class region if 𝑔 𝑥 > 0

• 𝑗 −class region if 𝑔 𝑥 < 0

• decision boundary if 𝑔 𝑥 = 0

17

(47)

• If 𝑝(𝑤𝑖) is equal to 𝑝(𝑤𝑗),

Discriminant Function for Normal Density

(48)

J. Y. Choi. SNU

Discriminant Function for Normal Density

• If 𝑝(𝑤𝑖) is not equal to 𝑝(𝑤𝑗)

19

(49)

Discriminant Function for Normal Density

• Assume the features are not statistically independent, and each feature has the different variance, but every class has the same covariance matrix, that is, Σ𝑖 = Σ

• What is the discriminant function?

(50)

J. Y. Choi. SNU

Discriminant Function for Normal Density

• Assume the features are not statistically independent, and each feature has the different variance, but every class has the same covariance matrix, that is, Σ𝑖 = Σ

• What is the discriminant function?

𝑔𝑖 𝑥 = − 1

2 𝑥 − 𝜇𝑖 𝑡Σ−1 𝑥 − 𝜇𝑖𝑑

2 ln 2𝜋 − 1

2 ln Σ + ln 𝑝(𝜔𝑖) Since − 𝑑

2 ln 2𝜋 − 1

2 ln Σ is independent to the class, 𝑔𝑖 𝑥 = − 1

2 𝑥 − 𝜇𝑖 𝑡Σ−1 𝑥 − 𝜇𝑖 + ln 𝑝(𝜔𝑖)

21

(51)

Discriminant Function for Normal Density

• What is the discriminant function?

𝑔𝑖 𝑥 = −1

2 𝑥 − 𝜇𝑖 𝑡Σ−1 𝑥 − 𝜇𝑖 + ln 𝑝(𝜔𝑖)

= − 1

2 (𝑥𝑡Σ−1𝑥 − 2𝜇𝑖𝑡Σ−1𝑥 + 𝜇𝑖𝑡Σ−1𝜇𝑖) + ln 𝑝(𝜔𝑖) Since 𝑥𝑡Σ−1𝑥 is also independent to the class,

𝑔𝑖 𝑥 = 𝑤𝑖𝑡𝑥 + 𝑤𝑖0 where

𝑤𝑖 = Σ−1𝜇𝑖

𝑤𝑖0 = −𝜇𝑖𝑡Σ−1𝜇𝑖

2 + ln 𝑝(𝜔𝑖)

(52)

J. Y. Choi. SNU

Discriminant Function for Normal Density

Decision surface between 𝑖 and 𝑗 classes is obtained by letting 𝑔𝑖 𝑥 = 𝑔𝑗 𝑥

which yields

𝑔 𝑥 = 𝑔𝑖 𝑥 − 𝑔𝑗 𝑥 = 𝑤𝑡𝑥 − 𝑤0 = 0 where

𝑤 = 𝑤𝑖 − 𝑤𝑗 = Σ−1(𝜇𝑖 − 𝜇𝑗)

𝑤0 = 𝑤𝑖0 − 𝑤𝑗0 = −𝜇𝑖𝑡Σ−1𝜇𝑖−𝜇𝑗𝑡Σ−1𝜇𝑗

2 + ln 𝑝(𝜔𝑖) − ln 𝑝(𝜔𝑗)

• This linear classifier corresponds to a neuron model as

• 𝑖 −class region if 𝑔 𝑥 > 0

• 𝑗 −class region if 𝑔 𝑥 < 0

• decision boundary if 𝑔 𝑥 = 0

23

(53)

Discriminant Function for Normal Density

• Assume the features are not statistically independent, and each feature has the different variance, but every class has the different covariance matrix, that is, Σ𝑖

• What is the discriminant function?

(54)

J. Y. Choi. SNU

Discriminant Function for Normal Density

• Assume the features are not statistically independent, and each feature has the different variance, but every class has the same covariance matrix, that is, Σ𝑖

• What is the discriminant function?

𝑔𝑖 𝑥 = − 1

2 𝑥 − 𝜇𝑖 𝑡Σ𝑖−1 𝑥 − 𝜇𝑖𝑑

2 ln 2𝜋 − 1

2 ln Σ𝑖 + ln 𝑝(𝜔𝑖)

• Since − 𝑑

2 ln 2𝜋 is independent to the class, 𝑔𝑖 𝑥 = − 1

2 𝑥 − 𝜇𝑖 𝑡Σ𝑖−1 𝑥 − 𝜇𝑖1

2 ln Σ𝑖 + ln 𝑝(𝜔𝑖)

25

(55)

Discriminant Function for Normal Density

• What is the discriminant function?

𝑔𝑖 𝑥 = − 1

2 𝑥 − 𝜇𝑖 𝑡Σ𝑖−1 𝑥 − 𝜇𝑖1

2 ln Σ𝑖 + ln 𝑝(𝜔𝑖)

= − 1

2 𝑥𝑡Σ𝑖−1𝑥 − 2𝜇𝑖𝑡Σ𝑖−1𝑥 + 𝜇𝑖𝑡Σ𝑖−1𝜇𝑖1

2 ln Σ𝑖 + ln 𝑝(𝜔𝑖)

Since 𝑥𝑡Σ𝑖−1𝑥 is not independent to the class, 𝑔𝑖 𝑥 = 𝑥𝑡𝑊𝑖𝑥 + 𝑤𝑖𝑡𝑥 + 𝑤𝑖0

where

𝑊𝑖 = −1

2 Σ𝑖−1 𝑤𝑖 = Σ−1𝜇𝑖

𝑤𝑖0 = − 𝜇𝑖𝑡Σ−1𝜇𝑖

2 + ln 𝑝(𝜔𝑖)

(56)

J. Y. Choi. SNU

Decision Boundary for General Gaussian

• Decision boundaries are hyperquadratic for general case

27

(57)

Linear Mapping

𝑇: 𝑋 → 𝑌, 𝑦 = 𝑊𝑥

𝑋 𝑌

(58)

J. Y. Choi. SNU

Affine Mapping

𝑇: 𝑋 → 𝑌, 𝑦 = 𝑊𝑥 + 𝑏

29

𝑋 𝑌

1 𝑏

(59)

1 2

3 4

5

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09

1 2 3 4 5

Nonlinear Mapping

𝑇: 𝑋 → 𝑌, 𝑦 = 𝜎 𝑊𝑥 + 𝑏 , 𝑓 𝑥 = 𝑎𝑇𝑦

𝑋 ∋ 𝑥 𝑌 ∋ 𝑦 =

𝑦1

𝑦𝑛 𝑓 𝑥 = 𝑎𝑇𝑦

𝑓 𝑥

1 𝑏

(60)

J. Y. Choi. SNU

Error Probabilities and Integrals

• Consider the 2-class problem and suppose that the feature space is divided into 2 regions 𝑅1 and 𝑅2. There are 2 ways in which a classification error can occur.

• An observation 𝑥 falls in 𝑅2, and the true state is 𝜔1.

• An observation 𝑥 falls in 𝑅1, and the true state is 𝜔2.

• The error probability

𝑃 𝑒𝑟𝑟𝑜𝑟 = 𝑃 𝑥 ∈ 𝑅2 𝜔1 𝑝 𝜔1 + 𝑃 𝑥 ∈ 𝑅1 𝜔2 𝑝(𝜔2)

= ׬𝑅

2 𝑝 𝑥 𝜔1 𝑝 𝜔1 𝑑𝑥 + ׬𝑅

1 𝑝 𝑥 𝜔2 𝑝 𝜔2 𝑑𝑥

31

(61)

Error Probabilities and Integrals

• Because 𝑥 is chosen arbitrarily, the probability of error is not as small as it might be.

• 𝑥𝐵 = Bayes optimal decision boundary , and gives the lowest probability of error.

• Bayes classifier maximizes the correct probability.

1 1

( ) ( | ) ( ) ( | ) ( )

i

C C

i i i i i

i i

P correct P p p p d

 

x

 

x x

Error Probabilities and Integrals

(62)

J. Y. Choi. SNU

Interim Summary

• Decision Region

• (Linear) Discriminant Function

• Decision Surface, Linear Machine

• Bayes Classifier

• Normal Density ND: Univariate Case

• MND: Multivariate Case1: Indep., Same Variance

• MND: Multivariate Case2: Same Covariance Matrix

• MND: Multivariate Case3: Different Covariance Matrix

• Error Probability

• Left Issues: Learning of PDF, NN learning, Bayesian Learning

33

참조

관련 문서

“ The carrots, the eggs, and the coffee beans had faced the same serious problem ― boiling water, but each had a different result, ” her

This paper has a recognition that the corruption and bribery has not been rooted on the market economy but has been resulted from suppressing the

When the variance-covariance matrix is needed to use for optimal asset allocation, we estimated both the standard variance-covariance matrix and the DCC-MGARCH

The reason why I am showing here that the imaginary number ”i” has so unique character is : If ”i” has minus infinity as one of its values, then RH has a pos- sibility of

(3) Using the different importance levels of each criterion for the given alternative and the elements of the fuzzy global decision matrix, the fuzzy evaluation value

W Well, I like to discuss things with others, but I don’t think I’m fluent enough to take the class.. M Actually, the brochure says the class

The innovation of the mass media not only has changed the delivery process of a product but also has expanded into the image industry such as

 A linear system of equations in n unknowns has a unique solution if the coefficient matrix and the augmented matrix have the same rank n, and infinitely many solution if