Chapter 12 Discriminant Analysis

(1)

HanyangUniversity

Quest Lab.

Chapter 12 Discriminant Analysis

Fall, 2014 Department of IME Hanyang University

Hanyang University Quest Lab.

• Classical statistical method for classification and profiling

• Model-based approach to classification

• Classification is based on the “statistical” distance from each class average

• Classification functions are used to produce classification scores

• Common uses: 미생물의 종(種)분류, loans, credit cards, insurances, 대학입학허가여부, 지문분석 등등

12.1 Introduction

(2)

12.1 Introduction

Example 1 : Riding mowers (from Chapter 7)

12 owners and 12 nonowners

12.1 Introduction

Example 1 : Riding mowers (from Chapter 9)

The line seems to do a good job – 4 misclassifications Can we do better?

(3)

12.2 Distance of an observation froma class

General idea:

• To classify a new record, measure its distance from the center of each class

• Then, classify the record to the closest class

Step 1: Measuring Distance

• Euclidean distance between an item and mean of a class

→ 각 predictor들의 단위에 따라 거리값이 달라짐(달러 vs. 천 달러 등)

→ 각 predictor들의 variability를 고려하지 못함(ex. 거리는 A보다 B에 가까워 도 variability가 B가 더 작으면 A로 classify하는 것이 타당할 수도 있음)

→ Euclidean distance는 predictor간 correlation은 고려하지 못함

→ “Statistical distance” (Mahalanobis distance)

12.2 Distance of an observation froma class

2 2 2

1 1 2 2

( , ) ( ) ( ) (

_p _p

)

D_Euclidean x x

=

x

-

x

+

x

-

x

+ L +

x

-

x

(4)

Statistical distance (Mahalanobis distance)

where is the covariance matrix.

If  = 1(single predictor), this reduces to z-score.

If  =  (all predictors are non-correlated with each other), then

12.2 Distance of an observation froma class

[ ] [ ]

2 1

1 1

2 2

1

1 1 2 2

( , )

[( ), ( ), , ( )] ,

T

p p

D S

x x x x

x x x x x x S

x x

-

= - -

é - ù

ê - ú

ê ú

= - - -

ê ú

ê - ú

ë û

L L

Statistical

x x x x x x

Statistical

( , )

Euclidean

( , ).

D x x = D x x

12.3 Fisher’s linear classification functions

Idea: to find linear function of measurements that maximizes:

   

   

• 분자값이 클수록 class간 구분이 확실히 되며, 분모값이 작을수록 class 내의 동질성 높음)

• 각 obs.마다 이 function을 사용하여 score를 계산하고 이 값이 가장 큰 class로 assign

• 각 class마다 classification function을 하나씩 생성 Step 2: Classification Functions

(5)

12.3 Fisher’s linear classification functions

Record #1: income = $60K, lot size = 18.4K

Owner score = -73.16 + (0.43)(60) + (5.47)(18.4) = 53.2 Non-owner score= -51.42+(0.33)(60)+(4.68)(18.4)= 54.48

“Non-owner” score is higher → classify as non-owner

How do we find these functions?

12.3 Fisher’s linear classification functions

(6)

12.3 Fisher’s linear classification functions

Step 3:(alternative way to classify) Converting to Probabilities

• It is possible to convert classification scores to probabilities of belonging to a class:

 = ^^^()

^^^()+ ^^^()+ ⋯ + ^^^()

Probability that record  belongs to class 

• The probability is then compared to the cutoff value in order to classify a record

• 장점: 비교없이 lift chart 가능

12.3 Fisher’s linear classification functions

^.

^.+ ^.= 0.218

(7)

12.3 Fisher’s linear classification functions

DA 추측

Linear/Quadratic Discriminant Analysis

(8)

11/2/2014 15

Bayesian Decision Theory

( ) ( ) ( )

( )

| |

^j ^j

j

P C P C

P C

=

x P

x x

Posterior probability from Bayes Formula:

( )

j

Class probabilities

P C

(Prior)

Conditional density of X:

P ( x ^| C

j

)

( ) ( ) ( )

1

|

n

j j

j

P P C P C

=

= å

x x

Unconditional density of X

11/2/2014 16

Bayesian Classifier

(

1

)

1

(

2

)

2

For a given , if | ( ) | ( ),

then is classified to class 1, otherwise, class 2.

x P C p C P C p C x

x

>

x

Find the maximum posterior probability

(

1

) (

2

)

For a given , if | | ,

then is classified to class 1, otherwise, class 2.

P C > P C

x x x

x

( ) ( ) ( )

( )

| |

ⁱ ⁱ

i

P C P C P C

=

xP

x x

(9)

11/2/2014 17

Bayesian Classifier

(

1

)

1

(

2

)

2

For a given , if | ( ) | ( ),

then is classified to class 1, otherwise, class 2.

x P C p C P C p C x

>

x x

사전 확률 계산

P

(

C

₁) =

n

₁/

N

,

P

(

C

₂) =

n

₂/

N

정확한 값이 아니라 추정 (

N

이 커짐에 따라 실제 값에 가까워짐)

우도(likelihood) 계산

훈련 집합에서

C

_i에 속하는 샘플들을 가지고

P

(x |

C

_i) 추정 Naïve Bayes:

P

(x |

C

_i) =

P

(

x

₁|

C

_i)

P

(

x

₂|

C

_i) ⋯

P

(

x

_p|

C

_i) DA에서는?

P

(x |

C

_i) = multivariate normal 분포를 가정

11/2/2014 18

Classification Based on Normal Populations

정규 분포 (가우시언 분포) 현실 세계에 맞는 경우 있음

평균과 분산이라는 두 종류의 매개 변수만으로 표현 가능 수학적인 매력

우도가 정규 분포를 따른다는 가정 하에 베이시언 분류기의 특성을 해석

(10)

11/2/2014

Multivariate Normal Distribution

( )

/2 1/2

( )

¹

( )

1 1

| exp

2 2

T

j p j j

j

p C

p

é

-

ù

= ê - - - ú

ë Σ û

Σ

x x μ x μ

( )

1

11 1

1

, , MVN( , ),

where , , ,

T p

p T

p

p pp

x x

s s

m m

s s

=

é ù

ê ú

= = ê ú

ê ú

ë û

Σ

L :

L

L M O M

L

x μ

μ

( ) ( )

²

2

In case of 1, 1 exp

2 2

p p x x

m

p s s

é - ù

ê ú

= = -

ê ú

ë û

11/2/2014

Multivariate Normal Distribution

( )

/2 1/2

( )

¹

( )

1 1

| exp

2 2

T

j p j j

j

p C

p

é - ù

= ê- - - ú

ë Σ û

Σ

x x μ x μ

(11)

11/2/2014 21

Discriminant Function

( ) ( ) ( ) ( )

( ) ( ) ( )

| | , or

ln | ln

i i i i

i i i

P C p C P C

p C P C

d d

º µ

= +

x x x

x x

( ) ^, ^{1, 2,} ^, (where is the number of classes)

i

i K K

d x = L

The classifier will assign an observation x to class C

_i

if ( ) ( ) for all

i j

d x > d x ¹

Discriminant function in terms of posterior probability

11/2/2014 22

Discriminant Function for Normal Distribution

( ) ( ) ( )

¹

( )

ln | ln

1 1

ln 2 ln ln

2 2 2

i i i

T

i i i i i

p C P C

p P C

d

p

^-

= +

= - - Σ + - - Σ -

x x

x μ x μ

( )

/2 1/2

( )

¹

( )

1 1

| exp

2 2

t

i p i i i

i

p C

p

é

-

ù

= ê - - - ú

ë Σ û

Σ

x x μ x μ

Decision boundary between class k and class r

( ) ( )

d

_k

x = d

_r

x

(12)

11/2/2014 23

Derivation of Decision Boundary

0 ) (

) (

2 1

|

| ln 2 1 ) 2 ln(

2 ) 2 ( ln

) ( ) ( 2 1

|

| ln 2 1 ) 2 ln(

2 ) 1 ( ln

2 1

2 2 2

1 1 1 1 1

= - S - +

S +

+ -

- S - -

S -

-

- -

m m

p

m m

p

x x

p P

x x

p P

T T

) 2 (

) 1 ln ( 2 ln

) (

2 1 2

1 2 2 1

1 1

1

P

x P x

x

^T ^T

-

S

= S - S - - - S

- m

^-

m m

^-

m

1

( )

2

( ) 0

d x - d x =

24

Decision Boundary When S ₁ = S ₂

) 2 (

) 1 ln ( 2 ln

) (

2 1 2

1 2 2 1

1 1

1

P

x P x

x

^T ^T

-

S

= S - S -

- - S

- m

^-

m m

^-

m

) 0 2 (

) 1 ln ( ) (

) 2 (

) 1

(

₁

-

₂

S

^-¹

-

₁

-

₂

S

^-¹ ₁

-

₂

- = P

x

^T

P

T

m m m m

m m

( ) ( ) (

₁

) (

₂

)

²

2 1

1 2

1 1 1

1 1

1

When 1 úS

û ù êë

é

- + - + - úS û ù êë

é

- + -

= - S

= S

=

S n n

n n

A Decision

Assign x into class 1 if A  + B > 0, class 2 if A  + B < 0.

x + B =0

A  + B = 0: Linear Equation

(13)

• 결정 경계

• 예제

(14)

27

Decision Boundary When S ₁ ¹ S ₂

) 2 (

) 1 ln ( 2 ln

) (

2 1 2

1 2 2 1

1 1

1

P

x P x

x

^T ^T

-

S

= S - S -

- - S

- m

^-

m m

^-

m

) 0 2 (

) 1 ln ( )

( ) 2 (

1

₁

2 2 1 1 1 2

2 1

1

- S + S - S - - =

S

-

^- ^- ^- ^-

P T P x x

x

^T

m

^T

m

^T

2

When S

1

¹ S

A

Decision

Assign x into class 1 if Ax

²

+ Bx + C > 0, class 2 if Ax

²

+ Bx + C < 0.

x

²

+ =0

Ax

²

+ Bx + C = 0: Quadratic Equation

x + C

B

(15)

• 예제

11/2/2014 30

Linear Decision Boundary from LDA

(a) Three normal distributions, with the same covariance and different means.

(b) A sample of 30 drawn from each normal distribution, and the fitted LDA decision boundaries.

(a) (b)

(16)

11/2/2014 31

Quadratic Decision Boundary from QDA

11/2/2014 32

QDA vs. LDA

(17)

12.4 Classification performance of discriminant analysis

① 각 class에서의 distance측정에서 multivariate normal distribution을 가정함

• 이 가정이 적절한 경우 다른 어떤 classification 방법보다도 효과적 (적은 data로도 동일한 결과를 얻을 수 있다는 뜻)

• predictor들이 non-normal이거나 dummy variable일 지라도 robust

• outlier에는 sensitive하므로 exploratory analysis가 필요 Main assumptions:

12.4 Classification performance of discriminant analysis

② Class내에서의 correlation structure가 다른 class에서도 동일하다는 가정

• 각 class에서의 correlation matrix를 구하여 서로 비교하여 check

• Correlation matrix가 class마다 다른 경우 variability가 가장 큰 쪽으로 classify하는 경향

• 대안은 quadratic discriminant analysis를 수행하는 것(if dataset is very large)

Two main assumptions:

(18)

12.5 Prior probabilities

③ 이 방법은 각 class들이 향후 특정한 어떤 item을 만나게 될 확률이 동일하 다고 보는 가정을 바탕으로 하고 있음 – 이것이 다를 경우는?

Let _= prior probability of membership in class 

각 class의 classification function에log(_)를 더해줌 (natural logarithm)

Ex. Riding-mower owner의 모비율이 15% (sample에서는 12대 12, 즉 50%이 었음)이면?

→ owner로 classify하는 비율이 낮아지도록 유도해야 함

→ Fig 12.3의 owner classification function의 constant term에 각각 log(0.15), nonowner에 log(0.85)를 추가

12.5 Prior probabilities

Record가 income = $60K, lot size = 18.4K 인 경우

Owner score = -73.16 + (0.43)(60) + (5.47)(18.4) + log(0.15)= 51.3 Non-owner score= -51.42+(0.33)(60)+(4.68)(18.4) + log(0.85)= 53.64

(19)

12.6 Unequal misclassification costs

④ Misclassification cost가 symmetric하지 않은 경우에는?

Let _ = misclassifying cost of class .

• 각 class의 classification function에log(_)를 더해줌 (prior probability와 동시에 고려한다면log(__)를 더해주는 셈)

• _각각을 estimate하는 것이 어려우면 그 비율 ₂/₁을 사용. 이 경우에는 class 2의 classification function에만log(₂/₁)를 더해주면 됨

12.7 Classifying more than two classes

119 자원이 일시적으로 부족한데 dispatching 요청이 있을 때 어느 현장 으로 갈 것인가?

Outcome: (3 classes) No injury Non-fatal injury Fatal injury

Predictors: Time of day, day of week, weather, type of road, road surface conditions, …

Example 3: Medical dispatch to accident scene

(20)

12.7 Classifying more than two classes

- Applied discriminant analysis to 1,000 records,

- 각 record 마다 3개 classification function, confusion matrix is 3×3 - Compute the classification scores

- Highest classification score인 class에 assign Example 3: Medical dispatch to accident scene

12.7 Classifying more than two classes

Example 3: Medical dispatch to accident scene

e^30.93/ (e^25.94+e^31.42+e^30.93+)

= 0.38

(21)

12.8 Advantages and weaknesses

• DM 방법이라기보다는 statistical classification method

• Very popular in social science

• Use and performance are similar to multiple linear regression.

-> use same least square method

-> normal assumption (predictors are approx. multivariate normal) -> 실제로 normal이 아닌 경우에도 good

-> 지나치게 skew된 variable의 경우에는 log transform을 하면 좋음

• It provides estimates of single-predictor contributions, and is computationally simple, useful for small datasets.

(22)

• 12.1 (Personal Loan)

• 12.2 (SystemAdministrator)

• 12.3 (detecting spam mail)

Ch. 12 Problems

Chapter 12 Discriminant Analysis

Chapter 12

Discriminant Analysis

Fall, 2014 Department of IME Hanyang University

12.1 Introduction

12.1 Introduction

12.1 Introduction

12.2 Distance of an observation froma class

12.2 Distance of an observation froma class

( , ) ( ) ( ) (

)

=

-

+

-

+ L +

-

12.2 Distance of an observation froma class

[ ] [ ]

( , )

[( ), ( ), , ( )] ,

= - -

é - ù

ê - ú

ê ú

= - - -

ê ú

ê ú

ê - ú

ë û

L L

( , )

( , ).

D x x = D x x

12.3 Fisher’s linear classification functions

   

   

12.3 Fisher’s linear classification functions

12.3 Fisher’s linear classification functions

12.3 Fisher’s linear classification functions

12.3 Fisher’s linear classification functions

12.3 Fisher’s linear classification functions

Linear/Quadratic Discriminant Analysis

Bayesian Decision Theory

( ) ( ) ( )

( )

| |

=

( )

P C

P ( x | C

)

( ) ( ) ( )

|

P P C P C

= å

x x

Bayesian Classifier

(

)

(

)

For a given , if | ( ) | ( ),

then is classified to class 1, otherwise, class 2.

>

Find the maximum posterior probability

(

) (

)

For a given , if | | ,

then is classified to class 1, otherwise, class 2.

P C > P C

x x x

x

( ) ( ) ( )

( )

| |

=

Bayesian Classifier

(

P ( x ^| C