• 검색 결과가 없습니다.

Learning  a  Class  from  Examples

N/A
N/A
Protected

Academic year: 2021

Share "Learning  a  Class  from  Examples"

Copied!
16
0
0

로드 중.... (전체 텍스트 보기)

전체 글

(1)
(2)

Learning  a  Class  from  Examples  

’  Class  C  of  a  “family  car”  

’   Predic8on:  Is  car  x  a  family  car?  

’   Knowledge  extrac8on:  What  do  people  expect  from  a   family  car?  

’  Output:    

   Posi8ve  (+)  and  nega8ve  (–)  examples  

’  Input  representa8on:    

   x :  price,  x  :  engine  power  

(3)

Training  set   X

N t t

t

,r }

1

{

=

X = x

⎩ ⎨

= ⎧

negative  

is     if  

positive  

is     if  

x x 0

r 1

⎥ ⎦

⎢ ⎤

⎣

= ⎡

2 1

x

x x

(4)

Class   C

( p

1

≤   price   ≤   p

2

)  AND   ( e

1

≤   engine     power   ≤   e

2

)

(5)

Hypothesis  class   H

⎩ ⎨

= ⎧

negative  

is     says  

  if  

positive  

is     says  

  if )  

( x

x x

h h h

0 1

( )

( )

=

=

N t

t

t

r

h h

E

1

) 1 x

|

( X

Error  of  h  on   H  

(6)

S,  G,  and  the  Version  Space  

most  specific  hypothesis,  S  

most  general  hypothesis,  G  

h  ∈ H,  between  S  and  G  is   consistent    

and  make  up  the    

version  space  

(Mitchell,  1997)  

(7)

Margin  

’  Choose  h  with  largest  margin  

(8)

VC  Dimension  

’  N  points  can  be  labeled  in  2 N  ways  as  +/–  

’  H  shaXers  N  if  there      exists  h  ∈   H  consistent      for  any  of  these:    

 VC(H )  =  N    

 

(9)

Probably  Approximately  Correct   (PAC)  Learning  

’  How  many  training  examples  N  should  we  have,  such  that  with  probability   at  least  1  ‒  δ,  h  has  error  at  most  ε  ?  

 (Blumer  et  al.,  1989)  

’  Each  strip  is  at  most  ε/4  

’  Pr  that  we  miss  a  strip  1‒  ε/4  

’  Pr  that  N  instances  miss  a  strip  (1  ‒  ε/4)

N  

’  Pr  that  N  instances  miss  4  strips  4(1  ‒  ε/4)

N  

’  4(1  ‒  ε/4)

N

 ≤  δ  and  (1  ‒  x)≤exp(  ‒  x)

 

’  4exp(‒  εN/4)  ≤  δ    and  N  ≥  (4/ε)log(4/δ)  

’  Probabilis8c  upper  bound  on  the  test  error  of  a  classifica8on  model  (h  =  VC  dim.)  

(10)

Noise  and  Model  Complexity  

Use  the  simpler  one  because  

’  Simpler  to  use    

 (lower  computa8onal      complexity)  

’  Easier  to  train  (lower      space  complexity)  

’  Easier  to  explain      (more  interpretable)  

’  Generalizes  beXer  (lower    

(11)

Mul8ple  Classes,   C i  i=1,...,K  

N t t

t ,r } 1

{ =

X = x

⎩ ⎨

⎧

= ∈

       ,   if  

 

  if  

 

i r j

t j t i

it

C

C x

x 0 1

( ) ⎩ ⎨ ⎧

= ∈

       ,  

if    

  if  

 

i h j

t j t i

i t

C

C x

x x

0 1

Train  hypotheses    

h i (x),  i  =1,...,K:  

(12)

Regression  

( ) x w

1

x w

0

g = +

( )

1 0

2

2

x w x w

w x

g = + +

( ) [ ( ) ]

=

=

N t

t

t

g x

N r g

E

1

1

2

X

|

1 N

{ } ( ) + ε

= ℜ

= =

t t

t

N t t t

x f r

r

r

x , 1

X

(13)

Model  Selec8on  &  Generaliza8on  

’  Learning  is  an  ill-­‐posed  problem;  data  is  not  sufficient  to  find  a   unique  solu8on  (d  features   à  2 d  example)  

’  The  need  for  induc8ve  bias,  assump8ons  about   H

’  E.x.  assuming  the  shape  of  rectangle  

’  Generaliza8on:  How  well  a  model  performs  on  new  data  

’  Overfinng:   H  more  complex  than  C  or  f    

’  Underfinng:   H  less  complex  than  C  or  f  

13

2.7 Model Selection and Generalization 37

Table 2.1 With two inputs, there are four possible cases and sixteen possible Boolean functions.

x1 x2 h1 h2 h3 h4 h5 h6 h7 h8 h9 h10 h11 h12 h13 h14 h15 h16

0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1

0 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1

1 0 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1

1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1

where x = !

t xt/N and r = !

t rt/N. The line found is shown in figure 1.2.

If the linear model is too simple, it is too constrained and incurs a large approximation error, and in such a case, the output may be taken as a higher-order function of the input—for example, quadratic

g(x)= w2x2+ w1x+ w0

(2.18)

where similarly we have an analytical solution for the parameters. When the order of the polynomial is increased, the error on the training data de- creases. But a high-order polynomial follows individual examples closely, instead of capturing the general trend; see the sixth-order polynomial in figure 2.10. This implies that Occam’s razor also applies in the case of re- gression and we should be careful when fine-tuning the model complexity to match it with the complexity of the function underlying the data.

2.7 Model Selection and Generalization

Let us start with the case of learning a Boolean function from examples.

In a Boolean function, all inputs and the output are binary. There are 2d possible ways to write d binary values and therefore, with d inputs, the training set has at most 2d examples. As shown in table 2.1, each of these can be labeled as 0 or 1, and therefore, there are 22d possible Boolean functions of d inputs.

Each distinct training example removes half the hypotheses, namely,

(14)

Triple  Trade-­‐Off  

’  There  is  a  trade-­‐off  between  three  factors  (DieXerich,   2003):  

1.  Complexity  of   H,  c   ( H ),  

2.  Training  set  size,  N,    

3.  Generaliza8on  error,  E,  on  new  data  

¨  As  N↑, E↓  

¨  As  c  ( H)↑, first  E↓ and  then  E↑  

(15)

Cross-­‐Valida8on  

’  To  es8mate  generaliza8on  error,  we  need  data  unseen   during  training.  We  split  the  data  as  

’  Training  set  (50%)  

’  Valida8on  set  (25%)  

’   Test  (publica8on)  set  (25%)  

’  Resampling  when  there  is  few  data  

(16)

Dimensions  of  a  Supervised   Learner  

1.   Model:    

     

2.   Loss  func8on:  

     

3.  Op8miza8on  procedure:  

 

( x | θ )

g

( ) = ( ( ) )

t

t

t g

r L

E θ | X , x | θ

( | X )

min  

arg

* θ

θ = θ E

수치

Table 2.1 With two inputs, there are four possible cases and sixteen possible Boolean functions

참조

관련 문서

Should a module be used with any of the absolute maximum ratings exceeded, the characteristics of the module may not be recovered, or in an extreme case, the module

웹 표준을 지원하는 플랫폼에서 큰 수정없이 실행 가능함 패키징을 통해 다양한 기기를 위한 앱을 작성할 수 있음 네이티브 앱과

The index is calculated with the latest 5-year auction data of 400 selected Classic, Modern, and Contemporary Chinese painting artists from major auction houses..

A frame size error in a frame that could alter the state of the entire connection MUST be treated as a connection error (Section 5.4.1); this includes any frame carrying a

When static keyword is used with the variables and the methods, it signifies that these members belongs to the class and these members are shared by all the objects of

The “Asset Allocation” portfolio assumes the following weights: 25% in the S&P 500, 10% in the Russell 2000, 15% in the MSCI EAFE, 5% in the MSCI EME, 25% in the

1 John Owen, Justification by Faith Alone, in The Works of John Owen, ed. John Bolt, trans. Scott Clark, "Do This and Live: Christ's Active Obedience as the

In order to achieve the purpose of this study, we established a theoretical assumption that the variance of achievement goal orientation will affect the