Learning a Class from Examples

(1)

(2)

Learning a Class from Examples

’  Class C of a “family car”

’   Predic8on: Is car x a family car?

’   Knowledge extrac8on: What do people expect from a family car?

’  Output:

Posi8ve (+) and nega8ve (–) examples

’  Input representa8on:

x : price, x : engine power

(3)

Training set X

N t t

t

,r }

₁

{

₌

X = x

⎩ ⎨

= ⎧

negative

is if

positive

is if

x x 0

r 1

⎥ ⎦

⎢ ⎤

⎣

= ⎡

2 1

x

x x

(4)

Class C

( p

1

≤ price ≤ p

2

) AND ( e

1

≤ engine power ≤ e

2

)

(5)

Hypothesis class H

⎩ ⎨

= ⎧

negative

is says

if

positive

is says

if )

( x

x x

h h h

0 1

( )

∑

=

≠

=

N t

t

r

h h

E

1

) 1 x

|

( X

Error of h on H

(6)

S, G, and the Version Space

most speciﬁc hypothesis, S

most general hypothesis, G

h ∈ H, between S and G is consistent

and make up the

version space

(Mitchell, 1997)

(7)

Margin

’  Choose h with largest margin

(8)

VC Dimension

’  N points can be labeled in 2 ^N ways as +/–

’  H shaXers N if there exists h ∈ H consistent for any of these:

VC(H ) = N

(9)

Probably Approximately Correct (PAC) Learning

’  How many training examples N should we have, such that with probability at least 1 ‒ δ, h has error at most ε ?

(Blumer et al., 1989)

’  Each strip is at most ε/4

’  Pr that we miss a strip 1‒ ε/4

’  Pr that N instances miss a strip (1 ‒ ε/4)

^N

’  Pr that N instances miss 4 strips 4(1 ‒ ε/4)

^N

’  4(1 ‒ ε/4)

^N

≤ δ and (1 ‒ x)≤exp( ‒ x)

’  4exp(‒ εN/4) ≤ δ and N ≥ (4/ε)log(4/δ)

’  Probabilis8c upper bound on the test error of a classiﬁca8on model (h = VC dim.)

(10)

Noise and Model Complexity

Use the simpler one because

’  Simpler to use

(lower computa8onal complexity)

’  Easier to train (lower space complexity)

’  Easier to explain (more interpretable)

’  Generalizes beXer (lower

(11)

Mul8ple Classes, C _i i=1,...,K

N t t

t ,r } ₁

{ ₌

X = x

⎩ ⎨

⎧

≠

∈

= ∈

, if

if

i r j

t j t i

it

C

C x

x 0 1

( ) ⎩ ⎨ ⎧

≠

∈

= ∈

,

if

i h j

t j t i

i t

C

C x

x x

0 1

Train hypotheses

h _i (x), i =1,...,K:

(12)

Regression

( ) x w

1

x w

0

g = +

( )

1 0

2

x w x w

w x

g = + +

( ) _∑ [ ( ) ]

=

−

=

N t

t

g x

N r g

E

1

2

X

|

1 N

{ } ( ) ⁺ ^ε

= ℜ

∈

= ₌

t t

t

N t t t

x f r

r

x , ₁

X

(13)

Model Selec8on & Generaliza8on

’  Learning is an ill-‐posed problem; data is not suﬃcient to ﬁnd a unique solu8on (d features à 2 ^d example)

’  The need for induc8ve bias, assump8ons about H

’  E.x. assuming the shape of rectangle

’  Generaliza8on: How well a model performs on new data

’  Overﬁnng: H more complex than C or f

’  Underﬁnng: H less complex than C or f

13

2.7 Model Selection and Generalization 37

Table 2.1 With two inputs, there are four possible cases and sixteen possible Boolean functions.

x₁ x₂ h₁ h₂ h₃ h₄ h₅ h₆ h₇ h₈ h₉ h₁₀ h₁₁ h₁₂ h₁₃ h₁₄ h₁₅ h₁₆

0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1

0 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1

1 0 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1

1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1

where x = !

t x^t/N and r = !

t r^t/N. The line found is shown in figure 1.2.

If the linear model is too simple, it is too constrained and incurs a large approximation error, and in such a case, the output may be taken as a higher-order function of the input—for example, quadratic

g(x)= w2x²+ w1x+ w0

(2.18)

where similarly we have an analytical solution for the parameters. When the order of the polynomial is increased, the error on the training data de- creases. But a high-order polynomial follows individual examples closely, instead of capturing the general trend; see the sixth-order polynomial in figure 2.10. This implies that Occam’s razor also applies in the case of regression and we should be careful when fine-tuning the model complexity to match it with the complexity of the function underlying the data.

2.7 Model Selection and Generalization

Let us start with the case of learning a Boolean function from examples.

In a Boolean function, all inputs and the output are binary. There are 2^d possible ways to write d binary values and therefore, with d inputs, the training set has at most 2^d examples. As shown in table 2.1, each of these can be labeled as 0 or 1, and therefore, there are 2²^d possible Boolean functions of d inputs.

Each distinct training example removes half the hypotheses, namely,

(14)

Triple Trade-‐Oﬀ

’  There is a trade-‐oﬀ between three factors (DieXerich, 2003):

1.  Complexity of H, c ( H ),

2.  Training set size, N,

3.  Generaliza8on error, E, on new data

¨  As N↑, E↓

¨  As c ( H)↑, ﬁrst E↓ and then E↑

(15)

Learning a Class from Examples