Learning a Class from Examples
’ Class C of a “family car”
’ Predic8on: Is car x a family car?
’ Knowledge extrac8on: What do people expect from a family car?
’ Output:
Posi8ve (+) and nega8ve (–) examples
’ Input representa8on:
x : price, x : engine power
Training set X
N t t
t
,r }
1{
=X = x
⎩ ⎨
= ⎧
negative
is if
positive
is if
x x 0
r 1
⎥ ⎦
⎢ ⎤
⎣
= ⎡
2 1
x
x x
Class C
( p
1≤ price ≤ p
2) AND ( e
1≤ engine power ≤ e
2)
Hypothesis class H
⎩ ⎨
= ⎧
negative
is says
if
positive
is says
if )
( x
x x
h h h
0 1
( )
( )
∑
=≠
=
N t
t
t
r
h h
E
1
) 1 x
|
( X
Error of h on H
S, G, and the Version Space
most specific hypothesis, S
most general hypothesis, G
h ∈ H, between S and G is consistent
and make up the
version space
(Mitchell, 1997)
Margin
’ Choose h with largest margin
VC Dimension
’ N points can be labeled in 2 N ways as +/–
’ H shaXers N if there exists h ∈ H consistent for any of these:
VC(H ) = N
Probably Approximately Correct (PAC) Learning
’ How many training examples N should we have, such that with probability at least 1 ‒ δ, h has error at most ε ?
(Blumer et al., 1989)
’ Each strip is at most ε/4
’ Pr that we miss a strip 1‒ ε/4
’ Pr that N instances miss a strip (1 ‒ ε/4)
N’ Pr that N instances miss 4 strips 4(1 ‒ ε/4)
N’ 4(1 ‒ ε/4)
N≤ δ and (1 ‒ x)≤exp( ‒ x)
’ 4exp(‒ εN/4) ≤ δ and N ≥ (4/ε)log(4/δ)
’ Probabilis8c upper bound on the test error of a classifica8on model (h = VC dim.)
Noise and Model Complexity
Use the simpler one because
’ Simpler to use
(lower computa8onal complexity)
’ Easier to train (lower space complexity)
’ Easier to explain (more interpretable)
’ Generalizes beXer (lower
Mul8ple Classes, C i i=1,...,K
N t t
t ,r } 1
{ =
X = x
⎩ ⎨
⎧
≠
∈
= ∈
, if
if
i r j
t j t i
it
C
C x
x 0 1
( ) ⎩ ⎨ ⎧
≠
∈
= ∈
,
if
if
i h j
t j t i
i t
C
C x
x x
0 1
Train hypotheses
h i (x), i =1,...,K:
Regression
( ) x w
1x w
0g = +
( )
1 02
2
x w x w
w x
g = + +
( ) ∑ [ ( ) ]
=
−
=
N t
t
t
g x
N r g
E
1
1
2X
|
1 N
{ } ( ) + ε
= ℜ
∈
= =
t t
t
N t t t
x f r
r
r
x , 1
X
Model Selec8on & Generaliza8on
’ Learning is an ill-‐posed problem; data is not sufficient to find a unique solu8on (d features à 2 d example)
’ The need for induc8ve bias, assump8ons about H
’ E.x. assuming the shape of rectangle
’ Generaliza8on: How well a model performs on new data
’ Overfinng: H more complex than C or f
’ Underfinng: H less complex than C or f
13
2.7 Model Selection and Generalization 37
Table 2.1 With two inputs, there are four possible cases and sixteen possible Boolean functions.
x1 x2 h1 h2 h3 h4 h5 h6 h7 h8 h9 h10 h11 h12 h13 h14 h15 h16
0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1
0 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1
1 0 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
where x = !
t xt/N and r = !
t rt/N. The line found is shown in figure 1.2.
If the linear model is too simple, it is too constrained and incurs a large approximation error, and in such a case, the output may be taken as a higher-order function of the input—for example, quadratic
g(x)= w2x2+ w1x+ w0
(2.18)
where similarly we have an analytical solution for the parameters. When the order of the polynomial is increased, the error on the training data de- creases. But a high-order polynomial follows individual examples closely, instead of capturing the general trend; see the sixth-order polynomial in figure 2.10. This implies that Occam’s razor also applies in the case of re- gression and we should be careful when fine-tuning the model complexity to match it with the complexity of the function underlying the data.
2.7 Model Selection and Generalization
Let us start with the case of learning a Boolean function from examples.
In a Boolean function, all inputs and the output are binary. There are 2d possible ways to write d binary values and therefore, with d inputs, the training set has at most 2d examples. As shown in table 2.1, each of these can be labeled as 0 or 1, and therefore, there are 22d possible Boolean functions of d inputs.
Each distinct training example removes half the hypotheses, namely,