2 002 , V ol. 13, N o.2 p p . 65~75
On A ppro x im at e P re dic t ion In t e rv al s for S u pport V e c t or M ac h in e R e g re s s ion
K y u n g h a S e ok 1) , Ch an g h a H w an g 2 ) , D aeh y e on Ch o 3 ) A b s tra c t
T h e support v ect or m ach in e (S V M ), fir st dev elop ed by V apn ik an d h is gr oup at A T &T Bell Lab or at orie s , is b ein g u sed a s a n ew t echn iqu e for r egr es sion an d cla s sificat ion pr oblem s . In t his paper w e pre sen t an approach t o est im at in g appr ox im at e pr edict ion in t erv als for S V M r egr es sion b a sed on post erior pr edictiv e den sities . F u rt h erm or e, t h e m et h od is illu str at ed w it h a dat a ex am ple.
1 . Intro du c tion
T h e supp ort v ect or m achin e (S V M ), fir st dev eloped b y V apnik an d his gr ou p at A T &T Bell Lab or at ories , is b ein g u sed a s a n ew t echn iqu e for r eg re s sion an d cla s sificat ion pr oblem s . S VM is g ainin g popularit y du e t o m any at t r act iv e feat ur es , an d pr om isin g em pirical perform an ce. S V M h a s b een su cces sfu lly applied t o a n um b er of r eal w orld pr oblem s su ch a s h an dw rit t en ch ar act er an d digit r ecog nition , face det ection , t ex t cat eg orizat ion an d obj ect det ect ion in m ach in e v ision . T h e aforem ent ion ed applicat ion s r elat e t o cla s sification pr oblem s . bu t S V M is also w idely applicable in regr es sion problem s . S V M w a s initially dev eloped t o s olv e cla s sificat ion pr oblem s , bu t r ecen tly it h a s b een ex t en ded t o th e dom ain of r eg r es sion pr ob lem s . H ow ev er , S V M cla s sificat ion can b e v iew ed a s a special ca se of S VM r eg r es sion . F or t ut orial in tr odu ct ion s an d ov erv iew s of recent dev elopm ent s of S VM regr es sion , see Gun n (1998 ), S m ola & S ch ölk opf (1998 ), an d 1. As s ociat e Pr ofes s or , Dat a Science Dept ., Inj e Univ er sity , Kyungnam , 621- 749, Kor ea .
E - m ail : skh @s t at .inj e.ac.kr
2. Dept . of St atistical Infor m ation , Cat holic Univ er sit y of Daegu , Kyungbuk , 712- 702, Kor ea .
3. Dept . of Dat a Science, Inj e Univ er sity , Kyungnam , 621- 749, Kor ea
V apn ik (1995, 1998).
S V M is b a s ed on t h e stru ctu r al risk m in im ization (S RM ) prin ciple, w h ich h a s b een sh ow n t o b e su perior t o tr adit ion al em pirical risk m inim izat ion (ERM ) prin ciple. S RM m in im izes an u pper b oun d on th e ex p ect ed risk un lik e ERM m inim izin g t h e err or on t h e t r ain in g dat a . By m inim izin g t his b ou n d , hig h g en er alization p erform an ce can b e achiev ed . B a sed on th is ob serv at ion w e b eliev e t h at S V M r egr es sion w ill perform b et t er w h en calculat in g pr ediction in t erv als th an ot h er m et h ods su ch a s n eu r al n et w ork s an d M A RS . A s w e dem on st r at e , pr edict ion int erv als for S VM r eg r es sion can b e sligh tly difficult t o obt ain . De V eaux el.
al (1998 ) com pu t ed pr edict ion int erv als for n eur al n et w ork s an d com par ed t h em w it h pr ediction int erv als b a s ed on M A RS . S ee Bish op (1995) for n eu ral n et w ork s an d M A RS .
Despit e it s su cces s es in m any real w orld application s , S V M depen d s only on a on e w eigh t solu tion r epre sen t ed in dir ect ly by t h e set of Lag r an g ian m u lt iplier s in m ak in g pr ediction s . H ow ev er , accor din g t o a Bay esian p er spectiv e, t h e w eigh t s of S V M , ev en aft er learn in g , still t ak e a cert ain post erior dist ribu t ion . T hu s , ut ilizin g j u st a on e w eig ht s olut ion a s r epr es ent at iv e n eg lect s post erior u n cert aint y in t h e w eig ht s . T his oft en lead s t o m or e ex t r em e predict ed ou tput s du rin g t estin g , an d in t u rn in dicat es an ov erly high confiden ce. T h e appr opriat e w ay t o h an dle t h e se w eig ht par am et er s is t o u tilize pr edict iv e post erior den sities oft en u s ed in Bay esian S t at istics . K w ok (1999) u sed t his idea t o g et t h e m oderat ed out pu t s for S VM cla s sificat ion un der t h e n am e of m ar gin alization r epr esen t ed by M acK ay (1992). In t his paper w e apply t his idea t o com pu tin g predict ion int erv als for S VM r eg r es sion . S ollich (2000) gu ar an t ee s th at w e can apply B ay esian m et h ods t o S VM .
T h e purpose of th is pap er is t o pre sen t a B ay esian appr oach t o estim atin g pr edict ion int erv als for S VM r egr es sion an d t o illu st rat e t h e m et h od w ith an ex am ple u sin g th e sam e r eal dat a a s in De V eau x el. al (1998 ). T h e r est of th is p aper is or g anized a s follow s . S ect ion 2 g iv es an ov erv iew of S VM regr es sion . S ection 3 briefly rev iew s t h e r elat ion sh ip b et w een th e lik elih ood prin ciple an d S V M r eg re s sion . S ect ion 4 discu s s es h ow t o com pu t e pr ediction in t erv als for S V M r eg r es sion , an d in S ect ion 5 th e m eth od is illu str at ed w it h a dat a ex am ple .
2 . S u pport V e c t or M ac hin e Re g re s s io n
Let th e tr ainin g dat a set D b e den ot ed b y { ( x
i, y ) , i = 1, . . . , n }, w ith
each input x
iR
dan d t h e out pu t y
iR . Ou r g oal is t o fin d a fun ction
f ( x ) t h at h a s at m ost dev iat ion from t h e act u ally obt ain ed t ar g et s y
i' s for all
t h e t r ain in g dat a , an d at th e sam e tim e, is a s flat a s pos sible. F latn es s h er e
m ean s t h at w e s eek sm all w . W e fir st con sider t h e ca se of lin ear r eg r es sion .
T h en , w e t ak e t h e form
f ( x ) = w
tx + b w it h w R
d, b R
w h er e su per script t r epr es ent s t h e t r an spos e of a v ect or . On e w ay t o en sur e t his is t o m inim ize th e Eu clidean n orm | | w | |
2. F orm ally w e can w rit e th is pr oblem a s a con v ex opt im ization pr ob lem by r equirin g :
m inim ize 1
2 | | w | |
2,
subj ect t o y
i- w
tx
i- b an d w
tx
i+ b - y
iT h e u n derly in g a s sum pt ion h er e is t h at th e conv ex opt im izat ion pr oblem is fea sible. S om et im es , h ow ev er , th is m ay n ot b e t h e ca s e, or w e als o m ay w an t t o allow for s om e err or s . T o m ak e it fea sib le, w e in tr odu ce slack v ariables
ian d
*
i
. H en ce w e arriv e at t h e form ulat ion st at ed in V apn ik (1995, 1998).
m in im ize 1
2 | | w | |
2+ C
n
i = 1
(
i+
*i) , (1)
subj ect t o { y
i- w
t
x
i- b +
iw
tx
i+ b - y
i+
*ii
,
*i0
T h e con st an t C > 0 det erm in es t h e t r ade off b et w een t h e flat n e s s of f an d th e am oun t up t o w hich dev iation s lar g er t h an ar e t olerat ed . H er e,
ian d
*iare slack v ariables r epr es ent in g u pper an d low er con st r ain t s on t h e out pu t s . T h e form u lat ion ab ov e corr esp on ds t o dealin g w ith V apnik ' s - in sen sitiv e los s fu n ct ion des crib ed by
| | = { 0 | | - if | ot h erw is e |
T h e k ey idea is t o con st ru ct a Lag ran g e fu n ct ion . H en ce w e pr oceed a s follow s :
L = 1
2 | | w ||
2+ C
n
i = 1
(
i+
*i) -
n
i = 1 i
( +
i- y
i+ w
tx
i+ b )
-
n i = 1
*
i
( +
*i+ y
i- w
tx
i- b ) -
n
i = 1
(
i i+
*i *i) ( 2)
W e n otice t h at th e posit iv it y con st r ain t s
i,
*i,
i,
*i0 sh ould b e sat isfied .
A ft er t akin g p artial deriv ativ es of equ at ion (2) w it h r eg ar d t o th e prim al v ariables
( w , b,
i,
*i) an d plu g g in g th em in t o (2), w e g et th e opt im ization pr oblem b elow .
m ax
, *
- 1
2
n
i , j = 1
(
i-
*i)(
j-
j*) x
tix
j-
n
i = 1
(
i+
*i) +
n
i = 1
y
i(
i-
*i) w it h con st r ain t s
n
i = 1
(
i-
*i) = 0 an d
i,
*i[ 0 , C ] .
S olv in g th e ab ov e equ ation w it h t h e se con str aint s det erm in es t h e Lag r an g e m u lt iplier s ,
i,
*i, an d t h e opt im al r egr es sion fun ct ion is giv en b y
w =
n
i = 1
(
i- '
*i) x
i, b = - 1
2 w
t[ x
r+ x
s] ,
w h er e x
ran d x
sar e supp ort v ect or s . H er e, t h e su pport v ect or s ar e dat a p oint s w h er e ex a ctly on e of t h e Lag ran g e m u lt iplier s is g r eat er t h an zero. T h erefor e, t h e opt im al r egr es sion fun ct ion can b e r ew rit t en in t h e form of
f ( x ) =
n
i = 1
(
i-
*i) x
itx + b .
A ctu ally , b can b e com pu t ed b y u sin g t h e K aru sh - Ku hn - T u ck er (K KT ) con dit ion s . S ee for det ails S m ola & S ch lk opf (1998 ).
W e n ow con sider th e ca se of n onlin ear S V M r eg r es sion . A n on lin ear m odel is u su ally r equired t o adequ at ely m odel dat a . In t his ca s e S VM r egr es sion fir st m ap s x fr om t h e in put space R
dt o z = (x ) in a hig h dim en sion al feat u re sp ace w h er e lin ear r eg re s sion perform ed . T h e k ern el appr oach is em ploy ed t o addre s s t h e cur se of dim en sion alit y . T h e n on lin ear S V M r egr es sion s olut ion , u sin g an - in s en sit iv e los s fun ction , is g iv en by
m ax
, *
- 1
2
n
i , j = 1
(
i-
*i) (
j-
j*)K ( x
i, x
j) -
n
i = 1
(
i+
*i) +
n
i = 1
y
i(
i-
*i)
w it h con st r ain t s
n
i = 1
(
i-
*i) = 0 an d
i,
*i[ 0 , C ] .
S olv in g t h e ab ov e equ ation w ith t h e se con st r ain t s det erm in es t h e Lagr an g e m u lt iplier s ,
i,
*i, an d t h e opt im al r egr es sion fun ct ion is giv en b y
f ( x ) =
n
i = 1
(
i-
*i)K ( x
i, x ) + b (3)
w h er e
b = - 1
2
n
i = 1
(
i-
*i) [ K ( x
r, x
i) + K ( x
s, x
i) ] ,
T h e differ en ce t o t h e lin ear ca se is t h at w is n o lon g er ex plicit ly g iv en . H ow ev er , it is u niqu ely defin ed in t h e w eak sen se b y th e dot pr odu ct s . A ls o n ot e t h at in t h e n onlin ear set t in g , t h e opt im izat ion pr oblem corr espon d s t o fin din g t h e flatt est fu n ct ion in th e feat ur e space , n ot in th e in pu t space. In t his pap er w e only con sider t h e n on lin ear ca se. T hu s , for a t est v ect or x R
m, w e fir st n eed t o com pu t e
a ( x , w ) = w
tz + b =
n
i = 1
(
i-
*i)K ( x
i, x ) + b,
w h er e
w
tz =
n
i = 1
(
i-
*i)K ( x
i, x ) .
3 . Lik elih o o d P rin c iple an d S V M Re g re s s io n
In th is sect ion , w e describ e t h e r elat ion ship b et w een t h e lik elih ood prin ciple an d S V M r eg re s sion . In or der t o det erm in e w , w e m inim ize (1), w h ich , for a fix ed C, is t h e sam e a s m in im izin g
| | w | |
22 C +
n
i = 1
(
i+
*i) . (4)
W e pu t = 1/ C . T h en , a m odel H , w ith a k - dim en sion al par am et er v ect or w , con sist s of it s fu n ct ion al form f , t h e distribut ion p ( D w , H ) , t h at t h e m odel m ak es ab ou t t h e dat a D , an d a prior par am et er dist ribu tion p ( w , H ) w it h a regu larization par am et er of . T h er efor e, w e h av e t h e post erior dist ribu t ion of w for a g iv en v alu e of by u sin g Bay e s ' rule :
p ( w D , , H ) ∝ p ( w , H ) p ( D w , H ) . (5)
N ow , con sider t h e follow in g pr ob ab ilit y m odel:
T h e prior ov er w is th e n orm al prior
p ( w , H ) ∝ exp ( -
2 | | w | |
2) . T h e pr ob abilit y dist rib ut ion is giv en by
p ( y
ix
i ,w , H ) = 1
2 ( 1 + ) exp ( - |
i| ) .
N ot e th at p ( y
ix
i ,w , H ) is actu ally t h e den sit y m odel for an - in sen sitiv e los s fun ct ion . S ub stitu t in g t h es e pr ob abilities in t o (5 ) an d a s su m in g t h at t h e ob serv ation s ar e i.i.d ., w e ob t ain
- log p ( w D , , H ) =
2 | | w | |
2+
n
i = 1
(
i+
*i) -
n
i = 1
log p ( x
i) + con sta n t . (6 )
T h e la st t w o t erm s on t h e rig ht do n ot dep en d on w . H en ce, b y put tin g
= 1/ C , opt im izin g (1) can b e r eg ar ded a s fin din g th e m ax im um ap p os t eriori (M A P ) est im at e w
M Pof w . M or eov er , t r adit ion al S V M regr es sion can b e con sider ed a s u sin g w
M Pa s th e s ole r epr esen t ativ e of t h e w h ole post erior dist ribu tion upon predict ion .
4 . P o s t e rior P re di ctiv e D i s trib uti on an d P re dic ti on In t erv al s
T o com put e pr edict ion int erv al for each out pu t y , w e n eed t o deriv e t h e p ost erior pr edict iv e dist ribu tion p ( y | D , x ) of y giv en t h e t r ain in g dat a s et D an d an in pu t v ect or x . W e fir st a s sum e th at th e post erior dist ribu t ion of w can b e approx im at ed by a sin g le n orm al dist ribu t ion at w
M P. S in ce a ( x , w ) = w
tz + b , th e post erior dist rib ut ion of a ( x , w ) w ill als o b e n orm al N ( a
M P ,s
2) , w ith m ean a
M P( x ) = a ( x , w
M P) an d v arian ce
s
2( x ) = z
tA
- 1z , (7)
w h er e A =
2M =
2(
2 | | w | |
2+
n
i = 1
(
i+ '
*i) ) is t h e H es sian . T hu s , w e can
g et th e post erior pr edictiv e den sit y giv en b elow
p ( y D , x ) = p ( y a , D , x ) p ( a D , x ) da
= 1
2 ( 1 + ) 2 s e
- |y - a |e
- ( a - aM P)2/ 2 s2
d a
= 1
2 ( 1 + ) [ e
- y + + s3+ 2 aM P
2
( y - - a s
M P- s
2)
+ e
y + +s2- 2 aM P
2
( 1 - ( y - - a s
M P- s
2))
+ ( y + s - a
M P) - ( y - s - a
M P)] ,
w h er e is th e distribut ion fun ct ion of a st an dar d n orm al distribut ion . A plot of t h e post erior predict iv e dist rib ut ion is sh ow n in F igu re 1. T his plot can b e m ade aft er w e t r ain S VM r eg r es sion for fix ed v alu es of C, , an d a k ern el param et er . If an int erv al sum m ary is desir ed , a cent r al in t erv al of post erior pr edictiv e pr ob ab ilit y , w hich corr espon d s , in t h e ca s e of a 100 (1- a )% in t erv al, t o t h e ran g e of v alu es ab ov e an d b elow w hich lies ex actly 100 (a/ 2 )% of t h e post erior pr edict iv e pr ob ab ilit y can b e calcu lat ed . S u ch in t erv al estim at es ar e r eferred t o a s pr edict ion int erv als .
N ow , it is left t o illu st r at e h ow t o com put e s
2( x ) . T o com pu t e s
2( x ) in (7 ), w e h av e t o det erm in e t h e H es sian A . W e alr eady kn ow t h at com put in g t h e H es sian m at rix for n eur al n et w ork is v ery com plicat ed . But , w e w ill see th at it is m u ch sim pler t o com pu t e t h e H es sian m at rix for S V M regr es sion . W e k n ow t h at
i ,
'
*im ea sur es t h e differ en ce b et w een y
ian d w
tz
i+ b. H er e, w e u s e z
ifor b ot h lin ear an d n on lin ear ca ses . T hu s , w e h av e
i
= step ( y
i- a
i) ( y
i- a
i- )
*
i
= step ( a
i- y
i) ( a
i- y
i- )
F igu r e 1: P ost er pr edict iv e den sity an d S t an dar d n orm al den sit y
w h er e a
i= w
tz
i+ b an d step ( x ) is th e st ep fun ction . It sh ould b e n ot ed th at step ( x ) is n ot different iab le. T h u s , w e r eplace it by th e sig m oid fun ction ( x ) = 1/ (1 + e
- x) . S in ce a
i= z
ian d
2a
i= O , w e obt ain
2
i
= r ( y
i- a
i) z
iz
tian d
2 *i= r ( a
i- y
i) z
iz
it. H er e, O is th e zer o m at rix an d r ( x ) = ( x - ) ″( x ) + 2 ′( x ) . F in ally , w e h av e A = I + B w h er e
B =
n
i = 1
r
iz
iz
tian d r
ir ( | y
i- a
i|) .
W e n ow com put e t h e eig env alu es
kof B . F or t h e lin ear ca s e, w e put z
i= x
i.. T h en , com put in g eig en v alu es
kis st raig ht for w ar d. H ow ev er , w e n eed s om et hin g m or e for t h e n on lin ear ca s e. W e n ow w ill ex plain in det ail h ow t o com pu t e eig en v alu es
kfor t h e n onlin ear ca s e. W e kn ow th e t r ain in g alg orith m dep en d s on s om e k ern el fu n ct ion . W e pu t z
i= ( x
i) for som e m appin g . Let
v
kb e t h e eig env ect or s of B . T h en w e h av e v
k=
n
i = 1
u
k iz
ifor som e
con st ant s u
k ian d
kz '
jtv
k= z
jtB v
k .. T his lead s t o
kK u
k =K K u
k ,, w h er e u
k= ( u
k 1 ,. . . , u
k nt) , K is t h e n × n m at rix w it h elem en t s z
tiz
j= K ( x
i ,x
j) an d K is an ot h er n × n m at rix w it h elem en t s r
iz
tiz
j= r
iK ( x
i ,x
j) . In g en er al, w e can a s sum e t h at K is in v ert ible.
H en ce, w e h av e
ku
k= K u
k ,an d obt ain t h e eig en v alu es
kof A a s
k
= +
k .U sin g t h es e re sult s , A can b e appr ox im at ed a s
A P A P
t, (8)
w h er e
P = ( v
1 ,. . . , v
m) = (
n
n = 1
u
1iz
i ,. . . ,
n
i = 1
u
m iz
i) , (9)
an d
= d iag ( +
1 ,. . . , +
m) (10)
H er e, th e v
k' s ar e orth og on al an d m is t h e nu m b er of sign ificant eig env alu es in t h e n × n m at rix K .
Ut ilizin g (8 ), (9 ), (10) an d th e fa ct t h at P
- 1= P
t, s
2( x ) in (7) can t h en b e com pu t ed a s
s
2= z
tA
- 1z = z
tP
- 1P z = ( P
tz )
t - 1( P
tz )
=
n i = 1