2 002 , V o l. 13 , N o.2 p p . 2 71~2 84
R obu s t in f e re n c e for lin e ar re g re s s ion m o de l b a s e d on w e ig h t e d le a s t s qu are s 1 )
Jin - P y o P ark 2 )
A b s tra c t
In t his p aper w e con sider t h e r ob u st infer en ce for th e par am et er of lin ear r eg r es sion m odel b a sed on w eigh t ed lea st squ ar es . F ir st w e con sider t h e sequ ent ial t est of m ultiple ou t lier s . N ex t w e su g g est t h e w ay t o a s sig n a w eigh t t o each ob serv at ion ( x i , y i ) an d r ecom m en d th e r obu st infer en ce for lin ear m odel. F in ally , t o ch eck t h e perform an ce of con fiden ce int erv al for t h e slop e u sin g pr op os ed m et h od, w e con du ct ed a M ont e Carlo sim ulat ion an d pr esent ed som e nu m erical r esu lt s an d ex am ple s .
K e y W o rd s an d P h ra s e s : L eas t m ed ian of s quares , Ou tliers tes t, W e ig h ted leas t s quares
1 . IN T R OD U CT ION
W e con sider t h e r obu st infer en ce for th e par am et er s of lin ear r eg r es sion m odel y i = 0 + x i1 1 + + x ip p + e i , i = 1, 2 , , n (1)
w h ere t h e err or e i is a s sum ed t o b e n orm al dist rib ut ion w it h m ean zer o an d v arian ce 2 . Lea st s qu ar es estim at e of , an d st an dar d err or of , S E ( ) are s en sit iv e t o t h e ou t lier s . H en ce infer en ces for t h e p ar am et er s of lin ear r eg r es sion m odel u sin g an d S E ( ) ar e affect ed by out lier s . T o r em edy t his pr oblem , m an y st at ist ical m et h od h av e b een dev eloped .
In t his p aper , w e con sider t h e t ool of ident ify in g an d t est in g th e out lier s in lin ear r egr es sion m odel. T his t ool is b a s ed on t h e r atio of a r obu st s cale e st im at e 1. Res earch funded Kyungnam Univ er s ity , 2002
2. Pr ofes s or , Divis ion of inform ation & communication engin eering , Kyungnam Univ er sity .
an d n on r obu st s cale est im at e. A n d th en w e pr opose t h e for w ar d sequ ent ial pr ocedu re for iden tify in g t h e ou tlier s . N ex t w e con sider t h e m et h od t o a s sig n a w eig ht t o each ob s erv at ion ( x i , y i ). W e m ak e u se of n on in cr ea sin g fu n ct ion of t est st at istics a s a w eight t o each ob serv at ion . F in ally , w e apply a w eigh t ed lea st s qu ar es an aly sis t o int r odu ce r obu st in feren ce for lin ear r eg re s sion m odel. T he weighted least squares has a high breakdown point and is efficient in a st at ist ical s en s e.
H en ce, infer en ces for t h e par am et er s of lin ear regr es sion m odel u sin g w eig ht ed lea st squ ar es ar e n ot affect ed by ou tlier s
T h e r em ain in g of paper is or g an ized a s follow s . In S ect ion 2 w e in tr odu ce t h e t ool of iden tify in g an d t est in g t h e ou tlier s in lin ear r egr es sion m odel. W e su g g est t h e m eth od t o a s sign a w eigh t t o each ob serv at ion ( x i , y i ). W e pr opose t h e r obu st in fer en ce for th e par am et er s of lin ear r egr es sion m odel. In s ect ion 3 w e con sider th e cov er ag e an d m edian len gt h of con fiden ce in t erv al for t h e slop e b y m ean s of M ont e Carlo sim u lat ion . In section 4 w e apply t h e pr oposed m et h od t o s ev er al r eal dat a t o ch eck t h e p erform an ce of t h at . S ect ion 5 con t ain s s om e con clu din g r em ark s .
2 . T h e Ro bu s t inf ere n c e f or th e p aram et e rs of th e lin e ar re g re s s io n m o del
W e su g g e st th e w eigh t ed lea st s qu ar es regr es sion b a sed on t h e sequ ent ial out lier s t est pr opos ed b y Jinpy o P ark an d H eech an g P ark (2001). F ir st w e r ecall t h e defin it ion of t h e sequ ent ial ou tlier s t est . T h e t est st at istics is defin ed a s follow . Lea st m edian of s qu ar es pr op osed by R ou s seeu w (1984 ) m in im izes th e m edian of t h e squ ar ed r esidu als . Lea st m edian of s qu ar es regr es sion h a s a v ery h ig h br eak dow n poin t of alm ost 50% . T h e lea st m edian of squ ar es estim at or
L M S is g iv en by
M inim ize m ed r 2 i (2)
J i
w h ere r i = y i - x i J , J = ( X J T X J ) - 1 X T Y J , x i = ( x i1 , x i2 , , x ip ) an d J = { i 1 , i 2 , , i p } is a sub s et of {1 , 2 , , n } con t ainin g p in dices . T h e r esidu al is g iv en by
r L M S
i= y i - x i L M S . (3 )
T h e in it ial scale est im at e s 0 for t h e lea st m edian s qu ar es r eg re s sion is giv en b y
s 0 = 1 .4826 ( 1 + 5 / ( n - p - 1) ) m ed i ( r L M S
i) 2 . (4 )
T h e init ial s cale estim at e is t h en u sed t o det erm in e a w eigh t w i for t h e ith ob serv ation , n am ely
w i = { 0 1 if c ot h erw is e r L M S
i/ s 0 d (5 ) w h ere [c, d ] is t h e inn er fen ce of b ox plot of r L M S
i/ s 0 .
By m ean s of th ese w eig ht s , t h e fin al s cale estim at e s for th e lea st m edian s qu ar es
r eg re s sion is g iv en by
s =
n
i = 1 w i ( r L M S
i) 2 / (
n
i = 1 w i - p - 1) . (6)
s als o h a s a br eak dow n point 0.5, t h e h ig h est pos sible v alu e.
By con t ra st , t h e lea st squ ar es est im at or L S m in im izes
n
i = 1 r 2 i . (7 )
T h e br eak dow n poin t of lea st squ are s est im at or is 0. T h e r esidu al is giv en by
r L S
i= y i - x i T L S . (8 )
It is w ell kn ow n th at ou t lier s can h av e an ex t r em e effect on th e lea st s qu ar es e st im at or .
T h e scale est im at e for t h e lea st squ are s r egr es sion is giv en by
=
n
i = 1 ( r L S
i) 2 / ( n - p - 1) . (9 )
T h e t est st at ist ics for t est in g t h e out lier s is defin ed a s
R = / s. (10)
It t est s t h e follow in g h y poth esis
H 0 : n o out lier in dat a ( x i1 , x i2 , , x ip , y i ) , i = 1 , 2 , , n H 1 : som e out lier s in dat a ( x i1 , x i2 , , x ip , y i ) ,
i = 1 , 2 , , n . (11)
T h e n ull hy p ot h esis is r ej ect ed for lar g e R . H ow ev er , if t h e n ull hy p ot h esis is r ej ect ed, t h er e is n o in dication of h ow m any or w hich poin t s ar e ou t lier s . T o s olv e th is pr oblem , w e apply t h e t e st sequ en tially in for w ar d sequ en tial pr ocedur e t o iden tify t h e out lier s . If t h e t est r ej ect s th e n ull hy poth esis t h en t h e poin t w it h t h e lar g est D = |sor t( r L M S
i) - M ed (r L M S
i) | is defin ed a s an ou tlier , w h er e
sor t( r L M S
i
) is th e sort of r L M S
i
an d M ed ( r L M S
i
) is t h e m edian of r L M S
i
. T h e ob serv ation det ect ed a s an out lier is r em ov ed an d t h e t est is applied ag ain t o t h e n - 1 r em ain in g ob s erv at ion s . T h e procedur e is r epeat ed an d st op s w h en t h e t est is n o lon g er sign ificant .
T h e crit ical v alu es for t h e t est (approx im at ed by M on t e Carlo sim u lat ion u sin g 1000 replicat es ) ar e pr es ent ed in t h e T able 1.
T able 1 . Critical values for the pr opos ed t est
S am ple s ize s
N um ber of ex plan at ory v ar iable
1 2 3 4
lev el lev el lev el lev el
0.1 0.05 0 .01 0.1 0.05 0.01 0 .1 0 .05 0 .01 0 .1 0 .05 0 .01
15 1.725 1.894 2.072 2 .107 2 .223 2 .386 2.469 2 .622 2 .756 2 .807 2.895 2.992
20 1.484 1.637 1.849 1.850 1.978 2 .084 2.121 2 .246 2 .334 2 .323 2.407 2.580
25 1.493 1.605 1.759 1.682 1.793 1.853 1.950 2 .044 2 .200 2 .164 2.282 2.388
30 1.461 1.570 1.717 1.552 1.638 1.752 1.824 1.921 2 .065 1.982 2.150 2.333
35 1.395 1.475 1.623 1.496 1.578 1.688 1.650 1.793 1.925 1.786 1.910 2.103
40 1.326 1.403 1.493 1.417 1.487 1.580 1.573 1.666 1.774 1.654 1.769 1.882
45 1.276 1.337 1.435 1.393 1.473 1.570 1.456 1.548 1.655 1.575 1.688 1.812
50 1.266 1.338 1.403 1.351 1.425 1.515 1.471 1.492 1.575 1.466 1.540 1.631
N ex t w e su g g est t h e w ay t o a s sig n a w eight w i t o each ob serv at ion ( x i , y i ). F or t his pu rp os e, w e can u se s ev er al t y pes of fun ct ion s of t h e t est st atist ics R . T h e fir st kin d of w eigh t fu n ction t h at w e con sider h er e is of t h e form
w ( R i ) = { 0 1 if R ot h erw is e i c 1 . (12)
w h ere c 1 is a crit ical v alu e for t est st at ist ics w h en sig nifican t lev el is 0.1. T h is w eig ht fu n ction , y ieldin g only bin ary w eigh t , pr odu ce s a clear dist in ction b et w een a ccept ed an d rej ect ed p oint . But th is fun ct ion is r adical. S o w e int r odu ce w eig ht fu n ct ion t h at is les s ex tr em e. It con sist s of in t rodu cin g a lin ear part t h at sm ooth s t h e t r an sit ion fr om w eigh t 1 t o w eig ht 0.
In t h at w ay , ex tr em e ou tlier s dis appear en tir ely an d int erm ediat e ca ses ar e g r adu ally dow n - w eig ht ed . In t h e g en er al form u la
w ( R i ) =
1 if R i c 1
( c 2 - R i )
( c 2 - c 1 ) if c 1 R i c 2 0 ot h erw is e
(13 )
W h er e c 2 is a crit ical v alu e for t est st at istics w h en sign ificant lev el is 0.01.
A ny w ay , w e t h en apply w eight ed lea st s qu ar es defin ed by
M in im ize
n
i = 1 w ( R i ) r 2 i (14 )
T h e w eig ht ed lea st s qu ar es estim at or is g iv en by
* = ( X T W T WX ) - 1 X T W T WY (15 )
w h ere W = d iag ( w
1 2 1 , w
1 2
2 , , w
1 2 n ) .
Let W T W = d iag ( w 1 , w 2 , , w n ) = V , th en
* = ( X T VX ) - 1 X T V Y
E ( * ) = (16 )
an d Va r ( * ) = ( X T VX ) - 1 2 .
T h e st an dar d error of i- t h w eigh t ed lea st squ are s est im at or is giv en by
2 ( X T VX ) ii - 1 . (17 )
W h er e un kn ow n 2 is e st im at ed b y ( s * ) 2 =
n
i = 1 w ( R i ) r 2 i / (
n
i = 1 w ( R i ) - p) .
T o dis cu s s r obu st infer en ce for t h e lin ear regr es sion m odel, w e a s sum e t h at t h e err or s ar e in depen den t ly an d n orm ally dist ribu t ed w ith m ean zer o an d v arian ce
2 . Un der t h e se con dit ion s , it is w ell kn ow n th at
*
i - i
( s * ) 2 ( X T VX ) ii - 1
, i = 1 , 2 , , p (18 )
h a s a S tu dent t - dist ribu tion w ith
n
i = 1 w ( R i ) - p deg ree of fr eedom . Let u s
den ot e t h e 1 - / 2 qu an tile of t his dist ribu tion by t
ni = 1
w ( R
i) - p , 1 -
2
. T h en a ( 1 - ) 100 % con fiden ce in t erv al for i is giv en b y
[ * i - t
ni = 1
w (R
i) - p , 1 - 2
( s * ) 2 ( X T VX ) - 1 ii , * i + t
ni = 1