2 002 , V ol. 13, N o.2 p p . 167~1 74
A n aly s i s o f In c om ple t e D at a w ith N on ig n orab le M i s s in g V alu e s
H y u n - Je on g K im1 )
A b s tra c t
In t h e ca se of “n on ig n or able m is sin g dat a”, it is n eces sary t o a s sum e a m odel dealin g w it h t h e m is sin g on each sit u ation s . In th is article, for ex am ple, w e s om et im es m eet sit u ation s w h er e dat a set ar e in com e am ou nt s in a su rv ey of in div idu als an d a s sum e a m odel a s t h e v alu es are t h e lar g er , a m is sin g dat a pr ob ability is th e high er . T h e m eth od is t o m ax im ize u sin g t h e EM (Ex pect at ion an d M ax im izat ion ) alg orith m b a sed on th e (m is sin g dat a ) m ech an ism t h at cr eat es m is sin g dat a of t h e ca se of ex pon en tial dist ribu tion . T h e m et h od st art ed fr om any in it ial v alu e s , an d conv erg ed in a few it er at ion s . W e ch an g ed th e m is sin g dat a prob abilit y an d th e artificial dat a size t o sh ow t h e estim at ed accu r acy . T h en w e dis cu s s th e pr op erties of estim at es .
K e y w o rd s : EM - alg orith m , M ax im um lik elih ood , M ech anism , M is sin g dat a
1 . In tro du c tio n
W h en t h e con dit ion al m is sin g pr ob abilit y giv en t h e ob serv ed v alu es is depen den t t o t h e m is sin g v alu es , it is kn ow n t h at th e m is sin g m ech an ism is n onign or able. In su ch ca ses w e h av e t o t ak e int o accou nt t h e m is sin g m ech an ism in th e st atist ical an aly sis of t h e dat a w it h m is sin g v alu es . T h e aim of pr es ent pap er is t o sh ow s om e ex am ple of st at ist ical an aly sis b a sed on t h e m ax im u m lik elih ood (M L ) m et h od of t h e dat a w it h n onign aor ab le m is sin g v alu es an d stu dy th e v alidity of su ch kin ds of an aly sis .
In s ect ion 2 t h e lik elih ood fun ct ion is deriv ed for th e dat a w it h n on ig n orable m is sin g v alu e s . In section 3 w e con sider a specific m odel for u niv ariat e dat a t ak en fr om an ex pon ent ial distribut ion w it h m is sin g v alu es . In th is ca se th e m is sin g
1. F ulltim e Lectur er , Depart m ent of Gener al E ducation , Silla Univ er s ity , Bus an 617- 736, Kor ea.
E - m ail : s emikim @s illa.ac.kr
pr ob ab ilit y is a s su m ed t o in cr ea s e a s t h e v alu e in crea ses . T h en , in section 4 w e ex plain th e m et h od for ob t ainin g t h e M L est im at es , i.e., t h e EM alg orit hm in g en er al ca se. In sect ion 5 w e sim u lat e ch an gin g t h e m is sin g dat a pr ob ab ilit y an d t h e artificial dat a size b y th r ee st ep s t o sh ow th e estim at ed a ccur acy , th en discu s s on t h e propert ie s of est im at es .
F r om r esult s , w e k n ow t h at w h en w e est im at e b a sed on t h e n on ig n or able m is sin g dat a m ech an ism , it is t o b e un bia s by t o a s su m e th e m odel in clu din g m is sin g dat a m ech anism .
2 . Lik e lih o o d f or a s am ple w ith m i s s in g v alu e s
W e w rit e Y = ( Y obs, Y m is) w ith ou t any los s of g en erality , w h er e Y obs an d Y m is in dicat e t h e ob serv ed an d m is sin g part s of Y , r espect iv ely .
Let R b e t h e pat t ern of t h e m is sin g v alu es , t h at is , R is t h e ob serv ed v ect or of r an dom v ariables Ri, w hich is defin ed a s
Ri= {1 , if yi is ob s erv ed , 0 , if yi is m is s in g .
T h en , w e can form u lat e a m odel w ith m is sin g dat a in t erm s of a pr ob abilit y dist ribu tion for Y w it h den sit y f ( Y | ) , in dex ed by u nk n ow n param et er , an d a pr ob ab ilit y distribut ion g ( R | Y , ) for in dicat in g a par am et er appeared in t h e con dition al den sity of R g iv en Y . T h e lik elih ood fu n ct ion w it h m is sin g dat a w a s defin ed t o b e an y fu n ct ion of an d pr oportion al t o t h e j oin t den sit y h of R an d Y a s
L ( , |R , Yobs) h ( Y obs, R | , )
= -
n
i = 1h ( yi, ri| , ) d Y m is
=
m
i = 1f ( yi| )g ( ri|yi, )
n
j = m + 1g1( rj| , ) ,
w h er e m is ob s erv ed u nit s an d n - m m is sin g u nit s , an d g1( r| , ) is th e m ar g in al den sit y of R , i.e.,
g1( r| , ) =
- f ( y | )g ( r |y , ) dy . N ot e t h at th e j oint den sit y of Y an d R can b e decom posed in t o
h ( y , r | , ) = f ( y | )g ( r |y , ) ,
(see , Litt le an d Ru bin (1987 ), an d H og g an d T ain s (1993 )). T his decom p osit ion w ill b e u sed for com put in g ex pect ation of th e ab ov e lik elih ood .
3 . T h e Ca s e o f E x po n e n ti al D i s trib uti on
In t his s ect ion w e con sider th e ca se w h er e y follow s an ex pon en t ial distribut ion an d th e m is sin g dat a m ech an ism is defin ed by
P r ( R = r | y , ) = g ( r |y , ) = {1 - e
- y
, r = 0 , e -
y
, r = 1 .
T h e m is sin g pr ob ability is display ed in F ig ur e 3.1. A s F ig ur e 3.1 sh ow s th e m is sin g pr ob abilit y in cr ea s es an d t en d s t o 1 a s y b ecom es lar g er .
F ig ur e 3.1 A s sum ed m is sin g dat a m ech an ism g ( r = 0|y , )
A lso su pp ose th at th e ob serv at ion s Yn ar e obt ain ed a s a r an dom sam ple fr om an ex p on ent ial dist rib ut ion , i.e.,
f ( y | ) = 1
e - y / .
T h en , t h e j oin t distribut ion h ( y , r | , ) of y an d r is obt ain ed a s
h ( y , r | , ) = f ( y | )g ( r |y , )
=
{
11 ee - y /- y / (1 - ee - y /- y /, ), r = 0 ,r = 1 .an d , t h er efore , t h e lik elih ood fu n ct ion w ith n onign or able m is sin g dat a is g iv en by
L ( , |y1 , , ym, r1, , rm ,rm + 1, , rn )
=
m
i = 1h ( yi, ri| , )
n
j = m + 1g1( rj| , )
=
m i = 1
1 e - yi/ e - yi/
n
j = m + 1 +
= (1 )m( + )n - mex p(- + i = 1m yi),
w h er e
g1( rj| , ) = f ( yj| )g ( rj= 0 |yj, ) dyj
=
0
1 e - yj/ (1 - e - yj/ )dyj
= + .
4 . E s tim atio n o f P aram e t er V e c t o rs an d
T o obt ain M L estim at es an d w e can con sider t o apply a m et h od. T h e m et h od is t h e so- called EM (E x p ect ation an d M ax im ization ) alg orith m pr opos ed by D em p st er , Lair d an d Rub in (1977 ). T h e EM alg orit hm is a v ery g en er al it er at iv e alg orit hm for M L est im at ion in in com plet e dat a pr oblem s . In th e ca se of E M alg orit hm for m odels , m is sin g su fficien t st atist ics rath er t h an in div idu al ob serv ation s n eed t o b e estim at ed at ea ch it er at ion of th e alg orit hm (s ee, M cLa chlan an d Krishn an (1997 ), an d Dodg e (1985 )). Ev en in th e ca se of ex pon ent ial dist ribu tion t h e form ulat ion b ecom e s com plex w h ere m is sin g dat a m ech anism ex ist s .
E ach it eration of EM alg orit hm con sist s of E st ep an d M st ep an d it s con st ru ct ion is a s follow s .
S t e p 1 . S et init ial est im at e ( 0 ) an d ( 0) of , , r espectiv ely .
S t e p 2 . (E st ep ) W e com pu t e th e ex pect ed v alu es of t h e (j oin t ) sufficien t st at ist ics
n i = 1yi
E (
n
i = 1y i | ( t), ( t), R , Y obs)
= E (
m
i = 1yi| ( t), ( t), R , Y obs) + E (
n
j = m + 1yj| ( t), ( t), R , Y obs)
=
m i = 1y i+
n
j = m + 1 yj ( t)
w h er e
yj ( t) =
0 yjf1( y |r = 0 , ) dyj
= 0 yj 1
( t) e - yj/
( t)
(1 - e - yj/
( t)
)/ ( t)+( t) ( t) dyj
= ( t)+
( t) ( t)
( t)+ ( t) ( j = m + 1, , n ) . S t e p 3 . (M st ep ) Calculat e th e est im at e
( t + 1)
= (i = 1m yi+ ( n - m ) yj ( t))/ n
an d solv e t h e est im at e ( t + 1), u sin g th e com plet e - dat a su fficien t st atist ics
n
i = 1yi fou n d in t h e E st ep .
S t e p 4 . If conv er g ed , ( t)an d ( t) ar e reg ar ded a s t h e M L estim at e s , i.e.,
= ( t) an d = ( t). Oth er w is e g o b ack t o st ep 2 aft er put t in g t : = t + 1.
N u m e ri c a l e x am p le . T o illu str at e our pr ocedur e w e h av e g en erat ed a s et of n = 100 art ificial dat a b a sed on t h e m odel of ex p on ent ial dist ribu tion w it h = 1 an d on t h e m is sin g dat a m ech an ism w it h = 3 . 0 ab ou t 30% m is sin g . A s sh ow n in T able 4.1, 79 ob serv at ion s ou t of 100 ar e a ctu ally obt ain ed. V alu es w it h a st erisk are r eg ar ded a s m is sin g .
T able 4.2 sh ow s t h e conv er g en ce of EM t o t his solut ion st art in g fr om t h e in it ial v alu es ( 0)= 0 .01 an d ( 0 )= 0 . 01. T h e EM alg orit hm h av e t ook 14 it er ation s . T h e it er ativ e pr ocedu r e is con sidered t o b e conv er g ed , w h en it h olds t h at th e Eu clidean n orm of t h e differ en ces of su cces siv e t w o v alu es of ( t) an d ( t) is sm aller t h an = 0 . 00001. T h e obt ain ed r esu lt s ar e = 0 .96715 an d = 2 . 99999. T o sh ow th e differ en ce of conv erg ed v alu es , w e fou n ded t w o p ar am et er s of st an dar d err or : s e ( ) = 0 . 03285 an d s e ( ) = 0 . 00001. A n d t h e correlat ion coefficient of t w o est im at ed param et er s w a s corr ( , ) = 0 . 96703. It is n ot ed th at in th is ca se t h e est im at ed param et er s ar e v ery close t o th e popu lat ion par am et er s an d th at ou r pr ocedu re s seem t o w ork w ell.
T ab le 4.1: A s et of 100 artificial dat a (ex pon ent ial dist ribu t ion w ith = 1 an d t h e m is sin g v alu e m ech an ism w ith = 3 . 0)
N ot e. * : m is sin g dat a 1
6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96
0.633907 3.801593 0.208814 3.409731*
3.382418*
0.145197 2.129653 0.703712 1.200620 0.005993 2.524394 0.247183 0.954158 0.372242*
1.806685 1.227496 3.636831*
0.153316 1.460142 0.478107
0.062182 0.263736 0.013400 4.170736 1.872162 0.582066*
3.625779*
1.680004 1.676158*
0.693403 0.576381 0.850727 0.136173 0.022994 0.606821 0.510398 0.897983 1.783356 0.656768 2.636776
0.702323 0.670557 0.702941 1.234908 0.769794
0.083993 2.632828*
0.123132 1.258658*
3.443318*
0.741285 0.244805 0.295892 0.603037 0.273151 0.605348 0.026007 0.406625 1.483078 1.327901
0.134175 1.096038 1.241645 0.349487 0.745341 0.192054 0.623568 3.007779 0.192693 0.478826 0.257411 0.096543 0.096641 4.252598*
0.657411 0.263653 2.898733 1.981601*
0.857258*
0.010663
2.125615*
0.495486 0.253232 0.062779 0.040246 0.858098 3.334371*
1.273178*
0.408614 0.083737 0.661249 0.821901 0.922874*
0.178885 1.724596 2.014738*
1.704346*
1.466127 1.926649
0.354965
5 . S im u l atio n S tu dy
T o discu s s m or e in det ail, w e h av e sim ulat ed t w o ca s es t o sh ow t h e estim at ed a ccur acy for m is sin g pr ob ability an d sam ple size. T h e sim ulation pr oces sin g is s am e t o nu m erical ex am ple of pr ev iou s sect ion an d w e ch an g ed th e m is sin g pr ob ab ilit y an d s am ple size. On e of t w o is set s of n = 100 , 400 , an d 1600 artificial dat a b a sed on th e ex p on ent ial dist ribu t ion w it h = 1. F or t h e oth er , w e ch an g ed t h e m is sin g dat a prob abilit y by th r ee st ep s 10% ( = 12 .0), 30% ( = 3 . 0), an d
50% ( = 1. 0) for th e g en er at ed dat a set s . W e h av e fou n d th at t h e low er t h e v alu e is , t h e h ig h er th e m is sin g dat a pr ob abilit y is .
A n d w e h av e e st im at ed t w o param et er s an d by u sin g EM alg orith m a s t h e st art poin t s : ( 0 )= 0 . 01 an d ( 0)= 0 . 01. T h e pr oces sin g m et h od w a s equ al t o th e n um erical ex am ple. T h e obt ain ed r esu lt s ar e giv en in T able 5.1, it is sam e t o t h e r esult s of T able 4.2 for n = 100 an d = 3 . 0.
T ab le 4.2: E stim at ed p ar am et er s an d w it h th e EM alg orith m
it er ation | ( t + 1)- ( t)| | ( t + 1)- ( t)|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0.01 0.61465 0.84667 0.92693 0.95382 0.96275 0.96570 0.96667 0.96699 0.96710 0.96713 0.96715 0.96715 0.96715 0.96715 0.96715
- 0.60465 0.23202 0.08026 0.02689 0.00893 0.00295 0.00097 0.00032 0.00011 0.00002 0.00000 0.00000 0.00000 0.00000 0.00000
0.01 - 0.30225
0.41418 1.55003 2.38373 2.77549 2.92327 2.97438 2.99151 2.99719 2.99907 2.99969 2.99990 2.99997 2.99999 2.99999
- 0.28775 0.71643 1.13585 0.83370 0.39176 0.14778 0.05111 0.01713 0.00568 0.00188 0.00062 0.00021 0.00007 0.00002 0.00000
It is sh ow n t h at t h e conv er g en ce it er at ion s of ab out 10% m is sin g pr ob ability is few er th an t h ose of 50% m is sin g r at e. A n d w e can foun d t h at t h e low er t h e m is sin g r at e is , t h e few er t h e it er at ion of con v er g en cy is . In t h e ca se of iden tical m is sin g pr ob ab ilit y , it is n ot able t h at th e lar g er th e n um b er of dat a is , t h e closer t h e est im at ion is t o p opulation par am et er s . T h e e st im at ed accu ra cy of par am et er
w a s v ery ex act r eg ar dle s s of t h e m is sin g r at e an d t h e dat a size .
In t h e pr esen t pap er , w e h av e sh ow n som e ex am ples of st at istical an aly sis b a sed on th e M L m eh t ods of th e dat a w ith n on ig n or able m is s sin g v alu es an d h av e stu died t h e v alidity of su ch kin d s of an aly sis . W e can discu s s on th e pr opert ies of e st im at es .
A s w e kn ow n in sim u lat ion , in t h e ca se of t h e EM alg orith m , if it fin d t h e M L e st im at es on t h e m odel in clu din g th e m is sin g - dat a m ech an ism , t h ou gh th e m is sin g - dat a prob abilit y is high er , it can estim at e a ccur at ely w ell. W e h av e kn ow n t h e est im at ed a ccur acy is v ery close an d low er m is sin g pr ob abilit y , m ore t h e a ccur acy is hig er , t oo.
T able 5.1: T h e estim at e of p ar am et er s w it h m is sin g pr ob ab ilit y ch an gin g
n s e ( ) s e ( ) it er at ions
1.0 (50% )
100 400 1600
0.91198 1.05876 1.00712
0.08802 0.05876 0.00712
0.99998 0.99994 0.99997
0.00002 0.00006 0.00003
30 26 26 3.0
(30% )
100 400 1600
0.96715 0.95617 1.00214
0.03285 0.04383 0.00214
2.9999 2.99984 2.99982
0.00001 0.00016 0.00018
15 14 13 12.0
(10% )
100 400 1600
1.00287 0.95453 0.98088
0.00287 0.04547 0.01912
11.99997 11.99985 11.99973
0.00003 0.00015 0.00027
12 7 7
R e f e re n c e s
1. Dem p st er , A . P ., Laird , N . M . an d Rubin , D . B . (1977 ). M ax im u m lik elih ood fr om in com plet e dat a v ia t h e EM alg orit hm , J ournal of R oy al S ta t is t ical S ocie ty , B39, 1- 38.
2. Dodg e , Y . (1985 ). A naly s is of E xp er im en ts w ith M is s ing D a ta , J ohn W iley
& S on s , N ew Y ork .
3. H og g , R . V . an d T an is , E . A . (1993 ). P robab ility and S ta t is t ical I nf er en ce 4 th E d ition , M acm illan .
4. Litt le , R . J . A . an d Rubin , D . B . (1987 ). S ta t is t ical A naly s is w ith M is s ing D a ta , J oh n W iley & S on s , N ew Y ork .
5. M cLachlan , G. J . an d Krishn an , T . (1997 ). T he E M A lg orithm an d E x tens ions , J oh n W iley & S on s , N ew Y ork .
[ 2002년 9월 접수, 2002년 9월 채택 ]