• 검색 결과가 없습니다.

Analysis of Incomplete Data with Nonignorable Missing Values

N/A
N/A
Protected

Academic year: 2021

Share "Analysis of Incomplete Data with Nonignorable Missing Values"

Copied!
8
0
0

로드 중.... (전체 텍스트 보기)

전체 글

(1)

2 002 , V ol. 13, N o.2 p p . 167~1 74

A n aly s i s o f In c om ple t e D at a w ith N on ig n orab le M i s s in g V alu e s

H y u n - Je on g K im1 )

A b s tra c t

In t h e ca se of “n on ig n or able m is sin g dat a”, it is n eces sary t o a s sum e a m odel dealin g w it h t h e m is sin g on each sit u ation s . In th is article, for ex am ple, w e s om et im es m eet sit u ation s w h er e dat a set ar e in com e am ou nt s in a su rv ey of in div idu als an d a s sum e a m odel a s t h e v alu es are t h e lar g er , a m is sin g dat a pr ob ability is th e high er . T h e m eth od is t o m ax im ize u sin g t h e EM (Ex pect at ion an d M ax im izat ion ) alg orith m b a sed on th e (m is sin g dat a ) m ech an ism t h at cr eat es m is sin g dat a of t h e ca se of ex pon en tial dist ribu tion . T h e m et h od st art ed fr om any in it ial v alu e s , an d conv erg ed in a few it er at ion s . W e ch an g ed th e m is sin g dat a prob abilit y an d th e artificial dat a size t o sh ow t h e estim at ed accu r acy . T h en w e dis cu s s th e pr op erties of estim at es .

K e y w o rd s : EM - alg orith m , M ax im um lik elih ood , M ech anism , M is sin g dat a

1 . In tro du c tio n

W h en t h e con dit ion al m is sin g pr ob abilit y giv en t h e ob serv ed v alu es is depen den t t o t h e m is sin g v alu es , it is kn ow n t h at th e m is sin g m ech an ism is n onign or able. In su ch ca ses w e h av e t o t ak e int o accou nt t h e m is sin g m ech an ism in th e st atist ical an aly sis of t h e dat a w it h m is sin g v alu es . T h e aim of pr es ent pap er is t o sh ow s om e ex am ple of st at ist ical an aly sis b a sed on t h e m ax im u m lik elih ood (M L ) m et h od of t h e dat a w it h n onign aor ab le m is sin g v alu es an d stu dy th e v alidity of su ch kin ds of an aly sis .

In s ect ion 2 t h e lik elih ood fun ct ion is deriv ed for th e dat a w it h n on ig n orable m is sin g v alu e s . In section 3 w e con sider a specific m odel for u niv ariat e dat a t ak en fr om an ex pon ent ial distribut ion w it h m is sin g v alu es . In th is ca se th e m is sin g

1. F ulltim e Lectur er , Depart m ent of Gener al E ducation , Silla Univ er s ity , Bus an 617- 736, Kor ea.

E - m ail : s emikim @s illa.ac.kr

(2)

pr ob ab ilit y is a s su m ed t o in cr ea s e a s t h e v alu e in crea ses . T h en , in section 4 w e ex plain th e m et h od for ob t ainin g t h e M L est im at es , i.e., t h e EM alg orit hm in g en er al ca se. In sect ion 5 w e sim u lat e ch an gin g t h e m is sin g dat a pr ob ab ilit y an d t h e artificial dat a size b y th r ee st ep s t o sh ow th e estim at ed a ccur acy , th en discu s s on t h e propert ie s of est im at es .

F r om r esult s , w e k n ow t h at w h en w e est im at e b a sed on t h e n on ig n or able m is sin g dat a m ech an ism , it is t o b e un bia s by t o a s su m e th e m odel in clu din g m is sin g dat a m ech anism .

2 . Lik e lih o o d f or a s am ple w ith m i s s in g v alu e s

W e w rit e Y = ( Y obs, Y m is) w ith ou t any los s of g en erality , w h er e Y obs an d Y m is in dicat e t h e ob serv ed an d m is sin g part s of Y , r espect iv ely .

Let R b e t h e pat t ern of t h e m is sin g v alu es , t h at is , R is t h e ob serv ed v ect or of r an dom v ariables Ri, w hich is defin ed a s

Ri= {1 , if yi is ob s erv ed , 0 , if yi is m is s in g .

T h en , w e can form u lat e a m odel w ith m is sin g dat a in t erm s of a pr ob abilit y dist ribu tion for Y w it h den sit y f ( Y | ) , in dex ed by u nk n ow n param et er , an d a pr ob ab ilit y distribut ion g ( R | Y , ) for in dicat in g a par am et er appeared in t h e con dition al den sity of R g iv en Y . T h e lik elih ood fu n ct ion w it h m is sin g dat a w a s defin ed t o b e an y fu n ct ion of an d pr oportion al t o t h e j oin t den sit y h of R an d Y a s

L ( , |R , Yobs) h ( Y obs, R | , )

= -

n

i = 1h ( yi, ri| , ) d Y m is

=

m

i = 1f ( yi| )g ( ri|yi, )

n

j = m + 1g1( rj| , ) ,

w h er e m is ob s erv ed u nit s an d n - m m is sin g u nit s , an d g1( r| , ) is th e m ar g in al den sit y of R , i.e.,

g1( r| , ) =

- f ( y | )g ( r |y , ) dy . N ot e t h at th e j oint den sit y of Y an d R can b e decom posed in t o

h ( y , r | , ) = f ( y | )g ( r |y , ) ,

(see , Litt le an d Ru bin (1987 ), an d H og g an d T ain s (1993 )). T his decom p osit ion w ill b e u sed for com put in g ex pect ation of th e ab ov e lik elih ood .

(3)

3 . T h e Ca s e o f E x po n e n ti al D i s trib uti on

In t his s ect ion w e con sider th e ca se w h er e y follow s an ex pon en t ial distribut ion an d th e m is sin g dat a m ech an ism is defin ed by

P r ( R = r | y , ) = g ( r |y , ) = {1 - e

- y

, r = 0 , e -

y

, r = 1 .

T h e m is sin g pr ob ability is display ed in F ig ur e 3.1. A s F ig ur e 3.1 sh ow s th e m is sin g pr ob abilit y in cr ea s es an d t en d s t o 1 a s y b ecom es lar g er .

F ig ur e 3.1 A s sum ed m is sin g dat a m ech an ism g ( r = 0|y , )

A lso su pp ose th at th e ob serv at ion s Yn ar e obt ain ed a s a r an dom sam ple fr om an ex p on ent ial dist rib ut ion , i.e.,

f ( y | ) = 1

e - y / .

T h en , t h e j oin t distribut ion h ( y , r | , ) of y an d r is obt ain ed a s

h ( y , r | , ) = f ( y | )g ( r |y , )

=

{

11 ee - y /- y / (1 - ee - y /- y /, ), r = 0 ,r = 1 .

an d , t h er efore , t h e lik elih ood fu n ct ion w ith n onign or able m is sin g dat a is g iv en by

(4)

L ( , |y1 , , ym, r1, , rm ,rm + 1, , rn )

=

m

i = 1h ( yi, ri| , )

n

j = m + 1g1( rj| , )

=

m i = 1

1 e - yi/ e - yi/

n

j = m + 1 +

= (1 )m( + )n - mex p(- + i = 1m yi),

w h er e

g1( rj| , ) = f ( yj| )g ( rj= 0 |yj, ) dyj

=

0

1 e - yj/ (1 - e - yj/ )dyj

= + .

4 . E s tim atio n o f P aram e t er V e c t o rs an d

T o obt ain M L estim at es an d w e can con sider t o apply a m et h od. T h e m et h od is t h e so- called EM (E x p ect ation an d M ax im ization ) alg orith m pr opos ed by D em p st er , Lair d an d Rub in (1977 ). T h e EM alg orit hm is a v ery g en er al it er at iv e alg orit hm for M L est im at ion in in com plet e dat a pr oblem s . In th e ca se of E M alg orit hm for m odels , m is sin g su fficien t st atist ics rath er t h an in div idu al ob serv ation s n eed t o b e estim at ed at ea ch it er at ion of th e alg orit hm (s ee, M cLa chlan an d Krishn an (1997 ), an d Dodg e (1985 )). Ev en in th e ca se of ex pon ent ial dist ribu tion t h e form ulat ion b ecom e s com plex w h ere m is sin g dat a m ech anism ex ist s .

E ach it eration of EM alg orit hm con sist s of E st ep an d M st ep an d it s con st ru ct ion is a s follow s .

S t e p 1 . S et init ial est im at e ( 0 ) an d ( 0) of , , r espectiv ely .

S t e p 2 . (E st ep ) W e com pu t e th e ex pect ed v alu es of t h e (j oin t ) sufficien t st at ist ics

n i = 1yi

E (

n

i = 1y i | ( t), ( t), R , Y obs)

= E (

m

i = 1yi| ( t), ( t), R , Y obs) + E (

n

j = m + 1yj| ( t), ( t), R , Y obs)

=

m i = 1y i+

n

j = m + 1 yj ( t)

(5)

w h er e

yj ( t) =

0 yjf1( y |r = 0 , ) dyj

= 0 yj 1

( t) e - yj/

( t)

(1 - e - yj/

( t)

)/ ( t)+( t) ( t) dyj

= ( t)+

( t) ( t)

( t)+ ( t) ( j = m + 1, , n ) . S t e p 3 . (M st ep ) Calculat e th e est im at e

( t + 1)

= (i = 1m yi+ ( n - m ) yj ( t))/ n

an d solv e t h e est im at e ( t + 1), u sin g th e com plet e - dat a su fficien t st atist ics

n

i = 1yi fou n d in t h e E st ep .

S t e p 4 . If conv er g ed , ( t)an d ( t) ar e reg ar ded a s t h e M L estim at e s , i.e.,

= ( t) an d = ( t). Oth er w is e g o b ack t o st ep 2 aft er put t in g t : = t + 1.

N u m e ri c a l e x am p le . T o illu str at e our pr ocedur e w e h av e g en erat ed a s et of n = 100 art ificial dat a b a sed on t h e m odel of ex p on ent ial dist ribu tion w it h = 1 an d on t h e m is sin g dat a m ech an ism w it h = 3 . 0 ab ou t 30% m is sin g . A s sh ow n in T able 4.1, 79 ob serv at ion s ou t of 100 ar e a ctu ally obt ain ed. V alu es w it h a st erisk are r eg ar ded a s m is sin g .

T able 4.2 sh ow s t h e conv er g en ce of EM t o t his solut ion st art in g fr om t h e in it ial v alu es ( 0)= 0 .01 an d ( 0 )= 0 . 01. T h e EM alg orit hm h av e t ook 14 it er ation s . T h e it er ativ e pr ocedu r e is con sidered t o b e conv er g ed , w h en it h olds t h at th e Eu clidean n orm of t h e differ en ces of su cces siv e t w o v alu es of ( t) an d ( t) is sm aller t h an = 0 . 00001. T h e obt ain ed r esu lt s ar e = 0 .96715 an d = 2 . 99999. T o sh ow th e differ en ce of conv erg ed v alu es , w e fou n ded t w o p ar am et er s of st an dar d err or : s e ( ) = 0 . 03285 an d s e ( ) = 0 . 00001. A n d t h e correlat ion coefficient of t w o est im at ed param et er s w a s corr ( , ) = 0 . 96703. It is n ot ed th at in th is ca se t h e est im at ed param et er s ar e v ery close t o th e popu lat ion par am et er s an d th at ou r pr ocedu re s seem t o w ork w ell.

(6)

T ab le 4.1: A s et of 100 artificial dat a (ex pon ent ial dist ribu t ion w ith = 1 an d t h e m is sin g v alu e m ech an ism w ith = 3 . 0)

N ot e. * : m is sin g dat a 1

6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96

0.633907 3.801593 0.208814 3.409731*

3.382418*

0.145197 2.129653 0.703712 1.200620 0.005993 2.524394 0.247183 0.954158 0.372242*

1.806685 1.227496 3.636831*

0.153316 1.460142 0.478107

0.062182 0.263736 0.013400 4.170736 1.872162 0.582066*

3.625779*

1.680004 1.676158*

0.693403 0.576381 0.850727 0.136173 0.022994 0.606821 0.510398 0.897983 1.783356 0.656768 2.636776

0.702323 0.670557 0.702941 1.234908 0.769794

0.083993 2.632828*

0.123132 1.258658*

3.443318*

0.741285 0.244805 0.295892 0.603037 0.273151 0.605348 0.026007 0.406625 1.483078 1.327901

0.134175 1.096038 1.241645 0.349487 0.745341 0.192054 0.623568 3.007779 0.192693 0.478826 0.257411 0.096543 0.096641 4.252598*

0.657411 0.263653 2.898733 1.981601*

0.857258*

0.010663

2.125615*

0.495486 0.253232 0.062779 0.040246 0.858098 3.334371*

1.273178*

0.408614 0.083737 0.661249 0.821901 0.922874*

0.178885 1.724596 2.014738*

1.704346*

1.466127 1.926649

0.354965

5 . S im u l atio n S tu dy

T o discu s s m or e in det ail, w e h av e sim ulat ed t w o ca s es t o sh ow t h e estim at ed a ccur acy for m is sin g pr ob ability an d sam ple size. T h e sim ulation pr oces sin g is s am e t o nu m erical ex am ple of pr ev iou s sect ion an d w e ch an g ed th e m is sin g pr ob ab ilit y an d s am ple size. On e of t w o is set s of n = 100 , 400 , an d 1600 artificial dat a b a sed on th e ex p on ent ial dist ribu t ion w it h = 1. F or t h e oth er , w e ch an g ed t h e m is sin g dat a prob abilit y by th r ee st ep s 10% ( = 12 .0), 30% ( = 3 . 0), an d

50% ( = 1. 0) for th e g en er at ed dat a set s . W e h av e fou n d th at t h e low er t h e v alu e is , t h e h ig h er th e m is sin g dat a pr ob abilit y is .

A n d w e h av e e st im at ed t w o param et er s an d by u sin g EM alg orith m a s t h e st art poin t s : ( 0 )= 0 . 01 an d ( 0)= 0 . 01. T h e pr oces sin g m et h od w a s equ al t o th e n um erical ex am ple. T h e obt ain ed r esu lt s ar e giv en in T able 5.1, it is sam e t o t h e r esult s of T able 4.2 for n = 100 an d = 3 . 0.

(7)

T ab le 4.2: E stim at ed p ar am et er s an d w it h th e EM alg orith m

it er ation | ( t + 1)- ( t)| | ( t + 1)- ( t)|

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

0.01 0.61465 0.84667 0.92693 0.95382 0.96275 0.96570 0.96667 0.96699 0.96710 0.96713 0.96715 0.96715 0.96715 0.96715 0.96715

- 0.60465 0.23202 0.08026 0.02689 0.00893 0.00295 0.00097 0.00032 0.00011 0.00002 0.00000 0.00000 0.00000 0.00000 0.00000

0.01 - 0.30225

0.41418 1.55003 2.38373 2.77549 2.92327 2.97438 2.99151 2.99719 2.99907 2.99969 2.99990 2.99997 2.99999 2.99999

- 0.28775 0.71643 1.13585 0.83370 0.39176 0.14778 0.05111 0.01713 0.00568 0.00188 0.00062 0.00021 0.00007 0.00002 0.00000

It is sh ow n t h at t h e conv er g en ce it er at ion s of ab out 10% m is sin g pr ob ability is few er th an t h ose of 50% m is sin g r at e. A n d w e can foun d t h at t h e low er t h e m is sin g r at e is , t h e few er t h e it er at ion of con v er g en cy is . In t h e ca se of iden tical m is sin g pr ob ab ilit y , it is n ot able t h at th e lar g er th e n um b er of dat a is , t h e closer t h e est im at ion is t o p opulation par am et er s . T h e e st im at ed accu ra cy of par am et er

w a s v ery ex act r eg ar dle s s of t h e m is sin g r at e an d t h e dat a size .

In t h e pr esen t pap er , w e h av e sh ow n som e ex am ples of st at istical an aly sis b a sed on th e M L m eh t ods of th e dat a w ith n on ig n or able m is s sin g v alu es an d h av e stu died t h e v alidity of su ch kin d s of an aly sis . W e can discu s s on th e pr opert ies of e st im at es .

A s w e kn ow n in sim u lat ion , in t h e ca se of t h e EM alg orith m , if it fin d t h e M L e st im at es on t h e m odel in clu din g th e m is sin g - dat a m ech an ism , t h ou gh th e m is sin g - dat a prob abilit y is high er , it can estim at e a ccur at ely w ell. W e h av e kn ow n t h e est im at ed a ccur acy is v ery close an d low er m is sin g pr ob abilit y , m ore t h e a ccur acy is hig er , t oo.

(8)

T able 5.1: T h e estim at e of p ar am et er s w it h m is sin g pr ob ab ilit y ch an gin g

n s e ( ) s e ( ) it er at ions

1.0 (50% )

100 400 1600

0.91198 1.05876 1.00712

0.08802 0.05876 0.00712

0.99998 0.99994 0.99997

0.00002 0.00006 0.00003

30 26 26 3.0

(30% )

100 400 1600

0.96715 0.95617 1.00214

0.03285 0.04383 0.00214

2.9999 2.99984 2.99982

0.00001 0.00016 0.00018

15 14 13 12.0

(10% )

100 400 1600

1.00287 0.95453 0.98088

0.00287 0.04547 0.01912

11.99997 11.99985 11.99973

0.00003 0.00015 0.00027

12 7 7

R e f e re n c e s

1. Dem p st er , A . P ., Laird , N . M . an d Rubin , D . B . (1977 ). M ax im u m lik elih ood fr om in com plet e dat a v ia t h e EM alg orit hm , J ournal of R oy al S ta t is t ical S ocie ty , B39, 1- 38.

2. Dodg e , Y . (1985 ). A naly s is of E xp er im en ts w ith M is s ing D a ta , J ohn W iley

& S on s , N ew Y ork .

3. H og g , R . V . an d T an is , E . A . (1993 ). P robab ility and S ta t is t ical I nf er en ce 4 th E d ition , M acm illan .

4. Litt le , R . J . A . an d Rubin , D . B . (1987 ). S ta t is t ical A naly s is w ith M is s ing D a ta , J oh n W iley & S on s , N ew Y ork .

5. M cLachlan , G. J . an d Krishn an , T . (1997 ). T he E M A lg orithm an d E x tens ions , J oh n W iley & S on s , N ew Y ork .

[ 2002년 9월 접수, 2002년 9월 채택 ]

참조

관련 문서

The correct typing results for the DNA samples are listed in Table 1 for all STR loci in the Powerplex 16 multiplexes.(Table 1.) The multiplex typing results based on the

But, the adjusted - R 2 of book values and earnings produced through deferred tax accounting has no change, compared with that of book values and earnings

The Incomplete diffuse type PVD found primarily in patients with diabetic retinopathy and retinal detachment and complete diifuse type PVD found primarily in patients

period was prolonged unavoidably, (3) by explaining the risk factors associated with the failure to patients honestly, and subsequently performing

The used output data are minimum DNBR values in a reactor core in a lot of operating conditions and the input data are reactor power, core inlet

and Gold, F.(1985), Bank Branch Operating Efficiency: Evaluation with Data Envelopment Analysis, Journal of Banking and Finance , Vol.. F.(1992), Quantifying Management Role

 Procedure for creating the equipment failure rate data segment of a CPQRA analysis data base.  Define

à Element nodes have child nodes, which can be attributes or subelements à Text in an element is modeled as a text node child of the element. à Children of a node