Missing Value Imputation based on Locally Linear Reconstruction for Improving Classification Performance

(1)

Journal of the Korean Institute of Industrial Engineers http://dx.doi.org/10.7232/JKIIE.2012.38.4.276

Vol. 38, No. 4, pp. 276-284, December 2012. © 2012 KIIE

ISSN 1225-0988 | EISSN 2234-6457

분류 성능 향상을 위한 지역적 선형 재구축 기반 결측치 대치

강 필 성

서울과학기술대학교 글로벌융합산업공학과

Missing Value Imputation based on Locally Linear Reconstruction for Improving Classification Performance

Pilsung Kang

Industrial and Information Systems Engineering, Seoul National University of Science and Technology (Seoultech)

Classification algorithms generally assume that the data is complete. However, missing values are common in real data sets due to various reasons. In this paper, we propose to use locally linear reconstruction (LLR) for missing value imputation to improve the classification performance when missing values exist. We first investigate how much missing values degenerate the classification performance with regard to various missing ratios. Then, we compare the proposed missing value imputation (LLR) with three well-known single imputation methods over three different classifiers using eight data sets. The experimental results showed that (1) any imputation methods, although some of them are very simple, helped to improve the classification accuracy; (2) among the imputation methods, the proposed LLR imputation was the most effective over all missing ratios, and (3) when the missing ratio is relatively high, LLR was outstanding and its classification accuracy was as high as the classification accuracy derived from the compete data set.

Keywords: Locally Linear Reconstruction (LLR), Missing Value Imputation, Classification.

1. 서 론

(Data Mining) (Machine Learning) (Classification)

(completeness) .

,

( ,

, ) (missing

value)

(Batista and Monard, 2003).

, ,

, (Barnard

and Meng, 1999). Acuna and Rodriguez(2004) 1%

, 1~5%

, 5~15%

, 15%

.

(missing value imputation)

(Jerez et al., 2010; Jerzy and Hu, 2000; Li et al., 2009; Yu et al., 2011; Zhang and Liu, 2009).

(randomness)

(Little and Rubin, 1987).

본 연구는 서울과학기술대학교 교내연구비의 지원을 받아 수행되었습니다.

연락저자 강필성 교수, 139-743 서울시 노원구 공릉길 138 서울과학기술대학교 글로벌융합산업공학과, Tel : 02-970-7286, Fax : 02-974-5388, E-mail : [email protected]

2012년 7월 23일 접수; 2012년 9월 27일 수정본 접수; 2012년 10월 5일 게재 확정.

(2)

분류 성능 향상을 위한 지역적 선형 재구축 기반 결측치 대치 277

(missing completely at random; MCAR) : ,

(missing value imputation) .

(missing at random; MAR) : ,

, .

, ‘ ’

‘ ’

(Su et al., 2008).

(not missing at random; NMAR) : ,

. ,

‘ ’ ‘ ’

.

(domain knowledge) .

(Batista and Monard, 2003; Farhangfar et al., 2008; Su et al., 2008).

. , .

,

. (classification and regression tree; CART) (splitting point)

(Breiman et al., 1984).

(Naive Bayesian Classifier) ,

(Sommerfield, 1997). k- (k-

Nearest Neighbor) ,

,

0 ,

(Witten and Frank, 2005). ,

. (single imputation)

(multiple imputation) .

, .

/ (mean/median imputation; MEI) : ,

( ) ( )

(Farhangfar et al., 2008; Jerez et al., 2010).

k- (k-Nearest Neighbor Imputation; k-NN) : k , k = 1 ‘Hot deck’

(Farhangfar et al., 2008; Garcia-Laencina et al., 2009).

(expectation conditional maximization;

ECM) : (multi-variate

Gaussian distribution) .

(Ghahramani and Jordan, 1994; Su et al., 2008).

: ,

. ,

.

(Hron et al., 2010), (Ennett et al., 2008), (Farhangfar et al., 2007) . (Zhang, 2003)

,

.

Multivariate imputation by chained equations(MICE)(van

Buuren, 2011) : ,

chained equation . .

Boosting(Farhangfar et al., 2007) : 3

. MEI

,

t- .

Boosting

.

. ,

,

, .

(3)

278 Pilsung Kang

(a) A new instance

(c) Find critical neighbors

(b) Find k-NN(k = 5 in this example)

(d) Assign weights Figure 1. An illustrative example of LLR procedure

,

, ,

.

,

.

(locally linear reconstruction; LLR)

. LLR k-

, k

(Kang and Cho,

2008). LLR

,

1% 50%

. , 3

LLR 3

LLR .

. 2

LLR . 3

LLR

, , ,

. 4 , 5

.

2. 지역적 선형 재구축을 통한 결측치 대치

(Locally linear reconstruction; LLR)(Kang and Cho, 2008) k-

,

(1) . ( 

_{ }

) ,

(

^^{ }

) k

( 

^

) .

 

_{ }

  

_{ }

 

^

 

^

  (1)

(4)

Missing Value Imputation based on Locally Linear Reconstruction for Improving Classification Performance 279

Figure 2. The procedure of LLR-based missing value imputation

LLR <Figure 1> .



_{ }

(reference set) k

(<Figure 1>(a) ).

(distance) , 

_



_

(2) Minkowski distance

.

 

_

 

_

   

_

 

_



^



^^



 (2)

p 1 p = 2

(Euclidean distance) . LLR

k ^

 

(weights) .

(3) 

_{ }



_{ }

(reconstruction error; E(w)) .

 _{  }



  ^

^{ }

^ ^

  





_

 

_



^

^ ⁽³⁾

 

_



_{ }

j ^



 

_

. (3)   

_{ }



_{ }

(hyperplane)

, ^

 



_{ }

.

LLR (4)

.

 

_{ }

 

_{  }^

^

^

^

^

^ ⁽⁴⁾

LLR k (3)

0 . 0

k’ k k’ , (3)

k* LLR k

k* k . k*

,

k , LLR

(Kang and Cho, 2008). , LLR k k-

k .

LLR <Figure 2>

. n d

, i m , j m, p

, k p ( )

. LLR

(missing data set)

(complete data set) (<Figure 2>(b) ). ,

.

. <Figure 3>(c) i m

, (d-1) (1, 2, , m-1, m+1, d) ( 

_{ }

) 1 (m) ( ^ 

_{ }

)

. , . ,

<Figure 3>(c) (n-3)×(d-1) ( 

^

) (n-3)×

1 ( 

^

) .

LLR

(5)

280 강 필 성

(<Figure 3>(c) ).

.

LLR . ,

. 1

,

, LLR k

. ,

. , LLR

. .

, LLR .

3. 실험 설계

LLR . ,

. , LLR

.

Table 1. The description of the data sets used in this paper No. Name N. Instances(n) N. Attributes(d) N. Classes(c)

1 Iris 150 4 3

2 Wine 178 13 3

3 Ney-thyroid 215 5 3

4 Liver-disorder 345 6 2

5 Ionosphere 351 34 2

6 Pima 768 8 2

7 Wdbc 569 30 2

8 Vehicle 846 18 4

UCI Machine Learning Repository

8 .

, <Table 1> .

8 ,

. , 0%, 1%, 3%, 5%, 7%, 10%, 15%, 20%, 25%, 30%, 35%, 40%, 45%, 50% 14

.

, 1 (d/2)

.

30 1

420 (14 × 30 )

. LLR

. / (mean/median imputation; MEI) :

(

) ( )

Hot deck :

(expectation conditional maximization;

ECM) : (multi-variate

Gaussian distribution)

(discard) (complete data)

(discard, complete, MEI, Hot-deck, ECM, LLR)

. , 2

4 (multi-class classification)

(Multimonial logistic regression; MNR, MuCullagh and Nelder 1990), (classification and regression tree; CART, Breiman et al., 1984), (artificial neural network; ANN,

Bishop 2006) .

5-fold (cross-validation) .

, .

(5)

. n ^



  

_

i

, I(·) ,

1 0

.

Classification Accuracy = _ _ ^ 

  



  ^



  

_

 ^ (5)

(6)

분류 성능 향상을 위한 지역적 선형 재구축 기반 결측치 대치 281

Figure 3. The classification accuracies of each classification algorithm with various missing value ratios(x-axis : missing value ratio, y-axis : classification accuracy)

Table 2. Average classification accuracy decrease with regard to the missing ratios(%)

Missing ratio 1% 3% 5% 7% 10% 15% 20%

Accuracy decrease 0.60 1.75 2.73 3.90 5.41 8.03 11.04

Missing ratio 25% 30% 35% 40% 45% 50%

Accuracy decrease 13.60 15.93 18.91 21.53 24.26 26.80

Table 3. Average classification performance improvement (%) of each imputation method over discarding missing values for the eight data sets(The numbers with boldface and underline represent the highest improvement among the imputation methods for each classification algorithm)

Algorithm Imputation 1% 3% 5% 7% 10% 15% 20% 25% 30% 35% 40% 45% 50%

MNR

MEI 0.49 1.09 1.70 2.68 3.88 6.12 8.88 11.87 14.57 17.92 21.32 25.15 29.38 Hot Deck 0.36 1.36 1.96 3.11 4.31 6.64 9.51 12.60 15.30 18.74 22.36 26.44 30.92 ECM 0.44 1.19 1.59 2.63 3.89 6.02 8.96 11.89 14.27 17.94 21.24 24.87 29.25

LLR 0.50 1.57 2.52 3.75 4.93 7.87 11.22 14.58 17.65 21.65 25.56 30.02 34.72

CART

MEI 0.70 1.17 2.29 3.00 4.19 6.45 9.50 12.07 15.01 18.59 21.85 25.67 30.71 Hot Deck 0.59 1.23 2.40 2.94 4.31 6.54 10.05 12.17 15.42 18.81 22.20 26.10 30.43 ECM 0.55 1.31 2.29 3.07 4.33 6.73 9.45 12.14 14.78 18.81 21.91 25.83 30.30

LLR 0.81 1.72 2.44 3.49 5.12 7.81 11.33 14.12 17.19 20.67 24.62 29.16 33.61

ANN

MEI 0.54 0.88 1.79 2.57 3.99 6.27 9.32 12.09 14.84 18.27 21.70 26.85 30.14 Hot Deck 0.46 1.35 2.08 3.11 4.46 6.89 10.12 13.30 15.67 20.19 23.85 28.45 32.06 ECM 0.13 1.08 1.73 2.85 4.07 6.39 9.23 12.10 14.50 17.94 22.26 25.85 29.95

LLR 0.33 1.52 2.32 3.82 5.27 8.64 12.08 15.21 18.71 22.88 27.39 32.96 36.95

4. 실험 결과 및 토의

,

, 24 ( - )

.

, -

. 1 (5%

), ,

50%

25%

.

, 8

,

.

<Table 3> 8

(7)

282 Pilsung Kang

Table 4. The number of (classification algorithm and data set) pairs that each imputation method resulted in the highest classification accuracy for each missing instance ratio

Imputation 1% 3% 5% 7% 10% 15% 20% 25% 30% 35% 40% 45% 50%

MEI 6 2 4 2 2 4 1 1 2 0 0 2 1

Hot Deck 2 8 7 6 6 2 2 1 1 1 2 2 1

ECM 6 1 1 4 3 1 1 3 0 2 1 1 0

LLR 10 13 12 12 13 17 20 19 21 21 21 19 22

Table 5. Average performance decrease (%) of the LLR imputation against the complete data set for each missing instance ratio Classification 1% 3% 5% 7% 10% 15% 20% 25% 30% 35% 40% 45% 50%

MNR 0.09 0.20 0.27 0.35 0.75 0.85 1.06 1.27 1.50 1.86 1.91 2.27 2.41 CART -0.17 -0.15 0.28 0.20 0.22 0.32 0.49 0.73 0.91 1.50 1.58 1.49 2.03 ANN 0.25 0.50 0.59 0.55 0.83 0.76 1.12 1.52 1.35 1.94 2.23 2.12 2.61

(MEI, Hot deck, ECM, LLR) .

. , . ,

, 13 ,

(ANN, 1%) LLR

. , ,

.

, LLR

. LLR (MNR, CART, ANN)

.

<Table 4> 24 ( (3

), (8 ))

. LLR

,

. .

MEI ECM ,

.

, LLR .

<Table 5> LLR

(%) .

.

LLR 15%

(MNR, ANN) 30%(CART)

1%

. 50%

2~3% , LLR

,

.

<Figure 4> CART ,

. . ,

(5% ) ,

. , LLR

. , LLR (Liver-disorder, Pima)

,

.

. , . , ,

. ,

, . , LLR

.

(8)

Missing Value Imputation based on Locally Linear Reconstruction for Improving Classification Performance 283

(a) Iris

(c) Ney-thyroid

(e) Ionosphere

(g) Wdbc

(b) Wine

(d) Liver-disorder

(f) Pima

(h) Vehicle Figure 4. Classification accuracy of each imputation method with CART

, (

)

2~3% . ,

LLR .

, LLR

.

(9)

284 강 필 성

5. 결론 및 향후 계획

(locally linear reconstruction;

LLR) ,

8 , 3 , 3

. , LLR

. . ,

.

LLR ,

. , (regression),

(novelty detection) (supervised learning) LLR

. ,

LLR .

LLR

.

참고문헌

Acuna, E. and Rodriguez, C. (2004), The Treatment of Missing Values and Its Effect in the Classifier Accuracy, in Classification, Clustering and Data Mining Applications, 639-648.

Batista, G. E. A. P. A. and Monard, M. C. (2003), An Analysis of Four Missing Data Treatment Methods for Supervosed Learning, Applied Artificial Intelligence, 17(5-6), 519-533.

Bernard, J. and Meng, X. L. (1999), Applications of Multiple Imputation in Medical Studies : From AIDS to NHANES, Statistical Methods in Medical Research, 8(1), 17-36.

Bishop, C. M. (2006), Pattern Recognition and Machine Learning, Springer, Singapore.

Breiman, L., Friedman, J., Olshen, R., and Stone, C. (1984), Classifi- cation and Regression Trees, Boca Raton, FL : CRC Press.

Ennett, C. M., Frize, M., and Walker, R. (2008), Imputation of Missing Values by Integrating Neural Networks and Case-Based Reasoning, In: Proceedings of the 30

^th

Annual International IEEE Engineering in Medicine and Biology Society (EBMS ’08), Vancouver, BC, Canada, 4337-4341.

Farhangfar, A., Kurgan, L., and Dy, J. (2008), Impact of Imputation of Missing Values on Classification Error for Discrete Data, Pattern Recognition, 41(12), 3692-3705.

Farhangfar, A., Kurgan, L., and Pedrycz, W. (2007), A Novel Framework for Imputation of Missing Values in Database, IEEE Transactions on Systems, Man, and Cybernetics–Part A : Systems and Humans 37(5), 692-709.

Garcia-Laencina, P., Sancho-Gomez, J.-L., Rigueiras-Vidal, A. R., and Verleysen, M. (2009), K-nearest Neighbours with Mutual Infor- mation for Simultaneous Classification and Missing Data Impu- tation, Neurocomputing, 72(7-9), 1483-1493.

Ghahramani, Z. and Jordan, M. I. (1994), Supervised Learning from Incomplete Data Via an EM Approach, In : Advances in NIPS 6, Morgan Kaufmann, Los Altos, CA, USA, 120-127.

Hron, K., Templ, M., and Filzmoser, P. (2010), Imputation of Missing Values for Compositional Data using Classical and Robust Methods, Computational Statistics and Data Analytics, 54(12), 3095-3107.

Jerez, J. M., Molina, I., Garcia-Laencina, G., Alba, E., Ribelles, N., Martin, M., and Franco, L. (2010), Missing Data Imputation using Statistical and Machine Learning Methods in a Real Breast Cancer Problem, Artificial Intelligence in Medicine, 50(2), 105-115.

Jerzy, W.G-B. and Hu, M. (2000), A Comparison of Several Approaches to Missing Attribute Values in Data Mining, In: Proceedings of the 2

^nd

International Conference on Rough Sets and Current Trends in Computing(RSCTC’00), Banff, Canada, 378-385.

Kang, P. and Cho, S. (2008), Locally Linear Reconstruction for Instance- Based Learnining, Pattern Recognition, 41(11), 3507-3518.

Li, H., Zhou, X., and Yao, Y. (2009), Missing Values Imputation Hypothesis : An Experimental Evaluation, In Proceedings of the 8

^th

IEEE International Conference on Cognitive Informatics(ICCI ’09), Hong Kong, China, 275-280.

Little, R. J. and Rubin, D. B. (1987), Statistical Analysis with Missing Data, John Wiley and Sons, New York.

McCullagh, P. and Nelder, J. A. (1990), Generalized Linear Models, New York : Chapman and Hall.

Kohavi, R., Becker, B., and Sommerfield, D. (1997), Improving Simple Bayes, In: Proceedings of the European Conference on Machine Learning (ECML’97), Prague, Czech Republic.

Su, X., Khoshgoftaar, T. M., and Greiner, R. (2008), Using Imputation Techniques to Help Learn Accurate Classifiers, In : Proceedings of the 20

^th

IEEE International Conference on Tools with Artificial Intelligence (ICTAI’08), Dayton, OH, USA, 437-444.

UCI Machine Learning Repository : http://archive.ics.uci.edu/ml/.

van Buuren, S. and Groothuis-Oudshoorn, K. (2011), MICE : Multi- variate Imputation by Chained Equation in R, Journal of Statistical Software, 45(3).

Witten, I. H. and Frank, E. (2005), Data Mining : Practical Machine Learning Tools and Techniques, 2

^nd

edition, Morgan Faufmann.

Yu, T., Peng, H., and Sun, W. (2011), Incorporating Nonlinear Relation- ships in Microarray Missing Value Imputation, IEEE/ACM Transactions on Computational Biology and Bioinformatics, 8(3), 723-731.

Zhang, P. (2003), Multiple Imputation : Theory and Method, Inter- national Statistical Review, 71(3), 581-592.

Zhang, Y. and Liu, Y. (2009), Data Imputation using Least Squares

Support Vector Machines in Urban Arterial Street, IEEE Signal

Processing Letters, 15(5), 414-417.