Outlier detection and variable selection via difference based regression model and penalized regression †
InHae Choi 1 · Chun Gun Park 2 · Kyeong Eun Lee 3
13 Department of Statistics, Kyungpook National University
2 Department of Mathematics, Kyonggi University
Received 3 May 2018, revised 22 May 2018, accepted 22 May 2018
Abstract
This paper studies an efficient procedure for the outlier detection and variable se- lection problem in linear regression. The effect of outliers is added in linear regression as a mean shift parameter, nonzero or zero constant. To fit this mean shift model, most penalized regressions have used some adaptive penalties on the parameters to shrink most of the parameters to zero. Such penalized models do select the true variables well, but do not detect the outliers correctly. To overcome this problem, we first determine a group of possibly suspected outliers using difference-based regression model (DBRM) and add the group to the linear model as the parameters of the effect of each suspected outlier. Then, we perform outlier detection and variable selection simultaneously using Lasso regression or Elastic net regression for the linear regression with the effect term of each suspected outlier added. The proposed method is more efficient than the previous penalized regression. We compare the proposed procedure with other methods using a simulation study and apply this procedure to the real data.
Keywords: Difference-based regression model, Elastic net, Lasso, outliers detection, vari- able selection.
1. Introduction
Outliers are observations that significantly differ from the others and frequently occur in the collection of real data. They distort statistical inference; for instance, ordinary least squares estimator is very susceptible to outliers (Joo and Cho, 2016). To address this, many methods have studied on outlier detection or robust techniques. We consider the mean shift linear regression, y = β 0 1 + X 1 β 1 + X 2 β 2 + γ + , where X 1 is an n × p 1 design matrix of relevant predictors, X 2 is a n × p 2 design matrix of irrelevant predictors, β i is a parameter vector corresponding to X i , γ is the effect vector of outliers which consists of zero or nonzero
† This work was extracted from Master’s thesis of InHae Choi at Kyungpook National University in December 2017.
1
Master, Department of Statistics, Kyungpook National University, Daegu 41566, Korea.
2
Associate Professor, Department of Mathematics, Kyonggi University, Suwan 16227, Korea.
3