• 검색 결과가 없습니다.

Outlier detection and variable selection via difference based regression model and penalized regression<sup>†</sup>

N/A
N/A
Protected

Academic year: 2021

Share "Outlier detection and variable selection via difference based regression model and penalized regression<sup>†</sup>"

Copied!
11
0
0

로드 중.... (전체 텍스트 보기)

전체 글

(1)

Outlier detection and variable selection via difference based regression model and penalized regression

InHae Choi 1 · Chun Gun Park 2 · Kyeong Eun Lee 3

13 Department of Statistics, Kyungpook National University

2 Department of Mathematics, Kyonggi University

Received 3 May 2018, revised 22 May 2018, accepted 22 May 2018

Abstract

This paper studies an efficient procedure for the outlier detection and variable se- lection problem in linear regression. The effect of outliers is added in linear regression as a mean shift parameter, nonzero or zero constant. To fit this mean shift model, most penalized regressions have used some adaptive penalties on the parameters to shrink most of the parameters to zero. Such penalized models do select the true variables well, but do not detect the outliers correctly. To overcome this problem, we first determine a group of possibly suspected outliers using difference-based regression model (DBRM) and add the group to the linear model as the parameters of the effect of each suspected outlier. Then, we perform outlier detection and variable selection simultaneously using Lasso regression or Elastic net regression for the linear regression with the effect term of each suspected outlier added. The proposed method is more efficient than the previous penalized regression. We compare the proposed procedure with other methods using a simulation study and apply this procedure to the real data.

Keywords: Difference-based regression model, Elastic net, Lasso, outliers detection, vari- able selection.

1. Introduction

Outliers are observations that significantly differ from the others and frequently occur in the collection of real data. They distort statistical inference; for instance, ordinary least squares estimator is very susceptible to outliers (Joo and Cho, 2016). To address this, many methods have studied on outlier detection or robust techniques. We consider the mean shift linear regression, y = β 0 1 + X 1 β 1 + X 2 β 2 + γ + , where X 1 is an n × p 1 design matrix of relevant predictors, X 2 is a n × p 2 design matrix of irrelevant predictors, β i is a parameter vector corresponding to X i , γ is the effect vector of outliers which consists of zero or nonzero

† This work was extracted from Master’s thesis of InHae Choi at Kyungpook National University in December 2017.

1

Master, Department of Statistics, Kyungpook National University, Daegu 41566, Korea.

2

Associate Professor, Department of Mathematics, Kyonggi University, Suwan 16227, Korea.

3

Corresponding author: Associate Professor, Department of Statistics, Kyungpook National University,

Daegu 41566, Korea. E-mail: [email protected]

(2)

constant and  is an error vector (McCann and Welsch, 2007; She and Owen, 2011; Park and Kim, 2017).

In this study, we focus on an efficient procedure for the outlier detection and variable selection in the above mean shift linear regression. Recently penalized regression methods such as Lasso (Tibshirani, 1996) and Elastic net regression (Zou and Hastie, 2005) are popular for variable selection. Also, a type of these penalized regressions can be used outlier detection. For example, She and Owen (2011) proposed a nonconvex penalty function on γ i ’s, the effect of the outliers.

Such penalties do select the true variables well, but do not detect the outliers correctly.

To overcome this problem, we first determine a group of possibly suspected outliers using difference-based regression model (DBRM) (Park and Kim, 2017; Park, 2018) and add the group to the linear model as the parameters of the effect of each suspected outlier. Then, we perform outlier detection and variable selection simultaneously using Lasso regression or Elastic net regression for the linear regression with the effect term of each suspected outlier added. The proposed method is more efficient than the previous penalized regression.

In the literature, some methods such as Forward Search Algorithm (Hadi and Simono, 1993) directly detect outliers without estimating the mean function, and others such as least trimmed squares (LTS) (Rousseeuw and Leroy, 1987), thresholding based iteratively procedure (Θ-IPOD) (She and Owen, 2011) indirectly identify outliers using residuals from robust regression model. While the latter methods can detect outliers by estimating the mean trend function, our outlier detection method uses the difference-based regression model without estimating the mean trend function.

The rest of this article is organized as follows. In Section 2, we introduce a difference- based regression model which is used for picking out a group of possibly suspected outliers and Lasso regression and Elastic net regression for selecting important variables. Also, we describe an efficient procedure or the outlier detection and variable selection via two penal- ized regressions and DBRM. In Section 3, we conduct a simulation study to compare our approach with other existing methods. In Section 4, we apply the proposed method to the consumption of petrol. Finally in Section 5, we provide some conclusions.

2. Methods

In this section, we will briefly review DBRM (Park and Kim, 2017), Lasso regression (Tibshirani, 1996) and Elastic net regression (Zou and Hastie, 2005) and then propose our method for outlier detection and variable selection using DBRM and Lasso or Elastic net regression.

2.1. Difference based regression model

Consider a multiple linear regression model without outlier y = 1 n β 0 + X 1 β 1 + X 2 β 2 + ,

where β 0 is an intercept, 1 n = [1, 1, . . . , 1] T is the n-dimensional unit vector, X a is the n×p a

matrix of rank p a , β a = [β a1 , . . . , β ap

a

] T is the p a -vector of coefficient, a = 1, 2, E() = 0

and V ar() = σ 2 R with unknown correlation matrix R and unknown variance σ 2 .

(3)

For detecting outliers in linear regression model, the mean-shift model (Cook and Weis- berg, 1982) has been used for a long time. So we consider the following mean shift model:

w = y + γ +  = 1 n β 0 + Xβ + γ +  where X = [X 1 , X 2 ], β = [β T 1 , β T 2 ] T and p = p 1 + p 2 .

DBRM detects outliers using the following property - when the jth case is outlier, the absolute value of estimated intercept of the fitted model without the jth case is large. The DBRM can be written as:

D (j) w = D (j) y + D (j) γ

= 1 n−1 (−γ j ) + D (j) Xβ + A (j) γ + D (j) 

= [1 n−1 : D (j) X] −γ j

β



+ A (j) γ + D (j) , j = 1, . . . , n

(2.1)

where (−γ j ) is the intercept for detecting outliers and D (j) and A (j) are the (n − 1) × n matrices as follows:

D (j) =

 I j−1 1 j−1 0 (j−1),(n−j)

0 (n−j),(j−1) 1 n−j I n−j



and A (j) =

 I j−1 0 j−1 0 (j−1),(n−j)

0 (n−j),(j−1) 0 n−j I n−j

 .

where I a is an a × a identity matrix and O a,b is an a × b null matrix.

Since intercept (−γ j ) is highly influenced by outliers, we propose to determine outlier candidates using the intercepts based on the difference based regression model. Note that we are not interested in other parameters.

2.2. Lasso regression

Tibshirani (1996) proposed Lasso Regression Model for regression shrinkage and selection.

For any fixed non-negative λ, the Lasso estimate ˆ β L is defined by β ˆ L = arg min

β L(λ, β)

= arg min

β

n

X

i=1

(y i − β 0 −

p

X

j=1

β j x ij ) 2 + λ

p

X

j=1

|β j |

 .

(2.2)

Here, the larger λ, the closer β 1 , . . . , β p are to zero. That is, the Lasso regression shrinks the coefficient estimates. But it is known to have there are some limitations. First, the Lasso method selects at most n variables before it saturates at high dimensional. Second, if there is a group of highly correlated variables, then the Lasso method tends to select one variable from a group and ignore the others (Zou and Hastie, 2005).

2.3. Elastic net regression

Zou and Hastie (2005) proposed Elastic net for regularization and variable selection to

overcome these limitations of Lasso. For any fixed non-negative λ 1 and λ 2 , the Elastic net

(4)

estimate ˆ β E is defined by β ˆ E = arg min

β L(λ 1 , λ 2 , β)

= arg min

β

n

X

i=1

(y i − β 0 −

p

X

j=1

β j x ij ) 2 + λ 1 p

X

j=1

β j 2 + λ 2 p

X

j=1

|β j |

 .

(2.3)

The Elastic net method allows sparsity and grouping effect together (Zou and Hastie, 2005).

2.4. Algorithm

We propose to use DBRM method to select outlier candidates and to perform outlier detection and variable selection simultaneously using Lasso regression or Elastic net method.

This approach consists of the following steps:

Step 1. Estimation of intercepts in DBRM.

D (j) w = 1 n−1 (−γ j ) + D (j) Xβ + A (j) γ + D (j) .

1.1. Estimate intercepts, −ˆ γ j , j = 1, . . . , n and rewrite δ j = abs(−ˆ γ j ).

1.2. Ascend them, δ (j) , δ (1) ≤ · · · ≤ δ (n) . Step 2. Determination outlier candidates.

2.1. Set the percentage of outlier candidates.

2.2. Consider ones with large δ (j) value as outlier candidates.

2.3. Include outlier candidates indicator variables in the multiple linear model.

Step 3. Outlier detection and variable selection using Lasso or Elastic net.

w = Xβ + γ 1 z 1 + · · · + γ k z k +  where z k is an indicator vector of outlier candidate.

3.1. Find Lasso estimates:

n

X

i=1

(w i − β 0 −

p

X

j=1

β j x ij −

k

X

j=1

γ j z ij ) 2 + λ(

p

X

j=1

|β j | +

k

X

j=1

|γ j |).

3.2. Find Elastic net estimates:

n

X

i=1

(w i − β 0 −

p

X

j=1

β j x ij −

k

X

j

γ j z ij ) 2 + λ 1 (

p

X

j=1

β 2 j +

k

X

j=1

γ 2 j ) + λ 2 (

p

X

j=1

|β j | +

k

X

j=1

|γ j |).

In step 3, we need to set the optimal λ for the Lasso regression and Elastic net. So,

we use a method to obtain the optimal λ in a multiple regression model without

outliers.

(5)

3. Simulation studies

We conduct simulations to compare the performance of our method with the other existing methods. We compare the outlier detection with the following four methods and compare the variable selection with LAD Lasso : Least trimmed squares (LT S) (Rousseeuw and Leroy, 1987), LAD regression with the Lasso penalty (LAD Lasso ) (Wang et al., 2007), Hard thresh- olding (denoted by θ) based iteratively procedure for outlier detection (HIP OD) (She and Owen, 2011), Soft thresholding (denoted by θ) based iteratively procedure for outlier detec- tion (SIP OD) (She and Owen, 2011).

3.1. Simulation setting

We consider three cases of samples sizes (n = 30, 100, 300) each with 10% outliers and p =

n

10 covariates (p 1 = p 2 = p 2 ). Covariates are generated from multivariate normal distribution with mean 0 and covariance matrix Σ = {σ ij } , σ ij = ρ |i−j| . We consider two cases of correlation of covariates (ρ = 0, 0.5) and two different errors distributions (N (0, 1), t df =2 ).

So the total number of simulation cases is 12. Outlier locations and signs are randomly selected and outlier sizes are 8 or 10. We generate 100 data sets for each case. We set the number of outlier candidate group as 30% of the sample size.

3.2. Criteria

Let n O be the number of true outliers, n D be the number of detected outliers, n CD be the number of correctly detected outliers, n ID be the number of incorrectly detected outliers, n IU be the number of incorrectly undetected outliers and n CU be the number of correctly undetected non-outliers.

Table 3.1 Outlier detection True

Outlier Non-outlier Sum

Detection Outlier n

CD

n

ID

n

D

Non-outlier n

IU

n

CU

n

U D

Sum n

O

n

N O

n

Two kinds of errors can occur in outlier detection: masking (a true outlier is not detected) and swamping (a non-outlier is detected as an outlier).

We define relative frequencies of perfection, only masking, only swamping, masking and swamping and complete failure in order to show the strength of our proposed method as follows:

- Relative frequency of perfect detection:

P D = 1 n sim

n

sim

X

s=1

I(n CD(s) = n O(s) ).

- Relative frequency of only-swamping with detection (overdetection):

OS = 1 n sim

n

sim

X

s=1

I(n IU (s) = 0, n ID(s) > 0, n CD(s) > 0).

(6)

- Relative frequency of only-masking with partial detection:

OM = 1

n sim n

sim

X

s=1

I(n IU (s) > 0, n ID(s) = 0, n CD(s) > 0).

- Relative frequency of masking and swamping with partial detection:

M S = 1 n sim

n

sim

X

s=1

I(n IU (s) > 0, n ID(s) > 0, n CD(s) > 0).

- Relative frequency of complete failure relative frequency:

CF = 1 n sim

n

sim

X

s=1

I(n CD(s) = 0).

Finally, we use three criteria for variable selection. Let CS be the relative frequency of correct selection, CR be relative frequency of correct variable reduction and AN be the average number of selected variables:

- CS : the relative frequency of correct selection:

CS = 1 n sim

n

sim

X

s=1

I 

{j : ˆ β j,s 6= 0} = {j : β j 6= 0}  .

- CR : the relative frequency of correct variable reduction:

CR = 1 n sim

n

sim

X

s=1

I 

{j : ˆ β j,s 6= 0} ⊃ {j : β j 6= 0}  .

- AN : the average number of selected variables:

AN = 1 n sim

n

sim

X

s=1



#{j : ˆ β j,s 6= 0}  .

3.3. Simulation result

We show the results of the 12 simulation cases. Table 3.2 is the results of outliers detection.

METHOD1 combines the Difference-Based Regression Model with the Lasso regression and METHOD2 combines Difference-Based Regression Model with the Elastic net regression.

Regardless of sample size or other conditions, our method has comparable performance, mainly in detection power. And the larger the number of samples and the correlation between variables, the better the performance of detecting outlier. Although our performance is not superior to other methods, overall performance is comparable.

Next, we show the result of the variable selection. Table 3.3 is the results of variable

selection.

(7)

Our method gets comparable results with other existing methods through simulation stud- ies and it is superior in most cases. Although the number of variables selected is larger than the number of important variables, the rate of correct variable reduction is much better than the other methods.

Table 3.2 Comparison of outlier detection

 ρ n Criteria LTS LAD

Lasso

HIPOD SIPOD METHOD1 METHOD2

N (0, 1) ρ = 0

n = 30

PD 0 0 0.02 0.46 0.29 0.17

OS 0 0 0.98 0.53 0.70 0.83

OM 0 0 0 0.01 0.01 0

MS 0 0 0 0 0 0

CF 1 1 0 0 0 0

n = 100

PD 0 0 0.02 0.15 0 0

OS 0 0 0.98 0.85 1 1

OM 0 0 0 0.01 0.01 0

MS 0 0 0 0 0 0

CF 1 1 0 0 0 0

n = 300

PD 0 0 0.03 0.02 0 0

OS 0 0 0.97 0.98 1 1

OM 0 0 0 0 0 0

MS 0 0 0 0 0 0

CF 1 1 0 0 0 0

ρ = 0.5 n = 30

PD 0 0 0.02 0.46 0.18 0.11

OS 0 0 0.98 0.53 0.81 0.88

OM 0 0 0 0.01 0.01 0.01

MS 0 0 0 0 0 0

CF 1 1 0 0 0 0

n = 100

PD 0 0 0.02 0.15 0.21 0.11

OS 0 0 0.98 0.85 0.79 0.89

OM 0 0 0 0 0 0

MS 0 0 0 0 0 0

CF 1 1 0 0 0 0

n = 300

PD 0 0 0.03 0.02 0.13 0.08

OS 0 0 0.97 0.98 0.87 0.92

OM 0 0 0 0 0 0

MS 0 0 0 0 0 0

CF 1 1 0 0 0 0

t

df =2

ρ = 0

n = 30

PD 0 0 0.01 0.09 0.06 0.05

OS 0 0 0.99 0.85 0.90 0.93

OM 0 0 0 0.01 0.01 0

MS 0 0 0 0 0.03 0.02

CF 1 1 0 0.05 0 0

n = 100

PD 0 0 0 0 0 0

OS 0 0 1 0.81 0.92 0.92

OM 0 0 0 0.02 0.03 0.01

MS 0 0 0 0 0.03 0.06

CF 1 1 0 0.16 0.02 0.01

n = 300

PD 0 0 0 0 0 0

OS 0 0 1 0.37 0.93 0.93

OM 0 0 0 0.05 0 0

MS 0 0 0 0 0.05 0.05

CF 1 1 0 0.58 0.02 0.02

(8)

Table 3.2 Continued

 ρ n Criteria LTS LAD

Lasso

HIPOD SIPOD METHOD1 METHOD2

t

df =2

ρ = 0.5 n = 30

PD 0 0 0.01 0.09 0.05 0.01

OS 0 0 0.99 0.85 0.92 0.94

OM 0 0 0 0.01 0.01 0.01

MS 0 0 0 0 0.02 0.04

CF 1 1 0 0.05 0 0

n = 100

PD 0 0 0 0 0 0

OS 0 0 1 0.81 0.96 0.97

OM 0 0 0 0.02 0.01 0.01

MS 0 0 0 0.01 0.02 0.01

CF 1 1 0 0.16 0.01 0.01

n = 300

PS 0 0 0 0 0 0

OS 0 0 1 0.37 0.92 0.93

OM 0 0 0 0.05 0 0

MS 0 0 0 0 0.06 0.05

CF 1 1 0 0.58 0.02 0.02

Table 3.3 Comparison of variable selection

 ρ n Criteria LAD

Lasso

METHOD1 METHOD2

N (0, 1)

ρ = 0

n = 30

CS 0.46 0.40 0.48

CR 0.59 0.68 0.80

AN 2.76 1.92 2.1

n = 100

CS 0.44 0.61 0.47

CR 0.53 0.94 0.96

AN 1.73 2.28 2.47

n = 300

CS 0.44 0.61 0.47

CR 0.53 0.94 0.96

AN 1.73 2.28 2.47

ρ = 0.5

n = 30

CS 0.44 0.61 0.47

CR 0.53 0.94 0.96

AN 1.73 2.28 2.47

n = 100

CS 0.44 0.61 0.47

CR 0.53 0.94 0.96

AN 1.73 2.28 2.47

n = 300

CS 0.44 0.61 0.47

CR 0.53 0.94 0.96

AN 1.73 2.28 2.47

t

df =2

ρ = 0

n = 30

CS 0.34 0.31 0.3

CR 0.44 0.59 0.63

AN 1.62 1.76 1.89

n = 100

CS 0.31 0.39 0.33

CR 0.42 0.80 0.85

AN 1.68 2.25 2.37

n = 300

CS 0.31 0.39 0.33

CR 0.42 0.80 0.85

AN 1.68 2.25 2.37

(9)

Table 3.3 Continued

 ρ n Criteria LAD

Lasso

METHOD1 METHOD2

t

df =2

ρ = 0.5

n = 30

CS 0.31 0.39 0.33

CR 0.42 0.80 0.85

AN 1.68 2.25 2.37

n = 100

CS 0.31 0.39 0.33

CR 0.42 0.80 0.85

AN 1.68 2.25 2.37

n = 300

CS 0.31 0.39 0.33

CR 0.42 0.80 0.85

AN 1.68 2.25 2.37

4. Real data analysis

We apply our procedure to the consumption of fuel. Because the consumption of fuel was measured in 48 states, there are 48 rows of data. The response variable (Y ) and four explanatory variables (X 1 , X 2 , X 3 , and X 4 ) are as the followings:

- Y = Consumption of fuel (millions of gallons).

- X 1 = fuel tax (cents per gallon).

- X 2 = Average income (dollars).

- X 3 = Paved Highways (miles).

- X 4 = Proportion of population with driver’s licenses.

Looking at Figure 4.1, we see that the 40th observation is an outlier. Table 4.1 is the result of outlier detection in fuel consumption data. Among the methods, HIPOD, SIPOD, METHOD1 and METHOD2 identify that the 40th observation is an outlier. But HIPOD, METHOD1 and METHOD2 tend to detect too many outliers.

Table 4.1 The result of outlier detection in fuel consumption data Method Number

of Outliers Outlier index set

LTS 0 { }

LAD

Lasso

0 { }

HIPOD 22 {5, 8, 9, 11, 15, 16, 18, 19, 20, 22, 23, 24, 31, 33, 34, 38, 39, 40, 42, 43, 44, 45}

SIPOD 2 {18, 40}

METHOD1 10 {5, 11, 18, 19, 33, 38, 40, 42, 44, 45}

METHOD2 14 {5, 11, 15, 18, 19, 20, 33, 36, 38, 39, 40, 42, 44, 45}

Table 4.2 is the results of variable selection in fuel consumption data. LAD-Lasso and

METHOD1 consider X 3 as the unimportant variable and select 3 variables, METHOD2

does not find the unimportant variable and selects all 4 variables.

(10)

Figure 4.1 Residual plot

Table 4.2 The result of variable selection in fuel consumption data

Method Number of

selected variables Unselected variable

LAD

Lasso

3 X

3

METHOD1 3 X

3

METHOD2 4

5. Conclusion

In this paper, using the properties of intercept estimator in a difference-based regression model, we propose a simultaneous procedure of an outlier detection and variable selection which is more efficient and simpler than using each procedure separately. Our procedure does not require us to estimate mean function. Instead, we determine the candidate group of outliers using the intercept estimates in the difference-based regression model and add the candidate group to the model. Then, we use outlier detection and variable selection using Lasso regression and Elastic net regression. Our procedure gets results comparable with other existing methods through simulation studies, and it is superior in most cases.

We apply our procedure to real data. For future study, we can extend to the nonparametric regression model for outliers detection.

References

Cook, R.D. and Weisberg, S. (1982). Residuals and influence in regression, Chapman & Hall, London, UK.

Hadi, A. S. and Simonoff, J. S. (1993). Procedures for the identification of multiple outliers in linear models.

Journal of the American Statistical Association, 88, 1264-1272.

Joo, Y. S. and Cho, G-Y. (2016). Outlier detection and treatment in industrial sampling survey. Journal of

the Korean Data & Information Science Society, 27, 131-142.

(11)

McCann, L., and Welsch, R. E. (2007). Robust variable selection using least angle regression and elemental set sampling. Computational Statistics & Data Analysis, 52 , 249-257.

Park, C. G. (2018). Distinction of an outlier(s) using difference based regression models. Journal of the Korean Data & Information Science Society, 29, 339-350.

Park, C. G. and Kim, I. (2017). Outlier detection using difference based regression Model. Manuscript.

Rousseeuw, P. J., and Leroy, A. M. (1987). Robust regression and outlier detection, Wiley Series in Prob- ability and Mathematical Statistics: Applied Probability and Statistics, John Wiley & Sons Inc, New York.

She, Y. and Owen, A. B. (2011). Outlier detection using nonconvex penalized regression. Journal of the American Statistical Association, 106, 626-639.

Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society. Series B, 58, 267-288.

Wang, H., Li, G. and Jiang, G. (2007). Robust regression shrinkage and consistent variable selection through the lad-Lasso. Journal of Business & Economic Statistics, 25, 347-355.

Zou, H and Hastie, T. (2005). Regularization and variable selection via the Elastic net. Journal of the Royal

Statistical Society. Series B, 67, 301-320.

수치

Table 3.1 Outlier detection True
Table 3.2 Comparison of outlier detection
Table 3.3 Comparison of variable selection
Table 4.1 The result of outlier detection in fuel consumption data Method Number
+2

참조

관련 문서

The related factors with clustering healthy behaviors by multinominal logistic regression analysis were higher for “females, the elderly, people with higher level of

For the association study, I analyzed the correlation of each SNP for Alzheimer's disease by logistic regression models using Additive genetic model after adjustment of

Transformed –log 10 p of SNPs from the logistic regression model of association with dementia based on the additive model in male subjects.. P-value on the left Y-axis is

인자하면서도 때로는 엄한 아버지 같은 나의 영원한 스승님.나만균 교수님.교수님 의 지도를 교훈삼아 행동하면 앞으로 어떠한 상황이 닥치더라도 무리 없이

As a result of multiple regression analysis, the statistically associated factors with burnout were social support, GDS, marital state, self­rated health

Glucose level before use of mannitol and peak osmolarity during mannitol treatment were associated with renal failure in univariate analysis.. In logistic regression

table. 10 Simple regression analysis result of freight forwarder selection factor about intention of renewal ··· 33.. table. 11 Multiple regression analysis result of

By analyzing the parameters according to the regression formula according to the simulation of the LOADEST model, the trends according to flow rate, season, and time