Adaptive ridge procedure for <em>L</em><sub>0</sub>-penalized weighted support vector machines

(1)

2017, 28(6), 1271–1278

Adaptive ridge procedure for L

₀

-penalized weighted support vector machines

^†

Kyoung Hee Kim

¹

· Seung Jun Shin

²

1Department of Statistics, Sungshin Women’s University

2Department of Statistics, Korea University

Received 29 September 2017, revised 30 October 2017, accepted 1 November 2017

Abstract

Although the L0-penalty is the most natural choice to identify the sparsity struc- ture of the model, it has not been widely used due to the computational bottleneck.

Recently, the adaptive ridge procedure is developed to efficiently approximate a Lq- penalized problem to an iterative L2-penalized one. In this article, we proposed to apply the adaptive ridge procedure to solve the L0-penalized weighted support vector machine (WSVM) to facilitate the corresponding optimization. Our numerical investigation shows the advantageous performance of the L0-penalized WSVM compared to the conventional WSVM with L2 penalty for both simulated and real data sets.

Keywords: L0-penalty, support vector machines, variable selection.

1. Introduction

The support vector machine (SVM; Vapnik, 2013) has been regarded as one of canoni- cal tools in binary classification due to its flexibility and promising performance. See, for example, Shim and Seok (2014) and Hwang and Shim (2017) among many others. The conventional SVMs treat all the data points equally. However, it is desired in some applications to assign different weights to observations from different classes. One such example is cancer classification where one type of misclassification induces a larger cost than the other type of misclassification. Motivated by this, Lin et al. (2002) proposed the weighted SVM (WSVM), a broader class of learning algorithm. With high-dimensional predictors, it is of- ten of primary interest to identify informative variables to improve both prediction accuracy and interpretability. Penalization is one popular way of performing variable selection. When SVM is penalized by ridge-type L2 penalty, it fails to have a sparse solution. One natural choice to pursue sparsity for the variable selection is LASSO-type L1-penalty (Tibshirani, 1996) and L1-norm SVM utilizing this LASSO-type penalty has been developed. Although

† This work is supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (NRF 2015 R1C1A1A01054913).

1 Assistant professor, Department of Statistics, Sungshin Women’s University, Seoul 100-741, Korea.

2 Corresponding author: Assistant professor, Department of Statistics, Korea University, Seoul 712-749, Korea. E-mail: [email protected]

(2)

LASSO penalty is very popular, the corresponding estimator is not only biased but also inconsistent unless very stringent assumptions are satisfied. Other alternatives that provide sparse solutions and enjoy desirable asymptotic properties include adaptive LASSO (Zou, 2006; Zhang and Lu, 2007), smoothly clipped absolute deviance penalty (SCAD, Fan and Li (2001)), and the minimax concave penalty (MCP; Zhang, 2010) among many others. The L0-norm which is defined as the cardinality of the set of nonzero elements is a directly re- lated to the concept of sparsity. The use of L0-norm was proposed in early 2000’s in machine learning society (Weston et al., 2003), but it fails to gain popularity because the associated optimization is notoriously difficult (Amaldi and Kann, 1998). To circumvent such computational bottleneck, several algorithms have been proposed based on the approximation of the L0-penalty (Fung et al., 2002; Weston et al., 2003). However, they all have drawbacks—

slow computing time and dependency on additional tuning parameters. Recently, Frommlet and Nuel (2016) developed an adaptive ridge (AR) procedure that converts L₀-penalized optimization problem to an iterative L₂- or ridge-penalized problem. This procedure was designed so that the resulting penalty converges towards the L₀one, and the computation is extremely fast. The idea of AR has been originally proposed by Huang et al. (2008) for the SVM by formulating it using hierarchical Bayesian model, and Frommlet and Nuel (2016) independently proposed a AR procedure in the context of the linear model. In this article, we have studied the L0-penalized WSVM which has not been explored much in the literature, due to the aforementioned difficulties. We provide an efficient algorithm based on the AR approach and illustrate its promising performance from numerical examples.

2. L

₀

-penalized weighted support vector machine

We are given a set of data (xi, yi) ∈ R^p× {−1, 1}, i = 1, · · · , n, and let I+= {i : yi = 1}

and I₋ = {i : yi= −1}. The WSVM solves minβ₀,β λkβk²₂+ (1 − π)X

i∈I+

ξi+ π X

i∈I−

ξi (2.1)

subject to yi(β0+ β^>xi) ≥ 1 − ξi and ξi≥ 0; i = 1, · · · , n,

where λ is a tuning parameter that controls balances between data fitting and the model complexity, and π ∈ (0, 1) is a constant that determines the relative importance of the two classes. Here f (x) = β0+β^>x is referred to as a decision function whose sign predicts output label associated with x. Notice that (2.1) reduces to the conventional WSVM when π = 0.5.

It can be easily shown that (2.1) is equivalent to the problem of minimizing (1 − π) X

i∈I₊

[1 − y_if (x_i)]₊+ πX

i∈I−

[1 − y_if (x_i)]₊+ λkβk²₂

over β₀, β. We remark that the WSVM can be viewed as an example of L₂-penalized learning algorithm. Now, the objective function of the L₀-penalized WSVM is naturally given by

(1 − π) X

i∈I₊

[1 − y_if (x_i)]₊+ π X

i∈I−

[1 − y_if (x_i)]₊+ λkβk₀. (2.2)

(3)

which is equivalent to solve

min_b,β λkβk₀+ (1 − π)X

i∈I₊

ξ_i+ π X

i∈I−

ξ_i

subject to y_i(b + β^>x_i) ≥ 1 − ξ_i and ξ_i≥ 0; i = 1, · · · , n.

Figure 2.1 compares L₀- and L₂-penalties, which explains why (2.2) is difficult to solve. That is, the L₀-penalty shows discontinuity and non-differentiability compared to the smooth L₂- penalty.

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

0.00.51.01.52.0

β

|β|

●

L2

L0

Figure 2.1 Comparison of L2and L0-penalties: L0-penalty is not only discontinuous but also not differentiable. This explains why the L0-penalized problem is computationally difficult to solve compared

to the L2-penalized one.

3. Adaptive ridge procedure

Frommlet and Nuel (2016) considered the following general Lq-penalized problem using a linear model f (x) = β₀+ β^Tx:

min

β₀,β n

X

i=1

L(y_i, f (x_i)) + λkβk_q (3.1)

where kβkq denotes the Lq-norm of β. As discussed in the introduction section, it is very difficult to optimize the above problem (3.1) when q = 0. The idea in Frommlet and Nuel (2016) is to consider approximating the objective function in (3.1) by a sequence of problems which can be dealt much easier. More precisely, suppose a sequence of β^(t) =

β^(t)₁ , · · · , βp^(t)

^T , which converges to β as t goes to infinity, and consider a sequence of weights

w^(t)_j,q=

|β_j^(t)|^γ+ δ^γ(q−2)/γ

(4)

and let W^(t)q = diag{w^(t)_j,q, j = 1, · · · , p}. This choice is motivated by the following:

t→∞lim n

β^(t)TW^(t−1)_q β^(t)o

=

p

X

j=1

β_j²

(|β_j|^γ+ δ^γ)^(2−q)/γ ≈ kβkq, (3.2) with sufficiently small δ > 0 (Huang et al., 2008; Frommlet and Nuel, 2016). The constant δ is introduced in order that we have stable solutions of the algorithm, and δ can be prespecified at a small value, for example δ = 1.0e-5. We remark that the above approximation holds for any values of γ > 0 and Frommlet and Nuel (2016) numerically compares the effect of γ under L_q-penalized linear regression model. In this article, we use γ = 1. Plugging the sequence in (3.2) into (2.2), we have the following iterative optimization problem: for the t^thiteration, β^(t)₀ and β^(t) can be updated by solving

β₀^(t), β^(t)

= argmin

β0,β

(1 − π)X

i∈I+

[1 − yif (xi)]++ π X

i∈I−

[1 − yif (xi)]++ λβ^TW^(t−1)₀ β.

(3.3) where W^(t−1)₀ = diag{(|β_j^(t−1)| + δ)⁻²}. In fact, the objective function in (3.3) can be understood differently. Let us consider a simple linear transformation ν = {W^(t−1)₀ }^1/2β and u = {W₀^(t−1)}^−1/2x. Then the optimization in (3.3) is equivalent to solve the conventional WSVM problem with respect to (ui, yi) to find the corresponding hyperplane f (u) = β0+ ν^Tu. After we solve this WSVM problem, we can let β = {W^(t−1)}^−1/2ν. That is, we can convert the L0-penalized WSVM (2.2) to the iterative WSVM, which is computationally more efficient.

4. Numerical illustration

4.1. Simulated example

In order to provide a concrete picture, we randomly generate (x_i, y_i), i = 1, · · · , n with n = 100 and p = 10 from the following simple model:

yi= sign{β0+ β^Txi+ i}, i = 1, · · · , n

where β = (1, 1, 0, · · · , 0), x_i ^iid∼ N (0p, I_p), and _i ^iid∼ N (0, 0.2²). Notice that the model is quite simple, and β is sparse. This implies that this model favors the sparsity-pursuing penalty such as L₀- and L₁-penalty. We consider β₀ ∈ {−1, 0, 1} and the following Figure 4.1 depicts scatter plots of x_i1 and x_i2 for simulated data with three different values of β₀. The (green) dashed lines represent the true hyperplane {(x₁, x₂) : f (x) = 0}. Assuming that the importance of the class is proportional to the reciprocal of the relative frequency of each class, we can regard the three cases (a) – (c) in Figure 4.1 as the positive, balanced, and negative scenarios, respectively.

Under the positive scenario with β0= −1, we apply the WSVM with π = 0.25. Similarly, we apply the WSVM with π = 0.5 (i.e., the unweighted SVM) and π = 0.75 under the

(5)

x1 x2

●

● ●

●

●● ●

●

● ●

●

−2 −1 0 1 2

−2−1012

(a) postive

x1 x2

● ●

●

● ●

●

−2 −1 0 1 2

−2−1012

(b) balanced

x1

x2 _●

●

−2 −1 0 1 2

−2−1012

(c) negative Figure 4.1 Illustrations of the simulated data sets with different values of β0.

balanced (β₀ = 0) and negative (β₀ = 1) scenarios, respectively. Finally, the proposed L₀- penalized WSVM is compared to the conventional L2-penalized in terms of misclassification error rates and total costs for independent test data sets, respectively defined as follows:

Error : 1 nts

nts

X

k=1

1n

sign{ ˆf (x_k)} 6= y_ko

Cost :

nts

X

k=1

(1 − π) · 1 {y^k = 1} · 1n

sign{ ˆf (xk) = −1o

+ π · 1 {y^k= −1} · 1n

sign{ ˆf (xk) = 1o

where (xk, yk), k = 1, · · · , ntsare independent test observations from the same model above.

We set nts= 1, 000. We numerically compared the performance based on 100 independent repetitions and Figure 4.2 depicts the box plots of difference between L2-penalized and L0- penalized WSVM results in terms of (a) misclassification errors and (b) total cost. Positive values imply the superior performance of the the proposed L₀-penalized WSVM to the L₂- penalized one. The result is not surprising since the true model is sparse. This simple toy simulation clearly motivates us to develop L₀-penalized WSVM when we have a sparse β.

4.2. Wisconsin breast cancer data

Table 4.1 Classification Tables for different methods for WDBC data: WSVM imposes different weight for different type of error. Since we assume that false negative is much more important in the cancer classification, WSVM provides much lower false negatives than SVM. L0-penalized methods shows

improved results compared to the L2-penalized ones.

L2-Penalized L0-Penalized

SVM True

Pred.

Negative Positive

True

Pred.

Negative Positive

π = 0.5 Negative 350 12 Negative 353 4

Positive 7 200 Positive 4 208

WSVM True

Pred.

Negative Positive

True

Pred.

Negative Positive

π = 0.1 Negative 330 27 Negative 336 21

Positive 2 210 Positive 2 210

(6)

●

●●

●

Test Error Difference

Positive Balanced Negative

−0.020.000.010.020.030.040.05

(a) Error

●

●●

●

Total Cost Difference

Positive Balanced Negative

−505101520

(b) Cost

Figure 4.2 Box plots of differences between L2-penalized and L0-penalized WSVM results in terms of (a) misclassification errors and (b) total cost. Positive values indicate superior performance of the L0-penalized

WSVM. The result is not surprising since the true model is sparse.

Table 4.2 Comparison of misclassification error rates and total costs for different methods for WDBC data: WSVM outperforms SVM in terms of the total cost. L0-penalized methods consistently better than

the L2-penalized ones.

Error Rate Total Cost

L2-penalized L0-penalized L2-penalized L0-penalized

SVM 3.34% 1.41% 11.5 4.0

WSVM 5.10% 4.04% 4.5 3.9

We now give an illustration using the Wisconsin Diagnostic Breast Cancer data available at the UCI machine learning repository website (http://archive.ics.uci.edu/ml/index.

html). The WDBC data record diagnosis results of breast cancer (1 for positive and −1 for negative for the breast cancer) for 569 subjects. For each subject, ten features of cell nuclei are measured from a digital image of a fine needle aspirate (FNA) of a breast mass. The mean, standard error, and the largest value are computed for each feature, leading to 30 predictors in total. Cancer is a rare disease and the false negative (i.e., fail to diagnose the cancer for the real patients) is much more serious than the false positive (i.e., misdiagnosis for cancer-free subjects). Suppose the false negative is 9 time more important than the false positive, meaning that the cost of false negatives is 9-times larger than that of the false positives. We therefore consider the WSVM with π = 0.1 to find a decision function for the breast cancer. The tuning parameter λ is selected by the ten-fold cross-validation and set it to 2⁻⁴. The L0-penalized WSVM selects 6 variables and the obtained decision function is given by

f (x) = − 10.368 − 0.601X1− 0.250X12+ 0.066X14+ 0.213X22+ 0.011X24+ 27.828X28

where X1 is the mean radius of the feature; X12 and X22 are the standard deviation and the largest value of texture which is defined by the standard deviation of gray-scale values, respectively; X14 and X24 are the standard deviation and the largest value of area of the feature; and X₂₈is the largest number of concave portions of the contour of the feature. We remark that the decision function from L₀-penalized WSVM is much easier to interpret (than

(7)

that resulted from other non-sparse penalization methods) the relation between the response and the predictors due to the attained sparsity. We also compare the performance of the L0-penalized WSVM (with π = 0.1) to L2-penalized WSVM, and both L2- and L0-penalized SVM (i.e. WSVM with π = 0.5). The following Table 4.1 contains misclassification table and Table 4.2 reports the corresponding error rates and total costs for different methods.

We assume that the cost of false negative is 0.9 and that of false positive is 0.1. The SVMs shows much lower error rate than WSVMs but larger cost which motivates the use of the WSVM. L0-penalized WSVM outperforms the conventional L2-penalized SVM in terms of both error rates and total costs.

5. Conclusion

In this paper, we have studied L0-penalized WSVM which has not been widely used due to the computational difficulty induced by the L0-penalty. We proposed to exploit the AR procedure to solve the L0-penalized WSVM, which enables us to convert the L0-penalized WSVM to the iterative L2-penalized WSVM which is extremely simple to implement. Nu- merical investigation supports the promising performance of the L0-penalized WSVM. The WSVM can also be used to estimate the class-probability P (Y = 1|X = x) by solving a sequence of WSVMs with different π’s ∈ (0, 1) (Wang et al., 2007). We remark that the class-probability estimator using L0-penalization can be naturally considered by applying the AR procedure. This will be one of interesting further research topic.

References

Amaldi, E. and Kann, V. (1998). On the approximability of minimizing nonzero variables or unsatisfied relations in linear systems. Theoretical Computer Science, 209, 237-260.

Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties.

Journal of the American statistical Association, 96, 1348-1360.

Frommlet, F. and Nuel, G. (2016). An adaptive ridge procedure for l0 regularization. PloS One, 11, e0148620.

Fung, G. M., Mangasarian, O. L. and Smola, A. J. (2002). Minimal kernel classifiers. Journal of Machine Learning Research, 3, 303-321.

Huang, K., King, I. and Lyu, M. R. (2008). Direct zero-norm optimization for feature selection. Eighth IEEE International Conference, 845-850.

Kim, K. H., Shin, S. J., Hwang, C. and Shim, J. (2017). Geographically weighted least squares-support vector machine. Journal of the Korean Data & Information Science Society, 28, 227-235.

Lin, Y., Lee, Y. and Wahba, G. (2002). Support vector machines for classification in nonstandard situations.

Machine Learning, 46, 191-202.

Shim, J. and Seok, K. (2014). A transductive least squares support vector machine with the difference convex algorithm. Journal of the Korean Data & Information Science Society, 25, 455-464.

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), 1, 267-288.

Vapnik, V. (2013). The nature of statistical learning theory, Springer Science & Business Media.

Wang, J., X. Shen and Y. Liu (2007). Probability estimation for large-margin classifiers. Biometrika, 95, 149-167.

Weston, J., A. Elisseeff, B. Scholkopf and M. Tipping (2003). Use of the zero-norm with linear models and kernel methods. Journal of Machine Learning Research, 3, 1439-1461.

Zhang, C.-H. (2010). Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics, 38, 894-942.

Zhang, H. H. and W. Lu (2007). Adaptive lasso for coxs proportional hazards model. Biometrika, 94, 691-703.

(8)

Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101, 1418-1429.