Geographically weighted least squares-support vector machine †
Changha Hwang 1 · Jooyong Shim 2
1 Department of Applied Statistics, Dankook University
2 Department of Statistics, Inje University
Received 26 December 2016, revised 7 January 2017, accepted 10 January 2017
Abstract
When the spatial information of each location is given specifically as coordinates it is popular to use the geographically weighted regression to incorporate the spatial information by assuming that the regression parameters vary spatially across locations.
In this paper, we relax the linearity assumption of geographically weighted regression and propose a geographically weighted least squares-support vector machine for esti- mating geographically weighted mean by using the basic concept of kernel machines.
Generalized cross validation function is induced for the model selection. Numerical stud- ies with real datasets have been conducted to compare the performance of proposed method with other methods for predicting geographically weighted mean.
Keywords: Generalized cross validation function, geographically weighted regression, kernel machine, least squares-support vector machine, model selection, spatial informa- tion.
1. Introduction
Geographical weighted regression (GWR) model was proposed first by Fotheringham et al. (1996) for the study with respect to the spatial heterogeneity, which is an alternative ap- proach to the spatial autoregressive regression (Anselin, 1992). GWR extends the classical regression model to allow local rather than global regression parameters to be estimated by incorporating the spatial information to satisfy the assumption that the regression param- eters vary spatially across locations of interest. Brunsdon and Fotheringham (1999) have studied the relationship between the house price and the area, and mentioned several ques- tions GWR faced: the selection of the input variables, bandwidth parameter, and the spatial autocorrelation of errors. Zhang (2004) utilized GWR model to study the height of crown
† This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (NRF- 2015R1D1A1A01056582, NRF-2014R1A1A2054917). This work was supported by the Ministry of Education of the Republic of Korea and the National Research Foundation of Korea (NRF- 2015S1A3A2046715).
1
Professor, Department of Applied Statistics, Dankook University, Yongin 16890, Korea.
2
Corresponding author: Adjunct professor, Institute of Statistical Information, Department of Statistics,
Inje University, Kimhae 50834, Korea. E-mail: [email protected]
and the result showed that GWR could bring better significance and residual error than the ordinary least squares regression estimation.
Support vector machine (SVM), first developed by Vapnik (1995) and his group at AT&T Bell Laboratories, is known to solve the weak point of the artificial neural network (Rosen- blatt, 1958) such as the local minima existence in the area of structural risk minimization and statistical learning theory. SVM has been successfully applied to lots of real world prob- lems related to regression and classification problems. One of prominent advantages of SVM is the use of kernels to utilize the nonlinear transforms without knowing the specific trans- forms. According to this idea, other authors proposed a class of kernel-based algorithms, such as kernel Fisher discriminant analysis (Mika et al., 1999), least square-SVM (LS-SVM, Suykens and Vandewalle, 1999), kernel ridge regression (Saunders et al., 1998) and the ker- nel minimum squared error model (Xu et al., 2001). LS-SVM is the modified version of SVM in least squares senses. The kernel minimum squared error model is a generalization of the conventional minimum squared error model to yield a new type of nonlinear model, which was devised by using the theory of reproducing kernels and adding different penalty terms. Xu et al. (2001) argued that LS-SVM can be viewed as a special case of the kernel minimum squared error model. Smola et al. (1998) developed a semiparametric SVM which is useful in the case where the domain knowledge exists about functions to be estimated or emphasis is put onto comprehensibility of the given model. Shim et al. (2011) proposed a semiparametric LS-SVM for accelerated failure time model. LS-SVM has been extended to recurrent models and use in optimal control problems. See for further details, Suykens and Vandewalle (1999), Suykens et al. (2001), Hwang and Shim (2016) and Hwang et al. (2016).
In this paper we present the geographically weighted LS-SVM (GWLS-SVM) that com- bines ideas from the architectures of GWR with those of LS-SVM. GWLS-SVM can be applied efficiently when the functional form of the relationship between the response and input variables is left unspecified.
The rest of this paper is organized as follows. In Section 2 we review GWR briefly. In Section 3 we propose the geographically weighted LS-SVM (GWLS-SVM) for geographically weighted mean estimation and induce the generalized cross validation function. In Section 4 and 5 we present the case studies and conclusion, respectively.
2. Geographically weighted regression
Given a training dataset {x i , u i , y i } n i=1 with each input vector x i ∈ R d , u i ∈ R 2 is the spatial coordinate vector (longitude, latitude) and corresponding response y i ∈ R, GWR can be represented as follows:
y i = β 0 (u i ) +
d
X
k=1
β k (u i )x ik + e i ,
where β 0 (u i ) and β k (u i )’s are the bias (intercept) and the slope parameter which depend on the ith spatial coordinate vector u i .
In GWR, the weight of the ith observation is affected by the proximity to the spatial
coordinates u i , which leads the weight of the ith observation varies along with the change of
u i , and y = (y 1 , · · · , y n ) 0 given x at location u j is assumed to follow a normal distribution
N (0, W j −2 ) so that the negative log-likelihood can be expressed as follows:
L j =
n
X
i=1
w ji (y i − β 0 (u i ) −
d
X
k=1
β k (u i )x ik ) 2 , (2.1)
where w ji is the weight function which is the decreasing in the distance from the jth location u j to the ith location u i , and W j is the diagonal matrix composed of w ji ’s.
Generally the exponential function of the distance from u j to other location u i is used for the weighted function, which is w ji = exp(−d ji /h), where h > 0 is the bandwidth parameter.
From the negative log-likelihood (2.1), the estimators of β(u j ) = (β 0 (u j ), · · · , β d (u j )) 0 as follows:
β(u b j ) = (X 0 W j X) −1 X 0 W j y,
where X = {1, x i } n i=1 is the n × (d + 1) input matrix and y = (y 1 , · · · , y n ) 0 . The estimated regression function given (x j , u j ) can be obtained as
f (x b j , u j ) = X j β(u b j ) = H j y, where H j = X j (X 0 W j X) −1 X 0 W j .
The predicted regression function given x t at new location u t is given as follows:
f (x b t , u t ) = H t y,
where H t = X t (X 0 W t X) −1 X 0 W t , W t is the diagonal matrix of (w t1 , · · · , w tn ), w ti = exp(−d ti /h), d ti is the distance between location u t and u i for i = 1, · · · , n.
The performance of GWR is affected by the bandwidth parameter in the weight function.
To select the optimal values of the bandwidth parameter, we consider the leave-one-out cross validation (LOO-CV) function as follows:
CV (θ) = 1 n
n
X
i=1
(y i − b f i (−i) (θ)) 2 ,
where θ is a candidate set of bandwidth parameters and b f i (−i) (θ) is the predicted value of f (x i , u i ) obtained from data without the ith observation, which can be obtained as follows:
f (x b i , u i ) (−i) = X i (X (−i)
0W i (−i) X (−i) ) −1 X (−i)
0W i (−i) y (−i) ,
where X (−i) = (X 0 1 , · · · , X 0 i−1 , X 0 i+1 , · · · , X 0 n ) 0 , W i (−i) is the diagonal matrix of (w i1 , · · · , w i,i−1 , w i,i+1 , · · · , w 1n ) and y (−i) = (y 1 , · · · , y i−1 , y i+1 , · · · , y n ) 0 .
Since for each candidate set of bandwidth parameters, b f i (−i) (θ) for i = 1, · · · , n, should be evaluated, selecting parameters using LOO-CV function is computationally burdensome.
By using leaving-out-one lemma (Wahba, 1990) and the first order Taylor expansion, the ordinary cross validation function is obtained as follows:
OCV (θ) = 1 n
n
X
i=1
1 − b f i (θ) 1 − ∂ b ∂y f
ii
2
= 1 n
n
X
i=1
1 − b f i (θ) 1 − h ii (θ)
! 2
, (2.2)
where h ii (θ) for i = 1, · · · , n, is the ith diagonal element of the hat matrix H such that f = Hy, which is composed of H b j ’s in (2.4) such that H = (H 1 0 , · · · , H n 0 ) 0 . By averaging the residuals in (2.2) by (1 − trace(H)/n), the generalized cross validation (GCV) function is obtained as follows:
GCV (θ) = n
n
P
i=1
(1 − b f i (θ)) 2 (n − trace(H)) 2 .
3. Geographically weighted LS-SVM
3.1. LS-SVM for regression
LS-SVM has been successfully applied to statistical problems like regression and classi- fication. The LS-SVM model for regression estimation has the following representation in feature space,
f (x) = ω 0 φ(x) + b, where ω ∈ R d
fis a weight vector corresponding to φ(x).
Given a training dataset {x i , y i } n i=1 with each input x i ∈ R d and the corresponding response y i ∈ R, the optimization problem in the primal weight space is considered as follows:
L(ω, b, e) = 1
2 ω 0 ω + C 2
n
X
i=1
e 2 i
subject to equality constraints
y i = ω 0 φ(x i ) + b + e i , i = 1, · · · , n.
The cost function with squared error and penalty corresponds to a form of ridge regression.
To find minimizers of the objective function, we can construct the Lagrangian function as follows:
L(ω, b, e; α) = 1
2 ω 0 ω + C 2
n
X
i=1
e 2 i −
n
X
i=1
α i (ω 0 φ(x i ) + b + e i − y i )
where α i ’s are the Lagrange multipliers. Then, the conditions for optimality are given by
∂L
∂ω = 0 → ω =
n
X
i=1
α i φ(x i )
∂L
∂b = 0 →
n
X
i=1
α i = 0
∂L
∂e i = 0 → e i = 1
C α i , i = 1, · · · , n
∂L
∂α i = 0 → y i − b − w 0 φ(x i ) − e i , i = 1, · · · , n
After eliminating e i and w, we could have the solution by the following linear equations
K+ C 1 I 1 1 0 0
α b
= y 0
, (3.1)
where is 1 is a n × 1 vector of 1’s, I is a n × n identity matrix and K = K(x,x) is the kernel matrix constructed with x.
From the linear equation (3.1) the bias estimate and optimal values of Lagrangian mul- tipliers, b b and α b i ’s can be obtained. The predicted regression function given x t ∈ R d is obtained as
f (x b t ) = K(x t , x) α + b b b = H t y,
where H t = (K(x t , x), 1)H 0 , x = (x 1 , · · · , x n ) 0 ∈ R n×d , y = (y 1 , · · · , y n ) 0 ∈ R n , K = K(x, x) and H 0 = (K +I/C)
−1−(K +I/C)
−11(1
0(K + I/C)
−11)1
0(K + I/C)
−1(1
0(K + I/C)
−11)
−11
0(K + I/C)
−1.
The performance of LS-SVM is affected by hyperparameters, which are the penalty param- eter and the kernel parameter. To select the optimal values of hyperparameters of LS-SVM, we consider LOO-CV function as follows:
CV (θ) = 1 n
n
X
i=1
(y i − b f i (−i) (θ)) 2 , (3.2)
where θ is a candidate set of hyper-parameters and b f i (−i) (θ) is the predicted value of f (x i ) obtained from data without the ith observation. From LOO-CV function in (3.2) GCV function is obtained as follows:
GCV (θ) = n
n
P
i=1
(1 − b f i (θ)) 2 (n − trace(H)) 2 ,
where b f i (θ) is the estimated regression function given x i and H = H(x, x) is the hat matrix such that b f = Hy.
3.2. Geographically weighted LS-SVM
Given a training dataset {x i , u i , y i } n i=1 with each input vector x i ∈ R d , u i ∈ R 2 (longi- tude, latitude) and corresponding response y i ∈ R. We want predict the regression function given x t at location u j as follows:
f (x t , u j ) = ω 0 j φ(x t ) + b j .
We assume that y given x at location u j follows a normal distribution N (0, W j −2 ) so that the penalized negative log-likelihood can be expressed as follows:
L j =
n
X
i=1
w ji (y i − ω 0 j φ(x i ) − b j ) 2 + λ
2 ||ω j || 2 , (3.3)
where λ > 0 is a penalty parameter, W j is the diagonal matrix of weight, w ji is the weight function which is an exponential function of the distance between u j and u i .
According to the representation theorem by Kimeldorf and Wahba (1971), the optimal values of ω j can be written as expansions over training observations, such that
ω 0 j φ(x i ) = K(x i , x)α .j
for some weight vector α .j . The penalized negative log-likelihood (3.3) can be reexpressed as follows:
L =
n
X
i=1
w ji (y i − K i α .j − b j ) 2 + λ
2 α 0 .j Kα .j (3.4)
where K = K(x.x) and K i is the ith row of K .
Taking derivative of (3.4) with respect to (α .j , b j ) , estimate of (α .j , b j ) can be obtained from the linear equations as follows:
α b .j
b b j
= W j K + λI W j 1 1 0 W j K 1 0 W j 1
−1
W j
1 0 W j
y = S j y.
Then the estimated regression function given x j at location u j is given as follows:
f (x b j , u j ) = K(x j , x)α .j + b j = (K(x j , x), 1)S j y, (3.5) From (3.5) the estimated regression function vector of training data can be expressed as follows:
f = b
f (x b 1 , u 1 ) f (x b 2 , u 2 )
.. . f (x b n , u n )
=
H(x 1 , u 1 ) H(x 2 , u 2 )
.. . H(x n , u n )
y = e Hy, (3.6)
where H(x j , u j ) = K(x j , x)S j for j = 1, · · · , n.
Using (3.6) the predicted regression function given x t at new location u t is given as follows:
f (x b t , u t ) = (K(x t , x), 1)S t y, where S t = W
t
K + λI W
t1 1
0W
tK 1
0W
t1
−1 W
t
1
0W
t, W t is the diagonal matrix of weight function w ti = exp(−d ti /h), d ti is the distance between location u t and u i for i = 1, · · · , n. The functional structures of the GWLS-SVM is characterized by the hyperparameters, which are the bandwidth parameter of the weight function, penalty parameter and kernel param- eter. To choose optimal values of hyperparameters we use LOO-CV function as follows:
CV (θ) = 1 n
n
X
i=1
(y i − b f i (−i) (θ)) 2 ,
where θ is a set of hyper-parameters, that is, kernel and penalty parameters, and b f i (−i) (θ) is the predicted value of f (x i ) obtained from data without the ith observation. For example f b 1 (−1) is obtained as follows:
f b 1 (−1) = (K(x 1 , x (−1) ), 1)S 1 (−1) y (−1) ,
where x (−1) = { x i } n i=2 , S 1 (−1) = W
(−1)1