Variance function estimation with LS-SVM for replicated data
Jooyong Shim 1 · Hyejung Park 2 · Kyung Ha Seok 3
1 Department of Applied Statistics, Catholic University of Daegu
2 Division of General Education and Teacher’s Certification, Daegu University
3 Department of Data Science, Institute of Statistical Information, Inje University
Received 12 July 2009, revised 6 September 2009, accepted 12 September 2009
Abstract
In this paper we propose a variance function estimation method for replicated data based on averages of squared residuals obtained from estimated mean function by the least squares support vector machine. Newton-Raphson method is used to obtain as- sociated parameter vector for the variance function estimation. Furthermore, the cross validation functions are introduced to select the hyper-parameters which affect the per- formance of the proposed estimation method. Experimental results are then presented which illustrate the performance of the proposed procedure.
Keywords: Cross validation function, heteroscedasticity, hyper-parameters, kernel func- tion, least squares support vector machine.
1. Introduction
It becomes an important issue in many fields to measure volatility or local variability, which are usually modelled in terms of variance functions. Researches on estimation of variance function can be found in Anderson and Lund (1997), and Liu et al. (2007) and most of them are focused on heteroscedastic error problems.
In this paper we propose a variance function estimation method based on kernel trick (Vapnik, 1998; Suykens and Vanderwalle, 1999) for the heteroscedastic regression problem based on the average of squared residuals which are assumed to follow a gamma distribution.
The kernel trick is a method for using a linear model to solve a nonlinear problem by mapping the input space into a higher-dimensional feature space. This is done by using Mercer’s theorem (1909). The kernel trick has been applied to the regression problems of various data types (Shim and Seok, 2008; Shim et al., 2009). The estimated variance functions are
1
Adjunct Professor, Department of Applied Statistics, Catholic University of Daegu, Gyungbuk 712-702, Korea.
2
Visiting Professor, Computer Course Division of General Education and Teacher’s Certification, Daegu University, Gyungbuk 712-714, Korea.
3
Corresponding author: Professor, Department of Data Science, Institute of Statistical Information,
Inje University, Kimhae 621-749, Korea. E-mail: [email protected]
obtained by minimizing the penalized log-likelihood function of averages of squared residuals.
Here averages of squared residuals are obtained from estimated mean function by applying LS-SVM (least squares support vector machine), (Suykens and Vanderwalle, 1999) with replicated data where errors assumed to follow a normal distribution. The basic idea and applications of LS-SVM can be found in Suykens et al . (2000, 2002). The proposed method enables to select appropriate hyper-parameters easily from the generalized cross validation (GCV) function and the generalized approximate cross validation (GACV) function for mean and variance function estimation, respectively, which are used to select hyper-parameters for the achievement of high generalization performance. The rest of this paper is organized as follows. In Section 2 we propose the mean and variance function estimations method using the principal idea of kernel machine. In Section 3 we present the model selection method using GCV function and GACV function. In Section 4 we perform the numerical studies through examples. In Section 5 we give the conclusions.
2. Mean and variance functions estimation
Consider the heteroscedastic regression model with n observations with m i replicates as follows,
z ij = µ i + ij , j = 1, · · · , m i , i = 1, · · · , n,
where ij follows a normal distribution (0, e f (x
i) ) with an unknown nonlinear function f (x i ), the x i ∈ R d is an input vector including a constant 1 and µ i = µ(x i ) is a mean function of z ij ’s for i = 1, · · · , n such that,
µ(x i ) = E(z ij |x i ).
Here 1 m i
P m
ij=1 2 ij follows independently gamma distribution ( m i 2 , 2
m i
e f (x
i) ) for i = 1, · · · , n, which provides the justification that we can assume that the average of squared residuals, y i = 1
m i P m
ij=1 (z ij − µ(x b i )) 2 also follows independently gamma distribution ( m i
2 , 2
m i
e f (x
i) ) for i = 1, · · · , n. Here we define the variance function as
V ar(z ij |x i ) = E(y i |x i ) = e f (x
i) .
To obtain estimator of mean function, µ(x b i ), we apply LS-SVM as follows,
min 1
2 w 0 µ w µ + C 2
X
i,j
e 2 ij
subject to z ij − w 0 µ φ µ (x i ) = e ij , j = 1, · · · , m i , i = 1, · · · , n, where C is a nonnegative
regularization parameter which controls the trade-off between the goodness-of-fit on the
data and k w µ k 2 , w µ is a weight vector which is used for µ(x b i ) = w 0 µ φ µ (x i ). φ µ (·) is
the feature mapping function such that φ µ (·) : R d − > R d
fmaps the input space to the
higher dimensional feature space where the dimension d f is defined in an implicit way.
It is known that φ µ (x i ) 0 φ µ (x j ) = K µ (x i , x j ) which are obtained from the application of Mercer’s conditions (1909). Then b µ(x i ) can be expressed with the optimal values of Lagrange multipliers, β ij ’s, which are the solution of the linear system,
β = (U K b µ U 0 + I/C) −1 z,
where K µ = K µ (x, x), x is a n × d matrix of x i ’s, U is a N × n block diagonal matrix consisted of 1 m
i×1 with N = P n
i=1 m i , z = (z 11 , z 12 , · · · , z n,m
n) 0 is a N × 1 vector of z ij ’s.
µ(x b i ) can be expressed with b β as follows,
µ(x b i ) = K µ (x i , x)U 0 β. b For y i = 1
m i P m
ij=1 (z ij − b µ(x i )) 2 , the negative log-likelihood of the given data can be expressed as (a constant term is omitted) under the assumption that y i follows independently gamma distribution ( m i
2 , 2
m i e f (x
i) ) for i = 1, · · · , n, L(f ) = 1
n
n
X
i=1
(m i y i e −f (x
i) + m i f (x i )).
The nonlinear function f (x i ) can be estimated by a linear model, f (x i ) = ω 0 φ(x i ), con- ducted in a high dimensional feature space. Then the estimate of parameter vector satis- fying f (x i ) = ω 0 φ(x i ) for i = 1, · · · , n is obtained by minimizing the penalized negative log-likelihood,
L(ω) =
n
X
i=1
(m i y i e −ω
0φ(x
i) + m i ω 0 φ(x i )) + λ
2 k ω k 2 , (2.1)
where λ is a nonnegative regularization parameter which controls the trade-off between the goodness-of-fit on the data and k ω k 2 and φ µ (·) is the feature mapping function. The representer theorem (Kimeldorf and Wahba, 1971) guarantees the minimizer of the penalized negative log-likelihood to be f (x i ) = K i α for some n × 1 vector α, where K i is the i -th row of the n × n kernel matrix K. Now the penalized negative log-likelihood (2.1) becomes
L(α) = (m 0 Y e −Kα + m 0 Kα) + λ
2 α 0 Kα, (2.2)
where m = (m 1 , · · · , m n ) 0 , Y is a diagonal matrix of y, and e is the componentwise ex- ponential function. By minimizing the penalized negative log-likelihood (2.2) we obtain the estimate of parameter vector α, but not in a explicit form, which leads to use the Newton- Raphson method. At each iteration the parameter vector α is updated as follows,
α b new = α − H b −1 G,
where G is the gradient vector and H is the Hessian matrix of (2.2).
With the estimate of parameter vector α, the estimated variance function for the input b vector x i is obtained as,
V ar(z b ij |x i ) = e f (x b
i) = e K
iα b ,
where K i is the 1 × n row vector with elements K(x i , x j ), j = 1, · · · , n.
3. Model selection
The functional structures of the estimation method of mean function and variance function are characterized by hyper-parameters, the regularization parameter C or λ and the kernel parameter.
Hyper-parameters θ µ in the mean function estimation can be chosen by minimizing the cross validation (CV) function as follows:
CV (θ) = 1 N
n
X
i=1 m
iX
j=1
(z ij − µ b (−ij) θ
µ
(x i )), where µ b (−ij) θ
µ
(x i ) is the estimate of µ(x i ) estimated data without (i, j) th observation. Since for each candidates of hyperparameters, µ b (−ij) θ
µ