Bootstrap confidence intervals for classification error rate in circular models when a block
of observations is missing
Hie-Choon Chung 1 · Chien-Pai Han 2
1 Department of e-business, Gwangju University
2 Department of Mathematics, University of Texas at Arlington
Received 9 May 2009, revised 17 July 2009, accepted 20 July 2009
Abstract
In discriminant analysis, we consider a special pattern which contains a block of missing observations. We assume that the two populations are equally likely and the costs of misclassification are equal. In this situation, we consider the bootstrap confi- dence intervals of the error rate in the circular models when the covariance matrices are equal and not equal.
Keywords: Block of missing observations, bootstrap confidence interval, circular model, error rate, linear combination classification statistic, Monte Carlo study.
1. Introduction
In discriminant analysis the problem is to classify a px1 observation X of unknown origin to one of several distinct populations using an appropriate classification rule. In this paper it will be assumed that there are two distinct populations which are multivariate normal. We also assume that the two populations are equally likely and the costs of misclassification are equal. The classification rule depends on the situation when the training samples include missing values or not. Assuming that the covariance matrices are circular, we make an appropriate transformation which reduces the circular matrices to canonical forms. The discriminant function is given when the populations are multivariate normal with different circular matrices and the linear combination statistic is used when a block of observations is missing. We consider the bootstrap confidence intervals of the error rate in the circular models when the covariance matrices are equal and not equal.
1
Corresponding author: Associate professor, Department of e-business, Gwangju University, Gwangju 503-703, Korea. E-mail: [email protected]
2
Professor, Department of Mathematics, University of Texas at Arlington, TX 76019-0408, U.S.A.
2. Discriminant analysis in circular models
Let X, a px1 vector, be an observation which is known to have come from one of two multivariate normal populations. Denote the ith population by π i which is N (µ i , Σ i ) for i = 1, 2. We assume that Σ i is positive definite and circular, i.e. Σ i is of the form
σ i 2
1 ρ 1 ρ 2 ρ 3 · ρ 2 ρ 1
ρ 1 1 ρ 1 ρ 2 · ρ 3 ρ 2
ρ 2 ρ 1 1 ρ 1 · ρ 4 ρ 3
.. . .. . .. . .. . .. . .. . ρ 1 ρ 2 ρ 3 ρ 4 · ρ 1 1
. (2.1)
It is given in Wise (1955) that the matrix in (2.1) can be transformed into canonical form.
Thus there exists an orthogonal matrix L with (m,n)th element
l mn = p −1/2 (
cos 2π
p (m − 1)(n − 1) + sin 2π
p (m − 1)(n − 1) )
such that L 0 Σ i L = diag(σ i1 2 , σ 2 i2 , . . . , σ ip 2 ). Since L is independent of the elements in Σ 1 and Σ 2 , the discriminant function is equivalent to that when the covariance matrices are diagonal.
This is true because the discriminant function derived by the likelihood ratio procedure is invariant under any linear transformation.
2.1. Discriminant function with complete data
Since the circular matrix can be transformed into canonical form, we may let Σ i = diag(σ i1 2 , σ i2 2 , . . . , σ ip 2 ), i = 1, 2. Han (1970) derived the discriminant function by using the likelihood ratio procedure. It is proportional to
(X − µ 2 ) 0 Σ −1 2 (X − µ 2 ) − (X − µ 1 ) 0 Σ −1 1 (X − µ 1 ).
Substituting µ i and Σ i we obtain, apart from a constant,
V =
p
X
j=1
1 σ 2j 2 − 1
σ 1j 2
!
x j − µ 2j /σ 2 2j − µ 1j /σ 1j 2 1/σ 2 2j − 1/σ 2 1j
! 2
, (2.2)
where x j and µ ij are the jth component of X and µ i , i = 1, 2, respectively. We classify X into π 1 if V > k and into π 2 if V ≤ k for some suitable choice of the constant k. To find the distribution of V , we shall assume that σ 1j 2 > σ 2 2j for all j, or equivalently Σ 1 − Σ 2 is positive definite. Hence 1/σ 2 2j − 1/σ 2 1j > 0. Let
Z j = s 1
σ 2j 2 − 1
σ 1j 2 x j − µ 2j /σ 2 2j − µ 1j /σ 2 1j 1/σ 2 2j − 1/σ 2 1j
!
.
Then V = P p
j=1 Z j 2 . When X comes from π i , i=1 or 2, Z j are independently distributed as N (ζ ij , τ ij 2 ) where
ζ ij = s 1
σ 2 2j − 1
σ 2 1j µ ij − µ 2j /σ 2j 2 − µ 1j /σ 2 1j 1/σ 2j 2 − 1/σ 1j 2
!
, τ ij 2 = σ 2 ij 1 σ 2 2j − 1
σ 2 1j
! .
Therefore V is distributed as the sum of τ ij 2 · χ 02 1 (δ ij 2 ), where χ 02 1 (δ 2 ij ) is a non-central χ 2 distribution with 1 degree of freedom and non-centrality parameters δ ij 2 = ζ ij 2 /τ ij . It is not easy to obtain the distribution in a closed form. Patnaik (1949) has considered a χ 2 approximation to the distribution of the sum by fitting the first two moments. Thus the distribution may be approximated by c α χ 2 ν
α
where c α =
P
j τ ij 4 + 2 P
j τ ij 2 ζ ij 2 P
j τ ij 2 + P
j ζ ij 2 , ν α = 1 c α
(τ ij 2 + X
j
ζ ij 2 ).
2.2. Discriminant function with incomplete data
In this paper we consider a special pattern which contains a block of missing observations in circular models. Instead of estimating the parameters, we construct two different discrim- inant functions from the complete data and incomplete data, respectively, and then a linear combination of these two linear discriminant functions is used to obtain the classification rule.
When the populations are multivariate normal with equal covariance matrix, that is, π i : N (µ i , Σ), Chung and Han (2000) derived the linear combination statistic when a block of observations is missing.
Let us partition the p x 1 observation X as follows.
X = Y Z
,
where Y is a k × 1 vector and Z is a (p − k) × 1 vector (1 ≤ k < p). Suppose random samples of sizes m i , containing no missing values,
X ij = Y ij Z ij
, i = 1, 2, ; j = 1, 2, . . . , m i , are available from
N p (µ i , Σ) = N p
µ yi
µ zi
,
P
yy
P
zy
P
yz
P
zz
,
and random samples of sizes n i − m i , which contain only the first k-components Y ij(kx1) , i = 1, 2; j = m i + 1, . . . , n i , are available from N k (µ yi , P
yy ). We denote by X ij , i = 1, 2; j =
1, . . . , m i , the complete observations, and by Y ij , i = 1, 2; j = 1, . . . , n i , the incomplete
observations. Hence the data have the special pattern of missing values where a block of
variables is missing on n i −m i observations, and the remaining observations are all complete.
We can construct two linear discriminant functions. The first linear discriminant function is based on the observations, X ij , i = 1, 2; j = 1, . . . , m i . We have
W x = ( ¯ X 1 − ¯ X 2 ) 0 S −1 xx
"
X − 1
2 ( ¯ X 1 − ¯ X 2 )
# ,
where
X ¯ i = 1 m i
m
iX
j=1
X ij =
Y ¯ i1
Z ¯ i
, Y ¯ i1 = 1 m i
m
iX
j=1
Y ij , Z ¯ i = 1 m i
m
iX
j=1
Z ij , i = 1, 2,
S xx =
2
X
i=1 m
iX
j=1
(X ij − ¯ X i )(X ij − ¯ X i ) 0 /ν x , ν x = m 1 + m 2 − 2.
The second linear discriminant function is based on the incomplete observations, ¯ Y ij(kx1) , i = 1, 2; j = 1, 2, . . . , n i . We have
W y = ( ¯ Y 1 − ¯ Y 2 ) 0 S yy −1
"
Y − 1
2 ( ¯ Y 1 − ¯ Y 2 )
#
, where ¯ Y i = 1
n i m i Y ¯ i1 + (n i − m i ) ¯ Y i2 , Y ¯ i2 = 1
n i − m i n
iX
j=m
i+1
Y ij , S yy =
2
X
i=1 n
iX
j=1
(Y ij − ¯ Y i )(Y ij − ¯ Y i ) 0 /ν y , ν y = n 1 + n 2 − 1.
Now we combine the two linear discriminant functions and construct the classification rule which is a linear combination of W x and W y , namely
W c = cW x + (1 − c)W y , 0 ≤ c ≤ 1. (2.3) Now we consider the linear combination statistic when the populations are multivariate normal with unequal covariance matrix. When a block of observations is missing in circular models, we have the discriminant function.
W c = cW x + (1 − c)W y , 0 ≤ c ≤ 1,
W x = (X − ¯ X 2 ) 0 S −1 2 (X − ¯ X 2 ) − (X − ¯ X 1 ) 0 S 1 −1 (X − ¯ X 1 ), W y = (Y − ¯ Y 2 ) 0 S y2 (Y − ¯ Y 2 ) − (Y − ¯ Y 1 ) 0 S −1 y1 (Y − ¯ Y 1 ),
where
c =
1 m 1
+ 1 m 2
! −1
D 2 1
m 1
+ 1 m 2
! −1
D 2 + 1 n 1
+ 1 n 2
! −1
D y 2
,
D 2 = ( ¯ X 1 − ¯ X 2 ) 0 S −1 ( ¯ X 1 − ¯ X 2 ), D 2 y = ( ¯ Y 1 − ¯ Y 2 ) 0 S y −1 ( ¯ Y 1 − ¯ Y 2 ), S = S 1
m 1
+ S 2
m 2
, S i = 1 m i − 1
m
iX
j=1
(X ij − ¯ X i ),
S y = S y1
n 1
+ S y2
n 2
, S yi = 1 n i − 1
n
iX
j=1
(Y ij − ¯ Y i ).
Substituting ¯ X i , S i , ¯ Y i and S yi we obtain apart from a constant,
V x =
p
X
j=1
1 s 2 2j − 1
s 2 1j
!
x j − x ¯ 2j /s 2 2j − ¯ x 1j /s 2 1j 1/s 2 2j − 1/s 2 1j
! 2
for W x ,
V y =
k
X
j=1
1 s 2 y2j − 1
s 2 y1j
!
x j − y ¯ 2j /s 2 y2j − ¯ y 1j /s 2 y1j 1/s 2 y2j − 1/s 2 y1j
! 2
for W y .
Then W c = cV x + (1 − c)V y , 0 ≤ c ≤ 1.
The unconditional cdf of V x and V y are given (Han, 1970). The distribution of W c is very complicated.
Now we consider the conditional distribution given the sample statistics. Given sample statistics, V x is distributed as the sum of τ ij ∗2 · χ 02 1 (δ ∗2 ij ), where χ 02 1 (δ ∗2 ij ) is a non-central distribution χ 2 with 1 degree of freedom and non-centrality parameters
δ ∗2 ij = ζ ij ∗2 τ ij ∗2 ,
where ζ ij ∗ = s 1
s 2 2j − 1
s 2 1j µ ij − x ¯ 2j /s 2 2j − ¯ x 1j /s 2 1j 1/s 2 2j − 1/s 2 1j
!
, τ ij ∗2 = σ ij 2 1 s 2 2j − 1
s 2 1j
! .
Also, given the statistics, V y is distributed as the sum of τ yij ∗2 · χ 02 1 (δ ∗2 yij ), where χ 02 1 (δ 2 yij ) has non-centrality parameters δ ∗2 yij = ζ yij ∗2 /τ yij ∗2 , where
ζ yij ∗ = s 1
s 2 y2j − 1
s 2 y1j µ yij − y ¯ 2j /s 2 y2j − ¯ y 1j /s 2 y1j 1/s 2 y2j − 1/s 2 y1j
!
, τ yij ∗2 = σ yij 2 1 s 2 y2j − 1
s 2 y1j
! . The conditional probability of misclassifying an observation X from π 1 and π 2 by W c is given by
β 1n ∗ = P r(W c < k|¯ x ij , ¯ y ij , s 2 ij , s 2 yij , x, y ∈ π 1 )
= P r
c
p
X
j=1
τ ij ∗2 χ 02 1 (δ ij ∗2 ) + (1 − c)
k
X
j=1
τ yij ∗2 χ 02 1 (δ ∗2 yij )
< k
.
Similarly,
β 2n ∗ = P r(W c ≥ k|¯ x ij , ¯ y ij , s 2 ij , s 2 yij , x, y ∈ π 2 )
= P r
c
p
X
j=1
τ ij ∗2 χ 02 1 (δ ij ∗2 ) + (1 − c)
k
X
j=1
τ yij ∗2 χ 02 1 (δ ∗2 yij )
≥ k
.
Hence the conditional error rate, with equal prior probability, is defined as
β ∗ n = 1
2 (β 1n ∗ β ∗ 2n ). (2.4)
We use Patnaik’s method to approximate it by a constant multiple of a central chi-square distribution. The distribution of V x may be approximated by c x χ 2 ν
x, where
c x = P
j τ ij ∗4 + 2 P
j τ ij ∗2 ζ ij ∗2 P
j τ ij ∗2 + P
j ζ ij ∗2 , ν x = 1 c x ( X
j
τ ij ∗2 + X
j
ζ ij ∗2 ).
Also, the distribution of V y may be approximated by c y χ 2 ν
y