• 검색 결과가 없습니다.

Nonparametric Bayesian test of homogeneity using a discretization approach

N/A
N/A
Protected

Academic year: 2021

Share "Nonparametric Bayesian test of homogeneity using a discretization approach"

Copied!
9
0
0

로드 중.... (전체 텍스트 보기)

전체 글

(1)

Nonparametric Bayesian test of homogeneity using a discretization approach

MinSup Kim 1 · Balgobin Nandram 2 · Dal Ho Kim 3

13 Department of Statistics, Kyungpook National University

2 Department of Mathematical Sciences, Worcester Polytechnic Institute

Received 24 October 2017, revised 9 November 2017, accepted 13 November 2017

Abstract

In this paper, we consider nonparametric Bayesian test of homogeneity using a hierarchical multinomial model with Dirichlet process priors in small area setup. If we discretize a continuous variable properly, the discretization approach could find some association between the groups and the variable even if the groups are homogeneous through k-sample tests involving one-way ANOVA. It could also be used to look at heterogeneity at specific levels of the variable of interest among groups. We use the clustering by the k-means and Dirichlet process to discretize the continuous variable.

When we discretize the continuous variable, it can be treated as an analysis of the contingency table. Then the chi-squared test is the most common thought. If more slices are added, however, chi-squared test is less accurate. So we use the Bayes factor through the nonparmetric Bayesian model and apply it to the test of homogeneity.

Keywords: Bayesian nonparametrics, contingency table, Dirichlet process prior, dis- cretization, homogeneity test, small areas.

1. Introduction

In the parametric Bayesian model, the use of conjugate priors is easy to understand the results and simplifies the calculation, but it limits the flexibility of the model by fixing the characteristics of the data. So we can think of a nonparametric Bayesian model if we need to avoid the parametric approach to parameters and consider a more robust and flexible approach to the model. In this case, we use Dirichlet process as priors in the hierarchical Bayesian setup. The Dirichlet process was introduced by Ferguson (1973). The Dirichlet process define a distribution over distributions. When a random probability distribution G follows the Dirichlet process, we denote G ∼ DP (α, G 0 ) where α > 0 is a scaling parameter, and G 0 is the base distribution. For all (A 1 , ..A k ) finite partitions of a measurable space Θ, G ∼ DP (α, G 0 ) means that (G(A 1 ), ..., G(A k )) ∼ Dir(αG 0 (A 1 ), ..., αG 0 (A k )). Blackwell

1

Ph.D. candidate, Department of Statistics, Kyungpook National University, Daegu 41566, Korea.

2

Professor, Department of Mathematical Sciences, Worcester Polytechnic Institute, 100 Institute Road, Worcester, MA 01609, USA.

3

Corresponding author: Professor, Department of Statistics, Kyungpook National University, Daegu

41566, Korea. E-mail: [email protected]

(2)

and MacQueen (1973) explained the Dirichlet process as the P´ olya urn scheme. That is the prediction rule as follows. When there are the sequences {X n , n = 1}, for a measurable A ∈ Θ, P (X n+1 ∈ A|X 1 , ..., X n ) = (αG 0 (A) + P n

i=1 I(X i ∈ A))/α + n.

Sethuraman (1994) described the Dirichlet process as a stick-breaking representation, and now this is the most common way of representing Dirichlet process. G can be expressed in stick-breaking representation as follow

G = P ∞

j=1 w j δ φ

j

, (1.1)

where φ j are independent and identically distributed random elements from G 0 , w j are random weights for φ j chosen to be independent of φ j , 0 < w j < 1, and P ∞

j=1 w j = 1 almost surely. Here w j are given by

w 1 = ν 1 , w j = ν j

Y

l<j

(1 − ν l ), j = 2, 3, ... (1.2)

with the ν j iid ∼ Beta(1, α). This Dirichlet process is widely used as a prior for the nonpara- metric Bayesian model.

Recent methods for sampling the Dirichlet mixture model include blocked Gibbs sampler (Ishwaran and James, 2001), retrospective sampler (Papaspiliopoulos and Roberts, 2008), and slice sampler (Walker, 2007; Kalli et al., 2011). These methods are called conditional methods, avoiding the P´ olya urn scheme and sampling a sufficient but finite number of variables at each iteration of a Markov chain. In this study, we use the slice sampling method.

This method was introduced by Walker (2007) using the latent variable which allows a finite number of variables to be sampled at each iteration of a Gibbs sampler. Kalli et al. (2011) proposed a more efficient and general version of the slice sampler by Walker (2007). So, one of our main concerns is to perform the test of homogeneity between groups by the Dirichlet mixture model using the slice sampling method.

Another concern is on the discretization of continuous variables. Miller and Siegmund (1982) studied a discretization approach based on the maximally selected chi-square statis- tic. That selects a cut point to maximize the standard chi square statistic and then forming a 2×2 contingency tables of the numbers of observations. In the k-sample test, discretization approaches can be also considered. Jiang et al. (2015) studied the dynamic slicing method based on the regularized likelihood ratio testing. This regularized likelihood-ratio test statis- tic is mutual information with penalty terms, to prevent overslice when cutting a continuous variable. It can be seen as a generalized version of what Miller and Siegmund (1982) studied.

This kind of a discretization approach could find some associations between the groups and the variables. Even though they appear to be homogeneous among groups, the proportions of variables at a particular level vary from group to group. So appropriate discretization of continuous variables can support the alternative hypothesis in homogeneity test even if the null hypothesis was not rejected through k-sample tests involving ANOVA.

In our study, we use the clustering by the k-means and Dirichlet process to discretize the continuous variable. This clustering method minimizes the difference within the cluster and maximizes the difference between the clusters. When clusters are present in the data, discretization using clustering represents heterogeneity between groups at a particular level.

And the discretization results in a homogeneity test on l ×c tables. In l ×c tables, the general

homogeneity test is χ 2 test. However, the more discretization, the less data in the cell, which

(3)

makes it difficult to perform accurate test. So we use Bayes factor for the homogeneity test and compare the p-values of the classic χ 2 test and Cressie-Read test (1984) with Bayes factor. The Cressie-Read test means the test with the following power divergence statistic with r = 2/3 applied.

P 2 = 2 r(r + 1)

X

ij

n ij

n ( n ij

λ ˆ ij

) r − 1 o

, −∞ < r < ∞

where n ij are cell counts and ˆ λ ij = P

i n ij P

j n ij / P P n ij . In general, the Cressie-Read statistic (P 2 with r = 2/3), is known to be less sensitive to the small amount of data in the cell than χ 2 .

There is a well known literature on Bayesian methods for analyzing data with contingency tables. Agresti and Hitchcock (2005) studied Bayesian methods for categorical data analysis, with focus on contingency table. Recently hierarchical Bayesian models in the contingency tables from small areas have been studied in Woo and Kim (2015, 2016).

In this paper, we construct Bayesian nonparametric test of homogeneity using a hierar- chical multinomial model with Dirichlet process priors in small area setup. In Section 2, we establish a nonparametric Bayesian hierarchical model with Dirichlet process priors in small areas and its computational procedure based on the slice sampler. Then the Bayes factor is calculated to perform the test of homogeneity. Section 3, we show the results of numerical studies with the comparable frequentist methods. Finally, we provide concluding remarks in Section 4.

2. Nonparametric Bayesian test of homogeneity

We consider the l × c contingency tables with cell count n ij and cell probability π ij (i = 1, 2, ...l, j = 1, 2, ..c). We denote n i = (n i1 , n i2 , ..., n ic ) 0 and π i = (π i1 , π i2 , ..., π ic ) 0 . Here n i+ = P c

j=1 n ij is sum of count in ith row, n +j = P l

i=1 n ij is sum of count in jth column, and n = P l

i=1 n i+ is total sum of cell count.

Table 2.1 l × c contingency tables

Areas 1 2 · · · c

1 n

11

, π

11

n

12

, π

12

· · · n

1c

, π

1c

2 n

21

, π

21

n

22

, π

22

· · · n

2c

, π

2c

. . .

. . .

. . .

. . .

. . . l n

l1

, π

l1

n

l2

, π

l2

· · · n

lc

, π

lc

n ij : cell count, π ij : cell probability

We consider the test of homogeneity such that

H 0 : π 1 = π 2 = ... = π l = π vs H 1 : not H 0 (2.1)

And we consider the Dirichlet process prior for probability parameters of multinomial dis-

tribution about the contingency table.

(4)

Under H 1 , our hierarchical Bayesian model is n i |π i

ind ∼ Multinominal(n i+ , π i ), i = 1, 2, ..., l, π i |G iid ∼ G, i = 1, 2, ..., l,

G ∼ DP(α, G 0 ), (2.2)

where α > 0 is a concentration parameter and G 0 is the base distribution. We assume the hyperprior π(α) = 1/(1 + α) 2 , α > 0 and take G 0 ≡ Dirichlet(µ). Note that π i |µ ∼ Dirichlet(µ) has the density f (π i |µ) = Q c

j=1 π ij µ

j

−1 /D(µ), 0 < π ij < 1, P c

j=1 π ij = 1 where D(µ) = Q c

j=1 Γ(µ j )/Γ( P c

j=1 µ j ). And we have f 1 (n ii ) = n i+ !

Q c j=1 n ij !

Q c

j=1 π n ij

ij

(2.3)

where π i = (π i1 , ..., π ic ) 0 .

To apply the Dirichlet process prior, we use the efficient version of the slice sampler proposed by Kalli et al. (2011). Then the joint posterior distribution under H 1 is

P 1 (n, d, u|ν, π) = Q l

i=1 I (u

i

di

)

w d

i

ξ d

i

n i+ !f 1 (n i |π d

i

)

= Q l

i=1 I (u

i

di

)

w d

i

ξ d

i

n i+ ! Q c j=1

π d n

ij

i

j

n ij ! . (2.4) Notice that d = (d 1 , ..., d l ) is the index of cluster, u = (u 1 , ..., u l ) is the latent variable for cluster, ν = (ν 1 , ..., ν R ) is the variable for stick breaking prior (d i 5 R ), and ξ r = (1 − k)k r − 1. Here ν r |α iid ∼ Beta(1, α), r = 1, 2, ..., R. Then the joint density function for all variables under H 1 is

P 1 (n, d, u, ν, π, α) = p 1 (n, d, u|ν, π)π 0 (π)π(ν|α)π(α) (2.5)

=

l

Y

i=1

I (u

i

di

)

w d

i

ξ d

i

n i !

c

Y

j=1

π d n

ij

i

j

n ij !

R

Y

r=1

 1 D(µ)

c

Y

j=1

π µ ij

j

−1 α(1 − ν r ) α−1 1 (1 + α) 2 . Then the marginal likelihood of n under H 1 is

P 1 (n) =

l

Y

i=1

n i+ ! Q c

j=1 n ij !

R

Y

r=1

D(µ + P l

i=1 I (d

i

=r) n i )

D(µ) (2.6)

× Z R

Y

r=1

αB(1 +

l

X

i=1

I (d

i

=r) , α +

l

X

i=1

I (d

i

>r) ) 1 (1 + α) 2 dα.

At each iteration of a Gibbs sampler with slice, the sampled variables are

(π r , ν r ), r = 1, 2, ..., R; (d i , u i ), i = 1, 2, ..., l . (2.7)

Our Gibbs sampler under H 1 is

(5)

Step 1 : π(π r |...) ∝ Dirichlet( X

d

i

=r

n i + µ);

Step 2 : π(ν r |...) ∝ Beta(1 +

l

X

i=1

I (d

i

=r) , α +

l

X

i=1

I (d

i

>r) );

Step 3 : π(u i |...) ∝ I(0 < u i < ξ d

i

); (2.8) Step 4 : π(d i = r|...) ∝ I (u

i

r

) w r

ξ r n i+ !

c

Y

j=1

π rj n

ij

n ij ! ;

Step 5 : π(α|...) ∝

R

Y

r=1

α(1 − ν r ) α−1 1 (1 + α) 2 . Next, under H 0 , our hierarchical Bayesian model is

n i |π ind ∼ Multinominal(n i+ , π), i = 1, 2, ..., l, π|G iid ∼ G,

G ∼ DP(α, G 0 ). (2.9)

Then the joint posterior distribution under H 0 is

P 0 (n, d, u|ν, π) = Q l

i=1 I (u

i

di

)

w d

i

ξ d

i

n i+ !f (n i |π)

= Q l

i=1 I (u

i

di

)

w d

i

ξ d

i

n i+ ! Q c j=1

π j n

ij

n ij ! . (2.10) At this time, under H 0 , π 1 = π 2 = ... = π l = π, which means that there is one stick- breaking random variable or one cluster in Dirichlet process. So d 1 = d 2 = ... = d l = 1 and there is only ν 1 that has the density π(ν|α) = α(1 − ν 1 ) α−1 . So the joint density function for all variables under H 0 is

P 0 (n, d, u, ν, π, α) =

l

Y

i=1

w 1 n i+ !

c

Y

j=1

π j n

ij

n ij !

 1 D(µ)

c

Y

j=1

π µ j

j

−1 α(1 − ν 1 ) α−1 1

(1 + α) 2 . (2.11) Then the marginal likelihood of n under H 0 is

P 0 (n) =

l

Y

i=1

n i+ ! Q c

j=1 n ij !

D(µ + n) D(µ)

Z

αB(1 + l, α) 1

(1 + α) 2 dα. (2.12)

(6)

Our Gibbs sampler under H 0 is

Step 1 : π(π r |...) ∝ Dirichlet(n + µ);

Step 2 : π(ν|...) ∝ Beta(1 + l, α);

Step 3 : π(u i |...) ∝ 1; (2.13)

Step 4 : π(d i = 1|...) ∝ w 1 n i+ !

c

Y

j=1

π n j

ij

n ij ! ; Step 5 : π(α|...) ∝ α(1 − ν 1 ) α−1 1

(1 + α) 2 . Finally, the Bayes factor (BF) for test of homogeneity is BF 10 = P 1 (n)

P 0 (n) (2.14)

= Q R

r=1

D(µ + P l

i=1 I (d

i

=r) n i ) D(µ)

R Q R

r=1 αB(1 + P l i=1 I (d

i

=r) , α + P l

i=1 I (d

i

>r) ) dα (1 + α) 2 D(µ + n)

D(µ) R αB(1 + l, α) dα (1 + α) 2

In the data analysis, we have considered the case of noninformative uniform prior with µ = 1.

The Bayes factor can be used to provide evidence in favor of the hypothesis. This BF 10 can offer a way of evaluating evidence in favor of H 0 . Kass and Raftery (1995) explained it factor as follows. If the log(BF 10 ) is in (0 to 1) we get borderline evidence against the null hypothesis. If the log(BF 10 ) is in (1 to 3), we get positive evidence against the null hypothesis. If the log(BF 10 ) is in (3 to 5) , we get strong evidence against the null hypothesis.

If the log(BF 10 ) is greater than 5, we get very strong evidence against the null hypothesis..

3. Numerical study

We use the officetel sales price transaction data of Ulsan Metropolitan City in the first quarter of 2017 to apply the Bayesian hierarchical model mentioned above. The data is composed of the categorical variable of five small area in Ulsan Metropolitan City and the continuous variable of the sales price. The area 1 is Nam-Gu and the number of data is 73.

The area 2 is Dong-Gu and the number of data is 8. The area 3 is Buk-Gu and the number of data is 7. The area 4 is Ulju-Gun and the number of data is 5. The area 5 is Jung-Gu and the number of data is 19. The total number of data is 112.

The one-way ANOVA showed that there was no difference in the officetel sales price between the groups with one way F-test p-value = 0.103. In the nonparametric test also, the Kruskal-wallis test p-value is 0.118 and the Anderson-Darling k-sample test p-value is 0.110, showing no difference between the groups.

Now we first apply the k-means method for discretization to make a contingency table,

then forming a l × c table of the numbers of observations. We use the elbow method to find

the k, which is using total within sum of squared error. As a result, the proper number of

clustering is k = 3 as shown in Figure 3.1. Table 3.1 shows 5 × 3 contingency tables by

k-means method.

(7)

Figure 3.1 The elbow method for finding k

Table 3.1 5 × 3 table of discretization by k-means clustering

Areas P1 P2 P3

1 3 0 4

2 6 0 2

3 11 2 6

4 26 20 27

5 0 0 5

Next, in order to perform clustering by the Dirichlet process, the initial clustering number by the rule of thumb is taken by 7 which is pn/2 = p112/2. We apply nonparametric Bayesian method to the discretized sales price variable. We use 10000 iterates to burn out the MCMC and we get 5000 samples from 5001 to 10000. The clustering is performed by using the values of d i which is sampled as below.

π(d i = r|...) ∝ I (u

i

r

) w r

ξ r n i+ !

c

Y

j=1

π n rj

ij

n ij ! (3.1)

Figure 3.2 shows the bar plot of the sampled d i , and possible clustering is 2. So we consider Table 3.2 which shows the results of proper discretization using the value of d i .

Now, we compare the χ 2 test and Cressie-Read test with Bayes factor for the test of

homogeneity. The p-values of χ 2 test and Cressie-Read test show that the null hypothesis is

(8)

Figure 3.2 Bar plot of the number of cluster by the Dirichlet process

Table 3.2 5 × 2 table of discretization by the Dirichlet process Areas P1 P2

1 0 7

2 4 4

3 11 8

4 35 38

5 0 5

rejected. And the Bayes factor gets positive evidence against the null hypothesis. Because the proportion of observations at each levels has changed to be heterogeneous by the clustering.

Table 3.3 Comparisons of the χ

2

test, Cressie-Read test and Bayes factor 5×3 tables by K-means 5×2 tables by Dirichlet process χ

2

test Cressie-Read test log(BF) χ

2

test Cressie-Read test log(BF)

0.014 0.012 1.503 0.022 0.016 1.419

We have monitored the convergence of Gibbs sampler using Geweke test. We run one chain with a sample sizes of 10,000. To diminish the influence of the starting samples, we discard the first 5,000 samples and we get another 5,000 samples. The Geweke test compares the means from the first 10 percent and the last 50 percent part of the Markov chain by using a z-score statistic, where the null hypothesis is that the chain is stationary. The p-values of the Geweke tests for all parameters are all greater than 0.10

4. Concluding remarks

In this study, our interest is to discretize continuous variables appropriately. Then we create a contingency table and perform nonparametric Bayesian test of the homogeneity between groups using the Dirichlet process prior. We have used the efficient version of the slice sampler by Kalli et al. (2011) for the nonparametric Bayesian computing. Specially, we use the clustering methods for the discretization which are k-means and Dirichlet process.

Then we calculate Bayes factors for the test of homogeneity for contingency tables. In this

(9)

case, the chi-squared test is the most commonly employed method. However, further dis- cretization gives rise to more cells in the table. In other words, since the size of the entire data is fixed, if the number of cells is increased due to more discretization, the size of data belonging to the cell becomes smaller. As a result, the count in the cells becomes smaller and the accuracy of the chi-squared test becomes lower. To prevent this, we consider the Bayesian approach.

This discretization approach can find relationships between groups and variables even if the groups are homogeneous. So appropriate discretization of continuous variables can support the alternative hypothesis in homogeneity test. It can also be used to look at heterogeneity at specific levels of variables of interest among groups.

References

Agresti, A. and Hitchcock, D. B. (2005). Bayesian inference for categorical data analysis. Statistical Methods and Applications, 14, 297-330.

Blackwell, D. and MacQueen, J. B. (1973). Ferguson distributions via P´ olya urn schemes. The Annals of Statistics, 1, 353-355.

Cressie, N. and Read, T. R. C. (1984). Multinomial goodness-of-fit tests. Journal of the Royal Statistical Society, Series B, 46, 440-464.

Ferguson, T. S. (1973). A Bayesian analysis of some nonparametric problems. The Annals of Statistics, 1, 209-230.

Ishwaran, H. and James, L. F. (2001). Gibbs sampling methods for stick-breaking priors. Journal of the American Statistical Association, 96, 161-173.

Jiang, B., Ye, C. and Liu, J. S. (2015). Nonparametric K-sample tests via dynamic slicing. Journal of the American Statistical Association, 110, 642-653.

Kalli M., Griffin J. E. and Walker, S. G. (2011). Slice sampling mixture models. Statistics and Computing, 21, 93-105.

Kass, R. E. and Raftery, A. E. (1995). Bayes factor. Journal of the American Statistical Association, 90, 773-795.

Miller, R. and Siegmund, D. (1982). Maximally selected Chi square statistics. Biometrics, 38, 1011-1016.

Papaspiliopoulos, O. and Roberts, G. O. (2008). Retrospective Markov chain Monte Carlo methods for Dirichlet process hierarchical models. Biometrika, 95, 169-186.

Sethuraman, J. (1994). A constructive definition of Dirichlet priors. Statistica Sinica, 4, 639-650.

Walker, S. G. (2007). Sampling the Dirichlet mixture model with slices. Communications in Statistics- Simulation and Computation, 36, 45-54.

Woo, N. and Kim, D. H. (2015). A Bayesian uncertainty analysis for nonignorable nonresponse in two-way contingency table. Journal of the Korean Data & Information Science Society, 26, 1547-1555.

Woo, N. and Kim, D. H. (2016). A Bayesian model for two-way contingency tables with nonignorable

nonresponse from small areas. Journal of the Korean Data & Information Science Society, 27, 245-

254.

수치

Figure 3.1 The elbow method for finding k
Figure 3.2 Bar plot of the number of cluster by the Dirichlet process

참조

관련 문서