• 검색 결과가 없습니다.

Bayesian test of homogenity in small areas: A discretization approach

N/A
N/A
Protected

Academic year: 2021

Share "Bayesian test of homogenity in small areas: A discretization approach"

Copied!
9
0
0

로드 중.... (전체 텍스트 보기)

전체 글

(1)

Bayesian test of homogenity in small areas: A discretization approach

MinSup Kim 1 · Balgobin Nandram 2 · Dal Ho Kim 3

13 Department of Statistics, Kyungpook National University

2 Department of Mathematical Sciences, Worcester Polytechnic Institute

Received 29 September 2017, revised 30 October 2017, accepted 1 November 2017

Abstract

This paper studies Bayesian test of homogeneity in contingency tables made by dis- cretizing a continuous variable. Sometimes when we are considering events of interest in small area setup, we can think of discretization approaches about the continuous vari- able. If we properly discretize the continuous variable, we can find invisible relationships between areas (groups) and a continuous variable of interest. The proper discretization of the continuous variable can support the alternative hypothesis of the homogeneity test in contingency tables even if the null hypothesis was not rejected through k-sample tests involving one-way ANOVA. In other words, the proportions of variables with a particular level can vary from group to group by the discretization. If we discretize the the continuous variable, it can be treated as an analysis of the contingency table.

In this case, the chi-squared test is the most commonly employed method. However, further discretization gives rise to more cells in the table. As a result, the count in the cells becomes smaller and the accuracy of the test becomes lower. To prevent this, we can consider the Bayesian approach and apply it to the setup of the homogeneity test.

Keywords: Contingency table, Dirichlet prior, discretization, hierarchical Bayesian model, test of homogeneity.

1. Introduction

In many studies, one of our main concerns is to test the relationship between more than two groups, especially homogeneity. When we analyze the homogeneity of k-samples with more than two groups, parametric method, one-way ANOVA test and nonparametric methods, Kruskal-Wallis test and Anderson-Darling test, which are ranked tests, can be used (Scholz and Stephens, 1987).

Sometimes when we are considering events of interest, we can think of discretization ap- proaches such as the maximally selected chi-square statistics which Miller and Siegmund

1

Ph.D. candidate, Department of Statistics, Kyungpook National University, Daegu 41566, Korea.

2

Professor, Department of Mathematical Sciences, Worcester Polytechnic Institute, 100 Institute Road, Worcester, MA 01609, USA.

3

Corresponding author: Professor, Department of Statistics, Kyungpook National University, Daegu

41566, Korea. E-mail: [email protected]

(2)

(1982) studied. It compares two groups by selecting a cut point to maximize the standard chi square statistic and then forming a 2 × 2 table of the numbers of observations. In the k-sample test with more than two groups, discretization approaches can be considered as well. Jiang et al. (2015) used a method of dynamic slicing based on the likelihood-ratio test- ing framework with regularization. This regularized likelihood-ratio test statistic is mutual information with penalty terms, to prevent overslice when splitting a continuous variable.

The test statistic maximizes all possible discretization (slicing) schemes. It can also be seen as a generalized version of what Miller and Siegmund (1982) studied. Actually the studies by Jiang et al. (2015) and Miller and Siegmund (1982) are about between continuous variable and categorical variable.

Now we consider one continuous variable of interest in several small areas. There may be some relationship between areas and this continuous variable, but it may be difficult to find the relationship. If we discretize the continuous variable in small areas and make l × c contingency table, where l is the number of areas and c is the number of discretized cells, it can be treated as an analysis of the contingency table. Although it seems to be homogenous across areas, it may not be homogenous in practice. At this time, the specific discretization can make different proportions for some categories among areas and can reject the null hypothesis of the homogeneity.

The purpose of this study is to appropriately discretize the continuous variable to create a contingency table and test the homogeneity between areas (groups). And the proper dis- cretization of the continuous variable can support the alternative hypothesis in homogeneity test even if the null hypothesis was not rejected through k-sample tests involving on-way ANOVA. In this case, the chi-squared test is the most commonly employed method for the homogeneity test in the contingency table. However, further discretization gives rise to more cells in the table. As a result, the count in the cells becomes smaller and the accuracy of the test becomes lower.

To prevent this, we intend to test of homogeneity using the Bayes factor. This is an analysis of Bayesian categorical data using the contingency table which Agresti and Hitchcock (2005) studied. Recently hierarchical Bayesian models in the contingency tables have been studied in Woo and Kim (2015, 2016) and Jo and Kim (2017).

In this paper, we consider Bayesian test of homogeneity in l × c contingency table which is supposed to be constructed by discretization of the continuous variable in small areas. In Section 2, we consider Bayesian hierarchical multinomial model with Dirichlet priors and ob- tain the Bayes factor to perform the Bayesian test of homogeneity. In Section 3, we illustrate our results using real dataset. We construct l × c contingency table by discretization of the continuous variable using the K-means method. We specifically use uniform Dirichlet priors in this study. Using this model, the Bayes factor is calculated to perform the homogeneity test. We compare the results of homogeneity test with the frequentist methods. Finally, we provide concluding remarks in Section 4.

2. Bayesian hierarchical model for test of homegenity

We consider l × c contingency table which is discretized the continuous variable with cell count n ij and cell probability π ij (i = 1, 2, ...l, j = 1, 2, ..c). Here, l is the number of areas (groups) and c is the number of discrete cells. We denote that n i = (n i1 , n i2 , ..., n ic ) 0 , π i = (π i1 , π i2 , ..., π ic ) 0 , n i = P c

j=1 n ij is sum of count in ith row, n j = P l

i=1 n ij is sum

(3)

of count in jth column, and n = P l

i=1 n i = P c

j=1 n j is total sum of cell count. And we consider the test of homogeneity such that

H 0 : π 1 = π 2 = ... = π l = π vs H 1 : not H 0 , (2.1) where π = (π 1 , ..., π c ) 0 .

Now we consider the Bayesian hierarchical multinomial model with Dirichlet distribution priors for the contingency table. Under H 0 , our hierarchical Bayesian model is

n i |π i

ind ∼ Multinominal(n i , π), i = 1, 2, ..., l; (2.2) π ∼ Dirichlet(α),

where α = (α 1 , α 2 , ..., α c ) 0 . Note that π|α ∼ Dirichlet(α) has the density f (π|α) = Q c

j=1 π j α

j

−1 /D(α), 0 < π j < 1, P c

j=1 π j = 1 where D(α) = Q c

j=1 Γ(α j )/Γ( P c

j=1 α j ), is the normalizing constant.

So we have

f 0 (n i |π) = n i ! Q c

j=1 n ij !

c

Y

j=1

π j n

ij

,

g 0 (π) = 1 D(α)

c

Y

j=1

π j α

j

−1 .

The joint density function for all variables is h 0 (n, π) = 1

D(α) Q c

j=1 π α j

j

−1 × Q l i=1

n i ! Q c

j=1 n ij ! Q c

j=1 π j n

ij

= 1

D(α) × n Q l

i=1

n i ! Q c

j=1 n ij ! o Q c

j=1 π j n

j

j

−1 , where n = (n 0 1 , ..., n 0 l ) 0 and Q l

i=1

Q c

j=1 π j n

ij

= π j n

j

. Then the marginal likelihood under H 0 is

m 0 (n) = 1

D(α) × n Y l

i=1

n i ! Q c

j=1 n ij ! on Q c

j=1 Γ(n j + α j ) Γ(n + P c

j=1 α j ) o

. (2.3)

Under H 1 , our hierarchical Bayesian model is n i |π i

ind ∼ Multinominal(n i , π i ), i = 1, 2, ..., l; (2.4) π i

iid ∼ Dirichlet(α).

So we have

f 1 (n ii ) = n i ! Q c

j=1 n ij ! Q c

j=1 π ij n

ij

, g 1 (π i ) = 1

D(α) Q c

j=1 π ij α

j

−1 .

(4)

The joint density function for all variables is

h 1 (n i , π i ) = 1 D(α)

Q c

j=1 π j α

j

−1 × n i ! Q c

j=1 n ij ! Q c

j=1 π n j

ij

. Then the marginal likelihood of n under H 1 is

m 1 (n) = Q l i=1

1 D(α) × n

Q l i=1

n i ! Q c

j=1 n ij ! on Q l

i=1

Q c

j=1 Γ(n ij + α j ) Γ(n i + P c

j=1 α j ) o

. (2.5)

Finally, the Bayes factor (BF) for test of homogeneity is

BF 10 = m 1 (n) m 0 (n) =

Q l i=1

1 D(α) × n

Q l i=1

Q c

j=1 Γ(n ij + α j ) Γ(n i + P c

j=1 α j ) o 1

D(α) × n Q c

j=1 Γ(n j + α j ) Γ(n + P c

j=1 α j ) o

. (2.6)

It is difficult to determine the prior distribution if there is little prior information about the parameter. In this case, noninformative prior can be used. In this study, we apply two noninformative prior as the α parameter, which are commonly used. One is the commonly used noninformative uniform prior with α = 1, where the 1 is c × 1 vector with the elements which are equal to 1. Another is Jeffreys proposed noninformative uniform prior with α = 0.51 which is Jeffreys prior. Specifically when α = 1, then

BF 10 = m 1 (n)

m 0 (n) = {(c − 1)!} l−1 × Q l

i=1

Q c

j=1 Γ(n ij + 1) Γ(n i + c) Q c

j=1 Γ(n j + 1) Γ(n + c)

. (2.7)

The Bayes factor can be used to provide evidence in favor of the hypothesis. This BF 10 can offer a way of evaluating evidence in favor of H 0 . Kass and Raftery (1995) interpreted Bayes factor as follows. If the log(BF 10 ) is in (0 to 1 we get borderline evidence against H 0 . If the log(BF 10 ) is in (1 to 3), we get positive evidence against H 0 . If the log(BF 10 ) is in (3 to 5) , we get strong evidence against H 0 . If the log(BF 10 ) is greater than 5, we get very strong evidence against H 0 .

3. Numerical study

We use the officetel sales price transaction data of Ulsan Metropolitan City in the first quarter of 2017 to apply the Bayesian hierarchical model mentioned above. The data is composed of the categorical variable of five small areas in Ulsan Metropolitan City and the continuous variable of the sales price. The area 1 is Nam-Gu and the number of data is 73.

The area 2 is Dong-Gu and the number of data is 8. The area 3 is Buk-Gu and the number

of data is 7. The area 4 is Ulju-Gun and the number of data is 5. The area 5 is Jung-Gu

and the number of data is 19. The total number of data is 112.

(5)

Table 3.1 Data layout of sales price transaction data of Ulsan, l = 5, n = 112, n

1

= 73, n

2

= 8, n

3

= 7, n

4

= 5, n

5

= 19

Areas Sales price n

11

1

. . . n

1n1

n

21

2

. . . n

2n2

. . .

. . . n

l1

l . . .

n

lnl

One-way ANOVA showed that there was no difference between the groups with p-value 0.103 in F-test. In the nonparametric tests, p-value of the Kruskal-wallis test is 0.118 and p-value of the Anderson-Darling k-sample test is 0.110, showing no difference between the groups.

Figure 3.1 Box-plots of five areas in Ulsan

To make a contingency table by dicretization, we first divide the sorted price values by

quartiles, and then move it to each corresponding area to create a table. Table 3.2 shows

that the counts are formed into a 5 × 4 contingency table by quartiles.

(6)

Table 3.2 5 × 4 table based on discretization by quartiles

Areas C1 C2 C3 C4

1 4 1 2 0

2 1 1 4 2

3 4 6 5 4

4 17 17 17 22

5 2 3 0 0

Then we compare the p values of the classical χ 2 test and Cressie-Read test with Bayes factor for the test of homogeneity in this table. Note that Cressi and Read (1984) introduced power divergence statistics. Cressie-Read test means the test with the following power di- vergence statistic with r = 2/3 applied.

P 2 = 2 r(r + 1)

X

ij

n ij

n ( n ij

λ ˆ ij

) r − 1 o

, −∞ < r < ∞, (3.1)

where n ij are cell counts, and ˆ λ ij = P

i n ij P

j n ij / P P n ij . The Cressie-Read statistic (r = 2/3), is known to be less sensitive to the small amount of data in the cell than χ 2 .

Table 3.3 Comparisons of the χ

2

test, Cressie-Read test and Bayes factors based on discretization by quartiles

χ

2

test Cressie-Read test α = 1 α = 0.51 BF log(BF) BF log(BF)

0.241 0.232 0.157 -1.849 0.026 -3.667

The p-values of χ 2 test and Cressie-Read test shows that the null hypothesis can not be rejected. And BF indicates that it supports the null hypothesis.

The χ 2 test of homogeneity compares the c levels for each group and tests whether they are the same proportion of observations about each group. Testing with the discretization by quartiles can be viewed as a sort of nonparametric rank test, testing whether the proportion of observations are same at each quartile levels among different groups. Therefore, it is similar to the result of nonparametric k-sample tests.

Next, we use the K-means method which is routinely performed for clustering. Here, the method of finding K is the elbow method which is using total within sum of squared error.

When we look at the proper clusters using the elbow method, the clusters of sales price appropriate when K = 3 in Figure 3.2.

After we perform the K-means method on the data of the whole area to discretize it, and

then move it to each corresponding area to create a table also. Table 3.4 shows that the

counts are formed into a 5 × 3 contingency table by K-means.

(7)

Table 3.4 5 × 3 table based on discretization by K-means clustering

Areas P1 P2 P3

1 3 0 4

2 6 0 2

3 11 2 6

4 26 20 27

5 0 0 5

Figure 3.2 The elbow method for finding K

Table 3.5 Comparisons of the χ

2

test, Cressie-Read test and Bayes factors for 5 × 3 table based on discretization by K-means clustering

χ

2

test Cressie-Read test α = 1 α = 0.51 BF log(BF) BF log(BF)

0.014 0.012 4.496 1.503 5.392 1.685

The p-values of χ 2 test and Cressie-Read test shows that the null hypothesis can be

rejected. And for both α = 1 and α = 0.51, the BF gets positive evidence against the null

hypothesis. This is because the proportion of variables having a certain level by discretization

has changed heterogeneously for each group.

(8)

4. Concluding remarks

In this article, our interest is to discretize appropriately the continuous variable to create a contingency table and perform the test of the homogeneity between groups. We construct the hierarchical Bayesian models with the Dirichlet priors for the contingency tables by discretization. And we applied the non-informative Dirichlet priors with α = 1 and α = 0.51.

Then, we use Bayes factors for the test of homogeneity in contingency tables. The reason for using Bayes factor is that the more cells, the smaller the data and the less accurate the test. Also there is the reason for using Bayesian method typically in small area setup because occasionally it is not possible to perform the chi-squared test when all rows or columns in a table have a value of zero

Another concern of this study was how to appropriately discretize the continuous variable.

We first applied the discretization by quartiles. The quartile is organized in order of data and divided the data set into four equal groups. Test using the quartile can be seen from the view of a sort of nonparametric rank test, and the discretization by this method could not reject the null hypothesis or not get evidence against the null hypothesis. Therefore, it also shows similar results to nonparametric k-sample tests.

However, the discretization by the clustering method is opposite to the previous result.

The discretization by this method could reject the null hypothesis or get evidence against the null hypothesis.

The clustering method minimizes the difference within the cluster and maximizes the difference between the clusters. If actual clusters are present in the data, they appear in the contingency table by discretization using clustering like K-means. Therefore, the dis- cretization of a continuous variable by the clustering method can be effective in finding invisible relationships between groups and a continuous variable of interest. That means the proportions of variables at a particular level vary from group to group.

In this study, we use the elbow method for the appropriate the k. For the choice of the proper k, the silhouette method (Rousseeuw, 1987) and gap statistics (Tibshirani et al., 2001). would be also possible, and natural clustering using Dirichlet process could be possible.

References

Agresti, A. and Hitchcock, D. B. (2005). Bayesian inference for categorical data analysis. Statistical Methods and Applications, 14, 297-330.

Cressie, N. and Read, T. R. C. (1984). Multinomial goodness-of-fit tests. Journal of the Royal Statistical Society, Series B, 46, 440-464.

Jiang, B., Ye, C., and Liu, J. S. (2015). Nonparametric K-sample tests via dynamic slicing. Journal of the American Statistical Association, 110, 642-653.

Jo, A., and Kim, D. H. (2017). Bayes tests of independence for contingency tables from small areas. Journal of the Korean Data & Information Science Society, 28, 207-215.

Kass, R. E. and Raftery, A. E. (1995). Bayes factor. Journal of the American Statistical Association, 90, 773-795.

Miller, R. and Siegmund, D. (1982). Maximally selected chi square statistics. Biometrics, 38, 1011-1016.

Rousseeuw, P. J. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis.

Computational and Applied Mathematics, 20, 53-65.

Scholz, F. W. and Stephens, M. A. (1987). K-sample Anderson-Darling tests. Journal of the American

Statistical Association, 82, 918-924.

(9)

Tibshirani, R., Walther, G. and Hastie, T. (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society, Series B, 63, 411-423.

Woo, N. and Kim, D. H. (2015). A Bayesian uncertainty analysis for nonignorable nonresponse in two-way contingency table. Journal of the Korean Data & Information Science Society, 26, 1547-1555.

Woo, N. and Kim, D. H. (2016). A Bayesian model for two-way contingency tables with nonignorable

nonresponse from small areas. Journal of the Korean Data & Information Science Society, 27, 245-

254.

수치

Figure 3.1 Box-plots of five areas in Ulsan
Table 3.4 5 × 3 table based on discretization by K-means clustering Areas P1 P2 P3 1 3 0 4 2 6 0 2 3 11 2 6 4 26 20 27 5 0 0 5

참조

관련 문서