A Bayesian approach to statistical matching using the national health screening data

(1)

A Bayesian approach to statistical matching using the national health screening data

Sejin Bae¹· Dal Ho Kim²

12Department of Statistics, Kyungpook National University

Received 23 May 2021, revised 3 June 2021, accepted 11 June 2021

Abstract

The problem of missing data that complicates the data analysis process, is an inevitable phenomenon in various studies. Statistical matching using existing data or information can solve this problem. Many studies have been conducted on statistical matching. This includes linear regression models and nonparametric methods. How- ever, the aforementioned methods may not perform well in small sample problems.

This study attempts to address this issue from a Bayesian perspective. In particular, we verify the performance of our Bayesian-based statistical matching method in small sample problems. We use the real observed data from the National Health Screening Data to compare the proposed model with other existing methods.

Keywords: Bayesian approach, data fusion, distance hot deck, linear regression model, statistical matching.

1. Introduction

Surveys are planned and conducted to obtain suitable information. However, such research is time-consuming and economically burdensome in its design and progress. Moreover, the data missing issue is an inevitable phenomenon in many studies, making the analysis process challenging. These studies include sociological investigations, longitudinal studies, and clinical trials. The inference and results of these studies can be biased and inefficient if the missing data are ignored or improperly processed (Mason et al., 2010). In particular, making inferences under small sample problems becomes complex due to a lack of information. Var- ious approaches, including statistical matching, have been proposed and developed to solve these problems. In the study on the statistical matching problem, the underlying individual data structure and the merged form of the data were summarized as illustrated in Figure 1 (D’Orazio et al., 2006).

In the sample files, A and B obtained from the population, the shaded portion represents unobserved information in each dataset. Common variables have played a significant role in several proposed methodologies for conventional statistical matching. Here, a common

1 Ph.D. candidate, Department of Statistics, Kyungpook National University, Daegu 41566, Korea.

2 Corresponding author: Professor, Department of Statistics, Kyungpook National University, Daegu 41566, Korea. E-mail: [email protected]

(2)

File  _^ ⋯ _



 _^ ⋯ _



 missing

File   ⋯ _



 missing  ⋯ _



⇓

Merged file



_^ ⋯ _



 _^ ⋯ _



 _^ ⋯_



_^ ⋯ _



 _^ ⋯ _



 _^ ⋯ _



Figure 1.1 Sample dataset structure in statistical matching problems

variable denotes a variable that is common to two datasets. In the structure shown in Figure 1.1, it is X. The variables excluding the common variables are called unique variables. These correspond to Y and Z in each sample file, A and B. The goal of statistical matching is to statistically merge predicted values from a designed statistical model to approximate the missing values. Statistical matching approaches can be divided into two main types:

macro and micro approaches. A macro approach obtains direct parameter estimates for each component of the dataset. It is used to obtain information regarding the main characteristics of the data, such as the joint distribution functions and correlations. Conversely, a micro approach is used to estimate unique variables not contained in each sample dataset to create one complete merge. Macro and micro approaches are not mutually exclusive but complement each other. Here, we will develop a model using a Bayesian approach where both of these methods are applied simultaneously. We specifically analyze problems in small sample datasets using real-world data.

Bayesian approaches and the existing methods form a general framework of the multiple imputation methods attempted to solve statistical matching problems. The Bayesian approach provides a natural way to consider the uncertainty of missing data when reasoning or arriving at conclusions using incomplete data (Daniels and Hogan, 2008; Ibrahim et al., 2005). From a Bayesian perspective, the missing data are considered random variables. The information on their posterior distribution can be obtained from specific prior information on the distribution of data parameters and missing covariances. The information corresponding to the missing data can be sampled from a conditional distribution via the Markov chain Monte Carlo (MCMC) method. Information on the inference can be obtained from the posterior distribution (Ahmed, 2011). By simultaneously estimating the unknown parameters and missing data, the inference is unified (Mason, 2010). Furthermore, the Bayesian inference can achieve improved and reliable results in small sample size regions. This can be obtained with only the informational pre-distributions and additional information. (Cai et al., 2010).

The estimated results of the data convergence with the proposed Bayesian approach and existing methods are compared using the National Health Insurance Service’s National Health Screening Data. Section 2 describes the Bayesian models for the statistical matching.

It briefly introduces the prior distribution and computational results of performing Gibbs

(3)

sampling using the MCMC method. Furthermore, we use the grid methods for unknown conditional distributions. In Section 3, we briefly introduce the existing methods, such as the linear regression model and the distance hot deck. We then compare the results of the statistical matching using our approach and the existing methods using the appropriate measures with real data. Finally, we conclude and make suggestions in Section 4.

2. Bayesian model for statistical matching

As already described in Figure 1.1, it is the overall sample A ∪ B of nA+ nB units from f (x, y, z) with Z missing in A and Y missing in B. Hence, the statistical matching problem can be regarded as a problem of analysis of a partially observed data set. To consider the Bayesian approaches, we do not separate the roles of the donor files and the recipient files commonly used in the statistical matching mechanism. The overall structure is as follows:

Let (X, Y, Z) be the trivariate normal distribution of the expression (2.1) with the following parameters.

f (x, y, z|µ, Σ) = 1

p(2π)³|Σ|exp

−1

2(x, y, z) − µ⁰)Σ⁻¹(x, y, z)⁰− µ

, (2.1)

θ = (µ, Σ) =







 µX

µY

µ_Z



,





σ²_X σXY σXZ

σXY σ²_Y σY Z

σ_XZ σ_{Y Z} σ_Z²







, θ ∈ Θ. (2.2)

The covariance matrix Σ can be re-expressed as follows through standard decomposition, naming as DSD.





√σXX 0 0

0 √

σ_{Y Y} 0

0 0 √

σ_ZZ









1 ρXY ρXZ

ρ_XY 1 ρ_{Y Z} ρ_XZ ρ_{Y Z} 1









√σXX 0 0

0 √

σ_{Y Y} 0

0 0 √

σ_ZZ



.

Therefore, expression (2.2) rewritten in terms of the equivalent parametrization of Θ, with matrix of the correlation coefficients ρ :

θ = (µ, σ, ρ) =







 µ_X µ_Y µZ



,



 σ_X² σ²_Y σ_Z²



,





1 ρ_XY ρ_XZ ρ_XY 1 ρ_{Y Z} ρXZ ρY Z 1







.

After the aforementioned transformation, the additional constraints arise from the equation for the model’s assumption of uncertainty in statistical matching.

1 ρ_XY ρ_XZ ρXY 1 ρY Z

ρXZ ρY Z 1

> 0.

(4)

The constraint values of the correlation coefficients can be determined in their interval form by solving the aforementioned equation. ρ_XY must have values within the ρ_XZρ_{Y Z}± p(1 − ρ²_XZ) (1 − ρ²_{Y Z}) range that do not contain both endpoints. ρXZ and ρY Z also have constrained intervals of ρXYρY Z±p(1 − ρ²_XY) (1 − ρ²_{Y Z}) and ρXYρXZ±p(1 − ρ²_XY) (1 − ρ²_XZ) in the same context. They must have values within that range.

For each parameter value (µ, σ, ρ), the following prior information is considered for performing the MCMC algorithm with a Gibbs sampler. In particular σ, we assumed a shrinkage prior to avoid difficulties associated with improper priors of the form π(σ) ∝ 1/σ (Gelman, 2006; Nandram et al., 2013).

π(µ) ∝ 1, π σ_X² ∝ 1

(1 + σ_X²)², π σ_Y² ∝ 1

(1 + σ_Y²)², π σ²_Z ∝ 1 (1 + σ²_Z)², ρXY ∼ U (−1, 1), ρXZ∼ U (−1, 1), ρY Z∼ U (−1, 1).

To adapt to the statistical matching into the Bayesian framework, necessary to consider the r.v. R, as missing data mechanism, with probability distribution h(r|ξ) and prior probability distribution π(ξ) (D’Orazio et al., 2006; R¨assler, 2002). Assuming that the missing data mechanism is at least MAR and the θ and ξ are independent, the posterior distribution of θ is:

π(θ|(x, y, z)_obs, r) = Z

π(θ, ξ|(x, y, z)_obs, r)dξ

= c⁻¹f ((x, y, z)obs|θ)π(θ) Z

h(r|(x, y, z)obs, ξ)π(ξ)dξ

∝ L(θ|(x, y, z)obs)π(θ), where c is the normalizing constant

c = Z Z

f ((x, y, z)_obs, r|θ, ξ)π(θ)π(ξ)dθdξ.

Therefore, all the information on θ is contained in above likelihood expression, which does not involve the missing data mechanism. By D’Orazio et al., (2006), the posterior distribution of θ to be computed:

π(θ|(x, y, z)obs) ∝ π(θ)Qn_A

a=1fXY(x^A_a, y^A_a|θ)Qn_B

b=1fXZ(x^B_b, z_b^B|θ).

The expression in (2.1) can be expressed for two datasets: A and B, respectively, which have a missing result, as follows.

f (xi, yi, ˜zi|µ, Σ) = √ ¹

(2π)³|Σ|exp

−1

2(xi, yi, ˜zi) − µ⁰)Σ⁻¹(xi, yi, ˜zi)⁰− µ

,

f (xj, ˜yj, zj|µ, Σ) = √ ¹

(2π)³|Σ|exp

−1

2(xj, ˜yj, zj) − µ⁰)Σ⁻¹(xj, ˜yj, zj)⁰− µ

, where ˜zi, ˜yj are missing value in each datasets with i = 1, · · · , nA, j = 1, · · · , nB.

(5)

Based on the Bayes’ theorem, the joint posterior density of ˜y, ˜z, µ, σ_X², σ_Y², σ²_Z, ρ_XY, ρ_XZ, ρ_{Y Z} is given as follows.

p (˜y, ˜z, µ, σ²_X, σ_Y², σ_Z², ρXY, ρXZ, ρY Z|x, y, z)

=

n_A

Y

i=1

[f (x_i, y_i, ˜z_i|µ, σ, ρ)] ×

n_B

Y

j=1

[f (x_j, ˜y_j, z_j|µ, σ, ρ)] × π(µ, σ, ρ)

∝

nA

Y

i=1

1

|DSD|^1/2exp

"

−1 2

ⁿA

X

i=1

((x_i, y_i, ˜z_i) − µ)^T(DSD)⁻¹((x_i, y_i, ˜z_i) − µ)

#

×

n_B

Y

j=1

1

|DSD|^1/2exp

"

−1 2

ⁿB

X

j=1

((xj, ˜yj, zj) − µ)^T(DSD)⁻¹((xj, ˜yj, zj) − µ)

#

× 1

(1 + σ_X²)² × 1

(1 + σ²_Y)² × 1 (1 + σ²_Z)²

× I_(ρ_XY_∈(−1,1))× I(ρ_XZ∈(−1,1))× I(ρ_{Y Z}∈(−1,1)).

Estimating the individual parameters from the posterior density is difficult owing to the complexity of posterior distributions. We, therefore, use the method and the Gibbs’ sampler.

The conditional posterior distribution for each parameter are as follows:

(i) p(µ|x, y, z, ˜y, ˜z, σ_X², σ²_Y, σ_Z², ρXY, ρXZ, ρY Z) ∼ N3

S_AB nA+ nB

, DSD

nA+ nB

, where S_AB=PnA

i=1(x_i, y_i, ˜z_i) +PnB

j=1(x_j, ˜y_j, z_j).

(ii) p(σ_X²|x, y, z, ˜y, ˜z, µ, σ²_Y, σ_Z², ρXY, ρXZ, ρY Z) ∝σ²_X⁻⁽ⁿ^A⁺ⁿ^B^)/2 (1 + σ²_X)²

× exp (

− 1 2R

α1

σ²_X + 2 β1

pσ_X²

!) , where R = 1 + 2ρ_XYρ_XZρ_{Y Z}− ρ²_XY − ρ²_XZ− ρ²_{Y Z},

α1=







n_A

X

i=1

(xi− µX)²+

n_B

X

j=1

(xj− µX)²







1 − ρ²_{Y Z} ,

β₁=

"(_n_A X

i=1

(x_i− µ_X) (y_i− µ_Y) +

n_B

X

j=1

(x_j− µ_X) ( ˜y_j− µ_Y) )

×(ρXZρY Z− ρXY) pσ²_Y

+ (_n_A

X

i=1

(xi− µX) ( ˜zi− µZ) +

nB

X

j=1

(xj− µX) (zj− µZ) )

×(ρ_XYρ_{Y Z}− ρXZ) pσ²_Z

# .

(iii) p(σ_Y²|x, y, z, ˜y, ˜z, µ, σ_X², σ²_Z, ρXY, ρXZ, ρY Z) ∝σ²_Y⁻⁽ⁿ^A⁺ⁿ^B^)/2 (1 + σ²_Y)²

× exp (

− 1 2R

α2

σ²_Y + 2 β2

pσ_Y²

!) ,

(6)

where R = 1 + 2ρ_XYρ_XZρ_{Y Z}− ρ²_XY − ρ²_XZ− ρ²_{Y Z},

α2=







nA

X

i=1

(yi− µY)²+

nB

X

j=1

( ˜yj− µY)²







1 − ρ²_XZ ,

β2=

"(_n_A X

i=1

(xi− µX) (yi− µY) +

n_B

X

j=1

(xj− µX) ( ˜yj− µY) )

×(ρXZρY Z− ρXY) pσ_X²

+ (_n_A

X

i=1

(yi− µY) ( ˜zi− µZ) +

n_B

X

j=1

( ˜yj− µY) (zj− µZ) )

×(ρXYρXZ− ρY Z) pσ²_Z

# .

(iv) p(σ²_Z|x, y, z, ˜y, ˜z, µ, σ²_X, σ_Y², ρXY, ρXZ, ρY Z) ∝ σ_Z²⁻⁽ⁿ^A⁺ⁿ^B^)/2 (1 + σ_Z²)²

× exp (

− 1 2R

α₃

σ²_Z + 2 β₃ pσ_Z²

!) , where R = 1 + 2ρXYρXZρY Z− ρ²_XY − ρ²_XZ− ρ²_{Y Z},

α3=







n_A

X

i=1

( ˜zi− µZ)²+

n_B

X

j=1

(zj− µZ)²







1 − ρ²_XY ,

β₃=

"(_n_A X

i=1

(x_i− µ_X) ( ˜z_i− µ_Z) +

nB

X

j=1

(x_j− µ_X) (z_j− µ_Z) )

×(ρ_XYρ_{Y Z}− ρ_XZ) pσ_X²

+ (_n_A

X

i=1

(yi− µY) ( ˜zi− µZ) +

nB

X

j=1

( ˜yj− µY) (zj− µZ) )

×(ρ_XYρ_XZ− ρY Z) pσ_Y²

# .

(v) p(ρ_XY| x, y, z, ˜y, ˜z, µ, σ²_X, σ_Y², σ²_Z, ρ_XZ, ρ_{Y Z})

∝ exp

"

−1 2

ⁿA

X

i=1

((x_i, y_i, ˜z_i) − µ)^T(DSD)⁻¹((x_i, y_i, ˜z_i) − µ)

+

nB

X

j=1

((x_j, ˜y_j, z_j) − µ)^T(DSD)⁻¹((x_j, ˜y_j, z_j) − µ)

#

× |DSD|⁻⁽ⁿ^A⁺ⁿ^B^)/2× I_(ρ_XY_∈(−1,1)).

(vi) p(ρXZ| x, y, z, ˜y, ˜z, µ, σ_X², σ²_Y, σ_Z², ρXY, ρY Z)

∝ exp

"

−1 2

ⁿA

X

i=1

((xi, yi, ˜zi) − µ)^T(DSD)⁻¹((xi, yi, ˜zi) − µ)

+

n_B

X

j=1

((xj, ˜yj, zj) − µ)^T(DSD)⁻¹((xj, ˜yj, zj) − µ)

#

× |DSD|⁻⁽ⁿ^A⁺ⁿ^B^)/2× I(ρ_XZ∈(−1,1)).

(7)

(vii) p(ρ_{Y Z}| x, y, z, ˜y, ˜z, µ, σ²_X, σ_Y², σ²_Z, ρ_XY, ρ_XZ)

∝ exp

"

−1 2

ⁿA

X

i=1

((xi, yi, ˜zi) − µ)^T(DSD)⁻¹((xi, yi, ˜zi) − µ)

+

nB

X

j=1

((x_j, ˜y_j, z_j) − µ)^T(DSD)⁻¹((x_j, ˜y_j, z_j) − µ)

#

× |DSD|⁻⁽ⁿ^A⁺ⁿ^B^)/2× I_(ρ_{Y Z}_∈(−1,1)).

(viii) p(˜y|x, y, z, ˜z, µ, σ_X², σ²_Y, σ_Z², ρXY, ρXZ, ρY Z)^ind∼ Nj

µY −β4

α4

, R α4

, where R = 1 + 2ρXYρXZρY Z− ρ²_XY − ρ²_XZ− ρ²_{Y Z},

α₄= (1 − ρ²_XZ) σ²_Y , β₄=

( ρ_XZρ_{Y Z}− ρ_XY pσ_X²pσ_Y²

!

(x_j− µ_X) + ρ_XYρ_XZ− ρ_{Y Z} pσ²_Ypσ_Z²

!

(z_j− µ_Z) )

.

(ix) p(˜z|x, y, z, ˜y, µ, σ_X², σ²_Y, σ_Z², ρXY, ρXZ, ρY Z)^ind∼ Ni

µZ−β5

α5

, R α5

, where R = 1 + 2ρ_XYρ_XZρ_{Y Z}− ρ²_XY − ρ²_XZ− ρ²_{Y Z},

α₅= (1 − ρ²_XY) σ_Z² , β₅=

( ρXYρY Z− ρXZ

pσ_X²pσ_Z²

!

(x_i− µ_X) + ρXYρXZ− ρY Z

pσ_Y²pσ²_Z

!

(y_i− µ_Y) )

.

From the aforementioned expressions, the conditional posterior probability distribution for some parameters does not take the form of a specific probability distribution. We, therefore, estimate the posterior probability distribution using the grid method (Jang and Kim, 2019;

Jang et al., 2020; Nandram et al., 2011; Nandram and Yun, 2016). The main concept of the grid method is as follows. First, we divide the numbers 0 and 1 (or closed intervals), into 100 uniform intervals and calculate the conditional posterior distribution value at the midpoint of each interval. We randomly select an interval using an approximately obtained distribution function and then extract the random numbers along with a uniform distribution from that interval. This generates random numbers that follow a conditional posterior probability distribution. Using the samples obtained, we estimated the parameters of interest. The convergence of the MCMC algorithm was determined using trace plots and autocorrelation plots. The Geweke test was used to compare the average of the initial 10% and the last 50%

of the total iterations to determine its convergence (Geweke, 1992). Also, we checked the effective sample size and confirmed the diagnosis.

3. Data analysis

To evaluate the performance of the proposed Bayesian approach, we introduce two existing methods from linear regression model and the nonparametric micro approach.

(8)

3.1. Linear regression model - conditional mean matching

Various statistical matching methods have been developed using the regression analysis.

The regression model estimation method uses the given variables. We fit a regression model using the common variables from individual datasets that are used for the statistical matching. The method was introduced and used in statistical matching by Kadane (1978) and Rubin (1986). Singh et al., (1993) and Moriarity and Scheuren (2001, 2003) developed the generalizations and extensions of the method. The various methods under this approach were reviewed by D’Orazio et al., (2006). For a comparison of the estimation with the Bayesian method used in this study, we analyze the conditional mean matching method introduced in Little & Rubin (2002) and D’Orazio et al., (2006). We estimate conditional regression models using common and unique variables from each dataset. And estimate the unobserved unique variable by entering common variables from different datasets in the estimated regression model. We then compare the statistical matching’s results by the estimates obtained from this model with the results obtained by our Bayesian approach.

3.2. Nonparametric micro approach - distance hot deck

A nonparametric approach considered for statistical matching is the micro approach; they include: (i) random hot deck, (ii) rank hot deck, and (iii) distance hot deck. These methods are introduced in Singh et al., (1993). They can be applied when randomly forming matches or considering the order and using the distance between variables. They have also been extended to form Okner (1972), Ruggles (1974), and Rodgers (1984). Nielsen (2001) confirms these extensions. The distance hot deck essentially entails finding the distance between two datasets based on a common variable for a continuous common variable X. We then match the corresponding values by selecting the object with the closest distance.

dab^∗=

x^A_a − x^B_b∗

=_1≤b≤n^min_B

x^A_a − x^B_b .

If there are multiple objects at the same distance, one of them can be selected at random, or the average of the objects with the same distance between them can be matched. We take unconstrained distance hot decks that are used several times for matching by allowing the redundancy of each selected object. Matching is performed for the data fusion using the average value of the selected objects based on the same distance.

3.3. Results of data analysis

The data used in the analysis are NHIS National Health Screening Data (NHIS-2021-1- 214). Every two years, basic medical check-ups, such as blood tests and physical measurements are available to the entire nation. Some of the body measurement information among the various test items measured are used in this study. The data collected is from about 30 million people who had examined the National Health Screening in 2017 and 2018. Depend- ing on the characteristics of the administrative data, some data identified as misinput or outliers were deleted. In this study, we consider the development of the Bayesian approaches under the small sample situation. Therefore, to represent the characteristics of the small sample, 100 males aged 30 to 39 years, who have chronic diseases among the entire population group are used for analysis via random sampling based on age-specific stratification.

(9)

We consider a significant relationship between obesity and hyperlipidemia among the continuous variables available in the National Health Screening Data. We analyze the statistical matching problem using three variables: body mass index (BMI), high density lipoprotein (HDL) cholesterol, and triglyceride levels. To create the missing situations for the statistical matching, we randomly divide the data in a ratio of six : four, as proposed in Yoshizoe (1999). The model’s efficiency is determined by deliberately generating the missing values from each variable and then comparing them with the actual values after fitting them to the model to be compared. The common variable X is the body mass index. The unique variables in the individual dataset are the HDL cholesterol in Y and the triglyceride level in Z. The first dataset A is missing the variable Z and the second dataset B is missing the variable Y .

In the proposed Bayesian approach, we use Markov chain Monte Carlo (MCMC) to get samples of the parameters from the posterior distribution. Simultaneously, we used the grid method to obtain samples of all unclosed forms of parameters. We do 20,000 iterations for each parameter and set the first 5,000 samples to the burn-in period. We determine thinning for each model to get independent samples and obtain at least 1,000 samples. For example, we draw 20,000 samples, set the first 5,000 samples as the burn-in period, and take every 15th sample. Each variables’ distribution to be used in the analysis is as follows:

Histogram

Body mass index

Density

80 100 120 140 0.00

0.01 0.02 0.03 0.04 0.05

80 100 120 140

0.00.20.40.60.81.0

Cumulative distribution

Body mass index

CDF

Figure 3.1 Histogram and cumulative distribution of body mass index

From each dataset, we compare the statistical matching by applying the linear regression models and nonparametric methods’ results. The comparison for each estimate is confirmed through the two following measures:

Average absolute bias (AAB) 1 m

Pn

i |ci− ei| , Average squared deviation (ASD) 1

m Pn

i (ci− ei)²,

where c_i is the i-th actual measurements, e_i is the estimated value by the model. Clearly,

(10)

Histogram

HDL cholesterol

Density

150 200 250 300 350 0.000

0.002 0.004 0.006 0.008 0.010 0.012

150 200 250 300 350

0.00.20.40.60.81.0

cumulative distribution

HDL cholesterol

CDF

Figure 3.2 Histogram and cumulative distribution of HDL cholesterol

Histogram

Triglyceride

Density

20 25 30 35

0.00 0.02 0.04 0.06 0.08 0.10 0.12

20 25 30 35 40

0.00.20.40.60.81.0

Cumulative distribution

Triglyceride

CDF

Figure 3.3 Histogram and cumulative distribution of triglyceride

lower values of these measures would imply a better model-based estimate. The results of checking the accuracy of estimates obtained by three methods can be found on the following two measures in Table 3.1.

The results of the statistical matching of the estimation on the two variables confirm that the Bayesian approach performs better on both scales than the linear regression and nonparametric models. In situations where data must be fused, it may be necessary to consider combining the data with small sample characteristics. The proposed Bayesian approach is thought to be sufficiently effective in comparison with the existing methods.

(11)

Table 3.1 Results of fitting estimates by estimation method

Method Estimate AAB ASD

Bayesian approach HDL cholesterol 0.7079 0.9523 Triglyceride 0.7288 0.8976 Linear regression model HDL cholesterol 0.7457 0.9977 Triglyceride 0.7446 0.9706 Distance hotdeck HDL cholesterol 1.0717 1.8859 Triglyceride 0.9348 1.4550

4. Concluding remarks

In this study, we propose a Bayesian approach to improve statistical matching in small samples. Two existing methods are considered for comparison with the proposed Bayesian method. It is a conditional mean matching of conditional regression models and a distance hot deck of nonparametric methods. The estimation efficiency between the proposed Bayesian model and two existing models, conditional mean matching and the hot deck method, were compared using the National Health Insurance Service’s National Health Screening Data. Each method has its own advantages. However, the Bayesian approach showed advantages over others in a small sample. The three variables being considered in the analysis can be considered as having more than a certain level of association considering their relationship with obesity and hyperlipidemia. However, the results using the linear regression model may be biased due to the wide variety of distributions among the data. In addition, by adding the common variables, we will be able to increase the explanatory power of the estimation of the statistical matching considered by the individual models. Neverthe- less, the same context is maintained in the situation where the number of samples is small.

From a Bayesian approach, even with limited sample data, we can verify a better performance than that of the linear regression models. Thus, the Bayesian approach is significant as a methodology appropriate for real situations with multiple constraints. Furthermore, extended statistical matching models can be derived through the addition and development of common variables.

References

Ahmed, M. R. (2011). An investigation of methods for missing data in hierarchical models for discrete data, Ph.D. Thesis, Canada: University of Waterloo.

Cai, J. H., Song, X. Y. and Hser, Y. I. (2010). A Bayesian analysis of mixture structural equation models with non-ignorable missing responses and covariates. Statistics in Medicine, 29, 1861-1874.

D’Orazio, M., Di Zio, M. and Scanu, M. (2006). Statistical matching: Theory and practice, Wiley, Chichester.

Geweke, J. (1992). Evaluating the accuracy of sampling-based approaches to the calculation of posterior moments, Bayesian Statistics, 4, 169-193.

Gelman A. (2006). Prior distribution for variance parameters in hierarchical models. Bayesian Anal, 1, 515-533.

Jang, E. J. and Kim, D. H. (2019). Bayesian hierarchical model for the estimation of proper receiver operating characteristic curves using stochastic ordering. Communications for Statistical Applications and Methods, 39, 1514-1528.

Jang, E. J., Nandram, B., Ko, Y. and Kim, D. H. (2020). Small area estimation of receiver operating characteristic curves for ordinal data under stochastic ordering. Statistics in Medicine, 25, 1858-1871.

Kadane, J. B. (1978). Some statistical problems in merging data files. In Department of Treasury, Com- pendium of Tax Research, 159-179. Washington, DC: US Government Printing Office.

(12)

Little, R. J. A. and Rubin, D. B. (2002). Statistical analysis with missing data, 2nd Ed., Wiley, New York.

Mason, A. J. (2010). Bayesian methods for modelling non-random missing data mechanisms in longitudinal studies, In Technical report, London: Imperial College.

Moriarity, C. and Scheuren, F. (2001). Statistical matching: A paradigm for assessing the uncertainty in the procedure. Journal of Official Statistics, 17, 407-422.

Moriarity, C. and Scheuren, F. (2003) A note on Rubin’s statistical matching using file concatenation with adjusted weights and multiple imputation. Journal of Business and Economic Statistics, 21, 65-73.

Nandram, B., Bhatta, D., Bhadra, D. and Shen, G. (2013). Bayesian predictive inference of a finite population proportion under selection bias. Stat Methodol, 11, 1-21.

Nandram, B., Toto, M. C. S. and Choi, J. W. (2011). A Bayesian benchmarking of the Scott-Smith model for small areas. Journal of Statistical Computation and Simulation, 81, 1593-1608.

Nandram, B. and Yin, J. (2016). A nonparametric Bayesian prediction interval for a finite population mean.

Journal of Statistical Computation and Simulation, 86, 1-17.

Nielsen, S. F. (2001) Nonparametric conditional mean imputation. Journal of Statistical Planning and Inference, 99, 129-150.

Okner, B. A. (1972) Constructing a new data base from existing microdata sets: The 1966 merge file. Annals of Economic and Social Measurement, 1, 325-342.

R¨assler, S. (2002). Statistical matching: A frequentist theory, practical applications and alternative bayesian apporaches, Springer-Verlag, New York.

Rodgers, W. L. (1984). An evaluation of statistical matching. Journal of Business and Economic Statistics, 2, 91-102.

Rubin, D.B. (1986). Statistical matching using file concatenation with adjusted weights and multiple impu- tations. Journal of Business and Economic Statistics, 4, 87-94.

Ruggles, N. and Ruggles, R. (1974). A strategy for merging and matching microdata sets. Annals of Eco- nomic and Social Measurement, 1, 353-371.

Singh, A.C., Mantel, H., Kinack, M. and Rowe, G. (1993). Statistical matching: Use of auxiliary information as an alternative to the conditional independence assumption. Survey Methodology, 19, 59-79.

Yoshizoe, Y. and Araki, M. (1999). Use of statistical matching for household surveys in Japan In 52nd Session of the International Statistical Institute, Helsinki, Finland.