A Bayesian approach to statistical matching using the national health screening data
Sejin Bae1· Dal Ho Kim2
12Department of Statistics, Kyungpook National University
Received 23 May 2021, revised 3 June 2021, accepted 11 June 2021
Abstract
The problem of missing data that complicates the data analysis process, is an in- evitable phenomenon in various studies. Statistical matching using existing data or information can solve this problem. Many studies have been conducted on statistical matching. This includes linear regression models and nonparametric methods. How- ever, the aforementioned methods may not perform well in small sample problems.
This study attempts to address this issue from a Bayesian perspective. In particular, we verify the performance of our Bayesian-based statistical matching method in small sample problems. We use the real observed data from the National Health Screening Data to compare the proposed model with other existing methods.
Keywords: Bayesian approach, data fusion, distance hot deck, linear regression model, statistical matching.
1. Introduction
Surveys are planned and conducted to obtain suitable information. However, such research is time-consuming and economically burdensome in its design and progress. Moreover, the data missing issue is an inevitable phenomenon in many studies, making the analysis pro- cess challenging. These studies include sociological investigations, longitudinal studies, and clinical trials. The inference and results of these studies can be biased and inefficient if the missing data are ignored or improperly processed (Mason et al., 2010). In particular, making inferences under small sample problems becomes complex due to a lack of information. Var- ious approaches, including statistical matching, have been proposed and developed to solve these problems. In the study on the statistical matching problem, the underlying individual data structure and the merged form of the data were summarized as illustrated in Figure 1 (D’Orazio et al., 2006).
In the sample files, A and B obtained from the population, the shaded portion represents unobserved information in each dataset. Common variables have played a significant role in several proposed methodologies for conventional statistical matching. Here, a common
1 Ph.D. candidate, Department of Statistics, Kyungpook National University, Daegu 41566, Korea.
2 Corresponding author: Professor, Department of Statistics, Kyungpook National University, Daegu 41566, Korea. E-mail: [email protected]
File ⋯
⋯
missing
File ⋯
missing ⋯
⇓
Merged file
⋯
⋯
⋯
⋯
⋯
⋯
Figure 1.1 Sample dataset structure in statistical matching problems
variable denotes a variable that is common to two datasets. In the structure shown in Figure 1.1, it is X. The variables excluding the common variables are called unique variables. These correspond to Y and Z in each sample file, A and B. The goal of statistical matching is to statistically merge predicted values from a designed statistical model to approximate the missing values. Statistical matching approaches can be divided into two main types:
macro and micro approaches. A macro approach obtains direct parameter estimates for each component of the dataset. It is used to obtain information regarding the main characteristics of the data, such as the joint distribution functions and correlations. Conversely, a micro approach is used to estimate unique variables not contained in each sample dataset to create one complete merge. Macro and micro approaches are not mutually exclusive but complement each other. Here, we will develop a model using a Bayesian approach where both of these methods are applied simultaneously. We specifically analyze problems in small sample datasets using real-world data.
Bayesian approaches and the existing methods form a general framework of the multiple imputation methods attempted to solve statistical matching problems. The Bayesian ap- proach provides a natural way to consider the uncertainty of missing data when reasoning or arriving at conclusions using incomplete data (Daniels and Hogan, 2008; Ibrahim et al., 2005). From a Bayesian perspective, the missing data are considered random variables. The information on their posterior distribution can be obtained from specific prior information on the distribution of data parameters and missing covariances. The information corresponding to the missing data can be sampled from a conditional distribution via the Markov chain Monte Carlo (MCMC) method. Information on the inference can be obtained from the pos- terior distribution (Ahmed, 2011). By simultaneously estimating the unknown parameters and missing data, the inference is unified (Mason, 2010). Furthermore, the Bayesian infer- ence can achieve improved and reliable results in small sample size regions. This can be obtained with only the informational pre-distributions and additional information. (Cai et al., 2010).
The estimated results of the data convergence with the proposed Bayesian approach and existing methods are compared using the National Health Insurance Service’s National Health Screening Data. Section 2 describes the Bayesian models for the statistical matching.
It briefly introduces the prior distribution and computational results of performing Gibbs
sampling using the MCMC method. Furthermore, we use the grid methods for unknown conditional distributions. In Section 3, we briefly introduce the existing methods, such as the linear regression model and the distance hot deck. We then compare the results of the statistical matching using our approach and the existing methods using the appropriate measures with real data. Finally, we conclude and make suggestions in Section 4.
2. Bayesian model for statistical matching
As already described in Figure 1.1, it is the overall sample A ∪ B of nA+ nB units from f (x, y, z) with Z missing in A and Y missing in B. Hence, the statistical matching problem can be regarded as a problem of analysis of a partially observed data set. To consider the Bayesian approaches, we do not separate the roles of the donor files and the recipient files commonly used in the statistical matching mechanism. The overall structure is as follows:
Let (X, Y, Z) be the trivariate normal distribution of the expression (2.1) with the following parameters.
f (x, y, z|µ, Σ) = 1
p(2π)3|Σ|exp
−1
2(x, y, z) − µ0)Σ−1(x, y, z)0− µ
, (2.1)
θ = (µ, Σ) =
µX
µY
µZ
,
σ2X σXY σXZ
σXY σ2Y σY Z
σXZ σY Z σZ2
, θ ∈ Θ. (2.2)
The covariance matrix Σ can be re-expressed as follows through standard decomposition, naming as DSD.
√σXX 0 0
0 √
σY Y 0
0 0 √
σZZ
1 ρXY ρXZ
ρXY 1 ρY Z ρXZ ρY Z 1
√σXX 0 0
0 √
σY Y 0
0 0 √
σZZ
.
Therefore, expression (2.2) rewritten in terms of the equivalent parametrization of Θ, with matrix of the correlation coefficients ρ :
θ = (µ, σ, ρ) =
µX µY µZ
,
σX2 σ2Y σZ2
,
1 ρXY ρXZ ρXY 1 ρY Z ρXZ ρY Z 1
.
After the aforementioned transformation, the additional constraints arise from the equation for the model’s assumption of uncertainty in statistical matching.
1 ρXY ρXZ ρXY 1 ρY Z
ρXZ ρY Z 1
> 0.
The constraint values of the correlation coefficients can be determined in their interval form by solving the aforementioned equation. ρXY must have values within the ρXZρY Z± p(1 − ρ2XZ) (1 − ρ2Y Z) range that do not contain both endpoints. ρXZ and ρY Z also have constrained intervals of ρXYρY Z±p(1 − ρ2XY) (1 − ρ2Y Z) and ρXYρXZ±p(1 − ρ2XY) (1 − ρ2XZ) in the same context. They must have values within that range.
For each parameter value (µ, σ, ρ), the following prior information is considered for per- forming the MCMC algorithm with a Gibbs sampler. In particular σ, we assumed a shrinkage prior to avoid difficulties associated with improper priors of the form π(σ) ∝ 1/σ (Gelman, 2006; Nandram et al., 2013).
π(µ) ∝ 1, π σX2 ∝ 1
(1 + σX2)2, π σY2 ∝ 1
(1 + σY2)2, π σ2Z ∝ 1 (1 + σ2Z)2, ρXY ∼ U (−1, 1), ρXZ∼ U (−1, 1), ρY Z∼ U (−1, 1).
To adapt to the statistical matching into the Bayesian framework, necessary to consider the r.v. R, as missing data mechanism, with probability distribution h(r|ξ) and prior probability distribution π(ξ) (D’Orazio et al., 2006; R¨assler, 2002). Assuming that the missing data mechanism is at least MAR and the θ and ξ are independent, the posterior distribution of θ is:
π(θ|(x, y, z)obs, r) = Z
π(θ, ξ|(x, y, z)obs, r)dξ
= c−1f ((x, y, z)obs|θ)π(θ) Z
h(r|(x, y, z)obs, ξ)π(ξ)dξ
∝ L(θ|(x, y, z)obs)π(θ), where c is the normalizing constant
c = Z Z
f ((x, y, z)obs, r|θ, ξ)π(θ)π(ξ)dθdξ.
Therefore, all the information on θ is contained in above likelihood expression, which does not involve the missing data mechanism. By D’Orazio et al., (2006), the posterior distribution of θ to be computed:
π(θ|(x, y, z)obs) ∝ π(θ)QnA
a=1fXY(xAa, yAa|θ)QnB
b=1fXZ(xBb, zbB|θ).
The expression in (2.1) can be expressed for two datasets: A and B, respectively, which have a missing result, as follows.
f (xi, yi, ˜zi|µ, Σ) = √ 1
(2π)3|Σ|exp
−1
2(xi, yi, ˜zi) − µ0)Σ−1(xi, yi, ˜zi)0− µ
,
f (xj, ˜yj, zj|µ, Σ) = √ 1
(2π)3|Σ|exp
−1
2(xj, ˜yj, zj) − µ0)Σ−1(xj, ˜yj, zj)0− µ
, where ˜zi, ˜yj are missing value in each datasets with i = 1, · · · , nA, j = 1, · · · , nB.
Based on the Bayes’ theorem, the joint posterior density of ˜y, ˜z, µ, σX2, σY2, σ2Z, ρXY, ρXZ, ρY Z is given as follows.
p (˜y, ˜z, µ, σ2X, σY2, σZ2, ρXY, ρXZ, ρY Z|x, y, z)
=
nA
Y
i=1
[f (xi, yi, ˜zi|µ, σ, ρ)] ×
nB
Y
j=1
[f (xj, ˜yj, zj|µ, σ, ρ)] × π(µ, σ, ρ)
∝
nA
Y
i=1
1
|DSD|1/2exp
"
−1 2
nA
X
i=1
((xi, yi, ˜zi) − µ)T(DSD)−1((xi, yi, ˜zi) − µ)
#
×
nB
Y
j=1
1
|DSD|1/2exp
"
−1 2
nB
X
j=1
((xj, ˜yj, zj) − µ)T(DSD)−1((xj, ˜yj, zj) − µ)
#
× 1
(1 + σX2)2 × 1
(1 + σ2Y)2 × 1 (1 + σ2Z)2
× I(ρXY∈(−1,1))× I(ρXZ∈(−1,1))× I(ρY Z∈(−1,1)).
Estimating the individual parameters from the posterior density is difficult owing to the complexity of posterior distributions. We, therefore, use the method and the Gibbs’ sampler.
The conditional posterior distribution for each parameter are as follows:
(i) p(µ|x, y, z, ˜y, ˜z, σX2, σ2Y, σZ2, ρXY, ρXZ, ρY Z) ∼ N3
SAB nA+ nB
, DSD
nA+ nB
, where SAB=PnA
i=1(xi, yi, ˜zi) +PnB
j=1(xj, ˜yj, zj).
(ii) p(σX2|x, y, z, ˜y, ˜z, µ, σ2Y, σZ2, ρXY, ρXZ, ρY Z) ∝σ2X−(nA+nB)/2 (1 + σ2X)2
× exp (
− 1 2R
α1
σ2X + 2 β1
pσX2
!) , where R = 1 + 2ρXYρXZρY Z− ρ2XY − ρ2XZ− ρ2Y Z,
α1=
nA
X
i=1
(xi− µX)2+
nB
X
j=1
(xj− µX)2
1 − ρ2Y Z ,
β1=
"(nA X
i=1
(xi− µX) (yi− µY) +
nB
X
j=1
(xj− µX) ( ˜yj− µY) )
×(ρXZρY Z− ρXY) pσ2Y
+ (nA
X
i=1
(xi− µX) ( ˜zi− µZ) +
nB
X
j=1
(xj− µX) (zj− µZ) )
×(ρXYρY Z− ρXZ) pσ2Z
# .
(iii) p(σY2|x, y, z, ˜y, ˜z, µ, σX2, σ2Z, ρXY, ρXZ, ρY Z) ∝σ2Y−(nA+nB)/2 (1 + σ2Y)2
× exp (
− 1 2R
α2
σ2Y + 2 β2
pσY2
!) ,
where R = 1 + 2ρXYρXZρY Z− ρ2XY − ρ2XZ− ρ2Y Z,
α2=
nA
X
i=1
(yi− µY)2+
nB
X
j=1
( ˜yj− µY)2
1 − ρ2XZ ,
β2=
"(nA X
i=1
(xi− µX) (yi− µY) +
nB
X
j=1
(xj− µX) ( ˜yj− µY) )
×(ρXZρY Z− ρXY) pσX2
+ (nA
X
i=1
(yi− µY) ( ˜zi− µZ) +
nB
X
j=1
( ˜yj− µY) (zj− µZ) )
×(ρXYρXZ− ρY Z) pσ2Z
# .
(iv) p(σ2Z|x, y, z, ˜y, ˜z, µ, σ2X, σY2, ρXY, ρXZ, ρY Z) ∝ σZ2−(nA+nB)/2 (1 + σZ2)2
× exp (
− 1 2R
α3
σ2Z + 2 β3 pσZ2
!) , where R = 1 + 2ρXYρXZρY Z− ρ2XY − ρ2XZ− ρ2Y Z,
α3=
nA
X
i=1
( ˜zi− µZ)2+
nB
X
j=1
(zj− µZ)2
1 − ρ2XY ,
β3=
"(nA X
i=1
(xi− µX) ( ˜zi− µZ) +
nB
X
j=1
(xj− µX) (zj− µZ) )
×(ρXYρY Z− ρXZ) pσX2
+ (nA
X
i=1
(yi− µY) ( ˜zi− µZ) +
nB
X
j=1
( ˜yj− µY) (zj− µZ) )
×(ρXYρXZ− ρY Z) pσY2
# .
(v) p(ρXY| x, y, z, ˜y, ˜z, µ, σ2X, σY2, σ2Z, ρXZ, ρY Z)
∝ exp
"
−1 2
nA
X
i=1
((xi, yi, ˜zi) − µ)T(DSD)−1((xi, yi, ˜zi) − µ)
+
nB
X
j=1
((xj, ˜yj, zj) − µ)T(DSD)−1((xj, ˜yj, zj) − µ)
#
× |DSD|−(nA+nB)/2× I(ρXY∈(−1,1)).
(vi) p(ρXZ| x, y, z, ˜y, ˜z, µ, σX2, σ2Y, σZ2, ρXY, ρY Z)
∝ exp
"
−1 2
nA
X
i=1
((xi, yi, ˜zi) − µ)T(DSD)−1((xi, yi, ˜zi) − µ)
+
nB
X
j=1
((xj, ˜yj, zj) − µ)T(DSD)−1((xj, ˜yj, zj) − µ)
#
× |DSD|−(nA+nB)/2× I(ρXZ∈(−1,1)).
(vii) p(ρY Z| x, y, z, ˜y, ˜z, µ, σ2X, σY2, σ2Z, ρXY, ρXZ)
∝ exp
"
−1 2
nA
X
i=1
((xi, yi, ˜zi) − µ)T(DSD)−1((xi, yi, ˜zi) − µ)
+
nB
X
j=1
((xj, ˜yj, zj) − µ)T(DSD)−1((xj, ˜yj, zj) − µ)
#
× |DSD|−(nA+nB)/2× I(ρY Z∈(−1,1)).
(viii) p(˜y|x, y, z, ˜z, µ, σX2, σ2Y, σZ2, ρXY, ρXZ, ρY Z)ind∼ Nj
µY −β4
α4
, R α4
, where R = 1 + 2ρXYρXZρY Z− ρ2XY − ρ2XZ− ρ2Y Z,
α4= (1 − ρ2XZ) σ2Y , β4=
( ρXZρY Z− ρXY pσX2pσY2
!
(xj− µX) + ρXYρXZ− ρY Z pσ2YpσZ2
!
(zj− µZ) )
.
(ix) p(˜z|x, y, z, ˜y, µ, σX2, σ2Y, σZ2, ρXY, ρXZ, ρY Z)ind∼ Ni
µZ−β5
α5
, R α5
, where R = 1 + 2ρXYρXZρY Z− ρ2XY − ρ2XZ− ρ2Y Z,
α5= (1 − ρ2XY) σZ2 , β5=
( ρXYρY Z− ρXZ
pσX2pσZ2
!
(xi− µX) + ρXYρXZ− ρY Z
pσY2pσ2Z
!
(yi− µY) )
.
From the aforementioned expressions, the conditional posterior probability distribution for some parameters does not take the form of a specific probability distribution. We, therefore, estimate the posterior probability distribution using the grid method (Jang and Kim, 2019;
Jang et al., 2020; Nandram et al., 2011; Nandram and Yun, 2016). The main concept of the grid method is as follows. First, we divide the numbers 0 and 1 (or closed intervals), into 100 uniform intervals and calculate the conditional posterior distribution value at the midpoint of each interval. We randomly select an interval using an approximately obtained distribution function and then extract the random numbers along with a uniform distribution from that interval. This generates random numbers that follow a conditional posterior probability distribution. Using the samples obtained, we estimated the parameters of interest. The convergence of the MCMC algorithm was determined using trace plots and autocorrelation plots. The Geweke test was used to compare the average of the initial 10% and the last 50%
of the total iterations to determine its convergence (Geweke, 1992). Also, we checked the effective sample size and confirmed the diagnosis.
3. Data analysis
To evaluate the performance of the proposed Bayesian approach, we introduce two existing methods from linear regression model and the nonparametric micro approach.
3.1. Linear regression model - conditional mean matching
Various statistical matching methods have been developed using the regression analysis.
The regression model estimation method uses the given variables. We fit a regression model using the common variables from individual datasets that are used for the statistical match- ing. The method was introduced and used in statistical matching by Kadane (1978) and Rubin (1986). Singh et al., (1993) and Moriarity and Scheuren (2001, 2003) developed the generalizations and extensions of the method. The various methods under this approach were reviewed by D’Orazio et al., (2006). For a comparison of the estimation with the Bayesian method used in this study, we analyze the conditional mean matching method introduced in Little & Rubin (2002) and D’Orazio et al., (2006). We estimate conditional regression models using common and unique variables from each dataset. And estimate the unobserved unique variable by entering common variables from different datasets in the estimated regression model. We then compare the statistical matching’s results by the estimates obtained from this model with the results obtained by our Bayesian approach.
3.2. Nonparametric micro approach - distance hot deck
A nonparametric approach considered for statistical matching is the micro approach; they include: (i) random hot deck, (ii) rank hot deck, and (iii) distance hot deck. These methods are introduced in Singh et al., (1993). They can be applied when randomly forming matches or considering the order and using the distance between variables. They have also been extended to form Okner (1972), Ruggles (1974), and Rodgers (1984). Nielsen (2001) confirms these extensions. The distance hot deck essentially entails finding the distance between two datasets based on a common variable for a continuous common variable X. We then match the corresponding values by selecting the object with the closest distance.
dab∗=
xAa − xBb∗
=1≤b≤nminB
xAa − xBb .
If there are multiple objects at the same distance, one of them can be selected at random, or the average of the objects with the same distance between them can be matched. We take unconstrained distance hot decks that are used several times for matching by allowing the redundancy of each selected object. Matching is performed for the data fusion using the average value of the selected objects based on the same distance.
3.3. Results of data analysis
The data used in the analysis are NHIS National Health Screening Data (NHIS-2021-1- 214). Every two years, basic medical check-ups, such as blood tests and physical measure- ments are available to the entire nation. Some of the body measurement information among the various test items measured are used in this study. The data collected is from about 30 million people who had examined the National Health Screening in 2017 and 2018. Depend- ing on the characteristics of the administrative data, some data identified as misinput or outliers were deleted. In this study, we consider the development of the Bayesian approaches under the small sample situation. Therefore, to represent the characteristics of the small sample, 100 males aged 30 to 39 years, who have chronic diseases among the entire popu- lation group are used for analysis via random sampling based on age-specific stratification.
We consider a significant relationship between obesity and hyperlipidemia among the con- tinuous variables available in the National Health Screening Data. We analyze the statistical matching problem using three variables: body mass index (BMI), high density lipoprotein (HDL) cholesterol, and triglyceride levels. To create the missing situations for the statistical matching, we randomly divide the data in a ratio of six : four, as proposed in Yoshizoe (1999). The model’s efficiency is determined by deliberately generating the missing values from each variable and then comparing them with the actual values after fitting them to the model to be compared. The common variable X is the body mass index. The unique variables in the individual dataset are the HDL cholesterol in Y and the triglyceride level in Z. The first dataset A is missing the variable Z and the second dataset B is missing the variable Y .
In the proposed Bayesian approach, we use Markov chain Monte Carlo (MCMC) to get samples of the parameters from the posterior distribution. Simultaneously, we used the grid method to obtain samples of all unclosed forms of parameters. We do 20,000 iterations for each parameter and set the first 5,000 samples to the burn-in period. We determine thinning for each model to get independent samples and obtain at least 1,000 samples. For example, we draw 20,000 samples, set the first 5,000 samples as the burn-in period, and take every 15th sample. Each variables’ distribution to be used in the analysis is as follows:
Histogram
Body mass index
Density
80 100 120 140 0.00
0.01 0.02 0.03 0.04 0.05
80 100 120 140
0.00.20.40.60.81.0
Cumulative distribution
Body mass index
CDF
Figure 3.1 Histogram and cumulative distribution of body mass index
From each dataset, we compare the statistical matching by applying the linear regression models and nonparametric methods’ results. The comparison for each estimate is confirmed through the two following measures:
Average absolute bias (AAB) 1 m
Pn
i |ci− ei| , Average squared deviation (ASD) 1
m Pn
i (ci− ei)2,
where ci is the i-th actual measurements, ei is the estimated value by the model. Clearly,
Histogram
HDL cholesterol
Density
150 200 250 300 350 0.000
0.002 0.004 0.006 0.008 0.010 0.012
150 200 250 300 350
0.00.20.40.60.81.0
cumulative distribution
HDL cholesterol
CDF
Figure 3.2 Histogram and cumulative distribution of HDL cholesterol
Histogram
Triglyceride
Density
20 25 30 35
0.00 0.02 0.04 0.06 0.08 0.10 0.12
20 25 30 35 40
0.00.20.40.60.81.0
Cumulative distribution
Triglyceride
CDF
Figure 3.3 Histogram and cumulative distribution of triglyceride
lower values of these measures would imply a better model-based estimate. The results of checking the accuracy of estimates obtained by three methods can be found on the following two measures in Table 3.1.
The results of the statistical matching of the estimation on the two variables confirm that the Bayesian approach performs better on both scales than the linear regression and non- parametric models. In situations where data must be fused, it may be necessary to consider combining the data with small sample characteristics. The proposed Bayesian approach is thought to be sufficiently effective in comparison with the existing methods.
Table 3.1 Results of fitting estimates by estimation method
Method Estimate AAB ASD
Bayesian approach HDL cholesterol 0.7079 0.9523 Triglyceride 0.7288 0.8976 Linear regression model HDL cholesterol 0.7457 0.9977 Triglyceride 0.7446 0.9706 Distance hotdeck HDL cholesterol 1.0717 1.8859 Triglyceride 0.9348 1.4550
4. Concluding remarks
In this study, we propose a Bayesian approach to improve statistical matching in small samples. Two existing methods are considered for comparison with the proposed Bayesian method. It is a conditional mean matching of conditional regression models and a dis- tance hot deck of nonparametric methods. The estimation efficiency between the proposed Bayesian model and two existing models, conditional mean matching and the hot deck method, were compared using the National Health Insurance Service’s National Health Screening Data. Each method has its own advantages. However, the Bayesian approach showed advantages over others in a small sample. The three variables being considered in the analysis can be considered as having more than a certain level of association considering their relationship with obesity and hyperlipidemia. However, the results using the linear regression model may be biased due to the wide variety of distributions among the data. In addition, by adding the common variables, we will be able to increase the explanatory power of the estimation of the statistical matching considered by the individual models. Neverthe- less, the same context is maintained in the situation where the number of samples is small.
From a Bayesian approach, even with limited sample data, we can verify a better perfor- mance than that of the linear regression models. Thus, the Bayesian approach is significant as a methodology appropriate for real situations with multiple constraints. Furthermore, extended statistical matching models can be derived through the addition and development of common variables.
References
Ahmed, M. R. (2011). An investigation of methods for missing data in hierarchical models for discrete data, Ph.D. Thesis, Canada: University of Waterloo.
Cai, J. H., Song, X. Y. and Hser, Y. I. (2010). A Bayesian analysis of mixture structural equation models with non-ignorable missing responses and covariates. Statistics in Medicine, 29, 1861-1874.
D’Orazio, M., Di Zio, M. and Scanu, M. (2006). Statistical matching: Theory and practice, Wiley, Chichester.
Geweke, J. (1992). Evaluating the accuracy of sampling-based approaches to the calculation of posterior moments, Bayesian Statistics, 4, 169-193.
Gelman A. (2006). Prior distribution for variance parameters in hierarchical models. Bayesian Anal, 1, 515-533.
Jang, E. J. and Kim, D. H. (2019). Bayesian hierarchical model for the estimation of proper receiver operating characteristic curves using stochastic ordering. Communications for Statistical Applications and Methods, 39, 1514-1528.
Jang, E. J., Nandram, B., Ko, Y. and Kim, D. H. (2020). Small area estimation of receiver operating characteristic curves for ordinal data under stochastic ordering. Statistics in Medicine, 25, 1858-1871.
Kadane, J. B. (1978). Some statistical problems in merging data files. In Department of Treasury, Com- pendium of Tax Research, 159-179. Washington, DC: US Government Printing Office.
Little, R. J. A. and Rubin, D. B. (2002). Statistical analysis with missing data, 2nd Ed., Wiley, New York.
Mason, A. J. (2010). Bayesian methods for modelling non-random missing data mechanisms in longitudinal studies, In Technical report, London: Imperial College.
Moriarity, C. and Scheuren, F. (2001). Statistical matching: A paradigm for assessing the uncertainty in the procedure. Journal of Official Statistics, 17, 407-422.
Moriarity, C. and Scheuren, F. (2003) A note on Rubin’s statistical matching using file concatenation with adjusted weights and multiple imputation. Journal of Business and Economic Statistics, 21, 65-73.
Nandram, B., Bhatta, D., Bhadra, D. and Shen, G. (2013). Bayesian predictive inference of a finite popu- lation proportion under selection bias. Stat Methodol, 11, 1-21.
Nandram, B., Toto, M. C. S. and Choi, J. W. (2011). A Bayesian benchmarking of the Scott-Smith model for small areas. Journal of Statistical Computation and Simulation, 81, 1593-1608.
Nandram, B. and Yin, J. (2016). A nonparametric Bayesian prediction interval for a finite population mean.
Journal of Statistical Computation and Simulation, 86, 1-17.
Nielsen, S. F. (2001) Nonparametric conditional mean imputation. Journal of Statistical Planning and Inference, 99, 129-150.
Okner, B. A. (1972) Constructing a new data base from existing microdata sets: The 1966 merge file. Annals of Economic and Social Measurement, 1, 325-342.
R¨assler, S. (2002). Statistical matching: A frequentist theory, practical applications and alternative bayesian apporaches, Springer-Verlag, New York.
Rodgers, W. L. (1984). An evaluation of statistical matching. Journal of Business and Economic Statistics, 2, 91-102.
Rubin, D.B. (1986). Statistical matching using file concatenation with adjusted weights and multiple impu- tations. Journal of Business and Economic Statistics, 4, 87-94.
Ruggles, N. and Ruggles, R. (1974). A strategy for merging and matching microdata sets. Annals of Eco- nomic and Social Measurement, 1, 353-371.
Singh, A.C., Mantel, H., Kinack, M. and Rowe, G. (1993). Statistical matching: Use of auxiliary information as an alternative to the conditional independence assumption. Survey Methodology, 19, 59-79.
Yoshizoe, Y. and Araki, M. (1999). Use of statistical matching for household surveys in Japan In 52nd Session of the International Statistical Institute, Helsinki, Finland.