An alternative method in estimating propensity scores with conditional inference tree in multilevel data: A case study

(1)

An alternative method in estimating propensity scores with conditional inference tree in multilevel data: A

case study

Hyunsuk Han ¹ · Minho Kwak ²

1 Institute of Educational Research, Korea University

2 Quantitative Methodology, University of Georgia

Received 16 May 2019, revised 19 June 2019, accepted 8 July 2019

Abstract

A multilevel structure of data is widely used in a variety of social science settings. To investigate the effects of interventions, researchers often conduct observational studies that use large scale secondary data and incorporate propensity score methods; this is beneficial in performing causal inference in non-randomized observational studies. The standard propensity score uses a logistic regression approach; however, this approach could be outperformed by alternative methods based on statistical learning and data mining algorithms. To date, little research had addressed data utilizing mining meth- ods within propensity score design, especially with multilevel observational data. The purpose of this study is to examine the performance of propensity scores associated with the use of stratification, estimated by a multilevel logistic versus a conditional inference tree using large scale secondary data derived from the Programme for Inter- national Student Assessment. The results showed that a conditional inference tree more conservatively estimates the treatment effect. In addition, the covariate balance result showed that the CIT better produced a randomized treatment/control design than did the multilevel logistic regression.

Keywords: Conditional inference tree, multilevel, multilevel logistic regression, non- randomized, propensity score.

1. Introduction

Social science researchers often conduct studies to investigate the effects of interventions using large scale secondary data (e.g., Bryer and Pruzek, 2011; Hwang et al., 2017; Park and Hwang, 2018). In this case, there are two considerable factors significant to performing the valid study. First of all, the non-randomized property of the data. Since the data are not intended to perform experimental design with non-random sampling, it might be inade- quate to reveal the treatment effect using regression methods that employ direct comparison

1

Research professor, Institute of Educational Research, Korea University, Seoul, Republic of Korea.

2

Corresponding author: Ph.D. candidate, Quantitative Methodology, University of Georgia, Athens,

Georgia, United States. E-mail: [email protected]

(2)

between the treatment and control groups. The other significant factor is the multilevel prop- erty. Previous literature has pointed out that studies without a multi-level approach lead to biased standard error estimates of hierarchical data (Raudenbush and Bryk, 2002).

In order to deal with these problems, propensity score methods with a multilevel approach are widely used for causal inference in non-randomized studies. Propensity score methods are intended to assign the treatment or control group profile to the sample based on the propensity score estimated by logistic regression. The score represents the similarity of the observations, and observations of similar propensity scores are treated as comparable samples (d’Agostino, 1998). Also, propensity score estimation methods with multilevel data prefer to use the multilevel logistic regression method, since the most widely used propensity score methods (e.g. matching, weighting, and stratification) with non-nested data are unable to reflect hierarchical data structures.

However, using the multilevel logistic regression model to estimate the propensity score requires strict assumptions regarding sample size. The minimum sample size for the multi- level regression model is 30 groups, with 30 individuals in each group (Maas and Hox, 2005).

Although the sample size assumption plays a critical role in the estimation process, for data gathered in practical settings, the sufficient sample size assumption is difficult to satisfy.

With the advancements in data mining methods, several studies have reported that alter- native methods based on statistical learning and data mining algorithms have been utilized to estimate propensity scores (Westeich et al., 2010). Also, these methods are more ad- equate than the traditional approach when the treatment is assumed to effect outcome non-linearly or non-additively (Lee et al., 2010; Setoguchi et al., 2008). The most valuable trait of this data mining method is that it requires a less strict assumption regarding sample size. However, few previous studies have compared both methods with the empirical data.

Therefore, the purpose of this study is to compare propensity score methods, including sta- tistical learning and data mining methods, with multilevel logistic regression models, the traditional propensity score method. The Conditional Inference Tree (CIT) is used as the data mining method in this study (Hothorn et al., 2006), and the propensity score method is used for stratification. In this study, propensity scores are estimated by multilevel logistic regression and conditional decision trees with the data from Programme for International Student Assessment (PISA) data (Schleicher et al., 2009). For the stratification method, the two major results are comparable in terms of the efficiency and accuracy of the esti- mation method. The first result is the area of common support extracted as strata. The other result is the change of covariate balance. In order to fulfill the research purpose, the specific research questions are: (a) how different are the areas of common support produced by traditional methods compared to those produced by the CIT? and (b) how different are the average treatment effects on the treated (ATT) as estimated by traditional methods compared to those estimated by the CIT?

2. Theoretical framework

2.1. Propensity scores

The propensity score (PS) is defined as the conditional probability of assignment to a

particular treatment given observed covariates (Rosebaum and Rubin, 1983):

(3)

p(x) = P r(T = 1| ~ X = ~ x), (2.1) where p(x) is the propensity score. T is a binary treatment. ~ X indicates the covariate vector. If the dependent variable is independent from the covariates considered as treatment condition, it is possible to say that the treatment/control assignment is uncompounded, and it can be written as follows if Y (0) and Y (1) denote the dependent variable, respectively:

Y (0), Y (1) ⊥ (T | ~ X). (2.2)

PS is a strategy to reduce the bias caused by non-random sampling design when the researcher tests the treatment effect. For example, the study was designed to test the effect of treatment on sample control and treatment groups. If the study was constructed with an experimental design, covariates that can potentially affect the dependent variable should be considered. Unless the covariates are considered in the sampling process, the results of treatment effect are questionable, since the dependent variable possibly affected not only the effect of the treatment but also that of the covariate.

PS refines the true treatment effect, which is compounded with the covariates. It is similar to the control variable commonly used in the multiple regression model. However, PS has certain unique aspects that distinguish it from traditional methods used to input the control variables in regression models (Morgan and Winship, 2014). First, the propensity score represents a single value in that the sampling process is random. Second, the researchers are able to identify the advancement of adjustment based on the covariate balance. Lastly, while the regression for control covariates assumes linearity among variables, PS is a non- parametric method that is free from the linearity assumption.

In order to evaluate the common support for each covariate, standardized mean difference (SMD) is used; a value below 0.1 indicates adequate covariance balance (Austin, 2011).

SM D = X _t − X _c σ t

, (2.3)

where X t and X c denote the mean of treatment and control group. σ t indicates the standard deviation of treatment group.

2.2. Multilevel logistic model

The multilevel logistic regression model used to estimate the propensity score is as follows:

logit(Z _ij = 1|X, W ) =β ₀ +

M

X

m=1

β _m X _mij (2.4)

+

N

X

n=1

π n W nj + s mj +

M

X

m=1

s mj X mij ,

s 0j , s mj ∼ N (0, Σ), (2.5)

(4)

where Z _ij indicates that treatment indicator for person i in the group j, β _m means the fixed effects of the mth individual-level independent variable X _mij , π _n denotes the fixed effects of nth cluster-level independent variable W nj,s

_0j

and s mj are intercepts and slopes with mean of zero and covariance matrix.

2.3. Conditional inference trees model

Data mining methods such as random forest (Breiman, 2011), generalized boosted mod- eling (McCaffrey et al., 2004), and neural networks (Setoguchi et al., 2008) can be used for propensity score estimation with multilevel data. Since these methods automatically de- tect interactions, adding the cluster indicator as a covariate would allow for examination of the covariate by cluster interactions that improve prediction of the treatment assignment.

Research on data mining methods for propensity score estimation with multilevel data is still incipient. However, Gurel and Leite (2014) found evidence through a Monte Carlo that the generalized boosted modeling implemented in the twang package does not outperform multilevel logistic regression for propensity score estimation with multilevel data.

Of the various data mining methods, the Conditional Inference Tree (CIT) is the machine learning algorithm with fewer assumptions than logistic regression (Westreich et al., 2010).

A CIT-implemented recursive partitioning algorithm applied to propensity score estimation consists in repeatedly splitting the data into groups, with similar treatment status based on the categories of a categorical covariate or a cutoff applied to a continuous covariate. With CIT, propensity scores can be obtained by computing the proportion of trees that classified each observation as treated.

Specifically, CIT can be considered as a special case of Recursive binary partitioning. The purpose of the recursive binary partitioning is to split the whole sample into two most differ- ent sub-samples using the covariates. Hothorn et al. (2015) explained the general algorithm of recursive binary partitioning in their study, and this section is summary of the explana- tion. First, in order to introduce general notations used in the algorithm, the distribution of the response variable given the covariates is represented as Equation 2.5.

D(Y|X) = D(Y|X 1 , . . . , X M ) = D(Y|f (X 1 , . . . , X M )), (2.6) where D(Y|X) indicates the distribution of Y given covariates X, and X is a n by M matrix because the sample size is n with M different covariates. It is taken from a sample space χ = χ 1 × . . . × χ m . f (·) means a linear regression function, and coefficients of the functions usually denoted as w, and it is a set of non-negative integer values called as weight w = w ₁ , . . . , w _n .

For the simplicity, the algorithm for the only jth covariate was discussed. For the other covariates, the same process is applied. The algorithm consists of three steps as follows.

Step 1. Given w, the algorithm is stopped when the hypothesis H ₀ ^j : D(Y|X _j ) = D(Y) cannot be rejected. Otherwise, select a different jth covariate X _j∗ which has the strongest relationship with Y.

Step 2. Select a subset A ^∗ of whole sample space χ j∗ in order to split whole sample

space χ j∗ into disjoint subsets such as A ^∗ and χ j∗ /A ^∗ . The corresponding weights w lef t and

(5)

W _right are applied to both sets. An ith element of weight vector w _{lef t,i} can be defined as w _{lef t,i} = w _i I(X _j∗ ∈ A ^∗ ) for all i = 1, . . . , n (i.e., I(·) denotes the indicator function).

Step 3. Repeat step 1 and 2 recursively, and update the weights w lef t and w right consec- utively.

As referred above, CIT is a special case of recursive binary partitioning. Specifically, CIT has its own stopping rule and variable selection rule of step 1 and step 2. Hothorn et al.

(2006) summarized the stopping and variable selection rules of CIT as follows. For the stopping rule in the step 1, the test statistics under the jth null hypothesis T j depending on the n-size sample L n and w can be defined as

T j (L n , w) = vec(

n

X

i=1

w i g j (X ji h(Y i (Y 1 , . . . , Y n )) ^T ) ∈ R ^p

ⁱ

^q

^j

, (2.7)

where vec(·) operator denotes transformation of the a matrix into a column vector. Sepc- ficailly, it transforms a p _j by q matrix to a p _j ∗ q length vector. g j : X → R ^p

^j

denotes a non-random transformation of the jth covariate X _j , h : Y × Y ⁿ → R ^q denotes influence function. Thus, the maximum absolute value of standardized form of the statistics is an ap- parent option. It is possible to denote the univariate version of T _j (denoted as t) as Equation (2.7).

C _max (t, µ, Σ) = max

k=1,...,pq

(t − µ) k

√ Σ _kk

, (2.8)

where µ denotes the expectation value of the conditional distribution Y given X, Σ indi- cates the variance-covariance matrix of the distribution. When the test-statistics reject the null, the weight updating process is stopped.

For the variable selection rule, when the covariate has been selected, two-sample linear statistics used to evaluate the goodness of the split can be formulated as Equation (2.8).

T ^A _j∗ (L n , w) = vec(w i I(X j∗i ∈ A ^∗ )h(Y i (Y 1 , . . . , Y n )) ^T ) ∈ R ^q . (2.9) The statistics implies the difference between both exclusive samples. The subsample A∗

which maximize the split statistics is the optimal choice, and the it can be expressed as Equations (2.9).

A∗ = argmax _A c(t Â _j∗ , µ Â _j∗ , Σ Â _j∗ ), (2.10)

where µ ^A _j∗ denotes the expectation value of the conditional distribution Y given X of the

subsample A∗. Σ ^A _j∗ indicates the variance-covariance matrix of the distribution of the sub-

sample. t ^A _j∗ denotes the univariate version of the T ^A _j∗ .

(6)

3. Methods

3.1. Measure

In this study, while school level variables are level 1 variables, the country level variables are level 2 variables, since the treatment variable is the school level variable indicating whether the school is private or not. The response variable of the model is the school level math score defined as the average score of individuals in the school. The students of each school number around 20 in the PISA data. The school level covariates (level 1 covariate) are shown as Table 3.1.

Table 3.1 The list of covariates and comment with labels

Variable label Covariate label Comment

X

¹

COMPWEB Index of computers connected to the internet

X

²

IRATCOMP Index of computer availability

X

³

SCHSIZE Index of School size

X

⁴

EXCURACT Index of extra-curricular activities

X

⁵

LDRSHP Index of school principals leadership

X

⁶

RESPCURR Index of the school responsibility for curriculum and assessment X

⁷

RESPRES Index of school responsibility for resource allocation X

⁸

SCMATEDU Index on the schools educational resources X

⁹

STUDBEHA Index on student-related factors affecting school climate

X

¹⁰

TCHPARTI Index of teacher participation

X

¹¹

TCSHORT Index of teacher shortage

X

¹²

TEACBEHA Index on teacher-related factors affecting school climate

For notation convenience, the variable label (e.g. X ¹ , X ² , . . . , X ¹² ) is used in the equations of this study instead of the covariate name. The variables considered as being potentially related to the math score at school level are used in the level-1 covariates.

3.2. Sample

PISA 2009 data is used in this study. The data were gathered from 6,957 schools across 24 countries (Argentina, Australia, Austria, Belgium, Brazil, Canada, Chile, China, Colombia, Denmark, Indonesia, Ireland, Israel, Italy, Japan, Korea, Kyrgyzstan, Mexico, Panama, Peru, Qatar, Sweden, Trinidad, Uruguay). Out of these 6,957 schools, 5,153 are public and the remainder are private (N _private =1,804).

3.3. Data analysis

For multilevel logistic regression, the random intercept model is used in this study. The equations for level 1, level 2 and the composite model are as follows.

Level-1 (schools)

logit(Y ij = 1) = β oj +

12 X

m=1

β mj (X _ij ^m ) + r ij , (3.1)

(7)

Level-2 (countries)

β 0j = γ 00 + γ 01 W j + u 0j , (3.2)

β mj = γ m0 + γ m1 W j + u mj , (3.3)

Composite model

logit(Y ij = 1) =(γ 00 +

12 X

m=1

γ m0 X _ij ^m ) (3.4)

+ (γ ₀₁ W _j +

12 X

m=1

γ _m1 W _j X _ij ^m ) (3.5)

+ (u _0j +

12 X

m=1

u _mj W _j X _ij ^m + γ _ij ), (3.6)

where Y ij is indicator whether the school is private or not. In this model, 1 indicates the private school and 0 indicates the public school. Also, X _ij ^m is mth school level (level 1) covariate. There are 12 school level covariates in the model. β mj denotes the slopes for the mth school level (level 1) covariate of th country. r ij denotes the school level unique effect. In the level-2 equation, r 00 denotes the average intercept of the model. r 01 indicates the difference of intercept across the countries. W i indicates the country covariate (level- 2 variable). r m0 indicates the average slope of th level 1 covariate. r m1 denotes difference of slopes across the countries. u 0j and u mj are country level. unique effect to slopes and intercepts for jth country. The covariance matrix in this model is assumed as the compound symmetric covariance. Although the number of stratification is fixed as 5, the number of strata for each country can be various, since the strata consist of less than 5 observations for each country is omitted.

A data analysis was conducted in the R 3.2.5 environment (R Development Core Team, 2016), with the conditions given as those in which the probability of treatment assignment depended on ten individual- and cluster- level covariates as well as on the random effects of clusters. R package multilevelPSA (Bryer, 2011) was used to analysis data.

4. Result

4.1. Descriptive statistics

The numbers of schools per country varied from 144 (Ireland) to 1,532 (Mexico). Since

the minimum number of clusters and the sample size per cluster was 30, the minimum size

assumption of multilevel logistic regression was satisfied. The number of private and public

schools were not balanced. Except Kyrgyzstan, the sample size per school type also satisfied

the minimum sample size of 30. The average score for private schools was higher than that

of public schools. Korea showed the highest score among sampled countries, and Kyrgyzstan

showed the lowest. In general, the scores of private schools tended to be higher than those

of public schools, except in the case of China, Korea, Japan, Italy, and Trinidad (see Table

A.1 in Appendix).

(8)

The covariates used in the model were standardized in Z-score with the exception of the school size variable. The covariate reflecting the index of computers connected to the internet yielded the highest score among the covariates. On the other hand, the lowest covariate was that reflecting the index of school responsibility for curriculum and assessment. These covariates are assumed to be positively related to the school average math score (see Table A.2 in Appendix).

The result of the propensity score based on the stratification produced two major results.

The first result is the stratification result, which indicates the number of strata and the treatment effect estimated based on each stratum (see Table 4.1).

Table 4.1 Result of stratification and treatment effect test based on multilevel logistic regression Level2 Strata Math N

private

Math N

public

CI.min CI.max

score score

1 Argentina Overall 426.30 50 354.37 30 -113.76 -30.11

1 427.52 32 364.84 8 NA NA

2 425.08 18 343.89 22 NA NA

2 Australia Overall 515.31 63 502.59 78 -29.14 3.69

1 526.76 52 509.78 18 NA NA

2 504.03 11 495.50 60 NA NA

3 Austria Overall 502.14 31 448.23 78 -90.68 -17.15

1 488.85 24 434.47 31 NA NA

2 515.67 7 462.24 47 NA NA

4 Belgium Overall 514.47 189 485.60 89 -50.69 -7.05

1 559.28 49 532.06 7 NA NA

2 538.10 45 473.71 10 NA NA

3 523.28 40 513.83 16 NA NA

4 489.51 33 461.06 22 NA NA

5 462.18 22 446.71 34 NA NA

5 Brazil Overall 416.61 98 363.16 266 -78.69 -28.22

1 462.85 93 361.96 89 NA NA

2 370.38 5 364.35 177 NA NA

6 Canada Overall 566.74 71 507.37 124 -72.95 -45.78

1 566.74 71 507.37 124 NA NA

7 Chile Overall 389.55 30 339.77 7 -106.62 7.06

1 389.55 30 339.77 7 NA NA

8 Colombia Overall 443.59 47 393.53 8 -73.31 -26.82

1 443.59 47 393.53 8 NA NA

9 Denmark Overall 515.47 45 487.27 68 -48.99 -7.41

1 503.44 38 492.15 19 NA NA

2 527.72 7 482.31 49 NA NA

10 Indonesia Overall 354.90 24 352.94 13 -26.82 22.90

1 354.90 24 352.94 13 NA NA

11 Ireland Overall 495.10 87 470.82 57 -39.59 -8.97

1 502.90 16 488.31 13 NA NA

2 505.99 20 478.81 9 NA NA

3 498.48 18 486.10 10 NA NA

4 482.28 20 455.35 9 NA NA

5 485.97 13 446.06 16 NA NA

12 Israel Overall 461.91 24 451.68 45 -57.86 37.40

1 464.35 19 451.89 16 NA NA

2 459.40 5 451.47 29 NA NA

13 Italy Overall 457.87 76 467.53 139 -9.54 28.86

1 457.87 76 467.53 139 NA NA

(9)

Level2 Strata Math N

_private

Math N

_public

CI.min CI.max

score score

14 Japan Overall 512.86 13 508.68 24 -73.58 65.22

1 512.86 13 508.68 24 NA NA

15 Korea Overall 541.94 21 554.04 10 -22.93 47.15

1 541.94 21 554.04 10 NA NA

16 Mexico Overall 460.13 197 420.67 416 -53.01 -25.91

1 452.73 184 426.26 123 NA NA

2 467.56 13 415.06 293 NA NA

17 Netherlands Overall 528.92 113 533.90 69 -16.16 26.13

1 527.95 30 516.92 7 NA NA

2 539.70 22 561.75 14 NA NA

3 534.27 25 542.41 11 NA NA

4 533.17 18 515.75 18 NA NA

5 510.04 18 533.16 19 NA NA

18 Panama Overall 399.96 6 324.52 29 -139.19 -11.69

1 399.96 6 324.52 29 NA NA

19 Peru Overall 408.53 49 361.91 47 -87.13 -6.12

1 422.36 43 409.26 5 NA NA

2 394.71 6 314.57 42 NA NA

20 Qatar Overall 406.97 27 330.57 60 -107.29 -45.52

1 430.41 12 330.07 17 NA NA

2 402.22 8 332.17 21 NA NA

3 388.29 7 329.46 22 NA NA

21 Spain Overall 499.50 182 495.51 166 -21.18 13.19

1 504.31 161 506.72 13 NA NA

2 494.70 21 484.29 153 NA NA

22 Sweden Overall 523.91 29 512.73 47 -37.07 14.71

1 525.53 18 504.78 20 NA NA

2 522.30 11 520.69 27 NA NA

23 Trinidad Overall 385.44 32 411.85 120 -7.78 60.61

and Tobago 1 385.44 32 411.85 120 NA NA

24 Uruguay Overall 490.85 39 398.18 193 -109.63 -75.72

1 490.85 39 398.18 193 NA NA

The estimates of the treatment effect are commonly reported as confidence intervals based on Wald statistics. If the stratification is impossible due to the small sample size in each stratum, the estimation process is aborted. The second result is the covariates balance, which is a comparison of the covariate values before and after adjustment by the propensity score (see Table 4.2).

Table 4.2 Result of stratification and treatment effect test based on CIT Level2 Strata Math N

private

Math N

public

CI.min CI.max

score score

1 Argentina Overall 408.75 58 357.71 141 -78.47 -23.62

1 397.28 13 354.74 126

2 435.31 45 364.58 15

2 Australia Overall 521.76 66 493.14 90 -43.18 -14.07

1 539.67 46 504.25 8

2 513.25 15 492.33 68

3 508.09 5 465.09 14

3 Austria Overall 487.01 25 480.01 102 -38.94 24.93

1 476.99 14 485.29 83

2 519.41 11 462.95 19

(10)

Level2 Strata Math N

_private

Math N

_public

CI.min CI.max

score score

4 Belgium Overall 515.23 189 477.39 89 -59.61 -16.06

1 483.66 34 457.60 47

2 484.09 30 479.94 19

3 543.71 117 484.37 13

4 536.39 8 509.14 10

5 Brazil Overall 386.34 14 362.52 808 -55.97 8.35

1 384.43 7 362.30 791

2 449.71 7 369.85 17

6 Canada Overall 556.86 24 508.79 23 -75.46 -20.68

1 578.12 14 522.71 8

2 538.15 10 496.55 15

7 Chile Overall 427.88 101 361.36 6 -118.55 -14.49

1 427.88 101 361.36 6

8 Denmark Overall 514.69 49 503.77 88 -30.52 8.68

1 497.34 31 502.35 9

2 520.10 10 501.41 72

3 531.40 8 520.49 7

9 Indonesia Overall 362.53 5 387.89 19 -34.80 85.52

1 362.53 5 387.89 19

10 Ireland Overall 495.11 84 472.02 23 -43.66 -2.50

1 495.11 84 472.02 23

11 Israel Overall 463.88 31 445.28 53 -54.34 17.15

1 455.73 5 410.00 6

2 465.58 7 427.86 38

3 464.35 19 487.16 9

12 Italy Overall 466.42 75 451.56 81 -51.17 21.45

1 483.54 7 404.86 9

2 459.09 60 454.60 5

3 469.13 8 458.89 67

13 Japan Overall 522.03 46 563.65 6 -34.64 117.89

1 522.03 46 563.65 6

14 Korea Overall 540.98 7 543.40 90 -38.14 42.99

1 540.98 7 543.40 90

15 Mexico Overall 455.63 90 438.42 293 -32.41 -2.00

1 420.08 23 407.44 9

2 456.64 38 441.70 6

3 475.18 24 461.06 49

4 454.20 5 434.99 229

16 Netherlands Overall 529.61 113 534.24 69 -16.91 26.17

1 529.61 113 534.24 69

17 Peru Overall 387.89 51 351.51 189 -70.38 -2.37

1 377.98 6 336.51 183

2 424.58 45 407.10 6

18 Qatar Overall 387.70 17 329.21 85 -95.40 -21.58

1 387.70 17 329.21 85

19 Spain Overall 498.78 357 486.58 507 -35.38 10.97

1 488.49 7 478.54 439

2 521.40 5 468.34 23

3 523.78 6 494.27 32

4 507.32 339 497.39 13

20 Sweden Overall 522.43 27 517.86 18 -35.60 26.46

1 520.86 10 515.14 11

2 523.81 17 520.24 7

21 Trinidad Overall 448.21 6 409.94 108 -85.62 9.08

and Tobago 1 448.21 6 409.94 108

(11)

First of all, this section discusses the stratification results of both methods. The result of the multilevel logistic regression is shown in Table 4.3. The detailed results related to the multilevel logistic regression (e.g. individual regression equation, model fit indices, parameter estimation etc.) are included in the Appendix of this study. Since the result showed that 24 countries successfully produced strata of more than 1, the comparison of math scores between private and public schools was performed, except in the case of Kyrgyzstan, from which valid strata could not be extracted. The stratification results showed that while most countries had 1 or 2 strata, three countries-Belgium, Ireland, and the Netherlands–produced the greatest number of strata (5). The stratification result can be summarized as encompassing 48 strata across 24 countries.

The treatment effect (whether schools were private or not) in this study was observed across 16 countries (Qatar, Austria, Belgium, Brazil, Canada, Colombia, Peru, Mexico, Ire- land, Chile, Argentina, Denmark, Mexico, Panama, Uruguay). In these countries, the private schools produced higher math scores than did the public schools. Countries from the Amer- icas, such as Argentina, Brazil, Canada, Chile, Mexico, and Columbia, in particular showed relatively large differences in math scores in comparison to other European countries such as Denmark and Austria.

The result of the conditional tree is shown in Table 4.2. Since the result showed that 21 countries successfully produced strata greater than 1, the comparison of math scores between private and public schools was performed for all save for five countries (China, Kyrgyzstan, Panama, Columbia, and Uruguay), from which valid strata could not be ex- tracted. The stratification results showed that while most countries had 1 or 2 strata, three countries (Belgium, Mexico, and Spain) produced an even higher number of strata (4). The stratification result can be summarized as 44 strata across 21 countries.

The treatment effect, which addresses whether schools were private or not, is observed across nine countries in this study (Qatar, Peru, Mexico, Ireland, Chile, Canada, Belgium, Australia, and Argentina). In these countries, the private schools produced higher math scores than the public schools. In the previous analysis based on multilevel logistic regression, all these countries were reported as exhibiting a score difference between public schools and private schools. Also, North and South American and middle eastern countries did not show the strong tendency reported in the previous analysis.

The second result is the covariates balance, which showed that the propensity score suc-

cessfully adjusted for the selection bias. Table 4.3 shows the standardized mean difference of

each covariate for both methods. Austin (2011) suggested that a SMD of below 0.1 indicates

adequate covariance balance.

(12)

Table 4.3 Comparison of covariate balance result between multilevel logistic regression and CIT Variable Covariate Multilevel Logistic CIT

es.adj es.unadj es.adj es.unadj

1 COMPWEB 0.20 0.15 0.01 0.15

2 IRATCOMP 0.19 0.38 0.19 0.38

3 SCHSIZE 0.12 0.03 0.25 0.03

4 EXCURACT 0.01 0.10 0.03 0.10

5 LDRSHP 0.22 0.15 0.20 0.15

6 RESPCURR 0.10 0.70 0.17 0.70

7 RESPRES 0.56 1.43 0.45 1.43

8 SCMATEDU 0.21 0.56 0.27 0.56

9 STUDBEHA 0.34 0.71 0.37 0.71

10 TCHPARTI 0.07 0.27 0.06 0.27

11 TCSHORT 0.18 0.37 0.15 0.37

12 TEACBEHA 0.27 0.65 0.38 0.65

13 Treatment 4.20 2.33 4.07 2.33

For multilevel logistic regression, two covariates, the Index of extra-curricular activities and the Index of teacher participation, produced an SMD below 0.1, which indicates appropriate covariates. Before the adjustment, only one covariate, SCHSIZE, showed a below 0.1 SMD value. Thus, the stratification process based on propensity score using multilevel logistic regression produced a more balanced sample in terms of randomness than the properties of the original sample.

For CIT, three covariates, the Index of computers connected to the internet, the Index of extra-curricular activities, and the Index of teacher participation, produced an SMD below 0.1, indicating appropriate covariates. As mentioned above, only one covariate, SCHSIZE, showed a below 0.1 SMD value before adjustment. Thus, the stratification process based on the CIT produced a more balanced sample in terms of randomness than that of not only the original sample but also the adjusted sample based on multilevel logistic regression.

5. Conclusion

The purpose of this study is to compare traditional propensity score methods, such as multilevel logistic regression models, with CIT, a data mining method. PISA 2009 data, which consists of 6,957 schools from 24 countries, was used to conduct this analysis. The treatment effect indicates whether the school is private or not. Within the multilevel mod- eling, while the country variable was employed as a level 2 variable, other variables related to the schools were used as level 1 variables. For CIT, the same modeling was applied to perform the stratification. The major findings are as follows.

First, the multilevel logistic regression produced more strata than did the CIT. This means that multilevel logistic regression can more liberally assign the same stratum profile to several different observations. This can also be explained in relation to the propensity score estimation itself. Multilevel logistic regression tends to more liberally estimate the similar propensity score than does the CIT. Thus, the CIT can be characterized as the more conservative method of strata extraction, which can cause sample loss.

Second, the treatment effect (i.e., the private school effect) is more frequently reported

across countries through the multilevel logistic regression approach than through CIT. These

(13)

results suggest that CIT more conservatively estimates the treatment effect. Specifically, according to the descriptive statistics, private school scores were obviously higher than public school scores across most countries. However, after the adjustment, the number of countries showing higher math scores from private schools decreased in both approaches, with the tendency being more distinctive in the CIT.

Third, the covariate balance result showed that the CIT better produced randomized a treatment/control design than did the multilevel logistic regression. Meanwhile, the number of covariates that showed an SMD of less than 0.1 after the adjustment through multilevel logistic regression was greater than the number of well-adjusted covariates obtained through CIT.

The previous literature suggested that data mining methods are alternative methods of logistic regression when the data is inadequate to estimate the propensity score. There is little research on data mining methods for propensity score, especially with multilevel data.

Although this study provides a comparison of both methods with empirical data, statistically

rigorous investigation requires further replicated simulation studies. The major conditions

regarding simulation studies are sample and cluster sizes. In particular, as mentioned above,

the data mining method is relatively unrestricted by the assumptions of the linear model,

such as minimum sample size. Thus, the data mining method might have advantages over

the multilevel logistic regression model when using small sample sizes.

(14)

Appendix

Table A.1 Descriptive statistics of school-level math score by school type

Country N

private

Mean (SD) N

public

Mean (SD) N

total

Mean (SD) Argentina 58 426.30 (64.978) 141 355.29 (63.826) 199 375.99 (71.710)

Australia 136 532.43 (37.827) 217 489.48 (46.724) 353 506.03 (48.235)

Austria 39 489.22 (64.079) 234 474.28 (74.697) 273 476.41 (73.353)

Belgium 189 522.64 (83.916) 89 471.56 (76.962) 278 506.29 (85.036)

Brazil 98 457.62 (62.095) 812 361.98 (42.423) 910 372.28 (53.825)

Canada 76 568.45 (48.222) 896 508.58 (45.839) 972 513.26 (48.734)

Chile 103 425.40 (62.561) 80 384.19 (55.792) 183 407.38 (62.966)

China 61 505.00 (61.631) 97 567.00 (63.483) 158 543.06 (69.518)

Colombia 51 438.66 (52.882) 222 373.55 (38.173) 273 385.72 (48.418)

Denmark 50 506.72 (43.765) 231 483.94 (42.976) 281 488.00 (43.914)

Indonesia 98 356.83 (46.538) 85 380.05 (50.415) 183 367.62 (49.621)

Ireland 87 494.93 (35.046) 57 468.87 (49.810) 144 484.62 (43.287)

Israel 31 462.71 (76.217) 140 439.50 (70.209) 171 443.71 (71.666)

Italy 84 459.32 (60.037) 987 482.42 (67.962) 1071 480.61 (67.635)

Japan 51 524.09 (81.442) 135 527.32 (69.289) 186 526.43 (72.610)

Korea 58 547.44 (56.796) 99 541.19 (56.667) 157 543.50 (56.614)

Kyrgyzstan 6 434.30 (70.588) 167 329.37 (51.358) 173 333.01 (55.328)

Netherland 77 447.45 (89.116) 51 311.11 (75.125) 138 404.18 (84.514)

Mexico 200 452.93 (50.105) 1332 411.88 (49.445) 1532 417.24 (51.411)

Panama 40 433.17 (65.903) 136 332.94 (40.200) 176 355.72 (63.157)

Peru 51 418.62 (68.133) 189 338.26 (53.611) 240 355.33 (65.705)

Qatar 59 425.93 (76.151) 88 330.51 (42.477) 147 368.81 (74.711)

Sweden 30 526.41 (61.950) 159 494.86 (43.799) 189 499.87 (48.351)

Trinidad 32 384.94 (89.016) 120 411.34 (78.974) 152 405.78 (81.600)

Uruguay 39 490.34 (48.373) 193 397.66 (51.933) 232 413.24 (61.910)

(15)

Table A.2 Descriptive statistics of school-level covariates

Country X

¹

X

²

X

³

X

⁴

X

⁵

X

⁶

X

⁷

X

⁸

X

⁹

X

¹⁰

X

¹¹

X

¹²

Argentina 0.62 0.23 556.5 -0.71 0.54 -0.56 -0.59 -0.75 0.36 -0.01 -0.09 -0.27 (0.45) (0.27) (496.81) (1.00) (0.97) (0.64) (0.37) (1.27) (1.09) (0.87) (0.98) (1.11) Australia 0.99 0.98 923.61 0.59 0.37 0.15 -0.08 0.37 -0.06 0.54 0.27 -0.25

(0.06) (0.45) (425.26) (0.91) (1.04) (0.90) (0.90) (1.07) (0.97) (1.00) (1.00) (0.89) Austria 0.96 0.80 447.81 -0.12 -0.20 -0.12 -0.52 0.28 -0.18 0.13 -0.19 0.13

(0.16) (0.51) (414.58) (0.94) (0.81) (0.91) (0.58) (0.96) (0.96) (0.99) (0.82) (0.85) Belgium 0.93 0.64 674.03 -0.37 -0.38 -0.16 -0.35 0.16 0.2 0.13 0.49 0.09

(0.17) (0.50) (346.02) (0.86) (0.77) (0.84) (0.31) (0.99) (1.03) (0.88) (0.93) (0.89) Brazil 0.82 0.13 1006.76 -0.5 1.04 -0.53 -0.55 -0.85 -0.38 -0.45 0.14 -0.47

(0.34) (0.19) (669.03) (1.06) (1.13) (0.74) (0.76) (1.07) (0.97) (0.8) (1.00) (0.97) Canada 0.98 0.83 765.28 0.51 0.41 -0.72 -0.39 0.35 -0.31 -0.03 -0.14 -0.02 (0.12) (0.50) (517.96) (0.79) (0.96) (0.58) (0.57) (0.95) (0.86) (1.00) (0.86) (0.85) Chile 0.93 0.37 949.95 -0.30 0.66 -0.16 0.33 -0.55 -0.13 -0.19 0.40 -0.41

(0.20) (0.35) (759.97) (1.05) (1.08) (0.93) (1.22) (1.21) (1.14) (1.03) (1.09) (1.05) Colombia 0.77 0.40 1456.22 0.11 0.70 -0.28 -0.34 -0.94 -0.13 -0.69 0.08 -0.22

(0.32) (0.32) (1179.88) (0.94) (1.12) (0.79) (0.98) (1.12) (1.01) (0.93) (1.01) (1.01) Denmark 1.00 0.84 480.11 -1.11 -0.44 -0.06 0.14 0.14 0.11 0.04 -0.06 0.35

(0.04) (0.52) (215.43) (0.95) (0.66) (0.91) (0.85) (0.78) (0.84) (0.99) (0.64) (0.83) Indonesia 0.41 0.16 484.68 -0.18 0.35 0.23 0.22 -1.24 0.65 0.50 0.35 0.51

(0.42) (0.22) (343.26) (1.00) (0.94) (0.95) (1.03) (1.07) (0.79) (1.27) (0.94) (0.88) Ireland 0.98 0.57 569.96 -0.08 -0.21 -0.01 -0.43 -0.35 -0.28 0.18 -0.28 0.11

(0.11) (0.36) (236.02) (0.85) (0.88) (0.74) (0.24) (1.08) (0.85) (0.75) (0.79) (0.89) Israel 0.84 0.37 791.31 0.11 0.42 -0.01 -0.26 -0.05 0.07 0.15 0.19 -0.2

(0.33) (0.29) (479.24) (0.91) (0.93) (0.96) (0.74) (1.09) (0.88) (0.95) (0.99) (0.88) Italy 0.94 0.55 619.55 0.09 0.31 0.14 -0.63 -0.04 -0.03 0.23 0.15 -0.29

(0.18) (0.41) (384.74) (0.89) (0.87) (0.89) (0.54) (0.93) (0.87) (0.7) (0.85) (0.86) Japan 0.93 0.47 731.31 -0.01 -1.3 1.07 -0.23 0.5 0.59 -1.07 -0.53 -0.20 (0.2) (0.42) (374.07) (0.68) (0.85) (0.66) (0.96) (0.99) (0.92) (1.32) (0.67) (0.87) Korea 0.98 0.42 1148.69 1.00 -0.63 0.8 -0.45 0.01 0.38 0.09 0.04 -0.15

(0.08) (0.43) (442.29) (0.86) (1.16) (0.77) (0.73) (0.81) (0.91) (1.15) (0.94) (0.78) China 0.98 1.01 1318.91 0.54 0.2 0.92 1.53 0.2 -0.02 0.26 0.35 -0.46

(0.09) (0.48) (874.14) (1.14) (0.85) (0.76) (1.01) (1.08) (1.7) (1.17) (1.29) (1.4) Mexico 0.67 0.34 729.97 -0.11 0.35 -0.9 -0.32 -0.83 0.31 -0.98 0.45 -0.34 (0.41) (0.37) (928.12) (0.99) (1.00) (0.56) (0.85) (1.15) (0.92) (0.77) (1.00) (1.00) Netherlands 0.99 0.58 998.23 -0.29 -0.44 1.05 1.31 0.29 -0.13 -0.1 0.52 -0.68

(0.04) (0.37) (566.06) (0.69) (0.71) (0.62) (1.05) (0.83) (0.76) (0.88) (0.85) (0.74) Panama 0.58 0.29 1095.35 0.02 0.57 -0.5 -0.35 -0.79 0.19 -0.32 -0.13 -0.49

(0.47) (0.36) (849.89) (1.03) (1.21) (0.86) (0.84) (1.3) (0.97) (1.08) (0.89) (1.11) Peru 0.60 0.23 635.12 0.03 0.44 -0.16 -0.03 -1.25 0.26 -0.03 0.41 -0.28

(0.45) (0.30) (725.11) (0.95) (1.14) (1.03) (1.22) (1.3) (1.03) (1.19) (0.97) (1.00) Qatar 0.72 0.61 840.97 0.86 1.24 -0.42 0.27 0.36 0.37 -1.14 -0.25 0.22

(0.39) (0.46) (1243.91) (0.92) (1.17) (1.09) (1.20) (1.16) (1.29) (1.09) (1.07) (1.15) Spain 0.98 0.62 659.73 -0.33 -0.11 -0.41 -0.40 0.02 0.22 -0.23 -0.72 0.08

(0.1) (0.34) (390.67) (0.91) (0.93) (0.81) (0.64) (0.85) (0.96) (0.8) (0.57) (0.92) Sweden 0.98 0.47 437.37 -0.35 -0.24 0.21 0.90 0.06 -0.07 0.17 -0.35 0.03

(0.08) (0.37) (268.71) (0.73) (0.84) (0.95) (1.14) (0.8) (0.77) (0.7) (0.68) (0.86) Trinidad and 0.64 0.37 588.95 0.13 0.52 -0.58 -0.35 -0.65 -0.52 0.26 0.45 -0.68 Tobago (0.40) (0.27) (282.46) (1.01) (1.09) (0.68) (0.78) (0.93) (0.94) (1.03) (1.04) (1.03) Uruguay 0.75 0.27 783.92 -0.58 0.30 -1.00 -0.55 0.05 0.46 -0.84 0.14 -0.48

(0.38) (0.27) (658.17) (0.92) (1.06) (0.45) (0.58) (1.08) (1.08) (0.91) (0.96) (1.09) Overall 0.85 0.51 771.10 -0.05 0.30 -0.30 -0.27 -0.25 0.06 -0.24 0.06 -0.18

(0.31) (0.46) (696.46) (1.02) (1.09) (0.92) (0.89) (1.16) (0.99) (1.03) (0.97) (0.98)

(16)

References

Austin, P. C. (2011). A tutorial and case study in propensity score analysis: an application to estimating the effect of in-hospital smoking cessation counseling on mortality. Multivariate Behavioral Research, 46, 119-151.

Breiman, L. (2001). Random forests. Machine Learning, 45, 5-32.

Bryer, J. M. and Pruzek, R. M. (2011). An international comparison of private and public schools using multilevel propensity score methods and graphics. Multivariate Behavioral Research, 46, 1010-1011.

dAgostino, R. B. (1998). Tutorial in biostatistics: Propensity score methods for bias reduction in the com- parison of a treatment to a non-randomized control group. Stat Med , 17, 2265-2281.

Gurel, S. and Leite, W. L. (2014). Evaluation propensity score strategies with multilevel data when treatment assignment mechanism varies between clusters, American Educational Research Association.

Hothorn, T., Hornik, K. and Zeileis, A. (2006). Unbiased recursive partitioning: A conditional inference framework. Journal of Computational and Graphical statistics, 15, 651-674.

Hothorn, T., Hornik, K. and Zeileis, A. (2015). ctree: Conditional inference trees. The Comprehensive R Archive Network , 1-34.

Hwang, J. S., Pi, S. M., Choi, W. C. and Kim, J. T. (2017). The effect for exercise intensity on hypertension using propensity score. Journal of the Korean Data & Information Science Society, 28, 109-117.

Lee, B. K., Lessler, J. and Stuart, E. A. (2010). Improving propensity score weighting using machine learning.

Statistics in Medicine, 29, 337-346.

Maas, C. J. and Hox, J. J. (2005). Sufficient sample sizes for multilevel modeling. Methodology: European Journal of Research Methods for the Behavioral and Social Sciences, 1, 86-92.

McCaffrey, D. F., Ridgeway, G. and Morral, A. R. (2004). Propensity score estimation with boosted regres- sion for evaluating causal effects in observational studies. Psychological Methods, 9, 403-425.

Morgan, S. L. and Winship, C. (2014). Counterfactuals and causal inference, Cambridge University Press, Cambridge.

Park, S. and Hwang, J. (2018). The effects of private education on academic achievement by school grade:

Using propensity score matching. Journal of the Korean Data & Information Society, 29, 961-973.

R Core Team (2016). R: A language and environment for statistical computing, R Foundation for Statistical Computing, Vienna.

Raudenbush, S. W. and Bryk A. S. (2002). Hierarchical linear models: Applications and data analysis methods, CA: Sage, Thousand Oaks.

Rosenbaum, P. R. and Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70, 41-55.

Schleicher, A., Zimmer, K., Evans, J. and Clements, N. (2009). PISA 2009 assessment framework: Key competencies in reading, mathematics and science, OECD Publishing, Paris.

Setoguchi, S., Schneeweiss, S., Brookhart, M. A., Glynn, R. J. and Cook, E. F. (2008). Evaluating uses of data mining techniques in propensity score estimation: A simulation study. Pharmacoepidemiology and drug safety, 17, 546-555.

Westreich, D., Lessler, J. and Funk, M. J. (2010). Propensity score estimation: neural networks, support vec-

tor machines, decision trees (CART), and meta-classifiers as alternatives to logistic regression. Journal

of Clinical Epidemiology, 63, 826-833.

An alternative method in estimating propensity scores with conditional inference tree in multilevel data: A case study

An alternative method in estimating propensity scores with conditional inference tree in multilevel data: A

case study

Hyunsuk Han 1 · Minho Kwak 2

1 Institute of Educational Research, Korea University

2 Quantitative Methodology, University of Georgia

Received 16 May 2019, revised 19 June 2019, accepted 8 July 2019

Abstract

Keywords: Conditional inference tree, multilevel, multilevel logistic regression, non- randomized, propensity score.

1. Introduction

Research professor, Institute of Educational Research, Korea University, Seoul, Republic of Korea.

Corresponding author: Ph.D. candidate, Quantitative Methodology, University of Georgia, Athens,

Georgia, United States. E-mail: [email protected]

between the treatment and control groups. The other significant factor is the multilevel prop- erty. Previous literature has pointed out that studies without a multi-level approach lead to biased standard error estimates of hierarchical data (Raudenbush and Bryk, 2002).

However, using the multilevel logistic regression model to estimate the propensity score requires strict assumptions regarding sample size. The minimum sample size for the multi- level regression model is 30 groups, with 30 individuals in each group (Maas and Hox, 2005).

Although the sample size assumption plays a critical role in the estimation process, for data gathered in practical settings, the sufficient sample size assumption is difficult to satisfy.

2. Theoretical framework

2.1. Propensity scores

The propensity score (PS) is defined as the conditional probability of assignment to a

particular treatment given observed covariates (Rosebaum and Rubin, 1983):

Y (0), Y (1) ⊥ (T | ~ X). (2.2)

In order to evaluate the common support for each covariate, standardized mean difference (SMD) is used; a value below 0.1 indicates adequate covariance balance (Austin, 2011).

SM D = X t − X c σ t

, (2.3)

where X t and X c denote the mean of treatment and control group. σ t indicates the standard deviation of treatment group.

2.2. Multilevel logistic model

The multilevel logistic regression model used to estimate the propensity score is as follows:

logit(Z ij = 1|X, W ) =β 0 +

M

X

m=1

β m X mij (2.4)

+

N

X

n=1

π n W nj + s mj +

M

X

m=1

s mj X mij ,

s 0j , s mj ∼ N (0, Σ), (2.5)

where Z ij indicates that treatment indicator for person i in the group j, β m means the fixed effects of the mth individual-level independent variable X mij , π n denotes the fixed effects of nth cluster-level independent variable W nj,s

and s mj are intercepts and slopes with mean of zero and covariance matrix.

2.3. Conditional inference trees model

Of the various data mining methods, the Conditional Inference Tree (CIT) is the machine learning algorithm with fewer assumptions than logistic regression (Westreich et al., 2010).

For the simplicity, the algorithm for the only jth covariate was discussed. For the other covariates, the same process is applied. The algorithm consists of three steps as follows.

Step 1. Given w, the algorithm is stopped when the hypothesis H 0 j : D(Y|X j ) = D(Y) cannot be rejected. Otherwise, select a different jth covariate X j∗ which has the strongest relationship with Y.

Step 2. Select a subset A ∗ of whole sample space χ j∗ in order to split whole sample

space χ j∗ into disjoint subsets such as A ∗ and χ j∗ /A ∗ . The corresponding weights w lef t and

W right are applied to both sets. An ith element of weight vector w lef t,i can be defined as w lef t,i = w i I(X j∗ ∈ A ∗ ) for all i = 1, . . . , n (i.e., I(·) denotes the indicator function).

Step 3. Repeat step 1 and 2 recursively, and update the weights w lef t and w right consec- utively.

As referred above, CIT is a special case of recursive binary partitioning. Specifically, CIT has its own stopping rule and variable selection rule of step 1 and step 2. Hothorn et al.

(2006) summarized the stopping and variable selection rules of CIT as follows. For the stopping rule in the step 1, the test statistics under the jth null hypothesis T j depending on the n-size sample L n and w can be defined as

T j (L n , w) = vec(

n

X

i=1

w i g j (X ji h(Y i (Y 1 , . . . , Y n )) T ) ∈ R p

q

, (2.7)

where vec(·) operator denotes transformation of the a matrix into a column vector. Sepc- ficailly, it transforms a p j by q matrix to a p j ∗ q length vector. g j : X → R p

denotes a non-random transformation of the jth covariate X j , h : Y × Y n → R q denotes influence function. Thus, the maximum absolute value of standardized form of the statistics is an ap- parent option. It is possible to denote the univariate version of T j (denoted as t) as Equation (2.7).

C max (t, µ, Σ) = max

k=1,...,pq

(t − µ) k

√ Σ kk

, (2.8)

where µ denotes the expectation value of the conditional distribution Y given X, Σ indi- cates the variance-covariance matrix of the distribution. When the test-statistics reject the null, the weight updating process is stopped.

For the variable selection rule, when the covariate has been selected, two-sample linear statistics used to evaluate the goodness of the split can be formulated as Equation (2.8).

T A j∗ (L n , w) = vec(w i I(X j∗i ∈ A ∗ )h(Y i (Y 1 , . . . , Y n )) T ) ∈ R q . (2.9) The statistics implies the difference between both exclusive samples. The subsample A∗

which maximize the split statistics is the optimal choice, and the it can be expressed as Equations (2.9).

A∗ = argmax A c(t A j∗ , µ A j∗ , Σ A j∗ ), (2.10)

where µ A j∗ denotes the expectation value of the conditional distribution Y given X of the

subsample A∗. Σ A j∗ indicates the variance-covariance matrix of the distribution of the sub-

sample. t A j∗ denotes the univariate version of the T A j∗ .

3. Methods

3.1. Measure

Table 3.1 The list of covariates and comment with labels

Variable label Covariate label Comment

Hyunsuk Han ¹ · Minho Kwak ²

SM D = X _t − X _c σ t

logit(Z _ij = 1|X, W ) =β ₀ +

β _m X _mij (2.4)

where Z _ij indicates that treatment indicator for person i in the group j, β _m means the fixed effects of the mth individual-level independent variable X _mij , π _n denotes the fixed effects of nth cluster-level independent variable W nj,s

Step 1. Given w, the algorithm is stopped when the hypothesis H ₀ ^j : D(Y|X _j ) = D(Y) cannot be rejected. Otherwise, select a different jth covariate X _j∗ which has the strongest relationship with Y.

Step 2. Select a subset A ^∗ of whole sample space χ j∗ in order to split whole sample

space χ j∗ into disjoint subsets such as A ^∗ and χ j∗ /A ^∗ . The corresponding weights w lef t and

W _right are applied to both sets. An ith element of weight vector w _{lef t,i} can be defined as w _{lef t,i} = w _i I(X _j∗ ∈ A ^∗ ) for all i = 1, . . . , n (i.e., I(·) denotes the indicator function).

w i g j (X ji h(Y i (Y 1 , . . . , Y n )) ^T ) ∈ R ^p

^q

where vec(·) operator denotes transformation of the a matrix into a column vector. Sepc- ficailly, it transforms a p _j by q matrix to a p _j ∗ q length vector. g j : X → R ^p

C _max (t, µ, Σ) = max

√ Σ _kk

T ^A _j∗ (L n , w) = vec(w i I(X j∗i ∈ A ^∗ )h(Y i (Y 1 , . . . , Y n )) ^T ) ∈ R ^q . (2.9) The statistics implies the difference between both exclusive samples. The subsample A∗

A∗ = argmax _A c(t Â _j∗ , µ Â _j∗ , Σ Â _j∗ ), (2.10)

where µ ^A _j∗ denotes the expectation value of the conditional distribution Y given X of the

subsample A∗. Σ ^A _j∗ indicates the variance-covariance matrix of the distribution of the sub-

sample. t ^A _j∗ denotes the univariate version of the T ^A _j∗ .

For notation convenience, the variable label (e.g. X ¹ , X ² , . . . , X ¹² ) is used in the equations of this study instead of the covariate name. The variables considered as being potentially related to the math score at school level are used in the level-1 covariates.

β mj (X _ij ^m ) + r ij , (3.1)

γ m0 X _ij ^m ) (3.4)

+ (γ ₀₁ W _j +

γ _m1 W _j X _ij ^m ) (3.5)

+ (u _0j +

u _mj W _j X _ij ^m + γ _ij ), (3.6)