`vignettes/intro_medshift.Rmd`

`intro_medshift.Rmd`

We are interested in assessing the direct effect (DE) and the indirect effect (IE), based on the decomposition of the population mediated intervention mean given in Díaz and Hejazi (2019).

To proceed, we’ll use as our running example a simple data set from an observational study of the relationship between BMI and kids behavior, distributed as part of the `mma`

R package on CRAN. First, let’s load the packages we’ll be using and set a seed; then, load this data set and take a quick look at it

```
# preliminaries
library(data.table)
library(dplyr)
library(sl3)
library(medshift)
library(mma)
set.seed(429153)
# load and examine data
data(weight_behavior)
dim(weight_behavior)
```

`## [1] 691 15`

`head(weight_behavior)`

```
## bmi age sex race numpeople car gotosch snack tvhours cmpthours
## 1 18.20665 12.2 F OTHER 5 3 2 1 4 0
## 2 22.78401 12.8 M OTHER 4 3 2 1 4 2
## 3 19.60725 12.6 F OTHER 4 2 4 2 NA NA
## 4 25.56754 12.1 M OTHER 2 3 2 1 0 2
## 5 15.07408 12.3 M OTHER 4 1 2 1 2 1
## 6 22.98338 11.8 M OTHER 4 1 1 1 4 3
## cellhours sports exercises sweat overweigh
## 1 0 2 2 1 0
## 2 0 1 8 2 0
## 3 NA <NA> 4 2 0
## 4 0 2 9 1 1
## 5 3 1 12 1 0
## 6 2 1 1 1 0
```

The documentation for the data set describes it as a “database obtained from the Louisiana State University Health Sciences Center, New Orleans, by Dr. Richard Scribner. He explored the relationship between BMI and kids behavior through a survey at children, teachers and parents in Grenada in 2014. This data set includes 691 observations and 15 variables.”

Unfortunately, the data set contains a few observations with missing values. As these are unrelated to the object of our analysis, we’ll simply remove these for the time being. Note that in a real data analysis, we might consider strategies to fully make of the observed data, perhaps by imputing missing values. For now, we simply remove the incomplete observations, resulting in a data set with fewer observations but much the same structure as the original:

`## [1] 567 15`

```
## bmi age sex race numpeople car gotosch snack tvhours cmpthours
## 1 18.20665 12.2 F OTHER 5 3 2 1 4 0
## 2 22.78401 12.8 M OTHER 4 3 2 1 4 2
## 4 25.56754 12.1 M OTHER 2 3 2 1 0 2
## 5 15.07408 12.3 M OTHER 4 1 2 1 2 1
## 6 22.98338 11.8 M OTHER 4 1 1 1 4 3
## 8 19.15658 12.1 F OTHER 3 3 2 1 0 0
## cellhours sports exercises sweat overweigh
## 1 0 1 2 1 0
## 2 0 0 8 2 0
## 4 0 1 9 1 1
## 5 3 0 12 1 0
## 6 2 0 1 1 0
## 8 1 0 1 3 0
```

For the analysis of this observational data set, we focus on the effect of participating in a sports team (`sports`

) on the BMI of children (`bmi`

), taking several related covariates as mediators (`snack`

, `exercises`

, `overweigh`

) and all other collected covariates as potential confounders. Considering an NPSEM, we separate the observed variables from the data set into their corresponding nodes as follows

```
Y <- weight_behavior_complete$bmi
A <- weight_behavior_complete$sports
Z <- weight_behavior_complete %>%
select(snack, exercises, overweigh)
W <- weight_behavior_complete %>%
select(age, sex, race, numpeople, car, gotosch, tvhours, cmpthours,
cellhours, sweat)
```

Finally, in our analysis, we consider an incremental propensity score intervention (IPSI), as first proposed by Kennedy (2018), wherein the *odds of participating in a sports team* is modulated by some fixed amount (\(0 \leq \delta \leq \infty\)) for each individual. Such an intervention may be interpreted as the effect of a school program that motivates children to participate in sports teams. To exemplify our approach, we postulate a motivational intervention that *triples the odds* of participating in a sports team for each individual:

To easily incorporate ensemble machine learning into the estimation procedure, we rely on the facilities provided in the `sl3`

R package (Coyle et al. 2018). For a complete guide on using the `sl3`

R package, consider consulting https://tlverse.org/sl3, or https://tlverse.org (and https://github.com/tlverse) for the `tlverse`

ecosystem, of which `sl3`

is a major part. We construct an ensemble learner using a handful of popular machine learning algorithms below

```
# SL learners used for continuous data (the nuisance parameter M)
xgb_contin_lrnr <- Lrnr_xgboost$new(nrounds = 50, objective = "reg:linear")
hal_contin_lrnr <- Lrnr_hal9001$new(n_folds = 5, degrees = 3)
enet_contin_lrnr <- Lrnr_glmnet$new(alpha = 0.5, family = "gaussian",
nfolds = 3)
lasso_contin_lrnr <- Lrnr_glmnet$new(alpha = 1, family = "gaussian",
nfolds = 3)
fglm_contin_lrnr <- Lrnr_glm_fast$new(family = gaussian())
contin_lrnr_lib <- Stack$new(enet_contin_lrnr, lasso_contin_lrnr,
fglm_contin_lrnr, xgb_contin_lrnr,
hal_contin_lrnr)
sl_contin_lrnr <- Lrnr_sl$new(learners = contin_lrnr_lib,
metalearner = Lrnr_nnls$new())
# SL learners used for binary data (nuisance parameters G and E in this case)
xgb_binary_lrnr <- Lrnr_xgboost$new(nrounds = 50, objective = "reg:logistic")
hal_binary_lrnr <- Lrnr_hal9001$new(n_folds = 5, degrees = 3,
fit_type = "glmnet", family = "binomial")
enet_binary_lrnr <- Lrnr_glmnet$new(alpha = 0.5, family = "binomial",
nfolds = 3)
lasso_binary_lrnr <- Lrnr_glmnet$new(alpha = 1, family = "binomial",
nfolds = 3)
fglm_binary_lrnr <- Lrnr_glm_fast$new(family = binomial())
binary_lrnr_lib <- Stack$new(enet_binary_lrnr, lasso_binary_lrnr,
fglm_binary_lrnr, xgb_binary_lrnr,
hal_binary_lrnr)
sl_binary_lrnr <- Lrnr_sl$new(learners = binary_lrnr_lib,
metalearner = Lrnr_nnls$new())
```

We may decompose the population intervention effect (PIE) in terms of a and an : \[\begin{equation*} \psi(\delta) = \overbrace{\mathbb{E}\{Y(g, q) - Y(g_\delta, q)\}}^{\text{PIDE}} + \overbrace{\mathbb{E}\{Y(g_\delta, q) - Y(g_\delta, q_\delta)\}}^{\text{PIIE}} \end{equation*}\] This decomposition of the PIE as the sum of direct and indirect effects has an interpretation analogous to the corresponding standard decomposition of the average treatment effect. In the sequel, we will compute each of the components of the direct and indirect effects above using appropriate estimators as follows

- For \(\mathbb{E}\{Y(g, q)\}\), the sample mean \(\frac{1}{n}\sum_{i=1}^n Y_i\) is sufficient;
- for \(\mathbb{E}\{Y(g_{\delta}, q)\}\), a one-step efficient estimator for the pure effect (of altering the exposure mechanism but not the mediation mechanism), as proposed in Díaz and Hejazi (2019); and,
- for \(\mathbb{E}\{Y(g_{\delta}, q_{\delta})\}\), a one-step efficient estimator for the joint effect (of altering both the exposure and mediation mechanisms), as proposed in Kennedy (2018) and implemented in the
`npcausal`

R package.

As given in Díaz and Hejazi (2019), the statistical functional identifying the pure effect \(\mathbb{E}\{Y(g_{\delta}, q)\}\), which corresponds to altering the exposure mechanism while keeping the mediation mechanism (and its reliance on the exposure) fixed, is \[\begin{equation*}
\theta_0(\delta) = \int m_0(a, z, w) g_{0,\delta}(a \mid w) p_0(z, w)
d\nu(a, z, w),
\end{equation*}\] for which a one-step nonparametric-efficient estimator is available. The corresponding *efficient influence function* (EIF) with respect to the nonparametric model \(\mathcal{M}\) is \(D_{\eta,\delta}(o) = D^Y_{\eta,\delta}(o) + D^A_{\eta,\delta}(o) + D^{Z,W}_{\eta,\delta}(o) - \theta(\delta)\). The one-step estimator may be computed using the EIF estimating equation, making use of cross-fitting to circumvent any need for entropy conditions. The resultant estimator is \[\begin{equation*}
\hat{\theta}(\delta) = \frac{1}{n} \sum_{i = 1}^n D_{\hat{\eta}_{j(i)},
\delta}(O_i) = \frac{1}{n} \sum_{i = 1}^n \left\{ D^Y_{\hat{\eta}_{j(i)},
\delta}(O_i) + D^A_{\hat{\eta}_{j(i)}, \delta}(O_i) +
D^{Z,W}_{\hat{\eta}_{j(i)}, \delta}(O_i) \right\},
\end{equation*}\] which is implemented in the `medshift`

R package (Hejazi and Díaz 2019). We make use of that implementation to estimate \(\mathbb{E}\{Y(g_{\delta}, q)\}\) via its one-step estimator \(\hat{\theta}(\delta)\) below

```
# let's compute the parameter where A (but not Z) are shifted
theta_eff <- medshift(W = W, A = A, Z = Z, Y = Y,
delta = delta_shift_ipsi,
g = sl_binary_lrnr,
e = sl_binary_lrnr,
m = sl_contin_lrnr,
phi = fglm_contin_lrnr,
estimator = "onestep",
estimator_args = list(cv_folds = 2))
summary(theta_eff)
```

```
## lwr_ci param_est upr_ci param_var eif_mean
## 18.751039 19.105534 19.460029 0.032713 9.771786e-16
## estimator
## one-step
```

Recall that, based on the decomposition outlined previously, the direct effect_ (DE) may be denoted \(\beta_{\text{DE}}(\delta) = \mathbb{E}Y - \theta_0(\delta)\). Thus, an estimator of the DE, \(\hat{\beta}_{\text{DE}}(\delta)\) may be expressed as a composition of estimators of its constituent parameters: \[\begin{equation*} \hat{\beta}_{\text{DE}}({\delta}) = \frac{1}{n} \sum_{i = 1}^n Y_i - \hat{\theta}(\delta). \end{equation*}\]

Based on the above, we may construct an estimator of the DE using the quantities already computed. The convenience function below applies the simple delta method required in the case of a linear contrast between the two constituent parameters:

```
# convenience function to compute inference via delta method: EY1 - EY0
linear_contrast <- function(params, eifs, ci_level = 0.95) {
# bounds for confidence interval
ci_norm_bounds <- c(-1, 1) * abs(stats::qnorm(p = (1 - ci_level) / 2))
param_est <- params[[1]] - params[[2]]
eif <- eifs[[1]] - eifs[[2]]
se_eif <- sqrt(var(eif) / length(eif))
param_ci <- param_est + ci_norm_bounds * se_eif
# parameter and inference
out <- c(param_ci[1], param_est, param_ci[2])
names(out) <- c("lwr_ci", "param_est", "upr_ci")
return(out)
}
```

With the above convenience function in hand, we’ll construct or extract the necessary components from existing objects and simply apply the function:

```
# parameter estimates and EIFs for components of direct effect
EY <- mean(Y)
eif_EY <- Y - EY
params_de <- list(EY, theta_eff$theta)
eifs_de <- list(eif_EY, theta_eff$eif)
# direct effect = EY - estimated quantity
de_est <- linear_contrast(params_de, eifs_de)
de_est
```

```
## lwr_ci param_est upr_ci
## -0.4672572 0.0215740 0.5104052
```

As given above, we have for our estimate of the direct effect \(\hat{\beta}_{\text{DE}}({\delta}) =\) 0.022.

Coyle, Jeremy R, Nima S Hejazi, Ivana Malenica, and Oleg Sofrygin. 2018. *sl3: Pipelines for Machine Learning and Super Learning in R*. https://github.com/tlverse/sl3.

Díaz, Iván, and Nima S Hejazi. 2019. “Causal Mediation Analysis for Stochastic Interventions.” *Submitted*. https://arxiv.org/abs/1901.02776.

Hejazi, Nima S, and Iván Díaz. 2019. *medshift: Causal Mediation Analysis for Stochastic Interventions in R*. https://github.com/nhejazi/medshift.

Kennedy, Edward H. 2018. “Nonparametric Causal Effects Based on Incremental Propensity Score Interventions.” *Journal of the American Statistical Association*. Taylor & Francis.