Targeted Learning with Moderated Statistics for Biomarker Discovery

Authors: Nima Hejazi and Alan Hubbard

## What’s biotmle?

The biotmle R package facilitates biomarker discovery through a generalization of the moderated t-statistic (Smyth 2004) that extends the procedure to locally efficient estimators of asymptotically linear target parameters (Tsiatis 2007). The set of methods implemented modify targeted maximum likelihood (TML) estimators of statistical (or causal) target parameters (e.g., average treatment effect) to apply variance moderation to the efficient influence function (EIF) representation of the target parameter (van der Laan and Rose 2011, 2018). The influence function-based representation of the data are then subjected to a moderated hypothesis test of the statistical estimate of the target parameter, effectively stabilizing the standard error estimates (derived directly from the relevant efficient influence function) and allowing such estimators to be employed in smaller sample sizes, such as those common in computational biology and bioinformatics applications. The resultant procedure, supervised variance moderation, allows for the construction of a conservative hypothesis test of a statistically estimable target parameter that controls the standard error in a manner that reduces the false discovery rate or the family-wise error rate (Hejazi et al., n.d.). Utilities are also provided for performing clustering through supervised distance matrices, using the EIF-based estimates to draw out the underlying contributions of individual biomarkers to the target parameter of interest (Pollard and van der Laan 2008).

## Installation

For standard use, install from Bioconductor using BiocManager:

if (!requireNamespace("BiocManager", quietly=TRUE)) {
install.packages("BiocManager")
}
BiocManager::install("biotmle")

To contribute, install the bleeding-edge development version from GitHub via devtools:

devtools::install_github("nhejazi/biotmle")

Current and prior Bioconductor releases are available under branches with numbers prefixed by “RELEASE_”. For example, to install the version of this package available via Bioconductor 3.6, use

devtools::install_github("nhejazi/biotmle", ref = "RELEASE_3_6")

## Example

For details on how to best use the biotmle R package, please consult the most recent package vignette available through the Bioconductor project.

## Issues

If you encounter any bugs or have any specific feature requests, please file an issue.

## Contributions

Contributions are very welcome. Interested contributors should consult our contribution guidelines prior to submitting a pull request.

## Citation

After using the biotmle R package, please cite both of the following:

    @article{hejazi2017biotmle,
author = {Hejazi, Nima S and Cai, Weixin and Hubbard, Alan E},
title = {biotmle: Targeted Learning for Biomarker Discovery},
journal = {The Journal of Open Source Software},
volume = {2},
number = {15},
month = {July},
year  = {2017},
publisher = {The Open Journal},
doi = {10.21105/joss.00295},
url = {https://doi.org/10.21105/joss.00295}
}

@article{hejazi2018+supervised,
url = {https://arxiv.org/abs/1710.05451},
year = {2018+},
author = {Hejazi, Nima S and {Kherad-Pajouh}, Sara and {van der
Laan}, Mark J and Hubbard, Alan E},
title = {Supervised variance moderation of locally efficient
estimators in high-dimensional biology}
}

## Funding

The development of this software was supported in part through grants from the National Institutes of Health: P42 ES004705-29 and R01 ES021369-05.

© 2016-2018 Nima S. Hejazi

The contents of this repository are distributed under the MIT license. See file LICENSE for details.

## References

Hejazi, Nima S, Sara Kherad-Pajouh, Mark J van der Laan, and Alan E Hubbard. n.d. “Supervised Variance Moderation of Locally Efficient Estimators in High-Dimensional Biology.” https://arxiv.org/abs/1710.05451.

Pollard, Katherine S, and Mark J van der Laan. 2008. “Supervised Distance Matrices.” Statistical Applications in Genetics and Molecular Biology 7 (1). De Gruyter.

Smyth, Gordon K. 2004. “Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments.” Statistical Applications in Genetics and Molecular Biology 3 (1). Walter de Gruyter: 1–25. https://doi.org/10.2202/1544-6115.1027.

Tsiatis, Anastasios. 2007. Semiparametric Theory and Missing Data. Springer Science & Business Media.

van der Laan, Mark J., and Sherri Rose. 2011. Targeted Learning: Causal Inference for Observational and Experimental Data. Springer Science & Business Media.

van der Laan, Mark J, and Sherri Rose. 2018. Targeted Learning in Data Science: Causal Inference for Complex Longitudinal Studies. Springer Science & Business Media.