April 12, 2016

## Background

"Essentially, all models are wrong, but some are useful"
- George Box, 1979

Mantra of statisticians regarding the development of statistical models for many years

In the 1990s an awareness developed among statisticians (Breiman, Harrell) that this approach was wrong

• Parametric model assumptions rarely met
• Large number of variables makes it difficult to correctly specify a model

Simultaneously, computer scientists and some statisticians developed the machine learning field to address the limitations of parametric models

## Targeted learning

Combines advanced machine learning with efficient semiparametric estimation to provide a framework for answering causal questions from data

• Developed by Mark van der Laan and his research group at UC Berkeley
• Started with the seminal 2006 article on targeted maximum likelihood estimation

Central motivation is the belief that statisticians treat estimation as Art not Science

• This results in misspecified models that are data-adaptively selected, but this part of the estimation procedure is not accounted for in the variance

## Estimation is a Science, Not an Art

### Specific definitions required

1. Data: realizations of random variables with a probability distribution
2. Model: actual knowledge about the data generating probability distribution
3. Target Parameter: a feature of the data generating probability distribution
4. Estimator: an a priori-specified algorithm, benchmarked by a dissimilarity-measure (e.g., MSE) w.r.t. target parameter

## Data

Random variable $$O$$, observed $$n$$ times, defined in a simple case as $O=\left(A,W,Y\right)\sim {P}_{0}$ if we are without common issues such as missingness and censoring

• $$A$$: exposure or treatment
• $$W$$: vector of covariates
• $$Y$$: outcome
• ${P}_{0}$: the true probability distribution

This data structure makes for an effective example, but data structures found in practice are much more complicated

## Model

General case: Observe $$n$$ i.i.d. copies of random variable $$O$$ with probability distribution ${P}_{0}$

The data-generating distribution ${P}_{0}$ is also known to be an element of a statistical model $M:{P}_{0}\in M$

A statistical model $$M$$ is the set of possible probability distributions for ${P}_{0}$; it is a collection of probability distributions

If all we know is that we have $$n$$ i.i.d. copies of $$O$$, this can be our statistical model, which we call a non-parametric statistical model

## Model

A statistical model can be augmented with additional non-testable assumptions, allowing one to enrich the interpretation of $\Psi \left({P}_{0}\right)$; This does not change the statistical model

We refer to the statistical model augmented with a possibly additional assumptions as a causal model

In the Neyman-Rubin causal inference framework, assumptions include

• $\left(A\perp {Y}_{a}|W\right)$; randomization
• Stable unit treatment value assumption (SUTVA); no interference between subjects and consistency assumption
• Positivity; each possible exposure level of $$A$$ occurs with some positive probability within each stratum of $$W$$

## A (very) brief review of the Neyman-Rubin causal inference framework

Potential outcomes: every individual $$i$$ has a different potential outcome depending on their treatment "assignment"

• ${Y}_{i}\left(A=1\right)$ and ${Y}_{i}\left(A=0\right)$

• The "fundamental problem with causal inference" is that we can only observe one of these potential outcomes

• If we randomly assign $$i$$ to receive $$A$$, then the groups will be equivalent and causal inference can be inferred:

$E\left({Y}_{i1}|{A}_{i}=1\right)-E\left({Y}_{i0}|{A}_{i}=0\right)$

• This framework has been extended to observational data through propensity score matching

## Target Parameters

Define the parameter of the probability distribution $$P$$ as function of $P:\Psi \left(P\right)$

In a causal inference framework, a target parameter for the effect of $$A$$ could be
$\Psi {\left({P}_{0}\right)}_{RD}={E}_{W,0}\left[{E}_{0}\left(Y|A=1,W\right)-{E}_{0}\left(Y|A=0,W\right)\right]$

Or, if we wish to use a ratio instead of a difference: $\Psi {\left({P}_{0}\right)}_{OR}={E}_{W,0}\left[O\left[Y|A=1,W\right]/O\left[Y|A=0,W\right]\right]$ Where $O\left[.\right]=E\left[.\right]/1-E\left[.\right]$

## Estimators

The target parameter $\Psi \left({P}_{0}\right)$ depends on ${P}_{0}$ through the conditional mean ${\overline{Q}}_{0}\left(A,W\right)={E}_{0}\left(Y|A,W\right)$ and the marginal distribution ${Q}_{W,0}$ of $$W$$; or

$\overline{Q}\left(A,W\right)=E\left(Y|A,W\right)/\overline{Q}\left(W\right)=E\left(Y|W\right)$

Where $\overline{Q}$ is an estimator of ${\overline{Q}}_{0}\left(A,W\right)$, shortened to ${\overline{Q}}_{0}$

An estimator is an algorithm that can be applied to any empirical distribution to provide a mapping from the empirical distribution to the parameter space

• But which algorithm?

## Effect Estimation vs. Prediction

Both effect and prediction research questions are inherently estimation questions, but they are distinct in their goals

• Prediction: Interested in generating a function to input covariates and predict a value for the outcome: ${E}_{0}\left(Y|W\right)$
• Effect: Interested in estimating the true effect of exposure on outcome adjusted for covariates, $\Psi \left({P}_{0}\right)$, the targeted estimand
• Targeted maximum likelihood estimation (TMLE), is an iterative procedure that updates an initial (super learner) estimate of the relevant part ${\overline{Q}}_{0}$ of the data generating distribution ${P}_{0}$
• See second presentation given on April 14 to the Ann Arbor R User Group

## Some effect estimators

Maximum-likelihood-based substitution estimators will be of the type $\Psi \left({Q}_{n}\right)=\frac{1}{n}\sum _{i=1}^{n}\left\{{\overline{Q}}_{n}\left(A=1,{W}_{i}\right)-{\overline{Q}}_{n}\left(A=0,{W}_{i}\right)\right\}$ where this estimate is obtained by plugging in ${Q}_{n}=\left({\overline{Q}}_{n},{Q}_{W,n}\right)$ into the mapping $$\Psi$$

Estimating-equation-based function is a function of the data $$O$$ and the parameter of interest. If $$D(\psi)(O)$$ is an estimating function, then $\Psi \left({Q}_{n}\right)$ is a solution that satisfies: $0=\sum _{i=1}^{n}D\left(\psi \right)\left({O}_{i}\right)$

## Targeted Maximum Likelihood Estimation

It is an iterative procedure that:

1. Generates an initial (super learner) estimate of the relevant part ${\overline{Q}}_{0}$ of the data generating distribution ${P}_{0}$, noted as ${\overline{Q}}_{n}^{0}$

2. Updates an initial estimate, possibly using an estimate of a nuisance parameter, ${g}_{0}$

Produces a well-defined, unbiased, efficient substitution estimator of target a parameter $$\Psi$$
- Is semi-parametric, no need to make assumptions about ${P}_{0}$
- Uses machine learning techniques to get initial estimates

## TMLE steps

Step 1: Use the super learner procedure to generate an initial estimate ${\overline{Q}}_{n}^{0}$

Step 2: Estimate ${g}_{0}$, the conditional distribution of $$A$$ given $$W$$ (a propensity score, called a nuisance parameter if $$A$$ is randomized), denoted ${g}_{n}$

Step 3: Construct a "clever covariate" that will be used to fluctuate the initial estimate
${H}_{n}^{\ast }\left(A,W\right)\equiv \left(\frac{I\left(A=1\right)}{{g}_{n}\left(1|W\right)}\right)-\left(\frac{I\left(A=0\right)}{{g}_{n}\left(0|W\right)}\right)$

## TMLE steps

Step 4: Use maximum likelihood to obtain ${\epsilon }_{n}$, the estimated coefficient of ${H}_{n}^{\ast }\left(A,W\right)$ in:

Step 5: plug-in the substitution estimator using updated estimates ${\overline{Q}}_{n}^{1}\left(A=1,{W}_{i}\right)$ and ${\overline{Q}}_{n}^{0}\left(A=1,{W}_{i}\right)$ and the empirical distribution of $$W$$ into formula:

$\begin{array}{l}{\psi }_{TMLE,n}=\\ \Psi \left({Q}_{n}\right)=\frac{1}{n}\sum _{i=1}^{n}\left\{{\overline{Q}}_{n}^{1}\left(A=1,{W}_{i}\right)-{\overline{Q}}_{n}^{1}\left(A=0,{W}_{i}\right)\right\}\end{array}$

Step 6: Inference using an infuence curve (IC)

## The Infuence Curve (IC)

IC is a function that describes estimator behavior under slight perturbations of the empirical distribution.

IC has mean 0 at the true parameter value, so it can be used as an estimating equation: $\begin{array}{c}I{C}_{n}\left({O}_{i}\right)={H}_{n}^{\ast }\left(A,W\right)\left(Y-{\overline{Q}}_{n}^{1}\left({A}_{i},{W}_{i}\right)\right)\\ +{\overline{Q}}_{n}^{1}\left(A=1,{W}_{i}\right)-{\overline{Q}}_{n}^{1}\left(A=0,{W}_{i}\right)-{\psi }_{TMLE,n}\end{array}$

The empirical mean of IC for regular asymptotically linear (RAL) estimator provides a linear approximation of estimator. Thus, VAR(IC) provides asymptotic variance of estimator

## The Infuence Curve (IC)

We then calculate the sample variance of the estimated influence curve values: ${S}^{2}\left(I{C}_{n}\right)=\frac{1}{n}{\sum _{i=1}^{n}\left(I{C}_{n}\left({o}_{i}\right)-\overline{I}{\overline{C}}_{n}\right)}^{2}$

After which standard errors, confidence intervals and p-values can be calculated in the standard fashion

Also possible to utilize bootstrapping to calculate standard errors, but computationally expensive

## TMLE package in R

Created by Susan Gruber in collaboration with Mark van der Laan

library(tmle)

effA1 <- tmle(Y=Y,
A=A,
W=W,
Q.SL.library = c(),
g.SL.library = c(),
family = "binomial",
cvQinit = TRUE,
verbose = TRUE)

## TMLE Arguments

• Y - The outcome
• A - Binary treatment indicator, 1 treatment, 0 control
• W - A matrix of covariates
• Q.SL.library - a character vector of prediction algorithms for initial $$Q$$
• g.SL.library - a character vector of prediction algorithms for $$g$$
• family - 'gaussian' or 'binomial' to describe the error distribution
• cvQinit - estimates cross-validated predicted values for initial $$Q$$, if TRUE

• id - Subject or group identifier if observations are related. Causes corrected standard errors to be calculated
• verbose - helpful to set this to TRUE to see the progress of the estimation
• Delta - Indicator of missing outcome or treatment assignment
• Z - Binary mediating variable

## Using super learner with TMLE

Permits the use of multiple machine learning algorithms to generate the initial estimate of $$Q$$

• Should use cross validation as SL will easily overfit
• The better the initial estimate of $$Q$$, the easier it is to calculate the updated estimates

Currently, SL should not be used to estimate $$g$$

• Often creates violations of the positivity assumption
• Best to use standard GLM or LASSO

## TMLE example

Does placing a right heart catheter change 30 day mortality?

The ARF dataset has 2490 patients admitted to an ICU and 47 variables including:

• Demographic characteristics, including age, gender and race
• Patient medical history, 12 variables for medicial conditions: MI, COPD, stroke, cancer, etc.
• Current condition variables, that provide information about the patient's current health status: diagnostic scales, vital statistics
• RHC status, The placement of a right heart catheter (RHC) is controversial as there is no empirical evidence that benefits patients

Only works with numeric matrices; can be specified in-line, i.e. Y= dataset$Y Data must be pre-processed: • Can only handle missingness in the outcome Y, X must be removed/imputed • Continuous variables must be appropriately re-scaled • Categorical variables must be appropriately dummy coded ## Preparing data for TMLE # Impute missing X values # library("VIM") # Scale cont vars # library(arm) cont <- c("age","edu","das2d3pc","aps1","scoma1","meanbp1","wblc1","hrt1", "resp1","temp1","pafi1","alb1","hema1","bili1","crea1","sod1", "pot1","paco21","ph1","wtkilo1") arf[,cont] <- data.frame(apply(arf[cont], 2, function(x) {x <- rescale(x, "full")})); rm(cont) # standardizes by centering and # dividing by 2 sd # Create dummy vars # arf$rhc <- ifelse(arf$swang1=="RHC",1,0) arf$white <- ifelse(arf$race=="white",1,0) arf$swang1 <- arf$race <- NULL ## Run TMLE system.time({ eff <- tmle(Y=arf$death,
A=arf\$rhc,
W=arf[1:44],
Q.SL.library = c("SL.gam","SL.knn","SL.step"),
g.SL.library = c("SL.glmnet"),
family = "binomial",
cvQinit = TRUE, verbose = TRUE)
})[] # Obtain computation time

## TMLE results

Run time on laptop: 15.43 min.

print(eff)

Odds Ratio
Parameter Estimate: 1.207
p-value: 0.063956
95% Conf Interval: (0.98914, 1.4728)

Interpretation: Right heart catheterization does not appear to change 30 day mortality

• Note that causal assumptions require non-testable assumptions previously outlined

## Advantages of the TMLE approach

Incorporates machine learning so the limitations of parametric methods are avoided

Is “double robust” meaning that estimates are asymptotically unbiased if either the initial SL estimate or the propensity score is correctly specified

• As a result, TMLE works very well with rare outcomes

Can be extended to a variety of situations

• Missing outcomes: can account for missing outcomes in a MAR way
• Controlled direct effect estimation: can account for mediators in the relationship between A and Y
• Marginal structural models: flexible framework for handling issues of time-dependent confounding

## Extensions of TMLE being developed in new R packages

• ltmle: Longitudinal TMLE permits the evaluation of interventions over time using a marginal structural model
• multiPIM: variable importance analysis that estimates an attributable-risk-type parameter
• tmle.npvi: permits modeling an intervention variable that is a continuous variable
• CTMLE: collaborative TMLE accounts for the relationship between Q and g