Meta-Learner Algorithms

A meta-algorithm (or meta-learner) is a framework to estimate the Conditional Average Treatment Effect (CATE) using any machine learning estimators (called base learners) [11].

A meta-algorithm uses either a single base learner while having the treatment indicator as a feature (e.g. S-learner), or multiple base learners separately for each of the treatment and control groups (e.g. T-learner, X-learner and R-learner).

Confidence intervals of average treatment effect estimates are calculated based on the lower bound formular (7) from [10].


S-learner estimates the treatment effect using a single machine learning model as follows:

Stage 1

Estimate the average outcomes \(\mu(x)\) with covariates \(X\) and an indicator variable for treatment effect \(W\):

\[\mu(x) = E[Y \mid X=x, W=w]\]

using a machine learning model.

Stage 2

Define the CATE estimate as:

\[\hat\tau(x) = \hat\mu(x, W=1) - \hat\mu(x, W=0)\]

Including the propensity score in the model can reduce bias from regularization induced confounding [20].

When the control and treatment groups are very different in covariates, a single linear model is not sufficient to encode the different relevant dimensions and smoothness of features for the control and treatment groups [1].


T-learner [11] consists of two stages as follows:

Stage 1

Estimate the average outcomes \(\mu_0(x)\) and \(\mu_1(x)\):

\[\mu_0(x) = E[Y(0)|X=x] \mu_1(x) = E[Y(1)|X=x]\]

using machine learning models.

Stage 2

Define the CATE estimate as:

\[\hat\tau(x) = \hat\mu_1(x) - \hat\mu_0(x)\]


X-learner [11] is an extension of T-learner, and consists of three stages as follows:

Stage 1

Estimate the average outcomes \(\mu_0(x)\) and \(\mu_1(x)\):

\[\mu_0(x) = E[Y(0)|X=x] \mu_1(x) = E[Y(1)|X=x]\]

using machine learning models.

Stage 2

Impute the user level treatment effects, \(D^1_i\) and \(D^0_j\) for user \(i\) in the treatment group based on \(\mu_0(x)\), and user \(j\) in the control groups based on \(\mu_1(x)\):

\[D^1_i = Y^1_i - \hat\mu_0(X^1_i) D^0_i = \hat\mu_1(X^0_i) - Y^0_i\]

then estimate \(\tau_1(x) = E[D^1|X=x]\), and \(\tau_0(x) = E[D^0|X=x]\) using machine learning models.

Stage 3

Define the CATE estimate by a weighted average of \(\tau_1(x)\) and \(\tau_0(x)\):

\[\tau(x) = g(x)\tau_0(x) + (1 - g(x))\tau_1(x)\]

where \(g \in [0, 1]\). We can use propensity scores for \(g(x)\).


R-learner [13] uses the cross-validation out-of-fold estimates of outcomes \(\hat{m}^{(-i)}(x_i)\) and propensity scores \(\hat{e}^{(-i)}(x_i)\). It consists of two stages as follows:

Stage 1

Fit \(\hat{m}(x)\) and \(\hat{e}(x)\) with machine learning models using cross-validation.

Stage 2

Estimate treatment effects by minimising the R-loss, \(\hat{L}_n(\tau(x))\):

\[\hat{L}_n(\tau(x)) = \frac{1}{n} \sum^n_{i=1}\big(\big(Y_i - \hat{m}^{(-i)}(X_i)\big) - \big(W_i - \hat{e}^{(-i)}(X_i)\big)\tau(X_i)\big)^2\]

where \(\hat{e}^{(-i)}(X_i)\), etc. denote the out-of-fold held-out predictions made without using the \(i\)-th training sample.

Tree-Based Algorithms

Uplift Tree

The Uplift Tree approach consists of a set of methods that use a tree-based algorithm where the splitting criterion is based on differences in uplift. [16] proposed three different ways to quantify the gain in divergence as the result of splitting [7]:

\[D_{gain} = D_{after_{split}} (P^T, P^C) - D_{before_{split}}(P^T, P^C)\]

where \(D\) measures the divergence and \(P^T\) and \(P^C\) refer to the probability distribution of the outcome of interest in the treatment and control groups, respectively. Three different ways to quantify the divergence, KL, ED and Chi, are implemented in the package.


The Kullback-Leibler (KL) divergence is given by:

\[KL(P : Q) = \sum_{k=left, right}p_klog\frac{p_k}{q_k}\]

where \(p\) is the sample mean in the treatment group, \(q\) is the sample mean in the control group and \(k\) indicates the leaf in which \(p\) and \(q\) are computed [7]


The Euclidean Distance is given by:

\[ED(P : Q) = \sum_{k=left, right}(p_k - q_k)^2\]

where the notation is the same as above.


Finally, the \(\chi^2\)-divergence is given by:

\[\chi^2(P : Q) = \sum_{k=left, right}\frac{(p_k - q_k)^2}{q_k}\]

where the notation is again the same as above.


The final Uplift Tree algorithm that is implemented is the Contextual Treatment Selection (CTS) approach by [18], where the sample splitting criterion is defined as follows:

\[\hat{\Delta}_{\mu}(s) = \hat{p}(\phi_l \mid \phi) \times \max_{t=0, ..., K}\hat{y}_t(\phi_l) + \hat{p}(\phi_r \mid \phi) \times \max_{t=0, ..., K}\hat{y}_t(\phi_r) - \max_{t=0, ..., K}\hat{y}_t(\phi)\]

where \(\phi_l\) and \(\phi_r\) refer to the feature subspaces in the left leaf and the right leaves respectively, \(\hat{p}(\phi_j \mid \phi)\) denotes the estimated conditional probability of a subject’s being in \(\phi_j\) given \(\phi\), and \(\hat{y}_t(\phi_j)\) is the conditional expected response under treatment \(t\).

Value optimization methods

The package supports methods for assigning treatment groups when treatments are costly. To understand the problem, it is helpful to divide populations into the following four categories:

  • Compliers. Those who will have a favourable outcome if and only if they are treated.

  • Always-takers. Those who will have a favourable outcome whether or not they are treated.

  • Never-takers. Those who will never have a favourable outcome whether or not they are treated.

  • Defiers. Those who will have a favourable outcome if and only if they are not treated.

For a more detailed discussion see e.g. [2].

Counterfactual Unit Selection

[12] propose a method for selecting units for treatments using counterfactual logic. Suppose the following benefits for selecting units belonging to the different categories above:

  • Compliers: \(\beta\)

  • Always-takers: \(\gamma\)

  • Never-takers: \(\theta\)

  • Defiers: \(\delta\)

If \(X\) denotes the set of individual’s features, the unit selection problem can be formulated as follows:

\[argmax_X \beta P(\text{complier} \mid X) + \gamma P(\text{always-taker} \mid X) + \theta P(\text{never-taker} \mid X) + \delta P(\text{defier} \mid X)\]

The problem can be reformulated using counterfactual logic. Suppose \(W = w\) indicates that an individual is treated and \(W = w'\) indicates he or she is untreated. Similarly, let \(F = f\) denote a favourable outcome for the individual and \(F = f'\) an unfavourable outcome. Then the optimization problem becomes:

\[argmax_X \beta P(f_w, f'_{w'} \mid X) + \gamma P(f_w, f_{w'} \mid X) + \theta P(f'_w, f'_{w'} \mid X) + \delta P(f_{w'}, f'_{w} \mid X)\]

Note that the above simply follows from the definitions of the relevant users segments. [12] then use counterfactual logic ([15]) to solve the above optimization problem under certain conditions.

N.B. The current implementation in the package is highly experimental.

Counterfactual Value Estimator

The counterfactual value estimation method implemented in the package predicts the outcome for a unit under different treatment conditions using a standard machine learning model. The expected value of assigning a unit into a particular treatment is then given by

\[\mathbb{E}[(v - cc_w)Y_w - ic_w]\]

where \(Y_w\) is the probability of a favourable event (such as conversion) under a given treatment \(w\), \(v\) is the value of the favourable event, \(cc_w\) is the cost of the treatment triggered in case of a favourable event, and \(ic_w\) is the cost associated with the treatment whether or not the outcome is favourable. This method builds upon the ideas discussed in [19].

Selected traditional methods

The package supports selected traditional causal inference methods. These are usually used to conduct causal inference with observational (non-experimental) data. In these types of studies, the observed difference between the treatment and the control is in general not equal to the difference between “potential outcomes” \(\mathbb{E}[Y(1) - Y(0)]\). Thus, the methods below try to deal with this problem in different ways.


The general idea in matching is to find treated and non-treated units that are as similar as possible in terms of their relevant characteristics. As such, matching methods can be seen as part of the family of causal inference approaches that try to mimic randomized controlled trials.

While there are a number of different ways to match treated and non-treated units, the most common method is to use the propensity score:

\[e_i(X_i) = P(W_i = 1 \mid X_i)\]

Treated and non-treated units are then matched in terms of \(e(X)\) using some criterion of distance, such as \(k:1\) nearest neighbours. Because matching is usually between the treated population and the control, this method estimates the average treatment effect on the treated (ATT):

\[\mathbb{E}[Y(1) \mid W = 1] - \mathbb{E}[Y(0) \mid W = 1]\]

See [17] for a discussion of the strengths and weaknesses of the different matching methods.

Inverse probability of treatment weighting

The inverse probability of treatment weighting (IPTW) approach uses the propensity score \(e\) to weigh the treated and non-treated populations by the inverse of the probability of the actual treatment \(W\). For a binary treatment \(W \in \{1, 0\}\):

\[\frac{W}{e} + \frac{1 - W}{1 - e}\]

In this way, the IPTW approach can be seen as creating an artificial population in which the treated and non-treated units are similar in terms of their observed features \(X\).

One of the possible benefits of IPTW compared to matching is that less data may be discarded due to lack of overlap between treated and non-treated units. A known problem with the approach is that extreme propensity scores can generate highly variable estimators. Different methods have been proposed for trimming and normalizing the IPT weights ([9]). An overview of the IPTW approach can be found in [@article{].

Instrumental variables

The instrumental variables approach attempts to estimate the effect of \(W\) on \(Y\) with the help of a third variable \(Z\) that is correlated with \(W\) but is uncorrelated with the error term for \(Y\). In other words, the instrument \(Z\) is only related with \(Y\) through the directed path that goes through \(W\). If these conditions are satisfied, the effect of \(W\) on \(Y\) can be estimated using the sample analog of:

\[\frac{Cov(Y_i, Z_i)}{Cov(W_i, Z_i)}\]

The most common method for instrumental variables estimation is the two-stage least squares (2SLS). In this approach, the cause variable \(W\) is first regressed on the instrument \(Z\). Then, in the second stage, the outcome of interest \(Y\) is regressed on the predicted value from the first-stage model. Intuitively, the effect of \(W\) on \(Y\) is estimated by using only the proportion of variation in \(W\) due to variation in \(Z\). See [3] for a detailed discussion of the method.