Methodology
In this section we dive more deeply into the algorithms implemented in CausalML. To provide a basis for the discussion, we review some of the frameworks and definitions used in the literature.
We use the Neyman-Rubin potential outcomes framework and assume Y represents the outcome, W represents the treatment assignment, and X_i the observed covariates.
Supported Algorithms
CausalML currently supports the following methods:
- Tree-based algorithms
Uplift Random Forests on KL divergence, Euclidean Distance, and Chi-Square
Uplift Random Forests on Contextual Treatment Selection
Uplift Random Forests on delta-delta-p (\(\Delta\Delta P\)) criterion (only for binary trees and two-class problems)
Uplift Random Forests on IDDP (only for binary trees and two-class problems)
Interaction Tree (only for binary trees and two-class problems)
Causal Inference Tree (only for binary trees and two-class problems)
- Meta-learner algorithms
- Instrumental variables algorithms
- Neural network based algorithms
CEVAE
DragonNet
- Treatment optimization algorithms
Decision Guide
See image in: https://github.com/uber/causalml/issues/677#issuecomment-1712088558
Meta-Learner Algorithms
A meta-algorithm (or meta-learner) is a framework to estimate the Conditional Average Treatment Effect (CATE) using any machine learning estimators (called base learners) [16].
A meta-algorithm uses either a single base learner while having the treatment indicator as a feature (e.g. S-learner), or multiple base learners separately for each of the treatment and control groups (e.g. T-learner, X-learner and R-learner).
Confidence intervals of average treatment effect estimates are calculated based on the lower bound formular (7) from [14].
S-Learner
S-learner estimates the treatment effect using a single machine learning model as follows:
Stage 1
Estimate the average outcomes \(\mu(x)\) with covariates \(X\) and an indicator variable for treatment \(W\):
using a machine learning model.
Stage 2
Define the CATE estimate as:
Including the propensity score in the model can reduce bias from regularization induced confounding [30].
When the control and treatment groups are very different in covariates, a single linear model is not sufficient to encode the different relevant dimensions and smoothness of features for the control and treatment groups [1].
T-Learner
T-learner [16] consists of two stages as follows:
Stage 1
Estimate the average outcomes \(\mu_0(x)\) and \(\mu_1(x)\):
using machine learning models.
Stage 2
Define the CATE estimate as:
X-Learner
X-learner [16] is an extension of T-learner, and consists of three stages as follows:
Stage 1
Estimate the average outcomes \(\mu_0(x)\) and \(\mu_1(x)\):
using machine learning models.
Stage 2
Impute the user level treatment effects, \(D^1_i\) and \(D^0_j\) for user \(i\) in the treatment group based on \(\mu_0(x)\), and user \(j\) in the control groups based on \(\mu_1(x)\):
then estimate \(\tau_1(x) = E[D^1|X=x]\), and \(\tau_0(x) = E[D^0|X=x]\) using machine learning models.
Stage 3
Define the CATE estimate by a weighted average of \(\tau_1(x)\) and \(\tau_0(x)\):
where \(g \in [0, 1]\). We can use propensity scores for \(g(x)\).
R-Learner
R-learner [19] uses the cross-validation out-of-fold estimates of outcomes \(\hat{m}^{(-i)}(x_i)\) and propensity scores \(\hat{e}^{(-i)}(x_i)\). It consists of two stages as follows:
Stage 1
Fit \(\hat{m}(x)\) and \(\hat{e}(x)\) with machine learning models using cross-validation.
Stage 2
Estimate treatment effects by minimising the R-loss, \(\hat{L}_n(\tau(x))\):
where \(\hat{e}^{(-i)}(X_i)\), etc. denote the out-of-fold held-out predictions made without using the \(i\)-th training sample.
Doubly Robust (DR) learner
DR-learner [15] estimates the CATE via cross-fitting a doubly-robust score function in two stages as follows. We start by randomly split the data \(\{Y, X, W\}\) into 3 partitions \(\{Y^i, X^i, W^i\}, i=\{1,2,3\}\).
Stage 1
Fit a propensity score model \(\hat{e}(x)\) with machine learning using \(\{X^1, W^1\}\), and fit outcome regression models \(\hat{m}_0(x)\) and \(\hat{m}_1(x)\) for treated and untreated users with machine learning using \(\{Y^2, X^2, W^2\}\).
Stage 2
Use machine learning to fit the CATE model, \(\hat{\tau}(X)\) from the pseudo-outcome
with \(\{Y^3, X^3, W^3\}\)
Stage 3
Repeat Stage 1 and Stage 2 again twice. First use \(\{Y^2, X^2, W^2\}\), \(\{Y^3, X^3, W^3\}\), and \(\{Y^1, X^1, W^1\}\) for the propensity score model, the outcome models, and the CATE model. Then use \(\{Y^3, X^3, W^3\}\), \(\{Y^2, X^2, W^2\}\), and \(\{Y^1, X^1, W^1\}\) for the propensity score model, the outcome models, and the CATE model. The final CATE model is the average of the 3 CATE models.
Doubly Robust Instrumental Variable (DRIV) learner
We combine the idea from DR-learner [15] with the doubly robust score function for LATE described in [10] to estimate the conditional LATE. Towards that end, we start by randomly split the data \(\{Y, X, W, Z\}\) into 3 partitions \(\{Y^i, X^i, W^i, Z^i\}, i=\{1,2,3\}\).
Stage 1
Fit propensity score models \(\hat{e}_0(x)\) and \(\hat{e}_1(x)\) for assigned and unassigned users using \(\{X^1, W^1, Z^1\}\), and fit outcome regression models \(\hat{m}_0(x)\) and \(\hat{m}_1(x)\) for assigned and unassigned users with machine learning using \(\{Y^2, X^2, Z^2\}\). Assignment probabiliy, \(p_Z\), can either be user provided or come from a simple model, since in most use cases assignment is random by design.
Stage 2
Use machine learning to fit the conditional LATE model, \(\hat{\tau}(X)\) by minimizing the following loss function
with \(\{Y^3, X^3, W^3\}\)
Stage 3
Similar to the DR-Learner Repeat Stage 1 and Stage 2 again twice with different permutations of partitions for estimation. The final conditional LATE model is the average of the 3 conditional LATE models.
Tree-Based Algorithms
Uplift Tree
The Uplift Tree approach consists of a set of methods that use a tree-based algorithm where the splitting criterion is based on differences in uplift. [22] proposed three different ways to quantify the gain in divergence as the result of splitting [11]:
where \(D\) measures the divergence and \(P^T\) and \(P^C\) refer to the probability distribution of the outcome of interest in the treatment and control groups, respectively. Three different ways to quantify the divergence, KL, ED and Chi, are implemented in the package.
KL
The Kullback-Leibler (KL) divergence is given by:
where \(p\) is the sample mean in the treatment group, \(q\) is the sample mean in the control group and \(k\) indicates the leaf in which \(p\) and \(q\) are computed [11]
ED
The Euclidean Distance is given by:
where the notation is the same as above.
Chi
Finally, the \(\chi^2\)-divergence is given by:
where the notation is again the same as above.
DDP
Another Uplift Tree algorithm that is implemented is the delta-delta-p (\(\Delta\Delta P\)) approach by [9], where the sample splitting criterion is defined as follows:
where \(a_0\) and \(a_1\) are the outcomes of a Split A, \(y\) is the selected class, and \(P^T\) and \(P^C\) are the response rates of treatment and control group, respectively. In other words, we first calculate the difference in the response rate in each branch (\(\Delta P_{left}\) and \(\Delta P_{right}\)), and subsequently, calculate their differences (\(\Delta\Delta P = |\Delta P_{left} - \Delta P_{right}|\)).
IDDP
Build upon the \(\Delta\Delta P\) approach, the IDDP approach by [23] is implemented, where the sample splitting criterion is defined as follows:
where \(\Delta\Delta P^*\) is defined as \(\Delta\Delta P - |E[Y(1) - Y(0)]| X \epsilon \phi|\) and \(I(\phi, \phi_l, \phi_r)\) is defined as:
where the entropy H is defined as \(H(p,q)=(-p*log_2(p)) + (-q*log_2(q))\) and where \(\phi\) is a subset of the feature space associated with the current decision node, and \(\phi_l\) and \(\phi_r\) are the left and right child nodes, respectively. \(n_t(\phi)\) is the number of treatment samples, \(n_c(\phi)\) the number of control samples, and \(n(\phi)\) the number of all samples in the current (parent) node.
IT
Further, the package implements the Interaction Tree (IT) proposed by [26], where the sample splitting criterion maximizes the G statistic among all permissible splits:
where \(G(s)=t^2(s)\) and \(t(s)\) is defined as:
where \(\sigma=\sum_{i=4}^4w_is_i^2\) is a pooled estimator of the constant variance, and \(w_i=(n_i-1)/\sum_{j=1}^4(n_j-1)\). Further, \(y^L_1\), \(s^2_1\), and \(n_1\) are the the sample mean, the sample variance, and the sample size for the treatment group in the left child node ,respectively. Similar notation applies to the other quantities.
Note that this implementation deviates from the original implementation in that (1) the pruning techniques and (2) the validation method for determining the best tree size are different.
CIT
Also, the package implements the Causal Inference Tree (CIT) by [25], where the sample splitting criterion calculates the likelihood ratio test statistic:
where \(n_{\tau}\), \(n_{\tau 0}\), and \(n_{\tau 1}\) are the total number of observations in node \(\tau\), the number of observations in node \(\tau\) that are assigned to the control group, and the number of observations in node \(\tau\) that are assigned to the treatment group, respectively. \(SSE_{\tau}\) is defined as:
and \(\hat{y_{t0}}\) and \(\hat{y_{t1}}\) are the sample average responses of the control and treatment groups in node \(\tau\), respectively.
Note that this implementation deviates from the original implementation in that (1) the pruning techniques and (2) the validation method for determining the best tree size are different.
CTS
The final Uplift Tree algorithm that is implemented is the Contextual Treatment Selection (CTS) approach by [28], where the sample splitting criterion is defined as follows:
where \(\phi_l\) and \(\phi_r\) refer to the feature subspaces in the left leaf and the right leaves respectively, \(\hat{p}(\phi_j \mid \phi)\) denotes the estimated conditional probability of a subject’s being in \(\phi_j\) given \(\phi\), and \(\hat{y}_t(\phi_j)\) is the conditional expected response under treatment \(t\).
Value optimization methods
The package supports methods for assigning treatment groups when treatments are costly. To understand the problem, it is helpful to divide populations into the following four categories:
Compliers. Those who will have a favourable outcome if and only if they are treated.
Always-takers. Those who will have a favourable outcome whether or not they are treated.
Never-takers. Those who will never have a favourable outcome whether or not they are treated.
Defiers. Those who will have a favourable outcome if and only if they are not treated.
For a more detailed discussion see e.g. [2].
Counterfactual Unit Selection
[18] propose a method for selecting units for treatments using counterfactual logic. Suppose the following benefits for selecting units belonging to the different categories above:
Compliers: \(\beta\)
Always-takers: \(\gamma\)
Never-takers: \(\theta\)
Defiers: \(\delta\)
If \(X\) denotes the set of individual’s features, the unit selection problem can be formulated as follows:
The problem can be reformulated using counterfactual logic. Suppose \(W = w\) indicates that an individual is treated and \(W = w'\) indicates he or she is untreated. Similarly, let \(F = f\) denote a favourable outcome for the individual and \(F = f'\) an unfavourable outcome. Then the optimization problem becomes:
Note that the above simply follows from the definitions of the relevant users segments. [18] then use counterfactual logic ([21]) to solve the above optimization problem under certain conditions.
N.B. The current implementation in the package is highly experimental.
Counterfactual Value Estimator
The counterfactual value estimation method implemented in the package predicts the outcome for a unit under different treatment conditions using a standard machine learning model. The expected value of assigning a unit into a particular treatment is then given by
where \(Y_w\) is the probability of a favourable event (such as conversion) under a given treatment \(w\), \(v\) is the value of the favourable event, \(cc_w\) is the cost of the treatment triggered in case of a favourable event, and \(ic_w\) is the cost associated with the treatment whether or not the outcome is favourable. This method builds upon the ideas discussed in [29].
Probabilities of causation
A cause is said to be necessary for an outcome if the outcome would not have occurred in the absence of the cause. A cause is said to be sufficient for an outcome if the outcome would have occurred in the presence of the cause. A cause is said to be necessary and sufficient if both of the above two conditions hold. [27] show that we can calculate bounds for the probability that a cause is of each of the above three types.
To understand how the bounds for the probabilities of causation are calculated, we need special notation to represent counterfactual quantities. Let \(y_t\) represent the proposition “\(y\) would occur if the treatment group was set to ‘treatment’”, \(y^{\prime}_c\) represent the proposition “\(y\) would not occur if the treatment group was set to ‘control’”, and similarly for the remaining two combinations of the (by assumption) binary outcome and treatment variables.
Then the probability that the treatment is sufficient for \(y\) to occur can be defined as
This is the probability that the \(y\) would occur if the treatment was set to \(t\) when in fact the treatment was set to control and the outcome did not occur.
The probability that the treatment is necessary for \(y\) to occur can be defined as
This is the probability that \(y\) would not occur if the treatment was set to control, while in actuality both \(y\) occurs and the treatment takes place.
Finally, the probability that the treatment is both necessary and sufficient is defined as
and states that \(y\) would occur if the treatment took place; and \(y\) would not occur if the treatment did not take place. PNS is related with PN and PS as follows:
In bounding the above three quantities, we utilize observational data in addition to experimental data. The observational data is characterized in terms of the joint probabilities:
Given this, [27] use the program developed in [8] to obtain sharp bounds of the above three quantities. The main idea in this program is to turn the bounding task into a linear programming problem (for a modern implementation of their approach see here).
Using the linear programming approach and given certain constraints together with observational data, [27] find that the shar lower bound for PNS is given by
and the sharp upper bound is given by
They use a similar routine to find the bounds for PS and PN. The get_pns_bounds() function calculates the bounds for each of the three probabilities of causation using the results in [27].
Selected traditional methods
The package supports selected traditional causal inference methods. These are usually used to conduct causal inference with observational (non-experimental) data. In these types of studies, the observed difference between the treatment and the control is in general not equal to the difference between “potential outcomes” \(\mathbb{E}[Y(1) - Y(0)]\). Thus, the methods below try to deal with this problem in different ways.
Matching
The general idea in matching is to find treated and non-treated units that are as similar as possible in terms of their relevant characteristics. As such, matching methods can be seen as part of the family of causal inference approaches that try to mimic randomized controlled trials.
While there are a number of different ways to match treated and non-treated units, the most common method is to use the propensity score:
Treated and non-treated units are then matched in terms of \(e(X)\) using some criterion of distance, such as \(k:1\) nearest neighbours. Because matching is usually between the treated population and the control, this method estimates the average treatment effect on the treated (ATT):
See [24] for a discussion of the strengths and weaknesses of the different matching methods.
Inverse probability of treatment weighting
The inverse probability of treatment weighting (IPTW) approach uses the propensity score \(e\) to weigh the treated and non-treated populations by the inverse of the probability of the actual treatment \(W\). For a binary treatment \(W \in \{1, 0\}\):
In this way, the IPTW approach can be seen as creating an artificial population in which the treated and non-treated units are similar in terms of their observed features \(X\).
One of the possible benefits of IPTW compared to matching is that less data may be discarded due to lack of overlap between treated and non-treated units. A known problem with the approach is that extreme propensity scores can generate highly variable estimators. Different methods have been proposed for trimming and normalizing the IPT weights ([13]). An overview of the IPTW approach can be found in [7].
2-Stage Least Squares (2SLS)
One of the basic requirements for identifying the treatment effect of \(W\) on \(Y\) is that \(W\) is orthogonal to the potential outcome of \(Y\), conditional on the covariates \(X\). This may be violated if both \(W\) and \(Y\) are affected by an unobserved variable, the error term after removing the true effect of \(W\) from \(Y\), that is not in \(X\). In this case, the instrumental variables approach attempts to estimate the effect of \(W\) on \(Y\) with the help of a third variable \(Z\) that is correlated with \(W\) but is uncorrelated with the error term. In other words, the instrument \(Z\) is only related with \(Y\) through the directed path that goes through \(W\). If these conditions are satisfied, in the case without covariates, the effect of \(W\) on \(Y\) can be estimated using the sample analog of:
The most common method for instrumental variables estimation is the two-stage least squares (2SLS). In this approach, the cause variable \(W\) is first regressed on the instrument \(Z\). Then, in the second stage, the outcome of interest \(Y\) is regressed on the predicted value from the first-stage model. Intuitively, the effect of \(W\) on \(Y\) is estimated by using only the proportion of variation in \(W\) due to variation in \(Z\). Specifically, assume that we have the linear model
Here for convenience we let \(\Xi=[W, X]\) and \(\gamma=[\alpha', \beta']'\). Assume that we have instrumental variables \(Z\) whose number of columns is at least the number of columns of \(W\), let \(\Omega=[Z, X]\), 2SLS estimator is as follows
See [3] for a detailed discussion of the method.
LATE
In many situations the treatment \(W\) may depend on subject’s own choice and cannot be administered directly in an experimental setting. However one can randomly assign users into treatment/control groups so that users in the treatment group can be nudged to take the treatment. This is the case of noncompliance, where users may fail to comply with their assignment status, \(Z\), as to whether to take treatment or not. Similar to the section of Value optimization methods, in general there are 3 types of users in this situation,
Compliers Those who will take the treatment if and only if they are assigned to the treatment group.
Always-Taker Those who will take the treatment regardless which group they are assigned to.
Never-Taker Those who wil not take the treatment regardless which group they are assigned to.
However one assumes that there is no Defier for identification purposes, i.e. those who will only take the treatment if they are assigned to the control group.
In this case one can measure the treatment effect of Compliers,
This is Local Average Treatment Effect (LATE). The estimator is also equivalent to 2SLS if we take the assignment status, \(Z\), as an instrument.
Targeted maximum likelihood estimation (TMLE) for ATE
Targeted maximum likelihood estimation (TMLE) [17] provides a doubly robust semiparametric method that “targets” directly on the average treatment effect with the aid from machine learning algorithms. Compared to other methods including outcome regression and inverse probability of treatment weighting, TMLE usually gives better performance especially when dealing with skewed treatment and outliers.
Given binary treatment \(W\), covariates \(X\), and outcome \(Y\), the TMLE for ATE is performed in the following steps
Step 1
Use cross fit to estimate the propensity score \(\hat{e}(x)\), the predicted outcome for treated \(\hat{m}_1(x)\), and predicted outcome for control \(\hat{m}_0(x)\) with machine learning.
Step 2
Scale \(Y\) into \(\tilde{Y}=\frac{Y-\min Y}{\max Y - \min Y}\) so that \(\tilde{Y} \in [0,1]\). Use the same scale function to transform \(\hat{m}_i(x)\) into \(\tilde{m}_i(x)\), \(i=0,1\). Clip the scaled functions so that their values stay in the unit interval.
Step 3
Let \(Q=\log(\tilde{m}_W(X)/(1-\tilde{m}_W(X)))\). Maximize the following pseudo log-likelihood function
Step 4
Let
The ATE estimate is the sample average of the differences of \(\tilde{Q}_1\) and \(\tilde{Q}_0\) after rescale to the original range.