X_train – (np.ndarray), training subsample of feature matrix, (n_train_sample, n_features)
X_test – (np.ndarray), test subsample of feature matrix, (n_train_sample, n_features)
inbag – (ndarray, optional),
The inbag matrix that fit the data. If set to None (default) it
will be inferred from the forest. However, this only works for trees
for which bootstrapping was set to True. That is, if sampling was
done with replacement. Otherwise, users need to provide their own
inbag matrix.
calibrate – (boolean, optional)
Whether to apply calibration to mitigate Monte Carlo noise.
Some variance estimates may be negative due to Monte Carlo effects if
the number of trees in the forest is too small. To use calibration,
Default: True
memory_constrained – (boolean, optional)
Whether or not there is a restriction on memory. If False, it is
assumed that a ndarray of shape (n_train_sample,n_test_sample) fits
in main memory. Setting to True can actually provide a speedup if
memory_limit is tuned to the optimal range.
memory_limit – (int, optional)
An upper bound for how much memory the intermediate matrices will take
up in Megabytes. This must be provided if memory_constrained=True.
Returns:
(np.ndarray), An array with the unbiased sampling variance for a RandomForest object.
A Causal Tree regressor class.
The Causal Tree is a decision tree regressor with a split criteria for treatment effects.
Details are available at Athey and Imbens (2015) (https://arxiv.org/abs/1504.01132)
Tree node class to contain all the statistics of the tree node.
Parameters:
classes (list of str) – A list of the control and treatment group names.
col (int, optional (default = -1)) – The column index for splitting the tree node to children nodes.
value (float, optional (default = None)) – The value of the feature column to split the tree node to children nodes.
trueBranch (object of DecisionTree) – The true branch tree node (feature > value).
falseBranch (object of DecisionTree) – The false branch tree node (feature > value).
results (list of float) – The classification probability P(Y=1|T) for each of the control and treatment groups
in the tree node.
summary (list of list) – Summary statistics of the tree nodes, including impurity, sample size, uplift score, etc.
maxDiffTreatment (int) – The treatment index generating the maximum difference between the treatment and control groups.
maxDiffSign (float) – The sign of the maximum difference (1. or -1.).
nodeSummary (list of list) – Summary statistics of the tree nodes [P(Y=1|T), N(T)], where y_mean stands for the target metric mean
and n is the sample size.
backupResults (list of float) – The positive probabilities in each of the control and treatment groups in the parent node. The parent node
information is served as a backup for the children node, in case no valid statistics can be calculated from the
children node, the parent node information will be used in certain cases.
bestTreatment (int) – The treatment index providing the best uplift (treatment effect).
upliftScore (list) – The uplift score of this node: [max_Diff, p_value], where max_Diff stands for the maximum treatment effect, and
p_value stands for the p_value of the treatment effect.
matchScore (float) – The uplift score by filling a trained tree with validation dataset or testing dataset.
n_estimators (integer, optional (default=10)) – The number of trees in the uplift random forest.
evaluationFunction (string) – Choose from one of the models: ‘KL’, ‘ED’, ‘Chi’, ‘CTS’, ‘DDP’, ‘IT’, ‘CIT’, ‘IDDP’.
max_features (int, optional (default=10)) – The number of features to consider when looking for the best split.
random_state (int, RandomState instance or None (default=None)) – A random seed or np.random.RandomState to control randomness in building the trees and forest.
max_depth (int, optional (default=5)) – The maximum depth of the tree.
min_samples_leaf (int, optional (default=100)) – The minimum number of samples required to be split at a leaf node.
min_samples_treatment (int, optional (default=10)) – The minimum number of samples required of the experiment group to be split at a leaf node.
n_reg (int, optional (default=10)) – The regularization parameter defined in Rzepakowski et al. 2012, the
weight (in terms of sample size) of the parent node influence on the
child node, only effective for ‘KL’, ‘ED’, ‘Chi’, ‘CTS’ methods.
early_stopping_eval_diff_scale (float, optional (default=1)) – If train and valid uplift score diff bigger than
min(train_uplift_score,valid_uplift_score)/early_stopping_eval_diff_scale, stop.
control_name (string) – The name of the control group (other experiment groups will be regarded as treatment groups)
normalization (boolean, optional (default=True)) – The normalization factor defined in Rzepakowski et al. 2012,
correcting for tests with large number of splits and imbalanced
treatment and control splits
honesty (bool (default=False)) – True if the honest approach based on “Athey, S., & Imbens, G. (2016). Recursive partitioning for
heterogeneous causal effects.” shall be used.
estimation_sample_size (float (default=0.5)) – Sample size for estimating the CATE score in the leaves if honesty == True.
n_jobs (int, optional (default=-1)) – The parallelization parameter to define how many parallel jobs need to be created.
This is passed on to joblib library for parallelizing uplift-tree creation and prediction.
joblib_prefer (str, optional (default="threads")) – The preferred backend for joblib (passed as prefer to joblib.Parallel). See the joblib
documentation for valid values.
Outputs –
---------- –
df_res (pandas dataframe) – A user-level results dataframe containing the estimated individual treatment effect.
staticbootstrap(X, treatment, y, X_val, treatment_val, y_val, tree)¶
fit(X, treatment, y, X_val=None, treatment_val=None, y_val=None)¶
Fit the UpliftRandomForestClassifier.
Parameters:
X (ndarray, shape = [num_samples, num_features]) – An ndarray of the covariates used to train the uplift model.
treatment (array-like, shape = [num_samples]) – An array containing the treatment group for each unit.
y (array-like, shape = [num_samples]) – An array containing the outcome of interest for each unit.
X_val (ndarray, shape = [num_samples, num_features]) – An ndarray of the covariates used to valid the uplift model.
treatment_val (array-like, shape = [num_samples]) – An array containing the validation treatment group for each unit.
y_val (array-like, shape = [num_samples]) – An array containing the validation outcome of interest for each unit.
Returns the recommended treatment group and predicted optimal
probability conditional on using the recommended treatment group.
Parameters:
X (ndarray, shape = [num_samples, num_features]) – An ndarray of the covariates used to train the uplift model.
full_output (bool, optional (default=False)) – Whether the UpliftTree algorithm returns upliftScores, pred_nodes
alongside the recommended treatment group and p_hat in the treatment group.
Returns:
y_pred_list (ndarray, shape = (num_samples, num_treatments])) – An ndarray containing the predicted treatment effect of each treatment group for each sample
df_res (DataFrame, shape = [num_samples, (num_treatments * 2 + 3)]) – If full_output is True, a DataFrame containing the predicted outcome of each treatment and
control group, the treatment effect of each treatment group, the treatment group with the
highest treatment effect, and the maximum treatment effect for each sample.
A uplift tree classifier estimates the individual treatment effect by modifying the loss function in the
classification trees.
The uplift tree classifier is used in uplift random forest to construct the trees in the forest.
Parameters:
evaluationFunction (string) – Choose from one of the models: ‘KL’, ‘ED’, ‘Chi’, ‘CTS’, ‘DDP’, ‘IT’, ‘CIT’, ‘IDDP’.
max_features (int, optional (default=None)) – The number of features to consider when looking for the best split.
max_depth (int, optional (default=3)) – The maximum depth of the tree.
min_samples_leaf (int, optional (default=100)) – The minimum number of samples required to be split at a leaf node.
min_samples_treatment (int, optional (default=10)) – The minimum number of samples required of the experiment group to be split at a leaf node.
n_reg (int, optional (default=100)) – The regularization parameter defined in Rzepakowski et al. 2012, the weight (in terms of sample size) of the
parent node influence on the child node, only effective for ‘KL’, ‘ED’, ‘Chi’, ‘CTS’ methods.
early_stopping_eval_diff_scale (float, optional (default=1)) – If train and valid uplift score diff bigger than
min(train_uplift_score,valid_uplift_score)/early_stopping_eval_diff_scale, stop.
control_name (string) – The name of the control group (other experiment groups will be regarded as treatment groups).
normalization (boolean, optional (default=True)) – The normalization factor defined in Rzepakowski et al. 2012, correcting for tests with large number of splits
and imbalanced treatment and control splits.
honesty (bool (default=False)) – True if the honest approach based on “Athey, S., & Imbens, G. (2016). Recursive partitioning for heterogeneous causal effects.”
shall be used. If ‘IDDP’ is used as evaluation function, this parameter is automatically set to true.
estimation_sample_size (float (default=0.5)) – Sample size for estimating the CATE score in the leaves if honesty == True.
random_state (int, RandomState instance or None (default=None)) – A random seed or np.random.RandomState to control randomness in building a tree.
Calculate likelihood ratio test statistic as split evaluation criterion for a given node
NOTE: n_class should be 2.
Parameters:
cur_node_summary_p (array of shape [n_class]) – Has type numpy.double.
The positive probabilities of each of the control
and treament groups of the current node, i.e. [P(Y=1|T=i)…]
cur_node_summary_n (array of shape [n_class]) – Has type numpy.int32.
The counts of each of the control
and treament groups of the current node, i.e. [N(T=i)…]
left_node_summary_p (array of shape [n_class]) – Has type numpy.double.
The positive probabilities of each of the control
and treament groups of the left node, i.e. [P(Y=1|T=i)…]
left_node_summary_n (array of shape [n_class]) – Has type numpy.int32.
The counts of each of the control
and treament groups of the left node, i.e. [N(T=i)…]
right_node_summary_p (array of shape [n_class]) – Has type numpy.double.
The positive probabilities of each of the control
and treament groups of the right node, i.e. [P(Y=1|T=i)…]
right_node_summary_n (array of shape [n_class]) – Has type numpy.int32.
The counts of each of the control
and treament groups of the right node, i.e. [N(T=i)…]
Calculate CTS (conditional treatment selection) as split evaluation criterion for a given node.
Parameters:
node_summary_p (array of shape [n_class]) – Has type numpy.double.
The positive probabilities of each of the control
and treament groups of the current node, i.e. [P(Y=1|T=i)…]
node_summary_n (array of shape [n_class]) – Has type numpy.int32.
The counts of each of the control
and treament groups of the current node, i.e. [N(T=i)…]
Calculate Chi-Square statistic as split evaluation criterion for a given node.
Parameters:
node_summary_p (array of shape [n_class]) – Has type numpy.double.
The positive probabilities of each of the control
and treament groups of the current node, i.e. [P(Y=1|T=i)…]
node_summary_n (array of shape [n_class]) – Has type numpy.int32.
The counts of each of the control
and treament groups of the current node, i.e. [N(T=i)…]
Calculate Delta P as split evaluation criterion for a given node.
Parameters:
node_summary_p (array of shape [n_class]) – Has type numpy.double.
The positive probabilities of each of the control
and treament groups of the current node, i.e. [P(Y=1|T=i)…]
node_summary_n (array of shape [n_class]) – Has type numpy.int32.
The counts of each of the control
and treament groups of the current node, i.e. [N(T=i)…]
Calculate Euclidean Distance as split evaluation criterion for a given node.
Parameters:
node_summary_p (array of shape [n_class]) – Has type numpy.double.
The positive probabilities of each of the control
and treament groups of the current node, i.e. [P(Y=1|T=i)…]
node_summary_n (array of shape [n_class]) – Has type numpy.int32.
The counts of each of the control
and treament groups of the current node, i.e. [N(T=i)…]
Calculate Delta P as split evaluation criterion for a given node.
Parameters:
node_summary_p (array of shape [n_class]) – Has type numpy.double.
The positive probabilities of each of the control
and treament groups of the current node, i.e. [P(Y=1|T=i)…]
node_summary_n (array of shape [n_class]) – Has type numpy.int32.
The counts of each of the control
and treament groups of the current node, i.e. [N(T=i)…]
Calculate Squared T-Statistic as split evaluation criterion for a given node
NOTE: n_class should be 2.
Parameters:
left_node_summary_p (array of shape [n_class]) – Has type numpy.double.
The positive probabilities of each of the control
and treament groups of the left node, i.e. [P(Y=1|T=i)…]
left_node_summary_n (array of shape [n_class]) – Has type numpy.int32.
The counts of each of the control
and treament groups of the left node, i.e. [N(T=i)…]
right_node_summary_p (array of shape [n_class]) – Has type numpy.double.
The positive probabilities of each of the control
and treament groups of the right node, i.e. [P(Y=1|T=i)…]
right_node_summary_n (array of shape [n_class]) – Has type numpy.int32.
The counts of each of the control
and treament groups of the right node, i.e. [N(T=i)…]
Calculate KL Divergence as split evaluation criterion for a given node.
Modified to accept new node summary format.
Parameters:
node_summary_p (array of shape [n_class]) – Has type numpy.double.
The positive probabilities of each of the control
and treament groups of the current node, i.e. [P(Y=1|T=i)…]
node_summary_n (array of shape [n_class]) – Has type numpy.int32.
The counts of each of the control
and treament groups of the current node, i.e. [N(T=i)…]
cur_node_summary_n (array of shape [n_class]) – Has type numpy.int32.
The counts of each of the control
and treament groups of the current node, i.e. [N(T=i)…]
left_node_summary_n (array of shape [n_class]) – Has type numpy.int32.
The counts of each of the control
and treament groups of the left node, i.e. [N(T=i)…]
alpha (float) – The weight used to balance different normalization parts.
Classifies (prediction) the observations according to the tree.
Parameters:
observations (list of list) – The internal data format for the training data (combining X, Y, treatment).
dataMissing (boolean, optional (default = False)) – An indicator for if data are missing or not.
Returns:
The results in the leaf node.
Return type:
tree.results, tree.upliftScore
staticdivideSet(X, treatment_idx, y, column, value)¶
Tree node split.
Parameters:
X (ndarray, shape = [num_samples, num_features]) – An ndarray of the covariates used to train the uplift model.
treatment_idx (array-like, shape = [num_samples]) – An array containing the treatment group index for each unit.
y (array-like, shape = [num_samples]) – An array containing the outcome of interest for each unit.
column (int) – The column used to split the data.
value (float or int) – The value in the column for splitting the data.
Returns:
(X_l, X_r, treatment_l, treatment_r, y_l, y_r) – The covariates, treatments and outcomes of left node and the right node.
Return type:
list of ndarray
staticdivideSet_len(X, treatment_idx, y, column, value)¶
Tree node split.
Modified from dividedSet(), but return the len(X_l) and
len(X_r) instead of the split X_l and X_r, to avoid some
overhead, intended to be used for finding the split. After
finding the best splits, can split to find the X_l and X_r.
Parameters:
X (ndarray, shape = [num_samples, num_features]) – An ndarray of the covariates used to train the uplift model.
treatment_idx (array-like, shape = [num_samples]) – An array containing the treatment group index for each unit.
y (array-like, shape = [num_samples]) – An array containing the outcome of interest for each unit.
column (int) – The column used to split the data.
value (float or int) – The value in the column for splitting the data.
Returns:
(len_X_l, len_X_r, treatment_l, treatment_r, y_l, y_r) – The covariates nrows, treatments and outcomes of left node and the right node.
Return type:
list of ndarray
staticevaluate_CIT(currentNodeSummary, leftNodeSummary, rightNodeSummary, y_l, y_r, w_l, w_r, y, w)¶
Calculate likelihood ratio test statistic as split evaluation criterion for a given node
:param currentNodeSummary: The parent node summary statistics
:type currentNodeSummary: list of lists
:param leftNodeSummary: The left node summary statistics.
:type leftNodeSummary: list of lists
:param rightNodeSummary: The right node summary statistics.
:type rightNodeSummary: list of lists
:param y_l: An array containing the outcome of interest for each unit in the left node
:type y_l: array-like, shape = [num_samples]
:param y_r: An array containing the outcome of interest for each unit in the right node
:type y_r: array-like, shape = [num_samples]
:param w_l: An array containing the treatment for each unit in the left node
:type w_l: array-like, shape = [num_samples]
:param w_r: An array containing the treatment for each unit in the right node
:type w_r: array-like, shape = [num_samples]
:param y: An array containing the outcome of interest for each unit
:type y: array-like, shape = [num_samples]
:param w: An array containing the treatment for each unit
:type w: array-like, shape = [num_samples]
Fill the data into an existing tree.
This is a higher-level function to transform the original data inputs
into lower level data inputs (list of list and tree).
Parameters:
X (ndarray, shape = [num_samples, num_features]) – An ndarray of the covariates used to train the uplift model.
treatment (array-like, shape = [num_samples]) – An array containing the treatment group for each unit.
y (array-like, shape = [num_samples]) – An array containing the outcome of interest for each unit.
X (ndarray, shape = [num_samples, num_features]) – An ndarray of the covariates used to train the uplift model.
treatment_idx (array-like, shape = [num_samples]) – An array containing the treatment group idx for each unit.
The dtype should be numpy.int8.
y (array-like, shape = [num_samples]) – An array containing the outcome of interest for each unit.
X_val (ndarray, shape = [num_samples, num_features]) – An ndarray of the covariates used to valid the uplift model.
treatment_val_idx (array-like, shape = [num_samples]) – An array containing the validation treatment group idx for each unit.
y_val (array-like, shape = [num_samples]) – An array containing the validation outcome of interest for each unit.
max_depth (int, optional (default=10)) – The maximum depth of the tree.
min_samples_leaf (int, optional (default=100)) – The minimum number of samples required to be split at a leaf node.
depth (int, optional (default = 1)) – The current depth.
min_samples_treatment (int, optional (default=10)) – The minimum number of samples required of the experiment group to be split at a leaf node.
n_reg (int, optional (default=10)) – The regularization parameter defined in Rzepakowski et al. 2012,
the weight (in terms of sample size) of the parent node influence
on the child node, only effective for ‘KL’, ‘ED’, ‘Chi’, ‘CTS’ methods.
parentNodeSummary_p (array-like, shape [n_class]) – Node summary probability statistics of the parent tree node.
Apply the honest approach based on “Athey, S., & Imbens, G. (2016). Recursive partitioning for heterogeneous causal effects.”
:param X_est: An ndarray of the covariates used to calculate the unbiased estimates in the leafs of the decision tree.
:type X_est: ndarray, shape = [num_samples, num_features]
:param T_est: An array containing the treatment group for each unit.
:type T_est: array-like, shape = [num_samples]
:param Y_est: An array containing the outcome of interest for each unit.
:type Y_est: array-like, shape = [num_samples]
Modifies the leafs of the current decision tree to only contain unbiased estimates.
Applies the honest approach based on “Athey, S., & Imbens, G. (2016). Recursive partitioning for heterogeneous causal effects.”
:param X_est: An ndarray of the covariates used to calculate the unbiased estimates in the leafs of the decision tree.
:type X_est: ndarray, shape = [num_samples, num_features]
:param T_est: An array containing the treatment group for each unit.
:type T_est: array-like, shape = [num_samples]
:param Y_est: An array containing the outcome of interest for each unit.
:type Y_est: array-like, shape = [num_samples]
:param tree: object of DecisionTree class - the current decision tree that shall be modified
:type tree: object
Returns the recommended treatment group and predicted optimal
probability conditional on using the recommended treatment group.
Parameters:
X (ndarray, shape = [num_samples, num_features]) – An ndarray of the covariates used to train the uplift model.
Returns:
pred – An ndarray of predicted treatment effects across treatments.
Return type:
ndarray, shape = [num_samples, num_treatments]
prune(X, treatment, y, minGain=0.0001, rule='maxAbsDiff')¶
Prune the uplift model.
:param X: An ndarray of the covariates used to train the uplift model.
:type X: ndarray, shape = [num_samples, num_features]
:param treatment: An array containing the treatment group for each unit.
:type treatment: array-like, shape = [num_samples]
:param y: An array containing the outcome of interest for each unit.
:type y: array-like, shape = [num_samples]
:param minGain: The minimum gain required to make a tree node split. The children
tree branches are trimmed if the actual split gain is less than
the minimum gain.
Parameters:
rule (string, optional (default = 'maxAbsDiff')) – The prune rules. Supported values are ‘maxAbsDiff’ for optimizing
the maximum absolute difference, and ‘bestUplift’ for optimizing
the node-size weighted treatment effect.
Returns:
self
Return type:
object
pruneTree(X, treatment_idx, y, tree, rule='maxAbsDiff', minGain=0.0, n_reg=0, parentNodeSummary=None)¶
Prune one single tree node in the uplift model.
:param X: An ndarray of the covariates used to train the uplift model.
:type X: ndarray, shape = [num_samples, num_features]
:param treatment_idx: An array containing the treatment group index for each unit.
:type treatment_idx: array-like, shape = [num_samples]
:param y: An array containing the outcome of interest for each unit.
:type y: array-like, shape = [num_samples]
:param rule: The prune rules. Supported values are ‘maxAbsDiff’ for optimizing the maximum absolute difference, and
‘bestUplift’ for optimizing the node-size weighted treatment effect.
Parameters:
minGain (float, optional (default = 0.)) – The minimum gain required to make a tree node split. The children tree branches are trimmed if the actual
split gain is less than the minimum gain.
n_reg (int, optional (default=0)) – The regularization parameter defined in Rzepakowski et al. 2012, the weight (in terms of sample size) of the
parent node influence on the child node, only effective for ‘KL’, ‘ED’, ‘Chi’, ‘CTS’ methods.
parentNodeSummary (list of list, optional (default = None)) – Node summary statistics, [P(Y=1|T), N(T)] of the parent tree node.
Returns:
self
Return type:
object
tree_node_summary(treatment_idx, y, min_samples_treatment=10, n_reg=100, parentNodeSummary=None)¶
Tree node summary statistics.
Parameters:
treatment_idx (array-like, shape = [num_samples]) – An array containing the treatment group index for each unit.
y (array-like, shape = [num_samples]) – An array containing the outcome of interest for each unit.
min_samples_treatment (int, optional (default=10)) – The minimum number of samples required of the experiment group t be split at a leaf node.
n_reg (int, optional (default=10)) – The regularization parameter defined in Rzepakowski et al. 2012,
the weight (in terms of sample size) of the parent node influence
on the child node, only effective for ‘KL’, ‘ED’, ‘Chi’, ‘CTS’ methods.
parentNodeSummary (list of list) – The positive probabilities and sample sizes of each of the control and treatment groups
in the parent node.
Returns:
nodeSummary – The positive probabilities and sample sizes of each of the control and treatment groups
in the current node.
Modified from tree_node_summary_to_arr, to use different
format for the summary and to calculate based on already
calculated group counts. Instead of [[P(Y=1|T=0), N(T=0)],
[P(Y=1|T=1), N(T=1)], …], use two arrays [N(T=i)…] and
[P(Y=1|T=i)…].
Parameters:
group_count_arr (array of shape [2*n_class]) – Has type numpy.int32.
The grounp counts, where entry 2*i is N(Y=0, T=i),
and entry 2*i+1 is N(Y=1, T=i).
out_summary_p (array of shape [n_class]) – Has type numpy.double.
To be filled with the positive probabilities of each of the control
and treament groups of the current node.
out_summary_n (array of shape [n_class]) – Has type numpy.int32.
To be filled with the counts of each of the control
and treament groups of the current node.
parentNodeSummary_p (array of shape [n_class]) – The positive probabilities of each of the control and treatment groups
in the parent node.
has_parent_summary (bool as int) – If True (non-zero), then parentNodeSummary_p is a valid parent node summary probabilities.
If False (0), assume no parent node summary and parentNodeSummary_p is not touched.
min_samples_treatment (int, optional (default=10)) – The minimum number of samples required of the experiment group t be split at a leaf node.
n_reg (int, optional (default=10)) – The regularization parameter defined in Rzepakowski et al. 2012,
the weight (in terms of sample size) of the parent node influence
on the child node, only effective for ‘KL’, ‘ED’, ‘Chi’, ‘CTS’ methods.
Return type:
No return values, but will modify out_summary_p and out_summary_n.
statictree_node_summary_to_arr(treatment_idx, y, out_summary_p, out_summary_n, buf_count_arr, parentNodeSummary_p, has_parent_summary, min_samples_treatment=10, n_reg=100)¶
Tree node summary statistics.
Modified from tree_node_summary, to use different format for the summary.
Instead of [[P(Y=1|T=0), N(T=0)], [P(Y=1|T=1), N(T=1)], …],
use two arrays [N(T=i)…] and [P(Y=1|T=i)…].
Parameters:
treatment_idx (array-like, shape = [num_samples]) – An array containing the treatment group index for each unit.
Has type numpy.int8.
y (array-like, shape = [num_samples]) – An array containing the outcome of interest for each unit.
Has type numpy.int8.
out_summary_p (array of shape [n_class]) – Has type numpy.double.
To be filled with the positive probabilities of each of the control
and treament groups of the current node.
out_summary_n (array of shape [n_class]) – Has type numpy.int32.
To be filled with the counts of each of the control
and treament groups of the current node.
buf_count_arr (array of shape [2*n_class]) – Has type numpy.int32.
To be use as temporary buffer for group_uniqueCounts_to_arr.
parentNodeSummary_p (array of shape [n_class]) – The positive probabilities of each of the control and treatment groups
in the parent node.
has_parent_summary (bool as int) – If True (non-zero), then parentNodeSummary_p is a valid parent node summary probabilities.
If False (0), assume no parent node summary and parentNodeSummary_p is not touched.
min_samples_treatment (int, optional (default=10)) – The minimum number of samples required of the experiment group t be split at a leaf node.
n_reg (int, optional (default=10)) – The regularization parameter defined in Rzepakowski et al. 2012,
the weight (in terms of sample size) of the parent node influence
on the child node, only effective for ‘KL’, ‘ED’, ‘Chi’, ‘CTS’ methods.
Return type:
No return values, but will modify out_summary_p and out_summary_n.
Create distplot for tree leaves values
:param tree: (CausalTreeRegressor), Tree object
:param title: (str), plot title
:param figsize: (tuple), figure size
:param fontsize: (int), title font size
estimate_ate(X, treatment, y, p=None, bootstrap_ci=False, n_bootstraps=1000, bootstrap_size=10000, seed=None, pretrain=False)[source]¶
Estimate the Average Treatment Effect (ATE).
Parameters:
X (np.matrix or np.array or pd.Dataframe) – a feature matrix
treatment (np.array or pd.Series) – a treatment vector
y (np.array or pd.Series) – an outcome vector
p (np.ndarray or pd.Series or dict, optional) – an array of propensity scores of float (0,1) in the
single-treatment case; or, a dictionary of treatment groups that map to propensity vectors of
float (0,1); if None will run ElasticNetPropensityModel() to generate the propensity scores.
bootstrap_ci (bool) – whether run bootstrap for confidence intervals
n_bootstraps (int) – number of bootstrap iterations
bootstrap_size (int) – number of samples per bootstrap
seed (int) – random seed for cross-fitting
pretrain (bool) – whether a model has been fit, default False.
Returns:
The mean and confidence interval (LB, UB) of the ATE estimate.
X (np.matrix or np.array or pd.Dataframe) – a feature matrix
treatment (np.array or pd.Series) – a treatment vector
y (np.array or pd.Series) – an outcome vector
p (np.ndarray or pd.Series or dict, optional) – an array of propensity scores of float (0,1) in the
single-treatment case; or, a dictionary of treatment groups that map to propensity vectors of
float (0,1); if None will run ElasticNetPropensityModel() to generate the propensity scores.
seed (int) – random seed for cross-fitting
fit_predict(X, treatment, y, p=None, return_ci=False, n_bootstraps=1000, bootstrap_size=10000, return_components=False, verbose=True, seed=None)[source]¶
Fit the treatment effect and outcome models of the R learner and predict treatment effects.
Parameters:
X (np.matrix or np.array or pd.Dataframe) – a feature matrix
treatment (np.array or pd.Series) – a treatment vector
y (np.array or pd.Series) – an outcome vector
p (np.ndarray or pd.Series or dict, optional) – an array of propensity scores of float (0,1) in the
single-treatment case; or, a dictionary of treatment groups that map to propensity vectors of
float (0,1); if None will run ElasticNetPropensityModel() to generate the propensity scores.
return_ci (bool) – whether to return confidence intervals
n_bootstraps (int) – number of bootstrap iterations
bootstrap_size (int) – number of samples per bootstrap
return_components (bool, optional) – whether to return outcome for treatment and control seperately
verbose (str) – whether to output progress logs
seed (int) – random seed for cross-fitting
Returns:
Predictions of treatment effects. Output dim: [n_samples, n_treatment]
fit(X, treatment, y, p=None, sample_weight=None, verbose=True)[source]¶
Fit the treatment effect and outcome models of the R learner.
Parameters:
X (np.matrix or np.array or pd.Dataframe) – a feature matrix
treatment (np.array or pd.Series) – a treatment vector
y (np.array or pd.Series) – an outcome vector
p (np.ndarray or pd.Series or dict, optional) – an array of propensity scores of float (0,1) in the
single-treatment case; or, a dictionary of treatment groups that map to propensity vectors of
float (0,1); if None will run ElasticNetPropensityModel() to generate the propensity scores.
sample_weight (np.array or pd.Series, optional) – an array of sample weights indicating the
weight of each observation for effect_learner. If None, it assumes equal weight.
verbose (bool, optional) – whether to output progress logs
X (np.matrix or np.array or pd.Dataframe) – a feature matrix
treatment (np.array or pd.Series) – only needed when pretrain=False, a treatment vector
y (np.array or pd.Series) – only needed when pretrain=False, an outcome vector
p (np.ndarray or pd.Series or dict, optional) – an array of propensity scores of float (0,1) in the
single-treatment case; or, a dictionary of treatment groups that map to propensity vectors of
float (0,1); if None will run ElasticNetPropensityModel() to generate the propensity scores.
sample_weight (np.array or pd.Series, optional) – an array of sample weights indicating the
weight of each observation for effect_learner. If None, it assumes equal weight.
bootstrap_ci (bool) – whether run bootstrap for confidence intervals
n_bootstraps (int) – number of bootstrap iterations
bootstrap_size (int) – number of samples per bootstrap
pretrain (bool) – whether a model has been fit, default False.
Returns:
The mean and confidence interval (LB, UB) of the ATE estimate.
fit(X, treatment, y, p=None, sample_weight=None, verbose=True)[source]¶
Fit the treatment effect and outcome models of the R learner.
Parameters:
X (np.matrix or np.array or pd.Dataframe) – a feature matrix
treatment (np.array or pd.Series) – a treatment vector
y (np.array or pd.Series) – an outcome vector
p (np.ndarray or pd.Series or dict, optional) – an array of propensity scores of float (0,1) in the
single-treatment case; or, a dictionary of treatment groups that map to propensity vectors of
float (0,1); if None will run ElasticNetPropensityModel() to generate the propensity scores.
sample_weight (np.array or pd.Series, optional) – an array of sample weights indicating the
weight of each observation for effect_learner. If None, it assumes equal weight.
verbose (bool, optional) – whether to output progress logs
fit_predict(X, treatment, y, p=None, sample_weight=None, return_ci=False, n_bootstraps=1000, bootstrap_size=10000, verbose=True)[source]¶
Fit the treatment effect and outcome models of the R learner and predict treatment effects.
Parameters:
X (np.matrix or np.array or pd.Dataframe) – a feature matrix
treatment (np.array or pd.Series) – a treatment vector
y (np.array or pd.Series) – an outcome vector
p (np.ndarray or pd.Series or dict, optional) – an array of propensity scores of float (0,1) in the
single-treatment case; or, a dictionary of treatment groups that map to propensity vectors of
float (0,1); if None will run ElasticNetPropensityModel() to generate the propensity scores.
sample_weight (np.array or pd.Series, optional) – an array of sample weights indicating the
weight of each observation for effect_learner. If None, it assumes equal weight.
return_ci (bool) – whether to return confidence intervals
n_bootstraps (int) – number of bootstrap iterations
bootstrap_size (int) – number of samples per bootstrap
verbose (bool) – whether to output progress logs
Returns:
Predictions of treatment effects. Output dim: [n_samples, n_treatment].
A parent class for S-learner classes.
An S-learner estimates treatment effects with one machine learning model.
Details of S-learner are available at Kunzel et al. (2018) (https://arxiv.org/abs/1706.03461).
estimate_ate(X, treatment, y, p=None, return_ci=False, bootstrap_ci=False, n_bootstraps=1000, bootstrap_size=10000, pretrain=False)[source]¶
Estimate the Average Treatment Effect (ATE).
Parameters:
X (np.matrix, np.array, or pd.Dataframe) – a feature matrix
treatment (np.array or pd.Series) – a treatment vector
y (np.array or pd.Series) – an outcome vector
return_ci (bool, optional) – whether to return confidence intervals
bootstrap_ci (bool) – whether to return confidence intervals
n_bootstraps (int) – number of bootstrap iterations
bootstrap_size (int) – number of samples per bootstrap
pretrain (bool) – whether a model has been fit, default False.
Returns:
The mean and confidence interval (LB, UB) of the ATE estimate.
Fit the inference model
:param X: a feature matrix
:type X: np.matrix, np.array, or pd.Dataframe
:param treatment: a treatment vector
:type treatment: np.array or pd.Series
:param y: an outcome vector
:type y: np.array or pd.Series
fit_predict(X, treatment, y, p=None, return_ci=False, n_bootstraps=1000, bootstrap_size=10000, return_components=False, verbose=True)[source]¶
Fit the inference model of the S learner and predict treatment effects.
:param X: a feature matrix
:type X: np.matrix, np.array, or pd.Dataframe
:param treatment: a treatment vector
:type treatment: np.array or pd.Series
:param y: an outcome vector
:type y: np.array or pd.Series
:param return_ci: whether to return confidence intervals
:type return_ci: bool, optional
:param n_bootstraps: number of bootstrap iterations
:type n_bootstraps: int, optional
:param bootstrap_size: number of samples per bootstrap
:type bootstrap_size: int, optional
:param return_components: whether to return outcome for treatment and control seperately
:type return_components: bool, optional
:param verbose: whether to output progress logs
:type verbose: bool, optional
Returns:
Predictions of treatment effects. Output dim: [n_samples, n_treatment].
X (np.matrix or np.array or pd.Dataframe) – a feature matrix
treatment (np.array or pd.Series) – a treatment vector
y (np.array or pd.Series) – an outcome vector
p (np.ndarray or pd.Series or dict, optional) – an array of propensity scores of float (0,1) in the
single-treatment case; or, a dictionary of treatment groups that map to propensity vectors of
float (0,1); if None will run ElasticNetPropensityModel() to generate the propensity scores.
X (np.matrix or np.array or pd.Dataframe) – a feature matrix
treatment (np.array or pd.Series, optional) – a treatment vector
y (np.array or pd.Series, optional) – an outcome vector
p (np.ndarray or pd.Series or dict, optional) – an array of propensity scores of float (0,1) in the
single-treatment case; or, a dictionary of treatment groups that map to propensity vectors of
float (0,1); if None will run ElasticNetPropensityModel() to generate the propensity scores.
return_components (bool, optional) – whether to return outcome for treatment and control seperately
return_p_score (bool, optional) – whether to return propensity score
verbose (bool, optional) – whether to output progress logs
estimate_ate(X, treatment, y, p=None, bootstrap_ci=False, n_bootstraps=1000, bootstrap_size=10000, pretrain=False)[source]¶
Estimate the Average Treatment Effect (ATE).
Parameters:
X (np.matrix or np.array or pd.Dataframe) – a feature matrix
treatment (np.array or pd.Series) – a treatment vector
y (np.array or pd.Series) – an outcome vector
p (np.ndarray or pd.Series or dict, optional) – an array of propensity scores of float (0,1) in the
single-treatment case; or, a dictionary of treatment groups that map to propensity vectors of
float (0,1); if None will run ElasticNetPropensityModel() to generate the propensity scores.
bootstrap_ci (bool) – whether run bootstrap for confidence intervals
n_bootstraps (int) – number of bootstrap iterations
bootstrap_size (int) – number of samples per bootstrap
pretrain (bool) – whether a model has been fit, default False.
Returns:
The mean and confidence interval (LB, UB) of the ATE estimate.
X (np.matrix or np.array or pd.Dataframe) – a feature matrix
treatment (np.array or pd.Series) – a treatment vector
y (np.array or pd.Series) – an outcome vector
p (np.ndarray or pd.Series or dict, optional) – an array of propensity scores of float (0,1) in the
single-treatment case; or, a dictionary of treatment groups that map to propensity vectors of
float (0,1); if None will run ElasticNetPropensityModel() to generate the propensity scores.
fit_predict(X, treatment, y, p=None, return_ci=False, n_bootstraps=1000, bootstrap_size=10000, return_components=False, verbose=True)[source]¶
Fit the treatment effect and outcome models of the R learner and predict treatment effects.
Parameters:
X (np.matrix or np.array or pd.Dataframe) – a feature matrix
treatment (np.array or pd.Series) – a treatment vector
y (np.array or pd.Series) – an outcome vector
p (np.ndarray or pd.Series or dict, optional) – an array of propensity scores of float (0,1) in the
single-treatment case; or, a dictionary of treatment groups that map to propensity vectors of
float (0,1); if None will run ElasticNetPropensityModel() to generate the propensity scores.
return_ci (bool) – whether to return confidence intervals
n_bootstraps (int) – number of bootstrap iterations
bootstrap_size (int) – number of samples per bootstrap
return_components (bool, optional) – whether to return outcome for treatment and control seperately
verbose (str) – whether to output progress logs
Returns:
Predictions of treatment effects. Output dim: [n_samples, n_treatment]
X (np.matrix or np.array or pd.Dataframe) – a feature matrix
treatment (np.array or pd.Series, optional) – a treatment vector
y (np.array or pd.Series, optional) – an outcome vector
p (np.ndarray or pd.Series or dict, optional) – an array of propensity scores of float (0,1) in the
single-treatment case; or, a dictionary of treatment groups that map to propensity vectors of
float (0,1); if None will run ElasticNetPropensityModel() to generate the propensity scores.
return_components (bool, optional) – whether to return outcome for treatment and control seperately
verbose (bool, optional) – whether to output progress logs
estimate_ate(X, treatment, y, p=None, pretrain=False)[source]¶
Estimate the Average Treatment Effect (ATE).
:param X: a feature matrix
:type X: np.matrix, np.array, or pd.Dataframe
:param treatment: a treatment vector
:type treatment: np.array or pd.Series
:param y: an outcome vector
:type y: np.array or pd.Series
Returns:
The mean and confidence interval (LB, UB) of the ATE estimate.
Ref: Gruber, S., & Van Der Laan, M. J. (2009). Targeted maximum likelihood estimation: A gentle introduction.
estimate_ate(X, treatment, y, p, segment=None, return_ci=False)[source]¶
Estimate the Average Treatment Effect (ATE).
Parameters:
X (np.matrix or np.array or pd.Dataframe) – a feature matrix
treatment (np.array or pd.Series) – a treatment vector
y (np.array or pd.Series) – an outcome vector
p (np.ndarray or pd.Series or dict) – an array of propensity scores of float (0,1) in the single-treatment
case; or, a dictionary of treatment groups that map to propensity vectors of float (0,1)
segment (np.array, optional) – An optional segment vector of int. If given, the ATE and its CI will be
estimated for each segment.
return_ci (bool, optional) – Whether to return confidence intervals
Returns:
The ATE and its confidence interval (LB, UB) for each treatment, t and segment, s
fit(X, treatment, y, p=None, sample_weight=None, verbose=True)[source]¶
Fit the treatment effect and outcome models of the R learner.
Parameters:
X (np.matrix or np.array or pd.Dataframe) – a feature matrix
y (np.array or pd.Series) – an outcome vector
p (np.ndarray or pd.Series or dict, optional) – an array of propensity scores of float (0,1) in the
single-treatment case; or, a dictionary of treatment groups that map to propensity vectors of
float (0,1); if None will run ElasticNetPropensityModel() to generate the propensity scores.
sample_weight (np.array or pd.Series, optional) – an array of sample weights indicating the
weight of each observation for effect_learner. If None, it assumes equal weight.
verbose (bool, optional) – whether to output progress logs
Estimate the Average Treatment Effect (ATE) for compliers.
Parameters:
X (np.matrix or np.array or pd.Dataframe) – a feature matrix
assignment (np.array or pd.Series) – an assignment vector. The assignment is the
instrumental variable that does not depend on unknown confounders. The assignment status
influences treatment in a monotonic way, i.e. one can only be more likely to take the
treatment if assigned.
treatment (np.array or pd.Series) – a treatment vector
y (np.array or pd.Series) – an outcome vector
p (2-tuple of np.ndarray or pd.Series or dict, optional) – The first (second) element corresponds to
unassigned (assigned) units. Each is an array of propensity scores of float (0,1) in the single-treatment
case; or, a dictionary of treatment groups that map to propensity vectors of float (0,1). If None will run
ElasticNetPropensityModel() to generate the propensity scores.
pZ (np.array or pd.Series, optional) – an array of assignment probability of float (0,1); if None
will run ElasticNetPropensityModel() to generate the assignment probability score.
bootstrap_ci (bool) – whether run bootstrap for confidence intervals
n_bootstraps (int) – number of bootstrap iterations
bootstrap_size (int) – number of samples per bootstrap
seed (int) – random seed for cross-fitting
Returns:
The mean and confidence interval (LB, UB) of the ATE estimate.
fit(X, assignment, treatment, y, p=None, pZ=None, seed=None, calibrate=True)[source]¶
Fit the inference model.
Parameters:
X (np.matrix or np.array or pd.Dataframe) – a feature matrix
assignment (np.array or pd.Series) – a (0,1)-valued assignment vector. The assignment is the
instrumental variable that does not depend on unknown confounders. The assignment status
influences treatment in a monotonic way, i.e. one can only be more likely to take the
treatment if assigned.
treatment (np.array or pd.Series) – a treatment vector
y (np.array or pd.Series) – an outcome vector
p (2-tuple of np.ndarray or pd.Series or dict, optional) – The first (second) element corresponds to
unassigned (assigned) units. Each is an array of propensity scores of float (0,1) in the single-treatment
case; or, a dictionary of treatment groups that map to propensity vectors of float (0,1). If None will run
ElasticNetPropensityModel() to generate the propensity scores.
pZ (np.array or pd.Series, optional) – an array of assignment probability of float (0,1); if None
will run ElasticNetPropensityModel() to generate the assignment probability score.
Fit the treatment effect and outcome models of the R learner and predict treatment effects.
Parameters:
X (np.matrix or np.array or pd.Dataframe) – a feature matrix
assignment (np.array or pd.Series) – a (0,1)-valued assignment vector. The assignment is the
instrumental variable that does not depend on unknown confounders. The assignment status
influences treatment in a monotonic way, i.e. one can only be more likely to take the
treatment if assigned.
treatment (np.array or pd.Series) – a treatment vector
y (np.array or pd.Series) – an outcome vector
p (2-tuple of np.ndarray or pd.Series or dict, optional) – The first (second) element corresponds to
unassigned (assigned) units. Each is an array of propensity scores of float (0,1) in the single-treatment
case; or, a dictionary of treatment groups that map to propensity vectors of float (0,1). If None will run
ElasticNetPropensityModel() to generate the propensity scores.
pZ (np.array or pd.Series, optional) – an array of assignment probability of float (0,1); if None
will run ElasticNetPropensityModel() to generate the assignment probability score.
return_ci (bool) – whether to return confidence intervals
n_bootstraps (int) – number of bootstrap iterations
bootstrap_size (int) – number of samples per bootstrap
return_components (bool, optional) – whether to return outcome for treatment and control seperately
verbose (str) – whether to output progress logs
seed (int) – random seed for cross-fitting
Returns:
Predictions of treatment effects for compliers, , i.e. those individuals
who take the treatment only if they are assigned. Output dim: [n_samples, n_treatment]
If return_ci, returns CATE [n_samples, n_treatment], LB [n_samples, n_treatment],
UB [n_samples, n_treatment]
Builds a model (using X to predict estimated/actual tau), and then calculates feature importances
based on a specified method.
Currently supported methods are:
auto (calculates importance based on estimator’s default implementation of feature importance;
estimator must be tree-based)
Note: if none provided, it uses lightgbm’s LGBMRegressor as estimator, and “gain” as
importance type
permutation (calculates importance based on mean decrease in accuracy when a feature column is permuted;
estimator can be any form)
Hint: for permutation, downsample data for better performance especially if X.shape[1] is large
Parameters:
X (np.matrix or np.array or pd.Dataframe) – a feature matrix
tau (np.array) – a treatment effect vector (estimated/actual)
model_tau_feature (sklearn/lightgbm/xgboost model object) – an unfitted model object
features (np.array) – list/array of feature names. If None, an enumerated list will be used
method (str) – auto, permutation
normalize (bool) – normalize by sum of importances if method=auto (defaults to True)
test_size (float/int) – if float, represents the proportion of the dataset to include in the test split.
If int, represents the absolute number of test samples (used for estimating
permutation importance)
random_state (int/RandomState instance/None) – random state used in permutation importance estimation
Builds a model (using X to predict estimated/actual tau), and then calculates shapley values.
:param X: a feature matrix
:type X: np.matrix or np.array or pd.Dataframe
:param tau: a treatment effect vector (estimated/actual)
:type tau: np.array
:param model_tau_feature: an unfitted model object
:type model_tau_feature: sklearn/lightgbm/xgboost model object
:param features: list/array of feature names. If None, an enumerated list will be used.
:type features: optional, np.array
Builds a model (using X to predict estimated/actual tau), and then plots feature importances
based on a specified method.
Currently supported methods are:
auto (calculates importance based on estimator’s default implementation of feature importance;
estimator must be tree-based)
Note: if none provided, it uses lightgbm’s LGBMRegressor as estimator, and “gain” as
importance type
permutation (calculates importance based on mean decrease in accuracy when a feature column is permuted;
estimator can be any form)
Hint: for permutation, downsample data for better performance especially if X.shape[1] is large
Parameters:
X (np.matrix or np.array or pd.Dataframe) – a feature matrix
tau (np.array) – a treatment effect vector (estimated/actual)
model_tau_feature (sklearn/lightgbm/xgboost model object) – an unfitted model object
features (optional, np.array) – list/array of feature names. If None, an enumerated list will be used
method (str) – auto, permutation
normalize (bool) – normalize by sum of importances if method=auto (defaults to True)
test_size (float/int) – if float, represents the proportion of the dataset to include in the test split.
If int, represents the absolute number of test samples (used for estimating
permutation importance)
random_state (int/RandomState instance/None) – random state used in permutation importance estimation
Plots dependency of shapley values for a specified feature, colored by an interaction feature.
If shapley values have been pre-computed, pass it through the shap_dict parameter.
If shap_dict is not provided, this builds a new model (using X to predict estimated/actual tau),
and then calculates shapley values.
This plots the value of the feature on the x-axis and the SHAP value of the same feature
on the y-axis. This shows how the model depends on the given feature, and is like a
richer extension of the classical partial dependence plots. Vertical dispersion of the
data points represents interaction effects.
Parameters:
treatment_group (str or int) – name of treatment group to create dependency plot on
feature_idx (str or int) – feature index / name to create dependency plot on
X (np.matrix or np.array or pd.Dataframe) – a feature matrix
tau (np.array) – a treatment effect vector (estimated/actual)
model_tau_feature (sklearn/lightgbm/xgboost model object) – an unfitted model object
features (optional, np.array) – list/array of feature names. If None, an enumerated list will be used.
shap_dict (optional, dict) – a dict of shapley value matrices. If None, shap_dict will be computed.
interaction_idx (optional, str or int) – feature index / name used in coloring scheme as interaction feature.
If “auto” then shap.common.approximate_interactions is used to pick what seems to be the
strongest interaction (note that to find to true strongest interaction you need to compute
the SHAP interaction values).
If shapley values have been pre-computed, pass it through the shap_dict parameter.
If shap_dict is not provided, this builds a new model (using X to predict estimated/actual tau),
and then calculates shapley values.
Parameters:
X (np.matrix or np.array or pd.Dataframe) – a feature matrix. Required if shap_dict is None.
tau (np.array) – a treatment effect vector (estimated/actual)
model_tau_feature (sklearn/lightgbm/xgboost model object) – an unfitted model object
features (optional, np.array) – list/array of feature names. If None, an enumerated list will be used.
shap_dict (optional, dict) – a dict of shapley value matrices. If None, shap_dict will be computed.
The organic conversion rate in the population without an intervention.
If None, the organic conversion rate is obtained from tne control group.
NB: The organic conversion in the control group is not always the same
as the organic conversion rate without treatment.
data (DataFrame) – A pandas DataFrame containing the features, treatment assignment
indicator and the outcome of interest.
treatment (string) – A string corresponding to the name of the treatment column. The
assumed coding in the column is 1 for treatment and 0 for control.
outcome (string) – A string corresponding to the name of the outcome column. The assumed
coding in the column is 1 for conversion and 0 for no conversion.
treatment (array, shape = (num_samples, )) – An array of treatment group indicator values.
control_name (string) – The name of the control condition as a string. Must be contained in the treatment array.
treatment_names (list, length = cate.shape[1]) – A list of treatment group names. NB: The order of the items in the
list must correspond to the order in which the conditional average
treatment effect estimates are in cate_array.
y_proba (array, shape = (num_samples, )) – The predicted probability of conversion using the Y ~ X model across
the total sample.
cate (array, shape = (num_samples, len(set(treatment)))) – Conditional average treatment effect estimations from any model.
value (array, shape = (num_samples, )) – Value of converting each unit.
conversion_cost (shape = (num_samples, len(set(treatment)))) – The cost of a treatment that is triggered if a unit converts after having been in the treatment, such as a promotion code.
impression_cost (shape = (num_samples, len(set(treatment)))) – The cost of a treatment that is the same for each unit whether or not they convert, such as a cost associated with a promotion channel.
Notes
Because we get the conditional average treatment effects from
cate-learners relative to the control condition, we subtract the
cate for the unit in their actual treatment group from y_proba for that
unit, in order to recover the control outcome. We then add the cates
to the control outcome to obtain y_proba under each condition. These
outcomes are counterfactual because just one of them is actually
observed.
Array of impression costs for each unit in each treatment.
returns:
actual_value (array, shape = (num_samples, )) – Array of actual values of havng a user in their actual treatment group.
conversion_value (array, shape = (num_samples, )) – Array of payoffs from converting a user.
causalml.optimize.get_pns_bounds(data_exp, data_obs, T, Y, type='PNS')[source]¶
Parameters:
data_exp (DataFrame) – Data from an experiment.
data_obs (DataFrame) – Data from an observational study
T (str) – Name of the binary treatment indicator
y (str) – Name of the binary outcome indicator
'type' (str) – Type of probability of causation desired. Acceptable args are:
* ‘PNS’: Probability of necessary and sufficient causation
* ‘PS’: Probability of sufficient causation
* ‘PN’: Probability of necessary causation
To capture the counterfactual notation, we use `1’ and `0’ to indicate the actual and
counterfactual values of a variable, respectively, and we use `do’ to indicate the effect
of an intervention.
The experimental and observational data are either assumed to come to the same population,
or from random samples of the population. If the data are from a sample, the bounds may
be incorrectly calculated because the relevant quantities in the Tian-Pearl equations are
defined e.g. as P(YifT), not P(YifT mid S) where S corresponds to sample selection.
Bareinboim and Pearl (https://www.pnas.org/doi/10.1073/pnas.1510507113) discuss conditions
under which P(YifT) can be recovered from P(YifT mid S).
Get auuc values for cumulative gains of model estimates in quantiles.
For details, reference get_cumgain() and plot_gain()
:param synthetic_preds: dictionary of predictions generated by get_synthetic_preds()
:type synthetic_preds: dict
:param or get_synthetic_preds_holdout():
:param outcome_col: the column name for the actual outcome
:type outcome_col: str, optional
:param treatment_col: the column name for the treatment indicator (0 or 1)
:type treatment_col: str, optional
:param treatment_effect_col: the column name for the true treatment effect
:type treatment_effect_col: str, optional
:param plot: plot the cumulative gain chart or not
:type plot: boolean,optional
Returns:
auuc values by learner for cumulative gains of model estimates
Generate a synthetic dataset for classification uplift modeling problem.
Parameters:
n_samples (int, optional (default=1000)) – The number of samples to be generated for each treatment group.
treatment_name (list, optional (default = ['control','treatment1','treatment2','treatment3'])) – The list of treatment names.
y_name (string, optional (default = 'conversion')) – The name of the outcome variable to be used as a column in the output dataframe.
n_classification_features (int, optional (default = 10)) – Total number of features for base classification
n_classification_informative (int, optional (default = 5)) – Total number of informative features for base classification
n_classification_redundant (int, optional (default = 0)) – Total number of redundant features for base classification
n_classification_repeated (int, optional (default = 0)) – Total number of repeated features for base classification
n_uplift_increase_dict (dictionary, optional (default: {'treatment1': 2, 'treatment2': 2, 'treatment3': 2})) – Number of features for generating positive treatment effects for corresponding treatment group.
Dictionary of {treatment_key: number_of_features_for_increase_uplift}.
n_uplift_decrease_dict (dictionary, optional (default: {'treatment1': 0, 'treatment2': 0, 'treatment3': 0})) – Number of features for generating negative treatment effects for corresponding treatment group.
Dictionary of {treatment_key: number_of_features_for_increase_uplift}.
delta_uplift_increase_dict (dictionary, optional (default: {'treatment1': .02, 'treatment2': .05, 'treatment3': .1})) – Positive treatment effect created by the positive uplift features on the base classification label.
Dictionary of {treatment_key: increase_delta}.
delta_uplift_decrease_dict (dictionary, optional (default: {'treatment1': 0., 'treatment2': 0., 'treatment3': 0.})) – Negative treatment effect created by the negative uplift features on the base classification label.
Dictionary of {treatment_key: increase_delta}.
n_uplift_increase_mix_informative_dict (dictionary, optional (default: {'treatment1': 1, 'treatment2': 1, 'treatment3': 1})) – Number of positive mix features for each treatment. The positive mix feature is defined as a linear combination
of a randomly selected informative classification feature and a randomly selected positive uplift feature.
The linear combination is made by two coefficients sampled from a uniform distribution between -1 and 1.
n_uplift_decrease_mix_informative_dict (dictionary, optional (default: {'treatment1': 0, 'treatment2': 0, 'treatment3': 0})) – Number of negative mix features for each treatment. The negative mix feature is defined as a linear combination
of a randomly selected informative classification feature and a randomly selected negative uplift feature. The
linear combination is made by two coefficients sampled from a uniform distribution between -1 and 1.
positive_class_proportion (float, optional (default = 0.5)) – The proportion of positive label (1) in the control group.
random_seed (int, optional (default = 20190101)) – The random seed to be used in the data generation process.
Returns:
df_res (DataFrame) – A data frame containing the treatment label, features, and outcome variable.
x_name (list) – The list of feature names generated.
Notes
The algorithm for generating the base classification dataset is adapted from the make_classification method in the
sklearn package, that uses the algorithm in Guyon [1] designed to generate the “Madelon” dataset.
Generate a synthetic dataset for classification uplift modeling problem.
Parameters:
n_samples (int, optional (default=1000)) – The number of samples to be generated for each treatment group.
treatment_name (list, optional (default = ['control','treatment1','treatment2','treatment3'])) – The list of treatment names. The first element must be ‘control’ as control group, and the rest are treated as treatment groups.
y_name (string, optional (default = 'conversion')) – The name of the outcome variable to be used as a column in the output dataframe.
n_classification_features (int, optional (default = 10)) – Total number of features for base classification
n_classification_informative (int, optional (default = 5)) – Total number of informative features for base classification
n_classification_redundant (int, optional (default = 0)) – Total number of redundant features for base classification
n_classification_repeated (int, optional (default = 0)) – Total number of repeated features for base classification
n_uplift_dict (dictionary, optional (default: {'treatment1': 2, 'treatment2': 2, 'treatment3': 3})) – Number of features for generating heterogeneous treatment effects for corresponding treatment group.
Dictionary of {treatment_key: number_of_features_for_uplift}.
n_mix_informative_uplift_dict (dictionary, optional (default: {'treatment1': 1, 'treatment2': 1, 'treatment3': 1})) – Number of mix features for each treatment. The mix feature is defined as a linear combination
of a randomly selected informative classification feature and a randomly selected uplift feature.
The mixture is made by a weighted sum (p*feature1 + (1-p)*feature2), where the weight p is drawn from a uniform distribution between 0 and 1.
delta_uplift_dict (dictionary, optional (default: {'treatment1': .02, 'treatment2': .05, 'treatment3': -.05})) – Treatment effect (delta), can be positive or negative.
Dictionary of {treatment_key: delta}.
positive_class_proportion (float, optional (default = 0.1)) – The proportion of positive label (1) in the control group, or the mean of outcome variable for control group.
random_seed (int, optional (default = 20200101)) – The random seed to be used in the data generation process.
feature_association_list (list, optional (default = ['linear','quadratic','cubic','relu','sin','cos'])) – List of uplift feature association patterns to the treatment effect. For example, if the feature pattern is ‘quadratic’, then the treatment effect will increase or decrease quadratically with the feature.
The values in the list must be one of (‘linear’,’quadratic’,’cubic’,’relu’,’sin’,’cos’). However, the same value can appear multiple times in the list.
random_select_association (boolean, optional (default = True)) – How the feature patterns are selected from the feature_association_list to be applied in the data generation process.
If random_select_association = True, then for every uplift feature, a random feature association pattern is selected from the list.
If random_select_association = False, then the feature association pattern is selected from the list in turns to be applied to each feature one by one.
error_std (float, optional (default = 0.05)) – Standard deviation to be used in the error term of the logistic regression. The error is drawn from a normal distribution with mean 0 and standard deviation specified in this argument.
Returns:
df1 (DataFrame) – A data frame containing the treatment label, features, and outcome variable.
x_name (list) – The list of feature names generated.
Synthetic data in Nie X. and Wager S. (2018) ‘Quasi-Oracle Estimation of Heterogeneous Treatment Effects’
:param mode: mode of the simulation: 1 for difficult nuisance components and an easy treatment effect. 2 for a randomized trial. 3 for an easy propensity and a difficult baseline. 4 for unrelated treatment and control groups. 5 for a hidden confounder biasing treatment.
:type mode: int, optional
:param n: number of observations
:type n: int, optional
:param p: number of covariates (>=5)
:type p: int optional
:param sigma: standard deviation of the error term
:type sigma: float
:param adj: adjustment term for the distribution of propensity, e. Higher values shift the distribution to 0.
It does not apply to mode == 2 or 3.
Returns:
Synthetically generated samples with the following outputs:
y ((n,)-array): outcome variable.
X ((n,p)-ndarray): independent variables.
w ((n,)-array): treatment flag with value 0 or 1.
tau ((n,)-array): individual treatment effect.
b ((n,)-array): expected outcome.
e ((n,)-array): propensity of receiving treatment.
A Sensitivity Check class to support Placebo Treatment, Irrelevant Additional Confounder
and Subset validation refutation methods to verify causal inference.
method (list of str) – a list of sensitivity analysis method
sample_size (float, optional) – ratio for subset the original data
confound (string, optional) – the name of confouding function
alpha_range (np.array, optional) – a parameter to pass the confounding function
Returns:
a feature matrix
p (np.array): a propensity score vector between 0 and 1
treatment (np.array): a treatment vector (1 if treated, otherwise 0)
y (np.array): an outcome vector
Check partial rsqs values of feature corresponding confounding amonunt of ATE
:param sens_df: a data frame output from causalsens
:type sens_df: pandas.DataFrame
:param feature_name: feature name to check
:type feature_name: str
:param partial_rsqs_value: partial rsquare value of feature
:type partial_rsqs_value: float
:param range: range to search from sens_df
:type range: float
Plot the results of a sensitivity analysis against unmeasured
:param sens_df: a data frame output from causalsens
:type sens_df: pandas.DataFrame
:param partial_rsqs_d: a data frame output from causalsens including partial rsqure
:type partial_rsqs_d: pandas.DataFrame
:param type: the type of plot to draw, ‘raw’ or ‘r.squared’ are supported
:type type: str, optional
:param ci: whether plot confidence intervals
:type ci: bool, optional
:param partial_rsqs: whether plot partial rsquare results
:type partial_rsqs: bool, optional
Calculate the AUUC (Area Under the Uplift Curve) score.
Args:
df (pandas.DataFrame): a data frame with model estimates and actual data as columns
outcome_col (str, optional): the column name for the actual outcome
treatment_col (str, optional): the column name for the treatment indicator (0 or 1)
treatment_effect_col (str, optional): the column name for the true treatment effect
normalize (bool, optional): whether to normalize the y-axis to 1 or not
w (numpy.array, optional) – a treatment vector (1 or True: treatment, 0 or False: control). If given, log
metrics for the treatment and control group separately
metrics (dict, optional) – a dictionary of the metric names and functions
Get cumulative gains of model estimates in population.
If the true treatment effect is provided (e.g. in synthetic data), it’s calculated
as the cumulative gain of the true treatment effect in each population.
Otherwise, it’s calculated as the cumulative difference between the mean outcomes
of the treatment and control groups in each population.
For details, see Section 4.1 of Gutierrez and G{‘e}rardy (2016), Causal Inference
and Uplift Modeling: A review of the literature.
For the former, treatment_effect_col should be provided. For the latter, both
outcome_col and treatment_col should be provided.
Parameters:
df (pandas.DataFrame) – a data frame with model estimates and actual data as columns
outcome_col (str, optional) – the column name for the actual outcome
treatment_col (str, optional) – the column name for the treatment indicator (0 or 1)
treatment_effect_col (str, optional) – the column name for the true treatment effect
normalize (bool, optional) – whether to normalize the y-axis to 1 or not
random_seed (int, optional) – random seed for numpy.random.rand()
Get average uplifts of model estimates in cumulative population.
If the true treatment effect is provided (e.g. in synthetic data), it’s calculated
as the mean of the true treatment effect in each of cumulative population.
Otherwise, it’s calculated as the difference between the mean outcomes of the
treatment and control groups in each of cumulative population.
For details, see Section 4.1 of Gutierrez and G{‘e}rardy (2016), Causal Inference
and Uplift Modeling: A review of the literature.
For the former, treatment_effect_col should be provided. For the latter, both
outcome_col and treatment_col should be provided.
Parameters:
df (pandas.DataFrame) – a data frame with model estimates and actual data as columns
outcome_col (str, optional) – the column name for the actual outcome
treatment_col (str, optional) – the column name for the treatment indicator (0 or 1)
treatment_effect_col (str, optional) – the column name for the true treatment effect
random_seed (int, optional) – random seed for numpy.random.rand()
Returns:
average uplifts of model estimates in cumulative population
If the true treatment effect is provided (e.g. in synthetic data), it’s calculated
as the cumulative gain of the true treatment effect in each population.
Otherwise, it’s calculated as the cumulative difference between the mean outcomes
of the treatment and control groups in each population.
For details, see Radcliffe (2007), Using Control Group to Target on Predicted Lift:
Building and Assessing Uplift Models
For the former, treatment_effect_col should be provided. For the latter, both
outcome_col and treatment_col should be provided.
Parameters:
df (pandas.DataFrame) – a data frame with model estimates and actual data as columns
outcome_col (str, optional) – the column name for the actual outcome
treatment_col (str, optional) – the column name for the treatment indicator (0 or 1)
treatment_effect_col (str, optional) – the column name for the true treatment effect
normalize (bool, optional) – whether to normalize the y-axis to 1 or not
random_seed (int, optional) – random seed for numpy.random.rand()
y_true (array-like of shape (n_samples,) or (n_samples, n_outputs)) – Ground truth (correct) target values.
y_pred (array-like of shape (n_samples,) or (n_samples, n_outputs)) – Estimated target values.
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.
multioutput ({'raw_values', 'uniform_average'} or array-like of shape (n_outputs,), default='uniform_average') –
Defines aggregating of multiple output values.
Array-like value defines weights used to average errors.
’raw_values’ :
Returns a full set of errors in case of multioutput input.
’uniform_average’ :
Errors of all outputs are averaged with uniform weight.
Returns:
loss – If multioutput is ‘raw_values’, then mean absolute error is returned
for each output separately.
If multioutput is ‘uniform_average’ or an ndarray of weights, then the
weighted average of all output errors is returned.
MAE output is non-negative floating point. The best value is 0.0.
Plot the cumulative gain chart (or uplift curve) of model estimates.
If the true treatment effect is provided (e.g. in synthetic data), it’s calculated
as the cumulative gain of the true treatment effect in each population.
Otherwise, it’s calculated as the cumulative difference between the mean outcomes
of the treatment and control groups in each population.
For details, see Section 4.1 of Gutierrez and G{‘e}rardy (2016), Causal Inference
and Uplift Modeling: A review of the literature.
For the former, treatment_effect_col should be provided. For the latter, both
outcome_col and treatment_col should be provided.
Parameters:
df (pandas.DataFrame) – a data frame with model estimates and actual data as columns
outcome_col (str, optional) – the column name for the actual outcome
treatment_col (str, optional) – the column name for the treatment indicator (0 or 1)
treatment_effect_col (str, optional) – the column name for the true treatment effect
normalize (bool, optional) – whether to normalize the y-axis to 1 or not
random_seed (int, optional) – random seed for numpy.random.rand()
n (int, optional) – the number of samples to be used for plotting
Plot the lift chart of model estimates in cumulative population.
If the true treatment effect is provided (e.g. in synthetic data), it’s calculated
as the mean of the true treatment effect in each of cumulative population.
Otherwise, it’s calculated as the difference between the mean outcomes of the
treatment and control groups in each of cumulative population.
For details, see Section 4.1 of Gutierrez and G{‘e}rardy (2016), Causal Inference
and Uplift Modeling: A review of the literature.
For the former, treatment_effect_col should be provided. For the latter, both
outcome_col and treatment_col should be provided.
Parameters:
df (pandas.DataFrame) – a data frame with model estimates and actual data as columns
outcome_col (str, optional) – the column name for the actual outcome
treatment_col (str, optional) – the column name for the treatment indicator (0 or 1)
treatment_effect_col (str, optional) – the column name for the true treatment effect
random_seed (int, optional) – random seed for numpy.random.rand()
n (int, optional) – the number of samples to be used for plotting
Plot the Qini chart (or uplift curve) of model estimates.
If the true treatment effect is provided (e.g. in synthetic data), it’s calculated
as the cumulative gain of the true treatment effect in each population.
Otherwise, it’s calculated as the cumulative difference between the mean outcomes
of the treatment and control groups in each population.
For details, see Radcliffe (2007), Using Control Group to Target on Predicted Lift:
Building and Assessing Uplift Models
For the former, treatment_effect_col should be provided. For the latter, both
outcome_col and treatment_col should be provided.
Parameters:
df (pandas.DataFrame) – a data frame with model estimates and actual data as columns
outcome_col (str, optional) – the column name for the actual outcome
treatment_col (str, optional) – the column name for the treatment indicator (0 or 1)
treatment_effect_col (str, optional) – the column name for the true treatment effect
normalize (bool, optional) – whether to normalize the y-axis to 1 or not
random_seed (int, optional) – random seed for numpy.random.rand()
n (int, optional) – the number of samples to be used for plotting
ci (bool, optional) – whether return confidence intervals for ATE or not
Calculate the Qini score: the area between the Qini curves of a model and random.
For details, see Radcliffe (2007), Using Control Group to Target on Predicted Lift:
Building and Assessing Uplift Models
Args:
df (pandas.DataFrame): a data frame with model estimates and actual data as columns
outcome_col (str, optional): the column name for the actual outcome
treatment_col (str, optional): the column name for the treatment indicator (0 or 1)
treatment_effect_col (str, optional): the column name for the true treatment effect
normalize (bool, optional): whether to normalize the y-axis to 1 or not
\(R^2\) (coefficient of determination) regression score function.
Best possible score is 1.0 and it can be negative (because the
model can be arbitrarily worse). A constant model that always
predicts the expected value of y, disregarding the input features,
would get a \(R^2\) score of 0.0.
Read more in the User Guide.
Parameters:
y_true (array-like of shape (n_samples,) or (n_samples, n_outputs)) – Ground truth (correct) target values.
y_pred (array-like of shape (n_samples,) or (n_samples, n_outputs)) – Estimated target values.
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.
multioutput ({'raw_values', 'uniform_average', 'variance_weighted'}, array-like of shape (n_outputs,) or None, default='uniform_average') –
Defines aggregating of multiple output scores.
Array-like value defines weights used to average scores.
Default is “uniform_average”.
’raw_values’ :
Returns a full set of scores in case of multioutput input.
’uniform_average’ :
Scores of all outputs are averaged with uniform weight.
’variance_weighted’ :
Scores of all outputs are averaged, weighted by the variances
of each individual output.
Changed in version 0.19: Default value of multioutput is ‘uniform_average’.
Returns:
z – The \(R^2\) score or ndarray of scores if ‘multioutput’ is
‘raw_values’.
Return type:
float or ndarray of floats
Notes
This is not a symmetric function.
Unlike most other scores, \(R^2\) score may be negative (it need not
actually be the square of a quantity R).
This metric is not well-defined for single samples and will return a NaN
value if n_samples is less than two.
w (numpy.array, optional) – a treatment vector (1 or True: treatment, 0 or False: control). If given, log
metrics for the treatment and control group separately
metrics (dict, optional) – a dictionary of the metric names and functions
Compute Area Under the Receiver Operating Characteristic Curve (ROC AUC)
from prediction scores.
Note: this implementation can be used with binary, multiclass and
multilabel classification, but some restrictions apply (see Parameters).
Read more in the User Guide.
Parameters:
y_true (array-like of shape (n_samples,) or (n_samples, n_classes)) – True labels or binary label indicators. The binary and multiclass cases
expect labels with shape (n_samples,) while the multilabel case expects
binary label indicators with shape (n_samples, n_classes).
y_score (array-like of shape (n_samples,) or (n_samples, n_classes)) –
Target scores.
In the binary case, it corresponds to an array of shape
(n_samples,). Both probability estimates and non-thresholded
decision values can be provided. The probability estimates correspond
to the probability of the class with the greater label,
i.e. estimator.classes_[1] and thus
estimator.predict_proba(X, y)[:, 1]. The decision values
corresponds to the output of estimator.decision_function(X, y).
See more information in the User guide;
In the multiclass case, it corresponds to an array of shape
(n_samples, n_classes) of probability estimates provided by the
predict_proba method. The probability estimates must
sum to 1 across the possible classes. In addition, the order of the
class scores must correspond to the order of labels,
if provided, or else to the numerical or lexicographical order of
the labels in y_true. See more information in the
User guide;
In the multilabel case, it corresponds to an array of shape
(n_samples, n_classes). Probability estimates are provided by the
predict_proba method and the non-thresholded decision values by
the decision_function method. The probability estimates correspond
to the probability of the class with the greater label for each
output of the classifier. See more information in the
User guide.
average ({'micro', 'macro', 'samples', 'weighted'} or None, default='macro') –
If None, the scores for each class are returned. Otherwise,
this determines the type of averaging performed on the data:
Note: multiclass ROC AUC currently only handles the ‘macro’ and
‘weighted’ averages.
'micro':
Calculate metrics globally by considering each element of the label
indicator matrix as a label.
'macro':
Calculate metrics for each label, and find their unweighted
mean. This does not take label imbalance into account.
'weighted':
Calculate metrics for each label, and find their average, weighted
by support (the number of true instances for each label).
'samples':
Calculate metrics for each instance, and find their average.
Will be ignored when y_true is binary.
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.
max_fpr (float > 0 and <= 1, default=None) – If not None, the standardized partial AUC [2] over the range
[0, max_fpr] is returned. For the multiclass case, max_fpr,
should be either equal to None or 1.0 as AUC ROC partial
computation currently is not supported for multiclass.
Only used for multiclass targets. Determines the type of configuration
to use. The default value raises an error, so either
'ovr' or 'ovo' must be passed explicitly.
'ovr':
Stands for One-vs-rest. Computes the AUC of each class
against the rest [3][4]. This
treats the multiclass case in the same way as the multilabel case.
Sensitive to class imbalance even when average=='macro',
because class imbalance affects the composition of each of the
‘rest’ groupings.
'ovo':
Stands for One-vs-one. Computes the average AUC of all
possible pairwise combinations of classes [5].
Insensitive to class imbalance when
average=='macro'.
labels (array-like of shape (n_classes,), default=None) – Only used for multiclass targets. List of labels that index the
classes in y_score. If None, the numerical or lexicographical
order of the labels in y_true is used.
>>> importnumpyasnp>>> fromsklearn.datasetsimportmake_multilabel_classification>>> fromsklearn.multioutputimportMultiOutputClassifier>>> X,y=make_multilabel_classification(random_state=0)>>> clf=MultiOutputClassifier(clf).fit(X,y)>>> # get a list of n_output containing probability arrays of shape>>> # (n_samples, n_classes)>>> y_pred=clf.predict_proba(X)>>> # extract the positive columns for each output>>> y_pred=np.transpose([pred[:,1]forprediny_pred])>>> roc_auc_score(y,y_pred,average=None)array([0.82..., 0.86..., 0.94..., 0.85... , 0.94...])>>> fromsklearn.linear_modelimportRidgeClassifierCV>>> clf=RidgeClassifierCV().fit(X,y)>>> roc_auc_score(y,clf.decision_function(X),average=None)array([0.81..., 0.84... , 0.93..., 0.87..., 0.94...])
Rank features based on the chosen divergence measure.
Parameters:
data (pd.Dataframe) – DataFrame containing outcome, features, and experiment group
treatment_indicator (string) – the column name for binary indicator of treatment (value 1) or control (value 0)
features (list of string) – list of feature names, that are columns in the data DataFrame
y_name (string) – name of the outcome variable
method (string, optional, default = 'KL') – taking one of the following values {‘F’, ‘LR’, ‘KL’, ‘ED’, ‘Chi’}
The feature selection method to be used to rank the features.
‘F’ for F-test
‘LR’ for likelihood ratio test
‘KL’, ‘ED’, ‘Chi’ for bin-based uplift filter methods, KL divergence, Euclidean distance, Chi-Square respectively
experiment_group_column (string, optional, default = 'treatment_group_key') – the experiment column name in the DataFrame, which contains the treatment and control assignment label
control_group (string, optional, default = 'control') – name for control group, value in the experiment group column
n_bins (int, optional, default = 10) – number of bins to be used for bin-based uplift filter methods
null_impute (str, optional, default=None) – impute np.nan present in the data taking on of the following strategy values {‘mean’, ‘median’, ‘most_frequent’, None}. If Value is None and null is present then exception will be raised
Returns:
pd.DataFrame
a data frame containing the feature importance statistics
Rank features based on the F-statistics of the interaction.
Parameters:
data (pd.Dataframe) – DataFrame containing outcome, features, and experiment group
treatment_indicator (string) – the column name for binary indicator of treatment (value 1) or control (value 0)
features (list of string) – list of feature names, that are columns in the data DataFrame
y_name (string) – name of the outcome variable
order (int) – the order of feature to be evaluated with the treatment effect, order takes 3 values: 1,2,3. order = 1 corresponds to linear importance of the feature, order=2 corresponds to quadratic and linear importance of the feature,
forms. (order= 3 will calculate feature importance up to cubic) –
Returns:
pd.DataFrame
a data frame containing the feature importance statistics
Rank features based on the LRT-statistics of the interaction.
Parameters:
data (pd.Dataframe) – DataFrame containing outcome, features, and experiment group
treatment_indicator (string) – the column name for binary indicator of treatment (value 1) or control (value 0)
feature_name (string) – feature name, as one column in the data DataFrame
y_name (string) – name of the outcome variable
order (int) – the order of feature to be evaluated with the treatment effect, order takes 3 values: 1,2,3. order = 1 corresponds to linear importance of the feature, order=2 corresponds to quadratic and linear importance of the feature,
forms. (order= 3 will calculate feature importance up to cubic) –
Returns:
pd.DataFrame
a data frame containing the feature importance statistics
Rank features based on the chosen statistic of the interaction.
Parameters:
data (pd.Dataframe) – DataFrame containing outcome, features, and experiment group
features (list of string) – list of feature names, that are columns in the data DataFrame
y_name (string) – name of the outcome variable
method (string, optional, default = 'KL') – taking one of the following values {‘F’, ‘LR’, ‘KL’, ‘ED’, ‘Chi’}
The feature selection method to be used to rank the features.
‘F’ for F-test
‘LR’ for likelihood ratio test
‘KL’, ‘ED’, ‘Chi’ for bin-based uplift filter methods, KL divergence, Euclidean distance, Chi-Square respectively
experiment_group_column (string) – the experiment column name in the DataFrame, which contains the treatment and control assignment label
control_group (string) – name for control group, value in the experiment group column
treatment_group (string) – name for treatment group, value in the experiment group column
n_bins (int, optional) – number of bins to be used for bin-based uplift filter methods
null_impute (str, optional, default=None) – impute np.nan present in the data taking on of the following strategy values {‘mean’, ‘median’, ‘most_frequent’, None}. If value is None and null is present then exception will be raised
order (int) – the order of feature to be evaluated with the treatment effect for F filter and LR filter, order takes 3 values: 1,2,3. order = 1 corresponds to linear importance of the feature, order=2 corresponds to quadratic and linear importance of the feature,
forms. (order= 3 will calculate feature importance up to cubic) –
disp (bool) – Set to True to print convergence messages for Logistic regression convergence in LR method.
Returns:
pd.DataFrame
a data frame with following columns: [‘method’, ‘feature’, ‘rank’, ‘score’, ‘p_value’, ‘misc’]