Uplift Trees Example with Synthetic Data

In this notebook, we use synthetic data to demonstrate the use of the tree-based algorithms.

[3]:

import numpy as np
import pandas as pd

from causalml.dataset import make_uplift_classification
from causalml.inference.tree import UpliftRandomForestClassifier
from causalml.metrics import plot_gain

from sklearn.model_selection import train_test_split

[4]:

import importlib
print(importlib.metadata.version('causalml') )

0.14.0

Generate synthetic dataset

The CausalML package contains various functions to generate synthetic datasets for uplift modeling. Here we generate a classification dataset using the make_uplift_classification() function.

[3]:

df, x_names = make_uplift_classification()

[4]:

df.head()

[4]:

	treatment_group_key	x1_informative	x2_informative	x3_informative	x4_informative	x5_informative	x6_irrelevant	x7_irrelevant	x8_irrelevant	x9_irrelevant	...	x12_uplift_increase	x13_increase_mix	x14_uplift_increase	x15_uplift_increase	x16_increase_mix	x17_uplift_increase	x18_uplift_increase	x19_increase_mix	conversion	treatment_effect
0	control	-0.542888	1.976361	-0.531359	-2.354211	-0.380629	-2.614321	-0.128893	0.448689	-2.275192	...	-1.315304	0.742654	1.891699	-2.428395	1.541875	-0.817705	-0.610194	-0.591581	0	0
1	treatment3	0.258654	0.552412	1.434239	-1.422311	0.089131	0.790293	1.159513	1.578868	0.166540	...	-1.391878	-0.623243	2.443972	-2.889253	2.018585	-1.109296	-0.380362	-1.667606	0	0
2	treatment1	1.697012	-2.762600	-0.662874	-1.682340	1.217443	0.837982	1.042981	0.177398	-0.112409	...	-1.132497	1.050179	1.573054	-1.788427	1.341609	-0.749227	-2.091521	-0.471386	0	0
3	treatment2	-1.441644	1.823648	0.789423	-0.295398	0.718509	-0.492993	0.947824	-1.307887	0.123340	...	-2.084619	0.058481	1.369439	0.422538	1.087176	-0.966666	-1.785592	-1.268379	1	1
4	control	-0.625074	3.002388	-0.096288	1.938235	3.392424	-0.465860	-0.919897	-1.072592	-1.331181	...	-1.403984	0.760430	1.917635	-2.347675	1.560946	-0.833067	-1.407884	-0.781343	0	0

5 rows × 22 columns

[5]:

# Look at the conversion rate and sample size in each group
df.pivot_table(values='conversion',
               index='treatment_group_key',
               aggfunc=[np.mean, np.size],
               margins=True)

[5]:

	mean	size
	conversion	conversion
treatment_group_key
control	0.511	1000
treatment1	0.514	1000
treatment2	0.559	1000
treatment3	0.600	1000
All	0.546	4000

Run the uplift random forest classifier

In this section, we first fit the uplift random forest classifier using training data. We then use the fitted model to make a prediction using testing data. The prediction returns an ndarray in which each column contains the predicted uplift if the unit was in the corresponding treatment group.

[6]:

# Split data to training and testing samples for model validation (next section)
df_train, df_test = train_test_split(df, test_size=0.2, random_state=111)

[7]:

from causalml.inference.tree import UpliftTreeClassifier

[8]:

clf = UpliftTreeClassifier(control_name='control')
clf.fit(df_train[x_names].values,
         treatment=df_train['treatment_group_key'].values,
         y=df_train['conversion'].values)
p = clf.predict(df_test[x_names].values)

[9]:

df_res = pd.DataFrame(p, columns=clf.classes_)
df_res.head()

[9]:

	control	treatment1	treatment2	treatment3
0	0.506394	0.511811	0.573935	0.503778
1	0.506394	0.511811	0.573935	0.503778
2	0.580838	0.458824	0.508982	0.452381
3	0.482558	0.572327	0.556757	0.961538
4	0.482558	0.572327	0.556757	0.961538

[10]:

uplift_model = UpliftRandomForestClassifier(control_name='control')

[11]:

uplift_model.fit(df_train[x_names].values,
                 treatment=df_train['treatment_group_key'].values,
                 y=df_train['conversion'].values)

[12]:

df_res = uplift_model.predict(df_test[x_names].values, full_output=True)
print(df_res.shape)
df_res.head()

(800, 9)

[12]:

	control	treatment1	treatment2	treatment3	recommended_treatment	delta_treatment1	delta_treatment2	delta_treatment3	max_delta
0	0.415263	0.401823	0.465554	0.391658	2	-0.013440	0.050291	-0.023605	0.050291
1	0.412962	0.389346	0.476169	0.363343	2	-0.023616	0.063206	-0.049619	0.063206
2	0.533442	0.548670	0.589756	0.588654	2	0.015228	0.056313	0.055212	0.056313
3	0.344854	0.314433	0.370315	0.760676	3	-0.030420	0.025461	0.415822	0.415822
4	0.649657	0.602642	0.641364	0.851301	3	-0.047015	-0.008293	0.201644	0.201644

[13]:

y_pred = uplift_model.predict(df_test[x_names].values)

[14]:

y_pred.shape

[14]:

(800, 3)

[15]:

# Put the predictions to a DataFrame for a neater presentation
# The output of `predict()` is a numpy array with the shape of [n_sample, n_treatment] excluding the
# predictions for the control group.
result = pd.DataFrame(y_pred,
                      columns=uplift_model.classes_[1:])
result.head()

[15]:

	treatment1	treatment2	treatment3
0	-0.013440	0.050291	-0.023605
1	-0.023616	0.063206	-0.049619
2	0.015228	0.056313	0.055212
3	-0.030420	0.025461	0.415822
4	-0.047015	-0.008293	0.201644

Create the uplift curve

The performance of the model can be evaluated with the help of the uplift curve.

Create a synthetic population

The uplift curve is calculated on a synthetic population that consists of those that were in the control group and those who happened to be in the treatment group recommended by the model. We use the synthetic population to calculate the actual treatment effect within predicted treatment effect quantiles. Because the data is randomized, we have a roughly equal number of treatment and control observations in the predicted quantiles and there is no self selection to treatment groups.

[16]:

# If all deltas are negative, assing to control; otherwise assign to the treatment
# with the highest delta
best_treatment = np.where((result < 0).all(axis=1),
                           'control',
                           result.idxmax(axis=1))

# Create indicator variables for whether a unit happened to have the
# recommended treatment or was in the control group
actual_is_best = np.where(df_test['treatment_group_key'] == best_treatment, 1, 0)
actual_is_control = np.where(df_test['treatment_group_key'] == 'control', 1, 0)

[17]:

synthetic = (actual_is_best == 1) | (actual_is_control == 1)
synth = result[synthetic]

Calculate the observed treatment effect per predicted treatment effect quantile

We use the observed treatment effect to calculate the uplift curve, which answers the question: how much of the total cumulative uplift could we have captured by targeting a subset of the population sorted according to the predicted uplift, from highest to lowest?

CausalML has the plot_gain() function which calculates the uplift curve given a DataFrame containing the treatment assignment, observed outcome and the predicted treatment effect.

[18]:

auuc_metrics = (synth.assign(is_treated = 1 - actual_is_control[synthetic],
                             conversion = df_test.loc[synthetic, 'conversion'].values,
                             uplift_tree = synth.max(axis=1))
                     .drop(columns=list(uplift_model.classes_[1:])))

[19]:

plot_gain(auuc_metrics, outcome_col='conversion', treatment_col='is_treated')

../_images/examples_uplift_trees_with_synthetic_data_24_0.png

[ ]: