Predicted Probabilities are Unfair¶

A meaningfuk one-dimensional numerical prediction ("score") will always unfairly discriminate between groups. This is one of these nice results that are both counterintuitive and easy to derive in its basic form. Meaningful here means calibration, i.e. that multiplying the score by a factor corresponds to changing frequencies by the same factor. Data scientists who build production systems should be aware of this finding. Luckily, it is easy to check how grave the problem is if you know what to look for.

Why scores cannot be fair¶

Conditions¶

The three basic conditions for fairness to hold are:

Calibration: the score should correspond to frequencies in the data.
Balance for positive class: the average score of people in positive class should be the same across groups.
Balance for negative class: the average score of people in negative class should be the same across groups.

Why these three? Without calibration, the comparison between individuals with different scores would be meaningless. Positive balance corresponds to False Positive Rates (FPR) being the the same across groups. To see why fairness requires balance, imagine the contrary in an example where the score is probability of default: if FPR in one group is higher than in the other, creditworthy members of the first group will be systematically classified as more likely to default than creditworthy members of the second.

Proof¶

Le $\hat{y}_0$ the average prediction of an observation with label 0 and $\hat{y}_1$ the average prediction for an observation with label 1. By balance, $\hat{y}_0$ and $\hat{y}_1$ are the same between groups. Let $n_0^g$ be the number of individuals with label 0 and $n_1^g$ be the number of individuals with label 1 in group $g$. For simplicity, take two only groups, A and B. The average prediction in group A is $$ \hat{y}^A = \frac{n_0^A\times\hat{y}_0 + n_1^A\times\hat{y}_1}{n_0^A+n_1^A}. $$ By calibration, we must have $$ \hat{y}^A = \frac{n_1^A}{n_0^A+n_1^A}. $$ Combining the two conditions and rearranging, we get $$ \hat{y}_1 = 1 - \hat{y}_0 \times \tfrac{n_0^A}{n_1^A}. $$ Proceeding in the same manner for group B yields $$ \hat{y}_1 = 1 - \hat{y}_0 \times \tfrac{n_0^B}{n_1^B}. $$ There are only two conditions under which the two previous equations hold:

$\hat{y}_0=0$.This is either a prediction without information contains that indiscriminately set $\hat{y}=0$ everywhere, or a perfect prediction that assigns score 1 to all positive class observations and 0 everywhere else. The former is unhelpful, the latter impossible to achieve.
$\tfrac{n_0^A}{n_1^A}=\tfrac{n_0^B}{n_1^B}$, the proportion of class labels is the same across groups. This condition is also unrealistic: we care about the group distinction because the averge members differ in their outcomes.

How does this look like on real data?¶

Let us now check how this result tranfers to real data, using Titanic dataset. The outcome will be survival and the groups will be passenger class. We will compare predictions between First and Third Class.

In [1]:

%load_ext nb_black

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

plt.rcParams["figure.figsize"] = [12.0, 8.0]

import lightgbm as lgb
import sklearn

from catboost.datasets import titanic

The results hold irrespective of whether the model is validated out of sample, so we will just focus on the training data set. Survival rates between passenger classes are clearly different.

In [2]:

df_titanic, _ = titanic()
df_titanic.groupby("Pclass")["Survived"].mean()

Out[2]:

Pclass
1    0.629630
2    0.472826
3    0.242363
Name: Survived, dtype: float64

We now define the features used to predict survival. One might think that excluding passenger class would safeguard from discrimination via this feature ("fairness via ignorance"), but this is not the case. To illustrate, we will build a model without passenger class.

In [3]:

categorical_features = ["Embarked", "Sex"]
for f in categorical_features:
    df_titanic[f] = df_titanic[f].astype("category")
features = ["Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]
x = df_titanic[features]
y = df_titanic["Survived"]
dtrain = lgb.Dataset(x, label=y)

We now train a reasonably flexible model. Protecting against overfitting will not resolve the unfairness issue.

In [4]:

param = {
    "objective": "binary",
    "metric": "binary_logloss",
    "verbosity": -1,
    "boosting_type": "gbdt",
    "num_leaves": 164,
}
gbm = lgb.train(param, dtrain, num_boost_round=100)
df_titanic["p_pred"] = gbm.predict(x)

Should we exclude passenger class from the features?¶

The model's probability score is clustered around 0 and 1. Passengers from First Class are, on average, assigned a much higher survival probability than those from Third Class.

In [5]:

sns.histplot(data=df_titanic, x="p_pred", hue="Pclass", multiple="stack")

Out[5]:

<AxesSubplot:xlabel='p_pred', ylabel='Count'>

How is this possible given that we excluded class as predictor? The reason is that the model will simply load features that are correlated with passenger class. Class is negatively correlated with both ticket fare and age, see the left plot below. Fare and age in turn are important features for gain (increase in the objective). A model that does not have passenger class as a feature will load higher on age and fare and will still discriminate on class. If you care about discrimination in passenger class, you might still be better off keeping it as a feature because at least that is more transparent!

In [6]:

figs, axs = plt.subplots(1, 2, figsize=(18, 8))

df_titanic[features + ["Pclass"]].corr()["Pclass"][:-1].plot.bar(
    ax=axs[0], title="Correlation with Pclass"
)
lgb.plot_importance(gbm, importance_type="gain", ax=axs[1])

Out[6]:

<AxesSubplot:title={'center':'Feature importance'}, xlabel='Feature importance', ylabel='Features'>

Checking the violation of fairness principles¶

It is now time to check how the violation of principles manifests here.

Calibration¶

For calibration, let us have a look at score and realized frequencies, over groups with similar score.

In [7]:

df_titanic["p_pred_bin"] = pd.qcut(df_titanic["p_pred"], 5)
df_calibration = (
    df_titanic.groupby("p_pred_bin")
    .agg(
        avg_score=("p_pred", "mean"),
        avg_out=("Survived", "mean"),
        count=("p_pred", "count"),
    )
    .reset_index()
)

fig, ax = plt.subplots()
sns.scatterplot(data=df_calibration, x='avg_score', y='avg_out',marker='o', ax=ax, color='red')
ax.axline((0,0),(1,1))

Out[7]:

<matplotlib.lines._AxLine at 0x20d6a0a05c8>

In a perfectly calibrated model, all dots would lie on the 45 degree line.The dots here look reasonably close to it. At the low end, there is slight overprediction, at the upper end, there is slight underprediction, which means that one could entertain a flexible mddel. The model also looks well-calibrated across passenger classes:

In [8]:

df_titanic.groupby("Pclass")[["Survived", "p_pred"]].mean()

Out[8]:

	Survived	p_pred
Pclass
1	0.629630	0.623544
2	0.472826	0.459584
3	0.242363	0.250129

Balance¶

To check balance, we first need to turn probability scores into classifications. First, we choose threshold that maximizes TPR - FPR.

In [9]:

fpr, tpr, thresholds = sklearn.metrics.roc_curve(y, df_titanic["p_pred"])
threshold = thresholds[np.argmax(tpr - fpr)]
threshold

Out[9]:

0.3433539564746816

We can now determine the True and False Positives and Negatives at the threshold:

In [10]:

df_titanic["pred"] = df_titanic["p_pred"] > threshold
df_titanic["tp"] = df_titanic["pred"] & df_titanic["Survived"]
df_titanic["fp"] = df_titanic["pred"] & ~df_titanic["Survived"]
df_titanic["tn"] = ~df_titanic["pred"] & ~df_titanic["Survived"]
df_titanic["fn"] = ~df_titanic["pred"] & df_titanic["Survived"]

Finally, the TPR and FPR over classes are:

In [11]:

df_titanic.groupby("Pclass").apply(
    lambda x: pd.Series(
        {
            # "perc_predict_survived": 100 * x["pred"].mean(),
            "fp": x["fp"].sum(),
            "fn": x["fn"].sum(),
            "tpr": x["tp"].sum() / (x["tp"].sum() + x["fn"].sum()),
            "tnr": x["tn"].sum() / (x["tn"].sum() + x["fp"].sum()),
        }
    )
)

Out[11]:

	fp	fn	tpr	tnr
Pclass
1	8.0	0.0	1.000000	0.900000
2	6.0	3.0	0.965517	0.938144
3	21.0	7.0	0.941176	0.943548

The violation of balance is striking: as we saw above, passengers in First Class have a higher probability of survival than those of other classes. But even passengers in Third Class that survived are far less likely to be classified as survivors than passengers in First Class that survived. No survivor from First Class is wrongly classified as not-to-survive, but there are 7 survivor from Third Class that are wrongly classified as not-to-survive, as we can see in the tn column.

Conclusion¶

To summarize: you cannot have a model that is well calibrated and balanced across groups. If you ensure that your model is well-calibrated in training (as in standard procedure), you should at least check the TPR and FPR among sensitive groups (age, gender, ethnicity...). If the discrepancies are too high, you might instead want use a simpler model that is not so well calibrated, but also spreads its predictions less between groups.

References¶

Kleinberg, J., Mullainathan, S., & Raghavan, M. (2016). Inherent trade-offs in the fair determination of risk scores. arXiv preprint arXiv:1609.05807.