A meaningfuk one-dimensional numerical prediction ("score") will always unfairly discriminate between groups. This is one of these nice results that are both counterintuitive and easy to derive in its basic form. Meaningful here means calibration, i.e. that multiplying the score by a factor corresponds to changing frequencies by the same factor. Data scientists who build production systems should be aware of this finding. Luckily, it is easy to check how grave the problem is if you know what to look for.
The three basic conditions for fairness to hold are:
Why these three? Without calibration, the comparison between individuals with different scores would be meaningless. Positive balance corresponds to False Positive Rates (FPR) being the the same across groups. To see why fairness requires balance, imagine the contrary in an example where the score is probability of default: if FPR in one group is higher than in the other, creditworthy members of the first group will be systematically classified as more likely to default than creditworthy members of the second.
Le $\hat{y}_0$ the average prediction of an observation with label 0 and $\hat{y}_1$ the average prediction for an observation with label 1. By balance, $\hat{y}_0$ and $\hat{y}_1$ are the same between groups. Let $n_0^g$ be the number of individuals with label 0 and $n_1^g$ be the number of individuals with label 1 in group $g$. For simplicity, take two only groups, A and B. The average prediction in group A is $$ \hat{y}^A = \frac{n_0^A\times\hat{y}_0 + n_1^A\times\hat{y}_1}{n_0^A+n_1^A}. $$ By calibration, we must have $$ \hat{y}^A = \frac{n_1^A}{n_0^A+n_1^A}. $$ Combining the two conditions and rearranging, we get $$ \hat{y}_1 = 1 - \hat{y}_0 \times \tfrac{n_0^A}{n_1^A}. $$ Proceeding in the same manner for group B yields $$ \hat{y}_1 = 1 - \hat{y}_0 \times \tfrac{n_0^B}{n_1^B}. $$ There are only two conditions under which the two previous equations hold:
Let us now check how this result tranfers to real data, using Titanic dataset. The outcome will be survival and the groups will be passenger class. We will compare predictions between First and Third Class.
%load_ext nb_black
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = [12.0, 8.0]
import lightgbm as lgb
import sklearn
from catboost.datasets import titanic
The results hold irrespective of whether the model is validated out of sample, so we will just focus on the training data set. Survival rates between passenger classes are clearly different.
df_titanic, _ = titanic()
df_titanic.groupby("Pclass")["Survived"].mean()
Pclass 1 0.629630 2 0.472826 3 0.242363 Name: Survived, dtype: float64
We now define the features used to predict survival. One might think that excluding passenger class would safeguard from discrimination via this feature ("fairness via ignorance"), but this is not the case. To illustrate, we will build a model without passenger class.
categorical_features = ["Embarked", "Sex"]
for f in categorical_features:
df_titanic[f] = df_titanic[f].astype("category")
features = ["Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]
x = df_titanic[features]
y = df_titanic["Survived"]
dtrain = lgb.Dataset(x, label=y)
We now train a reasonably flexible model. Protecting against overfitting will not resolve the unfairness issue.
param = {
"objective": "binary",
"metric": "binary_logloss",
"verbosity": -1,
"boosting_type": "gbdt",
"num_leaves": 164,
}
gbm = lgb.train(param, dtrain, num_boost_round=100)
df_titanic["p_pred"] = gbm.predict(x)
The model's probability score is clustered around 0 and 1. Passengers from First Class are, on average, assigned a much higher survival probability than those from Third Class.
sns.histplot(data=df_titanic, x="p_pred", hue="Pclass", multiple="stack")
<AxesSubplot:xlabel='p_pred', ylabel='Count'>
How is this possible given that we excluded class as predictor? The reason is that the model will simply load features that are correlated with passenger class. Class is negatively correlated with both ticket fare and age, see the left plot below. Fare and age in turn are important features for gain (increase in the objective). A model that does not have passenger class as a feature will load higher on age and fare and will still discriminate on class. If you care about discrimination in passenger class, you might still be better off keeping it as a feature because at least that is more transparent!
figs, axs = plt.subplots(1, 2, figsize=(18, 8))
df_titanic[features + ["Pclass"]].corr()["Pclass"][:-1].plot.bar(
ax=axs[0], title="Correlation with Pclass"
)
lgb.plot_importance(gbm, importance_type="gain", ax=axs[1])
<AxesSubplot:title={'center':'Feature importance'}, xlabel='Feature importance', ylabel='Features'>
df_titanic["p_pred_bin"] = pd.qcut(df_titanic["p_pred"], 5)
df_calibration = (
df_titanic.groupby("p_pred_bin")
.agg(
avg_score=("p_pred", "mean"),
avg_out=("Survived", "mean"),
count=("p_pred", "count"),
)
.reset_index()
)
fig, ax = plt.subplots()
sns.scatterplot(data=df_calibration, x='avg_score', y='avg_out',marker='o', ax=ax, color='red')
ax.axline((0,0),(1,1))
<matplotlib.lines._AxLine at 0x20d6a0a05c8>
In a perfectly calibrated model, all dots would lie on the 45 degree line.The dots here look reasonably close to it. At the low end, there is slight overprediction, at the upper end, there is slight underprediction, which means that one could entertain a flexible mddel. The model also looks well-calibrated across passenger classes:
df_titanic.groupby("Pclass")[["Survived", "p_pred"]].mean()
Survived | p_pred | |
---|---|---|
Pclass | ||
1 | 0.629630 | 0.623544 |
2 | 0.472826 | 0.459584 |
3 | 0.242363 | 0.250129 |
To check balance, we first need to turn probability scores into classifications. First, we choose threshold that maximizes TPR - FPR.
fpr, tpr, thresholds = sklearn.metrics.roc_curve(y, df_titanic["p_pred"])
threshold = thresholds[np.argmax(tpr - fpr)]
threshold
0.3433539564746816
We can now determine the True and False Positives and Negatives at the threshold:
df_titanic["pred"] = df_titanic["p_pred"] > threshold
df_titanic["tp"] = df_titanic["pred"] & df_titanic["Survived"]
df_titanic["fp"] = df_titanic["pred"] & ~df_titanic["Survived"]
df_titanic["tn"] = ~df_titanic["pred"] & ~df_titanic["Survived"]
df_titanic["fn"] = ~df_titanic["pred"] & df_titanic["Survived"]
Finally, the TPR and FPR over classes are:
df_titanic.groupby("Pclass").apply(
lambda x: pd.Series(
{
# "perc_predict_survived": 100 * x["pred"].mean(),
"fp": x["fp"].sum(),
"fn": x["fn"].sum(),
"tpr": x["tp"].sum() / (x["tp"].sum() + x["fn"].sum()),
"tnr": x["tn"].sum() / (x["tn"].sum() + x["fp"].sum()),
}
)
)
fp | fn | tpr | tnr | |
---|---|---|---|---|
Pclass | ||||
1 | 8.0 | 0.0 | 1.000000 | 0.900000 |
2 | 6.0 | 3.0 | 0.965517 | 0.938144 |
3 | 21.0 | 7.0 | 0.941176 | 0.943548 |
The violation of balance is striking: as we saw above, passengers in First Class have a higher probability of survival than those of other classes. But even passengers in Third Class that survived are far less likely to be classified as survivors than passengers in First Class that survived. No survivor from First Class is wrongly classified as not-to-survive, but there are 7 survivor from Third Class that are wrongly classified as not-to-survive, as we can see in the tn
column.
To summarize: you cannot have a model that is well calibrated and balanced across groups. If you ensure that your model is well-calibrated in training (as in standard procedure), you should at least check the TPR and FPR among sensitive groups (age, gender, ethnicity...). If the discrepancies are too high, you might instead want use a simpler model that is not so well calibrated, but also spreads its predictions less between groups.
Kleinberg, J., Mullainathan, S., & Raghavan, M. (2016). Inherent trade-offs in the fair determination of risk scores. arXiv preprint arXiv:1609.05807.