Skip to yearly menu bar Skip to main content

Affinity Workshop: Women in Machine Learning

Fair Active learning by exploiting causal data structure

Sindhu C M Gowda · Haoran Zhang · Marzyeh Ghassemi


Machine learning (ML) is increasingly being used in high-stakes applications impacting society. Prior evidence suggests that these models may learn to rely on “shortcut” biases or spurious correlations. Therefore, it is importance to ensure that ML models do not propagate biases found in training data. Further, collecting accurately labeled data can be very challenging and costly. In this work, we design algorithms for fair active learning that carefully selects data points to be labeled by exploiting their underlying causal structure so as to balance model accuracy and fairness. We look into a pool-based setup, where the learner has access to a small pool of labeled and a large pool of unlabelled data, both of which have the same biased distribution. We look at two cases of confounding bias where: a) bias is available b) bias is unknown or unavailable. For each class, we try to sample from interventional distribution to eliminate the effect of bias on the acquired data points. Exploiting the causal structure of the underlying data, the approach first involves expressing the interventional distribution as a simple weighted KDE to generate sampling weights. In each iteration, we generate weights for all labeled data samples and then batch sample unlabelled points, from kernels centered on labeled samples with probability w_n, ensuring diversity of the collected samples. We compare our method against the popular active learning baselines based on a) Uncertainty b) Density and c) Diversity. We also compare our method against models that implicity regularise for fairness while acquiring randomly or based on the entropy of the sample. We show that on the synthetically generated biased datasets, our method outperforms the baselines by a huge margin on unbiased test sets - implying that the model learned by acquiring actively based on the causal structure of the data is unbiased. We wish to further extend the results to large datasets and deep learning models.

Chat is not available.