As machine learning is deployed in all aspects of society, it has become increasingly important to ensure stakeholders understand and trust these models. Decision makers must have a clear understanding of the model behavior so they can diagnose errors and potential biases in these models, and decide when and how to employ them. However, most accurate models that are deployed in practice are not interpretable, making it difficult for users to understand where the predictions are coming from, and thus, difficult to trust.
Recent work on explanation techniques in machine learning offers an attractive solution: they provide intuitive explanations for “any” machine learning model by approximating complex machine learning models with simpler ones.
In this tutorial, we will discuss several post hoc explanation methods, and focus on their advantages and shortcomings. We will cover three families of techniques: (a) single instance gradient-based attribution methods (saliency maps), (b) model agnostic explanations via perturbations, such as LIME/SHAP and counterfactual explanations, and (c) surrogate modeling for global interpretability, such as MUSE. For each of these approaches, we will provide their problem setup, prominent methods, example applications, and finally, discuss their vulnerabilities and shortcomings. We will conclude the tutorial with an overview of future directions and a discussion on open research problems. We hope to provide a practical and insightful introduction to explainability in machine learning.