Workshop
Causal Machine Learning for RealWorld Impact
Nick Pawlowski · Jeroen Berrevoets · Caroline Uhler · Kun Zhang · Mihaela van der Schaar · Cheng Zhang
Room 295  296
Causality has a long history, providing it with many principled approaches to identify a causal effect (or even distill cause from effect). However, these approaches are often restricted to very specific situations, requiring very specific assumptions. This contrasts heavily with recent advances in machine learning. Realworld problems aren’t granted the luxury of making strict assumptions, yet still require causal thinking to solve. Armed with the rigor of causality, and the candoattitude of machine learning, we believe the time is ripe to start working towards solving realworld problems.
Schedule
Fri 6:30 a.m.  6:45 a.m.

Opening Remarks
(
Opening Remarks
)
>
SlidesLive Video 
Cheng Zhang · Mihaela van der Schaar 🔗 
Fri 6:45 a.m.  7:15 a.m.

Learning Causal Structures and Causal Representations from Data
(
Talk
)
>
SlidesLive Video 
Peter Spirtes 🔗 
Fri 7:15 a.m.  8:00 a.m.

Panel Discussion
(
Panel Discussion
)
>
SlidesLive Video 
Cheng Zhang · Mihaela van der Schaar · Ilya Shpitser · Aapo Hyvarinen · Yoshua Bengio · Bernhard Schölkopf 🔗 
Fri 8:00 a.m.  8:45 a.m.

Poster Session
(
Poster Session
)
>

🔗 
Fri 8:00 a.m.  8:30 a.m.

Coffee Break
(
Break
)
>

🔗 
Fri 8:45 a.m.  9:05 a.m.

Causal Discovery for Real World Applications: A Case Study
(
Talk
)
>
SlidesLive Video 
Stefan Bauer 🔗 
Fri 9:05 a.m.  9:25 a.m.

Learning Neural Causal Models
(
Talk
)
>
SlidesLive Video 
Nan Rosemary Ke 🔗 
Fri 9:30 a.m.  9:45 a.m.

Discrete Learning Of DAGs Via Backpropagation
(
Talk
)
>
SlidesLive Video 
Andrew Wren · Pasquale Minervini · Luca Franceschi · Valentina Zantedeschi 🔗 
Fri 9:45 a.m.  10:00 a.m.

Local Causal Discovery for Estimating Causal Effects
(
Talk
)
>
SlidesLive Video 
Shantanu Gupta · David Childers · Zachary Lipton 🔗 
Fri 10:00 a.m.  10:15 a.m.

Exploiting Neighborhood Interference with Low Order Interactions under Unit Randomized Design
(
Talk
)
>
SlidesLive Video 
Mayleen Cortez · Matthew Eichhorn · Christina Yu 🔗 
Fri 10:15 a.m.  10:30 a.m.

Hydranet: A Neural Network for the estimation of Multivalued Treatment Effects
(
Talk
)
>
SlidesLive Video 
Borja Velasco · Jesus Cerquides · Josep Arcos 🔗 
Fri 10:30 a.m.  11:45 a.m.

Lunch Break
(
Lunch Break
)
>

🔗 
Fri 10:30 a.m.  11:45 a.m.

Poster Session
(
Poster Session
)
>

🔗 
Fri 11:45 a.m.  12:15 p.m.

Causal ML for medicines R&D
(
Talk
)
>
SlidesLive Video 
Jim Weatherall 🔗 
Fri 12:15 p.m.  12:45 p.m.

Planning and Learning from Interventions in the Context of Cancer Immunotherapy
(
Talk
)
>
SlidesLive Video 
Caroline Uhler 🔗 
Fri 12:45 p.m.  1:30 p.m.

Coffee Break
(
Break
)
>

🔗 
Fri 12:45 p.m.  1:30 p.m.

Poster Session
(
Poster Session
)
>

🔗 
Fri 1:30 p.m.  2:00 p.m.

Stable Discovery of Interpretable Subgroups via Calibration in Causal Studies
(
Talk
)
>
SlidesLive Video 
Bin Yu 🔗 
Fri 2:00 p.m.  2:15 p.m.

A DesignBased Riesz Representation Framework For Randomized Experiments
(
Talk
)
>
SlidesLive Video 
Christopher Harshaw · Yitan Wang · Fredrik Sävje 🔗 
Fri 2:15 p.m.  2:30 p.m.

A Causal AI Suite for DecisionMaking
(
Talk
)
>
SlidesLive Video 
Emre Kiciman 🔗 
Fri 2:30 p.m.  2:45 p.m.

Causal Analysis of the TOPCAT Trial: Spironolactone for Preserved Cardiac Function Heart Failure
(
Talk
)
>
SlidesLive Video 
Francesca Raimondi · Tadhg O'Keeffe · Andrew Lawrence · Tamara Stemberga · Andre Franca · Maksim Sipos · Javed Butler · Shlomo BenHaim 🔗 
Fri 2:45 p.m.  3:00 p.m.

Closing Remarks
(
Closing Remarks
)
>
SlidesLive Video 
Cheng Zhang · Mihaela van der Schaar 🔗 


Evaluating the Impact of Geometric and Statistical Skews on OutOfDistribution Generalization Performance
(
Poster
)
>
link
Outofdistribution (OOD) or domain generalization is the problem of generalizing to unseen distributions. Recent work suggests that the marginal difficulty of generalizing to OOD over indistribution data (OODID generalization gap) is due to spurious correlations, which arise due to statistical and geometric skews, and can be addressed by careful data augmentation and class balancing. We observe that after constructing a dataset where we remove all conceivable sources of spurious correlation between interpretable factors, classifiers still fail to close the OODID generalization gap. 
Aengus Lynch · Jean Kaddour · Ricardo Silva 🔗 


Targeted Causal Elicitation
(
Poster
)
>
link
We look at the problem of learning causal structure for a fixed downstream causal effect optimization task. In contrast to previous work which often focuses on running interventional experiments, we consider an often overlooked source of information  a domain expert. In the Bayesian setting this amounts to augmenting the likelihood with a user model whose parameters account for possible biases of the expert. Such a model allows for active elicitation in a manner that is most informative to the optimization task at hand. 
Nazaal Ibrahim · ST John · Zhigao Guo · Samuel Kaski 🔗 


Using Interventions to Improve OutofDistribution Generalization of TextMatching Systems
(
Poster
)
>
link
Given a user's input text, textmatching recommender systems output relevant items by comparing the input text to available items' description, such as producttoproduct recommendation on ecommerce platforms. As users' interests and item inventory are expected to change, it is important for a textmatching system to generalize to data shifts, a task known as outofdistribution (OOD) generalization. However, we find that the popular approach of finetuning a large, base language model on paired item relevance data (e.g., user clicks) can be counterproductive for OOD generalization. For a product recommendation task, finetuning obtains worse accuracy than the base model when recommending items in a new category or for a future time period. To explain this generalization failure, we consider an interventionbased importance metric, which shows that a finetuned model captures spurious correlations and fails to learn the causal features that determine the relevance between any two text inputs. Moreover, standard methods for causal regularization do not apply in this setting, because unlike in images, there exist no universally spurious features in a textmatching task (the same token may be spurious or causal depending on the text it is being matched to). For OOD generalization on text inputs, therefore, we highlight a different goal: avoiding high importance scores for certain features. We do so using an interventionbased regularizer that constraints the causal effect of any token on the model's relevance score to be similar to the base model. Results on Amazon product and 3 question recommendation datasets show that our proposed regularizer improves generalization for both indistribution and OOD evaluation, especially in difficult scenarios when the base model is not accurate. 
Parikshit Bansal · Yashoteja Prabhu · Emre Kiciman · Amit Sharma 🔗 


Exploiting Selection Bias on Underspecified Tasks in Large Language Models
(
Poster
)
>
link
In this paper we motivate the causal mechanisms behind sample selection induced collider bias (selection collider bias) that can cause Large Language Models (LLMs) to learn unconditional dependence between entities that are unconditionally independent in the real world. We show that selection collider bias can become amplified in underspecified learning tasks, and although difficult to overcome, we describe a method to exploit the resulting spurious correlations for determination of when a model may be uncertain about its prediction. We demonstrate an uncertainty metric that matches human uncertainty in tasks with gender pronoun underspecification on an extended version of the Winogender Schemas evaluation set, and we provide online demos where users can evaluate spurious correlations and apply our uncertainty metric to their own texts and models. Finally, we generalize our approach to address a wider range of prediction tasks. 
Emily McMilin 🔗 


Making the World More Equal, One Ride at a Time: Studying Public Transportation Initiatives Using Interpretable Causal Inference
(
Poster
)
>
link
The goal of lowincome fare subsidy programs is to increase equitable access to public transit, and in doing so, increase access to jobs, housing, education and other essential resources. King County Metro, one of the largest transit providers focused on equitable public transit, has been innovative in launching new programs for lowincome riders. However, due to the observational nature of data on ridership behavior in King County, evaluating the effectiveness of such innovative policies is difficult. In this work, we used seven datasets from a variety of sources, and used a recent interpretable machinelearningbased causal inference matching method called FLAME to evaluate one of King County Metro’s largest programs implemented in 2020: the Subsidized Annual Pass (SAP). Using FLAME, we construct highquality matched groups and identify features that are important for predicting ridership and reenrollment. Our analysis provides clear and insightful feedback for policymakers. In particular, we found that SAP is effective in increasing longterm ridership and reenrollment. Notably, there are pronounced positive treatment effects in populations that have higher access to public transit and jobs. Treatment effects are also more pronounced in the Asian population and in individuals ages 65+. Insights from this work can help broadly inform public transportation policy decisions and generalize broadly to other cities and other forms of transportation. 
Gaurav Rajesh Parikh · Albert Sun · Jenny Huang · Lesia Semenova · Cynthia Rudin 🔗 


NonStationary Causal Bandits
(
Poster
)
>
link
The causal bandit problem is an extension of the conventional multiarmed bandit problem in which the arms available are not independent of each other, but rather are correlated within themselves in a Bayesian graph. This extension is more natural, since daytoday cases of bandits often have a causal relation between their actions and hence are better represented as a causal bandit problem. Moreover, the class of conventional multiarmed bandits lies within that of causal bandits, since any instance of the former can be modeled in the latter setting by using a Bayesian graph with all independent variables. However, it is generally assumed that the probabilistic distributions in the Bayesian graph are stationary.In this paper, we design nonstationary causal bandit algorithms by equipping the actual state of the art (mainly \algo{causal UCB}, \algo{causal Thompson Sampling}, \algo{causal KL UCB} and \algo{Online Causal TS}) with the restarted Bayesian online changepoint detector \cite{RBOCPD}. Experimental results show the minimization of the regret when using optimal changepoint detection. 
REDA ALAMI 🔗 


Can Active Sampling Reduce Causal Confusion in Offline Reinforcement Learning?
(
Poster
)
>
link
Causal confusion is a phenomenon where an agent learns a policy that reflects imperfect spurious correlations in the data. The resulting causally confused behaviors may appear desirable during training but may fail at deployment. This problem gets exacerbated in domains such as robotics with potentially large gaps between open and closedloop performance of an agent. In such cases, a causally confused model may appear to perform well according to openloop metrics but fail catastrophically when deployed in the real world. In this paper, we conduct the first study of causal confusion in offline reinforcement learning and hypothesize that selectively sampling data points that may help disambiguate the underlying causal mechanism of the environment may alleviate causal confusion. To investigate this hypothesis, we consider a set of simulated setups to study causal confusion and the ability of active sampling schemes to reduce its effects. We provide empirical evidence that random and active sampling schemes are able to consistently reduce causal confusion as training progresses and that active sampling is able to do so more efficiently than random sampling. 
Gunshi Gupta · Tim G. J. Rudner · Rowan McAllister · Adrien Gaidon · Yarin Gal 🔗 


A Causal AI Suite for DecisionMaking
(
Poster
)
>
link
Critical data science and decisionmaking questions across a wide variety of domains are fundamentally causal questions. The causal AI research area is still early in its development, however, and as with any technology area, will require many more advances and iterative practical deployments to reach its full impact. We present a suite of opensource causal tools and libraries that aims to simultaneously provide core causal AI functionality to practitioners and create a platform for research advances to be rapidly deployed. In this paper, we describe our contributions towards such a comprehensive causal AI suite of tools and libraries, its design, and lessons we are learning from its growing adoption. We hope that our work accelerates useinspired basic research for improvement of causal AI. 
Emre Kiciman · Eleanor Dillon · Darren Edge · Adam Foster · Joel Jennings · Chao Ma · Robert Ness · Nick Pawlowski · Amit Sharma · Cheng Zhang 🔗 


Unit Selection: Learning Benefit Function from Finite Population Data
(
Poster
)
>
link
The unit selection problem is to identify a group of individuals who are most likely to exhibit a desired mode of behavior, for example, selecting individuals who would respond one way if incentivized and a different way if not. The unit selection problem consists of evaluation and search subproblems. Li and Pearl defined the "benefit function" to evaluate the average payoff of selecting a certain individual with given characteristics. The search subproblem is then to design an algorithm to identify the characteristics that maximize the above benefit function. The hardness of the search subproblem arises due to the large number of characteristics available for each individual and the sparsity of the data available in each cell of characteristics. In this paper, we present a machine learning framework that uses the bounds of the benefit function that are estimable from the finite population data to learn the bounds of the benefit function for each cell of characteristics. Therefore, we could easily obtain the characteristics that maximize the benefit function. 
Ang Li · Song Jiang · Yizhou Sun · Judea Pearl 🔗 


Neural Bayesian Network Understudy
(
Poster
)
>
link
Bayesian Networks may be appealing for clinical decisionmaking due to their inclusion of causal knowledge, but their practical adoption remains limited as a result of their inability to deal with unstructured data. While neural networks do not have this limitation, they are not interpretable and are inherently unable to deal with causal structure in the input space. Our goal is to build neural networks that combine the advantages of both approaches. Motivated by the perspective to inject causal knowledge while training such neural networks, this work presents initial steps in that direction. We demonstrate how a neural network can be trained to output conditional probabilities, providing approximately the same functionality as a Bayesian Network. Additionally, we propose two training strategies that allow encoding the independence relations inferred from a given causal structure into the neural network. We present initial results in a proofofconcept setting, showing that the neural model acts as an understudy to its Bayesian Network counterpart, approximating its probabilistic and causal properties. 
Paloma Rabaey · Cedric De Boom · Thomas Demeester 🔗 


Hydranet: A Neural Network for the estimation of Multivalued Treatment Effects
(
Poster
)
>
link
The clinical effectiveness aspect within the Health Technology Assessment process often faces causal questions where the treatment variable can take multiple values. Nevertheless, most developments in causal inference algorithms that employ machine learning happen in binary treatment settings. In addition, there is a big gap between the algorithmic state of the art and the applied state of the art in this field. In this paper, we select a stateoftheart, neural networkbased algorithm for binary treatment effect estimation and generalize it to a multivalued treatment setting, testing it with semisynthetic data that could mimic an HTA process. We obtain an estimator with desirable asymptotic properties and good results in experiments. To the best of our knowledge, this work is opening ground for the benchmarking of neural networkbased algorithms for multivalued treatment effect estimation. 
Borja Velasco · Jesus Cerquides · Josep Arcos 🔗 


Deep Endtoend Causal Inference
(
Poster
)
>
link
Causal inference is essential for datadriven decision making across domains such as business engagement, medical treatment and policy making. However, research on causal discovery has evolved separately from causal inference, preventing straightforward combination of methods from both fields. In this work, we develop Deep Endtoend Causal Inference (DECI), a nonlinear additive noise model with neural network functional relationships that takes in observational data and can perform both causal discovery and inference, including conditional average treatment effect (CATE) estimation. We provide a theoretical guarantee that DECI can asymptotically recover the ground truth causal graph and treatment effects when correctly specified. Our results show the competitive performance of DECI when compared to relevant baselines for both causal discovery and (C)ATE estimation in over a thousand experiments on both synthetic datasets and causal machine learning benchmarks. 
Tomas Geffner · Javier Antorán · Adam Foster · Wenbo Gong · Chao Ma · Emre Kiciman · Amit Sharma · Angus Lamb · Martin Kukla · Nick Pawlowski · Miltiadis Allamanis · Cheng Zhang



Contrastive Unsupervised Learning of World Model with Invariant Causal Features
(
Poster
)
>
link
In this paper we present a world model, which learns causal features using the invariance principle. In particular, we use contrastive unsupervised learning to learn the invariant causal features, which enforces invariance across augmentations of irrelevant parts or styles of the observation. The worldmodelbased reinforcement learning methods independently optimize representation learning and the policy. Thus naive contrastive loss implementation collapses due to a lack of supervisory signals to the representation learning module. We propose an intervention invariant auxiliary task to mitigate this issue. Specifically, we utilize depth prediction to explicitly enforce the invariance and use data augmentation as style intervention on the RGB observation space. Our design leverages unsupervised representation learning to learn the world model with invariant causal features. Our proposed method significantly outperforms current stateoftheart modelbased and modelfree reinforcement learning methods on outofdistribution point navigation tasks on the iGibson dataset. Moreover, our proposed model excels at the simtoreal transfer of our perception learning module. 
Rudra PK Poudel · Harit Pandya · Roberto Cipolla 🔗 


Toward Fair and Robust Optimal Treatment Regimes
(
Poster
)
>
link
We propose a new framework for robust nonparametric estimation of optimal treatment regimes under flexible fairness constraints.Under standard regularity conditions we show that the resulting estimators possess the double robustness property. We use this framework to characterize the tradeoff between fairness and the maximum welfare that is achievable by the optimal treatment policy. 
Kwangho Kim · Jose Zubizarreta 🔗 


Counterfactual Generation Under Confounding
(
Poster
)
>
link
A machine learning model, under the influence of observed or unobserved confounders in the training data, can learn spurious correlations and fail to generalize when deployed. For image classifiers, augmenting a training dataset using counterfactual examples has been empirically shown to break spurious correlations. However, the counterfactual generation task itself becomes more difficult as the level of confounding increases. Existing methods for counterfactual generation under confounding consider a fixed set of interventions (e.g., texture, rotation) and are not flexible enough to capture diverse datagenerating processes. We formally characterize the adverse effects of confounding on any downstream tasks and show that the correlation between generative factors can be used to quantitatively measure confounding. To minimize such correlation, we propose a counterfactual generation method that learns to modify the value of any attribute in an image and generate new images. Our method is computationally efficient, simple to implement, and works well for any number of generative factors and confounding variables. Our experimental results on both synthetic (MNIST variants) and realworld (CelebA) datasets show the usefulness of our approach. 
Abbavaram Gowtham Reddy · Saloni Dash · Amit Sharma · Vineeth N Balasubramanian 🔗 


A Causal Inference Framework for Network Interference with Panel Data
(
Poster
)
>
link
We propose a framework for causal inference with panel data in the presence of network interference and unobserved confounding. Key to our approach is a novel latent factor model that takes into account network interference and generalizes the factor models typically used in panel data settings. We propose an estimator–the Network Synthetic Interventions estimator—and show that it consistently estimates the counterfactual outcomes for a unit under an arbitrary set of treatments, if certain observation patterns hold in the data. We corroborate our theoretical findings with simulations. In doing so, our framework extends the Synthetic Control and Synthetic Interventions methods to incorporate network interference. 
Sarah Cen · Anish Agarwal · Christina Yu · Devavrat Shah 🔗 


Improving the Efficiency of the PC Algorithm by Using ModelBased Conditional Independence Tests
(
Poster
)
>
link
Learning causal structure is useful in many areas of artificial intelligence, such as planning, robotics, and explanation. Constraintbased and hybrid structure learning algorithms such as PC use conditional independence (CI) tests to learn a causal structure. Traditionally, constraintbased algorithms perform the CI tests with a preference for smallersized conditioning sets, partially because the statistical power of conventional CI tests declines substantially as the size of the conditioning set increases. However, many modern conditional independence tests are \textit{modelbased}, and these tests use wellregularized models that can perform well even with very large conditioning sets. This suggests an intriguing new strategy for constraintbased algorithms which may result in a reduction of the total number of CI tests performed: Test variable pairs with \textit{large} conditioning sets \textit{first}, as a preprocessing step that finds some conditional independencies quickly, before moving on to the more conventional strategy of testing with incrementally larger conditioning sets of sizes (beginning with marginal independence tests). We propose such a preprocessing step for the PC algorithm which relies on performing CI tests on a few randomly selected large conditioning sets. We perform an empirical analysis on directed acyclic graphs (DAGs) that correspond to realworld systems and both an empirical and theoretical analysis for Erd\H{o}sRenyi DAGs. Our results show that the PC algorithm with our preprocessing step performs far fewer CI tests than the original PC algorithm, between 0.5\% and 20\%, of the CI tests that the PC algorithm alone performs. The efficiency gains are particularly significant for the DAGs corresponding to realworld systems. 
Erica Cai · Andrew McGregor · David Jensen 🔗 


Identifying Causal Effects Of Exercise On Irregular Heart Rhythm Events Using Wearable Device Data
(
Poster
)
>
link
Wearable devices can passively monitor user health by tracking a set of metrics, including activity and heart rate. The Apple Watch introduced Irregular Rhythm Notifications (IRNs), which alert a user when the watch detects an arrhythmia over a sustained period that is highly suggestive of atrial fibrillation (AFib). Arrhythmias like AFib are often episodic, and episodes are suspected to have triggers like sleep changes, alcohol intake, or exercise. We study the proximal connection between Apple Exercise Minutes, a measure of moderate to strenuous exercise, and IRN events, using a causal observational study with data from the Apple Heart and Movement Study. We find that while increased exercise levels have a broadly protective effect, a large daily increase in exercise relative to a user's baseline increases the risk of an IRN on that day. 
Lauren Hannah · Adam Bouyamourn 🔗 


On Causal Rationalization
(
Poster
)
>
link
With recent advances in natural language processing, rationalization becomes an essential selfexplaining diagram to disentangle the black box by selecting a subset of input texts to account for the major variation in prediction. Yet, existing associationbased approaches on rationalization cannot identify true rationales when two or more rationales are highly intercorrelated, and thus provide a similar contribution to prediction accuracy, socalled spuriousness. To address this limitation, we novelly leverage two causal desiderata, nonspuriousness and efficiency, into rationalization from the causal inference perspective. We formally define the probability of causation in the rationale model with its identification established as the main component of learning necessary and sufficient rationales. The superior performance of our causal rationalization is demonstrated on realworld review and medical datasets with extensive experiments compared to stateof theart methods. 
Wenbo Zhang · TONG WU · Yunlong Wang · Yong Cai · Hengrui Cai 🔗 


The CounterfactualShapley Value: Attributing Change in System Metrics
(
Poster
)
>
link
Given an unexpected change in the output metric of a largescale system, it is important to answer why the change occurred: which inputs caused the change in metric? A key component of such an attribution question is estimating the counterfactual: the (hypothetical) change in the system metric due to a specified change in a single input. However, due to inherent stochasticity and complex interactions between parts of the system, it is difficult to model an output metric directly. We utilize the computational structure of a system to break up the modelling task into subparts, such that each subpart corresponds to a more stable mechanism that can be modelled accurately over time. Using the system's structure also helps to view the metric as a computation over a structural causal model (SCM), thus providing a principled way to estimate counterfactuals. Specifically, we propose a method to estimate counterfactuals using timeseries predictive models and construct an attribution score, CFShapley, that is consistent with desirable axioms for attributing an observed change in the output metric. Unlike past work on causal shapley values, our proposed method can attribute a single observed change in output (rather than a populationlevel effect) and thus provides more accurate attribution scores when evaluated on simulated datasets. As a realworld application, we analyze a queryad matching system with the goal of attributing observed change in a metric for ad matching density. Attribution scores explain how query volume and ad demand from different query categories affect the ad matching density, uncovering the role of external events (e.g., "Cheetah Day") in driving the matching density. 
Amit Sharma · Hua Li · Jian Jiao 🔗 


Beyond Central Limit Theorem for Higher Order Inference in Batched Bandits
(
Poster
)
>
link
Adaptive experiments have been gaining traction in a variety of domains, which stimulates a growing literature focusing on postexperimental statistical inference on data collected from such designs. Prior work constructs confidence intervals mainly based on two types of methods: (i) martingale concentration inequalities and (ii) asymptotic approximation to distribution of test statistics; this work contributes to the second kind. The current asymptotic approximation methods however mostly rely on firstorder limit theorems, which can have a slow convergence rate in a datapoor regime. Besides, established results often rely on conditions that noises are wellbehaved, which can be problematic when the realworld instances are heavytailed or asymmetric. In this paper, we propose the first higherorder asymptotic expansion formula for inference on adaptively collected data, which generalizes normal approximation to the distribution of standard test statistics. Our theorem relaxes assumptions on the noise distribution and benefits a fast convergence rate to accommodate small sample sizes. We complement our results by promising empirical performances in simulations. 
Yechan Park · Ruohan Zhan · Nakahiro Yoshida 🔗 


Valid Inference after Causal Discovery
(
Poster
)
>
link
Causal graph discovery and causal effect estimation are two fundamental tasks in causal inference. While many methods have been developed for each task individually, statistical challenges arise when applying these methods jointly: estimating causal effects after running causal discovery algorithms on the same data leads to "double dipping," invalidating coverage guarantees of classical confidence intervals. To this end, we develop tools for valid postcausaldiscovery inference. One key contribution is a randomized version of the greedy equivalence search (GES) algorithm, which permits a valid, distributionfree correction of classical confidence intervals. We show that a naive combination of causal discovery and subsequent inference algorithms typically leads to highly inflated miscoverage rates; at the same time, our noisy GES method provides reliable coverage control while achieving more accurate causal graph recovery than data splitting. 
Paula Gradu · Tijana Zrnic · Yixin Wang · Michael Jordan 🔗 


Can Large Language Models Build Causal Graphs?
(
Poster
)
>
link
Building causal graphs can be a laborious process. To ensure all relevant variables have been captured, researchers often have to discuss with clinicians and experts while also reviewing extensive relevant medical literature. By encoding common and medical knowledge, large language models (LLMs) represent an opportunity to ease this process by automatically scoring edges (i.e., connections between two variables) in potential graphs. LLMs however have been shown to be brittle to the choice of probing words, context, and prompt that the user employs. In this work, we evaluate if LLMs can be a useful tool in speeding up causal graph development. 
Stephanie Long · Tibor Schuster · Alexandre Piche 🔗 


Counterfactual Decision Support Under TreatmentConditional Outcome Measurement Error
(
Poster
)
>
link
Growing work in algorithmic decision support proposes methods for combining predictive models with human judgment to improve decision quality. A challenge that arises in this setting is predicting the risk of a decisionrelevant target outcome under multiple candidate actions. While counterfactual prediction techniques have been developed for these tasks, current approaches do not account for measurement error in observed labels. This is a key limitation because in many domains, observed labels (e.g., medical diagnoses, defendant rearrest) serve as a proxy for the target outcome of interest (e.g., biological medical outcomes, recidivism). We develop a method for counterfactual prediction of target outcomes observed under treatmentconditional outcome measurement error (TCOME). Our method minimizes risk with respect to target potential outcomes given access to observational data and estimates of measurement error parameters. We also develop a method for estimating error parameters in cases where these are unknown in advance. Through a synthetic evaluation, we show that our approach achieves performance parity with an oracle model when measurement error parameters are known and retains performance given moderate bias in error parameter estimates. 
Luke Guerdan · Amanda Coston · Kenneth Holstein · Steven Wu 🔗 


Causal Estimation for Text Data with (Apparent) Overlap Violations
(
Poster
)
>
link
Consider the problem of estimating the causal effect of some attribute of a text document; for example: what effect does writing a polite vs. rude email have on response time? To estimate a causal effect from observational data, we need to adjust for confounding aspects of the text that affect both the treatment and outcomee.g., the topic or writing level of the text. These confounding aspects are unknown a priori, so it seems natural to adjust for the entirety of the text (e.g., using a transformer). However, causal identification and estimation procedures rely on the assumption of overlap: for all levels of the adjustment variables, there is randomness leftover so that every unit could have (not) received treatment. Since the treatment here is itself an attribute of the text, it is perfectly determined, and overlap is apparently violated. The purpose of this paper is to show how to handle causal identification and obtain robust causal estimation in the presence of apparent overlap violations. In brief, the idea is to use supervised representation learning to produce a data representation that preserves confounding information while eliminating information that is only predictive of the treatment. This representation then suffices for adjustment and satisfies overlap. Adapting results on nonparametric estimation, we find that this procedure is robust to conditional outcome misestimation, yielding a lowbias estimator with valid uncertainty quantification under weak conditions. Empirical results show strong improvements in bias and uncertainty quantification relative to the natural baseline. 
Lin Gui · Victor Veitch 🔗 


Initial Results for Pairwise Causal Discovery Using Quantitative Information Flow
(
Poster
)
>
link
Pairwise Causal Discovery is the task of determining causal, anticausal, confounded or independence relationships from realworld datasets (i.e., pairs of variables). Over the last few years, this challenging task has subsidized not only the discovery of novel machine learning models aimed at solving the task, but also discussions on how learning the causal direction of variables may benefit machine learning overall. In this paper, we show that Quantitative Information Flow (QIF), a measure usually employed for measuring leakages of information from a system to an attacker, shows promising results as features for the causal discovery task. In particular, experiments with realworld datasets indicate that QIF is statistically tied to the state of the art. Our initial results motivate further inquiries on how QIF relates to causality and what are its limitations. 
Felipe Giori · Flavio Figueiredo 🔗 


DoOperation Guided Causal Representation Learning with Reduced Supervision Strength
(
Poster
)
>
link
Causal representation learning has been proposed to encode causal relationships between factors presented in the high dimensional data. Existing methods are limited to being trained and fully supervised by groundtruth generative factors. In this paper, we seek to reduce supervision strength by leveraging intervention on either the cause factor or effect factor for reducing supervision strength. Applying interventions on cause factors and effect factors will lead to different results since intervention on effect factors will change the causal graph. In contrast, intervention on cause factors will not change the relationships. The intervention can also be called \emph{dooperation}. Based on this attribute of \emph{dooperation}, we propose a framework called DoVAE, which implements \emph{dooperation} by swapping latent cause factors and effect factors encoded from a pair of inputs and utilizing the supervision signal from a pair of inputs by comparing original inputs and reconstructions. Moreover, we also identify the inadequacy of existing causal representation metrics and introduce new metrics for better evaluation. 
Jiageng Zhu · Hanchen Xie · Wael AbdAlmageed 🔗 


Mitigating inputcausing confounding in multimodal learning via the backdoor adjustment
(
Poster
)
>
link
We adopt a causal perspective to address why multimodal learning often performs worse than unimodal learning. We put forth a structural causal model (SCM) for which multimodal learning is preferable over unimodal learning. In this SCM, which we call the multimodal SCM, a latent variable causes the inputs, and the inputs cause the target. We refer to this latent variable as the inputcausing confounder. By conditioning on all inputs, multimodal learning $d$separates the input causing confounder and the target, resulting in a causal model that is more robust than the statistical model learned by unimodal learning. We argue that multimodal learning fails in practice because our finite datasets appear to come from an alternative SCM, which we call the spurious SCM. In the spurious SCM, the inputcausing confounder and target are conditionally dependent given the inputs. This means that multimodal learning no longer $d$separates the inputcausing confounder and the target, and fails to estimate a causal model. We use a latent variable model to model the inputcausing confounder, and test whether the undesirable dependence with the target is present in the data. We then use the same model to remove this dependence and estimate a causal model, which corresponds to the backdoor adjustment. We use synthetic data experiments to validate our claims.

Taro Makino · Krzysztof Geras · Kyunghyun Cho 🔗 


Generalized Synthetic Control Method with StateSpace Model
(
Poster
)
>
link
Synthetic control method (SCM) is a widely used approach to assess the treatment effect of a pointwise intervention for crosssectional timeseries data. The goal of SCM is to approximate the counterfactual outcomes of the treated unit as a combination of the control units' observed outcomes. Many studies propose a linear factor model as a parametric justification for the SCM that assumes the synthetic control weights are invariant across time. However, such an assumption does not always hold in practice. We propose a generalized SCM with timevarying weights based on statespace model (GSCSSM), allowing for a more flexible and accurate construction of counterfactual series. GSCSSM recovers the classic SCM when the hidden weights are specified as constant. It applies Bayesian shrinkage for a twoway sparsity of the estimated weights across both the donor pool and the time. On the basis of our method, we shed light on the role of auxiliary covariates, on nonlinear and nonGuassian statespace model, and on the prediction interval based on timeseries forecasting. We apply GSCSSM to investigate the impact of German reunification and a mandatory certificate on COVID19 vaccine compliance. 
Junzhe Shao · Mingzhang Yin · Xiaoxuan Cai · Linda Valeri 🔗 


On counterfactual inference with unobserved confounding
(
Poster
)
>
link
Given an observational study with $n$ independent but heterogeneous units and one $p$dimensional sample per unit containing covariates, interventions, and outcomes, our goal is to learn counterfactual distribution for each unit. We consider studies with unobserved confounding which introduces statistical biases between interventions and outcomes as well as exacerbates the heterogeneity across units. Modeling the underlying joint distribution as an exponential family and under suitable conditions, we reduce learning the $n$ unitlevel counterfactual distributions to learning $n$ exponential family distributions with heterogeneous parameters and only one sample per distribution. We introduce a convex objective that pools all $n$ samples to jointly learn all $n$ parameters and provide a unitwise mean squared error bound that scales linearly with the metric entropy of the parameter space. For example, when the parameters are $s$sparse linear combination of $k$ known vectors, the error is $O(s\log k/p)$. En route, we derive sufficient conditions for compactly supported distributions to satisfy the logarithmic Sobolev inequality.

Abhin Shah · Raaz Dwivedi · Devavrat Shah · Gregory Wornell 🔗 


Identifying causes of Pyrocumulonimbus (PyroCb)
(
Poster
)
>
link
A first causal discovery analysis from observational data of pyroCb (storm clouds generated from extreme wildfires) is presented. Invariant Causal Prediction was used to develop tools to understand the causal drivers of pyroCb formation. This includes a conditional independence test for testing $Y \indep EX$ for binary variable $Y$ and multivariate, continuous variables $X$ and $E$ and a greedyICP search algorithm that relies on fewer conditional independence tests to obtain a smaller more manageable set of causal predictors. With these tools we identified a subset of seven causal predictors which are plausible when contrasted with domain knowledge: surface sensible heat flux, relative humidity at 850hPa, a component of wind at 250 hPa, 13.3 \textmu m thermal emissions, convective available potential energy and altitude.

Emiliano Diaz · Kenza Tazi · Ashwin Braude · Daniel Okoh · Kara Lamb · Duncan WatsonParris · Paula Harder · Nis Meinert 🔗 


Rhino: Deep Causal Temporal Relationship Learning with historydependent noise
(
Poster
)
>
link
Discovering causal relationships between different variables from time series data has been a longstanding challenge for many domains. For example, in stock markets, the announcement of acquisitions from leading companies may have immediate effects on stock prices and increased uncertainty of the future market due to this past action. This requires the model to take nonlinear relationships, instantaneous effects and the pastaction dependent uncertainty into account. We name the latter as historydependent noise. However, previous works do not offer a solution addressing all these problems together. In this paper, we propose a structural equation model, called Rhino, which combines vector autoregression, deep learning and variational inference to model nonlinear relationships with instantaneous effects and flexible historydependent noise. Theoretically, we prove the structural identifiability for a generalization of Rhino. Our empirical results from extensive synthetic experiments and a realworld benchmark demonstrate better discovery performance compared to relevant baselines, with ablation studies revealing its robustness when the Rhino is misspecified. 
Wenbo Gong · Joel Jennings · Cheng Zhang · Nick Pawlowski 🔗 


Causal Analysis of the TOPCAT Trial: Spironolactone for Preserved Cardiac Function Heart Failure
(
Poster
)
>
link
We describe the results of applying causal discovery methods on the data from a multisite clinical trial, on the Treatment of Preserved Cardiac Function Heart Failure with an Aldosterone Antagonist (TOPCAT). The trial was inconclusive, with no clear benefits consistently shown for the whole cohort. However, there were questions regarding the reliability of the diagnosis and treatment protocol for a geographic subgroup of the cohort. With the inclusion of medical context in the form of domain knowledge, causal discovery is used to demonstrate regional discrepancies and to frame the regional transportability of the results. Furthermore, we show that, globally and especially for some subgroups, the treatment has significant causal effects, thus offering a more refined view of the trial results. 
Francesca Raimondi · Tadhg O'Keeffe · Hana Chockler · Andrew Lawrence · Tamara Stemberga · Andre Franca · Maksim Sipos · Javed Butler · Shlomo BenHaim 🔗 


Conditional differential measurement error: partial identifiability and estimation
(
Poster
)
>
link
Differential measurement error, which occurs when the level of error in the measured outcome is correlated with the treatment renders the causal effect unidentifiable from observational data. We study conditional differential measurement error, where a subgroup of the population is known to be prone to differential measurement error. Under an assumption about the direction (but not magnitude) of the measurement error, we derive sharp bounds on the conditional average treatment effect and present an approach to estimate them. We empirically validate our approach on semisynthetic and real data, showing that it gives a more credible and informative bound than other approaches. 
Pengrun Huang · Maggie Makar 🔗 


Active Bayesian Causal inference
(
Poster
)
>
link
Causal discovery and causal reasoning are classically treated as separate and consecutive tasks: one first infers the causal graph, and then uses it to estimate causal effects of interventions. However, such a twostage approach is uneconomical, especially in terms of actively collected interventional data, since the causal query of interest may not require a fullyspecified causal model. From a Bayesian perspective, it is also unnatural, since a causal query (e.g., the causal graph or some causal effect) can be viewed as a latent quantity subject to posterior inference—quantities that are not of direct interest ought to be marginalized out in this process, thus contributing to our overall uncertainty. In this work, we propose Active Bayesian Causal Inference (ABCI), a fullyBayesian active learning framework for integrated causal discovery and reasoning, i.e., for jointly inferring a posterior over causal models and queries of interest. In our approach to ABCI, we focus on the class of causallysufficient nonlinear additive Gaussian noise models, which we model using Gaussian processes. To capture the space of causal graphs, we use a continuous latent graph representation, allowing our approach to scale to practically relevant problem sizes. We sequentially design experiments that are maximally informative about our target causal query, collect the corresponding interventional data, update our beliefs, and repeat. Through simulations, we demonstrate that our approach is more dataefficient than existing methods that only focus on learning the full causal graph. This allows us to accurately learn downstream causal queries from fewer samples, while providing wellcalibrated uncertainty estimates of the quantities of interest. 
Christian Toth · Lars Lorch · Christian Knoll · Andreas Krause · Franz Pernkopf · Robert Peharz · Julius von Kügelgen 🔗 


Bounding the Effects of Continuous Treatments for Hidden Confounders
(
Poster
)
>
link
Observational studies often seek to infer the causal effect of a treatment even though both the assigned treatment and the outcome depend on other confounding variables. An effective strategy for dealing with confounders is to estimate a propensity model that corrects for the relationship between covariates and assigned treatment. Unfortunately, the confounding variables themselves are not always observed, in which case we can only bound the propensity, and therefore bound the magnitude of causal effects. In many important cases, like administering a dose of some medicine, the possible treatments belong to a continuum. Sensitivity models, which are required to tie the true propensity to something that can be estimated, have been explored for binary treatments. We propose one for continuous treatments. We develop a framework to compute ignorance intervals on the partially identified doseresponse curves, enabling us to quantify the susceptibility of an inference to hidden confounders. We show with realworld observational studies that our approach can give nontrivial bounds on causal effects from continuous treatments in the presence of hidden confounders. 
Myrl Marmarelis · Greg Ver Steeg · Neda Jahanshad · Aram Galstyan 🔗 


Local Causal Discovery for Estimating Causal Effects
(
Poster
)
>
link
Even when the causal graph underlying our data is unknown, we can nevertheless narrow down the possible values that an average treatment effect (ATE) can take by (1) identifying the graph up to a Markov equivalence class; and (2) estimating that ATE for each graph in the class. While the PC algorithm can identify this class under strong faithfulness assumptions, it can be computationally prohibitive. Fortunately, only the local graph structure around the treatment is required to identify an ATE, a fact exploited by local discovery algorithms to identify the possible values for an ATE more efficiently. In this paper, we introduce Local Discovery using Eager Collider Checks (LDECC), a new local discovery algorithm that finds colliders and orients the treatment's parents differently from existing methods. We show that there exist graphs where our algorithm exponentially outperforms existing local discovery algorithms and vice versa. Moreover, we show that LDECC and existing algorithms rely on different sets of faithfulness assumptions. We leverage this insight to show that it is possible to test and recover from certain faithfulness violations. 
Shantanu Gupta · David Childers · Zachary Lipton 🔗 


Partial identification without distributional assumptions
(
Poster
)
>
link
Causal effect estimation is important for numerous tasks in the natural and social sciences. However, identifying effects is impossible from observational data without making strong, often untestable assumptions which might not be applicable to realworld data. We consider algorithms for the partial identification problem, bounding the effects of multivariate, continuous treatments over multiple possible causal models when unmeasured confounding makes identification impossible. Even in the partial identification setting, most current work is only applicable in the discrete setting. We propose a framework which is applicable to continuous highdimensional data. The observable evidence is matched to the implications of constraints encoded in a causal model by normbased criteria. In particular, for the IV setting, we present ways by which such constrained optimization problems can be parameterized without likelihood functions for the causal or the observed data model, reducing the computational and statistical complexity of the task. 
Kirtan Padh · Jakob Zeitler · David Watson · Matt Kusner · Ricardo Silva · Niki Kilbertus 🔗 


Trust Your $\nabla$: Gradientbased Intervention Targeting for Causal Discovery
(
Poster
)
>
link
Inferring causal structure from data is a challenging task of fundamental importance in science. Observational data are often insufficient to identify a system’s causal structure uniquely. While conducting interventions (i.e., experiments) can improve the identifiability, such samples are usually challenging and expensive to obtain. Hence, experimental design approaches for causal discovery aim to minimize the number of interventions by estimating the most informative intervention target. In this work, we propose a novel gradientbased intervention targeting method, abbreviated GIT, that 'trusts' the gradient estimator of a gradientbased causal discovery framework to provide signals for intervention acquisition function. We provide extensive experiments in simulated and realworld datasets and demonstrate that GIT performs on par with competitive baselines, surpassing them in the lowdata regime. 
Mateusz Olko · Michał Zając · Aleksandra Nowak · Nino Scherrer · Yashas Annadani · Stefan Bauer · Łukasz Kuciński · Piotr Miłoś 🔗 


A Novel Twolevel Causal Inference Framework for Onroad Vehicle Quality Issues Diagnosis
(
Poster
)
>
link
In the automotive industry, the full cycle of managing inuse vehicle quality issues can take weeks to investigate. The process involves isolating root causes, defining and implementing appropriate treatments, and refining treatments if needed. The main painpoint is the lack of a systematic method to identify causal relationships, evaluate treatment effectiveness, and direct the next actionable treatment if the current treatment was deemed ineffective. This paper will show how we leverage causal Machine Learning (ML) to speed up such processes. A realword data set collected from onroad vehicles will be used to demonstrate the proposed framework. Open challenges for vehicle quality applications will also be discussed. 
Qian Wang · Huanyi Shui · Thi Tu Trinh Tran · Milad nezhad · devesh upadhyay · Kamran Paynabar · Anqi He 🔗 


A kernel balancing approach that scales to big data
(
Poster
)
>
link
In causal inference, weighting is commonly used for covariate adjustment. Procedurally, weighting can be accomplished either through methods that model the propensity score, or methods that use convex optimization to find the weights that balance the covariates directly. However, the computational demand of the balancing approach has to date precluded it from including broad classes of functions of the covariates in large datasets. To address this problem, we outline a scalable approach to balancing that incorporates a kernel representation of a broad class of basis functions. First, we use the Nystr\"{o}m method to rapidly generate a kernel basis in a reproducing kernel Hilbert space containing a broad class of basis functions of the covariates. Then, we integrate these basis functions as constraints in a stateoftheart implementation of the alternating direction method of multipliers, which rapidly finds the optimal weights that balance the general basis functions in the kernel. Using this kernel balancing approach, we conduct a national observational study of the relationship between hospital profit status and treatment and outcomes of heart attack care in a large dataset containing 1.27 million patients and over 3,500 hospitals. After weighting, we observe that forprofit hospitals perform percutaneous coronary intervention at similar rates as other hospitals; however, their patients have slightly worse mortality and higher readmission rates. 
Kwangho Kim · Bijan Niknam · Jose Zubizarreta 🔗 


Causal Bandits: Online DecisionMaking in Endogenous Settings
(
Poster
)
>
link
The deployment of MultiArmed Bandits (MAB) has become commonplace in many economic applications. However, regret guarantees for even stateoftheart linear bandit algorithms (such as Optimism in the Face of Uncertainty Linear bandit (OFUL)) make strong exogeneity assumptions w.r.t. arm covariates. This assumption is very often violated in many economic contexts and using such algorithms can lead to suboptimal decisions. In this paper, we consider the problem of online learning in linear stochastic multiarmed bandit problems with endogenous covariates. We propose an algorithm we term BanditIV, that uses instrumental variables to correct for this bias, and prove an $\tilde{\mathcal{O}}(k\sqrt{T})$ upper bound for the expected regret of the algorithm. Further, in economic contexts, it is also important to understand how the model parameters behave asymptotically. To this end, we additionally propose $\epsilon$\textit{BanditIV} algorithm and demonstrate its asymptotic consistency and normality while ensuring the same regret bound. Finally, we carry out extensive Monte Carlo simulations to demonstrate the performance of our algorithms compared to other methods. We show that BanditIV and $\epsilon$BanditIV significantly outperform other existing methods.

Jingwen Zhang · Yifang Chen · Amandeep Singh 🔗 


Rethinking Neural Relational Inference for Granger Causal Discovery
(
Poster
)
>
link
Granger causal discovery aims to infer the underlying Granger causal relationships between pairs of variables in a multivariate time series system. Recent work has proposed using Neural Relational Inference (NRI)  a latent graph inference model  for Granger causal discovery. However, the conditions under which NRI succeeds in recovering the true Granger causal graph remain unknown. In this work we show how the mean field approximation inherent in NRI has significant implications for its ability to recover the Granger causal structure in multivariate time series. We illustrate this point theoretically and experimentally using a linear vector autoregressive model  an important benchmark in economic and financial studies. 
Stefanos Bennett · Rose Yu 🔗 


Machine learning reveals how personalized climate communication can both succeed and backfire
(
Poster
)
>
link
Different advertising messages work for different people. Machine learning can be an effective way to personalise climate communications. In this paper, we use machine learning to reanalyse findings from a recent study, showing that online advertisements increased climate change belief in some people while resulting in decreased belief in others. In particular, we show that the effect of the advertisements could change depending on a person's age and ethnicity. Our findings have broad methodological and practical applications. 
Totte Harinen · Alexandre Filipowicz · Shabnam Hakimi · Rumen Iliev · Matt Klenk · Emily Sumner 🔗 


Causal Reasoning in the Presence of Latent Confounders via Neural ADMG Learning
(
Poster
)
>
link
Latent confounding has been a longstanding obstacle for causal reasoning from observational data. One popular approach is to model the data using acyclic directed mixed graphs (ADMGs), which describe ancestral relations between variables using directed and bidirected edges. However, existing methods using ADMGs are based on either linear functional assumptions or a discrete search that is complicated to use and lacks computational tractability for large datasets. In this work, we further extend the existing body of work and develop a novel gradientbased approach to learning an ADMG with nonlinear functional relations from observational data. We first show that the presence of latent confounding is identifiable under the assumptions of bowfree ADMGs with nonlinear additive noise models. With this insight, we propose a novel neural causal model based on autoregressive flows. This not only enables us to model complex causal relationships behind the data, but also estimate their functional relationships (hence treatment effects) simultaneously. We further validate our approach via experiments on both synthetic and realworld datasets, and demonstrate the competitive performance against relevant baselines. 
Matthew Ashman · Chao Ma · Agrin Hilmkil · Joel Jennings · Cheng Zhang 🔗 


Causal Discovery using Marginal Likelihood
(
Poster
)
>
link
Causal discovery is an important problem in many fields such as medicine, epidemiology, or economics. Here, causal structure is necessary to relay information about the effectiveness of treatments. Recently, causal structure has also been linked with generalisation and out of distribution generalisation in prediction tasks. This problem however, is only solvable upto a Markov equivalence class without strong assumptions. Previous work has made assumptions on the data generation process to render the causal graph identifiable. These methods fail when the data generation assumptions no longer hold. In this work, we directly algorithmise the independence of causal mechanism (ICM) assumption to achieve a flexible causal discovery algorithm. In the bivariate case, this is done by showing that independent parametrisation with independent priors encodes an ICM assumption. We show that this implies different marginal likelihoods for models of different causal directions. Using a Bayesian model selection procedure to take advantage of this, we show that our method outperforms competing methods. 
Anish Dhir · Mark van der Wilk 🔗 


Deep Structural Causal Modelling of the Clinical and Radiological Phenotype of Alzheimer’s Disease
(
Poster
)
>
link
Alzheimer's disease (AD) has a poorly understood aetiology. Patients often have different rates and patterns of brain atrophy, and present at different stages along the natural history of their condition. This means that establishing the relationships between diseaserelated variables, and subsequently linking the clinical and radiological phenotypes of AD is difficult. Investigating this link is important because it could ultimately allow for a better understanding of the disease process, and this could enable tasks such as treatment effect estimates, disease progression modelling, and better precision medicine for AD patients. We extend a class of deep structural causal models (DSCMs) to the clinical and radiological phenotype of AD, and propose an aetiological model of relevant patient demographics, imaging and clinical biomarkers, and cognitive assessment/educational scores based on specific current hypotheses in the medical literature. The trained DSCM produces biologically plausible counterfactuals relating to the specified disease covariates, and reproduces groundtruth longitudinal changes in magnetic resonance images of AD. Such a model could enable the assessment of the effects of intervening on variables outside a randomized controlled trial setting. In addition, by being explicit about how causal relationships are encoded, the framework provides a principled approach to define and assess hypotheses of the aetiology of AD. Code to replicate the experiment can be found at: $\href{https://github.com/aay993/counterfactual_AD}{Counterfactual AD.}$

Ahmed Abdulaal · Daniel C. Castro · Daniel Alexander 🔗 


Learning Causal Representations of Single Cells via Sparse Mechanism Shift Modeling
(
Poster
)
>
link
Latent variable models have become a goto tool for analyzing biological data, especially in the field of singlecell genomics. One remaining challenge is the identification of individual latent variables related to biological pathways, more generally conceptualized as disentanglement. Although versions of variational autoencoders that explicitly promote disentanglement were introduced and applied to singlecell genomics data, the theoretical feasibility of disentanglement from independent and identically distributed measurements has been challenged.Recent methods propose instead to leverage nonstationary data, as well as the sparse mechanism assumption in order to learn disentangled representations, with a causal semantic. Here, we explore the application of these methodological advances in the analysis of singlecell genomics data with genetic or chemical perturbations. We benchmark these methods on simulated single cell expression data to evaluate their performance regarding disentanglement, causal target identification and outofdomain generalisation. Finally, by applying the approaches to a largescale gene perturbation dataset, we find that the model relying on the sparse mechanism shift hypothesis surpasses contemporary methods on a transfer learning task. 
Romain Lopez · Nataša Tagasovska · Stephen Ra · Kyunghyun Cho · Jonathan Pritchard · Aviv Regev 🔗 


Amortized Inference for Causal Structure Learning
(
Poster
)
>
link
Learning causal structure poses a combinatorial search problem that typically involves evaluating structures with a score or independence test. The resulting search is costly, and designing suitable scores or tests that capture prior knowledge is difficult. In this work, we propose to amortize causal structure learning. Rather than searching over structures, we train a variational inference model to predict the causal structure from observational or interventional data. This allows us to bypass both the search over graphs and the handengineering of suitable score functions. Instead, our inference model acquires domainspecific inductive biases for causal discovery solely from data generated by a simulator. The architecture of our inference model emulates permutation invariances that are crucial for statistical efficiency in structure learning, which facilitates generalization to significantly larger problem instances than seen during training. On synthetic data and semisynthetic gene expression data, our models exhibit robust generalization capabilities when subject to substantial distribution shifts and significantly outperform existing algorithms, especially in the challenging genomics domain. 
Lars Lorch · Scott Sussex · Jonas Rothfuss · Andreas Krause · Bernhard Schölkopf 🔗 


Discrete Learning Of DAGs Via Backpropagation
(
Poster
)
>
link
Recently continuous relaxations have been proposed in order to learn directed acyclic graphs (DAGs) by backpropagation, instead of combinatorial optimization. However, a number of techniques for fully discrete backpropagation could instead be applied. In this paper, we explore this direction and propose DAGDB, a framework for learning DAGs by Discrete Backpropagation, based on the architecture of Implicit Maximum Likelihood Estimation (IMLE). DAGDB performs competitively using either of two fully discrete backpropagation techniques, IMLE itself, or straightthrough estimation. 
Andrew Wren · Pasquale Minervini · Luca Franceschi · Valentina Zantedeschi 🔗 


Interventional Causal Representation Learning
(
Poster
)
>
link
The theory of identifiable representation learning aims to build generalpurpose methods that extract highlevel latent (causal) factors from lowlevel sensory data. Most existing works focus on identifiable representation learning with observational data, relying on distributional assumptions on latent (causal) factors. However, in practice, we often also have access to interventional data for representation learning. How can we leverage interventional data to help identify highlevel latents? To this end, we explore the role of interventional data for identifiable representation learning in this work. We study the identifiability of latent causal factors with and without interventional data, under minimal distributional assumptions on the latents. We prove that, if the true latent variables map to the observed highdimensional data via a polynomial function, then representation learning via minimizing the standard reconstruction loss of autoencoders identifies the true latents up to affine transformation. If we further have access to interventional data generated by hard do interventions on some of the latents, then we can identify these intervened latents up to permutation, shift and scaling. 
Kartik Ahuja · Yixin Wang · Divyat Mahajan · Yoshua Bengio 🔗 


Exploiting Neighborhood Interference with Low Order Interactions under Unit Randomized Design
(
Poster
)
>
link
Network interference, where the outcome of an individual is affected by the treatment of others in their social network, is pervasive in realworld settings. However, it poses a challenge to estimating causal effects. We consider the task of estimating the total treatment effect (TTE), or the difference between the average outcomes of the population when everyone is treated versus when no one is, under network interference. Under a nonuniform Bernoulli randomized design, we utilize knowledge of the network structure to provide an unbiased estimator for the TTE when network interference effects are constrained to loworder interactions among neighbors of an individual. We make no assumptions on the graph other than bounded degree, allowing for wellconnected networks that may not be easily clustered. We derive a bound on the variance of our estimator and show in simulated experiments that it performs well compared with standard TTE estimators. 
Mayleen Cortez · Matthew Eichhorn · Christina Yu 🔗 


Synthetic Principle Component Design: Fast Covariate Balancing with Synthetic Controls
(
Poster
)
>
link
In this paper, we target at developing a globally convergent and yet practically tractable optimization algorithm for the optimal experimental design problem with synthetic controls. Specifically, we consider a setting when the pretreatment outcome data is available. the average treatment effect is estimated via the difference between the weighted average outcomes of the treated and control units, where the weights are learned from the data observed during the pretreatment periods. We find that if the experimenter has the ability to select an optimal set of nonnegative weights, the optimal experimental design problem is identical to to a socalled \textit{phase synchronization} problem. We solve this problem via a normalized variate of the generalized power method with spectral initialization. On the theoretical side, we establish the first global optimality guarantee for experiment design under a realizable assumption with linear fixedeffect models (also referred to an "interactive fixedeffect model"). These results are surprising, given that the optimal design of experiments, especially involving covariate matching, typically involves solving an NPhard combinatorial optimization problem. Empirically, we apply our algorithm on US Bureau of Labor Statistics and the AbadieDiemondHainmueller California Smoking Data. The experiments demonstrate that our algorithm surpasses the random design with a large margin in terms of the root mean square error. 
Yiping Lu · Jiajin Li · Lexing Ying · Jose Blanchet 🔗 


Investigating causal understanding in LLMs
(
Poster
)
>
link
We investigate the quality of causal world models of LLMs in very simple settings. We test whether LLMs can identify cause and effect in natural language settings (taken from BigBench) such as “My car got dirty. I washed the car. Question: Which sentence is the cause of the other?” and in multiple other toy settings. We probe the LLM's world model by changing the presentation of the prompt while keeping the meaning constant, e.g. by changing the order of the sentences or asking the opposite question. Additionally, we test if the model can be “tricked” into giving wrong answers when we present the shot in a different pattern than the prompt. We have three findings. Firstly, larger models yield better results. Secondly, kshot outperforms oneshot and oneshot outperforms zeroshot in standard conditions. Thirdly, LLMs perform worse in conditions where form and content differ. We conclude that the form of the presentation matters for LLM predictions or, in other words, that LLMs don't solely base their predictions on content. Finally, we detail some of the implications this research has on AI safety. 
Marius Hobbhahn · Tom Lieberum · David Seiler 🔗 


A LargeScale Observational Study of the Causal Effects of a Behavioral Health Nudge
(
Poster
)
>
link
The Apple Watch encourages users to stand throughout the day by delivering a notification onto the users’ wrist if they have been sitting for the first 50 minutes of an hour. This simple behavioral intervention exemplifies the classical definition of nudge as a choice architecture that alters behavior without forbidding options or significantly changing economic incentives. In order to estimate from observational data the causal effect of the notification on the user's standing probability throughout the day, we introduce a novel regression discontinuity design for time series data with timevarying treatment. Using over 76 billions minutes of private and anonymous observational standing data from more than 160,000 subjects enrolled in the public Apple Heart and Movement Study from 2019 to 2022, we show that the nudge increases the probability of standing by up to 49.5% across all the studied population. The nudge is similarly effective for participants selfidentified as male or female, and it is more effective in older people, increasing the standing probability in people over 75 years old by more than 60%. We also demonstrate that closing Apple Watch Activity Rings, another simple choice architecture that visualizes the participant's daily progress in Move, Exercise, and Stand, correlates with user's response to the intervention; for users who close their activity rings regularly, the standing nudge almost triples their probability of standing. This observational study, which is one of the largest of its kind exploring the causal effects of nudges in the general population, demonstrates the effectiveness of simple behavioral health interventions and introduces a novel application of regression discontinuity design extended here to timevarying treatments. 
Achille Nazaret · Guillermo Sapiro 🔗 


Variational Causal Inference
(
Poster
)
>
link
Estimating an individual's potential outcomes under counterfactual treatments is a challenging task for traditional causal inference and supervised learning approaches when the outcome is highdimensional (e.g. gene expressions, impulse responses, human faces) and covariates are relatively limited. In this case, to construct one's outcome under a counterfactual treatment, it is crucial to leverage individual information contained in its observed factual outcome on top of the covariates. We propose a deep variational Bayesian framework that rigorously integrates two main sources of information for outcome construction under a counterfactual treatment: one source is the individual features embedded in the highdimensional factual outcome; the other source is the response distribution of similar subjects (subjects with the same covariates) that factually received this treatment of interest. 
Yulun Wu · Layne Price · Zichen Wang · Vassilis Ioannidis · Rob Barton · George Karypis 🔗 