Timezone: »

 
Poster
Root Cause Analysis of Failures in Microservices through Causal Discovery
Azam Ikram · Sarthak Chakraborty · Subrata Mitra · Shiv Saini · Saurabh Bagchi · Murat Kocaoglu

Tue Nov 29 09:00 AM -- 11:00 AM (PST) @ Hall J #506
Most cloud applications use a large number of smaller sub-components (called microservices) that interact with each other in the form of a complex graph to provide the overall functionality to the user. While the modularity of the microservice architecture is beneficial for rapid software development, maintaining and debugging such a system quickly in cases of failure is challenging. We propose a scalable algorithm for rapidly detecting the root cause of failures in complex microservice architectures. The key ideas behind our novel hierarchical and localized learning approach are: (1) to treat the failure as an intervention on the root cause to quickly detect it, (2) only learn the portion of the causal graph related to the root cause, thus avoiding a large number of costly conditional independence tests, and (3) hierarchically explore the graph. The proposed technique is highly scalable and produces useful insights about the root cause, while the use of traditional techniques becomes infeasible due to high computation time. Our solution is application agnostic and relies only on the data collected for diagnosis. For the evaluation, we compare the proposed solution with a modified version of the PC algorithm and the state-of-the-art for root cause analysis. The results show a considerable improvement in top-$k$ recall while significantly reducing the execution time.

Author Information

Azam Ikram (Purdue University)
Sarthak Chakraborty (Indian Institute of Technology Kharagpur)
Subrata Mitra (Adobe)
Shiv Saini (Adobe Systems)
Saurabh Bagchi (Purdue University)

Saurabh Bagchi is a Professor in the School of Electrical and Computer Engineering and the Department of Computer Science at Purdue University in West Lafayette, Indiana. He is the founding Director of a university-wide resilience center at Purdue called CRISP (2017-present). He is the recipient of the Alexander von Humboldt Research Award (2018), the Adobe Faculty Award (2017), the AT&T Labs VURI Award (2016), the Google Faculty Award (2015), and the IBM Faculty Award (2014). He was elected to the IEEE Computer Society Board of Governors for the 2017-19 term and re-elected in 2019. He is an ACM Distinguished Scientist (2013), a Senior Member of IEEE (2007) and of ACM (2009), and a Distinguished Speaker for ACM (2012). He is a co-lead on the $39M WHIN-SMART center at Purdue. Saurabh's research interest is in dependable computing and distributed systems. He is proudest of the 21 PhD students and 50 Masters thesis students who have graduated from his research group and who are in various stages of building wonderful careers in industry or academia. In his group, he and his students have way too much fun building and breaking real systems. Along the way this has led to 10 best paper awards or nominations at IEEE/ACM conferences. Saurabh received his MS and PhD degrees from the University of Illinois at Urbana-Champaign and his BS degree from the Indian Institute of Technology Kharagpur, all in Computer Science. He was selected as the inaugural International Visiting Professor at IIT Kharagpur in 2018.

Murat Kocaoglu (Purdue University)

More from the Same Authors