Title: Failure Diagnosis and Detection for Distributed Systems
Speaker: Yongle Zhang (Purdue University)
Time: 11:00 am, March 25 (Friday), 2022
Location: N/A
Online link: provided upon request or see the seminar email.


Distributed software systems have become the backbone of Internet services. Failures in production distributed systems have severe consequences. A 30-minute outage of Amazon in 2013 caused a two million dollar loss in revenue. Moreover, the frequency of failures rises with the increasing complexity of software systems. 2019 has experienced noticeably more Internet outages and is sometimes considered the “year of outages”.

Failure diagnosis and detection for distributed systems at data center scale is especially difficult because distributed systems are complex with numerous threads, processes, and nodes running concurrently. In this talk, I am going to introduce our efforts on automating the failure diagnosis procedure for distributed systems and our recent work on detecting upgrade failures in distributed systems.


Yongle Zhang is an Assistant Professor in the Computer Science Department at Purdue University. His research interest lies in building reliable software systems failure diagnosis for cloud systems.