Title: TrainVerify - Equivalence-Based Verification for Distributed LLM Training
Speaker: Yunchi Lu (University of Michigan) [webpage: https://luyunchi.github.io/]
Time: 10:00 am, October 31 (Friday), 2025
Location: Ryder Hall 156
Online link: provided upon request or see the seminar email.

Abstract:

Training large language models (LLMs) at scale requires parallel execution across thousands of devices, incurring enormous computational costs. Yet, these costly distributed trainings are rarely verified, leaving them prone to silent errors and potentially wasting millions of GPU hours. We introduce TrainVerify, a system for verifiable distributed training of LLMs. Given a deep learning model’s logical specification as the ground truth, TrainVerify formally verifies that a distributed parallel execution plan is mathematically equivalent to it. Direct verification is notoriously difficult due to the sheer scale of LLMs which often involves billions of variables and highly intricate computation graphs. Therefore, Train Verify introduces a stage-wise parallel verification algorithm and shape-reduction techniques that significantly reduces complexity while preserving formal correctness. TrainVerify scales to frontier LLMs, including the successful verification of the Llama3 405B and DeepSeek-V3 671B training plans.

Bio:

Yunchi Lu is a Ph.D. student in Computer Science at University of Michigan advised by Prof. Ryan Huang. His research interests lie broadly in developing principled techniques to improve the correctness, observability, and reliability of large-scale systems. Currently, his work focuses on addressing silent bugs in various components of machine learning systems.