Title: UCCL - An Extensible Software Transport Layer for GPU Networking
Speaker: Yang Zhou (UC Davis) [webpage: https://yangzhou1997.github.io/]
Time: 10:00 am, November 07 (Friday), 2025
Location: Ryder Hall 156
Online link: provided upon request or see the seminar email.

Abstract:

Fast evolving machine learning workloads have increasing requirements for networking, ranging from simple AllReduce to more challenging AlltoAll. However, host networking hardware like RDMA NICs is evolving slowly, e.g., broken DCQCN congestion control that is challenging to tune, and no multipathing support yet. This mismatch has severely hindered performance scale-up and incurred additional costs.

We present UCCL, an extensible software transport layer for GPU networking. UCCL separates the data path and control path of existing RDMA NICs, and achieves efficient software control for the transport running on host CPUs. The extensibility of UCCL transport allows us to implement a multipath transport with packet spraying to resolve ECMP collisions, a receiver-driven transport to handle network incast, and a selective retransmission scheme to handle packet loss. UCCL provides a drop-in replacement for NCCL (and RCCL), and outperforms NCCL by 2.3-3.3x for GPU collectives over RoCE and AWS EFA RDMA. UCCL is fully open-sourced at https://github.com/uccl-project/uccl.

Bio:

Yang Zhou is an Assistant Professor at UC Davis CS. He was a Postdoc at UC Berkeley SkyLab, working with Ion Stoica, and got his Ph.D. from Harvard University, advised by Minlan Yu and James Mickens. He has equal interests in core systems and ML systems research. He once received a Google Ph.D. fellowship.