Omnia vincit amor
Home -> Publications
Home
  Publications
    
edited volumes
  Awards
  Research
  Teaching
  Miscellaneous
  Full CV [pdf]
  BLOG
  bio






  Events








  Past Events





Publications of Torsten Hoefler
Mikhail Khalilov, Salvatore Di Girolamo, Marcin Chrapek, Rami Nudelman, Gil Bloch, Torsten Hoefler:

 Network-Offloaded Bandwidth-Optimal Broadcast and Allgather for Distributed AI

(In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC'24), presented in Atlanta, GA, USA, pages 103:1-103:17, IEEE Press, ISBN: 9798350352917, Nov. 2024)

Publisher Reference

Abstract

In the Fully Sharded Data Parallel (FSDP) training pipeline, collective operations can be interleaved to maximize the communication/computation overlap. In this scenario, outstanding operations such as Allgather and Reduce-Scatter can compete for the injection bandwidth and create pipeline bubbles. To address this problem, we propose a novel bandwidth-optimal Allgather collective algorithm that leverages hardware multicast. We use multicast to build a constant-time reliable Broadcast protocol, a building block for constructing an optimal Allgather schedule. Our Allgather algorithm achieves 2x traffic reduction on a 188-node testbed. To free the host side from running the protocol, we employ SmartNIC offloading. We extract the parallelism in our Allgather algorithm and map it to a SmartNIC specialized for hiding the cost of data movement. We show that our SmartNIC-offloaded collective progress engine can scale to the next generation of 1.6 Tbit/s links.

Documents

Publisher URL: https://dl.acm.org/doi/10.1109/SC41406.2024.00109download article:     
download slides:
 

BibTeX

@inproceedings{khalilov2024allgather,
  author={Mikhail Khalilov and Salvatore Di Girolamo and Marcin Chrapek and Rami Nudelman and Gil Bloch and Torsten Hoefler},
  title={{Network-Offloaded Bandwidth-Optimal Broadcast and Allgather for Distributed AI}},
  year={2024},
  month={Nov.},
  pages={103:1-103:17},
  booktitle={Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC'24)},
  location={Atlanta, GA, USA},
  publisher={IEEE Press},
  isbn={9798350352917},
  source={http://www.unixer.de/~htor/publications/},
}


serving: 216.73.216.217:11546© Torsten Hoefler