Home Publications edited volumes Awards Research Teaching Miscellaneous Full CV [pdf] BLOG bio
Events

Past Events
|
Publications of Torsten Hoefler
Mikhail Khalilov, Salvatore Di Girolamo, Marcin Chrapek, Rami Nudelman, Gil Bloch, Torsten Hoefler:
| | Network-Offloaded Bandwidth-Optimal Broadcast and Allgather for Distributed AI
(In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC'24), presented in Atlanta, GA, USA, pages 103:1-103:17, IEEE Press, ISBN: 9798350352917, Nov. 2024)
Publisher Reference
AbstractIn the Fully Sharded Data Parallel (FSDP) training pipeline, collective operations can be interleaved to maximize the communication/computation overlap. In this scenario, outstanding operations such as Allgather and Reduce-Scatter can compete for the injection bandwidth and create pipeline bubbles. To address this problem, we propose a novel bandwidth-optimal Allgather collective algorithm that leverages hardware multicast. We use multicast to build a constant-time reliable Broadcast protocol, a building block for constructing an optimal Allgather schedule. Our Allgather algorithm achieves 2x traffic reduction on a 188-node testbed. To free the host side from running the protocol, we employ SmartNIC offloading. We extract the parallelism in our Allgather algorithm and map it to a SmartNIC specialized for hiding the cost of data movement. We show that our SmartNIC-offloaded collective progress engine can scale to the next generation of 1.6 Tbit/s links.
DocumentsPublisher URL: https://dl.acm.org/doi/10.1109/SC41406.2024.00109download article:  download slides:  | | BibTeX | @inproceedings{khalilov2024allgather, author={Mikhail Khalilov and Salvatore Di Girolamo and Marcin Chrapek and Rami Nudelman and Gil Bloch and Torsten Hoefler}, title={{Network-Offloaded Bandwidth-Optimal Broadcast and Allgather for Distributed AI}}, year={2024}, month={Nov.}, pages={103:1-103:17}, booktitle={Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC'24)}, location={Atlanta, GA, USA}, publisher={IEEE Press}, isbn={9798350352917}, source={http://www.unixer.de/~htor/publications/}, } |
|
|