Home Publications edited volumes Awards Research Teaching Miscellaneous Full CV [pdf] BLOG bio
Events

Past Events
|
Publications of Torsten Hoefler
Yuyang Jin, Haojie Wang, Xiongchao Tang, Zhenhua Guo, Yaqian Zhao, Torsten Hoefler:
| | Leveraging Graph Analysis to Pinpoint Root Causes of Scalability Issues for Parallel Applications
(IEEE Transactions on Parallel and Distributed Systems. Vol 36, Nr. 2, pages 308-325, Feb. 2025)
Publisher Reference
AbstractIt is challenging to scale parallel applications to modern supercomputers because of load imbalance, resource contention, and communications between processes. Profiling and tracing are two main performance analysis approaches for detecting these scalability bottlenecks. Profiling is low-cost but lacks detailed dependence for identifying root causes. Tracing records plentiful information but incurs significant overheads. To address these issues, we present ScalAna, which employs static analysis techniques to combine the benefits of profiling and tracing - it enables tracing's analyzability with overhead similar to profiling. ScalAna uses static analysis to capture program structures and data dependence of parallel applications, and leverages lightweight profiling approaches to record performance data during runtime. Then a parallel performance graph is generated with both static and dynamic data. Based on this graph, we design a backtracking detection approach to automatically pinpoint the root causes of scaling issues. We evaluate the efficacy and efficiency of ScalAna using several real applications with up to 704K lines of code and demonstrate that our approach can effectively pinpoint the root causes of scaling loss with an average overhead of 5.65% for up to 16,384 processes. By fixing the root causes detected by our tool, it achieves up to 33.01% performance improvement.
DocumentsPublisher URL: https://ieeexplore.ieee.org/abstract/document/10734146
| | BibTeX | @article{, author={Yuyang Jin and Haojie Wang and Xiongchao Tang and Zhenhua Guo and Yaqian Zhao and Torsten Hoefler}, title={{Leveraging Graph Analysis to Pinpoint Root Causes of Scalability Issues for Parallel Applications}}, journal={IEEE Transactions on Parallel and Distributed Systems}, year={2025}, month={Feb.}, pages={308-325}, volume={36}, number={2}, source={http://www.unixer.de/~htor/publications/}, } |
|
|