Home   Publications     edited volumes   Awards   Research   Teaching   Miscellaneous   Full CV [pdf]   BLOG   bio
  
 
 
  
 
  
  Events
  
  
  
  
   
  
   Past Events
  
  
  
  
  
  
   
    | 
Publications of Torsten Hoefler  
K. Kharbas, D. Kim, Torsten Hoefler and F. Mueller:
 
  |  |   | Assessing HPC Failure Detectors for MPI Jobs
   (In Proceedings of the 2012 20th Euromicro International Conference on Parallel, Distributed and Network-based Processing, presented in Munich, Germany, pages 81--88, IEEE Computer Society, ISBN: 978-0-7695-4633-9, Feb. 2012) 
 
 AbstractReliability is one of the challenges faced by exascale
    computing. Components are poised to fail during large-scale executions given current mean time between failure
    (MTBF) projections. To cope with failures, resilience methods have been proposed as explicit or transparent tech-
    niques. For the latter techniques, this paper studies the
    challenge of fault detection.
    This work contributes a study on generic fault detection
    capabilities at the MPI level and beyond. The objective is
    to assess different detectors, which ultimately may or may
    not be implemented within the application’s runtime layer.
    A first approach utilizes a periodic liveness check while a
    second method promotes sporadic checks upon communication activities. The contributions of this paper are two-fold:
    (a) We provide generic interposing of MPI applications for
    fault detection. (b) We experimentally compare periodic
    and sporadic methods for liveness checking. We show that
    the sporadic approach, even though it imposes lower bandwidth requirements and utilizes lower frequency checking,
    results in equal or worse application performance than a
    periodic liveness test for larger number of nodes. We further
    show that performing liveness checks in separation from
    MPI applications results in lower overhead than interpositioning, as demonstrated by our prototypes. Hence, we
    promote separate periodic fault detection as the superior
    approach for fault detection.
 
 Documentsdownload article:  
  |  |   | BibTeX |  @inproceedings{ftdetectors,   author={K. Kharbas and D. Kim and Torsten Hoefler and F. Mueller},   title={{Assessing HPC Failure Detectors for MPI Jobs}},   year={2012},   month={Feb.},   pages={81--88},   booktitle={Proceedings of the 2012 20th Euromicro International Conference on Parallel, Distributed and Network-based Processing},   location={Munich, Germany},   publisher={IEEE Computer Society},   isbn={978-0-7695-4633-9},   source={http://www.unixer.de/~htor/publications/}, } |  
  |  
  
 
 |