News Release

Software approaches for resilience of high performance computing systems

Peer-Reviewed Publication

Higher Education Press

Classification of typical resilience approaches

image: Classification of typical resilience approaches view more 

Credit: Jie Jia, Yi Liu, Guozhen Zhang, Yulin Gao & Depei Qian

High-performance computing (HPC) systems are critical to advancing scientific discovery and innovation in a variety of fields. However, as HPC systems become larger and more complex, they are also exposed to more frequent and diverse faults that can impair performance and correctness. How can we ensure that HPC systems can run parallel programs correctly and efficiently? On 15 Aug 2024, a research team from Beihang University published their new study in Frontiers of Computer Science co-published by Higher Education Press and Springer Nature.

The team conducts a comprehensive and systematic survey of existing software resilience approaches for HPC systems. They classify these approaches into five categories: checkpointing, replication, soft error resilience, algorithm-based fault tolerance (ABFT), and fault detection and prediction. They present and summarize the main techniques and systems in each category and discuss their advantages and limitations.

In addition, they identify some challenges regarding the recently developed software resilience approach for HPC systems, mainly in terms of scalability and heterogeneous architecture. They also highlight the challenges of emerging fault-slow faults that require more attention in the future.

The paper aims to help researchers understand the progress and overall picture of HPC software resilience. It also provides some insights and directions for future research in this area.


Disclaimer: AAAS and EurekAlert! are not responsible for the accuracy of news releases posted to EurekAlert! by contributing institutions or for the use of any information through the EurekAlert system.