Simulation of failures in high-performance computing systems under MPI-ULFM
Vestnik Ûžno-Uralʹskogo gosudarstvennogo universiteta. Seriâ Vyčislitelʹnaâ matematika i informatika, Tome 4 (2015) no. 3, pp. 5-12
Cet article a éte moissonné depuis la source Math-Net.Ru

Voir la notice de l'article

In this paper, we consider one of the main problems that occur in the area of highperformance computing is to continue computations despite of failures. For the programs running on such systems it is very important to handle failures and continue computations on working nodes. One of the MPI 3.1 standardization efforts aim is adding new techniques, approaches, or concepts to support for fault tolerance in MPI applications. The paper briefly describes a library for simulation of failures and testing fault-tolerant algorithms using functional of developing MPI 3.1 standard. In the test problem we describe one of the techniques of fault tolerance and we compare checkpoint in operational memory versus checkpoint in the distributed file system.
Keywords: parallel computing, fault tolerance, checkpoint, simulation of failures, MPI, ULFM.
@article{VYURV_2015_4_3_a0,
     author = {A. A. Bondarenko and M. V. Iakobovski},
     title = {Simulation of failures in high-performance computing systems under {MPI-ULFM}},
     journal = {Vestnik \^U\v{z}no-Uralʹskogo gosudarstvennogo universiteta. Seri\^a Vy\v{c}islitelʹna\^a matematika i informatika},
     pages = {5--12},
     year = {2015},
     volume = {4},
     number = {3},
     language = {ru},
     url = {http://geodesic.mathdoc.fr/item/VYURV_2015_4_3_a0/}
}
TY  - JOUR
AU  - A. A. Bondarenko
AU  - M. V. Iakobovski
TI  - Simulation of failures in high-performance computing systems under MPI-ULFM
JO  - Vestnik Ûžno-Uralʹskogo gosudarstvennogo universiteta. Seriâ Vyčislitelʹnaâ matematika i informatika
PY  - 2015
SP  - 5
EP  - 12
VL  - 4
IS  - 3
UR  - http://geodesic.mathdoc.fr/item/VYURV_2015_4_3_a0/
LA  - ru
ID  - VYURV_2015_4_3_a0
ER  - 
%0 Journal Article
%A A. A. Bondarenko
%A M. V. Iakobovski
%T Simulation of failures in high-performance computing systems under MPI-ULFM
%J Vestnik Ûžno-Uralʹskogo gosudarstvennogo universiteta. Seriâ Vyčislitelʹnaâ matematika i informatika
%D 2015
%P 5-12
%V 4
%N 3
%U http://geodesic.mathdoc.fr/item/VYURV_2015_4_3_a0/
%G ru
%F VYURV_2015_4_3_a0
A. A. Bondarenko; M. V. Iakobovski. Simulation of failures in high-performance computing systems under MPI-ULFM. Vestnik Ûžno-Uralʹskogo gosudarstvennogo universiteta. Seriâ Vyčislitelʹnaâ matematika i informatika, Tome 4 (2015) no. 3, pp. 5-12. http://geodesic.mathdoc.fr/item/VYURV_2015_4_3_a0/

[1] F. Cappello, A. Geist, W. Gropp, S. Kale, B. Kramer, M. Snir, “Toward Exascale Resilience: 2014 update”, Supercomputing frontiers and innovations, 1:1 (2014), 1–28 | DOI

[2] W. Bland, A. Bouteiller, T. Hérault, G. Bosilca, J. Dongarra, “Post-failure recovery of MPI communication capability: Design and rationale”, International Journal of High Performance Computing Applications, 27:3 (2013), 244–254 | DOI

[3] ICL Fault Tolerance, (data obrascheniya: 01.03.2015) http://fault-tolerance.org/ulfm/ulfm-specification

[4] Bondarenko A. A., Yakobovskiy M. V., “Faul Tolerance for HPC by Using Local Checkpoints”, Bulletin of South Ural State University. Series: Computational Mathematics and Software Engineering, 3:3 (2014), 20–36 | DOI | MR

[5] Scientific Cluster of Keldysh Institute of Applied Mathematics RAS, (data obrascheniya: 01.03.2015) http://imm6.keldysh.ru/ĩnformer/