Fault tolerance for HPC by using local checkpoints
    
    
  
  
  
      
      
      
        
Vestnik Ûžno-Uralʹskogo gosudarstvennogo universiteta. Seriâ Vyčislitelʹnaâ matematika i informatika, Tome 3 (2014) no. 3, pp. 20-36
    
  
  
  
  
  
    
      
      
        
      
      
      
    Voir la notice de l'article provenant de la source Math-Net.Ru
            
              			One of the main problems that occur in the area of high-performance computing is to continue computations despite of failures. In this paper, we consider the main definitions relating to dependability, briefly review the failure rates for distributed systems and also survey the rollback-recovery approaches. The classic fault-tolerance technique used in parallel applications is the co-ordinated checkpointing protocol. This protocol takes a consistent global checkpoint snapshot by capturing the local state of each process node simultaneously and saves it on a parallel file system via I/O nodes. However, as the number of compute nodes increases and the size of applications grow, the performance overhead of this protocol can reach an unacceptable level. A solution to this problem is to use local storage for checkpointing. To provide protection, it is necessary to du-plicate checkpoints to other local storages. In this work, we develop user level approach and pre-sent scheme for checkpointing to the local storages. We proof that, if the number of failures is less than the maximum allowable value for the scheme then it is possible to recover from consistent global checkpoint.
			
            
            
            
          
        
      
                  
                    
                    
                    
                    
                    
                      
Keywords: 
parallel computing, fault tolerance, checkpoint, MPI.
                    
                  
                
                
                @article{VYURV_2014_3_3_a1,
     author = {A. A. Bondarenko and M. V. Iakobovski},
     title = {Fault tolerance for {HPC} by using local checkpoints},
     journal = {Vestnik \^U\v{z}no-Uralʹskogo gosudarstvennogo universiteta. Seri\^a Vy\v{c}islitelʹna\^a matematika i informatika},
     pages = {20--36},
     publisher = {mathdoc},
     volume = {3},
     number = {3},
     year = {2014},
     language = {ru},
     url = {http://geodesic.mathdoc.fr/item/VYURV_2014_3_3_a1/}
}
                      
                      
                    TY - JOUR AU - A. A. Bondarenko AU - M. V. Iakobovski TI - Fault tolerance for HPC by using local checkpoints JO - Vestnik Ûžno-Uralʹskogo gosudarstvennogo universiteta. Seriâ Vyčislitelʹnaâ matematika i informatika PY - 2014 SP - 20 EP - 36 VL - 3 IS - 3 PB - mathdoc UR - http://geodesic.mathdoc.fr/item/VYURV_2014_3_3_a1/ LA - ru ID - VYURV_2014_3_3_a1 ER -
%0 Journal Article %A A. A. Bondarenko %A M. V. Iakobovski %T Fault tolerance for HPC by using local checkpoints %J Vestnik Ûžno-Uralʹskogo gosudarstvennogo universiteta. Seriâ Vyčislitelʹnaâ matematika i informatika %D 2014 %P 20-36 %V 3 %N 3 %I mathdoc %U http://geodesic.mathdoc.fr/item/VYURV_2014_3_3_a1/ %G ru %F VYURV_2014_3_3_a1
A. A. Bondarenko; M. V. Iakobovski. Fault tolerance for HPC by using local checkpoints. Vestnik Ûžno-Uralʹskogo gosudarstvennogo universiteta. Seriâ Vyčislitelʹnaâ matematika i informatika, Tome 3 (2014) no. 3, pp. 20-36. http://geodesic.mathdoc.fr/item/VYURV_2014_3_3_a1/
