The parallel tiled WZ factorization algorithm for multicore architectures
International Journal of Applied Mathematics and Computer Science, Tome 29 (2019) no. 2, pp. 407-419.

Voir la notice de l'article provenant de la source Library of Science

The aim of this paper is to investigate dense linear algebra algorithms on shared memory multicore architectures. The design and implementation of a parallel tiled WZ factorization algorithm which can fully exploit such architectures are presented. Three parallel implementations of the algorithm are studied. The first one relies only on exploiting multithreaded BLAS (basic linear algebra subprograms) operations. The second implementation, except for BLAS operations, employs the OpenMP standard to use the loop-level parallelism. The third implementation, except for BLAS operations, employs the OpenMP task directive with the depend clause. We report the computational performance and the speedup of the parallel tiled WZ factorization algorithm on shared memory multicore architectures for dense square diagonally dominant matrices. Then we compare our parallel implementations with the respective LU factorization from a vendor implemented LAPACK library. We also analyze the numerical accuracy. Two of our implementations can be achieved with near maximal theoretical speedup implied by Amdahl’s law.
Keywords: tiled algorithm, WZ factorization, solution of linear system, Amdahl’s law, high performance computing, multicore architecture
Mots-clés : rozkład WZ, układ liniowy, prawo Amdahla, architektura wielordzeniowa
@article{IJAMCS_2019_29_2_a14,
     author = {Bylina, Beata and Bylina, Jaros{\l}aw},
     title = {The parallel tiled {WZ} factorization algorithm for multicore architectures},
     journal = {International Journal of Applied Mathematics and Computer Science},
     pages = {407--419},
     publisher = {mathdoc},
     volume = {29},
     number = {2},
     year = {2019},
     language = {en},
     url = {http://geodesic.mathdoc.fr/item/IJAMCS_2019_29_2_a14/}
}
TY  - JOUR
AU  - Bylina, Beata
AU  - Bylina, Jarosław
TI  - The parallel tiled WZ factorization algorithm for multicore architectures
JO  - International Journal of Applied Mathematics and Computer Science
PY  - 2019
SP  - 407
EP  - 419
VL  - 29
IS  - 2
PB  - mathdoc
UR  - http://geodesic.mathdoc.fr/item/IJAMCS_2019_29_2_a14/
LA  - en
ID  - IJAMCS_2019_29_2_a14
ER  - 
%0 Journal Article
%A Bylina, Beata
%A Bylina, Jarosław
%T The parallel tiled WZ factorization algorithm for multicore architectures
%J International Journal of Applied Mathematics and Computer Science
%D 2019
%P 407-419
%V 29
%N 2
%I mathdoc
%U http://geodesic.mathdoc.fr/item/IJAMCS_2019_29_2_a14/
%G en
%F IJAMCS_2019_29_2_a14
Bylina, Beata; Bylina, Jarosław. The parallel tiled WZ factorization algorithm for multicore architectures. International Journal of Applied Mathematics and Computer Science, Tome 29 (2019) no. 2, pp. 407-419. http://geodesic.mathdoc.fr/item/IJAMCS_2019_29_2_a14/

[1] Agullo, E., Demmel, J., Dongarra, J., Hadri, B., Kurzak, J., Langou, J., Ltaief, H., Luszczek, P. and Tomov, S. (2009). Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects, Journal of Physics: Conference Series 180(1): 012037.

[2] Amdahl, G.M. (1967). Validity of the single processor approach to achieving large scale computing capabilities, Proceedings of the Spring Joint Computer Conference, AFIPS’67 (Spring), Atlantic City, NJ, USA, pp. 483–485.

[3] Anderson, E., Bai, Z., Bischof, C., Blackford, S., Demmel, J., Dongarra, J., Du Croz, J., Greenbaum, A., Hammarling, S., McKenney, A. and Sorensen, D. (1999). LAPACK Users’ Guide, 3rd Edn., SIAM, Philadelphia, PA.

[4] Buttari, A., Langou, J., Kurzak, J. and Dongarra, J. (2009). A class of parallel tiled linear algebra algorithms for multicore architectures, Parallel Computing 35(1): 38–53.

[5] Bylina, B. (2018). The block WZ factorization, Journal of Computational and Applied Mathematics 331(C): 119–132.

[6] Bylina, B. and Bylina, J. (2007). Incomplete WZ factorization as an alternative method of preconditioning for solving Markov chains, in R. Wyrzykowski et al. (Eds.), PPAM, Lecture Notes in Computer Science, Vol. 4967, Springer, Berlin/Heidelberg, pp. 99–107.

[7] Bylina, B. and Bylina, J. (2009). Influence of preconditioning and blocking on accuracy in solving Markovian models, International Journal of Applied Mathematics and Computer Science 19(2): 207–217, DOI: 10.2478/v10006-009-0017-3.

[8] Bylina, B. and Bylina, J. (2015). Strategies of parallelizing nested loops on the multicore architectures on the example of the WZ factorization for the dense matrices, in M. Ganzha et al. (Eds.), Proceedings of the 2015 Federated Conference on Computer Science and Information Systems, Annals of Computer Science and Information Systems, Vol. 5, IEEE, Piscataway, NJ, pp. 629–639.

[9] Donfack, S., Dongarra, J., Faverge, M., Gates, M., Kurzak, J., Luszczek, P. and Yamazaki, I. (2015). A survey of recent developments in parallel implementations of Gaussian elimination, Concurrency and Computation: Practice and Experience 27(5): 1292–1309.

[10] Dongarra, J., DuCroz, J., Duff, I.S. and Hammarling, S. (1990). A set of level-3 basic linear algebra subprograms, ACM Transactions on Mathematics Software 16(1): 1–17.

[11] Dongarra, J.J., Faverge, M., Ltaief, H. and Luszczek, P. (2013). Achieving numerical accuracy and high performance using recursive tile LU factorization, Concurrency and Computation: Practice and Experience 26(6): 1408–1431.

[12] Dumas, J.G., Gautier, T., Pernet, C., Roch, J.L. and Sultan, Z. (2016). Recursion based parallelization of exact dense linear algebra routines for Gaussian elimination, Parallel Computing 57: 235–249.

[13] Evans, D. and Hatzopoulos, M. (1979). A parallel linear system solver, International Journal of Computer Mathematics 7(3): 227–238.

[14] Flynn, M.J. (1972). Some computer organizations and their effectiveness, IEEE Transactions on Computers 21(9): 948–960.

[15] García, I., Merelo, J., Bruguera, J. and Zapata, E. (1990). Parallel quadrant interlocking factorization on hypercube computers, Parallel Computing 15(1–3): 87–100.

[16] Gustavson, F.G. (1997). Recursion leads to automatic variable blocking for dense linear-algebra algorithms, IBM Journal of Research and Development 41(6): 737–756.

[17] Intel (2019). Math Kernel Library, https://software.intel.com/en-us/mkl.

[18] Kurzak, J., Langou, J., Langou, C.D.J., Ltaief, H., Luszczek, P., Yarkhan, A., Haidar, A., Hoffman, J., Agullo, P.D.E., Buttari, A. and Hadri, B. (2010). PLASMA Users’ Guide: Parallel Linear Algebra Software for Multicore Architectures, Version 2.3., http://icl.cs.utk.edu/projectsfiles/plasma/pdf/users_guide.pdf.

[19] Marqués, M., Quintana-Ortí, G., Quintana-Ortí, E.S. and van de Geijn, R.A. (2011). Using desktop computers to solve large-scale dense linear algebra problems, The Journal of Supercomputing 58(2): 145–150.

[20] Rao, S.C.S. (1997). Existence and uniqueness of WZ factorization, Parallel Computing 23(8): 1129–1139.

[21] Yalamov, P. and Evans, D. (1995). The WZ matrix factorisation method, Parallel Computing 21(7): 1111–1120.

[22] Yarkhan, A., Kurzak, J., Luszczek, P. and Dongarra, J. (2017). Porting the PLASMA numerical library to the OpenMP standard, International Journal of Parallel Programming 45(3): 612–633, DOI:10.1007/s10766-016-0441-6.