Accelerated Stochastic ExtraGradient: Mixing Hessian and gradient similarity to reduce communication in distributed and federated learning

D. A. Bylinkin; K. D. Degtyarev; A. N. Beznosikov

D. A. Bylinkin ; K. D. Degtyarev ; A. N. Beznosikov

Trudy Matematicheskogo Instituta imeni V.A. Steklova, Tome 79 (2024) no. 6, pp. 939-973 Cet article a éte moissonné depuis la source Math-Net.Ru

Voir la notice de l'article

Résumé

Modern realities and trends in learning require more and more generalization ability of models, which leads to an increase in both models and training sample size. It is already difficult to solve such tasks in a single device mode. This is the reason why distributed and federated learning approaches are becoming more popular every day. Distributed computing involves communication between devices, which requires solving two key problems: efficiency and privacy. One of the most well-known approaches to combat communication costs is to exploit the similarity of local data. Both Hessian similarity and homogeneous gradients have been studied in the literature, but separately. In this paper we combine both of these assumptions in analyzing a new method that incorporates the ideas of using data similarity and clients sampling. Moreover, to address privacy concerns, we apply the technique of additional noise and analyze its impact on the convergence of the proposed method. The theory is confirmed by training on real datasets. Bibliography: 45 titles.

Keywords: ASEG, distributed learning, federated learning, communication costs, Hessian similarity, homogeneous gradients
Mots-clés : technique of additional noise.

@article{RM_2024_79_6_a1,
     author = {D. A. Bylinkin and K. D. Degtyarev and A. N. Beznosikov},
     title = {Accelerated {Stochastic} {ExtraGradient:} {Mixing} {Hessian} and gradient similarity to reduce communication in distributed and federated learning},
     journal = {Trudy Matematicheskogo Instituta imeni V.A. Steklova},
     pages = {939--973},
     year = {2024},
     volume = {79},
     number = {6},
     language = {en},
     url = {http://geodesic.mathdoc.fr/item/RM_2024_79_6_a1/}
}

TY  - JOUR
AU  - D. A. Bylinkin
AU  - K. D. Degtyarev
AU  - A. N. Beznosikov
TI  - Accelerated Stochastic ExtraGradient: Mixing Hessian and gradient similarity to reduce communication in distributed and federated learning
JO  - Trudy Matematicheskogo Instituta imeni V.A. Steklova
PY  - 2024
SP  - 939
EP  - 973
VL  - 79
IS  - 6
UR  - http://geodesic.mathdoc.fr/item/RM_2024_79_6_a1/
LA  - en
ID  - RM_2024_79_6_a1
ER  -

%0 Journal Article
%A D. A. Bylinkin
%A K. D. Degtyarev
%A A. N. Beznosikov
%T Accelerated Stochastic ExtraGradient: Mixing Hessian and gradient similarity to reduce communication in distributed and federated learning
%J Trudy Matematicheskogo Instituta imeni V.A. Steklova
%D 2024
%P 939-973
%V 79
%N 6
%U http://geodesic.mathdoc.fr/item/RM_2024_79_6_a1/
%G en
%F RM_2024_79_6_a1

D. A. Bylinkin; K. D. Degtyarev; A. N. Beznosikov. Accelerated Stochastic ExtraGradient: Mixing Hessian and gradient similarity to reduce communication in distributed and federated learning. Trudy Matematicheskogo Instituta imeni V.A. Steklova, Tome 79 (2024) no. 6, pp. 939-973. http://geodesic.mathdoc.fr/item/RM_2024_79_6_a1/

Bibliographie
Cité par

[1] M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, and Li Zhang, “Deep learning with differential privacy”, Proceedings of the 2016 ACM SIGSAC conference on computer and communications security (Vienna 2016), ACM, New York, 2016, 308–318 | DOI

[2] M. Aitsam, “Differential privacy made easy”, 2022 International conference on emerging trends in electrical, control, and telecommunication engineering (ETECTE) (Lahore 2022), IEEE, 2022, 1–7 | DOI

[3] D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic, “QSGD: Communication-efficient SGD via gradient quantization and encoding”, NIPS'17: Proceedings of the 31st international conference on neural information processing systems, Adv. Neural Inf. Process. Syst., 30, Curran Associates, Inc., Red Hook, NY, 2017, 1707–1718, https://proceedings.neurips.cc/paper/2017/hash/6c340f25839e6acdc73414517203f5f0-Abstract.html

[4] Z. Allen-Zhu, “Katyusha: The first direct acceleration of stochastic gradient methods”, J. Mach. Learn. Res., 18 (2018), 221, 51 pp. | DOI | MR | Zbl

[5] Y. Arjevani and O. Shamir, “Communication complexity of distributed convex learning and optimization”, NIPS'15: Proceedings of the 28th international conference on neural information processing systems, v. 1, Adv. Neural Inf. Process. Syst., 28, MIT Press, Cambridge, MA, 2015, 1756–1764 https://papers.nips.cc/paper_files/paper/2015/hash/7fec306d1e665bc9c748b5d2b99a6e97-Abstract.html

[6] B. Barazandeh, Tianjian Huang, and G. Michailidis, “A decentralized adaptive momentum method for solving a class of min-max optimization problems”, Signal Process., 189 (2021), 108245, 23 pp. | DOI

[7] A. Beznosikov, P. Dvurechenskii, A. Koloskova, V. Samokhin, S. U. Stich, and A. Gasnikov, “Decentralized local stochastic extra-gradient for variational inequalities”, NIPS'22: Proceedings of the 36th international conference on neural information processing systems, Adv. Neural Inf. Process. Syst., 35, Curran Associates, Inc., Red Hook, NY, 2022, 2762, 38116–38133 https://proceedings.neurips.cc/paper_files/paper/2022/hash/f9379afacdbabfdc6b060972b60f9ab8-Abstract-Conference.html

[8] A. Beznosikov and A. Gasnikov, “Compression and data similarity: combination of two techniques for communication-efficient solving of distributed variational inequalities”, Optimization and applications, Lecture Notes in Comput. Sci., 13781, Springer, Cham, 2022, 151–162 | DOI | MR | Zbl

[9] A. Beznosikov, S. Horváth, P. Richtárik, and M. Safaryan, “On biased compression for distributed learning”, J. Mach. Learn. Res., 24 (2023), 276, 50 pp. | MR

[10] A. Beznosikov, G. Scutari, A. Rogozin, and A. Gasnikov, “Distributed saddle-point problems under similarity”, NIPS'21: Proceedings of the 35th international conference on neural information processing systems, Adv. Neural Inf. Process. Syst., 34, Curran Associates, Inc., Red Hook, NY, 2021, 625, 8172–8184 https://proceedings.neurips.cc/paper/2021/hash/44e65d3e9bc2f88b2b3d566de51a5381-Abstract.html

[11] A. Beznosikov, M. Takác, and A. Gasnikov, “Similarity, compression and local steps: three pillars of efficient communications for distributed variational inequalities”, NIPS'23: Proceedings of the 37th international conference on neural information processing systems, Adv. Neural Inf. Process. Syst., 36, Curran Associates, Inc., Red Hook, NY, 2023, 1246, 28663–28677 https://proceedings.neurips.cc/paper_files/paper/2023/hash/5b4a459db23e6db9be2a128380953d96-Abstract-Conference.html

[12] Chih-Chung Chang and Chih-Jen Lin, “LIBSVM: A library for support vector machines”, ACM Trans. Intell. Syst. Technol. (TIST), 2:3 (2011), 27, 27 pp. | DOI

[13] A. Defazio, F. Bach, and S. Lacoste-Julien, “SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives”, NIPS'14: Proceedings of the 27th international conference on neural information processing systems, v. 1, Adv. Neural Inf. Process. Syst., 27, MIT Press, Cambridge, MA, 2014, 1646–1654, https://proceedings.neurips.cc/paper_files/paper/2014/hash/ede7e2b6d13a41ddf9f4bdef84fdc737-Abstract.html

[14] Jiashi Feng, Huan Xu, and Shie Mannor, Distributed robust learning, 2015 (v1 – 2014), 18 pp., arXiv: 1409.5937

[15] S. Gade and N. H. Vaidya, “Private optimization on networks”, 2018 Annual american control conference (ACC) (Milwaukee, WI 2018), IEEE, 2018, 1402–1409 | DOI

[16] E. Gorbunov, F. Hanzely, and P. Richtárik, “A unified theory of SGD: variance reduction, sampling, quantization and coordinate descent”, Proceedings of the 23rd international conference on artificial intelligence and statistics (AISTATS 2020), Proc. Mach. Learn. Res. (PMLR), 108, 2020, 680–690 ; 2019, 38 pp., arXiv: https://proceedings.mlr.press/v108/gorbunov20a.html 1905.11261

[17] E. Gorbunov, F. Hanzely, and P. Richtárik, “Local SGD: unified theory and new efficient methods”, Proceedings of the 24th international conference on artificial intelligence and statistics (AISTATS 2021), Proc. Mach. Learn. Res. (PMLR), 130, 2021, 3556–3564 https://proceedings.mlr.press/v130/gorbunov21a.html

[18] T. M. Hehn, J. F. Kooij, and F. A. Hamprecht, “End-to-end learning of decision trees and forests”, Int. J. Comput. Vis., 128:4 (2020), 997–1011 | DOI | MR | Zbl

[19] H. Hendrikx, Lin Xiao, S. Bubeck, F. Bach, and L. Massoulie, “Statistically preconditioned accelerated gradient method for distributed optimization”, Proceedings of the 37th international conference on machine learning (ICML 2020), Proc. Mach. Learn. Res. (PMLR), 119, 2020, 4203–4227, https://proceedings.mlr.press/v119/hendrikx20a.html

[20] R. Johnson and Tong Zhang, “Accelerating stochastic gradient descent using predictive variance reduction”, NIPS'13: Proceedings of the 26th international conference on neural information processing systems, v. 1, Adv. Neural Inf. Process. Syst., 26, Curran Associates Inc., Red Hook, NY, 2013, 315–323, https://proceedings.neurips.cc/paper_files/paper/2013/hash/ac1dd209cbcc5e5d1c6e28598e8cbbe8-Abstract.html

[21] A. Khaled and Chi Jin, Faster federated optimization under second-order similarity, 2023 (v1 – 2022), 44 pp., arXiv: 2209.02257

[22] A. Khaled, K. Mishchenko, and P. Richtárik, “Tighter theory for local SGD on identical and heterogeneous data”, Proceedings of the 23rd international conference on artificial intelligence and statistics (AISTATS 2020), Proc. Mach. Learn. Res. (PMLR), 108, 2020, 4519–4529 https://proceedings.mlr.press/v108/bayoumi20a.html

[23] A. Koloskova, N. Loizou, S. Boreiri, M. Jaggi, and S. Stich, “A unified theory of decentralized SGD with changing topology and local updates”, Proceedings of the 37th international conference on machine learning (ICML 2020), Proc. Mach. Learn. Res. (PMLR), 119, 2020, 5381–5393, https://proceedings.mlr.press/v119/koloskova20a.html

[24] D. Kovalev, A. Beznosikov, E. Borodich, A. Gasnikov, and G. Scutari, “Optimal gradient sliding and its application to optimal distributed optimization under similarity”, NIPS'22: Proceedings of the 36th international conference on neural information processing systems, Adv. Neural Inf. Process. Syst., 35, Curran Associates, Inc., Red Hook, NY, 2022, 2427, 33494–33507, https://proceedings.neurips.cc/paper_files/paper/2022/hash/d88f6f81e1aaf606776ffdd06fdf24ef-Abstract-Conference.html

[25] D. Kovalev, S. Horváth, and P. Richtárik, “Don't jump through hoops and remove those loops: SVRG and Katyusha are better without the outer loop”, Proceedings of the 31st international conference on algorithmic learning theory (ALT 2020), Proc. Mach. Learn. Res. (PMLR), 117, 2020, 451–467 https://proceedings.mlr.press/v117/kovalev20a.html | MR | Zbl

[26] Zhize Li, Hongyan Bao, Xiangliang Zhang, and P. Richtárik, “PAGE: A simple and optimal probabilistic gradient estimator for nonconvex optimization”, Proceedings of the 38th international conference on machine learning (ICML 2021), Proc. Mach. Learn. Res. (PMLR), 139, 2021, 6286–6295 https://proceedings.mlr.press/v139/li21a.html

[27] Dachao Lin, Yuze Han, Haishan Ye, and Zhihua Zhang, “Stochastic distributed optimization under average second-order similarity: algorithms and analysis”, NIPS'23: Proceedings of the 37th international conference on neural information processing systems, Adv. Neural Inf. Process. Syst., 36, Curran Associates, Inc., Red Hook, NY, 2023, 1849–1862 https://proceedings.neurips.cc/paper_files/paper/2023/hash/05e552739c2629f3324c1063a382b4bd-Abstract-Conference.html

[28] Hongzhou Lin, J. Mairal, and Z. Harchaoui, “A universal catalyst for first-order optimization”, NIPS'15: Proceedings of the 28th international conference on neural information processing systems, v. 2, Adv. Neural Inf. Process. Syst., 28, MIT Press, Cambridge, MA, 2015, 3384–3392, https://proceedings.neurips.cc/paper_files/paper/2015/hash/c164bbc9d6c72a52c599bbb43d8db8e1-Abstract.html

[29] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. Arcas, “Communication-efficient learning of deep networks from decentralized data”, Proceedings of the 20th international conference on artificial intelligence and statistics (AISTATS 2017), Proc. Mach. Learn. Res. (PMLR), 54, 2017, 1273–1282 https://proceedings.mlr.press/v54/mcmahan17a

[30] J. M. Moguerza and A. Muñoz, “Support vector machines with applications”, Statist. Sci., 21:3 (2006), 322–336 | DOI | MR | Zbl

[31] L. M. Nguyen, Jie Liu, K. Scheinberg, and M. Takáč, “SARAH: A novel method for machine learning problems using stochastic recursive gradient”, Proceedings of the 34th international conference on machine learning (ICML 2017), Proc. Mach. Learn. Res. (PMLR), 70, 2017, 2613–2621 https://proceedings.mlr.press/v70/nguyen17b.html

[32] E. Nozari, P. Tallapragada, and J. Cortés, “Differentially private distributed convex optimization via functional perturbation”, IEEE Trans. Control Netw. Syst., 5:1 (2018), 395–408 | DOI | MR | Zbl

[33] S. S. Sannakki and A. A. Kulkarni, “A review paper on artificial neural network in cognitive science”, Int. J. Engrg. Techn., 2:2 (2016), 118–123 https://oaji.net/articles/2017/1992-1514450980.pdf

[34] S. Shalev-Shwartz, O. Shamir, N. Srebro, and K. Sridharan, “Learnability, stability and uniform convergence”, J. Mach. Learn. Res., 11 (2010), 2635–2670 | MR | Zbl

[35] O. Shamir, N. Srebro, and Tong Zhang, “Communication-efficient distributed optimization using an approximate Newton-type method”, Proceedings of the 31st international conference on machine learning (ICML 2014), Proc. Mach. Learn. Res. (PMLR), 32, 2014, 1000–1008 https://proceedings.mlr.press/v32/shamir14.html

[36] S. Sreenivasan, R. Cohen, E. López, Z. Toroczkai, and H. E. Stanley, Communication bottlenecks in scale-free networks, 2006, 5 pp., arXiv: cs/0604023

[37] S. U. Stich, Unified optimal analysis of the (stochastic) gradient method, 2019, 11 pp., arXiv: 1907.04232

[38] Ying Sun, G. Scutari, and A. Daneshmand, “Distributed optimization based on gradient tracking revisited: enhancing convergence rate via surrogation”, SIAM J. Optim., 32:2 (2022), 354–385 | DOI | MR | Zbl

[39] YРі Tian, G. Scutari, Tianyu Cao, and A. Gasnikov, “Acceleration in distributed optimization under similarity”, Proceedings of the 25th international conference on artificial intelligence and statistics (AISTATS 2022), Proc. Mach. Learn. Res. (PMLR), 151, 2022, 5721–5756 https://proceedings.mlr.press/v151/tian22b.html

[40] G. Tripepi, K. J. Jager, F. W. Dekker, and C. Zoccali, “Linear and logistic regression analysis”, Kidney Int., 73:7 (2008), 806–810 | DOI

[41] S. Vaswani, I. Laradji, F. Kunstner, Si Yi Meng, M. Schmidt, and S. Lacoste-Julien, Adaptive gradient methods converge faster with over-parameterization (but you should do a line-search), 2021 (v1 – 2020), 41 pp., arXiv: 2006.06835

[42] J. Verbraeken, M. Wolting, J. Katzy, J. Kloppenburg, T. Verbelen, and J. S. Rellermeyer, “A survey on distributed machine learning”, ACM Comput. Surveys, 53:2 (2020), 30, 33 pp. | DOI

[43] P. C. Weeraddana and C. Fischione, “On the privacy of optimization”, IFAC-PapersOnLine, 50:1 (2017), 9502–9508 | DOI

[44] B. E. Woodworth, K. K. Patel, and N. Srebro, “Minibatch vs local SGD for heterogeneous distributed learning”, NIPS'20: Proceedings of the 34th international conference on neural information processing systems, Adv. Neural Inf. Process. Syst., 33, Curran Associates, Inc., Red Hook, NY, 2020, 527, 6281–6292, https://proceedings.neurips.cc/paper_files/paper/2020/hash/45713f6ff2041d3fdfae927b82488db8-Abstract.html

[45] Xiao-Tong Yuan and Ping Li, “On convergence of distributed approximate Newton methods: globalization, sharper bounds and beyond”, J. Mach. Learn. Res., 21 (2020), 206, 51 pp. | MR | Zbl

Parcourir par

Geodesic

Parcourir par