Refining joint text and source code embeddings for retrieval task with parameter-efficient fine-tuning
Zapiski Nauchnykh Seminarov POMI, Investigations on applied mathematics and informatics. Part IV, Tome 540 (2024), pp. 27-45 Cet article a éte moissonné depuis la source Math-Net.Ru

Voir la notice du chapitre de livre

Latest developments in natural language processing demonstrate remarkable progress in the code-text retrieval problem. As Transformer-based models used for this task continue to increase in size, the computational costs and time required for end-to-end fine-tuning become substantial. This poses a significant challenge for adapting and utilizing these models when computational resources are limited. Motivated by these concerns, we propose a fine-tuning framework that leverages parameter-efficient fine-tuning (PEFT) techniques. Moreover, we adopt contrastive learning objectives to improve the quality of bimodal representations learned by Transformer-based models. Additionally, for PEFT methods we provide extensive benchmarking, the lack of which has been highlighted as a crucial problem in the literature. Based on extensive experiments with the CodeT5+ model conducted on two datasets, we demonstrate that the proposed fine-tuning framework has the potential to improve code-text retrieval performance by tuning only 0.4% parameters at the most.
@article{ZNSL_2024_540_a1,
     author = {K. Galliamov and L. Khaertdinova and K. Denisova},
     title = {Refining joint text and source code embeddings for retrieval task with parameter-efficient fine-tuning},
     journal = {Zapiski Nauchnykh Seminarov POMI},
     pages = {27--45},
     year = {2024},
     volume = {540},
     language = {en},
     url = {http://geodesic.mathdoc.fr/item/ZNSL_2024_540_a1/}
}
TY  - JOUR
AU  - K. Galliamov
AU  - L. Khaertdinova
AU  - K. Denisova
TI  - Refining joint text and source code embeddings for retrieval task with parameter-efficient fine-tuning
JO  - Zapiski Nauchnykh Seminarov POMI
PY  - 2024
SP  - 27
EP  - 45
VL  - 540
UR  - http://geodesic.mathdoc.fr/item/ZNSL_2024_540_a1/
LA  - en
ID  - ZNSL_2024_540_a1
ER  - 
%0 Journal Article
%A K. Galliamov
%A L. Khaertdinova
%A K. Denisova
%T Refining joint text and source code embeddings for retrieval task with parameter-efficient fine-tuning
%J Zapiski Nauchnykh Seminarov POMI
%D 2024
%P 27-45
%V 540
%U http://geodesic.mathdoc.fr/item/ZNSL_2024_540_a1/
%G en
%F ZNSL_2024_540_a1
K. Galliamov; L. Khaertdinova; K. Denisova. Refining joint text and source code embeddings for retrieval task with parameter-efficient fine-tuning. Zapiski Nauchnykh Seminarov POMI, Investigations on applied mathematics and informatics. Part IV, Tome 540 (2024), pp. 27-45. http://geodesic.mathdoc.fr/item/ZNSL_2024_540_a1/

[1] V. Lialin, V. Deshpande, and A. Rumshisky, Scaling down to scale up: A guide to parameter-efficient fine-tuning, 2023, arXiv: 2303.15647 | Zbl

[2] Y. Wang, H. Le, A. Gotmare, N. Bui, J. Li, and S. Hoi, “CodeT5+: Open Code Large Language Models for Code Understanding and Generation”, Proc. 2023 Conf. Empir. Methods Nat. Lang. Process, 2023, 1069–1088 | DOI

[3] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need”, Adv. Neural Inf. Process. Syst., 30 (2017)

[4] P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y. Tian, P. Isola, A. Maschinot, C. Liu, and D. Krishnan, “Supervised contrastive learning”, Adv. Neural Inf. Process. Syst., 33 (2020), 18661–18673

[5] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W. Yih, T. Rocktäschel, S. Riedel, and D. Kiela, Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, 2021, arXiv: 2005.11401

[6] M. Hasan, T. Muttaqueen, A.A. Ishtiaq, K.S. Mehrab, M.M.A. Haque, T. Hasan, W.U. Ahmad, A. Iqbal, and R. Shahriyar, “CoDesc: A Large Code-Description Parallel Dataset”, Findings Assoc. Comput. Linguist.: EMNLP, 2021

[7] M. Zhu, A. Jain, K. Suresh, R. Ravindran, S. Tipirneni, and C.K. Reddy, Xlcost: A benchmark dataset for cross-lingual code intelligence, 2022, arXiv: 2206.08474

[8] N. Rao, C. Bansal, and J. Guan, “Search4Code: Code search intent classification using weak supervision”, Proc. 2021 IEEE/ACM 18th Int. Conf. Min. Softw. Repositories, 2021, 575–579 | DOI

[9] H. Yao et al., “StaQC: A Systematically Mined Question-Code Dataset from Stack Overflow”, Proc. World Wide Web Conf., 2018, 135–144

[10] M. Bahrami, N.C. Shrikanth, S. Ruangwan, L. Liu, Y. Mizobuchi, M. Fukuyori, W.-P. Chen, K. Munakata, and T. Menzies, Pytorrent: A python library corpus for large-scale language models, 2021, arXiv: 2110.01710

[11] C. Shi, Y. Xiang, J. Yu, and L. Gao, Semantic code search for smart contracts, 2021, arXiv: 2111.14139

[12] S. Kairajärvi, A. Costin, and T. Hämäläinen, “ISAdetect: Usable Automated Detection of CPU Architecture and Endianness for Executable Binary Files and Object Code”, Proc. Tenth ACM Conf. Data Appl. Secur. Priv., 2020

[13] Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, L. Shou, B. Qin, T. Liu, D. Jiang, et al., CodeBERT: A pre-trained model for programming and natural languages, 2020, arXiv: 2002.08155

[14] H. Husain, H.-H. Wu, T. Gazit, M. Allamanis, and M. Brockschmidt, CodeSearchNet challenge: Evaluating the state of semantic code search, 2019, arXiv: 1909.09436

[15] Y. Xie, J. Lin, H. Dong, L. Zhang, and Z. Wu, “Survey of code search based on deep learning”, ACM Trans. Softw. Eng. Methodol., 33:2 (2023), 1–42 | DOI | MR

[16] S. Chatterjee, S. Juvekar, and K. Sen, “Sniff: A search engine for Java using free-form queries”, Fundam. Approaches Softw. Eng. Int. Conf., 2009, 385–400

[17] E. Hill, M. Roldan-Vega, J.A. Fails, and G. Mallet, “NL-based query refinement and contextualized code search results: A user study”, Proc. 2014 Software Evolution Week-IEEE Conference on Software Maintenance, Reengineering, and Reverse Engineering (CSMR-WCRE), 2014, 34–43 | DOI

[18] X. Liu, F. Zhang, Z. Hou, L. Mian, Z. Wang, J. Zhang, and J. Tang, “Self-supervised learning: Generative or contrastive”, IEEE Trans. Knowl. Data Eng., 35:1 (2021), 857–876

[19] R. Brinzea, B. Khaertdinov, and S. Asteriadis, “Contrastive learning with cross-modal knowledge mining for multimodal human activity recognition”, Proc. 2022 International Joint Conference on Neural Networks (IJCNN), 2022, 1–8

[20] Y. You, T. Chen, Y. Sui, T. Chen, Z. Wang, and Y. Shen, “Graph contrastive learning with augmentations”, Advances in Neural Information Processing Systems, 33 (2020), 5812–5823

[21] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations”, Proc. Int. Conf. Mach. Learn., 2020, 1597–1607

[22] A. Radford, J.W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., “Learning transferable visual models from natural language supervision”, Proc. International Conference on Machine Learning, 2021, 8748–8763

[23] M.M. Abdollah Pour, P. Farinneya, A. Toroghi, A. Korikov, A. Pesaranghader, T. Sajed, M. Bharadwaj, B. Mavrin, and S. Sanner, “Self-supervised Contrastive BERT Fine-tuning for Fusion-Based Reviewed-Item Retrieval”, Proc. European Conference on Information Retrieval, 2023, 3–17 | Zbl

[24] H. Luo, L. Ji, M. Zhong, Y. Chen, W. Lei, N. Duan, and T. Li, “Clip4clip: An empirical study of clip for end-to-end video clip retrieval and captioning”, Neurocomputing, 508 (2022), 293–304 | DOI

[25] V.A. Romanov and V.V. Ivanov, “Comparison of graph embeddings for source code with text models based on CNN and CodeBERT architectures”, Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS), 35:1 (2023), 237–264 | DOI

[26] D. Guo, Q. Zhu, D. Yang, Z. Xie, K. Dong, W. Zhang, G. Chen, X. Bi, Y. Wu, Y.K. Li, et al., DeepSeek-Coder: When the Large Language Model Meets Programming-The Rise of Code Intelligence, 2024, arXiv: 2401.14196

[27] K. Ganesan, Rouge 2.0: Updated and improved measures for evaluation of summarization tasks, 2018, arXiv: 1803.01937

[28] Y. Tian, D. Krishnan, and P. Isola, “Contrastive multiview coding”, Computer Vision-ECCV 2020: 16th European Conference (Glasgow, UK), v. XI, 2020, 776–794 | DOI

[29] L. Xu, H. Xie, S.-Z.J. Qin, X. Tao, and F.L. Wang, Parameter-efficient fine-tuning methods for pretrained language models: A critical review and assessment, 2023, arXiv: 2312.12148

[30] M. Weyssow, X. Zhou, K. Kim, D. Lo, and H. Sahraoui, Exploring parameter-efficient fine-tuning techniques for code generation with large language models, 2023, arXiv: 2308.10462

[31] Y. Yu, C.-H.H. Yang, J. Kolehmainen, P.G. Shivakumar, Y. Gu, S.R.R. Ren, Q. Luo, A. Gourav, I.-F. Chen, Y.-C. Liu, et al., “Low-rank adaptation of large language model rescoring for parameter-efficient speech recognition”, Proc. 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2023, 1–8 | MR

[32] Q. Zhang, M. Chen, A. Bukharin, P. He, Y. Cheng, W. Chen, and T. Zhao, Adaptive budget allocation for parameter-efficient fine-tuning, 2023, arXiv: 2303.10512

[33] H. Liu, D. Tam, M. Muqeeth, J. Mohta, T. Huang, M. Bansal, and C.A. Raffel, “Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning”, Advances in Neural Information Processing Systems, 35 (2022), 1950–1965

[34] B. Lester, R. Al-Rfou, and N. Constant, The power of scale for parameter-efficient prompt tuning, 2021, arXiv: 2104.08691

[35] C. Wang, Y. Yang, C. Gao, Y. Peng, H. Zhang, and M.R. Lyu, “No more fine-tuning? An experimental evaluation of prompt tuning in code intelligence”, Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2022, 382–394 | DOI

[36] H. Jiang, L. Nie, Z. Sun, Z. Ren, W. Kong, T. Zhang, and X. Luo, “Rosf: Leveraging information retrieval and supervised learning for recommending code snippets”, IEEE Transactions on Services Computing, 12:1 (2016), 34–46 | DOI

[37] M. Allamanis, D. Tarlow, A. Gordon, and Y. Wei, “Bimodal modelling of source code and natural language”, Proc. International Conference on Machine Learning, 2015, 2123–2132

[38] Q. Chen and M. Zhou, “A neural framework for retrieval and summarization of source code”, Proc. 33rd ACM/IEEE International Conference on Automated Software Engineering, 2018, 826–831 | DOI

[39] D. Guo, S. Ren, S. Lu, Z. Feng, D. Tang, S. Liu, L. Zhou, N. Duan, A. Svyatkovskiy, S. Fu, et al., Graphcodebert: Pre-training code representations with data flow, 2020, arXiv: 2009.08366

[40] A. Neelakantan, T. Xu, R. Puri, A. Radford, J.M. Han, J. Tworek, Q. Yuan, N. Tezak, J.W. Kim, C. Hallacy, et al., Text and code embeddings by contrastive pre-training, 2022, arXiv: 2201.10005

[41] Y. Wang, W. Wang, S. Joty, and S.C.H. Hoi, Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation, 2021, arXiv: 2109.00859

[42] J. Gu, Z. Chen, and M. Monperrus, “Multimodal representation for neural code search”, Proc. 2021 IEEE International Conference on Software Maintenance and Evolution (ICSME), 2021, 483–494

[43] S. Liu, Y. Chen, X. Xie, J. Siow, and Y. Liu, Retrieval-augmented generation for code summarization via hybrid GNN, 2020, arXiv: 2006.05405

[44] J. Zhang, X. Wang, H. Zhang, H. Sun, K. Wang, and X. Liu, “A novel neural source code representation based on abstract syntax tree”, Proc. 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), 2019, 783–794 | DOI | Zbl