Adversarial attacks on language models: WordPiece filtration and ChatGPT synonyms
Zapiski Nauchnykh Seminarov POMI, Investigations on applied mathematics and informatics. Part II–2, Tome 530 (2023), pp. 80-95
Cet article a éte moissonné depuis la source Math-Net.Ru

Voir la notice du chapitre de livre

Adversarial attacks on text have gained significant attention in recent years due to their potential to undermine the reliability of NLP models. We present novel black-box character- and word-level adversarial example generation approaches applicable to BERT-based models. The character-level approach is based on the idea of adding natural typos into a word according to its WordPiece tokenization. As for word-level approaches, we present three techniques that make use of synonymous substitute words created by ChatGPT and post-corrected to be in the appropriate grammatical form for the given context. Additionally, we try to minimize the perturbation rate taking into account the damage that each perturbation does to the model. By combining character-level approaches, word-level approaches, and the perturbation rate minimization technique, we achieve a state of the art attack rate. Our best approach works 30-65% faster than the previously best method, Tampers, and has a comparable perturbation rate. At the same time, proposed perturbations retain the semantic similarity between the original and adversarial examples and achieve a relatively low value of Levenshtein distance.
@article{ZNSL_2023_530_a6,
     author = {T. Ter-Hovhannisyan and H. Aleksanyan and K. Avetisyan},
     title = {Adversarial attacks on language models: {WordPiece} filtration and {ChatGPT} synonyms},
     journal = {Zapiski Nauchnykh Seminarov POMI},
     pages = {80--95},
     year = {2023},
     volume = {530},
     language = {en},
     url = {http://geodesic.mathdoc.fr/item/ZNSL_2023_530_a6/}
}
TY  - JOUR
AU  - T. Ter-Hovhannisyan
AU  - H. Aleksanyan
AU  - K. Avetisyan
TI  - Adversarial attacks on language models: WordPiece filtration and ChatGPT synonyms
JO  - Zapiski Nauchnykh Seminarov POMI
PY  - 2023
SP  - 80
EP  - 95
VL  - 530
UR  - http://geodesic.mathdoc.fr/item/ZNSL_2023_530_a6/
LA  - en
ID  - ZNSL_2023_530_a6
ER  - 
%0 Journal Article
%A T. Ter-Hovhannisyan
%A H. Aleksanyan
%A K. Avetisyan
%T Adversarial attacks on language models: WordPiece filtration and ChatGPT synonyms
%J Zapiski Nauchnykh Seminarov POMI
%D 2023
%P 80-95
%V 530
%U http://geodesic.mathdoc.fr/item/ZNSL_2023_530_a6/
%G en
%F ZNSL_2023_530_a6
T. Ter-Hovhannisyan; H. Aleksanyan; K. Avetisyan. Adversarial attacks on language models: WordPiece filtration and ChatGPT synonyms. Zapiski Nauchnykh Seminarov POMI, Investigations on applied mathematics and informatics. Part II–2, Tome 530 (2023), pp. 80-95. http://geodesic.mathdoc.fr/item/ZNSL_2023_530_a6/

[1] M. Alzantot, Y. Sharma, A. Elgohary, B.-J. Ho, M. Srivastava, and K.-W. Chang, Generating Natural Language Adversarial Examples, 2018, arXiv: 1804.07998

[2] Y. Belinkov and Y. Bisk, Synthetic and Natural Noise Both Break Neural Machine Translation, 2018, arXiv: 1711.02173

[3] A. Chakraborty, M. Alam, V. Dey, A. Chattopadhyay, and D. Mukhopadhyay, Adversarial Attacks and Defences: A Survey, 2018, arXiv: 1810.00069

[4] J. Ebrahimi, D. Lowd, and D. Dou, On Adversarial Examples for Character-Level Neural Machine Translation, 2018, arXiv: 1806.09030

[5] J. Ebrahimi, A. Rao, D. Lowd, and D. Dou, HotFlip: White-Box Adversarial Examples for Text Classification, 2018, arXiv: 1712.06751

[6] J. Gao, J. Lanchantin, M. L. Soffa, and Y. Qi, Black-box Generation of Adversarial Text Sequences to Evade Deep Learning Classifiers, 2018, arXiv: 1801.04354

[7] S. Garg and G. Ramakrishnan, “BAE: BERT-based Adversarial Examples for Text Classification”, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, 2020, 6174–6181 | DOI

[8] I. J. Goodfellow, J. Shlens, and C. Szegedy, Explaining and Harnessing Adversarial Examples, 2015, arXiv: 1412.6572

[9] J. Grainger and C. Whitney, Does the huamn mnid raed wrods as a wlohe?, Trends in Cognitive Sciences, 8:2, 58–59 | DOI | MR

[10] R. Jia, A. Raghunathan, K. Göksel, and P. Liang, Certified Robustness to Adversarial Word Substitutions, 2019, arXiv: 1909.00986

[11] D. Jin, Z. Jin, J. T. Zhou, and P. Szolovits, Is BERT Really Robust? A Strong Baseline for Natural Language Attack on Text Classification and Entailment, 2020, arXiv: \href

[12] A. Kurakin, I. Goodfellow, and S. Bengio, Adversarial examples in the physical world, 2017, arXiv: 1907.11932 | Zbl

[13] J. Li, S. Ji, T. Du, B. Li, and T. Wang, “TextBugger: Generating Adversarial Text Against Real-world Applications”, Proceedings 2019 Network and Distributed System Security Symposium, 2019

[14] L. Li, R. Ma, Q. Guo, X. Xue, and X. Qiu, BERT-ATTACK: Adversarial Attack Against BERT Using BERT, 2020, arXiv: 2004.09984

[15] B. Pang and L. Lee, “Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales”, ACL'05, Association for Computational Linguistics, 2005, 115–124

[16] S. Ren, Y. Deng, K. He, and W. Che, “Generating Natural Language Adversarial Examples through Probability Weighted Word Saliency”, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, 2019, 1085–1097 | DOI

[17] X. Song, A. Salcianu, Y. Song, D. Dopson, and D. Zhou, Fast WordPiece Tokenization, 2021, arXiv: 2012.15524

[18] L. Sun, K. Hashimoto, W. Yin, A. Asai, J. Li, P. Yu, and C. Xiong, Adv-BERT: BERT is not robust on misspellings! Generating nature adversarial samples on BERT, 2020, arXiv: 2003.04985

[19] Y. Zang, F. Qi, C. Yang, Z. Liu, M. Zhang, Q. Liu, and M. Sun, “Word-level Textual Adversarial Attacking as Combinatorial Optimization”, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, 2020, 6066–6080 | DOI

[20] X. Zhang, J. Zhao, and Y. LeCun, Character-level Convolutional Networks for Text Classification, 2016, arXiv: 1509.01626

[21] X. Zhao, L. Zhang, D. Xu, and S. Yuan, Generating Textual Adversaries with Minimal Perturbation, 2022, arXiv: 2211.06571

[22] V. Zubarev and V. Sochenkov, Cross-Language text alignment for plagiarism detection based on contextual and context-free models, 2019