Pre-training longt5 for vietnamese mass-media multi-document summarization
Zapiski Nauchnykh Seminarov POMI, Investigations on applied mathematics and informatics. Part II–1, Tome 529 (2023), pp. 123-139

Voir la notice de l'article provenant de la source Math-Net.Ru

Multi-document summarization is a task aimed to extract the most salient information from a set of input documents. One of the main challenges in this task is the long-term dependency problem. When we deal with texts written in Vietnamese, it is also accompanied by the specific syllable-based text representation and lack of labeled datasets. Recent advances in machine translation have resulted in significant growth in the use of a related architecture known as the Transformer. Being pretrained on large amounts of raw texts, Transformers allow to capture a deep knowledge of the texts. In this paper, we survey the findings of language model applications for text summarization problems, including important Vietnamese text summarization models. According to the latter, we select LongT5 to pretrain and then fine-tune it for the Vietnamese multi-document text summarization problem from scratch. We analyze the resulting model and experiment with multi-document Vietnamese datasets, including ViMs, VMDS, and VLSP2022. We conclude that using a Transformer-based model pretrained on a large amount of unlabeled Vietnamese texts allows us to achieve promising results, with further enhancement via fine-tuning within a small amount of manually summarized texts. The pretrained model utilized in the experiment section has been made available online at https://github.com/nicolay-r/ViLongT5.
@article{ZNSL_2023_529_a8,
     author = {N. Rusnachenko and The Anh Le and Ngoc Diep Nguyen},
     title = {Pre-training longt5 for vietnamese mass-media multi-document summarization},
     journal = {Zapiski Nauchnykh Seminarov POMI},
     pages = {123--139},
     publisher = {mathdoc},
     volume = {529},
     year = {2023},
     language = {en},
     url = {http://geodesic.mathdoc.fr/item/ZNSL_2023_529_a8/}
}
TY  - JOUR
AU  - N. Rusnachenko
AU  - The Anh Le
AU  - Ngoc Diep Nguyen
TI  - Pre-training longt5 for vietnamese mass-media multi-document summarization
JO  - Zapiski Nauchnykh Seminarov POMI
PY  - 2023
SP  - 123
EP  - 139
VL  - 529
PB  - mathdoc
UR  - http://geodesic.mathdoc.fr/item/ZNSL_2023_529_a8/
LA  - en
ID  - ZNSL_2023_529_a8
ER  - 
%0 Journal Article
%A N. Rusnachenko
%A The Anh Le
%A Ngoc Diep Nguyen
%T Pre-training longt5 for vietnamese mass-media multi-document summarization
%J Zapiski Nauchnykh Seminarov POMI
%D 2023
%P 123-139
%V 529
%I mathdoc
%U http://geodesic.mathdoc.fr/item/ZNSL_2023_529_a8/
%G en
%F ZNSL_2023_529_a8
N. Rusnachenko; The Anh Le; Ngoc Diep Nguyen. Pre-training longt5 for vietnamese mass-media multi-document summarization. Zapiski Nauchnykh Seminarov POMI, Investigations on applied mathematics and informatics. Part II–1, Tome 529 (2023), pp. 123-139. http://geodesic.mathdoc.fr/item/ZNSL_2023_529_a8/