Extracting named entities from russian-language documents with different expressiveness of structure
Modelirovanie i analiz informacionnyh sistem, Tome 30 (2023) no. 4, pp. 382-393.

Voir la notice de l'article provenant de la source Math-Net.Ru

This work is devoted to solving the problem of recognizing named entities for Russian-language texts based on the CRF model. Two sets of data were considered: documents on refinancing with a good document structure, semi-structured texts of court records. The model was tested under various sets of text features and CRF parameters (optimization algorithms). In average for all entities, the best F-measure value for structured documents was 0.99, and for semi-structured ones 0.86.
Keywords: named entity extraction
Mots-clés : CRF.
@article{MAIS_2023_30_4_a5,
     author = {M. D. Averina and O. A. Levanova},
     title = {Extracting named entities from russian-language documents with different expressiveness of structure},
     journal = {Modelirovanie i analiz informacionnyh sistem},
     pages = {382--393},
     publisher = {mathdoc},
     volume = {30},
     number = {4},
     year = {2023},
     language = {ru},
     url = {http://geodesic.mathdoc.fr/item/MAIS_2023_30_4_a5/}
}
TY  - JOUR
AU  - M. D. Averina
AU  - O. A. Levanova
TI  - Extracting named entities from russian-language documents with different expressiveness of structure
JO  - Modelirovanie i analiz informacionnyh sistem
PY  - 2023
SP  - 382
EP  - 393
VL  - 30
IS  - 4
PB  - mathdoc
UR  - http://geodesic.mathdoc.fr/item/MAIS_2023_30_4_a5/
LA  - ru
ID  - MAIS_2023_30_4_a5
ER  - 
%0 Journal Article
%A M. D. Averina
%A O. A. Levanova
%T Extracting named entities from russian-language documents with different expressiveness of structure
%J Modelirovanie i analiz informacionnyh sistem
%D 2023
%P 382-393
%V 30
%N 4
%I mathdoc
%U http://geodesic.mathdoc.fr/item/MAIS_2023_30_4_a5/
%G ru
%F MAIS_2023_30_4_a5
M. D. Averina; O. A. Levanova. Extracting named entities from russian-language documents with different expressiveness of structure. Modelirovanie i analiz informacionnyh sistem, Tome 30 (2023) no. 4, pp. 382-393. http://geodesic.mathdoc.fr/item/MAIS_2023_30_4_a5/

[1] E. Leitner, G. Rehm, J. Moreno-Schneider, “Fine-grained Named Entity Recognition in legal documents”, International Conference on Semantic Systems, Springer, 2019, 272–287

[2] J. Strakova, M. Straka, J. Hajic, “Neural architectures for nested NER through linearization”, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, 5326–5331 | DOI

[3] R. Yeshpanov, Y. Khassanov, H. A. Varol, KazNERD: Kazakh Named Entity Recognition dataset, 2022, arXiv: 2111.13419 [cs.CL]

[4] S. Zheng et al, “Conditional Random Fields as Recurrent Neural Networks”, Proceedings of the IEEE International Conference on Computer Vision, 2015, 1529–1537

[5] K. W. Church, “Word2vec”, Natural Language Engineering, 23:1 (2017), 155–162 | DOI

[6] P. Bojanowski, E. Grave, A. Joulin, T. Mikolov, “Enriching word vectors with subword information”, Transactions of the association for computational linguistics, 5 (2017), 135–146 | DOI

[7] C. Sutton, A. McCallum et al, “An introduction to Conditional Random Fields”, Foundations and TrendsR in Machine Learning, 4:4 (2012), 267–373 | DOI

[8] J. Lafferty, A. Mccallum, F. Pereira, “Conditional Random Fields: Probabilistic models for segmenting and labeling sequence data”, Proceedings of the Eighteenth International Conference on Machine Learning, 2001, 282–289

[9] M. Collins, “Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms”, Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing, EMNLP 2002, Association for Computational Linguistics, 2002, 1–8

[10] S. Bird, “NLTK: The natural language toolkit”, Proceedings of the COLING/ACL on Interactive Presentation Sessions, COLING-ACL '06, Association for Computational Linguistics, 2006, 69–72 | DOI

[11] R. Reh u?rek, P. Sojka, “Software framework for topic modelling with large corpora”, Proceedings of LREC 2010 workshop New Challenges for NLP Frameworks, 2010, 46–50

[12] M. Korobov, “Morphological analyzer and generator for Russian and Ukrainian languages”, Analysis of Images, Social Networks and Texts, Springer, 2015, 320–332 | DOI

[13] J. Li, A. Sun, J. Han, C. Li, “A survey on deep learning for Named Entity Recognition”, IEEE Transactions on Knowledge and Data Engineering, 34:1 (2020), 50–70 | MR