Penalty for unknown words in topic model
Informacionnye tehnologii i vyčislitelnye sistemy, no. 4 (2020), pp. 111-124.

Voir la notice de l'article provenant de la source Math-Net.Ru

The paper considers approaches to accounting for unknown words in language models used in natural language processing algorithms. A method is proposed for accounting for unknown words in probabilistic topic modeling, which allows to determine the probability of a document's novelty in relation to existing topics. Topic models calculate the probabilistic assessment of classifying a word to some topic. The word-topic probabilistic relationship matrix in such a model is filled with posterior values of word probabilities. To calculate the probabilistic assessment of a document's novelty, this paper proposes to introduce the concept of a penalty for obscurity or an a priori probability estimate for unknown words into the model. A software prototype has been developed that allows calculating the probability of a document's novelty taking into account various penalty values. Experiments were conducted on the SCTM-ru text corpus, demonstrating the capabilities of the method for classifying collections and flows of text documents containing unknown words that reflect their influence on the topic of documents. During the experiments, the classification results were also compared using a thematic model and a classifier model based on logistic regression.
Keywords: topic modeling, natural language processing, penalty unknown words.
@article{ITVS_2020_4_a9,
     author = {S. N. Karpovich and A. V. Smirnov and N. N. Teslya},
     title = {Penalty for unknown words in topic model},
     journal = {Informacionnye tehnologii i vy\v{c}islitelnye sistemy},
     pages = {111--124},
     publisher = {mathdoc},
     number = {4},
     year = {2020},
     language = {ru},
     url = {http://geodesic.mathdoc.fr/item/ITVS_2020_4_a9/}
}
TY  - JOUR
AU  - S. N. Karpovich
AU  - A. V. Smirnov
AU  - N. N. Teslya
TI  - Penalty for unknown words in topic model
JO  - Informacionnye tehnologii i vyčislitelnye sistemy
PY  - 2020
SP  - 111
EP  - 124
IS  - 4
PB  - mathdoc
UR  - http://geodesic.mathdoc.fr/item/ITVS_2020_4_a9/
LA  - ru
ID  - ITVS_2020_4_a9
ER  - 
%0 Journal Article
%A S. N. Karpovich
%A A. V. Smirnov
%A N. N. Teslya
%T Penalty for unknown words in topic model
%J Informacionnye tehnologii i vyčislitelnye sistemy
%D 2020
%P 111-124
%N 4
%I mathdoc
%U http://geodesic.mathdoc.fr/item/ITVS_2020_4_a9/
%G ru
%F ITVS_2020_4_a9
S. N. Karpovich; A. V. Smirnov; N. N. Teslya. Penalty for unknown words in topic model. Informacionnye tehnologii i vyčislitelnye sistemy, no. 4 (2020), pp. 111-124. http://geodesic.mathdoc.fr/item/ITVS_2020_4_a9/