Topic models: adding bigrams and taking account of the similarity between unigrams and bigrams
Numerical methods and programming, Tome 16 (2015) no. 2, pp. 215-234.

Voir la notice de l'article provenant de la source Math-Net.Ru

The results of experimental study of adding bigrams and taking account of the similarity between them and unigrams are discussed. A novel PLSA-SIM algorithm based on a modification of the original PLSA (Probabilistic Latent Semantic Analysis) algorithm is proposed. The proposed algorithm incorporates bigrams and takes into account the similarity between them and unigram components. Various word association measures are analyzed to integrate top-ranked bigrams into topic models. As target text collections, articles from various Russian electronic banking magazines, English parts of parallel corpora Europarl and JRC-Acquiz, and the English digital archive of research papers in computational linguistics (ACL Anthology) are chosen. The computational experiments show that there exists a subgroup of tested measures that produce top-ranked bigrams in such a way that their inclusion into the PLSA-SIM algorithm significantly improves the quality of topic models for all collections. A novel unsupervised iterative algorithm named PLSA-ITER is also proposed for adding the most relevant bigrams. The computational experiments show a further improvement in the quality of topic models compared to the PLSA algorithm.
Keywords: PLSA (Probabilistic Latent Semantic Analysis), topic models, PLSA (Probabilistic Latent Semantic Analysis), word association measures, bigrams, topic coherence, perplexity.
@article{VMP_2015_16_2_a4,
     author = {M. A. Nokel and N. V. Lukashevich},
     title = {Topic models: adding bigrams and taking account of the similarity between unigrams and bigrams},
     journal = {Numerical methods and programming},
     pages = {215--234},
     publisher = {mathdoc},
     volume = {16},
     number = {2},
     year = {2015},
     language = {ru},
     url = {http://geodesic.mathdoc.fr/item/VMP_2015_16_2_a4/}
}
TY  - JOUR
AU  - M. A. Nokel
AU  - N. V. Lukashevich
TI  - Topic models: adding bigrams and taking account of the similarity between unigrams and bigrams
JO  - Numerical methods and programming
PY  - 2015
SP  - 215
EP  - 234
VL  - 16
IS  - 2
PB  - mathdoc
UR  - http://geodesic.mathdoc.fr/item/VMP_2015_16_2_a4/
LA  - ru
ID  - VMP_2015_16_2_a4
ER  - 
%0 Journal Article
%A M. A. Nokel
%A N. V. Lukashevich
%T Topic models: adding bigrams and taking account of the similarity between unigrams and bigrams
%J Numerical methods and programming
%D 2015
%P 215-234
%V 16
%N 2
%I mathdoc
%U http://geodesic.mathdoc.fr/item/VMP_2015_16_2_a4/
%G ru
%F VMP_2015_16_2_a4
M. A. Nokel; N. V. Lukashevich. Topic models: adding bigrams and taking account of the similarity between unigrams and bigrams. Numerical methods and programming, Tome 16 (2015) no. 2, pp. 215-234. http://geodesic.mathdoc.fr/item/VMP_2015_16_2_a4/