Matrix text models. Text corpora models
Matematičeskoe modelirovanie, Tome 32 (2020) no. 2, pp. 37-57.

Voir la notice de l'article provenant de la source Math-Net.Ru

The models of text corpora, formed on the basis of the matrix model of texts in natural languages, are presented. As methods to form models of collections we consider the techniques of computational identification of the thematic structure of the collections. We suggest to use the models for searching for thematically similar text collections and thematic categorization of texts based on text models and text collections. The differences of the proposed models of text collections from the common approaches to their analysis and modeling are analyzed.
Keywords: natural language texts, text corpora, text corpora models, topic models, text models, text information retrieval.
@article{MM_2020_32_2_a2,
     author = {M. G. Kreines and E. M. Kreines},
     title = {Matrix text models. {Text} corpora models},
     journal = {Matemati\v{c}eskoe modelirovanie},
     pages = {37--57},
     publisher = {mathdoc},
     volume = {32},
     number = {2},
     year = {2020},
     language = {ru},
     url = {http://geodesic.mathdoc.fr/item/MM_2020_32_2_a2/}
}
TY  - JOUR
AU  - M. G. Kreines
AU  - E. M. Kreines
TI  - Matrix text models. Text corpora models
JO  - Matematičeskoe modelirovanie
PY  - 2020
SP  - 37
EP  - 57
VL  - 32
IS  - 2
PB  - mathdoc
UR  - http://geodesic.mathdoc.fr/item/MM_2020_32_2_a2/
LA  - ru
ID  - MM_2020_32_2_a2
ER  - 
%0 Journal Article
%A M. G. Kreines
%A E. M. Kreines
%T Matrix text models. Text corpora models
%J Matematičeskoe modelirovanie
%D 2020
%P 37-57
%V 32
%N 2
%I mathdoc
%U http://geodesic.mathdoc.fr/item/MM_2020_32_2_a2/
%G ru
%F MM_2020_32_2_a2
M. G. Kreines; E. M. Kreines. Matrix text models. Text corpora models. Matematičeskoe modelirovanie, Tome 32 (2020) no. 2, pp. 37-57. http://geodesic.mathdoc.fr/item/MM_2020_32_2_a2/

[1] W. B. Croft, D. Metzler, T. Strohman, Search engines: Information retrieval in practice, Addison-Wesley, Boston, 2010, 542 pp.

[2] W. Wu, H. Xiong, Sh. Shekhar, Clustering and Information Retrieval, Network Theory, Applications, 11, Springer, N. Y., 2004, 338 pp. | MR

[3] H. Alani, S. Kim, D. E. Millard, M. J. Weal, W. Hall, P. H. Lewis, N. R. Shadbolt, “Automatic ontology-based knowledge extraction from Web documents”, IEEE Intelligent Systems, 18:1 (2003), 14–21 | DOI

[4] N. V. Lukashevich, Tezaurusy v zadachah informatsionnogo poiska, MGU, M., 2011, 512 pp.

[5] T. K. Landauer, D. S. McNamara, S. Dennis, W. Kintsch (eds.), Handbook of Latent Semantic Analysis, Psychology Press, Hove, 2013, 544 pp.

[6] D. M. Blei, “Probabilistic topic models”, Communicat. of the ACM, 55:4 (2012), 77–84 | DOI | MR

[7] K. V. Vorontsov, “Additive Regularization for Topic Models of Text Collections”, Doklady Mathemaics, 89:3 (2014), 301–304 | DOI | DOI | MR | Zbl

[8] M. J. Kusner, Y. Sun, N. I. Kolkin, K. Q. Weinberger, “From Word Embeddings To Document Distances”, Proc. of the 32nd Int. Conf. on Machine Learning (Lille, France, 2015), JMLR: W, 37, 2015, 957–966

[9] M. G. Kreines, E. M. Kreines. Matrix text models, “Text models and similarity of text contents”, MM, 2020 | DOI

[10] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, J. Dean, “Distributed representations of words and phrases and their compositionality”, Advances in neural information processing systems, 2013, 3111–3119

[11] I. S. Misuno, D. A. Rachkovskii, S. V. Slipchenko, “Vektornye i raspredelennye predstavleniia, otrazhaushchie mery semanticheskoi sviazi slov”, Matemathchni mashini i sistemi, 3 (2005), 50–66

[12] Y. Bengio, R. Ducharme, P. Vincent, C. Jauvin, “A neural probabilistic language model”, Journal of Machine Learning Research, 3 (2003), 1137–1155 | Zbl

[13] Q. Le, T. Mikolov, “Distributed representations of sentences and documents”, Proc. of the 31-st Int. Conf. on Machine Learning (Beijing, China, 2014), JMLR: W, 32, 1188–1196, arXiv: 1405.4053v2

[14] M. G. Kreines, A. A. Afonin, “Klasterizatsiia tekstovykh kollektsii: pomoshch pri soderzha-telnom poiske i analiticheskii instrument”, Internet-portaly: soderzhanie i tekhnologii, 4, FGU GNII ITT “Informika”, eds. A.N. Tikhonov (pred.) i dr., Prosveshenie, M., 2007, 510–537

[15] M. G. Kreines, “Modeli tekstov i tekstovyh kolliktsii dlia poiska i analyza informatsii”, Trudy MFTI, 3 (2017), 132–142

[16] M. G. Kreines, E. M. Kreines, “The control model for the selection of reference collections providing the impartial assessment of the quality of scientific and technological pub-lications by using bibliometric and scientometric indicators”, J. of Comp. and Systems Sci. Intern., 55:5, 750–766 | DOI | MR | Zbl

[17] D. Mimno, H. Wallach, E. Talley, M. Leenders, A. McCallum, “Optimizing semantic coherence in topic models”, Proc. of the 2011 Conf. on Empirical Methods in Natural Language Processing (Edinburgh, Scotland, UK, July 27–31, 2011), 262–272

[18] D. Newman, J. H. Lau, K. Grieser, T. Baldwin, “Automatic evaluation of topic coherence”, Human Language Technologies, The 2010 Annual Conf. of the North American Chapter of the ACL (Los Angeles, California, 2010), 100–108

[19] D. Newman, Y. Noh, E. Talley, S. Karimi, T. Baldwin, “Evaluating topic models for digital libraries”, Proc. of the 10th ann.Joint Conf. on Digital libraries, JCDL'10, ACM, New York, NY, USA, 2010, 215–224

[20] K. V. Vorontsov, A. A. Potapenko, Additivnaia reguliarizatsiia tematicheskih modelei, 2014, 22 pp.

[21] J. Chang, J. Boyd-Graber, S. Gerrish, C. Wang, D. M. Blei, “Reading tea leaves: How humans interpret topic models”, NIPS 2009, 288–296

[22] M.G. Kreines, E.M. Kreines, “Control model for the alignment of the quality assessment of scientific documents based on the analysis of content-related context”, J. of Computer and Systems Sciences International, 55:6, 938–947 | DOI | MR