Analysis of influence of different relations types on the quality of thesaurus application to text classification problems

N. S. Lagutina; K. V. Lagutina; I. A. Shchitov; I. V. Paramonov

N. S. Lagutina ; K. V. Lagutina ; I. A. Shchitov ; I. V. Paramonov

Modelirovanie i analiz informacionnyh sistem, Tome 24 (2017) no. 6, pp. 772-787

Cet article a éte moissonné depuis la source Math-Net.Ru

Voir la notice de l'article

Résumé

The main purpose of the article is to analyze how effectively different types of thesaurus relations can be used for solutions of text classification tasks. The basis of the study is an automatically generated thesaurus of a subject area, that contains three types of relations: synonymous, hierarchical and associative. To generate the thesaurus the authors use a hybrid method based on several linguistic and statistical algorithms for extraction of semantic relations. The method allows to create a thesaurus with a sufficiently large number of terms and relations among them. The authors consider two problems: topical text classification and sentiment classification of large newspaper articles. To solve them, the authors developed two approaches that complement standard algorithms with a procedure that take into account thesaurus relations to determine semantic features of texts. The approach to topical classification includes the standard unsupervised BM25 algorithm and the procedure, that take into account synonymous and hierarchical relations of the thesaurus of the subject area. The approach to sentiment classification consists of two steps. At the first step, a thesaurus is created, whose terms weight polarities are calculated depending on the term occurrences in the training set or on the weights of related thesaurus terms. At the second step, the thesaurus is used to compute the features of words from texts and to classify texts by the algorithm SVM or Naive Bayes. In experiments with text corpora BBCSport, Reuters, PubMed and the corpus of articles about American immigrants, the authors varied the types of thesaurus relations that are involved in the classification and the degree of their use. The results of the experiments make it possible to evaluate the efficiency of the application of thesaurus relations for classification of raw texts and to determine under what conditions certain relationships affect more or less. In particular, the most useful thesaurus connections are synonymous and hierarchical, as they provide a better quality of classification.

Keywords: thesaurus, semantic relations, thesaurus relations, topical classification
Mots-clés : sentiment classification.

@article{MAIS_2017_24_6_a9,
     author = {N. S. Lagutina and K. V. Lagutina and I. A. Shchitov and I. V. Paramonov},
     title = {Analysis of influence of different relations types on the quality of thesaurus application to text classification problems},
     journal = {Modelirovanie i analiz informacionnyh sistem},
     pages = {772--787},
     year = {2017},
     volume = {24},
     number = {6},
     language = {ru},
     url = {http://geodesic.mathdoc.fr/item/MAIS_2017_24_6_a9/}
}

TY  - JOUR
AU  - N. S. Lagutina
AU  - K. V. Lagutina
AU  - I. A. Shchitov
AU  - I. V. Paramonov
TI  - Analysis of influence of different relations types on the quality of thesaurus application to text classification problems
JO  - Modelirovanie i analiz informacionnyh sistem
PY  - 2017
SP  - 772
EP  - 787
VL  - 24
IS  - 6
UR  - http://geodesic.mathdoc.fr/item/MAIS_2017_24_6_a9/
LA  - ru
ID  - MAIS_2017_24_6_a9
ER  -

%0 Journal Article
%A N. S. Lagutina
%A K. V. Lagutina
%A I. A. Shchitov
%A I. V. Paramonov
%T Analysis of influence of different relations types on the quality of thesaurus application to text classification problems
%J Modelirovanie i analiz informacionnyh sistem
%D 2017
%P 772-787
%V 24
%N 6
%U http://geodesic.mathdoc.fr/item/MAIS_2017_24_6_a9/
%G ru
%F MAIS_2017_24_6_a9

N. S. Lagutina; K. V. Lagutina; I. A. Shchitov; I. V. Paramonov. Analysis of influence of different relations types on the quality of thesaurus application to text classification problems. Modelirovanie i analiz informacionnyh sistem, Tome 24 (2017) no. 6, pp. 772-787. http://geodesic.mathdoc.fr/item/MAIS_2017_24_6_a9/

Bibliographie
Cité par

[1] Masterman M., “Semantic message detection for machine translation, using an interlingua”, Proc. 1961 International Conf. on Machine Translation, 1961, 438–475

[2] Loukachevitch N., Dobrov B., “The Sociopolitical Thesaurus as a resource for automatic document processing in Russian”, Terminology, 21:2 (2015), 237–262 | DOI

[3] Aitchison J., Clarke S.D., “The thesaurus: a historical viewpoint, with a look to the future”, Cataloging and classification quarterly, 37:3–4 (2004), 5–21 | DOI

[4] Lukashevich N. V., Tezaurusy v zadachah informacionnogo poiska, Izdatelstvo MGU, M., 2011, 512 pp. (in Russian)

[5] Willis C., Losee R., “A random walk on an ontology: Using thesaurus structure for automatic subject indexing”, Journal of the American Society for Information Science and Technology, 64:7 (2013), 1330–1344 | DOI

[6] Vállez M., Pedraza-Jiménez R., Codina L., Blanco S., Rovira C., “A semi-automatic indexing system based on embedded information in HTML documents”, Library Hi Tech, 33:2 (2015), 195–210 | DOI

[7] Loukachevitch N., Nokel M., Ivanov K., Combining Thesaurus Knowledge and Probabilistic Topic Models, 2017, arXiv: 1707.09816

[8] Sanchez-Pi N., Martí L. Garcia A. C. B., “Improving ontology-based text classification: An occupational health and security application”, Journal of Applied Logic, 17 (2016), 48–58 | DOI | MR | Zbl

[9] Bollegala D., Weir D., Carroll J., “Cross-domain sentiment classification using a sentiment sensitive thesaurus”, IEEE transactions on knowledge and data engineering, 25:8 (2013), 1719–1731 | DOI

[10] Sparck Jones K., Walker S., Robertson S.E., “A probabilistic model of information retrieval: development and comparative experiments: Part 2”, Information Processing and Management, 36:6 (2000), 809–840 | DOI

[11] Lagutina N. S., Lagutina K. V., Mamedov E. I., Paramonov I. V., “Methodological aspects of semantic relationship extraction for automatic thesaurus generation”, Modeling and Analysis of Information Systems, 23:6 (2016), 826–840 (in Russian) | MR

[12] Mihalcea R., Tarau P., “TextRank: Bringing order into texts”, Proceedings of Empirical Methods in Natural Language Processing, EMNLP 2004 (Barcelona, Spain), ACL, 404–411

[13] Trieschnigg D., Pezik P., Lee V., De Jong F., Kraaij W., Rebholz-Schuhmann D., “MeSH Up: effective MeSH text classification for improved document retrieval”, Bioinformatics, 25:11 (2009), 1412–1418 | DOI

[14] Aggarwal C., Zhai C., “A survey of text classification algorithms”, Mining text data, Springer-Verlag, New York, 2012, 163–222 | DOI | MR

[15] Grimmer J., Stewart B., “Text as data: The promise and pitfalls of automatic content analysis methods for political texts”, Political analysis, 21:3 (2013), 267–297 | DOI

[16] Ravi K., Ravi V., “A survey on opinion mining and sentiment analysis: tasks, approaches and applications”, Knowledge-Based Systems, 89 (2015), 14–46 | DOI

[17] Junker M., Hoch R., Dengel A., “On the evaluation of document analysis components by recall, precision, and accuracy”, Proceedings of the Fifth International Conference on Document Analysis and Recognition, IEEE, 1999, 713–716 | DOI

Parcourir par

Geodesic

Parcourir par