A Distributed Dictionary — Based Morphological Analysis Framework for Russian Language Processing
Vestnik Ûžno-Uralʹskogo gosudarstvennogo universiteta. Seriâ, Matematičeskoe modelirovanie i programmirovanie, no. 13 (2012), pp. 119-127
Cet article a éte moissonné depuis la source Math-Net.Ru

Voir la notice de l'article

This article describes an approach to scaling service morphological parsing of words of natural language processing of various collections of documents in Russian. An overview and critical analysis of existing solutions. The requirements workbench vocabulary morphological analyzer were established. The distributed architecture of the web service morphological analysis, designed to a handle large collections of documents in Russian, presented the form of a structural model. This architecture is implemented as a prototype system in the programming language Ruby. The structure used in the morphological dictionary of a relational schema. Tests of this method in a distributed computing environment showed linear scalability of the proposed solutions. The configuration of the experiment involves the generation of the system load as a HTTP requests, system load balancing working nodes of a distributed system, application servers with a functioning database analyzer and morphological dictionary, as well as a caching node to reduce costs when you run queries to the dictionary. Applying this approach provides a linear increase in performance in distributed systems, automated processing of large volumes of text.
Keywords: distributed computing, natural language processing, corpus linguistics, data-intensive computing, morphological analysis.
@article{VYURU_2012_13_a11,
     author = {D. A. Ustalov and M. L. Goldstein},
     title = {A {Distributed} {Dictionary} {\textemdash} {Based} {Morphological} {Analysis} {Framework} for {Russian} {Language} {Processing}},
     journal = {Vestnik \^U\v{z}no-Uralʹskogo gosudarstvennogo universiteta. Seri\^a, Matemati\v{c}eskoe modelirovanie i programmirovanie},
     pages = {119--127},
     year = {2012},
     number = {13},
     language = {ru},
     url = {http://geodesic.mathdoc.fr/item/VYURU_2012_13_a11/}
}
TY  - JOUR
AU  - D. A. Ustalov
AU  - M. L. Goldstein
TI  - A Distributed Dictionary — Based Morphological Analysis Framework for Russian Language Processing
JO  - Vestnik Ûžno-Uralʹskogo gosudarstvennogo universiteta. Seriâ, Matematičeskoe modelirovanie i programmirovanie
PY  - 2012
SP  - 119
EP  - 127
IS  - 13
UR  - http://geodesic.mathdoc.fr/item/VYURU_2012_13_a11/
LA  - ru
ID  - VYURU_2012_13_a11
ER  - 
%0 Journal Article
%A D. A. Ustalov
%A M. L. Goldstein
%T A Distributed Dictionary — Based Morphological Analysis Framework for Russian Language Processing
%J Vestnik Ûžno-Uralʹskogo gosudarstvennogo universiteta. Seriâ, Matematičeskoe modelirovanie i programmirovanie
%D 2012
%P 119-127
%N 13
%U http://geodesic.mathdoc.fr/item/VYURU_2012_13_a11/
%G ru
%F VYURU_2012_13_a11
D. A. Ustalov; M. L. Goldstein. A Distributed Dictionary — Based Morphological Analysis Framework for Russian Language Processing. Vestnik Ûžno-Uralʹskogo gosudarstvennogo universiteta. Seriâ, Matematičeskoe modelirovanie i programmirovanie, no. 13 (2012), pp. 119-127. http://geodesic.mathdoc.fr/item/VYURU_2012_13_a11/

[1] Corpus Linguistics (accessed: 20 May 2012)

[2] GATE Cloud — a New Way to Mine the Web (data obrascheniya: 20.05.2012) http://gatecloud.net

[3] Shestakov A. L., Sidorov A. I., Shaefer L. A., Gichkina E. V., “The System of the Management Quality, Operational Control and Analysis of the Educational Process”, Herald of the Leningrad State University, 2009, no. 1, 177–194 (in Russian) | MR

[4] mystem (data obrascheniya: 20.05.2012) http://company.yandex.ru/technologies/mystem

[5] Snowball (data obrascheniya: 20.05.2012) http://snowball.tartarus.org

[6] Stemka (data obrascheniya: 20.05.2012) http://www.keva.ru/stemka/stemka.html

[7] Gearman (data obrascheniya: 20.05.2012) http://gearman.org

[8] T. Erjavec, “MULTEXT-East Version 4: Multilingual Morphosyntactic Specifications, Lexicons and Corpora”, Proceedings of the 8th International Conference on Language Resources and Evaluation, LREC'10 (Malta, 2010), 2544–2547

[9] Myaso (data obrascheniya: 20.05.2012) http://myaso.eveel.ru

[10] AOT :: Technologies (accessed: 20 May 2012)

[11] I. Segalovich, “A fast morphological algorithm with unknown word guessing induced by a dictionary for a web search engine”, Proceedings of the International Conference on Machine Learning; Models, Technologies and Applications, MLMTA'03 (Las Vegas, 2003), 273–280

[12] HAProxy — The Reliable, High Performance TCP/HTTP Load Balancer (data obrascheniya: 20.05.2012) haproxy.1wt.eu

[13] Blog Posts Collection (accessed: 20 May 2012)

[14] Tokyo Cabinet: a modern implementation of DBM (data obrascheniya: 20.05.2012) http://fallabs.com/tokyocabinet

[15] Memcached — a distributed memory object caching system (data obrascheniya: 20.05.2012) http://memcached.org

[16] Software. Russian National Corpus Language (accessed: 20 May 2012)