Automatic Text Categorization: Methods and Problems

M. S. Ageev; B. V. Dobrov; N. V. Loukachevitch

M. S. Ageev ; B. V. Dobrov ; N. V. Loukachevitch

Učënye zapiski Kazanskogo universiteta. Seriâ Fiziko-matematičeskie nauki, Kazanskii Gosudarstvennyi Universitet. Uchenye Zapiski. Seriya Fiziko-Matematichaskie Nauki, Tome 150 (2008) no. 4, pp. 25-40

Cet article a éte moissonné depuis la source Math-Net.Ru

Voir la notice du chapitre de livre

Résumé

The paper is devoted to analysis of three techniques of text categorization (manual text categorization, knowledge-based text categorization and machine learning). Their advantages and problems are described. Two approaches are considered, intended to overcome problems of automatic text categorization. Their evaluation on public collections is presented. The first method is based on a large linguistic resource: RuThes Thesaurus and ALOT document processing technique. Another one is machine learning method of text categorization, generating descriptions of categories in form of Boolean formulas.

Keywords: document processing, automatic text categorization, thesaurus, machine-learning.

@article{UZKU_2008_150_4_a1,
     author = {M. S. Ageev and B. V. Dobrov and N. V. Loukachevitch},
     title = {Automatic {Text} {Categorization:} {Methods} and {Problems}},
     journal = {U\v{c}\"enye zapiski Kazanskogo universiteta. Seri\^a Fiziko-matemati\v{c}eskie nauki},
     pages = {25--40},
     year = {2008},
     volume = {150},
     number = {4},
     language = {ru},
     url = {http://geodesic.mathdoc.fr/item/UZKU_2008_150_4_a1/}
}

TY  - JOUR
AU  - M. S. Ageev
AU  - B. V. Dobrov
AU  - N. V. Loukachevitch
TI  - Automatic Text Categorization: Methods and Problems
JO  - Učënye zapiski Kazanskogo universiteta. Seriâ Fiziko-matematičeskie nauki
PY  - 2008
SP  - 25
EP  - 40
VL  - 150
IS  - 4
UR  - http://geodesic.mathdoc.fr/item/UZKU_2008_150_4_a1/
LA  - ru
ID  - UZKU_2008_150_4_a1
ER  -

%0 Journal Article
%A M. S. Ageev
%A B. V. Dobrov
%A N. V. Loukachevitch
%T Automatic Text Categorization: Methods and Problems
%J Učënye zapiski Kazanskogo universiteta. Seriâ Fiziko-matematičeskie nauki
%D 2008
%P 25-40
%V 150
%N 4
%U http://geodesic.mathdoc.fr/item/UZKU_2008_150_4_a1/
%G ru
%F UZKU_2008_150_4_a1

M. S. Ageev; B. V. Dobrov; N. V. Loukachevitch. Automatic Text Categorization: Methods and Problems. Učënye zapiski Kazanskogo universiteta. Seriâ Fiziko-matematičeskie nauki, Kazanskii Gosudarstvennyi Universitet. Uchenye Zapiski. Seriya Fiziko-Matematichaskie Nauki, Tome 150 (2008) no. 4, pp. 25-40. http://geodesic.mathdoc.fr/item/UZKU_2008_150_4_a1/

Bibliographie
Cité par

[1] Dumais S., Platt J., Heckerman D., Sahami M., “Inductive learning algorithms and representations for text categorization”, Proc. Int. Conf. on Inform. and Knowledge Manage, 1998, 148–155

[2] Joachims T., “Text Categorization with Support Vector Machines: Learning with Many Relevant Features”, Proc. ECML–98, 10th Europ. Conf. on Machine Learning, 1998; http://www.cs.cornell.edu/people/tj/publications/joachims_98a.ps.gz

[3] Lewis D., “Applying Support Vector Machines to the TREC-2001 Batch Filtering and Routing Tasks”, Proc. TREC–2001 Conf., NIST Special Publication, 2001, 286–294

[4] Yang Y., Liu X., “A re-examination of text categorization methods”, Proc. of Int. ACM Conf. on Research and Development in Information Retrieval (SIGIR–99), 1999, 42–49

[5] Ageev M., Dobrov B., Loukachevitch N., “Text Categorization Tasks for Large Hierarchial Systems of Categories”, SIGIR–2002 Workshop on Operational Text Classification Systems, eds. F. Sebastiani, S. Dumas, D.D. Lewis, T. Montgomery, I. Moulinier, Univ. of Tampere, Tampere, Finland, 2002, 49–52

[6] Dumais S., Lewis D., Sebastiani F., “Report on the Workshop on Operational Text Classification Systems (OTC–02)”, SIGIR–2002 Workshop on Operational Text Classification Systems, Univ. of Tampere, Tampere, Finland, 2002; http://www.sigir.org/forum/F2002/sebastiani.pdf

[7] Lewis D., Sebastiani F., “Report on the Workshop on Operational Text Classification Systems (OTC–01)”, ACM SIGIR Forum (New Orleans), 35:2 (2001), 8–11 | DOI

[8] Rose T., Stevenson M., Whitehead M., “The Reuters Corpus Volume 1 – from Yesterday News to tomorrow's Language”, Proc. of the Third Int. Conf. on Language Resources and Evaluation, Las Palmas de Gran Canaria, 2002; \href{http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.63.956} \allowbreak{http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.63.956}

[9] Wasson M., “Classification Technology at LexisNexis”, SIGIR–2001 Workshop on Operational Text Classification, 2001; http://www.daviddlewis.com/events/otc2001/presentations/otc01-wasson-paper.txt

[10] Hayes P. J., Weinstein S. P., “Construe: A System for Content-Based Indexing of a Database of News Stories”, Proc. of the Second Annual Conf. on Innovative Applications of Intelligence, 1990; \href{http://portal.acm.org/citation.cfm?id=653070} \allowbreak{http://portal.acm.org/citation. cfm?id=653070}

[11] Dobrov B. V., Lukashevich N. V., “Avtomaticheskaya rubrikatsiya polnotekstovykh dokumentov po klassifikatoram slozhnoi struktury”, Vosmaya nats. konf. po iskusstvennomu intellektu, Trudy konf., T. 1, Fizmatlit, M., 2002, 178–186

[12] Dobrov B. V., Lukashevich N. V., “Tezaurus i avtomaticheskoe kontseptualnoe indeksirovanie v universitetskoi informatsionnoi sisteme ROSSIYa”, Tretya Vseros. konf. po elektronnym bibliotekam “Elektronnye biblioteki: perspektivnye metody i tekhnologii, elektronnye kollektsii”, Trudy konf., Petrozavodsk, 2001, 78–82

[13] Ageev M. S., Dobrov B. V., Makarov-Zemlyanskii N. V., “Metod mashinnogo obucheniya, osnovannyi na modelirovanii logiki rubrikatora”, Elektronnye biblioteki: perspektivnye metody i tekhnologii, elektronnye kollektsii, Trudy 5-i Vseros. nauch. konf. RCDL' 2003, NII Khimii SPbGU, SPb., 2003, 150–158

[14] Ageev M. S., Kuralenok I. E., “Ofitsialnye metriki ROMIP' 2004”, Rossiiskii seminar po otsenke metodov informatsionnogo poiska, Cb. trudov, ed. I. S. Nekrestyanov, NII Khimii SPbGU, SPb., 2004, 142–150

[15] Lewis D., Reuters–21578 text categorization test collection. Distribution 1.0, http://www.daviddlewis.com/resources/testcollections/reuters21578/readme.txt

[16] Debole F., Sebastiani F., “An Analysis of the Relative Hardness of Reuters–21578 Subsets”, Proc. of LREC–04, 4th Int. Conf. on Language Resources and Evaluation, Lisbon, PT, 2004, 971–974; http://citeseer.ist.psu.edu/691424.html

[17] Ageev M., Dobrov B., “Support Vector Machine Parameter Optimization for Text Categorization Problems”, Information Systems Technology and its Applications (ISTA' 2003), Proc. Int. Conf., V. 30, 2003, 165–176

[18] Ageev M. S., Metody avtomaticheskoi rubrikatsii tekstov, osnovannye na mashinnom obuchenii i znaniyakh ekspertov, Dis. $\dots$ kand. fiz.-matem. nauk, M., 2005; http://www.cir.ru/docs/ips/publications/2005_diss_ageev.pdf

[19] Ageev M. S., Dobrov B. V., Lukashevich N. V., Sidorov A. V., “Eksperimentalnye algoritmy poiska/klassifikatsii i sravnenie s ‘basic line’ ”, Rossiiskii seminar po otsenke metodov informatsionnogo poiska, Cb. trudov, ed. I. S. Nekrestyanov, NII Khimii SPbGU, SPb., 2004, 62–89

[20] I. S. Nekrestyanov (red.), Trudy ROMIP' 2006, NU TsSI, SPb., 2006, 274 pp.

Parcourir par

Geodesic

Parcourir par