Document Representations for Classification of Short Web-Page Descriptions
Yugoslav journal of operations research, Tome 18 (2008) no. 1, p. 123
Cet article a éte moissonné depuis la source eLibrary of Mathematical Institute of the Serbian Academy of Sciences and Arts
Motivated by applying Text Categorization to classification of Web search
results, this paper describes an extensive experimental study of the impact of bag-ofwords
document representations on the performance of five major classifiers – Naïve
Bayes, SVM, Voted Perceptron, kNN and C4.5. The texts, representing short Web-page
descriptions sorted into a large hierarchy of topics, are taken from the dmoz Open
Directory Web-page ontology, and classifiers are trained to automatically determine the
topics which may be relevant to a previously unseen Web-page. Different
transformations of input data: stemming, normalization, logtf and idf, together with
dimensionality reduction, are found to have a statistically significant improving or
degrading effect on classification performance measured by classical metrics – accuracy,
precision, recall, $F_1$ and $F_2$. The emphasis of the study is not on determining the best
document representation which corresponds to each classifier, but rather on describing
the effects of every individual transformation on classification, together with their mutual
relationships.
Classification :
68T05 68U35 68T10
Keywords: Text categorization, document representation, machine learning.
Keywords: Text categorization, document representation, machine learning.
@article{YJOR_2008_18_1_a10,
author = {Milo\v{s} Radovanovi\'c and Mirjana Ivanovi\'c},
title = {Document {Representations} for {Classification} of {Short} {Web-Page} {Descriptions}},
journal = {Yugoslav journal of operations research},
pages = {123 },
year = {2008},
volume = {18},
number = {1},
zbl = {1199.68230},
language = {en},
url = {http://geodesic.mathdoc.fr/item/YJOR_2008_18_1_a10/}
}
TY - JOUR AU - Miloš Radovanović AU - Mirjana Ivanović TI - Document Representations for Classification of Short Web-Page Descriptions JO - Yugoslav journal of operations research PY - 2008 SP - 123 VL - 18 IS - 1 UR - http://geodesic.mathdoc.fr/item/YJOR_2008_18_1_a10/ LA - en ID - YJOR_2008_18_1_a10 ER -
Miloš Radovanović; Mirjana Ivanović. Document Representations for Classification of Short Web-Page Descriptions. Yugoslav journal of operations research, Tome 18 (2008) no. 1, p. 123 . http://geodesic.mathdoc.fr/item/YJOR_2008_18_1_a10/