Text document classification based on mixture models
Kybernetika, Tome 40 (2004) no. 3, pp. 293-304 Cet article a éte moissonné depuis la source Czech Digital Mathematics Library

Voir la notice de l'article

Finite mixture modelling of class-conditional distributions is a standard method in a statistical pattern recognition. This paper, using bag-of-words vector document representation, explores the use of the mixture of multinomial distributions as a model for class-conditional distribution for multiclass text document classification task. Experimental comparison of the proposed model and the standard Bernoulli and multinomial models as well as the model based on mixture of multivariate Bernoulli distributions was performed using Reuters-21578 and Newsgroups data sets. Preliminary experimental results indicate the effectiveness of the proposed model in a text classification problem.
Finite mixture modelling of class-conditional distributions is a standard method in a statistical pattern recognition. This paper, using bag-of-words vector document representation, explores the use of the mixture of multinomial distributions as a model for class-conditional distribution for multiclass text document classification task. Experimental comparison of the proposed model and the standard Bernoulli and multinomial models as well as the model based on mixture of multivariate Bernoulli distributions was performed using Reuters-21578 and Newsgroups data sets. Preliminary experimental results indicate the effectiveness of the proposed model in a text classification problem.
Classification : 62G05, 62H30, 68T10
Keywords: text classification; multinomialmixture model
@article{KYB_2004_40_3_a2,
     author = {Novovi\v{c}ov\'a, Jana and Mal{\'\i}k, Anton{\'\i}n},
     title = {Text document classification based on mixture models},
     journal = {Kybernetika},
     pages = {293--304},
     year = {2004},
     volume = {40},
     number = {3},
     mrnumber = {2103933},
     zbl = {1248.62107},
     language = {en},
     url = {http://geodesic.mathdoc.fr/item/KYB_2004_40_3_a2/}
}
TY  - JOUR
AU  - Novovičová, Jana
AU  - Malík, Antonín
TI  - Text document classification based on mixture models
JO  - Kybernetika
PY  - 2004
SP  - 293
EP  - 304
VL  - 40
IS  - 3
UR  - http://geodesic.mathdoc.fr/item/KYB_2004_40_3_a2/
LA  - en
ID  - KYB_2004_40_3_a2
ER  - 
%0 Journal Article
%A Novovičová, Jana
%A Malík, Antonín
%T Text document classification based on mixture models
%J Kybernetika
%D 2004
%P 293-304
%V 40
%N 3
%U http://geodesic.mathdoc.fr/item/KYB_2004_40_3_a2/
%G en
%F KYB_2004_40_3_a2
Novovičová, Jana; Malík, Antonín. Text document classification based on mixture models. Kybernetika, Tome 40 (2004) no. 3, pp. 293-304. http://geodesic.mathdoc.fr/item/KYB_2004_40_3_a2/

[1] Battiti R.: Using mutual information for selecting features in supervised neural net learning. IEEE Trans. Neural Networks 5 (1994), 537–550 | DOI

[2] Dempster A. P., Laird N. M., Rubin D. B.: Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Statist. Soc. Ser. B 39 (1977), 1–38 | MR | Zbl

[3] Forman G.: An experimental study of feature selection metrics for text categorization. J. Mach. Learning Res. 3 (2003), 1289–1305

[4] Joachims T.: Text categorization with support vector machines: Learning with many relevant features. In: Proc. 10th European Conference on Machine Learning (ECML’98), 1998, pp. 137–142

[5] Juan A., Vidal E.: On the use of Bernoulli mixture models for text clasification. Pattern Recognition 35 (2002), 2705–2710 | DOI

[6] Kwak N., Choi C.: Improved mutual information feature selector for neural networks in supervised learning. In: Proc. Internat. Joint Conference on Neural Networks (IJCNN ’99), 1999 pp. 1313–1318

[7] McCallum A., Nigam K.: A comparison of event models for naive Bayes text classification. In: Proc. AAAI-98 Workshop on Learning for Text Categorization, 1998

[8] McLachlan G. J., Peel D.: Finite Mixture Models. Wiley, New York 2000 | MR | Zbl

[9] Mladenic D., Grobelnik M.: Feature selection for unbalanced class distribution and Naive Bayes. In: Proc. Sixteenth Internat. Conference on Machine Learning, 1999, pp. 258–267

[10] Nigam K., McCallum A., Thrun, S., Mitchell T.: Text classification from labeled and unlabeled documents using EM. Mach. Learning 39 (2000), 103–134 | DOI | Zbl

[11] Novovičová J., Pudil, P., Kittler J.: Divergence based feature selection for multimodal class densities. IEEE Trans. Pattern Anal. Machine Intell. 18 (1996), 218–223 | DOI

[12] Novovičová J., Malík A.: Text Document Classification Using Finite Mixtures. Research Report No. 2063, Institute of Information Theory and Automation, Prague 2002

[13] Novovičová J., Malík A.: Application of multinomial mixture model to text classification. In: Pattern Recognition and Image Analysis (Lecture Notes in Computer Sciences 2652), Springer–Verlag, Berlin 2003, pp. 646–653

[14] Novovičová J., Malík, A., Pudil P.: Feature selection using improved mutual information for text classification. In: Structural, Syntactic and Statistical Pattern Recognition (Lecture Notes in Computer Science), Springer–Verlag, Berlin 2004 (in press) | Zbl

[15] Pudil P., Novovičová, J., Kittler J.: Feature selection based on approximation of class densities by finite mixtures of special type. Pattern Recognition 28 (1995), 1389–1398 | DOI

[16] Ueda N., Saito K.: Parametric mixture models for multi-labeled text. In: Proc. Neural Information Processing Systems, 2003

[17] Yang Y., Pedersen J. O.: A comparative study on feature selection in text categorization. In: Proc. Internat. Conference on Machine Learning, 1997, pp. 412–420

[18] Yang Y., Liu X.: A re-examination of text categorization methods. In: Proc. 22nd Internat. ACM SIGIR Conference on Research and Development in Inform. Retrieval, 1999, pp. 42–49

[19] Yang Y.: An evaluation of statistical approaches to text categorization. J. Inform. Retrieval 1 (1999), 67–88 | DOI

[20] Yang Y., Zhang, J., Kisiel B.: A scalability analysis of classifier in text categorization. In: Proc. 26th ACM SIGIR Conference on Research and Development in Inform. Retrieval, 2003