On the accuracy of cross-validation in the classification problem
The Bulletin of Irkutsk State University. Series Mathematics, Tome 38 (2021), pp. 84-95 Cet article a éte moissonné depuis la source Math-Net.Ru

Voir la notice de l'article

In this work we will study the accuracy of the cross-validation estimates for decision functions. The main idea of the research consists in the scheme of statistical modeling that allows using real data to obtain statistical estimates, which are usually obtained only by using model (synthetic) distributions. The studies confirm the well-known empirical recommendation to choose the number of folds equal to 5 or more. The choice of more than 10 folds does not yield a significant increase in accuracy. The use of repeated cross-validation also does not provide fundamental gain in precision. The results of the experiments allow us to formulate an empirical fact that the accuracy of the estimates obtained by the cross-validation method is approximately the same as the accuracy of the estimates obtained from the test sample of half the size. This result can be easily explained by the fact that all the objects of the test sample are independent, and the estimates built by the cross-validation on different subsamples (folds) are not independent.
Keywords: K-fold cross-validation, accuracy, statistical estimates, machinelearning.
@article{IIGUM_2021_38_a5,
     author = {V. M. Nedel'ko},
     title = {On the accuracy of cross-validation in the classification problem},
     journal = {The Bulletin of Irkutsk State University. Series Mathematics},
     pages = {84--95},
     year = {2021},
     volume = {38},
     language = {en},
     url = {http://geodesic.mathdoc.fr/item/IIGUM_2021_38_a5/}
}
TY  - JOUR
AU  - V. M. Nedel'ko
TI  - On the accuracy of cross-validation in the classification problem
JO  - The Bulletin of Irkutsk State University. Series Mathematics
PY  - 2021
SP  - 84
EP  - 95
VL  - 38
UR  - http://geodesic.mathdoc.fr/item/IIGUM_2021_38_a5/
LA  - en
ID  - IIGUM_2021_38_a5
ER  - 
%0 Journal Article
%A V. M. Nedel'ko
%T On the accuracy of cross-validation in the classification problem
%J The Bulletin of Irkutsk State University. Series Mathematics
%D 2021
%P 84-95
%V 38
%U http://geodesic.mathdoc.fr/item/IIGUM_2021_38_a5/
%G en
%F IIGUM_2021_38_a5
V. M. Nedel'ko. On the accuracy of cross-validation in the classification problem. The Bulletin of Irkutsk State University. Series Mathematics, Tome 38 (2021), pp. 84-95. http://geodesic.mathdoc.fr/item/IIGUM_2021_38_a5/

[1] Bayle P., Bayle A., Janson L., Mackey L., “Cross-validation Confidence Intervals for Test Error”, Advances in Neural Information Processing Systems, 33 (2020), 16339–16350

[2] Beleites C., Baumgartner R., Bowman C., Somorjai R., Steiner G., Salzer R., Sowa M. G., “Variance reduction in estimating classication error using sparse datasets”, Chemometrics and Intelligent Laboratory Systems, 79:1–2 (2005), 91–100 | DOI

[3] Franc V., Zien A., Schölkopf B., “Support Vector Machines as Probabilistic Models”, Proc. of the International Conference on Machine Learning (ICML), ACM, New York, USA, 2011, 665–672

[4] Friedman J., Hastie T., Tibshirani R., “Additive logistic regression: a statistical view of boosting”, Annals of Statistics, 28 (2000), 337–407 | DOI | Zbl

[5] Kelmanov A. V., Pyatkin A. V., “NP-hardness of some quadratic Euqleadean biclasterization tasks”, Reports of Academy of Science, 464:5 (2015), 535–538 (in Russian) | DOI

[6] Lbov G. S., Starceva N. G., “Comparison of recognition algorithms with the software system “Poligon””, Analysis of data and knowledge in expert systems, Computer systems, 134, Novosibirsk, 1990, 56–66 (in Russian)

[7] Lbov G. S., Starceva N. G., Logical decision functions and problem of statistical robustness of the solutions, Institute of Mathematics SB RAS Publ., Novosibirsk, 1999, 211 pp. (in Russian)

[8] Lugosi G., Vayatis N., “On the bayes-risk consistency of regularized boosting methods”, Annals of Statistics, 32 (2004), 30–55 | DOI | Zbl

[9] Mease D., Wyner A., “Evidence contrary to the statistical view of boosting”, Journal of Machine Learning Research, 9 (2008), 131–156 | DOI

[10] Motrenko A., Strijov V., Weber G.-W., “Sample Size Determination For Logistic Regression”, Journal of Computational and Applied Mathematics, 255 (2014), 743–752 | DOI | Zbl

[11] Krasotkina O. V., Turkov P. A., Mottl V. V., “Bayesian Approach to the Pattern Recognition Problem in Nonstationary Environment”, Lecture Notes in Computer Science, 6744, 2011, 24–29 | DOI

[12] Krasotkina O. V., Turkov P. A., Mottl' V. V., “Bayesian logistic regression in the problem of pattern recognition learning on shifting decision rule”, Proceedings of the Tula State University. Engineering, 2013, no. 2, 177–187 (in Russian)

[13] Nedel'ko V.M., “Misclassification probability estimations for linear decision functions”, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 3138, 2004, 780–787 | DOI

[14] Nedel'ko V., “Decision trees capacity and probability of misclassification”, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 3505, 2005, 193–199 | DOI

[15] Nedel'ko V. M., “Regression models in the classification problem”, Siberian Journal of Industrial Mathematics, 27:1 (2014), 86–98 (in Russian)

[16] Nedel'ko V. M., “On the boosting efficiency in the classification problem”, Bulletin of the Novosibirsk State University. Series: Mathematics, Mechanics, Computer Science, 15:2 (2015), 72–89 (in Russian) | DOI

[17] Torshin I.Yu., Rudakov K. V., “On the Theoretical Basis of Metric Analysis of Poorly Formalized Problems of Recognition and Classification”, Pattern Recognition and Image Analysis (Advances in Mathematical Theory and Applications), 25:4 (2015), 577–587 | DOI

[18] Vanwinckelen G., Blockeel H., “On estimating model accuracy with repeated cross-validation”, BeneLearn 2012, Proceedings of the 21st Belgian-Dutch Conference on Machine Learning, 2012, 39–44

[19] Vorontsov K. V., “Exact Combinatorial Bounds on the Probability of Overfitting for Empirical Risk Minimization”, Pattern Recognition and Image Analysis (Advances in Mathematical Theory and Applications), 20:3 (2010), 269–285 | DOI