On sub-gaussian concentration of missing mass
Teoriâ veroâtnostej i ee primeneniâ, Tome 68 (2023) no. 2, pp. 393-400 Cet article a éte moissonné depuis la source Math-Net.Ru

Voir la notice de l'article

The statistical inference on missing mass aims to estimate the weight of elements not observed during sampling. Since the pioneer work of Good and Turing, the problem has been studied in many areas, including statistical linguistics, ecology, and machine learning. Proving the sub-Gaussian behavior of the missing mass has been notoriously hard, and a number of complicated arguments have been proposed: logarithmic Sobolev inequalities, thermodynamic approaches, and information-theoretic transportation methods. Prior works have argued that the difficulty is inherent, and classical tools are inadequate. We show that this common belief is false, and all that we need to establish the sub-Gaussian concentration is the classical inequality of Bernstein. The strong educational value of our work is in its demonstration of this inequality in its full generality, an aspect not well recognized by researchers.
Keywords: missing mass, measure concentration, heterogenic Bernstein's inequality
Mots-clés : sub-Gamma concentration.
@article{TVP_2023_68_2_a10,
     author = {M. Skorski},
     title = {On sub-gaussian concentration of missing mass},
     journal = {Teori\^a vero\^atnostej i ee primeneni\^a},
     pages = {393--400},
     year = {2023},
     volume = {68},
     number = {2},
     language = {ru},
     url = {http://geodesic.mathdoc.fr/item/TVP_2023_68_2_a10/}
}
TY  - JOUR
AU  - M. Skorski
TI  - On sub-gaussian concentration of missing mass
JO  - Teoriâ veroâtnostej i ee primeneniâ
PY  - 2023
SP  - 393
EP  - 400
VL  - 68
IS  - 2
UR  - http://geodesic.mathdoc.fr/item/TVP_2023_68_2_a10/
LA  - ru
ID  - TVP_2023_68_2_a10
ER  - 
%0 Journal Article
%A M. Skorski
%T On sub-gaussian concentration of missing mass
%J Teoriâ veroâtnostej i ee primeneniâ
%D 2023
%P 393-400
%V 68
%N 2
%U http://geodesic.mathdoc.fr/item/TVP_2023_68_2_a10/
%G ru
%F TVP_2023_68_2_a10
M. Skorski. On sub-gaussian concentration of missing mass. Teoriâ veroâtnostej i ee primeneniâ, Tome 68 (2023) no. 2, pp. 393-400. http://geodesic.mathdoc.fr/item/TVP_2023_68_2_a10/

[1] I. J. Good, “The population frequencies of species and the estimation of population parameters”, Biometrika, 40:3-4 (1953), 237–264 | DOI | MR | Zbl

[2] I. J. Good, “Turing's anticipation of empirical Bayes in connection with the cryptanalysis of the naval Enigma”, J. Statist. Comput. Simulation, 66:2 (2000), 101–111 | DOI | MR | Zbl

[3] H. E. Robbins, “Estimating the total probability of the unobserved outcomes of an experiment”, Ann. Math. Statist., 39:1 (1968), 256–257 | DOI | MR | Zbl

[4] Tsung-Jen Shen, Anne Chao, Chih-Feng Lin, “Predicting the number of new species in further taxonomic sampling”, Ecology, 84:3 (2003), 798–804 | DOI

[5] Anne Chao, Tsung-Jen Shen, “Nonparametric estimation of Shannon's index of diversity when there are unseen species in sample”, Environ. Ecol. Stat., 10:4 (2003), 429–443 | DOI | MR

[6] Anne Chao, Y. T. Wang, Lou Jost, “Entropy and the species accumulation curve: a novel entropy estimator via discovery rates of new species”, Methods Ecol. Evol., 4:11 (2013), 1091–1100 | DOI

[7] Anne Chao, R. K. Colwell, Chun-Huo Chiu, D. Townsend, “Seen once or more than once: applying Good–Turing theory to estimate species richness using only unique observations and a species list”, Methods Ecol. Evol., 8:10 (2017), 1221–1232 | DOI

[8] B. Efron, R. Thisted, “Estimating the number of unseen species: How many words did Shakespeare know?”, Biometrika, 63:3 (1976), 435–447 | DOI | MR | Zbl

[9] D. R. McNeil, “Estimating an author's vocabulary”, J. Amer. Statist. Assoc., 68:341 (1973), 92–96 | DOI | MR

[10] W. A. Gale, G. Sampson, “Good-turing frequency estimation without tears”, J. Quant. Linguist., 2:3 (1995), 217–237 | DOI

[11] N. Myrberg Burström, “A tale of buried treasure, some good estimations, and golden unicorns: the numismatic connections of Alan Turing”, Myntstudier: Festskrift till Kenneth Jonsson, Svenska Numismatiska Föreningen, Stockholm, 2015, 226–230

[12] C. Budianu, Lang Tong, “Estimation of the number of operating sensors in sensor network”, The thrity-seventh Asilomar conference on signals, systems computers (Pacific Grove, CA, 2003), v. 2, IEEE, 2003, 1728–1732 | DOI

[13] C. Budianu, Lang Tong, “Good-turing estimation of the number of operating sensors: a large deviations analysis”, 2004 IEEE international conference on acoustics, speech, and signal processing (Montreal, QC, 2004), v. 2, IEEE, 2004, ii-1029 | DOI

[14] V. Q. Vu, Bin Yu, R. E. Kass, “Coverage-adjusted entropy estimation”, Stat. Med., 26:21 (2007), 4039–4060 | DOI | MR

[15] Zhiyi Zhang, “Entropy estimation in Turing's perspective”, Neural Comput., 24:5 (2012), 1368–1389 | DOI | MR | Zbl

[16] Chang Xuan Mao, B. G. Lindsay, “A Poisson model for the coverage problem with a genomic application”, Biometrika, 89:3 (2002), 669–682 | DOI | MR | Zbl

[17] P. I. Koukos, N. M. Glykos, “On the application of good-turing statistics to quantify convergence of biomolecular simulations”, J. Chem. Inf. Model., 54:1 (2014), 209–217 | DOI

[18] D. McAllester, L. Ortiz, “Concentration inequalities for the missing mass and for histogram rule error”, J. Mach. Learn. Res., 4:5 (2004), 895–911 | DOI | MR | Zbl

[19] D. Berend, A. Kontorovich, “On the concentration of the missing mass”, Electron. Commun. Probab., 18 (2013), 3, 7 pp. | DOI | MR | Zbl

[20] A. Ben-Hamou, S. Boucheron, M. I. Ohannessian, “Concentration inequalities in the infinite urn scheme for occupancy counts and the missing mass, with applications”, Bernoulli, 23:1 (2017), 249–287 | DOI | MR | Zbl

[21] D. A. McAllester, R. E. Schapire, “On the convergence rate of good-turing estimators”, COLT {'}00: Proceedings of the thirteenth annual conference on computational learning theory, 2000, 1–6

[22] A. Maurer, “Thermodynamics and concentration”, Bernoulli, 18:2 (2012), 434–454 | DOI | MR | Zbl

[23] M. Raginsky, I. Sason, “Concentration of measure inequalities in information theory, communications and coding”, Found. Trends Commun. Inf. Theory, 10:1-2 (2013), 1–246 ; (2015 (v1 – 2012)), 180 pp., arXiv: 1212.4663 | DOI | Zbl

[24] S. N. Bernshtein, Teoriya veroyatnostei, 4-e izd., Gostekhizdat, M.–L., 1946, 556 pp. | Zbl

[25] C. C. Craig, “On the Tchebychef inequality of Bernstein”, Ann. Math. Statist., 4:2 (1933), 94–102 | DOI | Zbl

[26] S. Boucheron, G. Lugosi, P. Massart, Concentration inequalities. A nonasymptotic theory of independence, Oxford Univ. Press, Oxford, 2013, x+481 pp. | DOI | MR | Zbl

[27] A. Khursheed, K. M. Lai Saxena, “Positive dependence in multivariate distributions”, Comm. Statist. A–Theory Methods, 10:12 (1981), 1183–1196 | DOI | MR | Zbl

[28] K. Joag-Dev, F. Proschan, “Negative association of random variables, with applications”, Ann. Statist., 11:1 (1983), 286–295 | DOI | MR | Zbl

[29] D. P. Dubhashi, D. Ranjan, Balls and bins: a study in negative dependence, BRICS Report Series, 3, no. 25, BRICS, Dep. of Comput. Sci. Univ. of Aarhus, 1996, 27 pp. | DOI

[30] Qi-Man Shao, “A comparison theorem on moment inequalities between negatively associated and independent random variables”, J. Theoret. Probab., 13:2 (2000), 343–356 | DOI | MR | Zbl

[31] M. Florenzano, Cuong Le Van, Finite dimensional convexity and optimization, In cooperation with P. Gourdel, Stud. Econom. Theory, 13, Springer-Verlag, Berlin, 2001, xii+154 pp. | DOI | MR | Zbl