A statistical test for correspondence of texts to the Zipf---Mandelbrot law
Sibirskie èlektronnye matematičeskie izvestiâ, Tome 17 (2020), pp. 1959-1974.

Voir la notice de l'article provenant de la source Math-Net.Ru

We analyse correspondence of texts to a simple probabilistic model. The model assumes that the words are selected independently from an infinite dictionary, and the probability distribution of words corresponds to the Zipf—Mandelbrot law. We count the numbers of different words in the text sequentially and get the process of the numbers of different words. Then we estimate the Zipf—Mandelbrot law's parameters using the same sequence and construct an estimate of the expectation of the number of different words in the text. After that we subtract the corresponding values of the estimate from the sequence and normalize along the coordinate axes, obtaining a random process on a segment from $0$ to $1$. We prove that this process (the empirical text bridge) converges weakly in the uniform metric on $C(0, 1)$ to a centered Gaussian process with continuous a.s. paths. We develop and implement an algorithm for calculating the probability distribution of the integral of the square of this process. We present several examples of application of the algorithm for analysis of the homogeneity of texts in English, French, Russian, and Chinese.
Keywords: Zipf's law, weak convergence, Gaussian process.
@article{SEMR_2020_17_a52,
     author = {A. Chakrabarty and M. G. Chebunin and A. P. Kovalevskii and I. M. Pupyshev and N. S. Zakrevskaya and Q. Zhou},
     title = {A statistical test for correspondence of texts to the {Zipf---Mandelbrot} law},
     journal = {Sibirskie \`elektronnye matemati\v{c}eskie izvesti\^a},
     pages = {1959--1974},
     publisher = {mathdoc},
     volume = {17},
     year = {2020},
     language = {en},
     url = {http://geodesic.mathdoc.fr/item/SEMR_2020_17_a52/}
}
TY  - JOUR
AU  - A. Chakrabarty
AU  - M. G. Chebunin
AU  - A. P. Kovalevskii
AU  - I. M. Pupyshev
AU  - N. S. Zakrevskaya
AU  - Q. Zhou
TI  - A statistical test for correspondence of texts to the Zipf---Mandelbrot law
JO  - Sibirskie èlektronnye matematičeskie izvestiâ
PY  - 2020
SP  - 1959
EP  - 1974
VL  - 17
PB  - mathdoc
UR  - http://geodesic.mathdoc.fr/item/SEMR_2020_17_a52/
LA  - en
ID  - SEMR_2020_17_a52
ER  - 
%0 Journal Article
%A A. Chakrabarty
%A M. G. Chebunin
%A A. P. Kovalevskii
%A I. M. Pupyshev
%A N. S. Zakrevskaya
%A Q. Zhou
%T A statistical test for correspondence of texts to the Zipf---Mandelbrot law
%J Sibirskie èlektronnye matematičeskie izvestiâ
%D 2020
%P 1959-1974
%V 17
%I mathdoc
%U http://geodesic.mathdoc.fr/item/SEMR_2020_17_a52/
%G en
%F SEMR_2020_17_a52
A. Chakrabarty; M. G. Chebunin; A. P. Kovalevskii; I. M. Pupyshev; N. S. Zakrevskaya; Q. Zhou. A statistical test for correspondence of texts to the Zipf---Mandelbrot law. Sibirskie èlektronnye matematičeskie izvestiâ, Tome 17 (2020), pp. 1959-1974. http://geodesic.mathdoc.fr/item/SEMR_2020_17_a52/

[1] R.R. Bahadur, “On the number of distinct values in a large sample from an infinite discrete distribution”, Proc. Natl Inst. Sci. India, 26A, Supp. II (1960), 67–75 | MR | Zbl

[2] A.D. Barbour, “Univariate approximations in the infinite occupancy scheme”, Alea, 6 (2009), 415–433 | MR

[3] A.D. Barbour, A.V. Gnedin, “Small counts in the infinite occupancy scheme”, Electron. J. Probab., 14 (2009), 365–384 | MR | Zbl

[4] A. Ben-Hamou, S. Boucheron, M.I. Ohannessian, “Concentration inequalities in the infinite urn scheme for occupancy counts and the missing mass, with applications”, Bernoulli, 23:1 (2017), 249–287 | DOI | MR | Zbl

[5] M. Chebunin, A. Kovalevskii, “Functional central limit theorems for certain statistics in an infinite urn scheme”, Stat. Probab. Lett., 119 (2016), 344–348 | DOI | MR | Zbl

[6] M. Chebunin, A. Kovalevskii, “A statistical test for the Zipf's law by deviations from the Heaps' law”, Sib. Electron. Mat. Izv., 16 (2019), 1822–1832 | DOI | MR | Zbl

[7] M. Chebunin, A. Kovalevskii, “Asymptotically normal estimators for Zipf's law”, Sankhya, Ser. A, 81:2 (2019), 482–492 | DOI | MR | Zbl

[8] G. Decrouez, M. Grabchak, Q. Paris, “Finite sample properties of the mean occupancy counts and probabilities”, Bernoulli, 24:3 (2018), 1910–1941 | DOI | MR | Zbl

[9] O. Durieu, Y. Wang, “From infinite urn schemes to decompositions of self-similar Gaussian processes”, Electron. J. Probab., 21 (2016), 43 | DOI | MR | Zbl

[10] O. Durieu, G. Samorodnitsky, Y. Wang, “From infinite urn schemes to self-similar stable processes”, Stochastic Processes Appl., 130:4 (2020), 2471–2487 | DOI | MR | Zbl

[11] M. Dutko, “Central limit theorems for infinite urn models”, Ann. Probab., 17:3 (1989), 1255–1263 | DOI | MR | Zbl

[12] I. Eliazar, “The Growth Statistics of Zipfian Ensembles: Beyond Heaps' Law”, Physica (Amsterdam), 390 (2011), 3189 | DOI

[13] M. Gerlach, E.G. Altmann, “Stochastic Model for the Vocabulary Growth in Natural Languages”, Physical Review X, 3 (2013), 021006 | DOI

[14] A. Gnedin, B. Hansen, J. Pitman, “Notes on the occupancy problem with infinitely many boxes: general asymptotics and power laws”, Probab. Surv., 4 (2007), 146–171 | DOI | MR | Zbl

[15] H.S. Heaps, Information Retrieval: Computational and Theoretical Aspects, Academic Press, New York etc, 1978 | Zbl

[16] G. Herdan, Type-token mathematics. A textbook of mathematical linguistics, Mouton and Co., 's-Gravenhage, 1960 | Zbl

[17] H.-K. Hwang, S. Janson, “Local limit theorems for finite and infinite urn models”, Ann. Probab., 36:3 (2008), 992–1022 | DOI | MR | Zbl

[18] D.C. van Leijenhorst, Th.P. van der Weide, “A Formal Derivation of Heaps' Law”, Inf. Sci., 170:2–4 (2005), 263–272 | MR | Zbl

[19] B. Mandelbrot, “Information Theory and Psycholinguistics”: B.B. Wolman, E. Nagel, Scientific psychology, Basic Books, 1965

[20] P.T. Nicholls, “Estimation of Zipf parameters”, J. Am. Soc. Inf. Sci., 38:8 (1987), 443–445 | 3.0.CO;2-E class='badge bg-secondary rounded-pill ref-badge extid-badge'>DOI

[21] A.M. Petersen, J.N. Tenenbaum, S. Havlin, H.E. Stanley, M. Perc, “Languages cool as they expand: Allometric scaling and the decreasing need for new words”, Scientific Reports, 2:2012, 943 | DOI

[22] N.V. Smirnov, “On the $\omega^2$ distribution”, Mat. Sb. n. Ser., 2 (1937), 973–993 | Zbl

[23] N. Zakrevskaya, A. Kovalevskii, “An omega-square statistics for analysis of correspondence of small texts to the Zipf—Mandelbrot law”, Applied methods of statistical analysis. Statistical computation and simulation, Proceedings of the International Workshop, NSTU, Novosibirsk, 2019, 488–494

[24] G.K. Zipf, The Psycho-Biology of Language, Houghton Mifflin, Boston, 1935