A statistical test for correspondence of texts to the Zipf---Mandelbrot law
Sibirskie èlektronnye matematičeskie izvestiâ, Tome 17 (2020), pp. 1959-1974

Voir la notice de l'article provenant de la source Math-Net.Ru

We analyse correspondence of texts to a simple probabilistic model. The model assumes that the words are selected independently from an infinite dictionary, and the probability distribution of words corresponds to the Zipf—Mandelbrot law. We count the numbers of different words in the text sequentially and get the process of the numbers of different words. Then we estimate the Zipf—Mandelbrot law's parameters using the same sequence and construct an estimate of the expectation of the number of different words in the text. After that we subtract the corresponding values of the estimate from the sequence and normalize along the coordinate axes, obtaining a random process on a segment from $0$ to $1$. We prove that this process (the empirical text bridge) converges weakly in the uniform metric on $C(0, 1)$ to a centered Gaussian process with continuous a.s. paths. We develop and implement an algorithm for calculating the probability distribution of the integral of the square of this process. We present several examples of application of the algorithm for analysis of the homogeneity of texts in English, French, Russian, and Chinese.
Keywords: Zipf's law, weak convergence, Gaussian process.
@article{SEMR_2020_17_a52,
     author = {A. Chakrabarty and M. G. Chebunin and A. P. Kovalevskii and I. M. Pupyshev and N. S. Zakrevskaya and Q. Zhou},
     title = {A statistical test for correspondence of texts to the {Zipf---Mandelbrot} law},
     journal = {Sibirskie \`elektronnye matemati\v{c}eskie izvesti\^a},
     pages = {1959--1974},
     publisher = {mathdoc},
     volume = {17},
     year = {2020},
     language = {en},
     url = {http://geodesic.mathdoc.fr/item/SEMR_2020_17_a52/}
}
TY  - JOUR
AU  - A. Chakrabarty
AU  - M. G. Chebunin
AU  - A. P. Kovalevskii
AU  - I. M. Pupyshev
AU  - N. S. Zakrevskaya
AU  - Q. Zhou
TI  - A statistical test for correspondence of texts to the Zipf---Mandelbrot law
JO  - Sibirskie èlektronnye matematičeskie izvestiâ
PY  - 2020
SP  - 1959
EP  - 1974
VL  - 17
PB  - mathdoc
UR  - http://geodesic.mathdoc.fr/item/SEMR_2020_17_a52/
LA  - en
ID  - SEMR_2020_17_a52
ER  - 
%0 Journal Article
%A A. Chakrabarty
%A M. G. Chebunin
%A A. P. Kovalevskii
%A I. M. Pupyshev
%A N. S. Zakrevskaya
%A Q. Zhou
%T A statistical test for correspondence of texts to the Zipf---Mandelbrot law
%J Sibirskie èlektronnye matematičeskie izvestiâ
%D 2020
%P 1959-1974
%V 17
%I mathdoc
%U http://geodesic.mathdoc.fr/item/SEMR_2020_17_a52/
%G en
%F SEMR_2020_17_a52
A. Chakrabarty; M. G. Chebunin; A. P. Kovalevskii; I. M. Pupyshev; N. S. Zakrevskaya; Q. Zhou. A statistical test for correspondence of texts to the Zipf---Mandelbrot law. Sibirskie èlektronnye matematičeskie izvestiâ, Tome 17 (2020), pp. 1959-1974. http://geodesic.mathdoc.fr/item/SEMR_2020_17_a52/