Modifications of Karlin and Simon text models
Sibirskie èlektronnye matematičeskie izvestiâ, Tome 19 (2022) no. 2, pp. 708-723.

Voir la notice de l'article provenant de la source Math-Net.Ru

We discuss probability text models and their modifications. We construct processes of different and unique words in a text. The models are to correspond to the real text statistics. The infinite urn model (Karlin model) and the Simon model are the most known models of texts, but they do not give the ability to simulate the number of unique words correctly. The infinite urn model give sometimes the incorrect limit of the relative number of unique and different words. The Simon model states a linear growth of the numbers of different and unique words. We propose three modifications of the Karlin and Simon models. The first one is the offline variant, the Simon model starts after the completion of the infinite urn scheme. We prove limit theorems for this modification in embedded times only. The second modification involves repeated words in the Karlin model. We prove limit theorems for it. The third modification is the online variant, the Simon redistribution works at any toss of the Karlin model. In contrast to the compound Poisson model, we have no analytics for this modification. We test all the modifications by the simulation and have a good correspondence to the real texts.
Keywords: probability text models, Simon model, infinite urn model, weak convergence.
@article{SEMR_2022_19_2_a18,
     author = {M. G. Chebunin and A. P. Kovalevskii},
     title = {Modifications of {Karlin} and {Simon} text models},
     journal = {Sibirskie \`elektronnye matemati\v{c}eskie izvesti\^a},
     pages = {708--723},
     publisher = {mathdoc},
     volume = {19},
     number = {2},
     year = {2022},
     language = {en},
     url = {http://geodesic.mathdoc.fr/item/SEMR_2022_19_2_a18/}
}
TY  - JOUR
AU  - M. G. Chebunin
AU  - A. P. Kovalevskii
TI  - Modifications of Karlin and Simon text models
JO  - Sibirskie èlektronnye matematičeskie izvestiâ
PY  - 2022
SP  - 708
EP  - 723
VL  - 19
IS  - 2
PB  - mathdoc
UR  - http://geodesic.mathdoc.fr/item/SEMR_2022_19_2_a18/
LA  - en
ID  - SEMR_2022_19_2_a18
ER  - 
%0 Journal Article
%A M. G. Chebunin
%A A. P. Kovalevskii
%T Modifications of Karlin and Simon text models
%J Sibirskie èlektronnye matematičeskie izvestiâ
%D 2022
%P 708-723
%V 19
%N 2
%I mathdoc
%U http://geodesic.mathdoc.fr/item/SEMR_2022_19_2_a18/
%G en
%F SEMR_2022_19_2_a18
M. G. Chebunin; A. P. Kovalevskii. Modifications of Karlin and Simon text models. Sibirskie èlektronnye matematičeskie izvestiâ, Tome 19 (2022) no. 2, pp. 708-723. http://geodesic.mathdoc.fr/item/SEMR_2022_19_2_a18/

[1] R.J. Adler, An introduction to continuity, extrema, and related topics for general Gaussian processes, Institute of Math. Stat., Hayward, 1990 | MR | Zbl

[2] D.J. Aldous, “Probability distributions on cladograms”, Random discrete structures (Based on a workshop held November 15-19, 1993 at IMA, University of Minnesota, Minneapolis), eds. Aldous David (ed.) et al., Springer, Berlin, 1996, 1–18 | MR | Zbl

[3] D. Aldous, J. Pitman, “The standard additive coalescent”, Ann. Probab., 26:4 (1998), 1703–1726 | DOI | MR | Zbl

[4] R.R. Bahadur, “On the number of distinct values in a large sample from an infinite discrete distribution”, Proc. Natl. Inst. Sci. India, Part A, Suppl. II, 26 (1960), 67–75 | MR | Zbl

[5] A.D. Barbour, “Univariate approximations in the infinite occupancy scheme”, Alea, 6 (2009), 415–433 | MR

[6] A.D. Barbour, A.V. Gnedin, “Small counts in the infinite occupancy scheme”, Electron. J. Probab., 14 (2009), 365–384 | MR | Zbl

[7] E. Baur, J. Bertoin, “Cutting edges at random in large recursive trees”, Stochastic analysis and applications 2014, In honour of Terry Lyons. Selected articles based on the presentations at the conference (Oxford, UK, September 23-27, 2013), Springer Proceedings in Mathematics Statistics, 100, eds. Crisan Dan (ed.) et al., Springer, Cham, 2014, 51–76 | DOI | MR | Zbl

[8] E. Baur, J. Bertoin, “The fragmentation process of an infinite recursive tree and Ornstein-Uhlenbeck type processes”, Electron. J. Probab., 20 (2015), 98 | DOI | MR | Zbl

[9] E. Baur, J. Bertoin, On a two-parameter Yule-Simon distribution, 2020, arXiv: 2001.01486 | MR

[10] P. Billingsley, Convergence of probability measures, Wiley, Chichester, 1999 | MR | Zbl

[11] M.G. Chebunin, “Estimation of parameters of probabilistic models which is based on the number of different elements in a sample”, Sib. Zh. Ind. Mat., 17:3 (2014), 135–147 | MR | Zbl

[12] M. Chebunin, A. Kovalevskii, “Functional central limit theorems for certain statistics in an infinite urn scheme”, Stat. Probab. Lett., 119 (2016), 344–348 | DOI | MR | Zbl

[13] M. Chebunin, A. Kovalevskii, “Asymptotically normal estimators for Zipf's law”, Sankhyā, Ser. A, 81:2 (2019), 482–492 | MR | Zbl

[14] M. Chebunin, A. Kovalevskii, “A statistical test for the Zipf's law by deviations from the Heaps' law”, Sib. Èlektron. Mat. Izv., 16 (2019), 1822–1832 | DOI | MR | Zbl

[15] O. Durieu, Y. Wang, “From infinite urn schemes to decompositions of self-similar Gaussian processes”, Electron. J. Probab., 21 (2015), 43 | MR | Zbl

[16] M. Dutko, “Central limit theorems for infinite urn models”, Ann. Probab., 17:3 (1989), 1255–1263 | DOI | MR | Zbl

[17] A. Gnedin, B. Hansen, J. Pitman, “Notes on the occupancy problem with infinitely many boxes: general asymptotics and power laws”, Probab. Surv., 4 (2007), 146–171 | DOI | MR | Zbl

[18] F.A. Haight, R.B. Jones, “A probabilistic treatment of qualitative data with special reference, to word association tests”, J. Math. Psychol., 11 (1974), 237–244 | DOI | MR | Zbl

[19] H.S. Heaps, Information retrieval. Computational and theoretical aspects, Academic Press, New York etc, 1978 | Zbl

[20] G. Herdan, Type-token mathematics, Mouton and Co., 'S-Gravenhage, 1960 | Zbl

[21] H.-K. Hwang, S. Janson, “Local limit theorems for finite and infinite urn models”, Ann. Probab., 36:3 (2008), 992–1022 | DOI | MR | Zbl

[22] S. Janson, “Functional limit theorems for multitype branching processes and generalized Pólya urns”, Stochastic Processes Appl., 110:2 (2004), 171–245 | DOI | MR | Zbl

[23] S. Karlin, “Central limit theorems for certain infinite urn schemes”, J. Math. Mech., 17:4 (1967), 373–401 | MR | Zbl

[24] E.S. Key, “Rare numbers”, J. Theor. Probab., 5:2 (1992), 375–389 | DOI | MR | Zbl

[25] E.S. Key, “Divergence rates for the number of rare numbers”, J. Theor. Probab., 9:2 (1996), 413–428 | DOI | MR | Zbl

[26] P. Lansky, T. Radill-Weiss, “A generalization of the Yule-Simon model, with special reference to word association tests and neural cell assembly formation”, J. Math. Psychol., 21 (1980), 53–65 | DOI | Zbl

[27] H.A. Simon, “On a class of skew distribution functions”, Biometrika, 42:3-4 (1955), 425–440 | DOI | MR | Zbl

[28] G.U. Yule, “A mathematical theory of evolution, based on the conclusions of Dr. J.C. Willis, F.R.S”, Philosophical Transactions of the Royal Society of London. Series B, 213 (1925), 21–87 | DOI

[29] N. Zakrevskaya, A. Kovalevskii, “An omega-square statistics for analysis of correspondence of small texts to the Zipf–Mandelbrot law”, Applied methods of statistical analysis. Statistical computation and simulation, AMSA'2019, Proceedings of the International Workshop, eds. B. Lemeshko (ed) et al., NSTU, Novosibirsk, 2019, 488–494