Verification of the Heaps law using the Google Books Ngram database
Učënye zapiski Kazanskogo universiteta. Seriâ Fiziko-matematičeskie nauki, Uchenye Zapiski Kazanskogo Universiteta. Seriya Fiziko-Matematicheskie Nauki, Tome 155 (2013) no. 4, pp. 16-23 Cet article a éte moissonné depuis la source Math-Net.Ru

Voir la notice du chapitre de livre

This article is devoted to the verification of the Heaps empirical law for European languages using the Google Books Ngram corpus data. It is shown that the Heaps law holds only for short texts and texts related to short historical periods. The Heaps exponent decreases in time and varies significantly within characteristic intervals of 60–100 years. The relationship between the word frequency distribution and the expected dependence of the number of individual words on the text size is analyzed in terms of a simple probability model of text generation. This model serves as an explanation for the observed decreasing trend of the Heaps exponent.
Keywords: Heaps law, Zipf law, text probability models, Google Books Ngram corpus.
@article{UZKU_2013_155_4_a1,
     author = {V. V. Bochkarev and E. Yu. Lerner and A. V. Shevlyakova},
     title = {Verification of the {Heaps} law using the {Google} {Books} {Ngram} database},
     journal = {U\v{c}\"enye zapiski Kazanskogo universiteta. Seri\^a Fiziko-matemati\v{c}eskie nauki},
     pages = {16--23},
     year = {2013},
     volume = {155},
     number = {4},
     language = {ru},
     url = {http://geodesic.mathdoc.fr/item/UZKU_2013_155_4_a1/}
}
TY  - JOUR
AU  - V. V. Bochkarev
AU  - E. Yu. Lerner
AU  - A. V. Shevlyakova
TI  - Verification of the Heaps law using the Google Books Ngram database
JO  - Učënye zapiski Kazanskogo universiteta. Seriâ Fiziko-matematičeskie nauki
PY  - 2013
SP  - 16
EP  - 23
VL  - 155
IS  - 4
UR  - http://geodesic.mathdoc.fr/item/UZKU_2013_155_4_a1/
LA  - ru
ID  - UZKU_2013_155_4_a1
ER  - 
%0 Journal Article
%A V. V. Bochkarev
%A E. Yu. Lerner
%A A. V. Shevlyakova
%T Verification of the Heaps law using the Google Books Ngram database
%J Učënye zapiski Kazanskogo universiteta. Seriâ Fiziko-matematičeskie nauki
%D 2013
%P 16-23
%V 155
%N 4
%U http://geodesic.mathdoc.fr/item/UZKU_2013_155_4_a1/
%G ru
%F UZKU_2013_155_4_a1
V. V. Bochkarev; E. Yu. Lerner; A. V. Shevlyakova. Verification of the Heaps law using the Google Books Ngram database. Učënye zapiski Kazanskogo universiteta. Seriâ Fiziko-matematičeskie nauki, Uchenye Zapiski Kazanskogo Universiteta. Seriya Fiziko-Matematicheskie Nauki, Tome 155 (2013) no. 4, pp. 16-23. http://geodesic.mathdoc.fr/item/UZKU_2013_155_4_a1/

[1] Baayen R. H., Word Frequency Distributions, Kluwer Acad. Pub., Dordrecht, 2001, 359 pp. | MR | Zbl

[2] Michel J. B., Shen Y. K., Aiden A. P., Veres A., Gray M. K., The Google Books Team, Pickett J. P., Hoiberg D., Clancy D., Norvig P., Orwant J., Pinker S., Nowak M. A., Aiden E. L., “Quantitative analysis of culture using millions of digitized books”, Science, 331 (2011), 176–182 | DOI

[3] Petersen A. M., Tenenbaum J. N., Havlin S., Stanley H. E., Perc M., “Languages cool as they expand: Allometric scaling and the decreasing need for new words”, Sci. Rep., 2 (2012), Art. 943, 10 pp. | DOI

[4] Gerlach M., Altmann E. G., “Stochastic model for the vocabulary growth in natural languages”, Phys. Rev. X, 3:2 (2013), 021006-1–021006-10

[5] Naranan S., Balasubrahmanyan V. K., “Models for power law relations in linguistics and information science”, J. Quant. Linguist., 5:1–2 (1998), 35–61 | DOI

[6] Ferrer i Cancho R., Solé R. V., “Two regimes in the frequency of words and the origins of complex lexicons: Zipf's law revisited”, J. Quant. Linguist., 8:3 (2001), 165–173 | DOI

[7] Bochkarev V. V., Lerner E. Yu., Zipf and non-Zipf laws for homogeneous Markov chain, 2012, arXiv: 1207.1872v2

[8] Bochkarev V. V., Lerner E. Yu., “The Zipf law for random texts with unequal letter probabilities and the Pascal pyramid”, Russ. Math. (Iz. VUZ), 56:12 (2012), 25–27 | MR | Zbl

[9] Hughes J. M., Foti N. J., Krakauer D. C., Rockmore D. N., “Quantitative patterns of stylistic influence in the evolution of literature”, Proc. Natl. Acad. Sci. USA, 109:20 (2012), 7682–7686 | DOI