Rank-frequency distribution of natural languages: A difference of probabilities approach

In this paper we investigate the time variation of the rank k of words for six Indo-European languages using the Google Books N-gram Dataset. Based on numerical evidence, we regard k as a random variable whose dynamics may be described by a Fokker–Planck equation which we solve analytically. For low ranks the distinct languages behave differently, maybe due to the syntax rules, whereas for k>50 the law of large numbers predominates. We analyze the frequency distribution of words using the data and their adjustment in terms of time-dependent probability density distributions. We find small differences between the data and the fits due to conflicting dynamic mechanisms, but the data show a consistent behavior with our general approach. For the lower ranks the behavior of the data changes among languages presumably, again, due to distinct dynamic mechanisms. We discuss a possible origin of these differences and assess the novel features and limitations of our work.

Fokker–Planck equation
Languages
Master equation
Rank dynamics

Cocho, Germinal
Rodríguez, Rosalío, F.
Sánchez, Sergio
Flores, Jorge
Pineda, Carlos
Gershenson, Carlos

https://doi.org/10.1016/j.physa.2019.121795