01973nas a2200229 4500008004100000022001400041245005100055210005100106300000700157490000600164520131800170100002301488700001701511700002101528700002801549700001901577700002201596700002001618700001801638700002301656856006401679 2018 eng d a2296-424X00aRank Dynamics of Word Usage at Multiple Scales0 aRank Dynamics of Word Usage at Multiple Scales a450 v63 aThe recent dramatic increase in online data availability has allowed researchers to explore human culture with unprecedented detail, such as the growth and diversification of language. In particular, it provides statistical tools to explore whether word use is similar across languages, and if so, whether these generic features appear at different scales of language structure. Here we use the Google Books $N$-grams dataset to analyze the temporal evolution of word usage in several languages. We apply measures proposed recently to study rank dynamics, such as the diversity of $N$-grams in a given rank, the probability that an $N$-gram changes rank between successive time intervals, the rank entropy, and the rank complexity. Using different methods, results show that there are generic properties for different languages at different scales, such as a core of words necessary to minimally understand a language. We also propose a null model to explore the relevance of linguistic structure across multiple scales, concluding that $N$-gram statistics cannot be reduced to word statistics. We expect our results to be useful in improving text prediction algorithms, as well as in shedding light on the large-scale features of language use, beyond linguistic and cultural differences across human populations.1 aMorales, José, A.1 aColman, Ewan1 aSánchez, Sergio1 aSánchez-Puig, Fernanda1 aPineda, Carlos1 aIñiguez, Gerardo1 aCocho, Germinal1 aFlores, Jorge1 aGershenson, Carlos uhttps://www.frontiersin.org/article/10.3389/fphy.2018.0004501729nas a2200181 4500008004100000245007900041210006900120260003400189300001300223490000700236520115000243100002001393700001801413700002301431700001901454700002101473856005301494 2015 eng d00aRank Diversity of Languages: Generic Behavior in Computational Linguistics0 aRank Diversity of Languages Generic Behavior in Computational Li bPublic Library of Sciencec04 ae01218980 v103 a
Statistical studies of languages have focused on the rank-frequency distribution of words. Instead, we introduce here a measure of how word ranks change in time and call this distribution rank diversity. We calculate this diversity for books published in six European languages since 1800, and find that it follows a universal lognormal distribution. Based on the mean and standard deviation associated with the lognormal distribution, we define three different word regimes of languages: “heads” consist of words which almost do not change their rank in time, “bodies” are words of general use, while “tails” are comprised by context-specific words and vary their rank considerably in time. The heads and bodies reflect the size of language cores identified by linguists for basic communication. We propose a Gaussian random walk model which reproduces the rank variation of words in time and thus the diversity. Rank diversity of words can be understood as the result of random variations in rank, where the size of the variation depends on the rank itself. We find that the core size is similar for all languages studied.
1 aCocho, Germinal1 aFlores, Jorge1 aGershenson, Carlos1 aPineda, Carlos1 aSánchez, Sergio uhttp://dx.doi.org/10.1371%2Fjournal.pone.0121898