APPROACHES TO THE CLASSIFICATION OF COMPLEX SYSTEMS: WORDS, TEXTS, AND MORE

Andrij Rovenchak (Personal webpage )

Ivan Franko National University of Lviv, Faculty of Physics, Ukraine
We will start from some introductory information about quantitative linguistics notions, like rank--frequency dependence, Zipf's law [1], frequency spectra [2], etc. Similarities in distributions of words in texts with level occupation in quantum ensembles hint a superficial analogy with statistical physics [3]. We thus will be able to define various parameters for texts based on this physical analogy, including ''temperature'', ''chemical potential'', entropy, and some others.
The calculated parameters will make it possible to classify texts serving as an example of complex systems. Moreover, they are perhaps the easiest complex systems to collect and analyze. In particular, a correlation is observed between the level of language analyticity and the analog of temperature. From such relations, even certain observations regarding the evolution of languages could be made [4].
Similar approaches can be developed to study, for instance, genomes due to well-known linguistic analogy [5]. We will consider certain nucleotide sequences in the mitochondrial DNA [6] and demonstrate their possible application as an auxiliary tool for comparative analysis of families and genera [7].
Finally, we will discuss entropy as one of the parameters, which can be easily computed from rank--frequency dependences [8]. Being a discriminating parameter in some problems of classification of complex systems, entropy can be given a proper interpretation only in a limited class of problems. Its overall role and significance remains so far an open issue.

References:

[1] I.-I. Popescu, G. Altmann, P. Grzybek, B. D. Jayaram, R. Köhler, V. Krupa, J. Macutek, R. Pustet, L. Uhlirova, M. N. Vidya, Word frequency studies (Berlin--New York: Mouton de Gruyter, 2009).
[2] J. Tuldava, The frequency spectrum of text and vocabulary, J. Quant. Ling. 3(1), 38--50 (1996).
[3] A. Rovenchak & S. Buk, Application of a quantum ensemble model to linguistic analysis, Physica A 390(7), 1326--1331 (2011).
[4] A. Rovenchak, Trends in language evolution found from the frequency structure of texts mapped against the Bose-distribution, J. Quant. Ling. 21(3), 281--294 (2014).
[5] S. Ji, The linguistics of DNA: Words, sentences, grammar, phonetics, and semantics, Ann. New York Acad. Sci. 870, 411-417 (1999).
[6] J.-W. Taanman, The mitochondrial genome: structure, transcription, translation and replication, Biochim. Biophys. Acta 1410(2), 103--123 (1999).
[7] A. Rovenchak, Telling apart Felidae and Ursidae from the distribution of nucleotides in mitochondrial DNA, Mod. Phys. Lett. B 32(5), 1850057 (2018).
[8] A. Rovenchak & O. Rovenchak, Quantifying comprehensibility of Christmas and Easter addresses from the Ukrainian Greek Catholic Church hierarchs, Glottometrics 41, 57--66 (2018).