# Applications of algorithmic information theory

Post-publication activity

Curator: Paul M.B. Vitanyi

Algorithmic Information Theory, more frequently called Kolmogorov complexity, has a wide range of applications, many of them described in detail by Li and Vitanyi (2008). The range of applications so vast that every such selection cannot be comprehensive or even cover all important items. The applications can use the property that:

• most strings are effectively incompressible, or
• some strings can be effectively compressed, or
• some strings can be effectively compressed but take a lot of effort to do so, or
• different strings can be effectively compressed in various degrees, singly, jointly, or conditionally on one another, and that approximating the Kolmogorov complexity with real-world compressors still gives acceptable, or even very good, results.

## Application of Incompressibility

A new mathematical proof technique was developed, now known as the incompressibility method. It is based on the fact that most strings (the so-called Martin-Loef random strings, also called Kolmogorov random strings) cannot be effectively compressed. The method and applications are surveyed in detail in Li and Vitanyi (2008), Chapter 6.

The incompressibility of random objects yields a simple but powerful proof technique. The incompressibility method is a general-purpose tool and should be compared with the pigeon-hole principle or the probabilistic method. Whereas the older methods generally show the existence of an object with the required properties, the incompressibility argument shows that almost all objects have the required property. This follows immediately from the fact that the argument is typically used on a Kolmogorov random object. Since such objects are effectively indistinguishable, the proof holds for all such objects. Each class of objects has an abundance of objects that are Kolmogorov random relative to the class.

The incompressibility method has been successfully applied to solve open problems and simplify existing proofs. The method rests on a simple fact: a Kolmogorov random string cannot be compressed. Generally, a proof proceeds by showing that a certain property has to hold for some typical' instance of a problem. Since typical' instances are difficult to define and often impossible to construct, a classical proof usually involves all instances of a certain class.

By intention and definition, an individual Kolmogorov random object is a typical' instance. These are the incompressible objects. Although individual objects cannot be proved to be incompressible in any given finite axiom system, a simple counting argument shows that almost all objects are incompressible. In a typical proof using the incompressibility method, one first chooses a Kolmogorov random object from the class under discussion. This object is incompressible. Then one proves that the desired property holds for this object. The argument invariably says that if the property does not hold, then the object can be compressed. This yields the required contradiction.

Because we are dealing with only one fixed object, the resulting proofs tend to be simple and natural. They are natural in that they supply rigorous analogues for our intuitive reasoning. In many cases a proof using the incompressibility method implies an average-case result since almost all strings are incompressible.

### List of selected application areas

The incompressibility method has been applied in, among others,

The method is always a matter of using regularity in an object, or algorithm, imposed by a property under investigation and quantified in an assumption to be contradicted, to compress the object's or algorithm's description to below its minimal value. The incompressibility method is the oldest and the most used application of algorithmic complexity,

### A simple example from number theory

In the nineteenth century, Chebychev showed that the number of primes less than $$n$$ grows asymptotically like $$n/\log n\ .$$ Using the incompressibility method we cannot (yet) prove this statement precisely, but we can come remarkably close with a minimal amount of effort. We first prove, following G.J. Chaitin, that for infinitely many $$n\ ,$$ the number of primes less than or equal to $$n$$ is at least $$\log n/ \log \log n\ .$$ The proof method is as follows. For each $$n\ ,$$ we construct a description from which $$n$$ can be effectively retrieved. This description will involve the primes less than $$n\ .$$ For some $$n$$ this description must be long, which shall give the desired result.

Assume that $$p_1 , p_2 , \ldots , p_m$$ is the list of all the primes less than $$n\ .$$ Then, $n = p_1^{{e}_1} p_2^{{e}_2} \cdots p_m^{{e}_m}$ can be reconstructed from the vector of the exponents. Each exponent is at most $$\log n$$ and can be represented by $$\log \log n$$ bits. The description of $$n$$ (given $$\log n$$) can be given in $$m \log \log n$$ bits.

It can be shown that each $$n$$ that is random (given $$\log n$$) cannot be described in fewer than $$\log n$$ bits, whence the result follows. Can we do better? This is slightly more complicated. The original idea is due to P. Berman, and improved by J.T. Tromp. Let $$l(x)$$ denote the length of the binary representation of $$x\ .$$ We shall show that for infinitely many $$n\ ,$$ the number of distinct primes less than $$n$$ is at least $$n/\log^2 n\ .$$

Firstly, we can describe any given integer $$n$$ by $$E(m)$$ concatenated with $$n/p_m\ ,$$ where $$E(m)$$ is a prefix-free encoding of $$m\ ,$$ and $$n/p_m$$ denotes the literal binary string representing the integer $$n/p_m\ .$$ Here $$p_m$$ is the largest prime dividing $$n\ .$$ For random $$n\ ,$$ the length of this description, $$l(E(m)) + \log n - \log p_m\ ,$$ must exceed $$\log n\ .$$ Therefore, $$\log p_m < l(E(m))\ .$$ It is known that the length of the prefix-free code $$l(E(m))$$ for the integer $$m$$ is $$< \log m + 2 \log \log m\ .$$ Hence, $$p_m < m \log^2 m\ .$$ Setting $$n_m := m \log^2 m\ ,$$ and since we know from our previous result that there are infinitely many primes, it follows that for the special sequence of values of $$n_1, n_2, \ldots$$ the number of primes less than $$n_m$$ exceeds $$n_m/ \log^2 n_m\ .$$

### Example: Gödel's incompleteness result

Gödel proved that in any consistent powerful enough theory, there are true, but unprovable statements. He constructed such a statement. Here we use the incompressibility argument to show in a very simple manner that there are, in fact, infinitely many such undecidable statements.

A formal system (consisting of definitions, axioms, rules of inference) is consistent if no statement that can be expressed in the system can be proved to be both true and false in the system. A formal system is sound if only true statements can be proved to be true in the system. (Hence, a sound formal system is consistent.) The idea below goes back to Ya. Barzdins and was popularized by G.J. Chaitin.

Let $$x$$ be a finite binary string. We write $$x$$ is random' if the shortest binary description of $$x$$ has length at least that of the literal description of $$x$$ (in other words, the Kolmogorov complexity of $$x$$ is not less than its length). A simple counting argument shows that there are random $$x$$'s of each length, or that most strings are random in the sense that their Kolmogorov complexity is at least their length minus 2.

Fix any sound formal system $$F$$ in which we can express statements like $$x$$ is random.' We claim that for all but finitely many random strings $$x\ ,$$ the sentence $$x$$ is random' is not provable in $$F\ .$$ Assume the contrary. Suppose $$F$$ can be described in $$f$$ bits---assume, for example, that this is the number of bits used in the exhaustive description of $$F$$ in the first chapter of the textbook Foundations of $$F$$'. Then given $$F\ ,$$ we can start to exhaustively search for a proof that some string of length $$n \gg f$$ is random, and print it when we find such a string. This is a $$x$$ satisfying the $$x$$ is random' sentence. This procedure to print $$x$$ of length $$n$$ uses only $$\log n + f$$ bits of data, which is much less than $$n\ .$$ But $$x$$ is random by the proof, which is true since $$F$$ is sound, and hence the shortest way to effectively describe $$x$$ is by at least $$n$$ bits. Hence, we have a contradiction.

This shows that although most strings are random, it is impossible to effectively prove them random. In a way, this explains why the incompressibility method is so successful. We can argue about a typical' individual element, which is difficult or impossible by other methods.

### Average-case complexity of Algorithms

The incompressibility method has been very successful to analyze the difficult average case complexity of algorithms. For example, the average case time complexity of an algorithm is the average running time over all inputs of binary length $$n\ ,$$ for each $$n\ .$$ This involves analyzing the running time of all binary strings of length $$n\ ,$$ for each $$n\ .$$ With the incompressibility method, we only need to analyze the running time of a single incompressible (that is, ML-random) string of length $$n\ ,$$ because such a string is so typical that its running time is about equal to that of most other strings, and hence the average time.

The method simply analyzes the algorithm with respect to this single string, by showing that the string is compressible if the algorithm does not satisfy a certain running time. This method has been used to analyze the average case running time for many well-known sorting algorithms, including the Heapsort and the Shellsort algorithm. A successful example is a $$\Omega (p n^{1+1/p})$$ lower bound on the average-case running time (uniform distribution) for sorting $$n$$ items, of $$p$$-pass Shellsort. This is the first nontrivial general lower bound for average-case Shellsort in 40 years.

## Applications of Compressibility

Traditional wisdom has it that the better a theory compresses the learning data concerning some phenomenon under investigation, the better we learn, generalize, and the better the theory predicts unknown data. This belief is vindicated in practice but before the advent of Kolmogorov complexity has not been rigorously proved in a general setting. The material on applications of compressibility is covered in Li and Vitanyi (2008), Chapter 5. Ray Solomonoff invented the notion of universal prediction using the Kolmogorov complexity based universal distribution, see the section on Algorithmic Probability in Algorithmic Information Theory. The discrete version of the universal probability, denoted as $$m\ ,$$ has miraculous applications:

• For every algorithm whatsoever, the average-case computation complexity resource (e.g. running time, storage) is the same order of magnitude as the worst-case computation complexity resource (running time, storage, respectively), the average taken with inputs distributed according to the universal distribution $$m\ .$$
• In PAC-learning, using $$m$$ as the distribution from which to draw labeled examples in the learning phase, considering families of discrete concept classes with computable distributions, the learning power increases: Some concept classes that were NP-hard to learn now become polynomially learnable. Similarly, PAC learning of continuous concept classes from families over computable measures is improved by using $$M\ ,$$ the continuous version of $$m\ ,$$ in the learning phase. Some continuous concept classes that are non-computable in the general setting become computable now. That is, in both cases we draw in the learning phase labeled examples according to the universal distribution, and we PAC-learn according to the real computable distribution. This model is akin to using a teacher in the learning phase who gives the simple examples first.
• A formally related quantity is the probability that $$U$$ halts when provided with fair coin flips on the input tape (i.e., that a random computer program will eventually halt). This halting probability, $$\Omega = \sum_x m(x)$$ also known as Chaitin's constant, or "the number of wisdom" has numerous remarkable mathematical properties, and can be used for instance to quantify Gödel's Incompleteness Theorem and is a natural example of a Martin-Loef random sequence.
• The continuous version of $$m\ ,$$ denoted above by $$M\ ,$$ leads to excellent predictions and decisions in general stochastic environments. A theory of learning by experiments in a reactive environment, has been developed by combining universal distribution predictions with sequential decision theory, results in an optimal reinforcement learning agent embedded in an arbitrary unknown environment Hutter (2005), and a formal definition and test of intelligence.

Universal prediction is related to optimal effective compression. The latter is almost always a best strategy in hypotheses identification (the minimum description length (MDL) principle). While most strings are incompressible, they represent data where there is no meaningful law or regularity to learn; it is precisely the compressible strings that represent data where we can learn meaningful laws from. As perhaps the last mathematical innovation of an extraordinary scientific career, Kolmogorov in 1973 proposed to found statistical theory on finite combinatorial principles independent of probabilistic assumptions. Technically, the new statistics is expressed in terms of Kolmogorov complexity. The relation between the individual data and its explanation (model) is expressed by Kolmogorov's Structure function. This entails a non-probabilistic approach to statistics and model selection. Let data be finite binary strings and models be finite sets of binary strings. Consider model classes consisting of models of given maximal (Kolmogorov) complexity. The Structure function of the given data expresses the relation between the complexity level constraint on a model class and the least log-cardinality of a model in the class containing the data. Recently it has been shown by Vereshchagin and Vitanyi that the structure function determines all stochastic properties of the data: for every constrained model class it determines the individual best-fitting model in the class irrespective of whether the true' model is in the model class considered or not. This approach led to a new foundation for the celebrated Minimum Description Length (MDL) principle for Statistical Modeling or Statistical Inference, Rissanen (2007). Essentially, for given data, one analysis by Vitanyi and Li tells which models obtained by the MDL principle are the right ones, the best fitting ones, while the analysis using the structure function tells how to obtain them. In this setting, this happens with certainty, rather than with high probability as is in the classical case. Related ideas have been applied, by Vereshchagin and Vitanyi to rate-distortion theory and lossy compression of individual data. Cognitive psychology has a long tradition of applying formal models of simplicity and complexity. The work by E. L. J. Leeuwenberg even predates the advent of Kolmogorov complexity. Not surprisingly, this field has a large and significant literature on applications of Kolmogorov complexity, for example the circles around N. Chater and P.A. van der Helm.

## Applications of Compressibility Requiring a Great Effort

Some strings can be compressed but take a great amount of effort, in time or space, to do so. Applications of compressibility requiring a great effort are covered in Li and Vitanyi (2008), Chapter 7. Such applications concern

• Structural Complexity Theory about relations between resource-bounded computational classes,
• Language compression and coding,
• Probabilistic computational classes and oracle classes,
• Universal optimal search,
• Logical Depth.

Potential applications in this setting are

## Applications of Differences in Compressibility by Real Compressors

Using the compressibility of nonrandom strings, a new theory of information distance and normalized information distance has been introduced, initially using Kolmogorov complexity and applied through approximation by real-life compressors like "gzip", "bzip2", and "PPMZ". Such applications are covered in Li and Vitanyi (2008), Chapter 8. In classical physics, we know how to measure the physical distance between a pair of objects. In the information age, it is important to measure the "information distance" between two objects: two documents, two letters, two emails, two music scores, two languages, two programs, two pictures, two systems, or two genomes. Such a measurement should not be application dependent or arbitrary. The universal similarity metric is probably the greatest practical success of Algorithmic Information Theory. A reasonable definition for the similarity between two objects is how difficult it is to transform them into each other. Formally, one can define the similarity between strings $$x$$ and $$y$$ as the length of the shortest program that computes $$x$$ from $$y$$ and vice versa. This was proven to be equal to $\max \{K(x|y),K(y|x)\}$ up to logarithmic additive terms which can be ignored. These distances are absolute, but if we want to express similarity, then we are more interested in relative ones. For example, if two strings of length 1,000,000 differ by 1000 bits, then we are inclined to think that those strings are relatively more similar than two strings of 1000 bits that have that distance. Hence we need to normalize to obtain a universal similarity metric, and to apply it in practice for e.g. evolutionary trees of species, we can approximate the Kolmogorov complexity by real-world compressors. This bold step was taken first for the slightly different sum distance $$K(x|y)+K(y|x)\ ,$$ originally derived with a thermodynamic interpretation in mind. Many applications including chain letter phylogeny, plagiarism detection, protein sequence/structure classification, and phylogenetic reconstruction have followed. In particular, researchers from the datamining community noticed that this methodology is in fact a parameter-free, feature-free, data-mining tool. They have experimentally tested a closely related metric on a large variety of sequence benchmarks. Comparing the compression method with 51 major methods found in 7 major data-mining conferences over the past decade they established clear superiority of the compression method for clustering heterogeneous data, and for anomaly detection, and competitiveness in clustering domain data. It has been shown that the normalized information distance (NID), $NID(x,y) = \frac{ \max\{K{(x|y)},K{(y|x)}\} }{ \max \{K(x),K(y)\}},$ where $$K(x|y)$$ is algorithmic information of $$x$$ conditioned on $$y\ ,$$ is a similarity metric. The function $$NID(x,y)$$ has been shown to satisfy the basic requirements for a metric distance measure. The universality conjecture was proved by recently, that is, that the NID as well as the normalized sum distance proposed earlier are universal in that they minimize, are at least as small as, every computable distance measure satisfying a natural density requirement. While this metric is not computable, it has an abundance of applications. Simply approximating $$K$$ by real-world compressors, with $$C(x)$$ is the binary length of the file $$x$$ compressed with compressor $$C$$ (for example "gzip", "bzip2", "PPMZ", in order to make NID easy to apply, we can rewrite the NID to obtain the normalized compression distance (NCD) $NCD(x,y) = \frac{C(xy) - \min \{C(x),C(y)\}}{\max \{C(x),C(y)\}}.$ NCD is actually a family of distances parametrized with the compressor $$C\ .$$ The better $$C$$ is, the closer the NCD approaches the NID, and the better the results are. The normalized compression distance has been used to fully automatically reconstruct language and phylogenetic trees as above. It can also be used for new applications of general clustering and classification of natural data in arbitrary domains, for clustering of heterogeneous data, and for anomaly detection across domains. The NID and NCD have been further applied to authorship attribution, stemmatology, music classification, internet knowledge discovery, to analyze network traffic and cluster computer worms and viruses, software metrics and obfuscation, web page authorship, topic and domain identification, hurricane risk assessment, ortholog detection, and clustering fetal heart rate tracings. Objects can be given literally, like the literal four-letter genome of a mouse, or the literal text of War and Peace by Tolstoy. For simplicity we take it that all meaning of the object is represented by the literal object itself. Objects can also be given by name, like "the four-letter genome of a mouse," or "the text of `War and Peace' by Tolstoy." There are also objects that cannot be given literally, but only by name, and that acquire their meaning from their contexts in background common knowledge in humankind, like "home" or "red." Using code-word lengths obtained from the page-hit counts returned by Google from the web, we obtain a semantic distance using the NCD formula and viewing Google as a compressor useful for data mining, text comprehension, classification, and translation. The associated NCD, now called the normalized Google distance (NGD) can be rewritten as: $NGD(x,y)= \frac{ \max \{\log f(x), \log f(y)\} - \log f(x,y) }{ \log N - \min\{\log f(x), \log f(y) \}},$ where $$f(x)$$ denotes the number of pages containing $$x\ ,$$ and $$f(x,y)$$ denotes the number of pages containing both $$x$$ and $$y\ ,$$ as reported by Google. The number $$N$$ can be set to the number of pages indexed by Google. For a publicly available open-source downloadable software tool CompLearn, written by Rudi Cilibrasi, for both NCD and NGD, see http://www.complearn.org. For an on-line demo of the NGD, see http://clo.complearn.org/: Simply think of 4-25 words (or compound words containing spaces) in a list. Enter the words in the box below and then press "run experiment" to see the computer make a tree. The computer will use the Google search engine page counts to compute NGD for each pair of words. Then it will use a new quartet method to find the best tree.

Stimulated by this work, a competitive approach based on compression has been developed to Pearson-Neyman hypothesis testing (null-hypothesis versus alternative hypothesis), tests for randomness of strings generated by random number generators, and lossy compression and denoising via compression.