# Algorithmic Information Dynamics

Post-publication activity

Curator: Hector Zenil

Algorithmic Information Dynamics (AID) is an algorithmic probabilistic framework for causal discovery and causal analysis. It enables a numerical solution to inverse problems based or motivated on principles of algorithmic probability. AID studies dynamical systems in software space where all possible computable models can be found or approximated under the assumption that discrete longitudinal data such as particle orbits in state and phase space can approximate continuous systems by Turing-computable means. AID combines perturbation analysis and algorithmic information theory to guide a search for sets of models compatible with observations and to precompute and exploit those models as testable generative mechanisms and causal first principles underlying data and systems. AID is an alternative or a complement to other approaches and methods of experimental inference, such as statistical machine learning and classical information theory.

AID connects with and across other parallel fields of active research such as logical inference, causal reasoning, and neuro-symbolic computation. AID studies how candidate discrete computable equations as generating mechanisms are affected by changes in observed phenomena over time as a result of a system evolving (e.g. under the influence of noise) or being externally perturbed.

AID is related to other areas such as computational mechanics and program synthesis. However, unlike graphical methods such as Bayesian networks, AID does not rely on graphical representations or (often inaccessible) empirical estimations of mass probability distributions. AID encompasses the foundations and methods that make the area of algorithmic information and algorithmic complexity more relevant to scientific discovery and causal analysis.

Figure 1: The basic principles of Algorithmic Information Dynamics consist in finding computable models or sequences of (short) computable models able to explain a piece of data or an observation and study the effect that interventions have on such models in the software space. Fully deterministic systems subject to no noise or no external influence will find an invariant set of candidate models with description lengths varying by only a logarithmic or sublogarithmic term; any deviation will suggest noise or external influence.

## Foundations and Motivation

Algorithmic Probability

Algorithmic probability in its modern formulation was introduced by Solomonoff and Levin, and by Kolmogorov and Chaitin through their work on algorithmic complexity (Chaitin, 1987; Calude, 2002; Downey & Hirschfeldt, 2010; Li & Vitányi, 2019). Algorithmic Probability considers the probability of a (discrete) object being produced by a computable mechanistic set of rules capable of Turing universality. It imposes a mapping distribution called the universal distribution between input and output. Formally, a computable process that produces a string $$s$$ is a program $$p$$ that when executed on a universal Turing machine $$U$$ produces the string $$s$$ as output and halts.

As $$p$$ is itself a binary string, we can define the discrete universal a priori probability $$m(s)$$ as the probability that the output of an arbitrary binary input of a universal prefix-free Turing machine $$U$$ is $$s$$ when this input is provided with fair coin flips on the input tape. Formally,

$m(s) \;:=\; \sum_{p\;:\;U(p)=s} 2^{-\ell(p)},$

where the sum is over all halting programs $$p$$ for which $$U$$ outputs the string $$s$$. As $$U$$ is a prefix-free universal Turing machine, the set of valid programs forms a prefix-free set (or self-delimited programming language), and thus the sum is bounded, given Kraft's inequality (i.e., it defines a probability semi-measure).

The methods underpinning AID can be described as a combination of Bayes' theorem and computability theory, where the default agnostic prior distribution is the universal distribution instead of some other agnostic distribution such as the uniform distribution. In this way, any updates to the specific distribution of the problem proceed according to algorithmic probability.

Introduced by Ray Solomonoff, algorithmic probability, or the theory of universal inductive inference as he also called it, is a theory of prediction based on observations that assumes that each individual observed object is generated by arbitrary computable processes. An example is the prediction of the next digit in the sequence $$s=1234567...$$. According to algorithmic probability, the next digit would be 8; if the simplest model (i.e., the shortest self-delimiting program) able to generate that sequence is the successor function $$x_0=1;x_i:=x_{i-1}+1$$, such a model would generate 8 as a predictor of the next digit given the data. The only assumption is that the prior probability follows a computable probability distribution even if such a distribution is unknown because as proven by Levin, all computable distributions converge in what is known as the universal distribution. As one of the pillars of algorithmic information theory (AIT), the universal distribution that is an offshoot of the algorithmic coding theorem conjoins algorithmic complexity, universal a priori probability, and a universal/maximal computably enumerable discrete semi-measure into a single precise and ubiquitous mathematical formalization of a necessary "bias toward simplicity" for spaces of computably generated objects. This is a result that has been called 'miraculous' in the scientific literature and been lauded by the late Marvin Minsky as the most important scientific theory of relevance to AI (Zenil, 2020; Zenil et al., 2019).

Algorithmic probability captures (albeit without actually setting out to do so) longstanding principles on which science has been founded: the principle (also known as Occam's razor) that the simplest explanation (in this case the most algorithmically probable which turns out to be the shortest algorithmic description) is most likely the correct one; and the principle of multiple explanations (Epicurus), which mandates the retention of all explanations consistent with the data; and Bayes's Rule, requiring the transformation of the a priori distribution into a posterior distribution according to the evidence, to keep all hypotheses consistent with the data (Zenil, 2020).

AID was conceived and introduced in the early 2010s and is currently an active area of research, but the methods enabling AID were introduced in the mid-2000s with the first publication of a calculation and systematic study of output probability distributions of different types of models of computation appeared in (Delahaye & Zenil, 2007). While it was not known--no experimental verification being then available--whether the concept of algorithmic probability would empirically capture the intuition of Occam's razor other than as established in the theory, Zenil and colleagues studied the behavior of these probability distributions in an exhaustive fashion (Delahaye & Zenil, 2012; Soler-Toscano et al., 2014). Their results suggested that the output distributions were more compatible than theoretically expected, and furthermore, met both the theoretical and empirical expectations entailed in Occam's razor, as the elements assigned the highest probability were also found to be the most algorithmically simple according to various complementary order parameters, including computable ones, which converge in value and thus provide further confirmation, while the least frequent elements were also more random by all measures (Zenil, 2020; Zenil, 2020a).

The numerical application of AID beyond its theoretical formulation relies strongly upon numerical methods designed to encompass and expand classical information theory to characterize randomness, using measures other than popular compression algorithms, such as Lempel-Ziv, widely used to produce purported approximations to algorithmic complexity (Zenil, 2020). These methods on which AID relies upon (but is also independent of) are the so-called coding theorem (CTM) and block decomposition (BDM) methods that have as their main features (1) that can go beyond entropic and statistical compression approaches by providing the means to explore and find computable models allowing causal discovery (and as opposed to e.g. obfuscated compressed files with no state correspondence to a model or evolving dynamical system), and (2) are sensitive enough to allow (small) perturbation causal analysis (Zenil et al., 2019; Zenil et al., 2018; Zenil et al., 2019b).

## The Coding Theorem Method (CTM) and Causal Discovery

Most attempts to approximate algorithmic complexity have proceeded by way of popular lossless compression algorithms such as LZ or LZW. However, if we take the example of the sequence $$s=1234567...$$, an algorithm such as LZ or LZW will fail at compressing $$s$$ despite its obvious compressed form ($$x_0=1;x_i:=x_{i-1}+1$$). This is because algorithms such as LZ and LZW are entropy estimators and they build a dictionary of most frequent words to assign them shorter codes. For example, if $$s$$ is a computable Borel normal number (Calude, 2002), no contiguous subsequence is approximately represented more frequently than any other of the same length, and the resulting compressed file tends to be of about the same size as $$s$$ itself (modulo change of data type transformation, which is only a transliteration from, e.g., ASCII to binary) (Zenil et al., 2017a). Because of these limitations of popular lossless compression algorithms, and for other reasons, Zenil et al. introduced a method based on the so-called algorithmic coding theorem. Formulated by L.A. Levin (Levin, 1974), the algorithmic coding theorem in the context of algorithmic probability establishes equality between universal a priori probability $$m(s)$$ and algorithmic complexity $$K(s)$$. Formally:

$m(s) = 2^{-K(s)} + c$

or equivalently,

$- \log \; m(s) \;=\; K(s)+ c ,$ where $$c$$ is a constant.

Based on this fundamental theorem, and under the assumption of optimality of the reference Turing machine, the coding theorem Method (CTM) is an alternative to statistical compression algorithms such as Lempel-Ziv, widely used to approximate algorithmic complexity. CTM does not rely upon statistical methods such as those based on the dictionaries on which popular lossless compression algorithms such as LZW are based (being designed to find statistical regularities and thus more closely related to classical information theory than to algorithmic complexity) (Delahaye & Zenil, 2012; Soler-Toscano et al., 2014a; Soler-Toscano et al., 2013; Zenil et al., 2015). The aim of CTM is to embrace Turing universality and explore the space of computer programs able to capture properties beyond statistical patterns (as a consequence of which has been overlooked in statistical approaches). Then, the method pumps it into the generative models that support a natural or artificial phenomenon, following the usual scientific modus operandi--whereas in statistical and probabilistic approaches the part of the probabilistic content of an object or state that belongs to the model itself and the part that belongs to the explanation of the possible explanatory models have traditionally been conflated.

For example, when modeling the outcome of throwing a dice, what a statistical model ends up quantifying is an uncertainty extrinsic to the process itself, the degree of uncertainty of an observer. Thus, it quantifies a property of the observer and not the process undergone by the dice itself. This is because the process of throwing the dice is deterministic according to the laws of classical mechanics, which are known to govern the dice trajectory (though quantum fluctuations may be believed to be probabilistic, at short distances they wouldn't have any effect). The probabilistic content of the model describing the dice outcome is thus extrinsic to the actual process. What an algorithmic-probability-like approach like CTM would do in principle--and this is its major difference--is to produce a set of deterministic models describing the dice trajectory and outcome without recourse to probability. Consequentially, it is in the distribution of possible computable models explaining the dice that a probability emerges; it is not assumed by the individual models themselves. In turn, not only can these models be tested beyond their outcome predictions as mechanistic (step-by-step) descriptions of the process, but they also offer a non-black-box approach where each model can be followed step-by-step, with model states corresponding to constructive (e.g. physical) states, as opposed to random variables with no state-to-state correspondence between the model and phenomenological data.

Unlike popular lossless compression algorithms that are guaranteed not to be able to characterize objects such as $$s$$, CTM can in principle do so, because when running all possible computer programs up to the size of $$s$$ in bits there is a non-zero probability that CTM will find such a program if it exists. Indeed, this is guaranteed, as the worst case is when the program in question has the form of $$print[s]$$ and $$s$$ is algorithmically random (i.e., $$s$$ is incompressible or with an algorithmic complexity value equal to the length of $$s$$, up to a constant). This holds because of the ergodicity of the algorithmic complexity approximation by the CTM over the software space, if enough computational resources are expended, which in turn is enabled by the lower semi-computability of the universal a priori probability (Zenil et al., 2018; Delahaye & Zenil, 2012).

## The Block Decomposition Method (BDM)

One way to see BDM is as a weighted version of Shannon's entropy that introduces algorithmic randomness into the original formulation of classical information, a version that is able to help tell apart statistical randomness and algorithmic randomness (Zenil et al., 2019; Zenil et al., 2018; Zenil, 2020). To illustrate this let's again consider the sequence $$s = 1234567...$$. While the digits of $$s$$ are distributed with maximal entropy because it can be seen as the digital expansion of the so-called Champernowne constant, proven to be Borel normal (because it does not show any statistical patterns in the sense of overrepresentation of any digit subsequence, no subsequence of any fixed length is repeated more times than any other subsequence of the same length), when considering the set of all possible sequences, some of which do show statistical patterns, the appearance of each digit in $$s$$ would be characterized as unexpected, where we may otherwise have been inclined to say that the most salient feature of $$s$$ was, in reality, its non-random nature, as it can be generated by a simple deterministic/computable generative process (Zenil et al., 2017a).

This is a key methodological distinction because (1) the sequence would be characterized as maximally 'disordered' by statistical approaches such as Shannon Entropy, unless upon its application one already had the information that $$s$$ had been generated by a deterministic process, which renders the use of the measure redundant (in science this is the rule rather than the exception when observing data: one investigates the nature of an object when the nature of that object is as yet unknown); and (2) it is of high empirical value in application to science.

For example, if we set out to quantify human memory, an experimental subject would not need to learn $$s$$ digit by digit in order to generate it, illustrating the algorithmic nature of human cognitive processes, that go beyond statistical patterns. The sequence $$s$$ may look very special, but in fact, most sequences are of this type. In addition to not having any statistical regularity, most sequences--if we consider the set of all possible sequences--do not even have any short description (low algorithmic randomness). That is, most sequences are algorithmically random. Moreover, the difference between those that do not have a short algorithmic description versus those that appear statistically random but have a short description is divergent. We only chose $$s$$ because it was the most obvious for purposes of illustration (another example would be the digits of a mathematical constant such as $$\pi$$). Indeed, CTM and BDM have found many applications in psychometrics and cognition due to this advantage (see Applications of AID).

What BDM does is to extend the power of CTM to quantify algorithmic randomness by implementing a divide-and-conquer approach whereby the data is decomposed into pieces small enough that an exhaustive computational search finds the set of all computable models able to generate each piece of data (Zenil et al., 2019; Zenil et al., 2018; Zenil, 2020). The sequence of small generators then supports the larger piece of data from which insight is gained as to their algorithmic properties. Small pieces of data supported by even smaller programs than those pieces are called causal patches in the context of AID, as they are the result of a causal relationship between the set of these computable models and the observed data. On the other hand, models of about the same length as the data are not causally explained by a shorter generating mechanism and are therefore considered (algorithmically) random and not causally supported (Zenil, 2020a).

A sequence of computer programs smaller than their matched patches constitutes a sufficient test for non-randomness and is therefore informative of its causal content, as its components can be explained by underlying small computable models shorter than the original whole data itself. This way, the stronger the coarse-graining of the original data (i.e., the weaker the decomposition) the closer the value resulting from the BDM is to the theoretical optimal value that corresponds to the whole data's algorithmic complexity. Hence, the more fine-grained (i.e., the stronger the decomposition), the more the BDM value can diverge from the whole data's algorithmic complexity. This is because the BDM value and the algorithmic complexity are proven to converge (up to a constant that only depends on the error of the CTM relative to the algorithmic complexity) as the data partition size tends to the whole data size (Zenil et al., 2018).

## Algorithmic Intervention Analysis

Unlike other approaches, such as Pearl's do-calculus, algorithmic information dynamics can help in the initial process of causal discovery and is able to perform fully unsupervised hypothesis generation from causal computable models. It also provides the tools to explore the algorithmic effects that perturbations to systems and data may have on their underlying computable models, effectively also providing a framework for causal analysis similar to the do-calculus, but without recourse to traditional probability distributions. AID can substitute for or complement other approaches to causal analysis. For example, the methods for causal analysis developed by Judea Pearl et al. assume the existence of an educated causal model but cannot provide the means for primary causal discovery.

Figure 2: Causal intervention matching with AID: Central to AID is the study of the effect of interventions at the data level to the set of computable candidate models, their changes in program space being an indication both of the nature of the original data and the nature of the perturbation.

One can formulate a cause-effect question in the language of the do-calculus as a probability between a random variable and an intervention $$P(L|do(D))$$, and one can also do so in the context of AID. In one of the typical examples used by Pearl himself, if $$L$$ represents the human lifespan and $$do(D)$$ the use of some drug $$D$$, the probability $$P$$ of $$D$$ having an effect on $$L$$ is quantified by classical probability to calculate $$P$$. What AID does is to substitute $$AP$$ for $$P$$, the algorithmic probability that $$D$$ exerts an effect on $$L$$, with the chief advantage that not only does one obtain a probability for the dependency (and direction) between $$D$$ and $$L$$, but also a set of candidate models explaining such a connection, with no classical probability involved in the inference or description of each individual model. AP would then impose a natural non-uniform algorithmic probability distribution over the space of candidate models connecting $$D$$ and $$L$$ based on estimations of the universal distribution, which in effect introduces a simplicity bias that favors shorter models as opposed to algorithmically random ones, all of which have already been found to explain the causal connection between $$D$$ and $$L$$. AID thus removes the need for classical probability distributions, and more importantly produces a set of generative models no longer derived from traditional statistics (e.g. regression, correlation), which even the do-calculus uses to determine the probability of a dependency, thereby falling back on the methods the do-calculus set out to circumvent.

While perturbation analysis is a major improvement in the area of causal discovery, AID complements it, offering a path to leave the use of classical probability descriptions out of the models taking a step further towards independence from other causal confounding statistical methods that introduce probability in the model description obfuscating first principles and preventing generative models. Another key difference between the original do-calculus and the algorithmic probability calculus is that the do-calculus makes a distinction between $$P(L|do(D))$$ and $$P(L|D)$$, but in AID both hypotheses $$AP(L|D)$$ and $$AP(D|L)$$ can be tested independently, as they have different meanings under algorithmic probability. $$AP(L|D)$$ means that $$L$$ is the generative mechanism of $$D$$ and $$AP(D|L)$$ means that $$D$$ is the generative mechanism of $$L$$. Each of them would trigger a different explorative process under AID. The former would look for the set of smaller to larger computable models denoted by $${L}$$ that can explain $$D$$, while the latter would look for all the generative models $$\lbrace D \rbrace$$ that generate $$L$$. Intuitively, this means that, for example, an attempt to explain how a barometer falling ($$X$$) could be the cause for a storm would likely not yield many shorter generative mechanisms to causally explain the occurrence of the storm ($$Y$$), whereas an attempt to do the reverse would, indicating that barometric fall is effect rather than cause. Clearly, regardless of the result, $$AP(X|Y)$$ and $$AP(Y|X)$$ are not equivalent.

In the language of AID, conforming more to a typical notation in graph and set theory, the first example would be written as $$AP(L \backslash D$$) or $$C(L \backslash D)$$, where $$C$$ is the algorithmic complexity of $$L$$ with intervention $$D$$ (in the context of AID this is usually a deletion to a model that includes D), and the second example as $$AP(X \backslash Y)$$ or $$C(X \backslash Y)$$. Alternatively, $$AP(L \backslash \lbrace D \rbrace )$$ would be $$L$$ with a set of interventions $$\lbrace D$$$$\rbrace$$. Multivariate causal analysis can be achieved by conditional versions of algorithmic complexity and is currently an area of active research.

In Pearl's account of causal analysis, the so-called causal ladder leading up to human-grade reasoning consists of 3 levels, with the first being that of pattern observation covered by traditional statistics and long based on regression and correlation, the second being interventions such as the one his do-calculus suggests and that AID allows, and the 3rd level is that of counterfactuals or the power to imagine what would happen if conditions were different. AID also provides the most fundamental step in the causal analysis, which is causal discovery, within a single framework and without the need of resorting to making recourse to different approaches for different purposes (discovery vs analysis). While the do-calculus comes with no initial guidelines for how to come up with a first testable model, AID can generate a set of computable models ranked by likelihood and offers a path towards the full replacement of regression and correlation in the description of a model, taking the statistical nature out of the causal description. In addition, AID can also potentially cover all three levels of causality in Pearl's ladder, and provides the methodological framework to address each without the need of classical probability, regression, or correlation.

Both interventions and counterfactuals in AID are covered by the underlying simulation of all possible computable hypotheses (hence all 'if' questions) up to the length of the original observation, and methods such as CTM and BDM allow the estimation of processes at all these 3 levels by decomposing the original problem into smaller problems, with the additional advantage that AID ranks the counterfactuals naturally by likely algorithmic probability based on the universal distribution as a prior (Zenil et al., 2019).

## Information Difference and Algorithmic Information Calculus

With its ability to quantify the departure of models away from or toward algorithmic randomness by perturbing a piece of data or a system, AID enables the investigation of which elements of the data contribute the most to the information necessary for the underlying causal computable model (the positive information elements) and which elements would trigger an increase in the system's algorithmic complexity (the negative information elements) (Zenil et al., 2019; Zenil et al., 2019b). This way, the "algorithmic randomness control" performed by the algorithmic intervention analysis puts forward new general methods for studying perturbation effects on non-linear systems, beyond those derived from classical control theory, and without making strong a priori assumptions of linearity. For example, AID has shown not only how to reconstruct the space-time evolution of Elementary Cellular Automata by gathering disordered states, but also correlated, as expected, positive information genes in E-Coli with homeostatic processes, and more negative information genes with specialization processes in the cellular network (Zenil et al., 2019).

Let $$|S|$$ denote the whole size of the object $$S$$, and not only the number of constitutive elements. Let $$N$$ denote the total number of constitutive elements of $$S$$. (Therefore $$N$$ may be smaller than $$| S |$$ , but $$|S| = \mathbf{O}( f_c(N) )$$, where $$f_c$$ is a total computable function). For example, in the case of networks, $$|S|$$ grows as the number of all possible edges of a graph (in $$\mathbf{O}( N^2 )$$) and $$N$$ is the total number of vertices.

Let $$F \neq \empty$$ be a subset of the set of elements of the object $$S$$ to be deleted. In order to grasp such a formal measure of the algorithmic-informational contribution of each element, one may look into the difference in algorithmic complexity between the original data $$S$$ and the destructively perturbed data $$S \setminus F$$. We define the information difference (Zenil et al., 2019; Zenil et al., 2020b) between $$S$$ and $$S \setminus F$$ as

$I(S,F) = K(S) - K(S \setminus F).$

Note that the same holds analogously for constructive perturbation, i.e., the insertion of elements. When $$F = \left\{ e \right\}$$ with $$I(S,F)=I(S,e)$$ and this difference assumes an absolute value

$| I(S,e) | = | K(S) - K(S \setminus e) | \leq \log(N) + \mathbf{O}(1),$

we say $$e$$ is a neutral information element, since the algorithmic information contribution of the insertion of $$e$$ back into $$S$$ (or its deletion from $$S$$) would necessarily be of the same order as a computable reversible procedure. Furthermore, we say $$e$$ is a positive information element, if it is not neutral and $$I(S,e) > 0$$; and a negative information subset of elements, if it is not neutral and $$I(S,e) < 0$$. Thus, negative content is immediately associated with information gain from perturbations, and positive content with information loss.

In the context of causal analysis, when considering a stochastically random bit-flip being performed on a string pattern, such as $$0000000...$$ repeated $$n$$ times, such a perturbation will have a significant effect with respect to the length of the original shortest program $$p$$ consisting of a loop printing $$n$$ number of $$0$$s, because the new program $$p'$$ would need to take into account the $$1$$ inserted at a stochastically random position $$x \leq n$$. If, however, the string is algorithmically random, the effect of any stochastically random single-bit perturbation will have no effect (up to a logarithmic term of the string's algorithmic complexity, which is already maximal), because every bit is already necessary information for the string's own description. Not only for strings, but also for graphs and other multidimensional data structures, (stochastically) random single perturbations have similar worst-case effects in both simple and algorithmically random objects amounting to up to about $$\log( | S | )$$. Thus, in the case of objects for which $$|S| \leq N^C$$, where the constant $$C \geq 2$$ that does not depend on $$N$$ (e.g. if the object is a graph), the information difference impact of moving $$S$$ toward or away from algorithmic randomness may be:

1. of the same order (i.e., $$I(S,e) = \mathbf{\Theta}( \log(N) )$$) as the object's algorithmic complexity $$K(S)$$, if, e.g., $$S$$ is computable and therefore logarithmically compressible in the form $$K(S) \leq \log(N) + \mathbf{O}( \log( \log(N) ) ) + \mathbf{O}(1)$$;
2. or strongly asymptotically dominated by the object's algorithmic complexity $$K(S) \geq |S| - \mathbf{O}(1)$$, as $$I(S,e) = \mathbf{O}( \log( |S| ) ) = \mathbf{o}\left( |S| \right)$$ , if, e.g., $$S$$ is algorithmically random and therefore incompressible (up to a constant);

In other words, at the asymptotic limit when $$N \to \infty$$, the ratio of worst-case information difference per bit of algorithmic complexity

$\frac{I(S,e)}{K(S)}$

tends to $$0$$ for algorithmically random objects and to a constant $$\epsilon > 0$$ for computable objects. The more the object's algorithmic complexity diverges from the maximal value (i.e., the larger the algorithmic randomness deficiency), the larger the relative impact of the worst-case on the object's algorithmic complexity (i.e., on its optimal lossless compressibility). Thus, within the context of computably generated objects, this phenomenon suggests some kind of "robustness" in the face of stochastically random perturbations relative to objects' intrinsic algorithmic randomness. Indeed, as one of the main results of AID, these impacts were studied and demonstrated to be related to a thermodynamic-like behavior in stochastic-randomly moving an object toward or away from algorithmic randomness. In addition, this, in turn, led to the existence of optimum points of (re)programmability between both extremes of algorithmic complexity (Zenil et al., 2019b).

### Connection to Dynamical Systems

While some theoretical results have connected algorithmic complexity and dynamical systems, not many applications have done so. One of the main results in AID is that if a chain of causally connected processes is unaffected and generated from the same generative mechanism, then its algorithmic complexity remains constant up to a (double) logarithmic term accounting for the time step when the system is to be reproduced up to a certain observable time (Zenil et al., 2019). Anything departing from such a quantity indicates that the process has been subject to an external perturbation or interaction which can be pinpointed by AID. Numerical experiments quantifying the change in the number of attractors in a dynamical system (and therefore the average depth or shallowness of these attractors) demonstrates that, on the one hand, removing elements that reduce the algorithmic complexity of a dynamical system (with respect to the average length of the models explaining its original description) systematically reduces the number of attractors (Zenil et al., 2019). On the other hand, when removing elements that increase the dynamical system's algorithmic complexity (thus making it more algorithmically random), the number of attractors increases. Current open topics of AID research include multivariable modeling, e.g., describing multiple particles individually rather than as a system.

## Applications to Data Science

### Dimensionality Reduction and Minimal Loss of Information

Besides causation, control, and interventional analysis, AID methods and algorithmic information difference can also be applied to study lossy compression, so that the information loss is minimized or quantified. One can study the rate at which information is lost, recovered or reconstructed in reduced complex artificial and real networks, while retaining the typical features of biological, social, and engineering networks, such as scale-free edge distribution and the small-world property (Zenil et al., 2016; Zenil et al., 2020b).

For example, the ability to preserve the information content of a network using previous dimensionality reduction methods based on motif analysis, graph sparsification, and graph spectra has been evaluated (Zenil et al., 2016). Graph spectra have been found to be the most irregular, losing most of the information content of the original network, and network motif analysis has been shown to be the method that better preserves the original information content, which is consistent with the notion that local regularities can preserve global algorithmic information.

On the other hand, AID methods can be applied directly in order to contribute to areas of feature selection and dimensionality reduction. In the case of graphs, for example, new methods for network summarization based on successive local perturbations in the form of single edge deletions that minimize loss of algorithmic information have been introduced (Zenil et al., 2020b). These methods constitute an interesting computation-time-feasible approach to designing theoretically optimal lossy compression techniques based on principles and estimations of theoretical optimal lossless compression. It has been shown that graph-theoretic indices (ranging from degree distribution and clustering coefficient to edge betweenness and degree and eigenvector centralities) measured on synthetic and real-world networks were preserved, achieving equal or significantly better results than other data reduction methods and some of the leading network sparsification methods.

The same methods have been proposed to equip machine intelligence with an inference engine based on AID, in order to help AI algorithms build computable hypotheses from data that can then be tested against observation (Hernández-Orozco et al., 2019). This would complement other approaches such as statistical machine learning that fail at tasks such as inference and abstraction.

## Relation to other concepts

There is an independent and somewhat parallel line of research by G. Chaitin concentrating on incompleteness and epistemology (Chaitin, 1975; Chaitin, 1987; Wuppuluri & Doria (ed), 2020). Chaitin's metabiology (Chaitin, 2012; Chaitin et al., 2014; Chaitin & Chaitin, 2018) also proposed to study open-ended biological evolution in software space, a concept that AID has taken inspiration from and has contributed to (Hernández-Orozco et al., 2018; Hernández-Orozco et al., 2018a).

Resource-bounded complexity

AID, as based on CTM, is a resource-bounded algorithmic complexity measure, as it takes informed runtime cutoffs (usually very short, to deal with short strings) to estimate candidate upper bounds of algorithmic complexity under assumptions of optimality and for the reference universal enumeration chosen.

Minimum Message and Minimal Description Length methods

While AID, MML, and MDL are ultimately related and are based on the principles of algorithmic probability, these minimum length approaches depart from AID (and AIT) in fundamental ways. MDL is a model selection principle rather than a model generating method. As such, AID incorporates the principle of MDL and represents a generalization, with MDL being a particular case based on learning algorithms using the statistical notion of information rather than the more general--and powerful--notion of algorithmic information used by AID that requires Turing-completeness. In practice, however, MDL and AID can complement each other because AID is more difficult to estimate while MDL provides some statistical shortcuts, and this is similar to and mostly already exemplified in the Block Decomposition Method (BDM), which combines both classical and algorithmic information theories.

In fact, AID is agnostic as regards the underlying method, even if it currently relies completely on the use of CTM and BDM, as described above. So, for example, when AID confines itself to statistical models not produced by CTM, it would collapse into methods such as MDL or MML, but any methods that introduce attempts to go beyond the statistical would approximate the spirit (if not the measure, under the assumption of optimality) of AID as exemplified in CTM and BDM.

Computational Mechanics

AID can be considered a special case or generalization of the area of computational mechanics, depending on one's perspective. They differ in that (1) Crutchfield's formulation involves the introduction of stochastic processes into the models themselves (rather than on the probability distribution of the models), while with AID there is no probabilistic content at the core of the candidate generative model, only at the level of the universal distribution (the distribution governing all computer programs), and (2) AID by way of CTM and BDM provides the means to implement other computational-mechanical approaches, including potentially Crutchfield's, as the means to mine the space of computable finite (deterministic or stochastic) automata.

Measures of sophistication

Because AID has the potential property (under assumptions of optimality) to distinguish simple from random and assign them similar values, AID can be considered a measure of sophistication similar to Bennett's logical depth. Moreover, a measure of (re)programmability is a measure of sophistication by design, and is based on AID. Some thermodynamic-like properties associated with reprogramming systems have been found and reported in the literature.

Granger causality and transfer entropy

This family of methods is statistical in nature and used mostly for causal analysis, not causal discovery, especially because they are unable to deal with models in phase-space corresponding to possible physical state models. Some relationships between variables, causal or temporal, can be derived as they can be from intervention analyses similar to Pearl's do-calculus, but they don't produce candidate mechanistic models in the first place, and the causal analysis falls back into the purely classical probabilistic framework which, though partially circumvented, must be resorted to again when it comes to testing the associative nature of 2 or more variables.

## Challenges and limitations

The computational resources needed to compute CTM, at the core of AID, limit the size of its application, but BDM extends and combines CTM with other computable measures, including Shannon entropy itself. One of the most active areas of AID research is to find ways to make the algorithmic measures more relevant and to find possible shortcuts towards the computation of the computable region in the uncomputable space that is relevant for applications in science.

## References

• Gàcs, P. (1974). On the symmetry of algorithmic information. Soviet Mathematics Doklady, 15:1477--1480.
• Zenil, H. (2014). Programmability: A Turing Test Approach to Computation. In L. De Mol and G. Primiero (eds.), Turing in Context, Koninklijke Vlaamse Academie van België voor Wetenschappen en Kunsten (Belgian Academy of Sciences and Arts), Contactforum.
• Zenil, H. (2017). Approximations to Algorithmic Probability. In Robert A. Meyers (ed), 2nd. Edition of the Springer Encyclopedia of Complexity and Systems Science.

Internal references

• Olaf Sporns (2007) Complexity. Scholarpedia, 2(10):1623.
• Tomasz Downarowicz (2007) Entropy. Scholarpedia, 2(11):3901.