# Bayesian Ying Yang learning

Post-publication activity

Curator: Lei Xu

Ying-Yang learning considers a learning system featured with two pathways between the external observation domain $$X$$ and its inner representation domain $$R$$. The domain $$R$$ and the pathway $$R \to X$$ is modeled by one subsystem system, while the domain $$X$$ and the pathway $$X \to R$$ is modeled by another subsystem. From the view of the ancient Ying-Yang philosophy, the former is called Ying and the latter is called Yang, and the two coordinately form a Ying-Yang system, with the structure of Ying subject to a principle of compactness (least complexity) and the structure of Yang subject to a principle of proper vitality (matched dynamic range) with respect to the Ying. Moreover, all the rest unknowns in the Ying-Yang system are learned under the guidance of a Ying-Yang best harmony principle ( Notice : "Ying" is spelled "Yin" in Chinese Pin Yin. To keep its original harmony with Yang, we deliberately adopted the term "Ying-Yang" since 1995).

Under the framework of modern probability theory, the Ying consists of probabilistic structure $$q(R)$$ for modeling the domain $$R$$ and $$q( X |R)$$ for modeling the pathway $$R \to X$$, while the Yang consists of $$p(X)$$ for the domain $$X$$ and $$p(R|X)$$ for the pathway $$X \to R$$. In other words, we have the following two complementary Bayesian representations of the joint distribution on $$X , R$$: $\tag{1} p({ X}, { R})=p({ R}| { X})p({ X}), \ q({ X}, { R})=q({ X} | { R})q({ R}),$ which is thus called Bayesian Ying-Yang (BYY) system. Moreover, the Ying-Yang best harmony principle is mathematically implemented by maximizing the following harmony measure $\tag{2} H(p \Vert q) = \int p({R}|{X})p({X}) \ln{[ q({X} | {R})q({R})]} d{X} d{R}$

Firstly proposed in 1995 and systematically for many years, Bayesian Ying-Yang learning provides not only a general framework that accommodates typical learning approaches from a unified perspective but also a new road that leads to improved model selection criteria, Ying-Yang alternative learning with automatic model selection, as well as coordinated implementation of Ying-based model selection and Yang-based learning regularization.

## Three learning ingredients, three inverse problems, and one two-pathway system

Learning is a process of inductively inferring or statistically summarizing the regularity underlying a set of samples under the guidance of a learning theory and adaptively loading the regularity to a learner via iteratively modifying its structure and parameters, different learning approaches are featured by differences in one or more of the following three ingredients :

• Learner $$\quad$$ It is usually a statistical model or system featured by both a set of unknown real parameters that specifies a mathematical function and a set of unknown integers that indicates the complexity and scale of this learner.
• Theory $$\quad$$ It provides a principle that measures how good the learner describes reliable regularities underlying samples, typically in a mathematical measure or objective function that can be optimized.
• Implementation $$\quad$$ It is featured by a dynamic process that implements the learning theory, usually featured by a mixed continuous and discrete optimization.

Two-pathway approach has been adopted in the literature of modeling perception system for decades. One early example is the adaptive resonance theory developed in the 1970s (Grossberg, 1976; Grossberg & Carpenter, 2002), featured by a resonance between bottom-up input and top-down expectation in help of a mechanism motivated from a cognitive science view. Efforts have further been made on multilayer-net featured two-pathway approaches under guidance of a statistical principle, e.g., either under the least mean square error criterion by auto-association (Bourlard & Kamp, 1988) and the LMSER self-organization (Xu, 1991 & 93) or under the Helmholtz energy by the Helmholtz machine and sleep-wake learning (Hinton, Dayan, Frey, & Neal, 1995; Dayan, 2002). The basic spirit of LMSER self-organizing was further developed into the Bayesian Ying-Yang (BYY) learning in the mid-1990s, not only with multi-layer two pathways replaced by a general system via two complementary Bayesian representations but also with the least mean square error criterion replaced by a best Ying-Yang harmony criterion (Xu, 1995, 1997a&b&c, 1998a&b, 2000, 2001a&b, 2002, 2003, 2004a&b&c, 2005, 2007, 2008a&b&c).

In the literature of statistics, the term representative or recognition model is used to refer to a bottom-up modeling, covering studies of not only various supervised learning models for regression or discriminative analyses and but also those unsupervised learning approaches such as best information transfer (Bell & Sejnowski, 1995) and minimum mutual information (Amari, Cichocki, Yang, 1996). On the other hand, the term generative model refers to a top-down modeling for a best fitting of observed samples, including minimum fitting errors, maximum likelihood, and maximum marginal likelihood, etc. Typically, a generative model is associated with latent variables, and its learning actually involves to estimate these latent variables by a bottom-up pathway, which is actually a Bayes inverse of the generative model during making maximum likelihood or maximum marginal likelihood. The BYY learning and the Helmholtz machine as well as variational Bayes (Hinton & Zemel, 1994; Jordan, Ghahramani, Jaakkola, & Saul, 1999; Jaakkola, 2001) are further developments of the Bayes inverse approaches. While sharing some common learner structures, these approaches are different from each other in the uses of different learning theories, as to be further addressed in the later subsections. Figure 1: Three levels of inverse problems, where $$G(x |\mu,\Sigma)\,$$ denotes a Gaussian density with the mean vector $$\mu$$ and the covariance matrix $$\Sigma$$.

Systematically, the tasks of two-pathway learning actually consist of three levels of inverse problem. Taking Gaussian mixture model illustrated in 1(d) as an example, the inner level is featured with a mapping $$X \to Y$$ that classifies each sample $$x_t$$ into a label $$y_t=j_t$$, based on $$k$$ given Gaussian components. Generally, as illustrated in 1(a), this level probabilistically maps $$X$$ into its inner representation $$Y$$, for making a decision, inferring something, or yielding a controlling signal, ..., eta, which may be all summarized under the term problem solving . The middle level aims at a mapping $$X \to \theta$$ that estimates a set $$\theta$$ of all the unknown parameters in this Gaussian mixture, which is usually called parameter learning. Moreover, the top level targets at $$X \to k$$ to determine $$k$$, i.e., the number of Gaussian components or clusters in a given data set, which is usually called model selection.

Summarizing into $$R=\{Y, \theta, k \}$$, three levels of two-pathways in 1 can be unified into the BYY system by eq.(1) with $\tag{3} q({R})=q(Y, \theta, k)= q(Y| \theta_k^{ y}) q(\theta_k) q(k), \ q({X} | {R})= q({X} | Y, \theta_k^{x|y}), \\ p({ X} )= \prod_t G(x_t | x^*_t, h^2I), \ p({ R}| { X})= p(Y, \theta, k | { X})= p(Y| { X}, \theta_k) p(\theta_k| { X}) p(k | { X}),$ where $$x^*$$ denotes a specified value of the variable $$x$$ that is an element of the set $$X$$. Then, all the unknowns in the BYY system are subsequently determined via maximizing the harmony measure by eq.(2).

For many practical tasks, e.g., Gaussian mixture illustrated in 1(d), $$X$$ is a set of i.i.d. variables $$\{x_t\}_{t=1}^N \,$$, and also $$Y$$ is a set of i.i.d. variables with each $$y_t$$ as the inner representation of $$x_t$$. Accordingly, we have $\tag{4} \begin{array}{l} q(Y|\theta_k)= \prod_t q(y_t | \theta_k ^y), \ q({X} | Y, \theta_k)=\prod_t q(x_t | y_t, \theta_k^{x|y}), \\ p(Y| { X}, \theta_k) =\prod_t p(y_t| x_t, \theta_k ). \end{array}$ The structures of $$q(y_t | \theta_k ^y), q(x_t | y_t, \theta_k^{x|y})$$ are specified task-dependently, with a manual help called design . For Gaussian mixture, we have simply $$y_t =j, j=1, \cdots, k$$ and $\tag{5} q(y_t =j | \theta_k ^y)= \alpha_{j}, \ q(x_t | y_t=j, \theta_k^{x|y})= G(x_t|\mu_{j},\Sigma_{j}).$ Moreover, the structure of $$p(y_t| x_t, \theta_k )$$ is designed or/and derived from the structures of $$q(y_t | \theta_k ^y), q(x_t | y_t, \theta_k^{x|y})$$, under the principle of matched dynamic range. As to be addressed in a later subsection, maximizing the harmony measure by eq.(2) with respect to $$p(y_t| x_t, \theta_k )$$ in the designed structure by eq.(5) leads to $$p(j | x, \theta_k ) =q(j | x, \theta_k )$$. Additionally, a design task also includes task-dependently choosing a priori $$q(\theta )$$, which may also be parametric with a set $$\Xi$$ of hyper-parameters to be determined together with $$k$$ as illustrated in 1(c).

For simplicity and clarity, we may subsequently drop the subscript $$k$$ in $$\theta_k, \theta_k ^y, \theta_k^{x|y}$$ wherever it will not bring a confusion.

## BYY algorithm versus EM algorithm for learning Gaussian mixture

The EM learning algorithm (Redner & Walker, 1984; Xu & Jordan, 1996) is widely used for implementing the ML learning and has also been extended for implementing Bayes learning and variational Bayes. Thus it is a good starting point to understand how the BYY learning differs from these approaches via observing their learning implementations.

In the case of eq.(4) and eq.(5), maximizing the harmony measure by eq.(2) becomes $\tag{6} \begin{array}{l} \max_{\theta, k} \ H(\theta, k), \ H(\theta, k) =\sum_{t=1}^{N}\sum_{\ell=1}^k p(\ell | x, \theta_k ) L(x_t,\theta_{\ell}), \ \ L(x_t,\theta_{\ell})=-\varepsilon_t(\theta_{\ell})-r_t(\theta_{\ell}), \\ \varepsilon_t(\theta_{\ell})=-\ln{[\alpha_{\ell}G(x_t|\mu_{\ell},\Sigma_{\ell})]}, \ r_t(\theta_{\ell})=r_{\pi,t}^x(\theta_{\ell})-\ln{q(\theta_{\ell})}, \\ r_{\pi,t}^x(\theta_{\ell})=-h^2Tr[\frac{\partial^2 L(x,\theta_{\ell})}{\partial x\partial x^T}]=\frac{1}{2}h^2Tr[\Sigma_{\ell}^{-1}], \end{array}$ where the term $$\varepsilon_t(\theta_{\ell})$$ describes the error that the $$\ell$$th component fits the sample $$x_t$$, and the term $$r_t(\theta_{\ell})$$ aims at reducing the complexity of this fit, consisting of two parts. One comes from a priori $$q(\theta_{\ell} )$$, while the other comes from $$p({ X} )$$ with $$h\ne 0$$ in eq.(3) for effectively seeking a smoothness $$L(x,\theta_{\ell})$$ with respect to $$x$$.

For simplicity, the priori $$q(\theta_{\ell} )$$ is ignored. Then, we maximize the above $$H(\theta, k)$$ by the following Ying-Yang alternation :

• Yang step (shortly A-step): with $$p(j | x_t, \theta ) = q(j|x_t, \theta)$$ given in 1(d) for each sample in $$X=\{x_t\}$$, we get

$\tag{7} \begin{array}{l} p_{j, t}= p(j|x_t, \theta^{old})[1+ \Delta_{j,t}(\theta^{old})], \,\,\, \Delta_{j,t}(\theta)= L(x_t,\theta_{j})- \sum_{\ell } p(\ell|x_t, \theta) L(x_t,\theta_{\ell}). \\ \end{array}$

• Ying step (shortly I-step): get $$\theta_{\ell}^{new}, \ \ell=1, \cdots, k$$ by either

$\tag{8} \begin{array}{l} \theta_{\ell}^{new}-\theta_{\ell}^{old} = \gamma (\theta_{\ell}^{*}-\theta_{\ell}^{old} ) \end{array}$ or $\tag{9} \begin{array}{l} \theta_{\ell}^{new}-\theta_{\ell}^{old} \propto \sum_{t\in T_t} G_t(\theta_{\ell}^{old}), \, \, \, G_t(\theta_{\ell})=\nabla_{\theta_{j}}\sum_{\ell=1}^k p(\ell | x, \theta_k ) L(x_t,\theta_{\ell}) \end{array}$ with an appropriate step-size $$0 \le \gamma$$, and discard the $$\ell$$th Gaussian component if its corresponding $$\alpha_l\to 0$$ or $$\alpha_lTr[\Sigma_l]\to 0$$, where $$T_t$$ is a set of samples that are considered at the time $$t$$, with one extreme that $$T_t$$ consists of only one current sample $$x_t$$ and the other extreme that $$T_t$$ consists of all of $$N$$ samples.

In the above, $$\theta_{\ell}^{*}$$ is the root of the equations $$\sum_{t\in T_t} G_t(\theta_{j})=0, \forall j,$$ with $\tag{10} \begin{array}{l} G_t(\theta_{j})= - \sum_{\ell=1}^k [p_{\ell, t}\nabla_{\theta_{j}}\varepsilon_t(\theta_{\ell})+ p(j|x_t, \theta^{old})\nabla_{\theta_{j}}r_t(\theta_{\ell}) ], \end{array}$ subject to $$\theta_j^{new}\in D(\theta_j)$$ that usually imposes certain constraints on $$\theta_{j}, \forall j$$. In simple case of ignoring the priori $$q(\theta_{\ell} )$$, we get the root of $$\sum_{t\in T_t}G_t(\theta_{j})=0, \forall j$$ given as follows $\tag{11} \begin{array}{l} \alpha_{j}^*={1 \over \#T_t} \sum_{t\in T_t} p_{j, t}, \, \, \, \mu_{j}^*={1 \over \#T_t\alpha_{j}^*} \sum_{t\in T_t} p_{j, t}x_t, \\ \Sigma_{j}^*=h^2I+{1 \over \#T_t\alpha_{j}^*} \sum_{t\in T_t} p_{j, t}( x_t- \mu^{old}_{j})( x_t- \mu^{old}_{j})^T. \end{array}$ We observe that the Ying Yang alternation actually becomes the well know EM algorithm for the ML learning if we consider samples directly (i.e., $$p({ X} )$$ with $$h=0$$ in eq.(3)), and then force $$\Delta_{\ell,t}=0$$,$$\gamma =0,$$ where $$\#T_t$$ denotes the cardinality of the set $$T_t$$.

The relative fitness $$\Delta_{\ell,t}$$ makes the BYY harmony learning become differ from the ML learning. If $$\Delta_{\ell,t}>0,$$ updating goes along the same direction of the ML learning but with an increased strength. If $$0>\Delta_{\ell,t}> - 1$$, updating still goes along the ML learning direction but with a reduced strength. When $$\Delta_{\ell,t}<-1$$ updating reverses to the opposite direction, i.e., becoming de-learning. Interestingly, we are lead to a scenario similar to Rival penalized competitive learning (RPCL) and thus also have a favorable nature of automatic model selection, namely, model selection is obtained automatically during parameter learning via discarding the $$\ell$$th Gaussian component as $$\alpha_l\to 0$$ or $$\alpha_lTr[\Sigma_l]\to 0$$, as illustrated in Figure 22, while RPCL can be explained as a bottom level simplified approximation of the BYY harmony learning. Generally, $$h\ne 0$$ may also be updated to seek a best smoothness of $$L(x,\theta_{\ell})$$ with respect to $$x$$, with further details referred to Fig.7(b) in a recent overview (Xu, 2010). Figure 2: an illustration of the BYY harmony learning on Gaussian mixture with automatic model selection (double click it to see animation).

Refer to the page about RPCL, typically model selection is made in a two stage implementation. First, the EM algorithm is used on a Gaussian mixture with a fixed $$k$$ for learning parameters $$\theta^*_k$$. Second, a best $$k^*$$ is selected by using a model selection criterion $$J(\theta^*_k)$$, e.g., AIC(Akaike, 1974), BIC (Schwarz, 1978)/ MDL (Rissanen, 1986, 2007), and their extensions. This two stage approach usually incurs a huge computing cost. Moreover, the parameter learning performance deteriorates rapidly as $$k$$ increases, which makes the value of $$J(\theta^*_k)$$ evaluated unreliably. Efforts have been made on stepwise model selection either incrementally or in a decrement way, which may save computing costs, but usually leads to suboptimal performances.

Since the early effort on RPCL learning (Xu, Krzyzak, & Oja, 1992, 1993), studies on automatic model selection seek a learning algorithm or a learning principle featured with a mechanism that automatically drives an indicator $$\rho(\theta_e)\to 0$$, if a particular structural component associated with a subset $$\theta_e \in \theta_{\mathbf k}$$, e.g., a Gaussian component is discarded if its corresponding $$\alpha_l=0$$ or /and $$Tr[\Sigma_l]=0$$.

## Ying-Yang best harmony : Ying-Yang best matching in a least complexity

We proceed to understand of the theory of Ying-Yang best harmony. We start to refer the modern perspective on the Chinese ancient Yin-Yang theory introduced in the Appendix of one recent overview (Xu, 2010) and then elaborated in Sect. 3.1 of another overview (Xu, 2012). As illustrated in Figure 33, a body or a system that survives and interacts with its world can be regarded as a Ying-Yang system that functionally composes of two complement parts. One is called Yang, from its external world (called Yang domain $$X$$) via one pathway (called Yang pathway $$X \to R$$) that transforms what gathered in the Yang domain into inner representations in its inside domain (called Ying domain $$R$$); while the other is called Ying that consists of both the Ying domain $$R$$ that receives, accumulates, integrates, digests the inner representations provided by Yang, and a Yang pathway $$R \to X$$ that selects among the Ying domain to generate outputs to its external world.

Referring to Figure 3 3(b) & (c) & (d), Ying is primary, featured with a compact capacity (i.e., in a least complexity) of accommodating and accumulating, as well as a good ability of integrating and digesting whatever supplied by Yang; while Yang is secondary, fetching from the outside and serving the demands of Ying, and thus is vigor in adapting to variety of not only external world but also the dynamic range of the Ying. A Ying-Yang system inputs by Yang and outputs by Ying, operating in a Wu-Xing circling flow during which Ying and Yang adapt each other to cooperate under a best harmony principle, see Figure 3 3(e) & (f). In the probabilistic framework, we get the Bayesian Ying-Yang (BYY) system by eq.(1), in which the Ying uses $$q({X} )= \int q({X} | {R})q({R}) d{R}$$ to fit observed samples $$\{x_t, t=1, \cdots, N\}$$ with help of $$p({R}|{X})$$, not in a maximum likelihood sense but maximizing the harmony measure by eq.(2), such that all the unknowns in this BYY system are determined. This maximization leads to not only a Ying-Yang best match but also a BYY system in a least complexity, which makes both parameter learning and model selection by the same measure $$H(p \Vert q)$$.

This nature can be further examined from the following three information theoretic perspectives:

• Considering the following relation

$\tag{12} \begin{array}{l} H(p \Vert q) = H(p \Vert p) - KL(p \Vert q), \\ H(p \Vert q) =\int p(u)\ln{q(u)}du, \ KL(p \Vert q)= \int p(u)\ln{ p(u) \over q(u)}du, \end{array}$ where $$KL(p \Vert q)$$ is Kullback-Leibler divergence while $$- H(p \Vert p)$$ is entropy, we observe that maximizing $$H(p \Vert q)$$ by eq.(2) consists of not only minimizing $$KL(p({R}|{X})p({ X} )\Vert q({X} | {R})q({R}))$$ for a Ying-Yang best match but also minimizing the information $$- H(p({R}|{X})p({ X})\Vert p({R}|{X})p({ X}))$$ of Yang, and thus in turn the information of Ying, due to the Ying-Yang best match nature. That is, Ying and Yang seeks a best agreement in a most tacit manner featured with a least amount of information communication in its operation flow.

• For a fixed $$p$$, maximizing $$H(p \Vert q)$$ with respect to a free $$q$$ will force $$q=p$$, which forces the Ying $$q({X} | {R})q({R})$$ to best match the Yang $$p({R}|{X})p({X})$$.

Due to a finite size $$N$$ and other existing constraints (if any), the limit $$q({X} | {R})q({R})=$$ $$p({R}|{X})p({ X} )$$ may not be really reached. Still there is a trend towards this equality by which $$H(p \Vert q)$$ becomes the negative entropy, and thus its further maximization will minimize the complexity of this Ying-Yang system.

• From the following relation

$\tag{13} \begin{array}{l} H(p \Vert q) = \int p({R}|{X})p({ X} ) \ln{ q({X} | {R})} dR dX + H(p({R}) \Vert q({R})), \\ with \ p({R}) = \int p({R}|{X})p({ X} ) dX, \end{array}$ we may observe that maximizing $$H(p \Vert q)$$ by eq.(2) consists of not only maximizing the fit of observed samples $$\{x_t, t=1, \cdots, N\}$$ by $$q({X} | {R})$$ with the help of $$p({R}|{X})$$, but also maximizing $$H(p({R}) \Vert q({R}))$$ that tends to $$q({R}) =p({R})$$ and thus approximately $$H(q({R}) \Vert q({R}))$$. Therefore, its further maximization will minimize the complexity of the inner representation by $$q({R})$$. Even without considering priors, i.e., ignoring $$q(\theta_k) q(k)$$ in eq.(3), we still have $$H(q({R}) \Vert q({R}))=$$ $$H(q(Y|\theta_k^{ y}) \Vert q(Y|\theta_k^{ y}))$$ to be maximized for a least complexity of $$q(Y|\theta_k^{ y})$$, which actually enforces model selection. Moreover, with help of an appropriate priori $$q(\theta_k)$$, model selection ability will be further improved via the maximization of $$H(q(Y|\theta_k^{ y}) q(\theta_k) \Vert q(Y|\theta_k^{ y}) q(\theta_k))$$.

Also, we may get further insights on the nature of automatic model selection from observing the gradient flow $$\nabla_{ \theta} H(p \Vert q )$$ for the maximization of $$H(p \Vert q )$$, in comparison with its counterpart for maximizing the likelihood. For simplicity but without losing generality, we ignore a priori $$q(\theta )$$ and maximize $$H(p \Vert q )$$ by eq.(2), resulting in $\tag{14} \begin{array}{l} \max_{\theta, k} \ H(\theta, k), \ H(\theta, k) = \int p( {Y} | {X}, \theta) p({X}) \ln{[ q({X} | {Y}, \theta^{x|y})q({Y}|\theta^{y})]} d{X} d{Y}. \end{array}$ As to be addressed in the next subsection, we further design $\tag{15} \begin{array}{l} p( {Y} | {X}, \theta)= q({X} | {Y}, \theta^{x|y})q({Y}|\theta^{y})/ q({X} | \theta), \ q({X} | \theta)=\int q({X} | {Y}, \theta^{x|y})q({Y}|\theta^{y})d{Y}. \end{array}$ Then, we obtain the following gradient

$\tag{16} \begin{array}{l} \nabla_{\theta} H(\theta, k)= \int p( {Y} | {X}, \theta) [1+\Delta_L(X,Y)]\nabla_{\theta}L(X,Y) p({X}) d{X} d{Y}, \\ \Delta_L(X,Y)= L(X,Y,\theta)- \int p( {Y} | {X}, \theta^{y|x}) L(X,Y,\theta) d{Y}, \\ L(X,Y,\theta)=\ln{[ q({X} | {Y}, \theta^{x|y})q({Y}|\theta^{y})]}. \end{array}$

Noticing that $$L(X,Y,\theta)$$ describes the fitness of an inner representation $$Y$$ on the observation $$X$$, $$\Delta_L(X,Y)$$ indicates whether the Ying with the currently considered $$Y$$ fits $$X$$ better than the average of ones with all the possible choices of $$Y$$.

Thus, we get observations similar to what discussed after eq.(11). Letting $$\Delta_L(X,Y)=0$$, $$\nabla_{\theta} H(\theta, k)$$ is simplified into $$\int p( {Y} | {X}, \theta^{y|x})\nabla_{\theta}L(X,Y) p({X}) d{X} d{Y}, \$$ which is also be derived from $$\nabla_{\theta} \int p({X}) \ln{ q({X} | \theta) } d{X}$$ with $$q({X} | \theta)$$ by eq.(15). In other words, with $$p({ X} )$$ at $$h=0$$, updating $$\theta$$ along this gradient flow actually implements the classic ML learning. With $$\Delta_L(X,Y)\ne 0$$, the gradient flow $$\nabla_{\theta}L(X,Y)$$ with all possible choices of $$Y$$ is integrated via weighting not just by $$p({Y}|{X}, \theta^{y|x})$$ but also by a modification of a relative fitness measure $$1+\Delta_L(X,Y)$$. If $$\Delta_L(X,Y)> 0$$, updating goes along the same direction of the ML learning but with an increased strength. If $$0>\Delta_L(X,Y)> -1$$, i.e., the fitness is worse than the average and thus this $$Y$$ is doubtful, updating still goes along the same direction of the ML learning, but with a reduced strength. When $$\Delta_L(X,Y)< -1$$, updating reverses to the opposite direction, i.e., becoming de-learning. That is, we again observe that the BYY harmony learning has a RPCL-like nature and thus gets a similar nature of automatic model selection.

## Tri-relation theory versus bi-relation ones : harmony, matching, and Ying-Yang pointview

Learning is made under the guidance of a learning principle or theory that measures how the behavior of a learner (a parametric model) is close to what observed in its world, for which we need to measure a proximity between two entities, namely, the regularity described by the model and the regularity underlying samples.

For learning by probabilistic models, we consider entities in a probability space or generally a measure space. Let $$P,Q, \mu$$ to be $$\sigma$$ -finite measures on the same measure space $$(X, \Sigma)$$, we observe the proximity between the changes of $$P,Q$$ on every differential piece of this space with respect to $$\sigma$$ that describes a volume or capacity (e.g., a Lebesgue measure) about this space $$(X, \Sigma)$$, with help of the following Radon-Nikodym derivative: $\tag{17} \begin{array}{l} H_\mu ( P\Vert Q)= \int \frac{ dP}{d\mu}f ( \frac{d Q}{d \mu})d \mu =\int f ( \frac{d Q}{d \mu})d P, \end{array}$ where the product $$\frac{ dP}{d\mu}f ( \frac{d Q}{d \mu})$$ describes a local harmoniousness with a scale adjustment by $$f(r)$$ that is usually a monotonic increasing and concave function with $$f(1) = 0$$, e.g., $$f(r)=\ln r$$, such that $$Q$$ takes a primary role with a large behaving range to be evaluated in details.

The above $$H_\mu ( P\Vert Q)$$ is a triple-relation among $$dP, dQ,$$ and $$d\mu,$$, which degenerates into one bi-relation measure $$H_ P ( P\Vert Q)$$ at $$\mu = P$$ and another bi-relation measure $$H_ \mu ( P\Vert P)$$ at $$Q = P$$. Particularly if $$f(r)=\ln r$$ and $$\mu$$ is Lebesque, we echo eq.(12) with $\tag{18} \begin{array}{l} H_\mu ( P\Vert Q)= H (p\Vert q) = H(p \Vert p) - KL(p \Vert q), \\ H_ P ( P\Vert Q)= - KL(p \Vert q), \ H_ \mu ( P\Vert P)= H(p \Vert p), \ p=\frac{ dP}{d\mu}, \ q= \frac{d Q}{d \mu}, \end{array}$ where $$H(p \Vert p)$$ is a negative entropy, and $$KL(p \Vert q)$$ is the Kullback-Leibler divergence, both of which are bi-relation measures, degenerated from the triple-relation measure $$H(p \Vert q)$$. Maximizing this triple-relation consists of both a least information $$-H(p \Vert p)$$ and a best matching between $$p$$ and $$q$$.

In the conventional information theoretic studies, both the above bi-relation measures are two basis concepts that have been widely used, while investigation on the triple-relation measure $$H(p \Vert q)$$ has not been enough explored relatively. One well known scenario is considering $$p$$ given from data samples by which maximizing $$H(p \Vert q)$$ and minimizing $$KL(p \Vert q)$$ become equivalent. Particularly, when $$p=p({ X} )$$ is the empirical density given directly from data samples, e.g., $$p({ X} )$$ with $$h=0$$ in eq.(3), both become equivalent to maximizing the likelihood function $$\ln{q(X|\theta)}$$. Another scenario ever explored but regarded as fruitless is that this $$H(p \Vert q)$$ is also sometimes called cross-entropy. However, considering $$q$$ given as the fixed prior reference and maximizing $$H(p \Vert q)$$ to optimize $$p$$ subject to some constraint does not approach to $$p=q$$. Instead, $$p$$ tends to be pushed towards a useless $$\delta$$ distribution. This scenario was regarded as unfavorable and thus the term cross-entropy has been actually re-defined to refer $$KL(p \Vert q)$$ rather than $$H(p \Vert q)$$. In fact, minimum cross-entropy (MCE) or Minxent in the engineering literature means minimizing $$KL(p \Vert q)$$, rather than maximizing $$H(p \Vert q)$$. To avoid ambiguity and inconsistency, we abandon the term cross-entropy and term $$H(p \Vert q)$$ as harmony measure, which becomes even preferably when applied to the BYY system by eq.(1). Readers are refereed to Sect.4.1 and particularly Fig.5 in one recent paper (Xu, 2012). and also Sect. 4.1 of another paper (Xu, 2010) for further details about the measures in eq.(18).

When these measures are used to those learning systems with inner representations, e.g., ones illustrated in 1, there are more than two components in each of these learning systems, and thus there are different choices among the components in a learning system to collectively act as $$P, Q$$. In addition to choosing among three measures in eq.(18), studies have been made on other learning principles. Also, we may combine the measures of different pairs of $$P, Q$$ to form a measure to be optimized eventually.

Efforts on forming such a pair $$P, Q$$ can be roughly summarized along two lines. One covers most of existing efforts, featured by measuring an unidirectional proximity. From a top-down direction, the proximity between a system and a set of samples is considered via $\tag{19} \begin{array}{l} q(X|\Phi)=\int q(X|\Phi)q(\Phi)d\Phi, \ \Phi \subset R, \end{array}$ e.g., the likelihood in 1(b) and the marginal likelihood in 1(c). From a bottom-up direction, the proximity is considered by either using $\tag{20} \begin{array}{l} p(Y)=\int p(Y|X)p(X) dX, \end{array}$ to match $$q(Y|\theta^y)$$ or directly $$p(Y|X)$$ to match a paired samples of $$Y,X$$, including minimum mutual information, maximum information transfer, and discriminative training MMI in speech recognition. Further details are referred to Sect.4.2 in a recent paper (Xu, 2012).

The other line is featured by measuring a bidirectional proximity via the BYY system by eq.(1), with Yang as $$P$$ and Ying as $$Q$$. On one hand, minimizing $$KL(p \Vert q)$$ performs Ying-Yang best matching, which interestingly acts as a general framework that unifies existing learning principles. As overviewed in Sect.4.2.2 of the recent paper (Xu, 2012), $$H(p \Vert q)$$ covers not only the above top-down unidirectional proximity and bottom-up unidirectional proximity, but also their various variants, especially Helmholtz machine and variational Bayes. As stated after eq.(18), maximizing $$H(p \Vert q)$$ actually further covers minimizing $$KL(p \Vert q)$$ as a degenerated case. Generally, Ying-Yang best harmony via maximizing $$H(p \Vert q)$$ by eq.(2) performs Ying-Yang best matching plus minimizing the complexity of BYY system, as addressed in the previous subsection from three information theoretic perspectives. In sequel, we provide more insights from two additional information theoretic perspectives.

As stated after eq.(18), the conventional information theoretic studies do not regard that maximizing $$H(p \Vert q)$$ takes a fundamental role that is same as the ones taken by its two degenerated bi-relation measures in eq.(18), because either its one case is equivalent to minimizing $$KL(p \Vert q)$$ or its other case was regarded as unfavorable and thus replaced by $$KL(p \Vert q)$$. Therefore, it was misunderstood that maximizing $$H(p \Vert q)$$ is less useful than minimizing $$KL(p \Vert q)$$. Interestingly, such an unfavorable scenario disappears as $$p, q$$ are given by the Ying-Yang pair in eq.(1). Because $$p = p(R|X)p(X)$$ includes $$p({ X} )$$ in eq.(3), $$\max_p H(p \Vert q)$$ for a fixed $$q$$ can not push $$p(R|X)p(X)$$ to entirely become a useless $$\delta$$ distribution, but push $$p(R|X)$$ into a most compact form under the constraints by $$p({ X} )$$ and also by some structure of $$p(R|X)$$ (if any). Moreover, $$\max_q H(p \Vert q)$$ for a fixed $$p$$ forces the Ying machine $$q(X|R)q(R)$$ to match $$p(R|X)p(X)$$ best and accordingly become more compact too, which provides another evidence that maximizing $$H(p \Vert q)$$ actually is more favorable.

Another information-theoretic perspective of maximizing $$H(p \Vert q)$$ by eq.(2) can be found in Sect. II(C), Sect.II(E), and Fig. 3 of (Xu, 2004a), which introduces a three level encoding scheme for optimal communication, being different from both the conventional MDL and the bit back MDL (Hinton & Zemel, 2004).

## BYY system design : Ying-Yang coordination and all in one formulation

Taking a key role in the operating flow within the BYY system, $$Y\,$$ is supported by a parametric substructure $$q(Y|\theta^y)\,$$. On one hand, $$Y\,\,$$ is the source of the operating flow to fit the observations $$X$$ via the top-down pathway $$q({ X} | Y, \theta^{x|y} ).$$ On the other hand, this $$Y\,$$ comes via the bottom-up pathway $$p(Y| { X}, \theta^{y|x})\,$$ from a set of samples $$X=\{x_t\}_{t=1}^N \,$$ typically as $$p({ X} )$$ in eq.(3).

Except $$p({ X} )$$, a design is needed on other components of the BYY system by eq.(1). Referring to Figure 3 3(b) & (c) & (d), Ying is primary, featured with a compact capacity in a least complexity. Subject to the nature of learning tasks, $$q(Y|\theta^y)$$ should be in a structure such that the inner representation of $$X=\{x_t\}_{t=1}^N$$ is encoded with a redundancy as least as possible, while $$q(X | Y, \theta_k^{x|y})$$ is usually a mixture of simple structures. Also, design is featured by a trade-off consideration between $$q(Y|\theta^y)$$ and $$q({X} | Y, \theta_k^{x|y})$$. Moreover, each component of a priori $$q(\theta)$$ is chosen according to the roles of the corresponding parameters. Readers are referred to Sect.3.2.1 in the paper (Xu, 2012) for a number of topics about designing Ying structure.

In addition to adopting those designing issues considered in a conventional latent or generative model, one particular point for the BYY system is that the complexity or scale of $$q(Y|\theta^y)$$ can be simply trimmed down via a subset of parameters, e.g., a Gaussian mixture in eq.(5) is simply trimmed down to a scale $$k-1$$ as some $$\alpha_{j} \to 0$$. These parameters should be estimated via learning. Simply fixing them by design, as encountered in some existing studies, will make the model selection performance unfavorable.

Another example is the classic factor analysis that considers $\tag{21} q(x_t | y_t, \theta ^{x|y})= G(x_t| Ay_t, D _{\sigma }), \ q(y_t | \theta ^y)= G(y_t|0,I), \ \ D _{\sigma }=diag[\sigma_1^2, \cdots, \sigma_d^2],$ to model samples of $$x_t$$ by a Gaussian $$G(x|0, \Sigma)$$ via $$\Sigma= D _{\sigma }+AA^{ T}$$. Alternatively, we may also have the following parameterization (Xu, 1997a): $\tag{22} q(x_t | y_t, \theta ^{x|y})= G(x_t| Ay_t, D _{\sigma }), \ A^TA=I, \ q(y_t | \theta ^y)= G(y_t|0, \Lambda).$ Shortly, we use FA-a to denote the classic factor analysis and FA-b to denote this alternative. Two FA types are equivalent in term of maximizing likelihood learning as long as $$A_aA_a^{ T}=A_b\Lambda A_b^{ T}$$ or $$A_a=A_b\Lambda^{0.5}$$, where $$A_a$$ and $$A_b$$ correspond to $$A$$ in FA-a and FA-b, respectively. However, the BYY harmony learning produces different model selection performances on two FA types, see Item 9.4 in Ref.(Xu, 1997a). Extensive empirical experiments have further shown in one recent paper (Tu & Xu, 2011) that the BYY harmony learning and VB perform reliably and robustly better on FA-b than on FA-a, while BYY further outperforms VB considerably, especially on FA-b.

Still referring to Figure 3 3(b) & (c) & (d), Yang is secondary, and vigor in adapting to variety of not only external world but also the dynamic range of Ying. Therefore, the Yang $$p({ R}| { X})p({ X})$$ is designed basing on the Ying $$q({X} | {R})q({R})$$, under a principle of matched dynamic range or variety conversation between Ying-Yang. One rationale is that the representation of $$R$$ by Yang should be more versatile than one by Ying, from which we may examine each subset $$\Phi\subseteq R$$ as follows $\tag{23} \begin{array}{l} p({ \Phi} |{ X})\le q({ \Phi} |{ X})={ q({X} | {\Phi})q({\Phi})\over \int_{ \Psi \in \mathcal{D}_q} q({X} | {\Psi})q({\Psi})d\Psi}, \ \mbox{for each} \ \Phi \in \mathcal{D}_p\subseteq \mathcal{D}_q, \\ \mbox{subject to} \ q(\Phi^* |{ X})\le \int_{ \Psi \in \mathcal{D}_p} p({ \Psi} |{ X})d\Psi =s(X)\le 1, \end{array}$ where $$\Phi^* =arg\max_\Phi q(\Phi |{ X})$$ and $$\mathcal{D}_q$$ is the domain of $${ \Phi}$$ by the Ying. The strongest preservation between Ying and Yang is $$p({ \Phi} |{ X})= q({ \Phi} |{ X}), \forall \Phi \in \mathcal{D}_q$$, which is achieved at $$s(X) =1$$ or equivalently at $$\mathcal{D}_p=\mathcal{D}_q$$. With $$s(X) < 1$$, maximizing $$H(p \Vert q)$$ by eq.(2) with respect to a free $$p({ \Phi} |{ X})$$ leads to a relaxed preservation that $$p({ \Phi} |{ X})= q({ \Phi} |{ X}), \ \forall \Phi \in \mathcal{D}_p \subset\mathcal{D}_q.$$

To be more specific, $$\mathcal{D}_p$$ consists of an apex realm that varies with the input $$X$$ and focuses on those possible representations of $$X$$ with high probabilities. At a given $$X\ ,$$ we have $$\mathcal{D}_p=\mathcal{D}^{\rho}_{\Phi^*}$$ given as follows: $\tag{24} \begin{array}{l} \mathcal{D}^{\rho}_{\Phi^*}=\{\Phi: \Phi \in \mathcal{D}_q \ \mbox{with}\ q({ \Phi} |{ X}) +\rho \ge q({ \Phi^*} |{ X}) \ \mbox{for} \ \rho>0\}, \end{array}$ which may be called the $$\rho$$-apex zone that corresponds to $$X\ .$$ One degenerated case is $$\mathcal{D}_p=\mathcal{D}^{\rho=0}_{\Phi^*}$$ that consists of only the apex point $$p({ \Phi} |{ X})= q({ \Phi^*} |{ X}),$$ e.g., occurred when $$s(X)=q(\Phi^* |{ X}).$$

Taking Gaussian mixture in eq.(5) as an example, we have $\tag{25} \begin{array}{l} p(\ell |x_t) \le q(\ell |x_t, \theta) = {\alpha_{\ell}G(x_t|\mu_{\ell},\Sigma_{\ell}) \over \sum_{j \in \mathcal{C}_{\kappa}(x_t)} \alpha_{j}G(x_t|\mu_{j},\Sigma_{j})}, \forall \ell\in \mathcal{C}_{\kappa}(x_t) \ with \ \kappa \le k, \end{array}$ where $$\mathcal{C}_{\kappa}(x_t) \subset \{1,2,\dots, k\}$$ is the $$\kappa$$- apex zone that consists of $$\kappa$$ indices that correspond the first $$\kappa$$ largest values of $$\alpha_{\ell}G(x_t|\mu_{\ell},\Sigma_{\ell}).$$ When $$\kappa=k$$, $$q(\ell|x_t, \theta)$$ is directly the Bayesian posteriori. Generally, the pre-specified $$\mathcal{C}_{\kappa}(x_t)$$ makes $$q(\ell|x_t, \theta)$$ focus on the $$\kappa$$- apex zone when $$\kappa<k$$.

For a subset $$\Phi\subseteq R$$ consists of real numbers, one usually encounters an implementing difficulty for getting $$q({ X})\ ,$$ except special cases that the integral over $$\Phi$$ is analytically solvable. In this case, the inequality in eq.(23) implies that $\tag{26} \begin{array}{l} Var_{p({ \Phi} |{ X})}(\Phi) \ge \Pi^{-1}, \ \Pi=-\nabla^2_{\Phi\Phi^T} \ln{[ q({X} | {\Phi})q({\Phi})]}, \end{array}$ which is consistent to, as previously discussed in (Xu, 2004b, p889), the celebrated Cramer-Rao inequality stating that the inverse of the Fisher information provides a lower bound on the covariance matrix of any unbiased estimator of $${ \Phi} \ ,$$ where $$\nabla^2_{uu^T} f(u)={\partial ^2 f(u)/\partial u \partial u^T}\$$ and $$Var_{p(u)}(u)$$ denotes the covariance matrix of $$u\ .$$

For two positive definite matrices $$A, B\ ,$$ $$A \ge B$$ means $$u^tAu \ge u^tBu$$ for any $$u\ .$$ A simple example is $\tag{27} \begin{array}{l} A=B+\rho, \ for \ \rho=diag[\rho_1, \cdots, \rho_m] \ with \ \rho_j \ge 0, \forall j. \end{array}$

Taking the factor analysis in eq.(22) as an example, we have $\tag{28} \begin{array}{l} p(y_t | x_t, \theta)= G(y_t| Wx_t, \Pi^{-1}+\rho), \ \Pi=A^T D _{\sigma }^{-1}A + \Lambda^{-1}. \end{array}$

Both Gaussian mixture in eq.(5) and factor analysis by eq.(21) or eq.(22) are simple examples of eq.(4), i.e., the BYY system on those learning tasks with a set $$X$$ of i.i.d. samples. For the tasks with dependence across samples, we consider the BYY system by eq.(3) with a set of samples expressed in a $$d \times N$$ data matrix $$X=\{x_t, t=1, \cdots, N\}$$ and accordingly its inner representation $$Y=\{y_t, t=1, \cdots, N\}$$ as an $$m \times N$$ data matrix. One simple but widely studied example is the generalization of sample-wise linear system $$x_t=Ay_t +e_t$$ into its matrix counterpart as illustrated in 4. The columns of $$A$$ forms a coordinate system for a manifold of the latent factors $$Y=\{y_t, t=1, \cdots, N\},$$ and thus we call it manifold factor analysis. Readers are referred to Fig. 1 of a recent paper (Xu, 2011) for a family of learning tasks summarized under this model.

The column-wise dependence among $$Y$$ are assumed to preserve local topological structure across the columns of $$X$$, with help of what is called Laplacian Eigenmap $$L$$ (Belkin & Niyogi, 2003; He & Niyogi, 2003). In the manifold learning literature, the aim is minimizing $$Tr[YLY^{T}]$$ to get a locality preserving projection $$Y = WX.$$ Using the Laplacian $$L$$ for modeling the distribution of $$Y$$, both the previous FA-a and FA-b can be generalized into their manifold FA counterparts, featured with whether the diagonal matrix $$\Lambda$$ is added to take an automatic model selection role via deleting the row dimension of $$Y$$, see Sect.3.2 of a recent paper (Xu, 2011) and Sect.5.3.2 of another paper (Xu, 2012).

Considering that elements of $$Y$$ are Gaussian and also the i.i.d. noise $$q(E) =\prod_t G(e_t|0, D _{\sigma })$$ are Gaussian, we have the following Ying system $\tag{29} \begin{array}{l} q(Y| \theta_k^{ y}) = G(Y|\Lambda)=\frac{|L |^{0.5m} }{(2\pi)^{0.5mN}|\Lambda |^{0.5N}}\exp\{-\frac{1}{2}Tr[ Y^{T}\Lambda^{-1}YL]\}, \\ q({X} | {R})=G(X|AY, D _{\sigma })=\frac{ 1}{(2\pi)^{0.5mN}| D _{\sigma } |^{0.5N}}\exp\{-\frac{1}{2}Tr[ (X-AY)^{T} D _{\sigma } ^{-1}(X-AY)]\}, \\ \Lambda =diag[\lambda_1, \cdots, \lambda_m], \, \, \, D _{\sigma }=diag[\sigma_1^2, \cdots, \sigma_d^2]. \end{array}$ Again, from eq.(2) we may extend the Yang system by eq.(28) into $\tag{30} \begin{array}{l} G(vec[Y]\ |\ vec[WX], \Pi^{-1}+\rho), \, \, \, \Pi= I \otimes (A^TD _{\sigma }^{ -1}A) + L\otimes \Lambda^{ -1}, \end{array}$ where $$\otimes$$ denotes the Kronecker product and $$vec[A]$$ denotes the vectorization of a matrix $$A$$. Also, we may obtain $$p(Y| { X}, \theta_k)$$ with $$Y$$ in a matrix format by a vector-matrix mapping, though it is usually not explicitly needed during a learning implementation introduced in the next subsection.

Last but not least, readers are referred to Sect.3.2.2 in the paper (Xu, 2012) for a number of topics on Yang structure design, and also further to its Sect.3.3.2 for insights on how the above introduced BYY system are directly applicable to supervised learning and semi-supervised learning, Actually, the BYY system provides a unified framework to accommodate unsupervised, supervised, and semisupervised learning all in one formulation, with four typical scenarios summarized in Table 2 in the paper (Xu, 2012) and each scenario differing from others simply in special settings on one or more of ingredients.

## Learning implementation, apex approximation, and manifold FA algorithm

To implement the maximization of $$H(p \Vert q)$$ by eq.(2), one key task is handling the integrals of types $$\int[\cdot]{ d}\theta$$ and $$\int[\cdot]{ d}{Y}$$. One direction is directly solving the integrals manually in those cases that the structures of components in eq.(3) make the integrals analytically trackable, especially when the components of Yang and components of Ying take structures in conjugate pairs. After removing the integrals analytically, we get $$H(p \Vert q)$$ expressed in an objective function $$H(k, \Xi)$$ with respect discrete variables in $$k$$ and continuous variables in $$\Xi$$, where $$\Xi$$ consists of hyper-parameters $$\Xi_q$$ in $$q(\theta|\Xi_q)$$ and $$\Xi_p$$ in $$p(\theta|X, \Xi_p)$$. Its maximization is made by the following two stages of iteration

• Stage I: enumerate $$k$$ and get $\tag{31}\Xi^*_k=\arg\max_{\Xi} H(k, \Xi;)$
• Stage II: get $$k^*=\arg\max_k H(k, \Xi^*_k).$$

Usually, $$\Xi^*_k$$ is estimated via a gradient ascent updating based on $$\nabla_{\Xi}H(k, \Xi)$$.

In many tasks, solving the integrals analytically is not trackable, and thus some approximation is implemented. One technique is called apex approximation. Given a distribution $$p(u)$$ and a concave function $$Q(u|\theta)$$, we consider the following approximation $\tag{32} \begin{array}{l} \int p(u)Q(u|\theta)du \approx Q(u^*|\theta)- R_Q^u(\theta), \, \, \, u^*=arg\max_u \ Q(u|\theta). \end{array}$ When $$u$$ is discrete, the integral is a summation, and $$R_Q^u(\theta)$$ could be simply $\tag{33} \begin{array}{l} R_Q^u(\theta)=-\sum_{u \in {D}^{\rho}_{u^*}}Q(u|\theta). \end{array}$ One simple example of $${D}^{\rho}_{u^*}$$ is $$\mathcal{C}_{\kappa}(x_t)$$ in eq.(25), and the other example is $$C_t^\kappa ( x_t )$$ by Eq.(20) in the paper (Xu, 2012), where each element $$u$$ is a binary vector and thus $$C_t^\kappa$$ consists of those ones that differ from $$u^*$$ by merely one bit.

When $$u$$ is continuous, with help of the following Taylor expansion of $$Q(u|\theta)$$ around $$u^{*}$$ up to the second order, we get $\tag{34} \begin{array}{l} R_Q^u(\theta)=\frac{1}{2}{\rm Tr}[(\varepsilon _u\varepsilon_u^{\rm T}+\Gamma^u)\Pi(u^{*})], ~\varepsilon _u =\eta_u -u^*, \, \, \, \Pi(u)=-\frac{\partial^2Q(u|\theta)}{\partial u\partial u^{\rm T}}, \end{array}$ where $$\eta_u,\Gamma ^{u}$$ are the mean and the covariance of $$p(u)$$.

With the notation $$H(\theta, k)$$ in eq.(14) and $$L(X,Y,\theta)$$ in eq.(16), we rewrite eq.(2) into $$H(p \Vert q) = \int p(\theta | {X}) H(p \Vert q, \theta, \Xi )d\theta$$ and put it into eq.(34), resulting in $\tag{35} \begin{array}{l} H(p \Vert q) = H(\theta^*, k) -R_H^{\theta}(\Xi, k)-R_q^{\theta}(\Xi, k), \, \, \, \theta^*=arg\max_{\theta}\ H(p \Vert q, \theta, \Xi ), \end{array}$ where $$R_H^{\theta}(\Xi, k)$$ is the counterpart of $$R_u^Q(\theta)$$, and we have $\tag{36} R_q^{\theta}(\Xi, k)= \begin{cases} -\ln{q( \theta|\Xi)}, & \mbox{ Choice (a),} \\ -\int p(\theta | {X}) \ln{q( \theta^*|\Xi)}d\theta, & \mbox{Choice (b).}\end{cases}$ When $$Y$$ only consists of discrete variables, $$H(\theta, k)$$ consists of a summation over $$Y$$, we can either use apex approximation by eq.(32) and eq.(33) or simply do not consider apex approximation when the summation only involves a few terms. One example is $$H(\theta, k)$$ in eq.(6). When $$Y$$ is continuous, $$H(\theta, k)$$ in eq.(14) still contains an integral over $$Y$$, for which we may still use the apex approximation by eq.(32) and thus get $\tag{37} \begin{array}{l} H(p \Vert q) = L(X,Y^*,\theta^*) -R(\Xi, k), \, \, \, \ Y^*=arg\max_{Y} \ L(X,Y,\theta), \\ R(\Xi, k)=R_H^{\theta}+R_q^{\theta}(\Xi, k)+R_L^Y(\theta^*,\Xi, k)+R_L^X(\theta^*,\Xi, k), \\ R_L^X(\theta^*,\Xi, k)=-\frac{1}{2}Tr[h^2 \frac{\partial^2 L(X,Y,\theta)}{\partial vec[X] \partial vec^T[X]}]= Nh^2Tr[D _{\sigma }^{-1}], \end{array}$ where $$R_L^Y(\theta^*,\Xi, k)$$ is the counterpart of $$R_Q^u(\theta)$$ in eq.(34). Similar to the term $$\varepsilon_t(\theta_{\ell})$$ in eq.(6), the term $$L(X,Y^*,\theta^*)$$ describes the fit of data $$X$$ aided with the inner representation $$Y^*$$. The term $$R(\Xi, k)$$ takes a same role as the term $$r_t(\theta_{\ell})$$ in eq.(6). In addition to the contributions from a priori and the smoothness of $$L(X,Y,\theta)$$ with respect to $$X$$, there are also additionally a term $$R_H^{\theta}$$ for the smoothness of $$H(p \Vert q, \theta, \Xi )$$ with respect to $$\theta$$ and a term $$R_L^Y(\theta^*,\Xi, k)$$ for the smoothness of $$L(X,Y,\theta)$$ with respect to $$Y$$.

In implementation, we maximize $$H( \Xi, k)=H( \theta^*, \Xi, k)$$ for parameter learning that consists of alternatively updating real parameters $$\theta^*, \Xi^*$$. When $$Y$$ is continuous, parameter learning also includes updating real variables $$Y^*$$. Moreover, there could be no updating on $$\Xi^*$$ in many cases. Model selection occurs not only by Stage II in eq.(31) by a discrete optimization but also via automatic model selection during parameter learning.

The Yang-Ying alternation in eq.(7) for learning Gaussian mixture is one simplest example of implementing eq.(35) with $$H(\theta, k)$$ given by eq.(6). As an example to handle $$H(\theta, k)$$ by eq.(37), in sequel we consider the manifold factor analysis. It follows from eq.(29) that eq.(37) becomes $\tag{38} \begin{array}{l} L(X,Y,\theta)=\ln[G(X|AY, D _{\sigma }) G(Y|\Lambda)], \, \, \, R_L^Y(\theta^*,\Xi, k)=\frac{1}{2}Tr[\Gamma \Pi], \\ \Gamma = \Pi^{-1}(\theta)+vec[\delta_Y] vec^T[\delta_Y], \, \, \, \delta_Y =Y^*-WX. \end{array}$ Particularly, with $$\Pi$$ given by eq.(30) at $$\rho=0$$, we have $\tag{39} \begin{array}{l} Tr[\Gamma \Pi(\theta)] = mN+ Tr[\delta_Y^T A^TD _{\sigma }^{ -1}A \delta_Y]+ Tr[\delta_Y ^{T}\Lambda^{ -1}\delta_Y L]. \end{array}$

The maximization of $$\theta^*=arg\max_{\theta}\ H(p \Vert q, \theta, \Xi)$$ can be implemented by the following Ying-Yang alternation:

• Yang step: get $$Y^*=arg\min_Y L(X,Y,\theta^{old})$$ by solving the root $$Y^*$$ of the following equation

$\tag{40} \begin{array}{l} \Lambda^{old }A^{old \ T } D _{\sigma }^{old \ -1}X = Y L+ (\Lambda^{old }A^{old \ T } D _{\sigma }^{old \ -1}A^{old}) Y, \end{array}$ which is a Sylvester equation that can be solved by one of the exiting techniques (Bartels & Stewart, 1972).

• Ying step: update each $$\theta \in \{ A, D _{\sigma }, \Lambda, W, h\}$$ by either

$\tag{41} \begin{array}{l} \theta^{new}-\theta^{old} = \gamma (\theta^{*}-\theta^{old} ), \end{array}$ or $\tag{42} \begin{array}{l} \theta^{new}-\theta^{old} \propto G(\theta^{old})= \nabla_{\theta} H(p \Vert q, \theta, \Xi ), \end{array}$ during which we delete the $$\ell$$th row of $$Y$$ and the $$\ell$$th column $$a_\ell$$ of $$A$$ if its corresponding $\tag{43} \begin{array}{l} \lambda_\ell\Vert a_\ell\Vert \to 0. \end{array}$ Ignoring a priori $$q( \theta|\Xi)$$, we may even get $$\theta^{*}$$ analytically solved by the root of the equation $$G(\theta) =0$$, resulting in $\tag{44} \begin{array}{l} A^{new}=XY^{*\ T}( Y^{* }Y^{*\ T}+\delta_Y \delta_Y ^T )^{-1}, \ \delta_Y =Y^*-W^{old}X, \ W^{new}= Y^{*}X^T(XX^T)^{-1}, \\ D _{\sigma }^{new}=h^2I+\frac{1}{N}diag[EE^{T} +A^{old}\delta_Y \delta_Y^T A^{old \ T}], \ E=X-A^{old}Y, \\ \Lambda^{new}= \frac{1}{N}diag[ Y^{*}L Y^{*\ T}+\delta_YL \delta_Y ^T], \end{array}$ where $$diag[B]$$ denotes the diagonal part of the matrix $$B$$. We may turn the above Ying-Yang alternation into one EM algorithm for the ML learning by simply letting $$\Gamma = \Pi^{-1}(\theta)+vec[\delta_Y] vec^T[\delta_Y]$$ in Eq.(38) replaced with $$\Gamma = \Pi^{-1}(\theta^{old})$$.

In general, the implementation is made by Eq.(42), because solving the root of $$G(\theta) =0$$ may become not as easy as in Eq.(44). Moreover, the term $$\delta_YL \delta_Y ^T$$ in getting $$\Lambda^{new}$$ will be also replaced by a full rank term that makes $$\Lambda^{new}$$ keep to be full rank. In contrast, both the term $$Y^{*}L Y^{*\ T}$$ and the term $$\delta_YL \delta_Y ^T$$ in Eq.(44) can become lower rank during the Ying Yang alternation, which makes automatic model selection possible via deleting the $$\ell$$th row of $$Y$$ and the $$\ell$$th column $$a_\ell$$ of $$A$$ as long as Eq.(43) holds.

The Ying-Yang alternation can be extended to the cases that elements of $$Y$$ are discrete and thus the Yang step is modified to tackle a discrete optimization $$Y^*=arg\min_Y L(X,Y,\theta^{old})$$ instead of solving eq.(40). Moreover, learning can be implemented similarly when we have either a non-informative prior (e.g., Jeffreys prior) or a parametric prior $$q( \theta|\Xi)$$. What needs to do is simply letting $$G(\theta)= \nabla_{\theta} H(\theta, k)$$ to be replaced by $$G(\theta)= \nabla_{\theta} H(p \Vert q, \theta, \Xi )$$ for deriving the updating equations, e.g., eq.(44).

The scenario of considering $$p({ X} )$$ in eq.(3) can be handled similarly too, with one additional term added to $$ln{q( \theta|\Xi)}$$, with details referred to the last part of Sect. 4.3.1 of one recent overview (Xu, 2012), where we further observe how $$H( \theta^*, \Xi, k)$$ in Eq.(35) can be used to replace $$H(p \Vert q, \theta, \Xi )$$ and how $$H( \Xi, k)=H( \theta^*, \Xi, k)$$ is put into eq.(31) for learning hyper-parameters $$\Xi$$ (if any).

Last but not the least, not only maximizing $$H(\theta, k)$$ with or without a priori and maximizing $$H( \theta^*, \Xi, k)$$ for optimizing hyper-parameters both contributes the task of model selection, but also making Stage II in eq.(31) by a discrete optimization will further improve model selection, because $$J(k)=-H( \theta^*, \Xi, k)$$ usually contains some terms that consists of integers that are not differentiable to be used in a gradient based updating but still useful to model selection, e.g., a term $$mN$$ in eq.(39). For Gaussian mixture, such a $$J(k)$$ is given by Eq.(19) of the paper (Xu, 2012). For the above manifold factor analysis, we have $\tag{45} \begin{array}{l} \frac{2}{N}J(m)=\ln{| D _{\sigma }|} +h^2Tr[ D _{\sigma }^{-1}]-\frac{2}{N}\ln{[G(Y^*|\Lambda)q(A)]}+m +\frac{1}{N}(md+m), \end{array}$ which is actually equivalent to $$J(m)$$ by Eq.(29) in the paper (Xu, 2011) where $$m$$ was missing due to a typo and $$q(A)$$ was given by a matrix normal distribution.

## Learning regularization, sparse learning, and automatic model selection

There are several concepts that are closely related to the efforts of reducing the complexity of an oversized model to be used for learning a finite set of samples $$X=\{x_t\}_{t=1}^N$$. Recently, a brief overview on differences and relations of these concepts have been made in Sect.2.1 (particularly see its second half) in the paper (Xu, 2011). The key points about these concepts are summarized below:

• The concept model complexity can be backtracked to efforts started in the 1960s on what later called Kolmogorov complexity (Solomonoff, 1963; Kolmogorov, 1964). These efforts aim at the counts of a unit complexity by a universal gauge, e.g., VC dimension (Vapnik, 1995) or the yardstick (Rissanen, 1986, 2007), and then a mathematical relation is built between this complexity and either indirectly the code length (Rissanen, 1986, 2007) or directly the model generalization performance, e.g., the structural risk minimization (Vapnik,1995).
• The concept model selection came from decades ago on linear regression studies for selecting the number of variables (Gorman, 1966; Mallows,1973), of clustering analysis for the number of clusters (Wallace& Boulton,1968), of times series modeling for the order of autoregressive model (Akaike,1974). Also, these studies aim at a mathematical relation between model complexity and model performance. Differently, the studies count the model complexity by a task-dependent intrinsic number of components that compose of the model, e.g., the number $$k$$ of Gaussian components in a Gaussian mixture or the dimension $$m$$ in a factor analysis model by eq.(22). The resulted complexity is actually a function of $$k$$ or $$m$$, which avoids the difficulty of computing the complexity of each component by a universal gauge and thus makes model selection more practically.
• The concept automatic model selection refers to a learning procedure during which model selection occurs automatically. One early effort is the RPCL learning (Xu, Krzyzak, & Oja, 1992, 1993). The nature of automatic model selection comes from not only RPCL like mechanism but also BYY learning, as well as Bayes learning with an appropriate priori. Generally, it is associated with two features. First, each of model components is associated with a subset $$\theta_e$$ of parameters and an indicator $$\rho(\theta_e)$$ such that this component is effectively discarded if its $$\rho(\theta_e)=0$$, e.g., a Gaussian component is discarded if either or both of its corresponding $$\alpha_l=0$$ and $$Tr[\Sigma_l]=0$$. Second, the learning approach has a mechanism that automatically drives $$\rho(\theta_e)\to 0$$ if the corresponding model component is redundant and thus can be effectively discarded. Further details are referred to Page 287 of one recent overview (Xu, 2010). Automatic model selection is a sub-category of model selection. Usually, a learning measure may contain some integer terms that are not differentiable for making automatic model selection but being still useful to model selection via a discrete optimization. E.g., typical criteria, such as AIC, BIC, etc., can be used for model selection but unable to make automatic model selection.
• The concept learning regularization came from decades ago on the studies of ill-posed problem solving (Tikhonov & Arsenin, 1977). The key point is making the modeling function more smooth such that model complexity is effectively reduced but without necessarily discarding extra parameters. Also, such a regularization may be implemented via Bayes learning with an appropriate priori $$q(\theta)$$. Learning regularization behaves ifferently from model selection, E.g., if we use a polynomial to fit a set of samples from $$y=x^{2}+3x+2$$, model selection desires to force all the terms $$a_{i}x^{i},~i>2$$, to be zero, while minimizing $$\sum_i a_{i}^{2}$$ makes a polynomial become more smooth but fails to treat the parameters $$a_{i},~i>2$$ differently from $$a_{i},~i\le 2$$.
• The concept sparse learning or Lasso shrinkage is featured by pruning away each of extra parameters in a model (Williams, 1995; Tibshirani, 1996; Hansen & Goutte, 1997), which is also implemented via Bayes learning with an appropriate priori $$q(\theta)$$, e.g., a Laplace. Actually, sparse learning and automatic model selection share some common points. For a mixture of Gaussian components with known means and covariances, Bayes learning on the coefficients $$\alpha_l$$ with Dirichlet priori $$q(\alpha_l)$$ can be regarded as both sparse learning and automatic model selection, featured with each redundant $$\alpha_l \to 0$$. Generally, if the means and covariances are unknown to learn, this learning implements automatic model selection rather than sparse learning. For the factor analysis by eq.(22), the concept of model selection is selecting the dimension $$m$$ of $$y_t$$ via pushing $$\lambda_l\to 0$$ or equivalently to selecting the column dimension $$m$$ of $$A$$ via pruning away extra individual columns of $$A$$; while sparse learning focuses on pruning away each of elements in $$A$$, though it can be regarded as implementing a type of automatic model selection if the parameters of one entire column of $$A$$ has been all pruned away.

Favorably, the BYY harmony learning accommodates all the above concepts in one system consistently. Similar to the Bayes learning approaches, either the nature of learning regularization or the nature of model selection / automatic model selection / sparse learning (shortly MS-AMS-SL) may come from an appropriate priori $$q(\theta_k)$$ in eq.(3). Additionally, the BYY harmony learning has a least complexity nature via $$q(Y|\theta_k)$$ that leads to the MS-AMS-SL nature, even without considering priories. Also, the BYY harmony learning gets learning regularization via $$p(X)$$ with $$h \ne 0$$ in eq.(3) under the name of data smoothing . In sequel, some further details are obtained from observing local factor analysis (LFA) shown in Figure 55(a).

For each given value of $$\ell$$, we encounter a variant of FA-b, i.e., a degenerated case $$L=I$$ of manifold FA in 4, while $$y_{t,\ell}$$ is a simplified version of $$Y^*$$ by eq.(37) and $$p(y|x, \ell)$$ becomes a simplified version of eq.(30) with $$\Sigma_{ \ell}=D _{\sigma }$$. Being different from FA-b by eq.(22), the constraint $$A_{\ell}^TA_{\ell}=I$$ is abandoned for simplifying computation while scale indeterminacy is tackled by checking eq.(43), with details further referred to Sect.2.2 in a recent paper (Xu, 2011). For an easy computing, we consider the priories in Figure 55(b). Particularly, we extend the use of Gaussian-Jeffreys priories on principal component analysis (PCA) by (Guan & Dy 2009) to the ones here in Figure 55(b)(3) for the LFA.

Specifically, $$A_{\ell}$$ is element-wisely independent and each element comes from a Gaussian with each variance as a hyper-parameter in a Jeffreys priori. The Gaussian nature of $$a_{\ell}^{(ij)}$$ makes learning regularization, while a Jeffreys priori of $$\eta^A_{\ell,ij}$$ prunes an extra $$a_{\ell}^{(ij)}$$ via pushing $$\eta^A_{\ell,ij}\to 0$$, which leads to sparse learning. Moreover, a Jeffreys priori of $$\lambda_{\ell,i}$$ discards an extra dimension of $$y$$ via pushing $$\lambda_{\ell,i}\to 0$$, which leads to automatic model selection. Even without these priories, the BYY harmony learning includes maximizing $$G(y|0,\Lambda_{\ell})$$ that drives an extra $$\lambda_{\ell,i}\to 0$$ and thus leads to automatic model selection too. Furthermore, the BYY harmony learning also gets learning regularization via $$p(X)$$ with $$h \ne 0$$ in eq.(3), like adding $$h^2I$$ to $$D _{\sigma }^{new}$$ in eq.(44).

Putting all such local FA models together, we get a structural Gaussian mixture with each component $$\alpha_{\ell}G(x_t|\mu_{\ell},A_{\ell}\Lambda_{\ell}A_{\ell}^T+\Sigma_{\ell})$$ $$=\alpha_{\ell}\int G(x_t|A_{\ell}y+\mu_{\ell},\Sigma_{\ell})G(y|0,\Lambda_{\ell})dy$$ or alternatively expressed as $$exp[-\varepsilon_t(y_{t,\ell},x_t,\theta_{\ell})-0.5\ln\{|\Pi_{\ell}^y|/(2\pi)^{m_{\ell}}\}]$$, by which we encounter a situation similar to eq.(6). Actually, the special case $$A=0$$ and $$y=0$$ makes $$\varepsilon(y, x_t, \theta_{\ell})$$ reduce to $$\varepsilon_t(\theta_{\ell})$$ and $$q(\ell|x, \theta)$$ to the one in 1(d), that is, learning Gaussian mixture by eq.(7), eq.(8) or eq.(9). Topping on the RPCL-like nature introduced after eq.(11), we may also consider $$\alpha_{\ell}$$ in a Gaussian mixture with a Dirichlet prior as shown in 5(b)(1), which helps to push each redundant $$\alpha_l \to 0$$ for automatic model selection. Moreover, as shown in 5(b)(3), using a joint Gaussian-Jeffreys priori on $$\mu_{\ell}$$ helps to push Gaussian components located more compactly.

Conceptually, those priories studied in the literature of Bayes approaches may be directly adopted. In addition to Jeffreys priories, we may also consider another type of improper priories called induced bias cancelation (IBC) that intents to cancel a biased prior implicitly imposed by a finite size of samples. Readers are referred to Sect. 3.4.3 in Ref. (Xu, 2007a) for a recent overview.

Learning LFA shown in Figure 66 can be implemented also with help of the Ying-Yang algorithm by eq.(7), eq.(9), and eq.(10), via the following modifications:

• Ying-step: instead of $$p(j | x_t, \theta ) = q(j|x_t, \theta)$$ given in 1(d), $$p(j | x_t, \theta )$$ is given in 6, where $$p(j | x_t, \theta )$$ is equal to $$q(j | x_t, \theta )$$ only on an apex zone $$C_{\kappa}(x_t)$$. Also, we need to get $$y_{t,\ell}$$ in addition to getting $$p_{j, t}$$ by eq.(7).
• Yang-step: as shown at the right-bottom of Figure 66, we still update $$\theta_{\ell}^{new}, \ \ell=1, \cdots, k$$ by eq.(9), but with $$G_t(\theta_{\ell})=\{\delta c_{j},\delta \mu_{j},\delta \Sigma_{j},\delta \Lambda_{j},\delta A_{j}, \delta W_{j}\}$$ given as follows:

$\tag{46} \begin{array}{l} \delta c_{j}=p_{j, t}-\alpha_{j}^{old}s_{\cdot t}+\eta_{j}^{\alpha \ old}(p(j|x_t, \theta^{old})-\alpha_{j}^{old}), \\ s_{\cdot t}=\sum_{j} p_{j, t}, \,\, s_{j \cdot}=\sum_{t\in T_t} p_{j, t}, \,\, s^o_{j \cdot}= \sum_{t\in T_t} p(j|x_t, \theta^{old}), \\ \delta \mu_{j}=\sum_{t\in T_t} p_{j, t} e_{t,j}-\Sigma_{j}^{old}(\mu_{j}^{old}\oslash \eta_{j}^{\mu old}), \, \, \, e_{t,j}=x_t-A^{old}_{j}y_{t,j}- \mu^{old}_{j}, \\ \delta \Sigma_{j}=diag[\sum_{t\in T_t} p_{j, t}e_{t,j}e_{t,j}^T+ A^{old}_{j} R_{j}^{\delta} A^{old \ T}_j]+s^o_{j \cdot} h^2I-(s_{j \cdot}+s^o_{j \cdot})\Sigma_{j}^{old}, \\ \delta_{t,j}=y_{t,j}-W^{old}_{j}(x_t- \mu^{old}_{j}), \, \, \, R_{j}^{\delta}=\sum_{t\in T_t} p(j|x_t, \theta^{old})\delta_{t,j}\delta_{t,j}^T,\\ \delta \Lambda_{j}= diag[R_{j}^{\delta}+R_{j}^{y}]-(s_{j \cdot}+s^o_{j \cdot})\Lambda_{j}^{old}, \\ \delta A_{j}=R_{j}^{xy}-A(R_{j}^{\delta}+R_{j}^{y})-\Sigma_{j}^{old}(A_{j}^{old}\oslash \eta_{j}^{A old}), \\ R_{j}^{y}=\sum_{t\in T_t} p_{j, t}y_{t,j}y_{t,j}^T, \, \, \, R_{j}^{xy}=\sum_{t\in T_t} p_{j, t}(x_t- \mu^{old}_{j})y_{t,j}^T,\\ \delta W_{j}=R_{j}^{xy\ T}-R_{j}^{x}-\Sigma_{j}^{old}(W_{j}^{old}\oslash \eta_{j}^{W old}), \, \, \, R_{j}^{x}=\sum_{t\in T_t} p_{j, t}(x_t- \mu^{old}_{j})(x_t- \mu^{old}_{j})^T,\\ \delta a_{ij}=a_{ij}^{old\ 2}-2\eta_{ij}^{A\ old}, \, \, \delta w_{ij}=w_{ij}^{old\ 2}-2\eta_{ij}^{W\ old}, \, \, \delta \mu_{j}=\mu_{j}^{old\ 2}-2\eta_{j}^{\mu\ old}, \end{array}$ where $$\delta \phi =P \nabla_{\phi} f(\phi)$$ with either $$P=I$$ or $$P$$ being a positive definite matrix. Given two matrices $$A=[a_{ij}]$$ and $$B=[b_{ij}]$$, we have the Kronecker product $$A\odot B=[a_{ij}b_{ij}]$$ and the Kronecker division $$A\oslash B=[a_{ij}/b_{ij}]$$. Moreover, $$h\ne 0$$ may also be updated to seek a best value via a data sensitive priori $$q(h|{\mathcal X}_N)$$, with further details referred to Fig.7(b) in one recent overview (Xu, 2010).

Furthermore, extending $$J(k)$$ given by Eq.(19) in the paper (Xu, 2012) for Gaussian mixture and $$J(m)$$ by Eq.(29) in another paper (Xu, 2011) for factor analysis, we have the counterpart of eq.(45) modified as follows: $\tag{47} \begin{array}{l} J(k, \{m_{\ell}\})={ 1 \over 2}\sum_{\ell=1}^k \alpha_{\ell}\{\ln{|\Sigma_{\ell}|}+h^2Tr[\Sigma_{\ell}^{-1}] - 2\ln{\alpha_{\ell}} +2 m_{\ell}+ \ln{|\Lambda_{\ell}|}+ m_{\ell} \ln{(2\pi)}\}+ { n_f(\Theta_{\mathbf k}) \over 2N}, \end{array}$ where $$n_f(\Theta_{\mathbf k})=dk+k-1+ { 1 \over 2}d(d+1)k+ \sum_{\ell=1}^{k}(m_{\ell}+d_{U_{\ell}})$$ and $$d_{U_{\ell}}= m_{\ell}d-{ 1 \over 2}m_{\ell}(m_{\ell}+1)\ .$$

In a recent paper (Shi, Tu, & Xu,2012), comparative investigations have been conducted not only on both a mixture of FA-b models by eq.(22) (i.e., LFA) and a mixture of FA-a models by eq.(21)(called MFA), but also Variational Bayes (VB) and Bayesian Ying-Yang (BYY) harmony learning with conjugate Dirichlet-Normal- Gamma priories for the problem of automatically determining the component number and the local hidden dimensionalities (i.e., the number of factors of FA in each component). A wide range of synthetic experiments shows that LFA is superior to MFA in model selection by either of VB and BYY, while BYY outperforms VB reliably on both MFA and LFA. These empirical findings are consistently observed from real applications not only on face image clustering and handwritten digit image clustering, but also on unsupervised image segmentation.

Shown in Figure 77 is a road map that indicates how the BYY harmony learning relates to existing typical learning approaches. On this map there are four regions in different background colors.

Colored with grey, the first region outlines typical learning approaches that model a finite size of samples from a top-down perspective, as previously introduced by eq.(19), and further addressed thereafter. Learning is feature by optimizing an objective function that combines a fitting error measure $$F$$ and a modeling complexity measure $$C$$. Though such a combination could be nonadditive or nonlinear, e.g., VC dimension based empirical risk (Vapnik, 1995), most approaches consider additive combination, i.e., simply add up two measures. As introduced in the previous subsection, Tikhonov regularization (Tikhonov & Arsenin, 1977) is featured by a mean square error $$F$$ while $$C$$ comes from Tikhonov smoothing operator. For a Bayes formulation for learning regularization / model selection / automatic model selection / sparse learning, $$F$$ is a likelihood function while $$C$$ simply comes from a priori. Such a combination also includes typical model selection criteria such as AIC, CAIC, BIC/MDL (Akaike,1974; Rissanen, 1986, 2007; Schwarz, 1978), etc. Moreover, the Helmholtz machine (Hinton, Dayan, Frey, & Neal, 1995; Hinton & Zemel, 1994) is featured also by such a combination. However, minimizing $$KL(p(Y|X_N)\vert q(Y|X_N))$$ as $$C$$ has little help on reducing modeling complexity. Instead, minimizing $$KL(p(Y|X_N)\vert q(Y|X_N))\to 0$$ makes that the Helmholtz machine learning tends to maximizing likelihood function.

There are also some cases slightly beyond additive combination, e.g., those list in Figure 77 under the name of quasi-additive combination. A typical example is marginal likelihood or also called marginal Bayes, which involves an integral over $$\theta$$. Two types of approximation are usually used to handle the integral. One is the BIC criterion, and the other is variational Bayes that tends to make $$KL(p(\theta|X_N)\vert q(\theta|X_N))=0$$ and thus becomes a marginal likelihood.

From the bottom up on 7, the BYY harmony learning can be classified into two streams, featured by $$p({ X} )$$ in eq.(3) with $$h=0$$ and $$h \neq 0$$, respectively. The stream with $$h=0$$ directly uses samples in making the BYY harmony learning studies, and all these studies can be extended to consider $$p({ X} )$$ in eq.(3) with $$h\neq 0$$ that indirectly uses samples via a smoothing regularization. Generally, the BYY harmony learning differs from the approaches in the above first region. In some special cases, we may still observe certain relations. Along the Line 1, $$H(p\vert q)$$ includes $$H(p\vert q, X_N)$$ and some special cases of $$H(p\vert q, X_N)$$ degenerate to either the Bayes learning (see the dotted red line) or the marginal Bayes (see the dotted blue line). Generally, $$H(p\vert q, X_N)$$ differs from Bayes learning, marginal Bayes, and variational Bayes. Along the Line 2, $$H(p\vert q)$$ includes $$H(p\vert q, \theta)$$ from which we observe its difference from the likelihood and the Helmholtz energy (Hinton, Dayan, Frey, & Neal, 1995; Hinton & Zemel, 1994), i.e., $$H_{Y|X}(X_N)$$ is maximized for reducing modeling complexity.

It is straightforward to consider a linear combination between $$H(p\Vert q, \theta)$$ and the Helmholtz energy $$KL(p(Y|X_N)\Vert q(X_N|Y)q(Y))$$ or equivalently $$KL(p(Y|X_N)p(X|X_N)\Vert q(X_N|Y)q(Y)),$$ that is, considering $\tag{48} \begin{array}{l} KL(p(Y|X_N)p(X|X_N)\Vert q(X_N|Y)q(Y))-\gamma H_{Y|X}(X_N), \, \, \gamma >0, \\ H_{Y|X}(X_N)=\int p(Y|X_N)\ln{p(Y|X_N)}dY, \end{array}$ which covers the following special case $\tag{49} \begin{array}{l} -\ln{ q(X_N|\theta)}-\gamma H_{Y|X}(X_N), \\\ at \,\, p(Y|X_N)=\frac{ q(X_N|Y)q(Y) }{ q(X_N|\theta)}, \,\, q(X_N|\theta)=\int q(X_N|Y)q(Y)dY, \end{array}$

Further details are referred to Eq. (22) in Ref.(Xu, 1998a) and Eq. (49) in Ref. (Xu, 1998b), as well as an overview made in Sect.23.4.2 of Ref.(Xu, 2004c). Following these previous studies, either or both of eq.(48) and eq.(49) have been revisited in recent years under the name "entropy regularization", coming from that $$-H_{Y|X}(X_N)$$ is an entropy and takes the position of the regularization term $$C$$. Actually, this different name brings some confusion, it is nothing more than either $$H(p\Vert q, \theta)$$ when $$\gamma=1$$ or a straightforward extension of $$H(p\Vert q, \theta)$$ when $$\gamma\ne 1$$ that incurs an unknown parameter $$\gamma$$ to be determined. Though we may consider to use cross-validation (Stone,1978) to estimate $$\gamma$$, it is just a heuristic technique that makes a theory lost a beauty of clarity.

From the bottom up on Figure 77 along the Line 2 toward the left, we enter the third region that outlines the relations of the BYY harmony learning to those learning approaches from a bottom up perspective, as previously introduced by eq.(20) and addressed thereafter. Interestingly, $$H(p\vert q, \theta)$$ at $$q(X_N |Y)=q(X_N)$$ degenerates separately a top-town type maximum likelihood learning and a bottom-up type minimum entropy learning, including minor component analysis (MCA) and minor ICA, see Fig.1 & Fig.2 in (Xu, 2008c). Also, letting $$q(X_N |Y)=p(Y|X_N)p(X|X_N)/p(Y)$$ makes $$H(p\vert q, \theta)$$ become a sum of three terms. Ignoring a constant term $$H(X_N)$$, maximizing $$H(p\vert q, \theta)$$ consists of minimizing $$-H_{Y|X}(X_N)$$ leads to $$p(Y|X_N)=\delta(Y-Y^*)$$. Then, minimizing $$KL(p(Y)\vert q(Y))$$ leads to minimum mutual information and maximum information transfer (Amari, Cichocki, & Yang, 1996; Bell & Sejnowski, 1995). Beyond these existing studies, extensions may also be obtained by $$p(Y|X_N)$$ of specific structures and by $$p({ X} )$$ in eq.(3) with $$h\neq 0$$.

Furthermore, readers are referred to Appendix A and the road map on Fig. A2 of one recent overview (Xu, 2010) for a more detailed overview on how the BYY harmony learning relates to other learning approaches. Also, readers are referred to Sect.II(C) for its relation to the minimum message length (MML) (Wallace & Boulton, 1968; Wallace & Dowe, 1999) and the minimum description length (MDL) (Rissanen, 1986, 2007) in a viewpoint of optimal communication.

Last but not the least, learning algorithms have been developed for a number of learning tasks, for which readers are referred to the road map on Fig.1 and Sect.1 of the paper (Xu, 2010) for several typical unsupervised learning tasks, to the road map on Fig.11 and Sect.4.4 of the same paper for their supervised counterparts, and further to the road map on Fig.1 and Sect.1 of another paper (Xu, 2011) for a family of bilinear matrix system based tasks. Additionally, readers are referred to Sect.7 in the paper (Xu, 2012) for a systematic outline on important topics about the BYY harmony learning.