# Volterra and Wiener series

Post-publication activity

Curator: Matthias O. Franz Figure 1: Vito Volterra, Italian mathematician (1860-1940). The series of polynomial operators named after him is the oldest systematic nonlinear system representation. Source: http://it.wikipedia.org/wiki/Vito_Volterra

Volterra and Wiener series are two classes of polynomial representations of nonlinear systems. They are perhaps the best understood and most widely used nonlinear system representations in signal processing and system identification. A Volterra or Wiener representation can be thought of as a natural extension of the classical linear system representation. In addition to the convolution of the input signal with the system's impulse response, the system representation includes a series of nonlinear terms that contain products of increasing order of the input signal with itself. It can be shown that these polynomial extension terms allow for representing a large class of nonlinear systems which basically encompasses all systems with scalar outputs that are time-invariant and have noninfinite memory.

## Representation of nonlinear systems with Volterra and Wiener series

If a system is linear, it can be always characterized uniquely by its impulse response. For nonlinear systems, however, there exists no canonical representation that encompasses all conceivable systems. The earliest approach to a systematic characterization of nonlinear systems dates back to V. Volterra (Volterra, 1887) who extended the standard convolution description of linear systems by a series of polynomial integral operators with increasing degree of nonlinearity, very similar in spirit to the Taylor series for analytic functions. The last 120 years have seen the accumulation of huge amount of research done both on the class of systems that can be represented by Volterra operators, and on their application in such diverse fields as nonlinear differential equations, neuroscience, fluid dynamics or electrical engineering. Of special importance in this context is the work by N. Wiener (Wiener, 1958) who re-arranged the Volterra series such that it could be applied much more easily to practical signal processing problems.

Both the Volterra series and the Wiener series are made of the same building blocks, namely the Volterra operators which consist of a weighted sum (or integral) of products of the input signal at different time instances. The number of factors in the product determines the degree of the Volterra operator. For instance, a second-order Volterra is a weighted sum of all pairwise products of the input signal at different times. The first-order Volterra operator is just a weighted sum of the input signal at different times and thus corresponds to the standard convolution used in describing linear systems.

The difference between the Volterra and the Wiener representation of a nonlinear system is mainly one of a different arrangement of the single Volterra operators in the series: in a Volterra series, the operators contain all terms of a given degree which has the disadvantage that they cannot be estimated independently of each other; in a Wiener series, the operators contain terms of mixed degree such that they can be independently estimated. In addition, the way in which the series representation approximates the true nonlinear system with increasing degree of the operators is different in both representations: Volterra series have to approximate the true system uniformly, whereas the Wiener representation is required to approximate it only in the mean square sense. As a consequence, the classes of systems that can be represented by both series are different.

## Volterra series

A system can be defined mathematically as a rule that assigns an output signal y(t) to an input signal x(t) (we assume for the moment that x(t) and y(t) are functions of time t). This rule can be expressed in the form $y(t) = Tx(t)$ using the system operator T. The system is typically assumed to be time-invariant and continuous, i.e., the system response should remain unchanged for repeated presentation of the same input and small changes in the input function x(t) should lead to small changes in the corresponding system output function y(t). In traditional systems theory, we further restrict T to be a linear operator H1 such that the system response can be described by a convolution $y(t) = H_1 x(t) = \int_\mathbb{R} h^{(1)}(\tau) x(t-\tau) \, d\tau$ of x(t) with a linear kernel (or impulse response) h(1). Volterra extended this well-known linear system representation to nonlinear systems by adding a series of nonlinear integral operators (Volterra, 1887, 1959) $y(t) = h^{(0)} + \int_\mathbb{R} h^{(1)}(\tau_1)x(t-\tau_1) \, d\tau_1 \ :$

$+ \int_{\mathbb{R}^2} h^{(2)}(\tau_1, \tau_2)x(t-\tau_1)x(t-\tau_2) \, d\tau_1 d\tau_2 \ :$
$+ \int_{\mathbb{R}^3} h^{(3)}(\tau_1, \tau_2, \tau_3)x(t-\tau_1)x(t-\tau_2)x(t-\tau_3) \, d\tau_1 d\tau_2 d\tau_2 \ :$
$+ ...$

or, more generally, the system described by a Volterra series $y(t) = H_0 x(t) + H_1 x(t) + H_2 x(t) + ... + H_nx(t) + ...$ in which H0 x(t) = h(0) is a constant and for n = 1, 2,.. $H_n x(t) = \int_{\mathbb{R}^n} h^{(n)}(\tau_1, ... \tau_n) \, x(t-\tau_1) \, ... \, x(t-\tau_n) \, d\tau_1 \, ... \, d\tau_n$ is the nth-order Volterra operator. The integral kernels h(n), referred to as the Volterra kernels of the system, have to be causal $h^{(n)}(\tau_1,...,\tau_n) = 0 \quad \text{for any} \quad \tau_j < 0, \quad j=1, 2, 3, ..., n,$ i.e., they do not depend on future values of x(t).

Depending on the system to be represented, the integrals can be computed over finite or infinite time intervals. The support of the Volterra kernel defines the memory of the system, i.e., it delimits the time interval in which past inputs can influence the current system output. The Volterra series can be regarded accordingly as a Taylor series with memory: whereas the usual Taylor series only represents systems that instantaneously map the input to the output, the Volterra series characterizes systems in which the output also depends on past inputs. Note that the Volterra kernels for a given output are not unique. There are, in fact, many asymmetric (with respect to permutations of the arguments) Volterra kernels which give rise to the same operator, but every system operator corresponds only to one symmetric kernel. Fortunately, any asymmetric kernel can be symmetrized by a simple procedure (Volterra, 1959, Mathews & Sicuranza, 2000) .

Due to its power series character, the convergence of an infinite Volterra series cannot be guaranteed for arbitrary input signals. In addition, not all nonlinear systems can be approximated. As a consequence, both the input and the output signals must be restricted to some suitable signal class. For instance, one can show that any continuous, nonlinear system can be uniformly approximated to arbitrary accuracy by a Volterra series of sufficient but finite order if the input signals are restricted to square-integrable functions on a finite interval (Fréchet, 1910, Brilliant, 1958). Although this approximation result appears to be rather general on first sight, the restriction to this type of input is quite severe. Many natural choices of input signals are precluded by this requirement such as, e.g., the standard infinite periodic forcing signals used in linear system theory.

The application of the Volterra series in its original continuous time form is mostly limited to the analysis of nonlinear differential equations. In practical signal processing, one has to use a discretized form for a finite sample of data. Here, we assume that the input data is given as a vector $$\mathbf{x} = (x_1, ... ,x_m)^\top \in \mathbb{R}^m$$ of finite dimension. The vectorial data can be generated from any multi-dimensional input or, for instance, by a sliding window over a discretized image or time series. A discrete system is simply described by a function that maps these vectors to some real number, not by an operator as in the continuous time case. The discretized Volterra operator (Alper, 1965) is defined as the function $\tag{1} H_n \mathbf{x} = \sum_{i_1=1}^{m} ... \sum_{i_n=1}^{m} h^{(n)}_{i_1 ... i_n} x_{i_1} \dots x_{i_n}.$

where the Volterra kernel is given as a finite number of mn coefficients h(n)i1 ... in. It is, accordingly, a linear combination of all ordered nth-order monomials of the components of the input vector x. The discretized Volterra functionals provide a practical approximation which shares the completeness and convergence properties of Volterra theory. The Weierstraß theorem states that all discrete continuous nonlinear systems with finite-dimensional vectorial input can be uniformly approximated by a finite, discrete Volterra series. However, - similar to the familiar Fourier series - many practically occurring systems with non-continuous output such as, e.g., rectangular waveforms cannot be approximated in the uniform sense which results in Gibbs-like ringing phenomena at the discontinuities for finite Volterra approximations.

## Wiener series

The convergence of the Volterra series is comparable to the convergence of the Taylor series expansion of a function which often allows only for small deviations from the starting point. The type of convergence required (uniform convergence) is very stringent since not only the error has to approach zero with increasing number of terms, but also the derivatives of the error. A less stringent notion of convergence applies if one represents a function by a series of orthogonal functions, namely only convergence in the mean square sense which allows convergence over a much larger range than the Taylor series. The same idea can be applied to functionals instead of functions by using an orthogonal series of base functionals and stipulating convergence in the mean square sense.

The output of two different functionals in a Volterra series is usually not orthogonal, i.e., their respective output time series are correlated. They can be orthogonalized by a procedure which is very similar to a Gram-Schmidt orthogonalization which is, however, only valid in the average case over all inputs. As a consequence, one has to assign a probability to all input signals, i.e., one has to define a stochastic process over which the Volterra functionals are orthogonalised in terms of the correlation between their output over all possible inputs. The resulting functionals are sums of Volterra functionals of different order, or nonhomogeneous Volterra functionals.

The stochastic input process chosen by Wiener for the series of nonhomogeneous Volterra functionals named after him is nowadays known as the Wiener process (Papoulis, 1991). One can show that the Wiener process assigns a non-zero probability to the neighbourhood of every continuous input function (Palm & Poggio, 1977). Thus the realizations of the Wiener process play a similar role in Wiener theory as the sinusoidal test inputs in linear system theory since they are capable of completely characterizing the system.

If the orthogonalization is done with respect to the Wiener process, one obtains the nonhomogeneous Volterra functionals called Wiener functionals, denoted by Gn (Wiener, 1958). The corresponding input-specific decomposition of the system operator T $y(t) = G_0 x(t) + G_1 x(t) + G_2 x(t) + ... + G_nx(t) + ...$ into a series of mutually uncorrelated operators is called a Wiener series expansion. The Wiener operators Gn of order n are linear combinations of Volterra operators up to order n. For instance, the second-degree Wiener operator $G_2 x(t) = \int h_2(\tau_1,\tau_2)x(t - \tau_1)x(t - \tau_2) \, d\tau_1 d\tau_2 - \int h_2(\tau_1,\tau_1) \, d\tau_1$ consists of a zero-order and a second-order Volterra operator. The third-degree Wiener operator $G_3[x(t)] = \int_{\mathbb{R}^3} k_3(\tau_1, \tau_2, \tau_3)x(t-\tau_1)x(t-\tau_2)x(t-\tau_3) \, d\tau_1 d\tau_2 d\tau_2 \ :$

$-3A \int_{\mathbb{R}^2} k_3(\tau_1, \tau_2, \tau_2) x(t-\tau_1) \,d\tau_1 d\tau_2$

of a first- and third order Volterra operator (A is the variance of the Wiener process used as input). In general, an odd-degree Wiener operator contains all lower odd-order Volterra operators, an even-degree Wiener operator all lower even-order Volterra operators. The highest order integral kernel of a Wiener operator is called the leading Wiener kernel, the lower order kernels are the derived Wiener kernels. These can be computed from the leading Wiener kernel by the orthogonalization procedure which results in a recursive formula (Schetzen, 1980): $k_{p-2m}^{(p)}(\sigma_1, \dots, \sigma_{p-2m}) = \frac{(-1)^m p! A^m}{(p-2m)! m! 2^m} \, \cdot \ :$

$\tag{2} \int_{\mathbb{R}^m} k_p(\tau_1, \tau_1, \dots, \tau_m, \tau_m, \sigma_1, \dots, \sigma_{p-2m}) \, d\tau_1 \dots d\tau_m.$

The zeroth- and first-degree Wiener functionals do not contain derived Wiener kernels and are given by $G_0[x(t)] = k^{(0)} \quad \text{and} \quad G_1[x(t)] = \int_\mathbb{R} k^{(1)}(\tau_1) x(t-\tau_1) \, d\tau_1.$

Wiener and Volterra series can be viewed as two alternative ways of characterizing a system. Both use monomials as base functions, but they group them into either nonhomogeneous or homogeneous Volterra operators. All systems that produce square integrable output for the Wiener input process can be approximated in the mean square sense by finite order Wiener series operators (Ahmed, 1970). In practice, this means that the systems must be non-divergent and cannot have infinite memory. Clearly, the relaxed conditions on convergence make the Wiener class of systems much larger than the Volterra class, however there are some systems that can be represented by an infinite Volterra series, but not by a Wiener series (Palm & Poggio, 1977, Korenberg & Hunter, 1990). In contrast, a truncated Wiener or Volterra series can always be transformed into its truncated counterpart: the Volterra expansion of a finite Wiener series can be computed easily by adding up all operators of equal order. The Wiener expansion can be obtained by applying the Gram-Schmidt type orthogonalization to a finite Volterra series, but this procedure becomes rather tedious for higher-order operators.

The discrete analogue to the Wiener series is typically orthogonalized with respect to Gaussian input $$\mathbf{x} \sim \mathcal{N}(0, A)$$ since this is the only practical setting where the popular crosscorrelation method can be applied (see next paragraph). All properties of continuous Wiener series operators described above carry over to the discrete case. In particular, any square-summable function with Gaussian input can be approximated in the mean square sense by a finite, discrete Wiener series (Palm & Poggio, 1978).

## Estimation of Volterra and Wiener series

In system identification, we are only given pairs of input and output signals whereas the system itself is treated as a black box. The appropriate Volterra or Wiener representation has to be found by minimizing some error measure between the true output and the model output such as, e.g., the integral over the squared error. In general, the minimization of the mean squared error for the estimation of the Volterra kernels requires the solution of a simultaneous set of integral equations. Since in most practical cases this is hard or even impossible, the use of the continuous Volterra series is rather limited. In contrast, Wiener operators are mutually uncorrelated and can be estimated independently of each other if the test signals are realizations of the Wiener process.

For a linear system, the estimation of the system representation is a straightforward procedure since it suffices to test the system on a set of basic signals such as delta functions or sinusoids. In a nonlinear system, however, we ideally have to measure the system response for all possible input signals. In other words, if a nonlinear system is tested with one class of input signals, one might obtain a different representation for another class of input signals. Since testing all possible input signals is infeasible, one resorts to inputs that are realizations of a random process. For instance, the above-mentioned Wiener process approximates any continuous input signal in the least squares sense. Thus, if the input time series is sufficiently long, the obtained system representation should generalize to previously unseen continuous input.

A further reason for the popularity of the Wiener series is that the leading Wiener kernels can be directly measured via the crosscorrelation method of Lee and Schetzen (1965). If one uses Gaussian white noise with standard deviation A instead of the Wiener process as input, the leading Wiener kernel can be estimated as the cross-correlations $k^{(0)} = \overline{y(t)}$ $k^{(1)}(\sigma_1) = \frac{1}{A} \overline{y(t)x(t-\sigma_1)}$ $k{(2)}(\sigma_1, \sigma_2) = \frac{1}{2A^2} \overline{y(t)x(t-\sigma_1) x(t-\sigma_2)}$ $k^{(3)}(\sigma_1, \sigma_2, \sigma_3) = \frac{1}{3!A^3} \overline{y(t)x(t-\sigma_1)x(t-\sigma_2)x(t-\sigma_3)}$ $:$ $\tag{3} k^{(n)}(\sigma_1, \dots, \sigma_n) = \frac{1}{n!A^n} \overline{y(t) x(t-\sigma_1) \dots x(t-\sigma_n)}$

where the bar indicates the average over time. Since the Wiener operators are orthogonal by definition, the single Wiener kernels can be estimated independently. The other lower-order Wiener kernels of Gn can be derived from the leading kernel by applying again formula Eq.(#eq:derived_wiener).

The crosscorrelation method suffers from severe problems that limit its applicability to relatively small Wiener kernels in a low-noise scenario:

• In practice, the cross-correlations have to be estimated at a finite resolution (cf. the discretized version of the Volterra operator in Eq.(#eq:volt_discrete)). The number of expansion coefficients h(n)i1 ... in in Eq.(#eq:volt_discrete) increases with mn for an m-dimensional input signal and an nth-order Volterra kernel. However, the number of coefficients that actually have to be estimated by cross-correlation is smaller. Since the products in Eq.(#eq:volt_discrete) remain the same when two different indices are permuted, the associated coefficients should be equal. As a consequence, the required number of measurements is $$(n + m -1)! / (n! (m - 1)!)$$ (Mathews & Sicuranza, 2000). Nonetheless, the resulting numbers are huge for higher-order Wiener kernels. For instance, a 5th-order Wiener kernel operating on 16 x 16 sized image patches contains roughly 1012 coefficients, 1010 of which would have to be measured individually by cross-correlation. As a consequence, this procedure is not feasible for higher-dimensional input signals or highly nonlinear kernels with larger memory.
• The estimation of cross-correlations requires large sample sizes. Typically, one needs several tens of thousands input-output pairs before a sufficient convergence is reached. Moreover, the variance of the estimator $$\overline{y(t) x(t-\sigma_1) \dots x(t-\sigma_n)}$$ in Eq.(#eq:crosscorr) increases with increasing values of the $$\sigma_i$$ (Papoulis, 1991) such that only operators with relatively small memory can be reliably estimated.
• The estimation via cross-correlation works only if the input is Gaussian noise with zero mean, not for general types of input. In physical experiments, however, deviations from ideal white noise and the resulting estimation errors cannot be avoided. Specific inputs, on the other hand, may have a very low probability of being generated by white noise. Since the approximation is only computed in the mean square sense, the system response to these inputs may be drastically different from the model predictions.

With the advent of readily available computing power the crosscorrelation method has been replaced in most cases by linear regression which does not suffer from the cumbersome estimation of crosscorrelations and the restriction to Gaussian noise input. Here, one directly estimates the discrete Volterra series by treating the monomials as basis functions $$\varphi_{i_1 \dots i_n} = x_{i_1} \dots x_{i_n}$$ and the components of Volterra kernel $$h^{(n)}_{i_1 ... i_n}$$ as expansion coefficients $f(\mathbf{x}) = \sum_{n=0}^p H_n \mathbf{x} = \sum_{n=0}^p \sum_{i_1=1}^{m} ... \sum_{i_n=1}^{m} h^{(n)}_{i_1 ... i_n} x_{i_1} \dots x_{i_n} = \sum_{n=0}^p \sum_{i_1=1}^{m} ... \sum_{i_n=1}^{m} h^{(n)}_{i_1 ... i_n} \varphi_{i_1 \dots i_n} .$ Instead of assuming an infinite amount of data, the expansion coefficients are found by minimizing the mean squared error over a finite dataset consisting of $$N$$ input-output pairs $$(\mathbf{x}_j, y_j)$$ (Korenberg et al., 1988; Mathews & Sicuranza, 2000) $\frac{1}{N} \sum_{j=1}^N(f(\mathbf{x}_j) - y_j)^2.$ This minimization problem is solved by applying the standard numerical methods for least squares problems such as pseudoinverses or singular value decomposition, without the need of estimating crosscorrelations. Moreover, the input signal class is no more restricted to Gaussian noise, but can be chosen freely, e.g., from the 'natural' input ensemble of the system. As long as the input is known to the experimenter, there is no need for controlling the input as in the classical system identification setting. Note, however, that the obtained Volterra models will approximate the Wiener series only for sufficiently large datasets of Gaussian white noise. Korenberg et al. (1988) have shown that the linear regression framework leads to Wiener models that are orders of magnitude more accurate than those obtained from the crosscorrelation method. Unfortunately, the solution of this regression problem requires the inversion of an $$M \times M$$ matrix where $$M$$ is the number of monomials in the expansion (Mathews & Sicuranza, 2000). This is again prohibitive for high-dimensional data and higher orders of nonlinearity since $$M$$ scales like $$m^n\ .$$

This dimensionality problem can be overcome by using kernel regression instead of standard linear regression (Franz & Schölkopf, 2006). It can be shown that all finite discrete Volterra series of degree $$p$$ that are solutions of a linear regression problem can be generated by an expansion in polynomial kernel functions $$(1 + \mathbf{x}_j^\top \mathbf{x})^p$$ with suitable weights $$\alpha_j\ ,$$ the implicit Volterra series $f(\mathbf{x}) = \sum_{n=0}^p H_n \mathbf{x} = \sum_{j=1}^N \alpha_j (1 + \mathbf{x}_j^\top \mathbf{x})^p.$ This means that, instead of $$m^n$$ coefficients $$h^{(n)}_{i_1 ... i_n}\ ,$$ only $$N$$ coefficients $$\alpha_j$$ of the implicit Volterra series have to be estimated. Clearly, this is advantageous if the number of the monomials exceeds the number of available datapoints which is typically the case for high-dimensional inputs and higher orders of nonlinearity. Any implicit Volterra series can be converted into a standard discrete Volterra series expansion (Franz & Schölkopf, 2006). If the $$\alpha_j$$ are computed from Gaussian white noise input by minimizing the mean squared error, the resulting Volterra series is called an implicit Wiener series from which also an expansion into a standard discrete Wiener series can be computed (Franz & Schölkopf, 2006). The kernel function $$(1 + \mathbf{x}_j^\top \mathbf{x})^p$$ is not the only one that is capable of representing Volterra series. There are several alternatives, some of which generate also infinite Volterra series (Franz & Schölkopf, 2006).