Matching pursuit

 Piotr J. Durka (2007), Scholarpedia, 2(11):2288. doi:10.4249/scholarpedia.2288 revision #140500 [link to/cite this article]
Post-publication activity

Curator: Piotr J. Durka

Matching pursuit (MP) algorithm finds a sub-optimal solution to the problem of an adaptive approximation of a signal in a redundant set (dictionary) of functions. Commonly used with dictionaries of Gabor functions, it offers several advantages in time-frequency analysis of signals, in particular EEG/MEG.

Dictionaries and linear expansions

Many signal analysis methods look for a linear expansion of the unknown signal x in terms of functions $$g_n$$

$\tag{1} x = \sum_{n=1}^{N} a_n g_n$

We may say that in such a way the unknown signal x is explained using words (functions $$g_n$$) from the dictionary D, used for decomposition. If the dictionary D is an orthonormal basis (like orthogonal wavelets or Fourier bases), then the coefficients $$a_n$$ are given simply by the inner products of the dictionary's functions $$g_n$$ with the signal $$\langle x, g_n \rangle\ .$$

We would like to use a dictionary $$D=\lbrace g_n\rbrace_{n=1..N}$$ that would properly reveal the intrinsic properties of an unknown signal, or, almost equivalently, would give low entropy of the $$\lbrace a_n\rbrace$$ and possibilities of good lossy compression. Unfortunately, there is no universal recipe for such a prior choice.

We may relax the requirement of exact signal representation (1), and try to automatically choose the functions $$g_{\gamma_n}\ ,$$ optimal for the representation of a given signal x, from a redundant dictionary D. The expansion becomes an approximation, and uses only the functions $$g_{\gamma_n}$$ chosen from the redundant dictionary D. In practice, the dictionary contains orders of magnitude more candidate functions $$g_\gamma$$ than the number M of functions chosen for the representation.

$\tag{2} x\approx \sum_{n=0}^{M-1} a_n g_{\gamma_n}$

A criterion of optimality of a given solution (for a fixed dictionary D, signal x, and number of used functions M) can be formulated as minimization of the error of representation

$\tag{3} \epsilon=\left\| x-\sum_{n=0}^{M-1} a_n g_{\gamma_n} \right\| = \min$

Finding the minimum requires checking all the possible combinations (subsets) of M functions from the dictionary, which leads to a combinatorial explosion. Therefore, the problem is intractable even for moderate dictionary sizes. Matching pursuit algorithm, proposed in (Mallat and Zhang 1993), finds a sub-optimal solution by means of an iterative procedure.

Matching pursuit algorithm

In the first step, the waveform $$g_{\gamma_0}$$ which gives the largest product with the signal x is chosen from the dictionary D, composed of normalized functions ($$||g_{\gamma_n}||=1$$). In each of the consecutive steps, the waveform $$g_{\gamma_n}$$ is matched to the signal $$R^n x$$ which is the residual left after subtracting results of previous iterations: $\tag{4} \begin{cases} R^0 x = x \\ R^n x = \langle R^n x,g_{\gamma_n}\rangle g_{\gamma_n}+R^{n+1}x\\ g_{\gamma_n} = \arg \max_{g_{\gamma_i} \in D } |\langle R^n x, g_{\gamma_i}\rangle | \end{cases}$

Orthogonality of $$R^{n+1} x$$ and $$g_{\gamma_n}$$ in each step implies energy conservation

$||x||^2 =\sum_{n=0}^{M-1} {|\langle R^n x, \;g_{\gamma_n}\rangle |^2} + ||R^M x||^2$

For a complete dictionary the procedure converges to x with $$M\rightarrow\infty$$ (Mallat and Zhang 1993). In practice we use finite expansions

$\tag{5} x\approx\sum_{n=0}^{M-1} {\langle R^n x,\; g_{\gamma_n}\rangle g_{\gamma_n} }$

Multivariate matching pursuit (MMP)

Multivariate extensions of the matching pursuit algorithm can be achieved in many significantly different ways, possibly corresponding to the aims of analysis and/or to the model of generation of the signal. For example, multivariate time series may stem from simultaneous measurements at different sensors (channels), as in the case of multichannel EEG/MEG, or from a joint analysis of subsequent repetitions of evoked fields or potentials. Variety of multivariate MP algorithms (MMPs) can be mostly attributed to:

(A) the structure of the multichannel dictionary (i.e., which parameters of the time-frequency atoms are allowed to vary across the jointly analyzed signals),

(B) to the criterion used for choosing (in each iteration) the atom, best fitting the residua in all the signals simultaneously.

The most straightforward multichannel extension of the MP can be defined by the following conditions (Gribonval 2003): (A) only the amplitude varies across channels, (B) we maximize the sum of squared products (energies) in all the channels: $\tag{6} \max_{g_{\gamma} \in D } \sum_{i=1}^{N_s} |\langle R^n\mathbf{x}^i, g_{\gamma}\rangle |^2 \ ,$

where $$N_s$$ stands for the number of simultaneously analyzed signals $$x^i\ .$$ A sub-optimal approach, proposed in (Durka et al. 2005) for preprocessing for EEG Inverse Solutions, relies on maximizing the sum of products $\tag{7} \max_{g_{\gamma} \in D } \left| \sum_{i=1}^{N_s} \langle R^n\mathbf{x}^i, g_{\gamma}\rangle \right| \ .$

Owing to the linearity of the residuum $$R\ ,$$ this assumption allows for implementation operating in each step on the average of signals. Similarly, we can allow for different phases in functions fitted to each channel/repetition $$x^i\ .$$ In the analysis of multichannel EEG/MEG, that would relate to a different model of propagation. If we take $$x^i$$ as repetitions of evoked potentials or fields, variable phase can partly account e.g. for the jitter of latencies. As exemplified above, other modifications of the above mentioned criteria (A) and (B) may lead to other variants of MMP.

Dictionary of time-frequency atoms

Dictionaries (D), commonly used for time-frequency analysis, are composed of Gabor functions, that is Gaussian envelopes modulated by sinusoidal oscillations. Real-valued Gabor function can be expressed as

$$\tag{8} g_\gamma (t) = K(\gamma)e^{-\pi{ \left( \frac{t-u}{s} \right) }^2} \cos\left(\omega (t-u)+\phi\right)$$

where $$\gamma=\{u, s, \omega, \phi\}\ ,$$ and $$K(\gamma)$$ is a normalization factor ensuring $$||g_{\gamma}||=1\ .$$ These functions provide a general and compact model for transient oscillations; the above equation can describe parametrically a wide variety of shapes ( Figure 1). Apart from that, Gabor functions exhibit compact time-frequency localization. Gabor dictionaries used for matching pursuit decomposition usually contain also complete Dirac and Fourier bases. Other functions can be also used with the matching pursuit algorithm.

In practical implementations of matching pursuit with Gabor dictionaries we must restrict the search to a discrete and finite subset of the continuous 3-dimensional space of parameters $$\{u, s, \omega\}\ ;$$ phase ($$\phi$$) is optimized separately in numerical implementations.

Time-frequency energy density

Wigner-Ville transform Figure 2: Cross-terms in time-frequency representations of signal's energy$(a + b)^2 = a^2 + b^2 + 2ab\ .$ Wigner-Ville transform of a signal composed of two sine waves with compact and separated support (below). The middle structure (labelled "2ab"), appearing in the time coordinates where the signal is flat, is a cross-term.

It can be simply demonstrated that the squared modulus of the Fourier transform of the signal x, that is the power spectrum of x, can be computed as the Fourier transform of the autocorrelation function of x (Wiener-Khinchine theorem). That is, for a real-valued signal x $\tag{9} |\mathcal{F}(x)|^2 = \mathcal{F} \left(\int_{-\infty}^{\infty} x(t+\frac{\tau}{2}) x(t-\frac{\tau}{2}) d\tau \right)$

where the Fourier transform of the signal x is given by $\tag{10} \mathcal F(x)=\int_{-\infty}^{\infty} e^{-i\omega t} x(t) dt$

Substituting explicitly (10) into (9) we get the following formula for the power spectral density of x: $\tag{11} |\mathcal{F}(x)|^2 = \int_{-\infty}^{\infty} e^{-i\omega \tau} \left(\int_{-\infty}^{\infty} x(t+\frac{\tau}{2}) x(t-\frac{\tau}{2}) dt \right) d \tau$

If we remove the middle integral, corresponding to the integration over time, we get a time-dependent spectral density as a 2-dimensional function $\tag{12} \mathcal{W}(x) = \int_{-\infty}^{\infty} e^{-i\omega \tau} x(t+\frac{\tau}{2}) x(t-\frac{\tau}{2}) d \tau$

which is the Wigner-Ville transform of x. This transform exhibits several elegant and desirable mathematical properties, hence it is sometimes considered a 'fundamental' time-frequency representation. Unfortunately, it has also one major drawback which is the presence of cross-terms (aka cross-components) in the time-frequency plane, as illustrated in Figure 2. Minimization of the presence of cross-terms in time-frequency estimates can be achieved by applying a smoothing kernel, designed for a particular signal as in the upper panel of Figure 6. Unfortunately, there is no general recipe for such smoothing which would give satisfactory results for an unknown signal.

Similarly to the Wigner-Ville transform of x, we can define a cross-Wigner transform of signals x and y: $\tag{Wigner1:label exists!} \mathcal{W}(x, y) = \int_{-\infty}^{\infty} e^{-i\omega \tau} x(t+\frac{\tau}{2}) y(t-\frac{\tau}{2}) d \tau$

MP-based estimate of energy density Figure 3: Middle plot: estimate of the time-frequency energy density (5), computed from the MP decomposition (14) of a 500-points simulated signal (plotted in red) composed of four Gabor functions, sine wave, one-point discontinuity, and a chirp (sine with linearly increasing frequency). Changing frequency of the chirp is represented as a series of Gabor functions. Upper plot--the same as middle, presented in 3-D. Square root of energy proportional to the height of the surface or "temperature" on 2-D plots.

Wigner distribution computed directly from (2) gives $\tag{13} \mathcal{W}(x) \approx \mathcal{W}\left( \sum_{n=0}^{M-1} a_n g_{\gamma_n} \right) =$ $$\sum_{n=0}^{M-1} a_n^2\, \mathcal{W}(g_{\gamma_n}) + \, \sum_{n=0}^{M-1} \sum_{k=1, k\not=n}^{M-1} a_n a_k \mathcal{W}\left( g_{\gamma_n}, g_{\gamma_k} \right)$$ where $$a_n=\langle R^n x,\; g_{\gamma_n}\rangle\ .$$ The double sum in (13) gathers cross-terms. Using linear expansion (5), we can omit them explicitly and construct the time-frequency representation of signal's energy density from the first sum, containing only auto-terms: $\tag{14} \mathcal{E} x = \sum_{n=0}^{M-1} |\langle R^n x,\; g_{\gamma_n}\rangle|^2 \mathcal{W}g_{\gamma_n}$

Technical issues

Matching pursuit provides a simple and intuitive decomposition of a signal, yet "under the hood" the procedure is highly nonlinear and complicated. First problem relates to the already mentioned need to select a discrete subset of the possible dictionary's functions for practical implementations. In general, it is expected, the larger (denser) the dictionary, the better the resulting matching pursuit decomposition at a higher computational cost. However, fixed schemes of subsampling the potentially continuous space of dictionary's parameters introduce a statistical bias, which becomes relevant when pooling together results of many matching pursuit decompositions (Durka et al. 2001). As for the number of waveforms in expansion (5), equal to the number of iterations of (4), there are several possible stopping criteria, but if we stop the iterations "too late" we may simply neglect the waveforms fitted in the latter iterations in further analysis. Reasonable settings for the dictionary size (density) and the number of iterations are important, because the computational cost of the matching pursuit, approximately proportional to these two factors, is still relatively high (Durka 2007). Figure 4: A failure in feature extraction: matching pursuit decomposition of a signal consisting of two Gabor functions, both present in the dictionary used for decomposition. R0 – analyzed signal, g0 – function fitted in the first iteration by the matching pursuit algorithm (4).

Not all the signal's structures can be efficiently modeled by Gabor functions. Some waveforms, as for example the chirp in Figure 3, may be expressed by several functions from the dictionary D, in which case the advantage of explicit parametrization is impaired. Nevertheless, we still get a robust and universal estimate of the time-frequency energy density, which is relatively independent of arbitrary factors. Influence of the prior choices, biasing other estimates of signals time-frequency energy density, is partly illustrated in Figure 6. Matching pursuit algorithm can be also used with different types of dictionaries, but up to now these attempts were mostly theoretical.

Finally, there is a price to pay for using the sub-optimal solution of (3). Greedy matching pursuit algorithm can fail also in cases when the signal is constructed entirely from the structures present in the dictionary. An example is given in Figure 4: a perfect solution to (3) would contain two Gabor functions, but, according to the greedy strategy of explaining maximum energy in a single step, in the first iteration matching pursuit chooses an intermediate waveform embracing both the structures. However, such an error can occur only in case when both the structures are perfectly in phase; in some cases one could argue that explaining them in term of one oscillation may be not so erratic.

Advantages in EEG/MEG analysis

Major advantages of the matching pursuit in analysis of biomedical signals stem from the explicit parametrization of transients, robust estimate of the time-frequency energy density, and combinations of these two. Extensive summary of the matching pursuit-based frameworks successfully applied to the EEG analysis is given in (Durka 2007).

Explicit parametrization of transients Figure 5: Time-frequency energy density (14) of matching pursuit decomposition (5) of 20 seconds of sleep EEG. Marked blobs correspond to sleep spindles, structures C-D and E-F were marked by an encephalographer in the raw signal as single spindles (during a visual analysis or the raw EEG trace).

Gabor functions (8) fitted to the signal in the matching pursuit procedure (4) are defined by their amplitude, occurrence in time u, frequency $$\omega\ ,$$ time span s and phase $$\phi\ .$$ If we assume that Gabor functions model appropriately the relevant components of the signal, then we can treat the fitted Gabors as representing the investigated phenomena. For example, EEG sleep spindles known from the visual analysis are usually defined as waveforms of frequencies in the 11-15 Hz range, with more or less generally assumed restrictions also on the amplitude and time span ( Figure 5). These restrictions can be implemented directly in the space of parameters of functions from the signal's expansion (5).

A robust and universal time-frequency estimate Figure 6: Different estimates of time-frequency energy density of a simulated 500-points signal plotted at the bottom (the same as in Figure 3). Two lower panels present spectrograms (windowed or short-time Fourier transforms) calculated for long (125 points in b) and short (21 points in c) windows. (d)--scalogram (continuous wavelet transform), (e)--smoothed pseudo Wigner-Ville distribution, (f)--the same as (e) in 3 dimensions.

The major problem in estimating the time-frequency energy density of signals is the trade-off between the time and frequency resolutions, known as the uncertainty principle in signal analysis. Different estimates offer different solutions to this issue, based upon different prior settings regulating this trade-off ( Figure 6). Spectrogram (windowed Fourier transform) gives uniform time-frequency resolution based upon the prior choice of the length of the analyzing window, scalogram (wavelet transform) gives higher time resolution for higher frequencies, while a variety of Wigner-derived distributions from the Cohen's class allow for almost any setting of these constrains. Nevertheless, these settings have to be decided a priori, before the analysis. Their improper choice may severely bias the results.

On the contrary, time-frequency estimate (14) derived from the matching pursuit decomposition (5) is based upon a representation, which adopts the time-frequency trade-off to the local signal structures in each iteration of the algorithm, yielding an estimate which is basically free from arbitrary settings--apart from choosing the Gabor functions.