Systems for Sparse Representations: Fourier Analysis, Wavelets & Shearlets

Author: Felix Voigtlaender
Description:

This is a rough transcript of the blackboard part of the tutorial on Wavelets and Shearlets which I presented at the DEDALE Tutorial Day, November 16, 2016.

The second part of these notes can be best (or maybe only) understood in combination with the associated slides.

Warning

This document is intended to serve as an informal tutorial. My aim is not that every statement is 100% mathematically correct or precise. Rather, I want to present the underlying ideas and techniques.

Contents

The general setting

This whole tutorial is concerened with different systems for sparsely representing certain signals, e.g. sound signals, images, videos, etc. Mathematically, these signals will be elements of a certain Hilbert space , e.g. ℋ = L2(ℝ) for sound signals or L2(ℝ2) for (continuous) images. In fact, for (quadratic, continuous) images, one might also use ℋ = L2([0, 1]2).

Of course, in practice, an image is not a function f:[0, 1]2 → ℝ, since this would require us to store uncountably many values. Instead, one usually models an image (consisting of discrete pixels) using the finite dimensional Hilbert space ℋ = ℂN × N (or ℋ = ℝN × N).

Formally, what we are looking for is a linear transformation

Φ:ℋ → ℋC

of into a certain coefficient Hilbert space C. To justify the name coefficient Hilbert space, this space should consist of (coefficient) sequences. Hence, we will assume C = ℓ2(I) for a certain index set I. Here,

2(I) = {c = (ci)i ∈ I : ci ∈ ℂ and i ∈ I|ci|2 < ∞}.

Typical examples are 2(ℤ) or 2(ℤ2), or 2({1, …, N}).

Hence, for each f ∈ ℋ, the transformation Φf is of the form

Φf = ((Φf)i)i ∈ I = (Φif)i ∈ I.

In fact, using the theory of Hilbert spaces (precisely, the Riesz representation theorem), one can show that there is a family (φi)i ∈ I in such that

Φf = (⟨f, φi⟩)i ∈ I for all f ∈ ℋ.

Of course, we want the transformation Φ to satisfy a number of nice properties:

  1. It should be possible to reconstruct f ∈ ℋ from its coefficients Φf in a stable way. In other words, there should be a (linear, continuous) reconstruction map R:ℋC → ℋ satisfying f = RΦf for all f ∈ ℋ.

    As we will explain later on, this is equivalent to requiring that the family (φi)i ∈ I forms a frame.

  2. For a certain class of important/interesting signals f ∈ G ⊂ ℋ (the good signals), we want the coefficients Φf to be sparse.

    In a finite dimensional setting, this means that most of the coefficients Φif should be zero, or at least negligible in comparison to the significant coefficients. In the infinite-dimensional setting, this is encoded by requiring Φf ∈ ℓp(I) for some p ≤ 1. Here, p(I) is defined similar to 2(I), i.e.,

    p(I) = {c = (ci)i ∈ I : ci ∈ ℂ and i ∈ I|ci|p < ∞}.

Motivation

The first question is of course why such a system for sparse signal representation is interesting at all. In fact, there are many areas where such systems are useful. In the following, we only mention three prominent examples.

  1. Such a system can be used for data compression: Since for a good signal f ∈ G, most of the coefficients Φif vanish, or are at least very small, one can save only the significant (large) coefficients, and set the remaining ones to zero, which yields the compressed coefficients Cf).

    This saves memory and the resulting (compressed) coefficients yield a good approximation to f, i.e., f − RCf)∥ is small.

    This general idea is the basis for the JPEG 2000 standard for image compression, which uses Wavelet systems as the system for sparse signal representations.

  2. In many applications, one wants to recover a signal f from incomplete measurements. Formally, this means that we only know (e.g. from taking certain measurements like MRI measurements) the quantitiy

    y = Ax for a measurement matrix A ∈ ℝn × N with nN.

    Since nN, this linear system is highly underdetermined, i.e., there is no unique solution if one has no additional a priori information about x.

    The main statement of the theory of compressive sensing, however, is that x is uniquely determined (and can be computed efficiently), if it is known that x is sparse. Of course, this only holds under additional suitable assumptions on the measurement matrix A. These assumptions, however, are outside of the scope of this tutorial.

    Typically, x will not be sparse in the standard basis. For example, for most natural images, it is not true that most pixels of the image are (almost) zero (black). Hence, to apply the theory of compressive sensing, one needs to know a system Φ which sparsifies x, i.e., such that Φx is sparse. Then we have

    y = Ax = ARΦx = ARz, 

    where the vector z = Φx is sparse. Now, one can apply compressive sensing methods to recover z = Φx and then also x = RΦx = Rz.

    Note

    For the theory of compressive sensing to be applicable, one needs to have a certain compatibility between the measurement matrix A and the sparsifying system Φ. This, however, is outside the scope of this tutorial.

  1. In the preceding point, we considered an underdetermined linear system in which the solution (if it exists) is not unique. But even if - in principle - there is a unique solution, i.e., if we want to reconstruct x from

    y = Ax where A ∈ ℝN × N is invertible, 

    simply computing this solution might be a bad idea.

    To see this, we consider a numerical example: In this example, we take a certain image (a racoon) and blur this image, by convolving it with a gaussian of a certain size. One can mathematically show (see The inverse of the convolution operation) that this blurring operation is invertible.

    However, once the blurring is strong enough, the numerical implementation of the inverse operation yields catastropical results:

    # First, we import some useful packages
    import numpy as np
    import matplotlib.pyplot as plt
    import scipy.misc
    from numpy.fft import fft2, ifft2, fftshift
    
    # We load the racoon image and display it for visual inspection
    im = scipy.misc.face(gray=True)
    im = im[:, :768]
    plt.imshow(im, cmap='gray')
    plt.colorbar()
    plt.show()
    
    figures/racoon.png
    # We compute and display the Gaussian with which we will convolve the racoon image.
    # In the plot below, we show a zoomed version of the actual plot
    grid = np.mgrid[-384:384, -384:384]
    gaussian = np.exp(- (grid[1]**2 + grid[0]**2) / 5)
    gaussian /= np.sum(gaussian)
    plt.imshow(gaussian, cmap='gray')
    plt.show()
    
    figures/small_gaussian_zoomed.png
    # We concolve the racoon image with the Gaussian and display the result
    blurred = ifft2(fft2(im) * fft2(fftshift(gaussian)))
    plt.imshow(np.real(blurred), cmap='gray')
    plt.show()
    
    figures/racoon_blurred.png
    # We apply the exact inverse operation of the blurring operation and display the result
    recon = ifft2(fft2(blurred) / fft2(fftshift(gaussian)))
    plt.imshow(np.real(recon), cmap='gray')
    plt.show()
    
    figures/racoon_small_gauss_recon.png
    # We plot the (very small) reconstruction error
    plt.imshow(np.abs(recon - im), cmap='gray')
    plt.colorbar()
    plt.show()
    
    figures/racoon_small_gauss_error.png
    # We convolve the racoon image with a wider Gaussian and display the Gaussian and the result
    # Again, the plot of the Gaussian which is shown below is a zoomed version
    gaussian = np.exp(- (grid[1]**2 + grid[0]**2) / 10)
    gaussian /= np.sum(gaussian)
    plt.imshow(gaussian, cmap='gray')
    plt.show()
    blurred = ifft2(fft2(im) * fft2(fftshift(gaussian)))
    plt.imshow(np.real(blurred), cmap='gray')
    plt.show()
    
    figures/large_gaussian_zoomed.png figures/racoon_large_blurred.png
    # Again, we apply the exact inverse operation and display the result, which is horrible
    recon = ifft2(fft2(blurred) / fft2(fftshift(gaussian)))
    plt.imshow(np.real(recon), cmap='gray')
    plt.colorbar()
    plt.show()
    
    figures/racoon_bad_recon.png

    The reason for this phenomenon is that while the blurring operation considered above is indeed invertible, it is very close to being noninvertible. One says that the deblurring problem is badly conditioned [1] or unstable.

    One can circumvent this unstability by regularization. As we saw above, the horrible reconstruction in the second case has very large numerical values. The idea of Tychonoff regularization is to penalize such a blow-up by solving the minimization problem

    argminxy − Ax2 + λx2

    for a certain parameter λ > 0. Here, the first term (the data fidelity term) forces the measurements Ax belonging to the reconstruction x to be close to the measurements y = Ax belonging to the ground truth x. The second term prevents the blow-up discussed above.

    Another possibility is to use the prior knowledge (or assumption) that the ground-truth x is sparse with respect to Φ. In this case, one solves the minimization problem

    argminxy − Ax2 + λ∥Φx1.

    For simplicity, we only present the numerical results using the Tychonoff regularization:

    # First, we import some useful packages and load the racoon image
    import numpy as np
    import matplotlib.pyplot as plt
    import scipy.misc
    from numpy.fft import fft2, ifft2, fftshift
    
    im = scipy.misc.face(gray=True)
    im = im[:, :768]
    
    # We compute the wide Gaussian from above
    grid = np.mgrid[-384:384, -384:384]
    gaussian = np.exp(- (grid[1]**2 + grid[0]**2) / 10)
    gaussian /= np.sum(gaussian)
    
    # We compute and display the blurred image (plot omitted here)
    blurred = ifft2(fft2(im) * fft2(fftshift(gaussian)))
    plt.imshow(np.real(blurred), cmap='gray')
    plt.show()
    
    # We compute the solution to the Tychonoff regularized problem
    # and display the reconstruction
    better_recon = ifft2(fft2(blurred) / (0.01 + fft2(fftshift(gaussian))))
    plt.imshow(np.real(better_recon), cmap='gray')
    plt.colorbar()
    plt.show()
    
    # We plot the reconstruction error
    plt.imshow(np.abs(better_recon - im), cmap='gray')
    plt.colorbar()
    plt.show()
    
    figures/racoon_tychonoff_recon.png figures/racoon_tychonoff_recon_error.png

    As we see, the reconstruction is much better, although not perfect, as the plot of the reconstruction error shows.

The Fourier transform

In a nutshell, the main theorem of Fourier analysis on the interval (0, 1) is that the family of pure frequencies (ek)k ∈ ℤ = (e2πik)k ∈ ℤ forms an orthonormal basis (ONB) of the Hilbert space L2((0, 1)).

In particular, this implies that the Fourier transform

ℱ:L2((0, 1)) → ℓ2(ℤ), f^f = (^f(k))k ∈ ℤ = (⟨f, e2πik⟩)k ∈ ℤ = (10f(x)⋅e − 2πikxdx)k ∈ ℤ

is a well-defined linear map and in fact an isometry, i.e., ^f2 = ∥fL2((0, 1)).

Furthermore, the properties of an orthonormal basis also show that we have

f = k ∈ ℤ^f(k)⋅e2πik, 

from which it follows that f can be reconstructed from its Fourier transform (coefficients) in a stable way.

Note

  1. The notation e2πik is a short-hand for the 1-periodic function ℝ → ℂ, xe2πikx.
  2. Similarly, the family (e2πik, •⟩)k ∈ ℤd forms an ONB for the Hilbert space L2((0, 1)d). For simplicity, however, we will concentrate on the one-dimensional case in what follows.
  3. There is also a variant of the Fourier transform for the Hilbert space L2(ℝd). But this version of the Fourier transform is more difficult to define, so we omit it here.

By what we just saw, the Fourier transform Φ = ℱ satisfies the first of our two desired properties from The general setting.

Our second desired property is that the Fourier transform of a function f should be sparse if f belongs to a certain class of "nice" functions. To see what this class of nice functions could be for the Fourier transform, let us first assume that f:ℝ → ℂ is 1-periodic (i.e., f(x + 1) = f(x) for all x ∈ ℝ) and continuously differentiable. In this case, a partial integration yields, for k ≠ 0,

^f’(k)  = 10f’(x)⋅e − 2πikxdx  = f(1)⋅e − 2πik⋅1 − f(0)⋅e − 2πik⋅0 − 10f(x)⋅( − 2πik)⋅e − 2πikxdx  = 2πik^f(k). (∗)

Here, we used that

f(1)⋅e − 2πik⋅1 = f(1) = f(0) = f(0)⋅e − 2πk⋅0, 

so that the boundary terms cancel out. By rearranging equation (∗), we see

|^f(k)| = (1)/(2π|k|)⋅|^f’(k)| ≤ (f’∥L1((0, 1)))/(2π|k|).

More generally, if f is 1-periodic and N times continuously differentiable, we get a very fast decay of the Fourier coefficients of f, namely |^f(k)|≲|k| − N for k ≠ 0.

Conversely, if the Fourier coefficients ^f(k) of f have a fast decay, namely, if k ∈ ℤ|^f(k)| < ∞, then one can show (using the so-called Weierstrass M-test) that the series

f(x) = k ∈ ℤ^f(k)⋅e2πikx

converges uniformly. Since each summand of the series is continuous, this implies that f itself must already be continuous. More generally, one can show that if the Fourier coefficients have very fast decay (we omit the details), then f has to be differentiable.

Important

We have thus seen that there is a very tight connetion between smoothness of f and the decay (sparsity) of the Fourier coefficients of f.

Conversely, this means that if f is not differentiable or not even continuous, then the Fourier coefficients of f cannot be sparse, since otherwise, the argument from above would imply that f is continuous (or even differentiable). Hence, even if f is badly behaved (discontinuous, non-differentiable) only at one point, the Fourier coefficients of f will behave badly.

Let us now see these theoretic arguments in practice:

# First, import packages for plotting and computing the Fourier transform
import numpy as np
import matplotlib.pyplot as plt
from numpy.fft import fft, ifft, fftshift

# Generate and plot a very smooth function f
N = 100
grid = np.linspace(0, 1, N)
f = 20 * grid**2 * (1 - grid)**2
plt.plot(grid, f)
plt.show()
figures/smooth_f_large.png
# Compute and display the Fourier coefficients of f
# -> they decay very quickly
plt.plot(np.arange(-N//2, N//2), np.abs(fftshift(fft(f))))
plt.show()
figures/smooth_f_fourier.png
# Create and plot a modified version g of f which is still very smooth,
# except at TWO POINTS.
g = np.copy(f)
g[N//4:3 * N //4] -= 1
plt.plot(grid, g)
plt.show()
figures/singular_g_large.png
# Compute and display the Fourier coefficients of g
# -> they have very bad decay!
plt.plot(np.arange(-N//2, N//2), np.abs(fftshift(fft(g))))
plt.show()
figures/singular_g_fourier.png

Frames for Hilbert spaces

As we saw in the preceding section, the pure frequencies are very well suited for sparsely representing functions that are smooth. However, even if the function only has one point at which it is badly behaved, the pure frequencies are not a good choice anymore, mainly because they are badly localized.

In the following, we will thus consider Wavelets and Shearlets as alternative systems. But at least for Shearlets (and also in many cases for Wavelets), one does not get an orthonormal basis for the Hilbert space under consideration. Hence, we will now briefly discuss the concept of frames. In a nutshell, these are a generalization of the concept of an orthonormal basis.

As explained in The general setting, we always consider a coefficient map of the form

Φ:ℋ → ℓ2(I), f↦(⟨f, φi⟩)i ∈ I.

One can then show that the following are equivalent:

  1. The map Φ is well-defined (i.e., i ∈ I|⟨f, φi⟩|2 < ∞ for all f ∈ ℋ) and one can reconstruct each f in a stable way from its coefficients Φf, i.e., there is a linear, continuous reconstruction map R:ℓ2(I) → ℋ satisfying RΦf = f for all f ∈ ℋ.

  2. There are positive constants A, B > 0 such that

    A⋅∥f2 ≤ i ∈ I|⟨f, φi⟩|2 ≤ B⋅∥f2

    holds for all f ∈ ℋ.

If either of these properties is satisfied, the family (φi)i ∈ I is called a frame for .

In fact, one can describe the reconstruction map more precisely: For each frame (φi)i ∈ I, there exists a dual frame (φi)i ∈ I which satisfies

f  = i ∈ If, φiφi  = i ∈ If, φiφi

for all f ∈ ℋ. This should be compared to the formula f = i ∈ If, φiφi which holds for orthonormal bases. Hence, for orthonormal bases, one possible dual frame (and in fact the only one) is the orthonormal basis itself.

Note

In a finite dimensional Hilbert space of dimension N, each orthonormal basis also has length N.

For a frame, this need not be the case: Since each element of the Hilbert space can be written as a linear combination of the frame elements, it follows that each frame (φi)i ∈ I has at least N elements, but in general, it can have much more. The quotient #I ⁄ N of the number of frame elements to the dimension of the space is called the redundancy of the frame.

Now, even if a highly redundant frame yields a very sparse representation in the sense that only a small fraction of the coefficients |⟨f, φi⟩| is not negligible, it can still be the case that a less redundant frame which yields a less sparse representation still leads to a smaller absolute number of non-negligible coefficients.

Thus, for data compression, one often still uses Wavelets instead of Shearlet or Curvelets, since although Wavelets usually provide less sparse expansions, they are much less redundant. For regularization purposes (cf. Motivation), however, sparsity is often more important than having a small redundancy.

Wavelets

For the definition of the different types of Wavelet systems, we refer to slide 3 of the associated slides to this document.

As indicated by the example on slide 4, wavelets yield fairly sparse representations of natural images. Furthermore, there is a mathematically rigorous theorem stating that Wavelets are suited optimally for approximating functions which are smooth apart from finitely many point singularities.

Hence, Wavelets yield a great improvement compared to the highly nonlocal Fourier basis, where one single singularity completely ruins the decay of the coefficients.

Let us now see the improvement of Wavelets in comparison to the Fourier basis in practice:

# First import packages for plotting and for computing Wavelet transforms
import numpy as np
import matplotlib.pyplot as plt
import pywt

# Compute and display the same smooth function f which we used above
# in the example with the Fourier transform
N = 128
grid = np.linspace(0, 1, N)
f = 20 * grid**2 * (1 - grid)**2
plt.plot(grid, f)
plt.show()
figures/smooth_f_large.png
# Compute and display the (absolute value) of the Wavelet coefficients
# -> These have a nice decay
coeff = pywt.wavedec(f, 'haar', level=4)
coeff_j = np.concatenate(coeff)
plt.plot(np.abs(coeff_j))
plt.show()
figures/smooth_f_wavelet_large.png
# Compute and display a modified version g of f which is smooth
# except AT TWO POINTS. This is the same g as for the Fourier transform example
g = np.copy(f)
g[N//4:3 * N //4] -= 1
plt.plot(grid, g)
plt.show()

# Compute and display the (absolute value) of the Wavelet coefficients of g
coeff2 = pywt.wavedec(g, 'haar', level=4)
coeff2_j = np.concatenate(coeff2)
plt.plot(np.abs(coeff2_j))
plt.show()
figures/singular_g_large.png figures/singular_g_wavelet_large.png

As we see, the wavelet coefficients of the more singular function do have a slightly worse decay than those of the smooth function, but the decay is significantly better than that of the Fourier coefficients from the preceding section.

We remark that we here used one of the simplest Wavelets, the so-called Haar-Wavelet. One could probably achieve even better sparsity by using a different (mother) Wavelet. A list of available Wavelets can be obtained via

pywt.wavelist()

Shearlets

As we saw above, Wavelets provide sparse representations for signals that only have finitely many point singularities. But as indicated on slide 5, natural images typically exhibit singularities (discontinuities) along curves instead of point-like singularities.

To capture these types of singularities in a precise mathematical definition, Donoho introduced the (highly simplified) model of cartoon-like functions, see slide 5. Donoho also proved a "no-go theorem" which provides a certain upper limit on how good any system can approximate cartoon-like functions, cf. slide 6. The same slide also shows that neither the Fourier basis, nor Wavelets do achieve this optimal approximation rate.

For Wavelets, the intuitive reasoning (shown on slide 6) is that since they use scalar dilations, all wavelets are approximately square shaped, which is not well-suited for approximating a curve. Instead, one needs building blocks which are elongated and which can adapt to the direction of the curve.

As seen on slide 8, the idea of Shearlets is

Thus, a first definition of Shearlets (see slide 9) is to take a single mother shearlet ψ and build the shearlet system only of translated, sheared and parabolically dilated versions of ψ.

However, as seen on slides 9 and 10, if one considers the associated Fourier tiling (i.e., the (essential) Fourier supports of the different elements of the Shearlet system), then this tiling behaves very differently with respect to the ξ1 direction and the ξ2 direction. This is undesirable in practice: For any "nice" image f, if we flip the x- and y-axis, we should still have a "nice" image, which should thus have a sparse representation using the shearlet system.

The solution to this problem is to divide the frequency plane into a horizontal and a vertical cone and a low-frequency part, cf. slide 10. For each of these regions, one then uses a different generator, as explained on slide 11.

Slide 13 shows that one can even construct shearlet frames which are compactly supported, which is important for theoretical, as well as for practical problems.

Finally, slide 14 shows that shearlets provide the optimal approximation rate (and also sparsity of the Shearlet coefficients) for the class of cartoon-like functions. This cannot be improved in view of Donoho's "no-go theorem" (slide 6).

Finally, let us compare the performance of Wavelets and Shearlets in a concrete example:

# Import packages for plotting and computing Wavelet and Shearlet transforms
# Here, we use the (alpha)-shearlet transform that we implemented as part of the DEDALE project
from AlphaTransform import AlphaShearletTransform as AST
import numpy as np
import matplotlib.pyplot as plt
import scipy.misc
import pywt

# Load and display a sample image (the usual racoon image)
im = scipy.misc.face(gray=True)
im = im[:, :768]
plt.imshow(im, cmap='gray')
plt.show()
figures/racoon.png
# Compute and plot (the absolute value of) the shearlet coefficients
trafo = AST(768,768, [0.5]*3, subsampled=True, verbose=False)
coeff = trafo.transform(im, do_norm=False)
coeff_j = np.concatenate([c.ravel() for c in coeff])
plt.plot(np.abs(coeff_j))
plt.show()
figures/shearlet_coeffs_large.png
# Compute and plot (the absolute value of) the wavelet coefficients using the Haar mother wavelet
coeff2 = pywt.wavedec2(im, 'haar', level=3)
coeff2_joined = np.concatenate([coeff2[0].ravel()] + [c.ravel() for cs in coeff2[1:] for c in cs])
plt.plot(np.abs(coeff2_joined))
plt.show()
figures/haar_wavelet_coeffs_large.png
# Compute and plot (the absolute value of) the wavelet coefficients using a Daubechies mother wavelet
coeff3 = pywt.wavedec2(im, 'db3', level=3)
coeff3_joined = np.concatenate([coeff3[0].ravel()] + [c.ravel() for cs in coeff3[1:] for c in cs])
plt.plot(np.abs(coeff3_joined))
plt.show()
figures/daub_wavelet_coeffs_large.png

Here, we clearly see one of the phenomena that was already discussed at the end of the Section Frames for Hilbert spaces: The shearlet coefficients are very sparse in the sense that a very large fraction of the coefficients is negligible, while the wavelet coefficients are slightly less sparse (for both choices of the mother wavelet). Note though that the total number of wavelet coefficients is identical to the dimension of the image (7682 = 589824 ≈ 600000 pixels), while the number of shearlet coefficients is 5670929, which is about 10 times the number of pixels in the original image.

Summary

In this tutorial, we discussed different systems which yield sparse representations for different classes of signals.

  1. The Fourier basis is very well adapted to smooth functions. However, even if the function has only one single singularity, the Fourier basis is usually a bad choice.
  2. In contrast, Wavelets are well adapted to functions which only have point singularities. In dimension 1 (e.g., for sound signals), this is essentially the only type of singularity that can occur. But in higher dimensions (already for images), signals tend to have more complicated singularities, e.g., along curves.
  3. For those signals which exhibit curved singularities, Shearlets provide optimally sparse representations.

We saw that these representation systems can be used for data compression, for compressive sensing and for the regularization of inverse problems.

Finally, we saw that Wavelets as well as Shearlets and Curvelets have a certain associated tiling of the Fourier domain which determines many of the properties of the respective system.

Footnotes

[1]As seen above, the inverse operation uses a division by the Fourier transform of the Gaussian. While it is nonzero, the Fourier transform of the Gaussian is almost zero at most frequencies. Hence, we are dividing by a value which is almost zero. Together with very small, unavoidable rounding errors, this explains the catastrophic result from above.