Optimization:

Introduction

We apply large scale optimization techniques to a wide variety of problems in seismology. For problems involving models and data living in high dimensions, we need to exploit some sort of latent structure in order to solve our problems efficiently.

Matrix completion

(Software overview available here)

When trying to complete a matrix from missing entries using an SPGL1-type approach, see (Aravkin et al., 2013), we solve the following problem \[ \begin{equation} \label{bpdn} \begin{aligned} \min \; & \|X\|_* \\ \text{s.t.} \; & \|\mathcal{A}(X) - b\|_2 \le \sigma. \end{aligned} \end{equation} \] Minimizing the nuclear norm \(\|X\|_*\) promotes low-rank structure in the final solution, but it is difficult to solve (\(\ref{bpdn}\)) directly as it is a constrained problem and the objective is nonsmooth. Instead, we solve \[ \begin{equation} \label{lasso} \begin{aligned} \min \; & \|\mathcal{A}(X) - b\|_2^2 \\ \text{s.t.} \; & \|X\|_* \le \tau \end{aligned} \end{equation} \] for a sequence of algorithmically chosen parameters \(\tau\), eventually resulting in a solution to (\(\ref{bpdn}\)). These problems can be solved in a straightforward manner using a projected gradient method. \[ \begin{equation*} \begin{aligned} X_{k+1} &= P_{\tau}( X_{k} - \alpha_{k} g_k ) \\ P_{\tau}(Y) &= \arg\min \{ \|Z - Y\|_F : \text{s.t.} \; \|Z\|_* \le \tau \} \end{aligned} \end{equation*} \] Unfortunately, when \(X\) is a reshaped version of the 3D data that we are trying to interpolate, \(X\) can easily have hundreds of thousands or even millions of rows and columns. Computing the projection \(P_{\tau}(Y)\) involves computing the SVD of the matrix \(Y\), which becomes cost-prohibitive for large scale problems.

Instead, we use an alternative characterization of the nuclear norm of \(X\), \[ \begin{equation*} \begin{aligned} \|X\|_* = \min_{L,R} \; & \dfrac{1}{2} (\|L\|_F^2 + \|R\|_F^2), \\ \text{s.t.} \; & X = LR^T \end{aligned} \end{equation*} \] for some matrices \(L,R\) of rank \(k\). If we replace \(X\) in (\(\ref{lasso}\)) by \(LR^T\), we can instead compute approximate projections on to the ball \[ \begin{equation} \label{lrball} \{ (L,R) : \text{s.t.} \; 1/2 (\|L\|_F^2 + \|R\|_F^2) \le \tau \}. \end{equation} \] Projecting on to (\(\ref{lrball}\)) involves computing the frobenius norm of \(L\) and \(R\) and scaling both variables by a constant, which is considerably cheaper than computing the SVD of the original matrix \(X\). By using this \(L,R-\)factorization approach, we solve large scale matrix completion problems for completing seismic data from missing sources/receivers.

Numerical Results

\(\sigma\) 0.1 0.01 0.001 0.0001
TFOCS SNR (dB) 15.5 24.7 25.7 25.8
time (s) 42 298 686 726
\(SPG\ell1\) SNR (dB) 15.7 24.7 25.7 24.1
time (s) 1.4 3.9 9.5 6.9
LR SNR (dB) 15.8 31.9 49.0 62.8
time (s) 0.6 0.7 5.3 3.9
TFOCS versus classic \(SPG\ell 1\) (using SVDs to solve (\(\ref{lasso}\))) versus LR factorization. Synthetic data set results. Synthetic low rank example shows results for completing a rank 20, \(100 \times 100\) matrix, with 50% missing entries. Computational time and SNR are shown for \(\sigma =0.1, 0.01, 0.001, 0.0001\). Rank of the factors is taken to be 20.
\(\sigma\) 0.2 0.01 0.05 0.04
TFOCS SNR (dB) 9.5 11 11.3 11.4
time (s) 39 140 460 377
\(SPG\ell1\) SNR (dB) 12.8 17.0 19.1 19.3
time (s) 24.6 34.6 56.2 59.1
LR SNR (dB) 15.8 31.9 49.0 62.8
time (s) 0.85 1.3 8.5 10.7
TFOCS versus classic \(SPG\ell 1\) (using SVDs to solve (\(\ref{lasso}\))) versus LR factorization. Seismic example shows results for matrix completion a low-frequency slice at 10 Hz, extracted from the Gulf of Suez data set, with 50% missing entries. Computational time and SNR are shown for \(\sigma =0.2, 0.1, 0.05, 0.04\). Rank of factors was taken to be 35.

Figure1(a) True data, (b) 50% missing sources, (c) L,R interpolated data - 18.5 dB, (d) Difference

Tensor completion

(Software overview available here)

We can also exploit the tensor structure of seismic data for interpolating frequency slices with missing sources or receivers.

An efficient format for representing high dimensional tensors is the so-called Hierarchical Tucker (HT) format. As depicted in Figure 2, we first reshape the \(4D\) tensor \(X\) in to a matrix by placing its first and second dimensions along the rows and the third and fourth dimensions along the columns, denoted \(X^{(1,2)}\). We then perform an ‘SVD-like’ decomposition on \(X^{(1,2)}\), resulting in the matrices \(U_{12}\), \(U_{34}\), and \(B_{1234}\). The insight of the Hierarchical Tucker format is that the matrix \(U_{12}\) contains the first and second dimensions along the rows, so when we reshape this matrix back in to a \(3-\)dimensional tensor, we can further ‘split apart’ dimension \(1\) from dimension \(2\) in this ‘SVD-like’ fashion.

Figure2A visualization of the HT format for a \(4-\)tensor \(\mathbf{X}\) of size \(n_1 \times n_2 \times n_3 \times n_4\).

In the example above, the intermediate matrices \(U_{12}, U_{34}\) do not need to be stored: the small parameter matrices \(U_i\) and small \(3-\) transfer tensors \(B_t\) specify the tensor \(\mathbf{X}\) completely. The number of parameters needed to specify this class of tensors is bounded by \[ \begin{equation*} d N K + (d - 2) K^3 + K^2 \end{equation*} \] where \(d\) is the number of dimensions, \(N = \max_{i=1,\dots,d} n_i\) is the maximum individual dimension, and \(K\) is the maximum internal rank parameter. When \(K \ll N\) and \(d \ge 4\), this quantity is much less than \(N^d\) needed to store the full array, the so-called curse of dimensionality.

If we collect the HT parameters in a single vector \(x = (U_t, B_t)\) and we write \(\phi(x)\) for the fully expanded tensor, then the tensor completion problem we’re interested in solving is \[ \begin{equation*} \min_x = \dfrac{1}{2}\|\mathcal{A} \phi(x) - b\|_2^2. \end{equation*} \] The HT format exhibits the structure of a Riemannian manifold, which is a multidimensional generalization of smooth surfaces in \(\mathbb{R}^3\).

Figure3The HT manifold is a lower dimensional object sitting in a higher dimensional tensor space and, as such, admits a subspace of “admissible directions” (the tangent space) at a particular point \(x\) along which a point “stays” on the manifold. By exploiting this structure, we develop solvers that are fast and SVD-free, as in the \(LR-\)factorization case. See Silva and Herrmann (2014) for more details.

Numerical results

We apply these techniques to a synthetic frequency slice provided to us by BG Group, with 68 x 68 sources and 201 x 201 receivers. We randomly remove 75% of the receiver pairs from the volume and interpolate using our HT techniques.


Figure4(a) True data, (b) 75% missing receivers, (c) HT interpolated data - 17.4 dB, (d) Difference

Applications to other seismic problems

There are still large open questions as to how to apply these low-rank ideas to other seismic problems such as Migration or Full Waveform Inversion. Although low-rank parametrizations of a problem can offer significant computational benefits to solving such problems, it remains challenging to find the appropriate physically-meaningful “transform” under which the resulting problem becomes low-rank. There is ongoing work to use these ideas to implicitly work with high dimensional data volumes, such as full-offset extended images, in a computationally feasible way.

References

Aravkin, A. Y., Kumar, R., Mansour, H., Recht, B., and Herrmann, F. J., 2013, An sVD-free pareto curve approach to rank minimization: Retrieved from https://www.slim.eos.ubc.ca/Publications/Public/TechReport/2013/kumar2013ICMLlr/kumar2013ICMLlr.pdf

Silva, C. D., and Herrmann, F. J., 2014, Optimization on the hierarchical tucker manifold - applications to tensor completion: UBC; UBC. Retrieved from https://www.slim.eos.ubc.ca/Publications/Public/TechReport/2014/dasilva2014htuck/dasilva2014htuck.pdf