Full wavefield inversion (FWI) is a nonlinear inversion technique that recovers the earth model by minimizing the mismatch between the simulated and the observed seismic wavefields. Due to its huge computational cost, current implementation of FWI often utilizes local optimization techniques to optimize the model parameters. A widely used local optimization technique is the gradient-based first-order approach, such as steepest descent and nonlinear conjugate gradient (Tarantola, 1984). The gradient-only first-order approach is relatively efficient, because it requires computing only the gradient of the objective function, a vector containing the first-order partial derivatives of the objective function with respect to the model parameters, but its convergence is usually slow.
The convergence can be significantly improved by using the second-order optimization technique, which uses not only the gradient information, but also the curvature information of the objective function. The main difference between the first- and the second-order approach is that the second-order approach preconditions the gradient with the inverse of the Hessian, such as Gauss-Newton/Newton method (Pratt, 1998), or the inverse of a projected Hessian, such as the subspace approach (Kennett, 1988). The Hessian is a matrix containing second-order partial derivatives of the objective function with respect to the model parameters. The second-order approach is attractive not only because of its fast convergence rate, but also because of its capability to properly scale the gradient for different parameters and provide meaningful updates for parameters with different units in the context of multi-parameter inversion. The parameter scaling using the Hessian can be crucial in multi-parameter inversion, especially when one wants to simultaneously invert multiple parameters. Computing the inverse of the Hessian or the Hessian itself or even the product of the Hessian and a vector, however, is very expensive, and it is the main obstacle that prevents the second-order approach from being widely used in practice.
In the present invention, the full Hessian is replaced with a banded matrix, assuming that the Hessian is sparse and the most significant entries are around its diagonals and subdiagonals. By doing so, the action of the Hessian on a vector, i.e., Hessian-vector product, becomes a sparse matrix multiplying a vector, and it can be very efficiently calculated. Computing the action of the Hessian-vector product is the main building block in both the Gauss-Newton/Newton approach and the subspace approach. Therefore, reducing the computational cost of the action of the Hessian-vector product is essential to reducing the cost of the second-order approach.
Review of the Second-Order Approach
The Gauss-Newton/Newton approach requires solving the following linear system at every nonlinear iteration:Hgnew=g,  (1)where H is the Hessian matrix, g is the gradient, and gnew is the preconditioned new gradient. The above equation is usually solved iteratively using the linear conjugate gradient algorithm, where the Hessian-vector product needs to be computed at each linear iteration. Equation (1) may be inverted to get the preconditioned gradient. This may be done iteratively, and a typical algorithm for doing so may be found on page 111 of Numerical Optimization, by Nocedal and Wright (2000), which may be summarized as follows:
 Given gnew0Set r0 ← Hgnew0 − g, p0 ← r0, k ← 0while rkTrk is bigger than tolerance              γ      k        ←                            r          k          T                ⁢                  r          k                                      p          k          T                ⁢                  Hp          k                      ;  gnewk+1 ← gnewk + γkpk; rk+1 ← rk + γkHpk;              λ              k        +        1              ←                            r                      k            +            1                    T                ⁢                  r                      k            +            1                                                r          k          T                ⁢                  r          k                      ;  pk+1 ← −rk+1 + λk+1pk; k ← k + 1end while.
Instead of solving equation (1), which is huge (for example, if the model contains N parameters to be inverted for, and each parameter has M samples, the Hessian then contains N2×M2 samples), the subspace approach projects the Hessian into a lower-dimensional space, hence a much smaller linear system to solve. For the case of inverting two parameters, it results in a 2×2 system as shown in equation (2). Because of the projection, the subspace approach uses less second order information. In the subspace approach, a projected Hessian needs to be inverted at every nonlinear iteration. For simplicity, taking inverting two parameters as an example, the following two-by-two system may be solved at every nonlinear iteration (generalization to inversion of more than two parameters is straightforward).
                                                        (                                                                                                                  s                        1                        T                                            ⁢                                              Hs                        1                                                                                                                                                s                        1                        T                                            ⁢                                              Hs                        2                                                                                                                                                                                s                        2                        T                                            ⁢                                              Hs                        1                                                                                                                                                s                        2                        T                                            ⁢                                              Hs                        2                                                                                                        )                        ⁢                          (                                                                    α                                                                                        β                                                              )                                =                      -                          (                                                                                                                  g                        T                                            ⁢                                              s                        1                                                                                                                                                                                g                        T                                            ⁢                                              s                        2                                                                                                        )                                      ,                            (        2        )            where α and β are constants used to scale different gradient components as discussed later; and g is the gradient containing components of both parameters
      g    =          (                                                  g              1                                                                          g              2                                          )        ,where g1 and g2 are the gradients for the first and the second parameter. Vectors s1 and s2 are the basis vectors defined as follows:
                                          s            1                    =                      (                                                                                -                                          g                      1                                                                                                                    0                                                      )                          ,                              s            2                    =                      (                                                            0                                                                                                  -                                          g                      2                                                                                            )                          ,                            (        3        )            where 0 denotes a vector containing zeros. Once the two-by-two system (equation 3) is solved, we get the preconditioned new gradient as follows:gnew=−αs1−αs2.  (4)
The construction of the two-by-two system requires computing two Hessian-vector products, i.e., Hs1 and Hs2. In general, if the subspace approach is used to invert N parameters, a Hessian-vector product needs to be evaluated N times at every nonlinear iteration.
Therefore, the cost of either Gauss-Newton/Newton or the subspace approach is directly related to the cost of computing the Hessian-vector product. The Hessian-vector product is usually computed using linearized modeling (Born modeling) followed by an adjoint modeling or using the finite difference approximation, both of which requires calling the simulator to do wavefield forward/adjoint modelings. The computational cost is typically two FWI gradient evaluations. An example of such a method is PCT patent application publication WO 2013/081752, by Lee and Baumstein, which approximates the exact Hessian-vector product using finite-difference approximations at a cost roughly equivalent to two FWI gradient calculations, which involves wavefield propagation as well. The present invention instead replaces the exact Hessian using a PSF-approximated Hessian as described below. Since the PSF-approximated Hessian of the present invention is very sparse and moreover does not need to be recomputed every time, the cost of computing its product with a vector is significantly smaller than computing the product of the exact Hessian and the vector.