WO2013033651A1

WO2013033651A1 - Full elastic wave equation for 3d data processing on gpgpu

Info

Publication number: WO2013033651A1
Application number: PCT/US2012/053546
Authority: WO
Inventors: Igor PODLADTCHIKOV; Brad Artman
Original assignee: Spectraseis Ag
Priority date: 2011-08-31
Filing date: 2012-08-31
Publication date: 2013-03-07

Abstract

A method and system for a computerized method for imaging subsurface energy sources comprising acquiring multicomponent seismic data, arranging 3D volumes of velocity and stress parameters in a global computer memory by ordering adjacent planes from like-indexed blocks extracted from the 3D parameter volumes so that like-indexed blocks are adjacent in the computer memory, reverse propagating the multicomponent seismic data to obtain a reverse time wave field and solving a full elastic wave equation for the reverse time wave field using a graphics processing unit to obtain wave field decomposition data. The method may also include applying at least one imaging condition to the wave field decomposition data to obtain imaging data and/or storing the imaging data for display.

Description

TITLE: FULL ELASTIC WAVE EQUATION FOR 3D DATA PROCESSING ON GPGPU

CROSS-REFERENCE TO RELATED APPLICATION

This application claims the benefit of U.S. Provisional Application No. 61/529,758 filed 31-August-2011, which is fully incorporated by reference.

BACKGROUND OF THE DISCLOSURE

Technical Field

[0001] The disclosure is related to 3D data processing of physical parameters. Applications include seismic exploration for oil and gas, and more particularly to processing and displaying seismic data attributes of the earth's subsurface.

Description of the Related Art

[0002] Computer processing of three dimensional data volumes may be challenging because of the volume of data necessary to adequately provide realistic parameterization of real world situations. This is particularly true for the case of the full elastic wave equation for seismic data processing.

[0003] Seismic exploration for hydrocarbons generally is conducted using a source of seismic energy and receiving and recording the energy generated by the source using seismic detectors. On land, the seismic energy source may be an explosive charge or another energy source having the capacity to impart impacts or mechanical vibrations at or near the earth's surface. Seismic waves generated by these sources travel into the earth subsurface and are reflected back from boundaries and reach the surface of the earth at varying intervals of time depending on the distance traveled and the characteristics of the subsurface material traversed. The return waves are detected by the sensors and representations of the seismic waves as representative electrical signals are recorded for processing.

[0004] Normally, signals from sensors located at varying distances from the source are combined together during processing to produce "stacked" seismic traces. In marine seismic surveys, the source of seismic energy is typically air guns. Marine seismic surveys typically employ a plurality of sources and/or a plurality of streamer cables, in which seismic sensors are mounted, to gather three dimensional data. [0005] The process of exploring for and exploiting subsurface hydrocarbon reservoirs is often costly and inefficient because operators have imperfect information from geophysical and geological characteristics about reservoir locations. Furthermore, a reservoir's characteristics may change as it is produced.

[0006] Data acquisition for oil exploration may have a negative impact on the environment. The impact of oil exploration methods on the environment may be reduced by using low-impact methods and/or by narrowing the scope of methods requiring an active source, including reflection seismic and electromagnetic surveying methods.

[0007] Geophysical and geological methods are used to determine well locations. Expensive exploration investment is often focused in the most promising areas using relatively slow methods, such as reflection seismic data acquisition and processing. The acquired data are used for mapping potential hydrocarbon-bearing areas within a survey area to optimize exploratory well locations and to minimize costly nonproductive wells.

[0008] The time from mineral discovery to production may be shortened if the total time required to evaluate and explore a survey area can be reduced by applying selected methods alone or in combination with other geophysical methods. Some methods may be used as a standalone decision tool for oil and gas development decisions when no other data is available. Preferable methods will be economical, have a low environmental impact, and relatively efficient with rapid data acquisition and processing.

SUMMARY

[0009] Disclosed herein is a computerized method for imaging subsurface energy sources comprising acquiring multicomponent seismic data, arranging 3D volumes of velocity and stress parameters in a global computer memory by ordering adjacent planes from like-indexed blocks extracted from the 3D parameter volumes so that like-indexed blocks are adjacent in the computer memory, reverse propagating the multicomponent seismic data to obtain a reverse time wave field and solving a full elastic wave equation for the reverse time wave field using a graphics processing unit to obtain wave field decomposition data. The method may also include applying at least one imaging condition to the wave field decomposition data to obtain imaging data and/or storing the imaging data for display.

[0010] BRIEF DESCRIPTION OF THE DRAWINGS

Fig. 1 illustrates an elementary 2D cell or grid point;

Fig. 2 illustrates 16 grid points, a thread-block grid;

Fig. 3 illustrates relationships of stresses and velocities for computations;

Fig. 4 illustrates the comparison of a 2D cell or grid point with a 3D cell or grid point representing stresses and velocities for computations;

Fig. 5 illustrates a 3D cell or grid point representing stresses and velocities for computations indexed using x, y and z as well as 1, 2 and 3;

Fig. 6 illustrates a thread-block grid with a halo for the stress update calculations;

Fig. 7 illustrates a thread-block grid with a halo for the velocity update calculations;

Fig. 8 illustrates an example global memory address configuration relative to a 2D calculation;

Fig. 9 illustrates a global memory address configuration with the element sizes equal;

Fig. 10 illustrates a global memory address configuration for thread-block grids;

Fig. 11 illustrates a visualization of the 3D Elements arranged into blocks to facilitate rapid calculations;

Fig. 12 illustrates a block memory organization schematic relative to organizing data planes to facilitate rapid calculations;

Fig. 13 illustrates the relationship between GPU memory threads and the planes of data for elements for each block;

Fig. 14 illustrates a GPU memory structure showing the relationship of data blocks to data planes to elements to threads;

Fig. 15 illustrates pseudo code for a non-limiting embodiment of the present invention;

Fig. 16a, 16b and 16c illustrate various subsurface related images according non- limiting embodiments of the present invention; and

Fig. 17 illustrates a computer system for implementing various non-limiting embodiments of the present invention. DETAILED DESCRIPTION

[0011] General purpose graphics processing units (GPGPU) enable rapid computations compared to conventional CPU systems. However, taking advantage of the increase in computing power can be difficult when the datasets to be processed are large 3 dimensional data sets. This disclosure includes descriptions of embodiments to arrange data to be efficiently processed with GPGPU systems.

[0012] An example of a 3D problem is used to illustrate how 3D data may be arranged to efficiently take advantage of the computing power available on a GPGPU system. The 3D problem used as an example is a time reverse imaging example

[0013] Information to enable the determination of the location of underground fluids, underground fluid reservoirs or seismic source locations may be extracted from seismic waves and vibrations measured at the earth's surface using various imaging conditions in data processing. These waves may be measured using passive or active seismic data acquisition methods.

[0014] One or more sensors are used to measure vertical and horizontal components of motion due to background seismic waves at multiple locations within a survey area. These components may be measured separately or in combination and may be recorded as signals representing displacement, velocity, and/or acceleration.

[0015] The sensor equipment for measuring seismic waves may be any type of seismometer. Seismometer equipment having a large dynamic range and enhanced sensitivity compared with other transducers may provide the best results (e.g., multicomponent earthquake seismometers). A number of commercially available sensors utilizing different technologies may be used, e.g. a balanced force feed-back instrument or an electrochemical sensor. An instrument with high sensitivity at very low frequencies and good coupling with the earth enhances the efficacy of the method.

[0016] Ambient noise conditions representative of seismic wave energy can negatively affect the recorded data. Techniques for removing unwanted artifacts and artificial signals from the data, such as cultural and industrial noise, may be important for applying this method successfully in areas where there is relatively high ambient noise. [0017] With the advancement of programmable graphics processing unit (GPU) hardware and supporting infrastructure to make GPU hardware more easily addressable and accessible, general-purpose computing on graphics processing units (GPGPU) is becoming more common. GPUs are highly parallel, multithreaded, many-core processors that are capable of very high computation and data throughput and are effective on problems that require stream proces sing-where data points have few dependencies and can be processed in the same way. In geophysics, many applications fall into this category. GPU-based implementations, from trace processing to time migration, show significant performance gains over traditional CPU based implementations.

[0018] CUD A (Compute Unified Device Architecture), a parallel computing architecture developed by NVIDIA, defines both a parallel programming model as well as a memory model for the user to develop GPU based programs in a high-level C-like programming language without worrying about low level hardware architecture, thus making the GPU more accessible and programmable. At the heart of CUDA is the ability for a programmer to define kernels, or C-like functions, that are executed many times in parallel by a number of CUDA threads.

[0019] Unlike CPU threads, CUDA threads are extremely lightweight, and have very little creation overhead and are fast-switching. CUDA uses thousands of threads to achieved high efficiency and high throughput, whereas, current multi-core CPU may utilize only a few threads. A kernel is executed by an array of threads in parallel, and threads may be arranged into one, two or three dimensional blocks. Thread blocks must be able to execute independently at any order, parallel or in series. The thread and block IDs are used as a convenient index to compute memory addresses and make control decisions. Threads can access multiple memories spaces-per thread local memory, per thread block shared memories and device global memories.

[0020] Local and shared memories are small, low-latency memories near each processor core; global memories on the other hand are larger, but off-chip memory that are high-latency and not cached, therefore costly to access. Following optimized access patterns is important when accessing global memories.

[0021] The number of blocks a multiprocessor can process at once depends on how many registers per thread and how much shared memory per blocks are required for a given kernel, because the register and shared memories are scarce and split among all the threads. In this disclosure, we present a CUDA-based time reverse imaging (TRI) implementation that initially achieved a 60 to 100 times speedup compared to an existing CPU-based code. The CUDA implementation uses a finite difference full elastic wave propagator, wave field decomposition, and six imaging conditions that compute absolute particle velocities and P-wave and S-wave auto correlations. Memory reads and writes may optimized by using texture bindings. Initial and/or boundary condition calculations may be built into the GPU update kernels resulting in substantial increases in elements per second throughput.

[0022] Time reverse imaging (TRI) is a migration-like algorithm to locate subsurface sources or diffractions by propagating data, which is reversed in time through a velocity model. With no down-going source wave field, the algorithm may be roughly considered as approximately the data half of reverse-time migration that may be used in conventional seismic data processing.

[0023] As disclosed with the following non-limiting embodiments, one chain of operations is: elastic finite difference propagation, wave-field decomposition via spatial derivatives, and evaluation of up to six correlation based imaging conditions. Propagation time is the exterior loop for all operations, and images are in physical space. An illustration that may be used to evaluate the accuracy and performance of the GPU implementation is forward modeled data due to a subsurface vertical single force propagating through a homogeneous medium with an image showing focusing of the P and S wave modes at the appropriate location in space.

[0024] TRI is an ideal application of a real world 3D problem for the GPGPU, since elastic wave propagation, wave field decomposition and updating imaging conditions are all computationally intensive. The computations may be performed entirely on the GPU, and host (CPU), since communication may be minimally required-only for initialization, and for writing back the final solution.

[0025] Finite difference elastic wave propagation: At the core of TRI is elastic wave propagation,

[0026] pu = f + (λ + 2μ)ν(ν^■ u) - μν X (V X u). (1)

[0027] Following a finite difference method (e.g., described in Virieux, 1984), the elastic wave equation, Equation 1, may be discretized using the centered finite difference technique. In the case of 2D, the elastic wave equation can be numerically propagated in time according to these following equations, which are first order in both time and space:

J¹ = _* + At ((A + 2μ) ¾ + A (4) z⁺¹ = z + At (λ ^ + (A + 2μ) ¾¾; (5) - = _{ζ +} Δί (_μ @ ₊ Α ¾). (6)

[0028] Density is presented by p, λ and μ are the Lame coefficients, V_% and ¾^f are the horizontal and vertical particle velocities, ε_χχ and ε _ζ are the horizontal and vertical stresses, and ε_χζ is the shear stress, all at time steps t. These velocities and stresses can be conveniently arranged onto a staggered grid. To update the state of any element on the staggered grid, only four neighboring elements— top, bottom, left, and right— are needed in the 2D case. A free-surface (stress free) boundary condition is imposed for one side (the presumed surface) and tapered absorbing layers for the other three sides.

[0029] Fig. 1 illustrates a 2D cell, or grid node, staggered finite grid for full elastic wave propagation. The horizontal velocity is Vx. The vertical velocity is Vz. The horizontal stress is Sxx (the same as ε_χχ in the equations above). The vertical stress is Szz and the cross stress is Sxz. Fig. 2 illustrates 16 grid nodes of 2D cells. Fig. 3 illustrates the relationships of each the stresses and velocities for calculations of equations 2 through 6. Note that the symbols 'S' and 'ε' are used interchangeably in the text, equations and figures, but represent exactly the same thing.

[0030] Fig. 4 illustrates taking these concepts to the 3D case. A third dimensional velocity, Vz, and stress Sz have been added along with cross stresses Sxz and Syz. Fig. 5 illustrates the 3D case of Fig. 4 using numerical indices.

[0031] GPUs are designed for floating point operations. The time spent on calculation is small compared to the time spent accessing data and updating results in memory. Having an 10 access pattern where data are aligned for appropriately coalesced and optimized reads and writes is essential for a fast efficient implementation of 3D data parameterizing real world situations.

[0032] Instead of interleaving the stresses and velocities into one large data structure, stresses and velocities are each separated into a linear array in memory. For an N- by M-element model, there are (N+l) x MV_X elements, N x (M+1)V_Z elements, N x Με_χχ and ε_ζζ elements and (N-l) x (M-l) ε_χζ elements. This arrangement keeps memory blocks relatively small to minimize memory paging. Smaller memory blocks are also easier to align with GPU memory warps to ensure coalesced reads and writes from a given thread. If the size of an array is not a factor of the half warp, it is padded to the next half warp factor. A warp is the number of parallel threads that are created, managed, scheduled and executed in a group by GPU's SIMT (single instruction, multiple-thread) unit. The graphics architecture for the example implementation disclosed herein has a warp size of 32, but this is not a limitation of the embodiments disclosed here, since the embodiments may be implemented in different architectures and with different warp ranges.

[0033] One possibility for implementation of 2D seismic data processing is to take advantage of CUDA built-in texture memory support by binding each array to texture and by using texture fetches to read data. Since there is a localized access pattern (top, bottom, left, right neighbors), threads of the same warp always read texture addresses that are close together. Texture memory space is cached on hardware level and the latency of addressing calculation is hidden better through thread parallelization, so there are significant speedups for data loaded with texture fetches compared to direct global memory reads. Having each thread load the data as needed through texture fetch is faster than first putting all needed data for a block into shared memory. Even though this implementation, on average, has twice the read redundancy, it keeps the memory usage for each thread to only a few registers, and thus allows more threads to run in parallel, which results in more rapid processing.

[0034] However, for a 3D implementation of full wave field seismic processing, or other 3D data structures, the texture binding method is expensively inefficient. Therefore, implementing the method for 3D seismic processing requires arranging thread blocks to process data efficiently with the least number of time consuming data handing steps. [0035] CUDA's capabilities depend on the GPU type. This disclosure uses as an example the specifications of compute capability of 1.3 GPUs (e.g. GTX 285, Tesla C1060), for the examples, but the methodology is straightforward to generalize to all other GPU types. CUDA launches its threads in thread-blocks, configurable to be one to three dimensions, in this example a maximum 512 x 512 x 64 for each dimension and maximum 512 total threads. These blocks can be arranged in grids, configurable to be one to two dimensions, maximum 65535 x 65535.

[0036] The thread-blocks are arranged in a configuration of 16 by 16, resulting in 256 total threads, in a grid of nx / 16 by nz / 16, resulting in nx * nz / 256 total blocks. Given 256 threads per block, a maximum of 4 blocks can be running on 1 multiprocessor at the same time. This is due to the way instructions are executed: A multiprocessor has 8 cores with which it uses to execute 4x8 = 32 instructions (again, called a warp) at a time. It is capable of holding 32 such warps, therefore 32x32 = 1024 thread instructions.

[0037] The blocks for this example implementation have 256 threads each, which is why a multiprocessor can manage 4 at a time. Given 30 multiprocessors, this results in 120 active threadblocks, or 30720 threads. Again, these numbers are for example only, it is the data arrangement that will lead to increased efficiencies.

[0038] The number of active threads (30720) cannot be mapped directly to the number of elementary cells processed in parallel. On one hand, although a multiprocessor has 32 warps with 32 instructions each scheduled, it executes one warp at a time. On the other hand, global memory IO operations take up hundreds of clock cycles and the multiprocessor uses this time to process other instructions. This way, the memory fetch latency can be "hidden" or accommodated to a certain amount.

[0039] Even though reading from a graphic card RAM is relatively slow, the IO bandwidth is pretty wide: 512 bit, which allows reading 16 consecutive 32 bit words with 1 read instruction. As 32 bits is the exact float size, this allows reading 16 values at once, if properly positioned in the memory (e.g. consecutively). Arranging the data in a comfortable way for the threads to read is key to optimizing the process on CUDA. A multiprocessor also has on-chip shared memory, which (in this example) is 16 KB in size and can be used by all threads inside a thread-block. The good thing about this memory is that it is very fast, a drawback can be that there may be a limited amount, so it must be carefully allocated. The shared memory is also partitioned among thread-blocks, so the 16KB may be divided by 4 simultaneously running thread-blocks, allowing 4 KB to each.

[0040] Given the thread-block size of 16 by 16 = 256 and the size of a float (4 bytes), a "tile" of data is 256 times 4 = 1 KB. Because halos are needed (data from the border with neighboring thread-block grids) to completely update any given element or grid block, this allows 3 layers of data in shared memory.

[0041] By comparison with Fig. 2, Fig. 6 illustrates how the halo outside and adjacent the thread block is juxtaposed in relation to each thread-block grid for the stress update calculation. Velocities, including the velocity data from the halo zones that are used to update the stress calculations are loaded from global memory to the shared (local) memory. The stresses are loaded from global memory to local registers for the update calculations. The updated stress calculations are then written from the local registers back to global memory.

[0042] Fig. 7 illustrates how the halo is juxtaposed in relation to each thread- block grid for the velocity update calculation. Stresses, including the stress data from the halo zones that are used to update the velocity calculations are loaded from global memory to the shared (local) memory. The velocities are loaded from global memory to local registers for the update calculations. The updated velocity calculations are then written from the local registers back to global memory.

[0043] Three layers are all that are needed: 1) update stress: load V_x and V_z in shared tiles, compute all differences. 2) update velocity: load ε_χ, e_y and e_xy in shared tiles, compute all differences. In this implementation a maximum of 3 operands can all put in shared memory and differences computed efficiently, thanks to the short fetch times from shared memory.

[0044] Even though the data are read only twice from global memory (once for stress update, once for velocity update), and written only once (after update), data input-output from memory is still the bottleneck of the process, so data input-output from memory needs to be optimized.

[0045] Elements as used here in the physical example are the Velocities (Vx, Vy and Vz) and the Stresses (Sx, Sy and Sz) and the Cross Stresses (Sxy, Sxz and Syz). A straightforward way to arrange the data (for the 2D case) is indexing element after element, and x-line after x-line inside each element:

V_x(0,0) V_X(0,1) ^•••V_x(0,nx+1) V_X(1,0) V_x(l,l) ... V_x(l,nx+1) ^•••V_x(nz,nx+1)

V_z(0,0) V_Z(0,1) - V_z(0,nx) V_Z(1,0) V_z(l,l) ... V_z(l,nx) - V_z(nz+l,nx) ε_χ(0,0) ε_χ(0,1)^•••ε_χ(0,ηχ) ε_χ(1,0) ε_χ(1,1) ... ε_χ(1,ηχ) " ·ε_χ(ηζ,ηχ) ε_ζ(0,0) ε_ζ(0,1) " ·ε_ζ(0,ηχ) ε_ζ(1,0) ε_ζ(1,1) ... ε_ζ(1,ηχ) " ·ε_ζ(ηζ,ηχ) ε_χζ(0,0) ε_χζ(0,1) " ·ε_χζ(0,ηχ-1) ε_χζ(1,0) ε_χζ(1,1) ... ε_χζ(1,ηχ-1) " ·ε_χζ(ηζ-1,ηχ-1)

Fig. 8 illustrates a global memory address configuration. However, aligning data in this way poses three major problems: 1) Operands (the variables V_x, V_y, ε_χ, ε_ζ and ε_χζ) are potentially scattered all over global memory, so to update V_x from ε_χ and ε_χζ, potentially large distances in memory space may need to be traversed, this applies to all 5 equations (equations 2 to 6); 2) Operands for a dz operation are at least nx-1 addresses apart; 3) The element sizes are different: V_x is nx+1 by nz, V_z is nx by nz+1, ε_χ and ε_ζ are nx by nz and ε_χζ is nx-1 by nz-1. This variation prevents synchronized, parallel reads/writes. The element size difference problem may be accommodated by making elements sizes all equal to nx by nz. The extra elements for x_y can be ignored and padded with zeros while the missing elements in V_x and V_z are always zero due to the rigid boundary condition. In this method, the data are written to memory so that later reading is extremely fast. Fig. 9 illustrates a global memory address configuration with the element sizes equal. Fig. 10 illustrates a global memory address configuration for thread-block grids. For simplicity, the thread block grid has nine sectors, but it will be appreciated thread block grids usually have a larger number of sectors.

[0046] Therefore, the data are accessed by thread-blocks, which in this example is a team of 16 by 16 threads. The data are arranged such that all a thread-block will ever need (except halos) is in one place, globally arranged block after block. A block would contain portions of V_x, V_y, ε_χ, ε_ζ and ε_χζ, exactly the corresponding data that thread-blocks require.

[0047] A thread block "knows" its coordinates inside the thread-blocks grid (gx, gy) and can use the coordinates to access its relevant data. The method includes: 1) Shift the global data address by (gx + gz * gnx) * n, where n is the number of elements per thread-block, which is simply 16 * 16 * 5 (V_x, V_y, ε_χ, e_y and e_xy). 2) Read operands into shared memory and elements to update into registers. This can be done in 5 * 16 = 80 reads for all 256 * 5 = 1280 elements, since they are arranged consecutively and 16 elements are read at once. 3) Read halos. This is more difficult and takes more time than with data arranged in a more classic way. Overall the implementation is still faster because halos are only a small amount of the total data to read/write. The procedure for accessing halo data includes extracting based on an index mapping such as: +1 halo: + n; +nx halo: + n * gnx; - 1 halo: - n; -nx halo: - n * gnx. 4) Update elements (V or ε) using data from shared memory, which can be accessed very fast. 5) Write to global memory updated elements, 2 * 16 = 32 (velocities) or 3 * 16 = 48 (stresses).

[0048] TRI processing includes computing imaging conditions. Imaging conditions collapse the propagation time axis into physical space to allow visualization of wave-field focusing. The absolute particle velocity and P- and S- Wave potentials may be computed after elastic propagation. The equations used are:

Absolute particle velocity: V² = V² + V². P-Wave potential: P² = λ + ^).

dx dz J

S-Potential: S² = _M( V^ dz -^ dxY J

[0049] From these, the following non-exclusive six imaging conditions may be computed: Maximum Absolute Particle Velocity (Ivmax), Absolute Particle Velocity Auto Correlation (Iw ), P- and S-Wave Potential Auto Correlation (Ipp and ½), and P- and S-Wave Potential and Energy Cross Correlation (Ips and I For the 2D case:

Ivmax .^x> ^{z = max} \ V(x,z) \ -

l_ss(x, z) =∑t≡o (S(x, z) * S(x, z)) .

l_Eps (x, z) = ^τ ₀ ((Ρ( ) * S(x, z) * P(x, z) * S(x, z))) .

[0050] For the 3D case, the like imaging conditions include i) the zero-lag of the P-wave autocorrelation, ii) the zero-lag of the S-wave autocorrelation, iii) the zero-lag of the cross-correlation of the P- and S-wave energy densities, iv) autocorrelation of the absolute value of particle motion, v) maximum over all time, and vi) the crosscorrelation of the energy density functions EpEs. Additionally, these imaging conditions may be generally applied to any of: i) particle velocity measurements, ii) particle acceleration measurements, iii) particle pressure measurements and iv) particle displacement measurements.

[0051] Similar to the physical attributes at each grid point location, the imaging conditions are updated on every grid point at every time step by reading the previous time-step value from global memory, calculating the incremental value, and then writing the updated value back to global memory.

[0052] The propagator consists of two kernels, one to update stress elements (ε_χχ, ε_ζζ, ε_χζ), and the other to update velocities (V_X,V_Z). For each time step, stress is updated first, then velocity. Each thread updates either the stresses or the velocities at a given array element location. The free surface condition is achieved by setting the stress value to zero at the surface boundary during the velocity pass. The absorbing boundaries may be applied as a Cosine taper function to the other three sides at the stress pass.

[0053] Wave field decomposition and imaging conditions are computed in a separate kernel from the propagation. The computational complexity is on par with elastic element updates, so roughly half of the computation time is spend on elastic element updates, the other half on wave field decomposition and updating imaging conditions. The propagator may also be used for forward time modeling by recording particle velocities at desired locations.

[0054] Fig. 11 illustrates a GPU data arrangement, which is a visualization of 3D elements (e.g. Velocities, Stresses and Cross Stresses). The most critical optimization parameter to realize the significant performance capabilities of GPU cards is to minimize the stride-length of memory skips across the ID memory array streaming through each processing thread. As total memory of GPU cards is relatively small, the size of each spatial volume in the model domain is also limited. Both issues are addressed by an appropriate domain-decomposition strategy focused on light-weight 2D planes of wave field and parameter values. Therefore, small planes (P0 to Pn-1, see Fig. 12) from the propagation domain is sequentially stored in memory for every wave field and parameter field, followed by the exterior loop over the third spatial dimension. This set of planes defines a Thread Block domain, and a set of blocks contains the entire model domain, consisting of N Elements, each nl by n2 by n3. Note that the wave and parameter fields are consecutive planes before considering the third spatial dimension. Fig. 13 illustrates the distribution of values in GPU memory, though adds some of the concepts of 3D spatial structure for an alternative view of the organization utilized.

[0055] The following table lists example address values for a model of nl=32, n2=16, n3=2, holding two Elements, processed using two Thread Blocks, 16 by 16 each. The GPU scheduling algorithm handles the order, priority, and distribution of Thread Blocks across the available cores on the card. Address Element Plane Block

[0 - 255] 0 0 0

[255 - 511] 1 0 0

[512 - 767] 0 1 0

[768 - 1023] 1 1 0

[1024 - 1279] 0 0 1

[1280 - 1535] 1 0 1

[1536 - 1791] 0 1 1

[1792 - 2047] 1 1 1

[0056] Axis attributes "fast", "middle" or "slow" refers to their arrangement in computer memory. Because memory address is one-dimensional, the representation of a three dimensional object is only possible by storing all their values consecutively, as in for i3 = 0 ... n3

for i2 = 0 ... n2

for il = 0 ... nl

memory[il + i2 * nl + i3 * nl * n2] = object[i3][i2][il]

[0057] The "speed" of the axis is then determined by the distance between neighboring elements: The fast axis has its elements neighboring directly, the middle axis has nl elements between neighbors and the slow axis has nl*n2 elements between neighbors.

[0058] Thread-Block : GPU threads are organized into a Thread-Block. Functions on the GPU are executed by Thread-Blocks, meaning that the instructions defined in the functions are executed by a group of threads at once (almost). The size of the Thread-Block is configurable in one, two or three dimensions.

[0059] The number of Thread-Blocks scheduled is also configurable in one, two or three dimensional grids. To use a two-dimensional grid, and if BLOCK_DIM is the size of both axis in the two-dimensional Thread-Block, the grid will be configured Nl / BLOCK_DIM by N2 / BLOCK_DIM, where Nl and N2 are nl and n2 padded to multiples of BLOCK_DIM.

nl: Length of fast axis, or axis 1.

n2: Length of middle axis, or axis 2.

n3 : Length of slow axis, or axis 3.

[0060] threadldx.x : When executing a function on the GPU, a Thread-Block has access to its grid and block coordinates. The block coordinates, threadldx.x and threadldx.y, inform a thread, where in the block it resides. This is used to access the proper addresses in the nl-n2 plane.

[0061] For illustration, Fig. 15 shows an example of pseudocode for GPU based 2D TRI. Threads are synchronized between each kernel to ensure all the values are updated for the entire simulation space before moving on to the next calculation. Memory is allocated on the host (CPU) and GPU device for V_x, V_z, ε_χ, ε_ζ and ε_χζ, sources and imaging conditions. Sources, as used herein, are the seismic data input to the TRI. The operands V_x, V_z, ε_χ, ε_ζ and ε_χζ are initialed and source data are loaded. Data are transferred from the host CPU to the GPU device. On the GPU for every grid point the stresses ε_χ, ε_ζ and ε_χζ are updated, the threads are synched and then the velocities V_x and V_z are updated, and the threads are synched again. This is followed by wavefield decomposition V, P, S; imaging conditions are computed and the threads are synched.

[0062] An example of a physical model is -15,000 m long by -4,000 m deep, discretized into a 10m x 10m grid, with 11 calculations per time step, resulting in -6,600,000 elements. Threads are organized into a 2D grid of blocks, each block consisting of 16 x 16 threads. With this block/grid arrangement, the thread and block IDs are used to index into the simulation space following the CUDA paradigm.

[0063] The necessary memory is allocated and initialized first on the host (CPU), then the device (GPU), and then the propagation and imaging kernels are carried out on the GPU, with the final results transferred back to CPU for logging and analysis.

[0064] Fig. 16a illustrates Horizontal Velocity Data as recorded by surface receivers (Ricker Wavelet, fundamental frequency 3 Hz) recorded at every grid point at the surface from a forward model of a single vertical point source. Fig. 16b illustrates a forward model of a single vertical point source. Receiver data is reversed in time and propagated. Fig. 16c shows the TRI result, which highlights the source location using the P-S-Wave Potential Cross Correlation Imaging Condition ¾_¾. All simulations performed on GPU.

[0065] Examples of TRI with actual physical settings using a single Nvidia GTX 285 graphics processor, which has 30 multi-processors (240 total compute cores), 1 gigabyte of device memory, and CUDA 2.3 with capability 1.3. N_x and N_z are numbers of horizontal and vertical grid points and N is the number of time steps. Tj and T_c are Java and CUDA execution time measured in seconds. Ej and E_c are Java and CUDA computational throughputs, in million elements per second. They are calculated by Nx x Nz x Nt x (5+6) / runtime. The factor of (5 + 6) is because there are 5 elastic properties (or Elements) and 6 imaging conditions to update at every time step in the 2D case.

[0066] Notwithstanding any inefficiency in the particular implementation, the throughput of more than 5,000 million elements per second achieved on a single GPU is non-trivial. Thirty seconds of physical simulation of elastic propagation takes around thirty seconds of runtime, enabling nearly real-time processing.

[0067] In the non-limiting embodiment disclosed herein, the CUDA implementation of TRI, which contains six imaging conditions and a finite difference full elastic wave propagation solver, that all runs fully on a GPU. Data are divided into small, manageable pieces to leverage CUDA's built-in texture functions, and ensure coalesced memory addressing from threads to for optimal memory reads and writes. The embodiment disclosed herein illustrates a CUDA implementation that achieves a computational throughput of more than 5,000 million elements per second.

[0068] To summarize the data arrangement, the necessary physical elements (velocities and stresses in the seismic example) are arranged in Blocks as illustrated in Fig. 11. The mapview of Elements E(n) in Fig. 11 schematically illustrates the data Block arrangement, B0 to B8, so that the same blocks, meaning like-indexed blocks, from other Elements (e.g. volumes of physical parameters) are closer together in memory to enable more rapid data transfer to speed computational speeds. Fig. 12 illustrates the data arrangement in memory of adjacent Planes (P0 to Pn-1) and the like-indexed Blocks of Elements are closer together in memory for data transfer efficiency. As illustrated, the adjacent Planes of each Block for all Elements are arranged in memory so that reads and writes are most efficient. Fig. 13 then is an illustration of this result, where threads are in memory arranged for maximum efficiency. Fig. 14 illustrates this data arrangement hierarchy. Starting from the bottom memory linear array the BLOCKS (like-indexed Blocks extracted from the Elements) are arranged in adjacent PLANES of the ELEMENTS. When the ELEMENTS are arranged in this manner, the thread blocks are in memory so that the GPU read and write sequences facilitate the maximally efficient throughput for GPU operations.

[0069] Fig. 17 illustrates a schematic example of the hardware and operating environment for which embodiments as described herein and their equivalents may be practiced. The description of Fig. 17 includes a general description of computer hardware, computing environment or information handling system for which the embodiments may be implemented. Although specific hardware may not be required, embodiments may be implemented in the general context of computer-executable instructions, such as program modules, being executed by a computer. Various embodiments may be practiced with a personal computer, a mainframe computer or combinations that include workstations with servers. Program modules include routines, programs, objects, components and data structures for performing tasks, processing data, and recording and displaying information.

[0070] The products as defined herein may be particularly adapted for use in what are termed "information handling system." An information handling system is any instrumentality or aggregate of instrumentalities primarily designed to compute, classify, process, transmit, receive, retrieve, originate, switch, store, display, manifest, measure, detect, record, reproduce, handle or utilize any form of information, intelligence or data for business, scientific, control or other purposes. Examples include personal computers and larger processors such as servers, mainframes, etc, and may contain elements illustrated in Fig. 17.

[0071] Embodiments may be practiced with various computer or information handling system configurations that separately or in combination may include handheld devices, multiprocessor systems, microprocessor-based or programmable consumer electronics, network computers, minicomputers, mainframe computers, and the like. Embodiments may be practiced with tasks performed in and over distributed computing environments that include remote processing devices linked through a communications network. Program modules operating in distributed computing environments may be located in various memory locations, both local and remote.

[0072] Fig. 17 is illustrative of hardware and an operating environment for implementing a general purpose computing device or information handling system in the form of a computer 10. Computer 10 includes a processor or processing unit 11 that may include Onboard' instructions 12. Computer 10 has a system memory 20 attached to a system bus 40 that operatively couples various system components including system memory 20 to processing unit 11. The system bus 40 may be any of several types of bus structures using any of a variety of bus architectures as are known in the art.

[0073] While one processing unit 11 is illustrated in Fig. 17, there may be a single central-processing unit (CPU) or a graphics processing unit (GPU), or both or a plurality of processing units. Computer 10 may be a standalone computer, a distributed computer, or any other type of computer.

[0074] System memory 20 includes read only memory (ROM) 21 with a basic input/output system (BIOS) 22 containing the basic routines that help to transfer information between elements within the computer 10, such as during start-up. System memory 20 of computer 10 further includes random access memory (RAM) 23 that may include an operating system (OS) 24, an application program 25 and data 26.

[0075] Computer 10 may include a disk drive 30 to enable reading from and writing to an associated computer or machine readable medium 31. Computer readable media 31 includes application programs 32 and program data 33.

[0076] For example, computer readable medium 31 may include programs to process seismic data, which may be stored as program data 33, according to the methods disclosed herein. The application program 32 associated with the computer readable medium 31 includes at least one application interface for receiving and/or processing program data 33. The program data 33 may include seismic data acquired according to embodiments disclosed herein. At least one application interface may be associated with calculating imaging conditions for locating subsurface hydrocarbon reservoirs. [0077] The disk drive may be a hard disk drive for a hard drive (e.g., magnetic disk) or a drive for a magnetic disk drive for reading from or writing to a removable magnetic media, or an optical disk drive for reading from or writing to a removable optical disk such as a CD ROM, DVD or other optical media.

[0078] Disk drive 30, whether a hard disk drive, magnetic disk drive or optical disk drive is connected to the system bus 40 by a disk drive interface (not shown). The drive 30 and associated computer-readable media 31 enable nonvolatile storage and retrieval for one or more application programs 32 and data 33 that include computer-readable instructions, data structures, program modules and other data for the computer 10. Any type of computer-readable media that can store data accessible by a computer, including but not limited to cassettes, flash memory, digital video disks in all formats, random access memories (RAMs), read only memories (ROMs), may be used in a computer 10 operating environment.

[0079] The application programs 32 may be associated with one or more application program interfaces. An application programming interface (API) 35 may be an interface that a computer system, library or application provides in order to allow requests for services to be made of it by other computer programs, and/or to allow data to be exchanged between them. An API 35 may also be a formalized set of software calls and routines that can be referenced by an application program 32 in order to access supporting application programs or services, which programs may be accessed over a network 90.

[0080] APIs 35 are provided that allow for higher level programming for displaying and mapping subsurface reservoirs. For example, APIs are provided for receiving seismic data, and decomposing, merging, smoothing and averaging the data. Moreover, the APIs allow for determining the output of imaging condition processing and storing it for display.

[0081] Data input and output devices may be connected to the processing unit 11 through a serial interface 50 that is coupled to the system bus. Serial interface 50 may a universal serial bus (USB). A user may enter commands or data into computer 10 through input devices connected to serial interface 50 such as a keyboard 53 and pointing device (mouse) 52. Other peripheral input/output devices 54 may include without limitation a microphone, joystick, game pad, satellite dish, scanner or fax, speakers, wireless transducer, etc. Other interfaces (not shown) that may be connected to bus 40 to enable input/output to computer 10 include a parallel port or a game port. Computers often include other peripheral input/output devices 54 that may be connected with serial interface 50 such as a machine readable media 55 (e.g., a memory stick), a printer 56 and a data sensor 57. A seismic sensor or seismometer for practicing embodiments disclosed herein are nonlimiting examples of data sensor 57. A video display 72 (e.g., a liquid crystal display (LCD), a flat panel, a solid state display, or a cathode ray tube (CRT)) or other type of output display device may also be connected to the system bus 40 via an interface, such as a video adapter 70. A map or subsurface display created from imaging condition values as disclosed herein may be displayed with video display 72.

[0082] A computer 10 may operate in a networked environment using logical connections to one or more remote computers. These logical connections are achieved by a communication device associated with computer 10. A remote computer may be another computer, a server, a router, a network computer, a workstation, a client, a peer device or other common network node, and typically includes many or all of the elements described relative to computer 10. The logical connections depicted in Fig. 17 include a local-area network (LAN) or a wide-area network (WAN) 90. However, the designation of such networking environments, whether LAN or WAN, is often arbitrary as the functionalities may be substantially similar. These networks are common in offices, enterprise-wide computer networks, intranets and the Internet.

[0083] When used in a networking environment, the computer 10 may be connected to a network 90 through a network interface or adapter 60. Alternatively computer 10 may include a modem 51 or any other type of communications device for establishing communications over the network 90, such as the Internet. Modem 51, which may be internal or external, may be connected to the system bus 40 via the serial interface 50.

[0084] In a networked deployment computer 10 may operate in the capacity of a server or a client user machine in server-client user network environment, or as a peer machine in a peer-to-peer (or distributed) network environment. In a networked environment, program modules associated with computer 10, or portions thereof, may be stored in a remote memory storage device. The network connections schematically illustrated are for example only and other communications devices for establishing a communications link between computers may be used.

[0085] While various embodiments have been shown and described, various modifications and substitutions may be made thereto without departing from the spirit and scope of the disclosure herein. Accordingly, it is to be understood that the present embodiments have been described by way of illustration and not limitation.

Claims

Claims:

1) A computerized method for imaging subsurface energy sources comprising:

acquiring multicomponent seismic data;

arranging 3D volumes of velocity and stress parameters in computer memory by ordering adjacent planes from like-indexed blocks extracted from the 3D volumes so that like-indexed blocks are adjacent in the computer memory; reverse propagating the multicomponent seismic data to obtain a reverse time wave field; and

solving a full elastic wave equation for the reverse time wave field using a graphics processing unit to obtain wave field decomposition data.

2) The method of Claim 1 further comprising applying at least one imaging condition to the wave field decomposition data to obtain imaging data.

3) The method of Claim 2 further comprising storing the imaging data for display.

4) The method of Claim 1 further comprising updating the stress parameters, loading velocity parameters, synching all threadblocks and writing synched data to global memory.

5) The method of Claim 1 further comprising updating velocity parameters, loading stress parameters, synching all threadblocks and writing synched data to global memory.

6) The method of Claim 1 wherein the 3D volumes of velocity and stress parameters are at least one selected from the group consisting of i) a horizontal velocity, ii) vertical velocity, iii) a horizontal stress, iv) a vertical stress, and v) a cross stress.

7) The method of claim 2 wherein the at least one imaging condition is at least one selected from the group consisting of; i) the zero-lag of the P-wave autocorrelation, ii) the zero-lag of the S-wave autocorrelation, iii) the zero-lag of the cross-correlation of the P- and S-wave energy densities, iv) autocorrelation of the absolute value of particle motion, v) maximum over all time, and vi) the crosscorrelation of the energy density functions EpEs.

8) The method of claim 2 further comprising applying the group of imaging conditions consisting of; i) the zero-lag of the P-wave autocorrelation, ii) the zero- lag of the S-wave autocorrelation, iii) the zero-lag of the cross-correlation of the P- and S-wave energy densities, iv) autocorrelation of the absolute value of particle motion, v) maximum over all time, and vi) the crosscorrelation of the energy density functions EpEs.

9) A computerized method for imaging subsurface energy sources by solving a full elastic wave equation for the reverse time wave field using a graphics processing unit to obtain wave field decomposition data comprising:

acquiring multicomponent seismic data;

arranging volumes of velocity and stress parameters in computer memory by ordering adjacent planes from like-indexed blocks extracted from the volumes so that like-indexed blocks are adjacent in memory;

updating the stress parameters, loading velocity parameters, synching all threadblocks and writing synched data to global memory;

updating velocity parameters, loading stress parameters, synching all threadblocks and writing synched data to global memory; and

decomposing a seismic wavefield, computing an imaging condition, writing the computed imaging condition to memory.

10) The method of claim 9 further comprising applying at least one imaging condition to the wave field decomposition data to obtain imaging data; and

11) The method of claim 10 further comprising storing the imaging data for display.

12) The method of Claim 9 wherein the 3D volumes of velocity and stress parameters are at least one selected from the group consisting of i) a horizontal velocity, ii) vertical velocity, iii) a horizontal stress, iv) a vertical stress, and v) a cross stress.

13) The method of claim 10 wherein the at least one imaging condition is at least one selected from the group consisting of; i) the zero-lag of the P-wave autocorrelation, ii) the zero-lag of the S-wave autocorrelation, iii) the zero-lag of the cross-correlation of the P- and S-wave energy densities, iv) autocorrelation of the absolute value of particle motion, v) maximum over all time, and vi) the crosscorrelation of the energy density functions EpEs. 14) The method of claim 10 further comprising applying the group of imaging conditions consisting of; i) the zero-lag of the P-wave autocorrelation, ii) the zero- lag of the S-wave autocorrelation, iii) the zero-lag of the cross-correlation of the P- and S-wave energy densities, iv) autocorrelation of the absolute value of particle motion, v) maximum over all time, and vi) the crosscorrelation of the energy density functions EpEs.

15) A computerized method for processing 3D seismic data comprising:

acquiring multicomponent seismic data;

acquiring a 3D volume of parameters associated with the multicomponent seismic data;

arranging the volume of parameters in computer memory by ordering adjacent planes from like-indexed blocks extracted from the volume so that like- indexed blocks are adjacent in the computer memory;

reverse propagating the multicomponent seismic data to obtain a reverse time wave field; and

16) The method of claim 15 further comprising applying at least one imaging

condition to the wave field decomposition data to obtain imaging data.

17) The method of claim 16 further comprising storing the imaging data for display.

18) The method of claim 16 wherein the at least one imaging condition is at least one selected from the group consisting of; i) the zero-lag of the P-wave

autocorrelation, ii) the zero-lag of the S-wave autocorrelation, iii) the zero-lag of the cross-correlation of the P- and S-wave energy densities, iv) autocorrelation of the absolute value of particle motion, v) maximum over all time, and vi) the crosscorrelation of the energy density functions EpEs.

19) The method of Claim 15 wherein the 3D volumes parameters associated with the multicomponent seismic data are at least one selected from the group consisting of i) a horizontal velocity, ii) vertical velocity, iii) a horizontal stress, iv) a vertical stress, and v) a cross stress. 20) The method of claim 15 further comprising updating the stress parameters, loading velocity parameters, synching all threadblocks and writing synched data to global memory and updating velocity parameters, loading stress parameters, synching all threadblocks and writing synched data to global memory.