This invention relates to the sorting of large volumes of seismic data economically both as to time and as to hardware.
In the processing of seismic data, particularly in the oil field industry, the sheer quantity of data to be processed has increased very rapidly. For example, for 3-dimensional (3D) imaging, the channel count has already reached 3840 channels per Vibrosize position, yielding data sets today on the order of 1 to 2 terabytes (1012) or larger. Data sets of this size are easily built by 500 million traces, and the desired data set size is only expected to grow.
A necessary initial step in the processing for many seismic processing algorithms is sorting the seismic data into a special order, such as common mid-point or common receiver gathers. In general, when using a heap-sort algorithm for trace data, one may calculate the complexity (the number of operations) associated with the task to be N*lnN, where N is the number of traces. For the example of 500 million traces, this requires:
500*106*ln500*106=about 15.5 billion operations.
In practice this estimation is misleading because it requires that the whole amount of data reside in memory in a stand-alone computer system, and most systems cannot hold this amount of data in memory. Indeed, even though it is well known that the advancement of computer hardware is leap-frogging in regards to speed, volume and reliability, the improvements in hardware are as yet insufficient to meet the needs of current seismic data processing. The increase in channel count and multiplicity of 3D data and the need to apply advanced processing methods, such as pre-stack time/depth migration, require innovative data handling methods. Quick sorting is one of the key factors for the efficient execution of these advanced algorithms.
The seismic industry has long been aware of the need for a sorting method efficient enough to handle the increasingly large quantities of data, but to date the methods in use have been inadequate.
The two basic sorting methods in use today are memory-based sorting and disk-based sorting. Memory-based sorting reads and accumulates traces in memory. After all the traces have been read in, or a certain criterion is satisfied, output traces are produced in the desired order. A typical criterion is a pre-defined window size, which could be the minimum number of traces needed to be held in memory and/or scratch disks. This method will not work when the data volume or the window size is larger than the memory capacity. Moreover, this method is not robust because traces can be lost when the actual window size in the input data set is larger than the user-pre-defined window size.
Disk-based sorting, the second conventional method, keeps the data in a limited memory buffer before writing it to temporary scratch disk files. The traces in the buffer can be partially sorted before being written to disk. After all the data has been written into hard disk(s), it can be read back in the desired sorted order and output to the final disk file(s). In this method, random access to the traces is usually needed, which is relatively slow. Moreover, this method may require as large a scratch disk space as the entire input data set. This is disadvantageous or impossible for sorting the large 3D seismic data sets in the range of tera-bytes discussed above.
Besides the two basic methods described above, hybrid methods combining aspects of the two basic methods have been proposed that reduce utilization of scratch space and CPU time. However, a simple combination of the two basic methods does not produce an acceptable method that works well for sorting large input data sets because the fundamental problems for each of the two basic methods remain unresolved.
Thus, for large volumes of seismic data, sorting is a serious task in terms of human and computer resources. Most systems cannot hold the entire input data in memory, and the conventionally available methods take an inordinate amount of time to process a data set of the desired size. For example, even using a hybrid of the two basic methods, it has taken the applicants 6-8 weeks to have 1.2 tera-byte data sorted. Accordingly, an improved sorting scheme is needed.
It is therefore an object of the present invention to provide a method and apparatus for sorting large quantities of seismic data that avoid the above-described difficulties of the prior art.
It is a further object of the present invention to provide a method and apparatus that sort large quantities of seismic data in times on the order of no more than days, rather than weeks.
The above and other objects are achieved by the present invention which, in one embodiment, is directed to a method for sorting large volumes of seismic data into a defined order. The method includes a receiving step of receiving one of a plurality of data portions, where the plurality of data portions constitute an input data set, each data portion containing a plurality of seismic data and having associated therewith an index distinguishing the respective data portion from all other data portions in the input data set. The method further includes an allocating step of allocating the received data portion to one of a plurality of leaf files of a B-Tree structure, the B-Tree structure defining a leaf order of the leaf files, a first storing step of storing the allocated data portion in a scratch memory space corresponding to the allocated leaf file, and a first repeating step of repeating the receiving step, the allocating step and the first storing step until the next one of the leaf files in the leaf index order is full.
The method still further comprises a reading step of reading a full one of the leaf files from the scratch memory space, a second storing step of storing data portions of the read leaf file into a sorting memory space, a sorting step of sorting the data portions in the sorting memory space into a respective sub-order based upon the indices of the data portions therein, and a step of selectively repeating the reading step, the second storing step and the sorting step until all the data portions of the read leaf file have been sorted.
Finally, the method comprises an outputting step of outputting the sorted data portions of the read leaf file in their sub-order to a final output data stream, and a second repeating step of repeating at least the reading step, the second storing step, the sorting step and the outputting step until all the data portions of the input data set have been outputted in the respective sub-orders for all the full leaf files in leaf order to the final output data stream, where the data portions in their respective sub-orders for all the full leaf files in leaf order are in the defined overall order.
In a preferred embodiment, each data portion is a seismic data trace, and each allocated data portion is stored in a scratch disk memory.
These and other objects, features and advantages of the present invention will be apparent from the following detailed description of the preferred embodiments taken in conjunction with the following drawings, wherein like reference numerals denote like elements.