1. Field of the Invention
The present invention is directed to a method and apparatus for determining whether a sampled data set, distributed over processors in a ring configuration, can be combined using a data combining operation, such as interpolation, without requiring data redistribution over the processors and the communication overhead required for such redistribution and, more particularly to a system which checks to determine whether four conditions exist before the combining operation can proceed. The term combination or combining when used in the present application in reference to data sets refers to any operation performed on data sets in which the elements of one data set are associated with elements of another data set, for example any mathematical operation such as interpolation or correlation between the elements of two data sets, or any arithmetical operation such as addition or subtraction between two data sets.
2. Description of the Related Art
Sampled data sets can be distributed over processors in a ring-connected distributed-memory parallel computer. A sampled data set is one whose elements are sampled at a constant frequency over a contiguous subset of an original data set. Each original data set is distributed in a linear fashion over the processors.
Distributed-memory parallel computers are an efficient apparatus for many computing problems having sampled data sets. These computers utilize multiple processors, each of which have a separate, local memory which is not accessible to any of the other processors. The processors can communicate with each other through interprocessor communication links. The number of processors in such a computer can often be extended without limit. However, the performance of such a computer is limited by the effort of communicating information between the processors so that, at some point, the performance of a large computer with many processors will be constrained by the interprocessor communication bandwidth.
Such computers are most efficient when computing on data sets which are distributed over the various processors and which do not require much communication. In such a case, each processor in the computer will process a local subset of the larger data set. Since each local subset is processed at the same time by each of the many processors within the computer, the overall processing time is greatly reduced. The actual improvement in processing time over a conventional, single-processor computer will depend on the number of processors (and the concomitant local data subset size) and on any communication overhead the computer may experience.
Tasks which require the combination of corresponding data elements from different data sets which are distributed in the same way do not require any interprocessor communication. All the data elements needed by each processor are stored within its local memory. Unfortunately, many tasks are not this simple. For tasks in which the data elements to be combined are not stored in the same processor, each processor must communicate with other processors to get the necessary information needed for the local computation. This communication limits the performance a computer may achieve.
Ring-connected processor arrays are particularly well suited to distributing linear data sets. When distributed over a ring of processors, each data set is divided into small one dimensional local subsets, each of which is stored contiguously in the processors of the ring. Such distributions can also be extended into two dimensions using a mesh or toroidally connected processor array and a two dimensional data set. An example of a one dimensional data set might be a sequence of samples taken of a time varying signal. An example of a two dimensional data set is a digital image.
The total communication overhead that a computer may encounter when combining data stored in separate processors may be reduced by storing data redundantly within the various processors. If some piece of data is likely to be needed several times, it can be stored locally within every processor that needs it. This reduces the total amount of communication at the expense of additional storage. Linear data sets are often locally correlated so that operations on the data elements tend to involve other nearby data elements, that is, the operations are done within a local neighborhood of each element. When the data set is distributed over a ring of processors, this implies that the interprocessor communication will tend to be local as well. Thus, the computer's efficiency will be high since local communication over a distributed network requires less interprocessor communication than global or long-distance communication.
When data elements are locally correlated within a one-dimensional set, redundant data storage within the processors is particularly effective. For example, consider a five-element convolution. Each data element must be accessed five times to compute the convolution. For an element on the edge within one processor, two elements from a neighboring processor must be accessed to compute the convolved value. Thus, it may be much more efficient for two neighboring processors to exchange a copy of their border elements with each other and store the input values locally. Once this has been done, each processor then has all the necessary data to perform its computation without any additional communication. Indeed, operations with these data elements can be done repeatedly, so long as the computing task does not require any new data from farther away and the data themselves are not changed.
Because the redundant storage of data in these circumstances is so useful, data sets are often stored in this "expanded" form, that is, with some data overlap between processors. Any local operation can be performed without requiring additional communication, so long as the original data is not changed (in which case the data exchange would have to be repeated) or the operation does not require data which is not locally stored.
To determine whether an operation can be performed on a data set, the computation algorithm requirements must be compared to the size of the data overlap. In many cases, this is very straightforward. For example, a convolution can be performed if one half the kernel size is less than or equal to the overlap.
It can also happen, however, that operations are not performed on a single data set but might combine two sets or portions of two data sets. For example, two arrays might be summed element by element. In this case if the data sets are distributed identically, the operation is straightforward and no interprocessor communication is necessary since every processor contains corresponding portions of each data subset. If the positions of the elements of the data sets to be combined do not correspond, however, the situation is much more complex.
There are two ways in which this might occur. First, the sampled data sets might be taken from different portions of the data sets. For example, the first sampled data set might be taken from the first half of a set, and the second sampled data set might be taken from the second half. Second, the corresponding elements of the two sampled sets might be relatively offset with respect to each other, that is the first element of one sampled set might correspond with the second element of a second set, or vice versa. It is also possible that both of these circumstances might apply.