The present invention relates generally to parallel processing and more particularly to determining an extrema (e.g., maximum or minimum) from a set of values distributed across an array of processing element in a parallel processing system.
Conventional central processing units (“CPU's”), such as those found in most personal computers, execute a single program (or instruction stream) and operate on a single stream of data. For example, the CPU fetches its program and data from a random access memory (“RAM”), manipulates the data in accordance with the program instructions, and writes the results back sequentially. There is a single stream of instructions and a single stream of data (note: a single operation may operate on more than one data item, as in X=Y+Z, however, only a single stream of results is produced). Although the CPU may determine the sequence of instructions executed in the program itself, only one operation can be completed at a time. Because conventional CPUs execute a single program (or instruction stream) and operate on a single stream of data, conventional CPUs may be referred to as a single-instruction, single data CPU or an SISD CPU.
The speed of conventional CPUs has dramatically increased in recent years. Additionally, the use of cache memories enables conventional CPUs faster access to the desired instruction and data streams. However because conventional CPUs can complete only one operation at a time, conventional CPUs are not suitable for extremely demanding applications having large data sets (such as moving image processing, high quality speech recognition, and analytical modeling applications, among others).
Improved performance over conventional SISD CPUs may be achieved by building systems which exhibit parallel processing capability. Typically, parallel processing systems use multiple processing units or processing elements to simultaneously perform one or more tasks on one or more data streams. For example in one class of parallel processing system, the results of an operation from a first CPU are passed to a second CPU for additional processing, and from the second CPU to another CPU, and so on. Such a system, commonly known as a “pipeline”, is referred to as a multiple-instruction, single-data or MISD system because each CPU receives a different instruction stream while operating on a single data stream. Improved performance may also be obtained by using a system which contains many autonomous processors, each running its own program (even if the program running on the processors is the same code) and producing multiple data streams. Systems in this class are referred to as a multiple-instruction, multiple-data or MIMD system.
Additionally, improved performance may be obtained using a system which has multiple identical processing units each performing the same operations at once on different data streams. The processing units may be under the control of a single sequencer running a single program. Systems in this class are referred to as a single-instruction, multiple data or SIMD system. When the number of processing units in this type of system is very large (e.g., hundreds or thousands), the system may be referred to as a massively parallel SIMD system.
Nearly all computer systems now exhibit some aspect of one or more of these types of parallelism. For example, MMX extensions are SIMD; multiple processors (graphics processors, etc) are MIMD; pipelining (especially in graphics accelerators) is MISD. Furthermore, techniques such as out of order execution and multiple execution units have been used to introduce parallelism within conventional CPUs as well.
Parallel processing is also used in active memory applications. An active memory refers to a memory device having a processing resource distributed throughout the memory structure. The processing resource is most often partitioned into many similar processing elements (PEs) and is typically a highly parallel computer system. By distributing the processing resource throughout the memory system, an active memory is able to exploit the very high data bandwidths available inside a memory system. Another advantage of active memory is that data can be processed “on-chip” without the need to transmit the data across a system bus to the CPU or other system resource. Thus, the work load of the CPU may be reduced to operating system tasks, such as scheduling processes and allocating system resources.
A typical active memory includes a number of interconnected PEs which are capable of simultaneously executing instructions sent from a central sequencer or control unit. The PEs may be connected in a variety of different arrangements depending on the design requirements for the active memory. For example, PEs may be arranged in hypercubes, butterfly networks, one-dimensional strings/loops, and two-dimensional meshes, among others.
A typical PE may contain data, for example a set of values, stored in one or more registers. In some instances, it may be desirable to determine the extrema (e.g., the highest or lowest value) of the set of values on an individual PE. Furthermore, it may be desirable to find the extrema for an entire array of PEs. Conventional methods for finding the extrema, however, often results in a number processing cycles being “lost.” A lost cycle may refer to, for example, a cycle in which the PE must wait to complete a calculation because the necessary data has yet to be transferred into or out of the PE.
One approach for finding the global extrema of a set of shorts (i.e., a “short” refers to a 16-bit value) for an array of 8-bit processors transmits the bytes in the order in which they are needed for comparison in the PE. The 8-bit PE processes each short as two separate bytes, a “most significant” MS byte and a “least significant” (LS) byte. Once started, for continuous operation, this approach requires a further four (4) cycles per short. First, the local LS-byte of the needed short is loaded onto the network during the first clock pulse and transferred to the PE during the second clock pulse. Next, the local MS-byte of the needed short is loaded onto the network during the third clock pulse and transferred to the PE during the fourth clock pulse. As can be seen, four (4) cycles are required to transfer the needed short to the PE. Thus to transfer sixteen (16) shorts, sixty-four (64) cycles are required.
Also, two (2) cycles are required for the PE to compare one short to another short. For example, the LS-byte of short-1 is compared to the LS-byte of short-2 in a first cycle and the MS-byte of short-1 is compared to the MS-byte of short-2 in a second cycle. For sixteen (16) values, fifteen (15) comparisons are required. Thus of the total sixty-four (64) cycles, the PE is “working” a minimum of thirty (30) cycles and is idle for thirty-four (34) cycles. Accordingly, this approach is considered to have a “transfer bottleneck” because the idle cycles are caused by the way the bytes are transferred.
A second approach attempts to minimize the time required to transfer the shorts by first transferring all of the LS-bytes to the PE and then transferring all of the MS bytes to the PE. Once started, for continuous operation, this approach requires approximately 3 cycles per short. For example for sixteen (16) PEs each having one local short, sixteen (16) cycles are needed to transfer each short's LS-byte to each other PE and to collect the sixteen (16) LS bytes in the PE's register files. An additional sixteen (16) cycles are then needed to transfer each short's MS-byte to each other PE and to start comparing the shorts to each other. Another fifteen (15) cycles are needed to finish comparing the shorts. It should be noted that the PE cannot start comparing the shorts until the first MS-byte is transferred. After the first MS-byte is transferred, the PE requires 30 cycles to finish comparing all sixteen (16) shorts. Thus of the forty six total cycles, the transfer network is working for thirty-two (32) cycles and is idle for fourteen (14) cycles. Accordingly, this approach is considered to have a “processing bottleneck” because the idle cycles are caused by the way the bytes are processed. It should be noted that each of the approaches discussed above may also require additional cycles for initialization and termination of the process.
Each of the approaches discussed above have idle or “lost” cycles. Thus, there exists a need for a method for determining the extrema of a set of values on an array of parallel processors such that the resources of the parallel processing system are maximized. More specifically, there exists a need for a method for determining the extrema of a set of values on an array of parallel processing elements of an active memory such that the resources of the active memory are maximized.