A common computer processing task involves sequentially processing large numbers of data items, such as data corresponding to each of a large number of pixels in an array. Processing data in this manner normally requires fetching each item of data from a memory device, performing a mathematical or logical calculation on that data, and then returning the processed data to the memory device. Performing such processing tasks at high speed is greatly facilitated by a high data bandwidth between the processor and the memory device. The data bandwidth between a processor and a memory device is proportional to the width of a data path between the processor and the memory device and the frequency at which the data are clocked between the processor and the memory device. Therefore, increasing either of these parameters will increase the data bandwidth between the processor and memory device, and hence the rate at which data can be processed.
A memory device having its own processing resource is known as an active memory. Conventional active memory devices have been provided for mainframe computers in the form of discrete memory devices having dedicated processing resources. However, it is now possible to fabricate a memory device, particularly a dynamic random access memory (“DRAM”) device, and one or more processors on a single integrated circuit chip. Single chip active memories have several advantageous properties. First, the data path between the DRAM device and the processor can be made very wide to provide a high data bandwidth between the DRAM device and the processor. In contrast, the data path between a discrete DRAM device and a processor is normally limited by constraints on the size of external data buses. Further, because the DRAM device and the processor are on the same chip, the speed at which data can be clocked between the DRAM device and the processor can be relatively high, which also maximizes data bandwidth. The cost of an active memory fabricated on a single chip can is also less than the cost of a discrete memory device coupled to an external processor.
An active memory device can be designed to operate at a very high speed by parallel processing data using a large number of processing elements (“PEs”) each of which processes a respective group of the data bits. One type of parallel processor is known as a single instruction, multiple data (“SIMD”) processor. In a SIMD processor, each of a large number of PEs simultaneously receive the same instructions, but they each process separate data. The instructions are generally provided to the PE's by a suitable device, such as a microprocessor. The advantages of SIMD processing are simple control, efficient use of available data bandwidth, and minimal logic hardware overhead. The number of PE's included on a single chip active memory can be very large, thereby resulting in a massively parallel processor capable of processing large amounts of data.
A common operation in active memory SIMD processing element arrays is the shifting of an array of data. To facilitate this operation, processing elements in the array are preferably connected to each other by data paths that permit neighboring processing elements to transfer data between each other. For example, in a two-dimensional rectangular array, each processing element may be connected to its four nearest neighbors. To maximize the operating speed of the processing element array, it is desirable to minimizes the data path between processing elements. Doing so does not present a problem in the interior of the array because neighboring processing elements can be placed very close to each other. However, at the edges of the array, the path from a processing element at one edge of the array to a processing element at the opposite edge of the array (which are neighbors to each other) can be very large.
One conventional technique that has been used to minimize the length of the longest path between processing elements can be explained with reference to FIGS. 1A–C, which show a technique for folding paper that can be analogized to “folding” an array of processing elements. As shown in FIG. 1A, a rectangular piece of paper 10 representing a logical array of processing elements has edges 12, 14, 16, 18, a horizontal fold line 20, and two vertical fold lines 26, 28. The paper 10 is initially folded about the vertical fold lines 26, 28 as shown by the arrows 30, 32, 34, 36 in FIG. 1A. After being folded in this manner, the paper has the configuration shown in FIG. 1B. In this configuration, the vertical edges 12, 16 are positioned close to each other at the center of the paper 10. As a result, the distance between the vertical edges 12, 16 is relatively small, and the distances between the vertical edges 12, 16 and other portions of the paper 10 are also reduced.
The paper 10 is next folded along the horizontal line 20 as shown by the arrows 40, 42, 44, 46 in FIG. 1B, which results in the configuration shown in FIG. 1C. In this configuration, the horizontal edges 18, 14 are positioned adjacent each other, as are the upper and lower portions of the vertical edges 12, 16.
It is, of course, not possible to fold a semiconductor substrate on which processing elements are fabricated in the manner in which the paper 10 can be folded as shown in FIGS. 1A–C. However, a similar effect can be achieved by spacing processing elements apart from each other so that processing elements on “overlapping” portions of the substrate can be interleaved with each other. For example, with reference to FIG. 1B, the processing elements between the vertical fold lines 26, 28 are spread out by one processing element. Processing elements in the substrate between the vertical fold line 26 and the vertical edge 16 and between the vertical fold line 28 and the vertical edge 12 are then positioned between the processing elements in the substrate between the fold lines 26, 28.
When the substrate is “folded” as shown in FIG. 1C, the processing elements are again interleaved. However, since there is now four layers of “substrate,” the processing elements extending from one side of the folded substrate shown in FIG. 1C must be interleaved by 3 processing elements. Since processing elements that are logically adjacent to each other before “folding” are now physically separated from each other by three processing elements, processing element interconnections are coupled to every fourth processing element.
When an array of processing elements are conceptually “folded” as shown in FIG. 1C, it can still be logically accessed as if it was in its unfolded configuration shown in FIG. 1A. When folded as shown in FIG. 1C, the array has physical topography shown in FIG. 2, which, in the interest of clarity, shows only some of the processing elements in the array. FIG. 2 also shows the physically location of registers 50, 52, 56, 58 that are positioned adjacent the edges 12, 14, 16, 18, respectively, of the array. When the array is folded about the horizontal line 20 as shown in FIG. 1B, processing elements 60a–n logically positioned below the line 20 are physically to the left of processing elements 62a–n logically positioned above the line 20 as shown in FIG. 2. The connections between the processing elements 62 logically positioned above the line and the processing elements 60 logically positioned below the line are physically connected to each other at the top, as shown in FIG. 2. The processing element 60a at the logical bottom of the array and the processing element 62a at the logical top of the array are positioned adjacent a common edge register 68. However, a separate edge register may be provided for the processing elements 60 logically positioned below the line 20 and a separate edge register may be provided for the processing elements 62 logically positioned above the line 20. In either case, at least one edge register 68 is provided for each column of the array.
As explained in greater detail below, FIG. 2 also shows a first row of processing elements 70a–c,n logically below the line 20, a first row of processing elements 72n,n−1, a logically above the line 20, a second row of processing elements 74a–c,n logically above the line 20, and a second row of processing elements 76a–c logically above the line 20 (the processing element 62d is also labeled 74b, and the processing element 60d is also labeled 76n−1). However, the processing elements 70, 74 are actually in the same logical row above the line 20, and the processing elements 72, 76 are in the same logical row below the line 20. The processing elements 70 logically extend rightwardly from the left edge of the logical array, and the processing elements 72 logically extend leftwardly from the right edge of the logical array.
A left edge register 80 physically positioned at the center of the physical array is logically positioned at the left edge of the logical array, i.e., adjacent the line 16, three processing elements above the line 20. A right edge register 82 is also physically positioned at the center of the substrate but is logically positioned at the right edge of the logical array, i.e., adjacent the line 12. The left edge register 80 is coupled to a processing element 70a, which, in turn is coupled to a processing element 70b. The logical position of the processing element 70a is at the left edge of the logical array, i.e., adjacent the line 16, three processing elements above the line 20, and the logical position of the processing element 70b is one processing element to the right of the left edge, three processing elements above the line 20. The processing elements to which the registers 80, 82 are coupled are in different logical rows.
Similarly, the right edge register 82 is shown coupled to the processing elements 72a,n,n−1. The processing element 72a is one processing element to the left of the right edge of the logical array, three processing elements below the line 20. The processing element 72n is at the center of the logical array, and the processing element 72n−1 is to the right adjacent the processing element 72n, and both are logically positioned three processing elements below the horizontal line 20.
The concept illustrated in FIG. 1 and the topography shown in FIG. 2 has the advantage of minimizing the length of the longest path between processing elements. However, the topography shown in FIG. 2 has the disadvantage of making it difficult to perform operations using only a portion of an array of processing elements. For example, performing operations using only the processing elements logically positioned in the upper left quadrant shown in FIG. 1 involves processing elements that are physically spread throughout the substrate. More specifically, the processing elements logically in the upper left quartile are interleaved with the processing elements in the lower left quartile, and they are in rows that are interleaved with processing elements in the upper and lower right quartiles.
Therefore, a need exists for a processing array topography that minimizes the length of the longest path between processing elements in an array, but does so in a manner that facilities either partial or full use of the array.