The present invention generally relates to computer systems and, more specifically, to a dynamic address mapping technique of a symmetric multiprocessor system.
A systolic array provides a common approach for increasing processing capacity of a computer system when a problem can be partitioned into discrete units of works. In the case of a one dimensional (1-D) systolic array comprising a single xe2x80x9crowxe2x80x9d of processing elements or processors, each processor in the array is responsible for executing a distinct set of instructions on input data before passing it to a next element of the array. To maximize throughput, the problem is divided such that each processor requires approximately the same amount time to complete its portion of the work. In this way, new input data can be xe2x80x9cpipelinedxe2x80x9d into the array at a rate equivalent to the processing time of each processor, with as many units of input data being processed in parallel as there are processors in the array. Performance can be improved by adding more elements to the array as long as the problem can continue to be divided into smaller units of work. Once this dividing limit has been reached, processing capacity may be further increased by configuring multiple 1-D rows or pipelines in parallel, with new input data allocated to the first processor of a next pipeline of the array in sequence.
In a symmetric multiprocessor system configured as a multidimensional systolic array, processors in the same position (xe2x80x9ccolumnxe2x80x9d) of each pipeline execute the same instructions on their input data. For a large class of applications including data networking, the processors in the same column access the same data structures. For example, a common table indicating a data communication queue must be accessible by all processors in the same column since it is not possible to know in advance which pipeline has the correct table for this input data. Therefore, access to a common memory is required among the processors of the same column.
To avoid contention and thus stalling by the processors, accesses to the common memory are scheduled. Since each processor of a column executes the same instruction code and therefore accesses the same tables in memory, the pipelines of the array are xe2x80x9cskewedxe2x80x9d. In this context, skewing denotes configuring the array such that a first processor of a first pipeline finishes accessing a particular memory just as a second processor of a second pipeline starts to access the same memory. Skewing may be realized by loading new input data into each pipeline of the array in sequence. In this way, the minimum time between input data, and therefore maximum system throughput, is bounded by the time a particular processor, such as the first processor, in a column consumes (xe2x80x9cties upxe2x80x9d) a particular memory resource that the second processor in the same column requires.
For an application utilizing low cost, high-density synchronous dynamic random access memory (SDRAM) resources, the granularity of memory resource contention is typically a bank. A typical SDRAM module has four (4) banks, each containing a fixed one-quarter of the total memory. When a bank is accessed, it cannot be accessed again for a certain period of time (e.g., 7 cycles at 100 MHz). The SDRAM memory resource can support overlapping accesses to each of its banks, where new accesses can be issued every 2 cycles, but only one access at a time per bank is possible. More banks can be added by providing more SDRAM modules to allow more simultaneous accesses to different locations in the memory, but the time needed to access a table within a bank still dictates the maximum throughput of the multiprocessor array. At lower speeds such as, e.g., 100 MHz, the minimum time for accessing a typical entry in a table is approximately seven (7) cycles (for a maximum system throughput of 14.3 million data units per second). However, to access relatively long table entries (e.g., entries containing words), the time that a bank is tied up increases, thereby directly decreasing system throughput.
Therefore, an object of the present invention is to provide a technique that increases throughput in a multidimensional systolic array having SDRAM memory module resources.
Another object of the present invention is to provide a technique that enables fast and efficient accesses by processors of a symmetric multiprocessor system to contiguous storage locations of a memory resource.
The present invention comprises a dynamic address mapping technique that eliminates contention to memory resources of a symmetric multiprocessor system having a plurality of processors arrayed as a processing engine. The inventive technique defines two logical-to-physical address mapping modes that may be simultaneously provided to the processors of the arrayed processing engine to thereby present a single contiguous address space for accessing individual memory locations, as well as a plurality of memory locations organized as a xe2x80x9cmemory stringxe2x80x9d, within the memory resources. As described herein, these addressing modes include a bank select mode and a stream mode.
According to an aspect of the invention, the bank select mode uses high-order address bits to select a bank of a memory resource for access. A data structure, such as a table having relatively short entries, is placed within a single bank of memory and addressed using the bank select mode. Assume that the bank is xe2x80x9ctied upxe2x80x9d for 7 cycles during an access to a single location in the table memory. A first processor in a first pipeline of the arrayed processing engine can access a random location within this table at absolute time N. As long as the skew between pipelines is as large as the time that the bank is tied up for a single access (i.e., 7 cycles), a second processor in the same column of a second pipeline can execute the same instructions (skewed by the 7 cycles). In this case, the second processor may access the same or a different location within the table (and bank) at time N+7 without contending with the first processor. That is, the first processor in the first pipeline may be busy accessing a different table in another bank at time N+7.
On the other hand, the stream mode uses low-order address bits to select a bank within a memory resource. Here, the data structure is preferably a table having relatively long entries, each containing words that are accessed over a plurality of cycles. According to this aspect of the present invention, the long entries are spread across successive banks and stream mode addressing functions to map each successive word to a different bank. By defining the table entry width as a multiple of the access width times the number of banks, contentions can be eliminated.
For example, a processor of a first pipeline can access a first word of a random entry from a table resident in Bank 0 at absolute time N; that processor may then access a second word of the same entry from Bank 1 at time N+7. This process may continue with the processor xe2x80x9cseeingxe2x80x9d the entire entry as a contiguous address space. A corresponding processor of a next pipeline is skewed by 7 cycles and can execute the same instructions for accessing the same or different entry from the same table. Here, a first word is accessed from Bank 0 at time N+7, a second word is accessed from Bank 1 at time N+14, etc., without contention. It should be noted that the time between accesses to different banks can be as low as 2 cycles and is unrelated to the time that a bank is tied up (e.g., 7 cycles). In this staggered configuration, the processor of the first pipeline can access Bank 1 as early as N+2 rather than N+7; likewise, the processor in the next pipeline may access Bank 1 as early as N+9 rather than N+14.
In the illustrative embodiment, the two types of tables (i.e., one with short entries and the other with long entries) may reside within the same physical memory. For example, if the bank size is 2 MB, the lower portion of each bank can be used for holding tables to be accessed in bank select mode, whereas the upper portion of each bank can be used for holding tables to be accessed in stream mode. If a bank select mode table is larger than a reserved size within a bank, additional address mapping can be performed to make it appear contiguous, if required. The processor may dynamically, on a per instruction basis, indicate the address mapping mode by way of a special field in an opcode or by assertion of a bit within an address generated by the processor.
Advantageously, the inventive technique provides a means for eliminating contention among synchronized parallel processors for various table organizations in a manner that achieves maximum system throughput. Bank select addressing mode, by itself, has severe contention problems for direct long accesses and additional processor overhead if a processor must apportion long accesses into multiple shorter accesses. Moreover, an address mapping technique that exclusively uses low-order address bits to select a bank is non-deterministic, resulting in contention and cannot be tuned.
While software may perform mapping to the banks via processor instructions, the novel address mapping technique has the advantage that it is transparent to real-time software and eliminates processor instructions that may be wasted by dynamically computing non-contiguous memory addresses. The technique also enables use of a single processor instruction that specifies the amount of data to be read or written, which is not possible if software must perform the address mapping for each access. In addition, the technique supports tables of various widths that can be accessed directly by software without reducing system throughput and without requiring system software to dynamically compute memory addresses for what should be viewed as contiguous memory. The inventive technique also allows the use of SDRAM memory resources where significantly more expensive, and lower density, SSRAM might otherwise be required.