1. Field of the Invention
This invention relates to the arts of signal processing, multi-processor architectures, and programmable logic.
2. Description of the Related Art
There are many applications of image and signal processing which require more microprocessing bandwidth than is available in a single processor at any given time. As microprocessors are improved and their operating speeds increase, so too are the application demands continuing to meet or exceed the ability of a single processor. For example, there are certain size, weight and power requirements to be met by processor modules or cards which are deployed in military, medical and commercial end-use applications, such as a line replaceable unit (“LRU”) for use in a signal processing system onboard a military aircraft. These requirements typically limit a module or card to a maximum number of microprocessors and support circuits which may be incorporated onto the module due to the power consumption and physical packaging dimensions of the available microprocessors and their support circuits (memories, power regulators, bus interfaces, etc.).
As such, a given module design or configuration with a given number of processors operating at a certain execution speed will determine the total bandwidth and processing capability of the module for parallel and distributed processing applications such as image or signal processing. Thus, as a matter of practicality, it is determined whether a particular application can be ported to a specific module based upon these parameters. Any applications which cannot be successfully be ported to the module, usually due to requiring a higher processing bandwidth level than available on the module, are implemented elsewhere such as on mini-super computers.
As processor execution rates are increased, microprocessing system component integration is improved, and memory densities are improved, each successive multi-processor module is redesigned to incorporate a similar number of improved processors and support circuits. So, for example, a doubling of a processor speed may lead to the doubling of the processing bandwidth available on a particular module. This typically allows twice as many “copies” or instances of applications to be run on the new module than were previously executable by the older, lower bandwidth module. Further, the increase in processing bandwidth may allow a single module to run applications which were previously too demanding to be handled by a single, lower bandwidth module.
The architectural challenges of maximizing processor utilization, communication and organization on a multi-processor module remains constant, even though processor and their associated circuits and devices tend to increase in capability dramatically from year to year.
For many years, this led the military to design specialized multi-processor modules which were optimized for a particular application or class of applications, such as radar signal processing, infrared sensor image processing, or communications signal decoding. A module designed for one class of applications, such as a radar signal processing module, may not be suitable for use in another application, such as signal decoding, due to architecture optimizations for the one application which are detrimental to other applications.
In recent years, the military has adopted an approach of specifying and purchasing computing modules and platforms which are more general purpose in nature and useful for a wider array of applications in order to reduce the number of unique units being purchased. Under this approach, known as “Commercial-Off-The-Shelf” (“COTS”), the military may specify certain software applications to be developed or ported to these common module designs, thereby reducing their lifecycle costs of ownership of the module.
This has given rise to a new market within the military hardware suppliers industry, causing competition to develop and offer improved generalized multi-processor architectures which are capable of hosting a wide range of software applications. In order to develop an effective general hardware architecture for a multi-processor board for multiple applications, one first examines the common needs or nature of the array of applications. Most of these types of applications work on two-dimensional data. For example, in one application, the source data may represent a 2-D radar image, and in another application, it may represent 2-D magnetic resonance imaging. Thus, it is common to break the data set into portions for processing by each microprocessor. Take an image which is represented by an array of data consisting of 128 rows and 128 columns of samples. When a feature recognition application is ported to a quad processor module, each processor may be first assigned to process 32 rows of data, and then to process 32 columns of data. In signal processing parlance this is known as “corner turning”. Corner turning is a characteristic of many algorithms and applications, and therefore is a common issue to be addressed in the interprocessor communications and memory arrangements for multi-processor boards and modules.
One microprocessor which has found widespread acceptance in the COTS market is the Motorola PowerPC [TM]. Available modules may contain one, two, or even four PowerPC processors and support circuits. The four-processor modules, or “quad PowerPC” modules, are of particular interest to many military clients as they represent a maximum processing bandwidth capability in a single module.
Quad Power PC board or module architectures on the market generally include “shared memory”, “distributed memory architecture” and “dual memory” architectures. These architectures, though, could be employed well with other types and models of processors, inheriting the strengths and weaknesses of each architecture somewhat independently of the processor chosen for the module.
One advantage of distributed memory architecture modules is that input data received at a central crossbar can be “farmed out” via local crossbars to multiple processors nodes that perform the processing of the data in parallel and simultaneously. Quad PowerPC cards such as this are offered by companies such as CSP Inc., Mercury Computer Systems Inc., and Sky Computers Inc.
For example, during the first phase of processing a hypothetical two-dimensional (2-D) data set of 128 rows by 128 columns shown in TABLE 1 on a distributed memory quad processor card, a first set of 32 rows (rows 0–31) of data may be sent to a first processor node, a second set of 32 rows (rows 32–63) of data would be sent to a second processor node, a third set of 32 rows (rows 64 to 95) of data to the third processor node, and the fourth set of 32 rows (rows 96 to 127) of data to the fourth processor node. Then, in preparation for a second phase of processing data by columns, a corner turning operation is performed in which the first processor node would receive data for the first 32 columns, the second processor node would receive the data for the second 32 columns, and so forth.
TABLE 1Example 128 × 128 Data ArrayColumnRow01234. . .126127 00 ×0 × 190 × 460 × 720 × 7A. . .0 × 9C0 × 4BFE 10 ×0 × 220 × 4A0 × A40 × F2. . .0 × BE0 × B391 20 ×0 × 9C0 × 9A0 × 980 × 97. . .0 × 430 × 449A 40 ×0 × 000 × 810 × 8F0 × 8F. . .0 × 230 × 4400::::::. . .::::::::. . .::1260 ×0 × 3A0 × 360 × 350 × 45. . .0 × FB0 × FA341270 ×0 × 870 × 990 × F00 × FE. . .0 × FF0 × FA75
Regardless of the type of bus used to interconnect the processor nodes, high speed parallel or serial, this architecture requires movement of significant data during a corner turning operation during which data that was initially needed for row processing by one processor node is transferred to another processor node for column processing. As such, the distributed memory architecture has a disadvantage with respect to efficiency of performing corner turning. Corner turning on multi-processor modules of this architecture type consumes processing bandwidth to move the data from one processor node to another, bandwidth which cannot be used for other computations such as processing the data to extract features or performing filtering algorithms.
Turning to the second architecture type commonly available in the COTS market, the advantage of shared memory architectures is that all data resides in one central memory. COTS modules having architectures such as this are commonly available from Thales Computers Corp., DNA Computing Solutions Inc., and Synergy Microsystems. In these types of systems, several processor nodes may operate on data stored in a global memory, such as via bridges between processor-specific buses to a standard bus (PowerPC bus to Peripheral Component Interconnect “PCI” bus in this example).
The bridges are responsible for arbitrating simultaneous attempts to access the global memory from the processor nodes. Additionally, common modules available today may provide expansion slots or daughterboard connectors such as PCI Mezzanine Connector (PMC) sites, which may also provide data access to the global memory. This architecture allows for “equal access” to the global data store, including the processor(s) which may be present on the expansion sites, and thus eases the decisions made during porting of large applications to specific processor nodes because each “job” to be ported runs equally well on any of the processor nodes.
Due to the centralized memory in this architecture, corner turning can be performed by addressing the shared memory with a pointer that increments by one when processing row data, and increments by the number of data samples in a row when processing column data. This avoids the need to ship or move data from one processor node to another following initial row-data processing, and thereby eliminates wasted processor cycles moving that data.
However, the disadvantage of this arrangement is that all processors must access data from the same shared memory, which often leads to a “memory bottleneck” that slows execution times due to some processor node requests being arbitrated, e.g. forced to wait, while another processor accesses the global memory. Thus, what was gained in eliminating the wasted processor cycles for moving data from node to node may be lost to wait states or polling loops caused by arbitration logic for accesses to shared memory.
Another multiprocessor architecture commonly found in modules available on the COTS market is the dual memory architecture, which is designed to utilize the best features of distributed and shared memory architectures, to facilitate fast processing and reduce corner turning overhead. Both memory schemes are adopted, providing the module with a global memory accessible by all processor nodes, and local memory for each processor or subset of processor nodes. This addresses the arbitration losses in accessing a single shared global memory by allowing processor node to move or copy data which is needed for intense accesses from global memory to local memory. Some data which is not so intensely needed by a processor is left in the global memory, which reduces the overhead costs associated with corner turning. D 4 Systems offers a module having an architecture such as this.
Most modern processors have increased their internal clock rate and computational capabilities per clock (or per cycle) faster than their ability to accept the data they need to process. In other words, most modern processors can now process data faster than they can read or write the data to be processed due to I/O speed limitations on busses and memory devices.
As a result, “operations/second” is no longer the chief concern when determining whether a particular processor or processor node is capable of executing a particular application. This concern has been replaced by data movement bandwidth as the driving consideration in measuring the performance of single processors, processor nodes and arrays of processors. TABLE 2 summarizes data movement capabilities of several currently available distributed architecture boards, including the Race++™ from Mercury Computer Systems Inc., the Sky Bolt II™ from Sky Computers Inc., and the Myranet 2841™ from CSP Inc.
TABLE 2Summary of Data Movement Capabilitiesfor Available Multi-processor ModulesMovement EndpointsRace++SkyBolt IIMyranetProcessor to Local Mem1064 * 4666 * 4480 * 4Node to Node 267 * 2320480 * 4Module I/O 267 * 2320480 * 4
As can be seen from this comparison, each architecture has strong points and weak points. For example, the Race++™ and SkyBolt II™ architectures have nearly twice the performance for processor to local memory data movement than for node to node or module I/O data movement. For applications which utilize local memory heavily and do not need intense node-to-node movement or board I/O data flow, these may be adequate. But, this imbalance among data movement paths can eliminate these two boards from candidacy for many applications. On the contrary, the Myranet™ board has a good balance between the data movement paths, but at the cost of efficient local memory accesses. For example, the Myranet™ board appears to be approximately 50% faster transferring data in and out of the module and between nodes than the SkyBolt II™, but 28% slower accessing local memory.
The related patent application established that there is a need in the art for a multiprocessor architecture for distributed and parallel processing of data which provides optimal data transfer performance between processors and their local memories, from processor to processor, and from processors to module inputs and outputs. In particular, there is a need in the art for this new arrangement to provide maximum performance when accessing local memory as well as nominal performance across other data transfer paths. Further, the related application established that there is a need in the art for this new architecture to be useful and advantageous for realization with any high speed microprocessor family or combination of microprocessor models, and especially those which are commonly used for control or signal processing applications and which exhibit I/O data transfer constraints relative to processing bandwidth. The invention described in the related patent application addressed these needs, and is summarized in the following paragraphs.
The invention of the related patent application utilized a programmable logic array in a key position between each microprocessor node and its memory, and provided functionality to allow each microprocessor in the multiprocessor array to access memory associated with another microprocessor in the array.
In order to maximize the capabilities of the related invention, it was desirable to extend the functionality of the multiprocessor array to utilize the programmable logic arrays to actually perform some level of processing, and especially signal processing, on the data stored in the processor memories and the data which flows through the logic array.
Programmable logic device suppliers such as Xilinx have promoted use of their devices to perform signal processing functions in hardware rather than using the traditional software or microprocessor-based firmware solutions. Thus, the combination of the location of the programmable logic in the topology of the invention disclosed in the related patent application and the availability of signal processing “macros” and designs for programmable logic produces an opportunity to embed signal processing in the new multiprocessor topology, thereby increasing the density of functionality and capability of the new architecture.