1. Field of the Invention
The present invention relates generally to computer architecture, and more particularly to systems and methods for reconfigurable computing. Still more particularly, the present invention is a system and method for scalable, parallel, dynamically reconfigurable computing.
2. Description of the Background Art
The evolution of computer architecture is driven by the need for ever-greater computational performance. Rapid, accurate solution of different types of computational problems typically requires different types of computational resources. For a given range of problem types, computational performance can be enhanced through the use of computational resources that have been specifically architected for the problem types under consideration. For example, the use of Digital Signal Processing (DSP) hardware in conjunction with a general-purpose computer can significantly enhance certain types of signal processing performance. In the event that a computer itself has been specifically architected for the problem types under consideration, computational performance will be further enhanced, or possibly even optimized relative to the available computational resources, for these particular problem types. Current parallel and massively-parallel computers, offering high performance for specific types of problems of O(n2) or greater complexity, provide examples in this case.
The need for greater computational performance must be balanced against the need to minimize system cost and the need to maximize system productivity in a widest-possible range of both current-day and possible future applications. In general, the incorporation of computational resources dedicated to a limited number of problem types into a computer system adversely affects system cost because specialized hardware is typically more expensive than general-purpose hardware. The design and production of an entire special-purpose computer can be prohibitively expensive in terms of both engineering time and hardware costs. The use of dedicated hardware to increase computational performance may offer few performance benefits as computational needs change. In the prior art, as computational needs have changed, new types of specialized hardware or new special-purpose systems have been designed and manufactured, resulting in an ongoing cycle of undesirably large nonrecurrent engineering costs. The use of computational resources dedicated to particular problem types therefore results in an inefficient use of available system Silicon when considering changing computational needs. Thus, for the reasons described above, attempting to increase computational performance using dedicated hardware is undesirable.
In the prior art, various attempts have been made to both increase computational performance and maximize problem type applicability using reprogrammable or reconfigurable hardware. A first such prior art approach is that of downloadable microcode computer architectures. In a downloadable microcode architecture, the behavior of fixed, nonreconfigurable hardware resources can be selectively altered by using a particular version of microcode. An example of such an architecture is that of the IBM System/360. Because the fundamental computational hardware in such prior art systems is not itself reconfigurable, such systems do not provide optimized computational performance when considering a wide range of problem types.
A second prior art approach toward both increasing computational performance and maximizing problem type applicability is the use of reconfigurable hardware coupled to a nonreconfigurable host processor or host system. This prior art approach most commonly involves the use of one or more reconfigurable co-processors coupled to a nonreconfigurable host. This approach can be categorized as an xe2x80x9cAttached Reconfigurable Processorxe2x80x9d (ARP) architecture, where some portion of hardware within a processor set attached to a host is reconfigurable. Examples of present-day ARP systems that utilize a set of reconfigurable processors coupled to a host system include: the SPLASH-1 and SPLASH-2 systems, designed at the Supercomputing Research Center (Bowie, Md.); the WILDFIRE Custom Configurable Computer produced by Annapolis Micro Systems (Annapolis, Md.), which is a commercial version of the SPLASH-2; and the EVC-1, produced by the Virtual Computer Corporation (Reseda, Calif.). In most computation-intensive problems, significant amounts of time are spent executing relatively small portions of program code. In general, ARP architectures are used to provide a reconfigurable computational accelerator for such portions of program code. Unfortunately, a computational model based upon one or more reconfigurable computational accelerators suffers from significant drawbacks, as will be described in detail below.
A first drawback of ARP architectures arises because ARP systems attempt to provide an optimized implementation of a particular algorithm in reconfigurable hardware at a particular time. The philosophy behind Virtual Computer Corporation""s EVC-1, for example, is the conversion of a specific algorithm into a specific configuration of reconfigurable hardware resources to provide optimized computational performance for that particular algorithm. Reconfigurable hardware resources are used for the sole purpose of providing optimum performance for a specific algorithm. The use of reconfigurable hardware resources for more general purposes, such as managing instruction execution, is avoided. Thus, for a given algorithm, reconfigurable hardware resources are considered from the perspective of individual gates coupled to ensure optimum performance.
Certain ARP systems rely upon a programming model in which a xe2x80x9cprogramxe2x80x9d includes both conventional program instructions as well as special-purpose instructions that specify how various reconfigurable hardware resources are interconnected. Because ARP systems consider reconfigurable hardware resources in a gate-level algorithm-specific manner, these special-purpose instructions must provide explicit detail as to the nature of each reconfigurable hardware resource used and the manner in which it is coupled to other reconfigurable hardware resources. This adversely affects program complexity. To reduce program complexity, attempts have been made to utilize a programming model in which a program includes both conventional high-level programming language instructions as well as high-level special-purpose instructions. Current ARP systems therefore attempt to utilize a compiling system capable of compiling both high-level programming language instructions and the aforementioned high-level special-purpose instructions. The target output of such a compiling system is assembly-language code for the conventional high-level programming language instructions, and Hardware Description Language (HDL) code for the special-purpose instructions. Unfortunately, the automatic determination of a set of reconfigurable hardware resources and an interconnection scheme to provide optimal computational performance for any particular algorithm under consideration is an NP-hard problem. A long-term goal of some ARP systems is the development of a compiling system that can compile an algorithm directly into an optimized interconnection scheme for a set of gates. The development of such a compiling system, however, is an exceedingly difficult task, particularly when considering multiple types of algorithms.
A second shortcoming of ARP architectures arises because an ARP apparatus distributes the computational work associated with the algorithm for which it is configured across multiple reconfigurable logic devices. For example, for an ARP apparatus implemented using a set of Field Programmable Logic Devices (FPGAs) and configured to implement a parallel multiplication accelerator, the computational work associated with parallel multiplication is distributed across the entire set of FPGAs. Therefore, the size of the algorithm for which the ARP apparatus can be configured is limited by the number of reconfigurable logic devices present. The maximum data-set size that the ARP apparatus can handle is similarly limited. An examination of source code does not necessarily provide a clear indication of the limitations of the ARP apparatus because some algorithms may have data dependencies. In general, data-dependent algorithms are avoided.
Furthermore, because ARP architectures teach the distribution of computational work across multiple reconfigurable logic devices, accommodation of a new (or even slightly modified) algorithm requires that reconfiguration be done en masse, that is, multiple reconfigurable logic devices must be reconfigured. This limits the maximum rate at which reconfiguration can occur for alternative problems or cascaded subproblems.
A third drawback of ARP architectures arises from the fact that one or more portions of program code are executed on the host. That is, an ARP apparatus is not an independent computing system in itself, the ARP apparatus does not execute entire programs, and therefore interaction with the host is required. Because some program code is executed upon the nonreconfigurable host, the set of available Silicon resources is not maximally utilized over the time-frame of the program""s execution. In particular, during host-based instruction execution, Silicon resources upon the ARP apparatus will be idle or inefficiently utilized. Similarly, when the ARP apparatus operates upon data, Silicon resources upon the host will, in general, be inefficiently utilized. In order to readily execute multiple entire programs, Silicon resources within a system must be grouped into readily reusable resources. As previously described, ARP systems treat reconfigurable hardware resources as a set of gates optimally interconnected for the implementation of a particular algorithm at a particular time. Thus, ARP systems do not provide a means for treating a particular set of reconfigurable hardware resources as a readily reusable resource from one algorithm to another because reusability requires a certain level of algorithmic independence.
An ARP apparatus cannot treat its currently-executing host program as data, and in general cannot contextualize itself. An ARP apparatus could not readily be made to simulate itself through the execution of its own host programs. Furthermore, an ARP apparatus could not be made to compile its own HDL or application programs upon itself, directly using the reconfigurable hardware resources from which it is constructed. An ARP apparatus is thus architecturally limited in relation to self-contained computing models that teach independence from a host processor.
Because an ARP apparatus functions as a computational accelerator, it in general is not capable of independent Input/Output (I/O) processing. Typically, an ARP apparatus requires host interaction for I/O processing. The performance of an ARP apparatus may therefore be I/O limited. Those skilled in the art will recognize that an ARP apparatus can, however, be configured for accelerating a specific I/O problem. However, because the entire ARP apparatus is configured for a single, specific problem, an ARP apparatus cannot balance I/O processing with data processing without compromising one or the other. Moreover, an ARP apparatus provides no means for interrupt processing. ARP teachings offer no such mechanism because they are directed toward maximizing computational acceleration, and interruption negatively impacts computational acceleration.
A fourth drawback of ARP architectures exists because there are software applications that possess inherent data parallelism that is difficult to exploit using an ARP apparatus. HDL compilation applications provide one such example when net-name symbol resolution in a very large netlist is required.
A fifth drawback associated with ARP architectures is that they are essentially a SIMD computer architecture model. ARP architectures are therefore less effective architecturally than one or more innovative prior art nonreconfigurable systems. ARP systems mirror only a portion of the process of executing a program, chiefly, the arithmetic logic for arithmetic computation, for each specific configuration instance, for as much computational power as the available reconfigurable hardware can provide. In contradistinction, in the system design of the SYMBOL machine at Fairchild in 1971, the entire computer used a unique hardware context for every aspect of program execution. As a result, SYMBOL encompassed every element for the system application of a computer, including the host portion taught by ARP systems.
ARP architectures exhibit other shortcomings as well. For example, an ARP apparatus lacks an effective means for providing independent timing to multiple reconfigurable logic devices. Similarly, cascaded ARP apparatus lack an effective clock distribution means for providing independently-timed units. As another example, it is difficult to accurately correlate execution time with the source code statements for which acceleration is attempted. For an accurate estimate of net system clock rate, the ARP device must be modeled with a Computer-Aided Design (CAD) tool after HDL compilation, a time-consuming process for arriving at such a basic parameter.
An equally significant problem with conventional architectures is their use of virtual or shared memory. This teaching of using a unified address space results in slower, less efficient memory access due to the more complicated addressing operations required. For example, in order to access individual bits in the memory device of a system using virtual memory, the physical address space of the memory must be first segmented into logical addresses, and then virtual addresses must be mapped onto the logical addresses. Only then may the bits in the memory be accessed. Additionally, in shared memory systems the processor typically performs address validation operations prior to allowing access to the memory, further complicating the memory operation. Finally, the processor must arbitrate between multiple processes attempting to access the same area of memory at the same time by providing some type of prioritization system.
To address the myriad of problems caused by the use of shared and virtual memory, many conventional systems use memory management units (MMUs) to perform the majority of the memory management functions, such as converting logical addresses to virtual addresses. However, the MMU/software interaction adds yet another degree of complexity to the memory accessing operation. Additionally, MMUs are quite limited in the types of operations which they can perform. They cannot handle interrupts, queue messages, or perform sophisticated addressing operations which all must be performed by the processor. When shared or virtual memory systems are employed in a computer architecture which has multiple parallel processors, the above-described defects are magnified. Not only must the hardware/software interactions be managed as described above, but the coherence and consistency of the data in the memory must also be maintained by both software and hardware in response to multiple processors attempting to access the shared memory. The addition of more processors increases the difficulty of the virtual address to logical address conversion. These complications in the memory accessing operation necessarily degrade system performance; this degradation only increases as the system grows larger as more processors are added.
One example of a conventional system is the cache-coherent, Non-Uniform Memory Access (ccNUMA) computer architecture. The ccNUMA machines use complex and costly hardware, such as cache controllers and crossbar switches, to maintain for each independent CPU the illusion of a single address space even though the memory is actually shared by multiple processors. The ccNUMA is moderately scalable, but achieves this scalability by the use of the additional hardware to achieve tight coupling of the processors in its system. This type of system is more advantageously used in computing environment in which a single program image is being shared, where shared memory I/O operations have very large bandwidth requirements, such as for finite element grids in scientific computing. Further, the ccNUMA is not useful for systems in which processors are not similar in nature. The ccNUMA architecture requires that each processor added be of the same type as the existing processors. In a system in which processors are optimized to serve different functions, and therefore operate differently from each other, the ccNUMA architecture does not provide an effective solution. Finally, in conventional systems, only the standard memory addressing schemes are used to address memory in the system.
What is needed is a means for addressing memory in a parallel computing environment which provides for scalability, transparent addressing, and which has a minimal impact on the processing power of the system.
The present invention is a system and method for scalable, parallel, dynamically reconfigurable computing. The system comprises at least one S-machine, a T-machine corresponding to each S-machine, a General-Purpose Interconnect Matrix (GPIM), a set of I/O T-machines, one or more I/O devices, and a master time-base unit. In the preferred embodiment, the system includes multiple S-machines. Each S-machine has an input and an output coupled to an output and an input of a corresponding T-machine, respectively. Each T-machine includes a routing input and a routing output coupled to the GPIM, as does each I/O T-machine. An I/O T-machine further includes an input and an output coupled to an I/O device. Finally, each S-machine, T-machine, and I/O T-machine has a master timing input coupled to a timing output of the master time-base unit.
The meta-addressing system of the present invention provides for bit-addressable capabilities for the processors in the network without requiring the processors themselves to perform the processing-intensive address manipulation functions. Separate processing and addressing machines are disclosed which are optimized to perform their assigned functions. The processing machines execute instructions, store and retrieve data from a local memory, and determine when remote operations are required. The addressing machines assemble packets of data for transmission, determine a geographic or network address of the packet, and perform addressing checking on incoming packets. Additionally, the addressing machines can provide interrupt handling and other addressing operations.
In one embodiment, the T-machines also provide the meta-addressing mechanism of the present invention. The meta-addresses designate the geographic location of the T-machines in the system and specify the location of data within the local memory devices. The local address of the meta-address is be used to address each bit in the memory of the new device, regardless of the actual memory size of the device (as long as the addressable space of the device is less or equal to the bit count of the local address). Thus, devices having different memory sizes and structures may be addressed using the single meta-address. Further, by use of the meta-address, hardware within the multi-processor parallel architecture is not required to guarantee coherency and consistency across the system.
The meta-address allows for complete scalability; as a new S-machine or I/O device is added, a new geographic address is designated for the new device. The present invention allows for irregular scalability, in that there is no requirement of a power-of-two number of processors. Scalability is also enhanced by the ability to couple any number of addressing machines to each processing machines, up to the available local memory bandwidth. This allows the system designer to arbitrarily designate the number of pathways to each processing machine. This flexibility can be used to allow more communication bandwidth to be provided to higher levels of the system, creating in effect a pyramid processing architecture which is optimized to devote the most communication bandwidth to the most important functions of the system.
As described above, in accordance with a preferred embodiment, the T-machines are addressing machines which generate meta-addresses, handle interrupts, and queue messages. The S-machines are thus freed to devote their processing capacity solely on the execution of program instructions, greatly optimizing the overall efficacy of the multi-processor parallel architecture of the present invention. The S-machines need only access the local memory component of the meta-address to locate the desired data; the geographic address is transparent to the S-machine. This addressing architecture interoperates extremely well with a distributed memory/distributed processor parallel computing system. The architectural design choice of isolating the local memories allows independent and parallel operation of hardware. In accordance with the present invention, each S-machine can have completely divergent reconfiguration directives at runtime, even though all are directed in parallel on one computing problem. Also, not only can the Instruction Set Architectures realized by dynamically reconfigurable S-machines be different, the actual hardware used to realize the S-machines can be optimized to perform certain tasks. Thus, the S-machines in a single system may all be operating at different rates, allowing each S-machine to optimally perform its function while maximizing the use of system resources.
Additionally, the only memory validation which occurs is to verify the correct geographic address has been transmitted; no validation of the local memory address is provided. Further, this validation is performed by the addressing machine, not by the processing machine. As no virtual addressing is used, no hardware/software interoperations for converting virtual addresses to logical addresses is required. The address in the meta-address is the physical address. The elimination of all of these preventative and maintenance functions greatly increases the processing speed of the entire system. Thus, by separating the xe2x80x9cspacexe2x80x9d management of computer systems into separate addressing machines from the xe2x80x9ctimexe2x80x9d management of the computer system (provided by the separate processing machines), in combination with the meta-addressing scheme, a unique memory management and addressing system for highly parallel computing systems is provided. The architecture of the present invention allows great flexibility in the operations of the S-machines, allowing each S-machine to operate at its own optimal rate, while maintaining a uniform T-machine rate. This balance of local instruction processing in fastest time, with system-wide data communication provided for across the farthest space, provides an improved approach to complex problem solving by highly parallel computer systems.