1. Field of the Present Invention
The present invention relates generally to scalable, high performance hybrid FPGA networks, and, in particular, to a methodology for scheduling, partitioning and mapping variable computational tasks onto scalable, high performance networks of FPGAs, memory arrays and processors.
2. Background
High performance computing (“HPC”) finds extensive use in diverse areas including materials simulation, weather modeling, drug development, automotive design, oil exploration and financial forecasting. According to market intelligence and advisory firm IDC, the worldwide market for high performance computing machines grew by 30 percent in 2004. IDC's data shows that the HPC market hit $7.25 billion in 2004 up from $6.5 billion in 2003.
Traditional supercomputers costs millions of dollars and is complex to maintain and operate. Recent years have witnessed the emergence of clusters and grids of low cost workstations capable of delivering gigaflops of computing power. Commercial microprocessors such as Intel Xeon and AMD Opteron serve as the core computing engines of cluster computing architectures. However, microprocessors are general purpose computing architectures, and are not necessarily well suited to deliver the high performance computing capability required for a given computationally intensive application. With the CMOS technology entering the sub-100 nm regime, limitations in technology such as reliability and leakage issues have posed a significant barrier in continually increasing the clock speed of processor architectures. For example, Intel Corporation has shifted microprocessor performance emphasis from raw clock speeds to architectural innovations such as the use of dual core processors. However, the effective use of these new microprocessor paradigms requires extensive modification of current software.
Other HPC solutions have also been offered. Many general purpose HPC machines (i.e., not tailored to any specific application domain) are available from suppliers such as IBM, Sun, SGI, Cray, Fujitsu. Cluster machines are available from Dell and a variety of other vendors (such as Aptiva, LinuxX). Unfortunately, such solutions tend to be proprietary and thus not of a design that may be easily controlled or customized for a particular application, other than as provided by the supplier. Of equal importance, the designs tend not to be scalable.
While microprocessors provide software flexibility, Application Specific Integrated Circuits (“ASICs”), where a given computational algorithm is directly implemented on silicon, provides the highest performance for a given CMOS technology. The company Clearspeed is one such provider of such solutions. However, the high cost and long design cycle makes ASIC solutions viable only for extremely high volume applications. Moreover, the lack of programmability of ASICs severely limits their flexibility in implementing minor modifications of a given computational algorithm.
Field Programmable Gate Arrays (“FPGAs”), available from such companies as Xilinx, Inc. and Altera, allow hardware programming of logic primitives to realize a certain computational algorithm. Thus, they enjoy the programmability of microprocessors and offer the ability to directly realize computational tasks on hardware (at a lower performance compared to ASICs). However, until recently,. FPGAs have been low performance devices, with low gate count, and limited computer aided design tools that limited their use to logic prototyping. The recent years have witnessed a dramatic improvement in the computational capability of FPGAs with platform FPGAs containing more than 10 million system gates and incorporating complex heterogeneous structures, such as Power PC processors. For example, a Viretex-II Pro family from Xilinx Inc. integrates on a single chip, two PowerPC processor blocks, 444 multiplier blocks, 444 block RAMs of 18K each, multi-gigabit transceivers, 99216 programmable logic cells, and many other components. The availability of such high performance FPGAs opens the possibility of implementing computational intensive algorithms on FPGAs instead of merely using them as prototyping device. In a computational cluster such FPGAs in conjunction with microprocessors could serve as hardware accelerators for acceleration of computationally intensive tasks delivering significant increase in performance. In the investigators research group, for example, an FPGA implementation of the gene sequence alignment Smith Waterman bioinformatics algorithms, demonstrated a increase in performance, as compared to a typical conventional workstation (the SunFire 280R), by two to three orders of magnitude.
While successive generations of FPGAs have higher transistor counts, the fixed hardware resources of a given FPGA often implies that multiple FPGAs are required to implement a complex computational architecture. Recognizing this need, FPGA vendors, such as Xilinx Inc., have introduced 10 Gb/s on-chip transceivers for inter-FPGA communication. Taking advantage of this, FPGA board vendors, such as Nallatech Inc., have introduced products where each board consists of 4 Virtex-II Pro FPGAs with 3 gigabytes of external RAM. Each FPGA can have embedded processor cores. Several such boards can be plugged into the PCI-X slots of host workstations that can be in a cluster or grid. Such architectures allow the construction of scalable systems, where the system designer can readily increase the number of FPGAs and memory arrays, based on the computational requirements and the budgetary constraints. Also, the FPGA network architecture makes it easier to follow the technology curve by enabling independent upgradation of individual FPGA nodes.
SGI and Cray are among the HPC suppliers using FPGAs for hardware acceleration. Starbridge implements an entire computer using a fixed network of FPGAs. Unfortunately, such approaches provide are not easily programmable and thus provide little flexibility.
In another approach, offered by Mitrionics AB of Lund, Sweden, a task-specific programmable architecture may be implemented on a single FPGA. Unfortunately, each architecture must be limited to the hardware resources available in that individual FPGA. Scalability may only be achieved by straightforward replication of the same architecture on multiple FPGAs to create a processor cluster. This solution provides no means for scheduling, partitioning and mapping a functional task onto a hybrid network of FPGAs, memory banks and processors.
Finally, FPGAs have also been used to implement specific algorithms. For example, TimeLogic Corporation, of Carlsbad, Calif., offers an FPGA implementation of standard bioinformatics algorithms. Unfortunately, such implementations are typically not dynamically scalable, and are not flexible.
The flexibility provided by FPGAs offers great promise for high performance computing applications, but as outlined hereinabove, previous solutions have failed to take full advantage of this opportunity. It is believed that the problem lies not in the ability to combine FPGAs with memory devices, processors and other components into hybrid networks, but in the ability to provide general purpose hardware accelerators that may be re-partitioned by the application users as desired. Thus, a need exists for a flexible methodology for scheduling, mapping and partitioning computational tasks onto scalable high performance networks of FPGAs, memory arrays and processors.