Maturity of Processor Architecture Research:
The general-purpose processor architecture research field has matured, with attempts to further increase the performance of general-purpose processors presently encountering (i) frequency, (ii) power, (iii) design complexity, and (iv) memory wall barriers. However, the need for increased performance and reduced power continues to exist.
Difficulty of Parallel Programming:
Abandoning the extremely convenient, easy-to-use sequential programming model and programming explicitly for parallel processors constitute one way for increasing performance. Recent multi-core processor architectures [5] that are enabled by increasing VLSI densities indeed encourage this approach. However, programming a parallel multi-core processor system is not a natural and easy task, due to, e.g., race conditions, deadlocks, and non-deterministic bugs that are hard to track. Increased parallelism in general-purpose processors has in fact increased the difficulty of programming and using them [2].
Inefficiencies of the Hypervisor and the Operating System:
Sharing of computing resources among different independent applications and virtual machines has been emphasized at least since the days of early mainframes [1]. This emphasis on resource sharing continues to this day. Recently, Cloud Computing [3] and Virtualization [4] have emerged as preferred methods of offering computing and application services with resource sharing. By breaking the barriers of the traditional in-house IT shop approach, cloud computing offers centralized high performance computing resources, economies of scale, and radically higher degrees of efficiency. For example, a large cloud computing data center, along with a fast and reliable encrypted network, can greatly amplify the performance of an inexpensive client device, while preserving the security properties of an in-house IT shop.
However, cloud computing today relies on operating systems or hypervisors that are designed in software, and that lack scalability. For example, the cost of an interrupt may involve substantial overhead (e.g., ten thousand instructions) in today's operating systems. Moreover, the transition between privilege levels (as in an interrupt or system call) requires a global serialization/pipeline flush in general-purpose processors. The schedulers within operating systems and hypervisors alike are not designed in an algorithmically parallel scalable way, to handle massively parallel systems. At the extreme performance levels that will be needed in the future, such serialization overheads will become important. To alleviate the severe performance slowdown consequences of Amdahl's law, the slowdown effects due to both the OS and the hypervisor must be reduced.
Prevailing Solutions:
Current computer industry focus areas include two prevailing approaches, namely: energy-efficient multi-core processors [5] and hybrid computing architectures [6], which, while not directly addressing the significant problems mentioned above (namely, the difficulty of parallel programming, and the inefficiency of the OS and hypervisor), do promise to increase performance and to reduce power. We will review the hybrid computing architectures, since they are most relevant to application-specific supercomputers, the subject of the present document.
In general-purpose hybrid computing architectures, the acceleration unit consists of graphics processing units (GPUs) with their own specialized Instruction Set Architecture [6]. These acceleration units are capable of accelerating graphics applications, as well as a range of additional high performance computing applications, provided that suitable parts of the applications are re-coded to expose explicit parallelism and to take advantage of the massively parallel architecture of specialized processors.
By contrast, reconfigurable hybrid computing architectures (reconfigurable computers) deploy field programmable gate arrays (FPGAs) as the acceleration unit, and offer more flexibility. Typically, a collection of one or more FPGAs acts as a co-processor to each general-purpose host processor [7] [8]. While arbitrary code in general cannot take advantage of the FPGAs using today's tools, suitable code fragments can again be recoded to expose explicit parallelism and then compiled with a high-level tool to run on the FPGAs.
Even though the commercial systems with FPGAs are very promising in boosting the application performance with less power than traditional servers, they suffer from a few shortcomings:                Lack of scalable pre-emptive scheduling: Many of today's reconfigurable computer systems do not implement pre-emptive scheduling of accelerators: they instead allow a hardware accelerator to keep its share of hardware resources as long as it runs. As a consequence, even when an accelerator is idle, e.g., waiting for an input, it occupies hardware resources until it finishes. This potentially leads to the underutilization of the system. Where pre-emptive hardware task scheduling is indeed done [15][16][17], it is done in a non-scalable way, with software involvement. Existing pre-emptive schedulers may also impose restrictions on inter-task communication, for example, task dependences may be in the form of a DAG (Directed Acyclic Graph) [U.S. Pat. No. 6,034,538].        Lack of scalability of hardware-accelerated applications: Today's software and hardware design tools do not virtualize application-specific, custom hardware accelerators at the supercomputer scale.        Low programmer productivity: Using a reconfigurable hardware platform is complex at present, because of the general difficulty of parallel programming, mentioned earlier, and the general difficulty of hardware design with today's tools.        Missing semi-reconfigurable ASICs: An FPGA is an interpreter of arbitrary circuits specified at the Register Transfer Level, and is therefore very flexible and general, while an ASIC implementation of a given RTL circuit is in fact a compiled version of that circuit, which has performance and power advantages over the corresponding FPGA implementation. To benefit from the lower power and higher performance advantages of an ASIC within a reconfigurable system, a systematic solution to utilize ASICs for application acceleration (beyond the solution of implementing only one function on a given kind of ASIC) is desirable. The existing reconfigurable computer systems do not systematically support application-specific integrated circuits (ASICs) in addition to FPGAs, as a source of hardware acceleration.        
Our Approach:
The present document's system does address the two significant problems (difficulty of parallel programming, inefficiency of the OS and hypervisor) mentioned above. It also distinguishes itself from the cited art in at least the following ways:                Scalable pre-emptive scheduling: The present document's system introduces a scalable parallel hardware hypervisor system, where the hypervisor functions related to the allocation, de-allocation, and relocation of hardware supercomputing tasks with unrestricted inter-task communication, are achieved with parallel algorithms implemented in hardware. The resources allocated to a virtual application-specific supercomputer can increase or decrease on demand, at the virtual tile granularity. The parallel implementation of such hypervisor functions is a difficult problem, giving rise to several race conditions, which have been addressed in the present document.        Scalability of hardware-accelerated applications: The present document's system virtualizes application-specific, custom hardware accelerators at the supercomputer scale.        Programmer productivity: The present document's system establishes a hardware acceleration programming model and automated compilation method, which achieves 100% compatibility between the original single-threaded software application and the virtual supercomputer created from it. This is a model that allows better programmer productivity.        Semi-reconfigurable ASICs: The present document's system establishes a systematic method to create semi-reconfigurable ASIC modules, allowing the same ASIC module to realize multiple functions, while retaining the ASIC advantages of lower power and/or higher performance.        