Traditionally, to control computations on a microprocessor, the microprocessor is provided with a centralized instruction-issue unit and a branch unit. The instruction-issue unit issues instructions that control the cycle-by-cycle operations of the microprocessor's resources, while the branch unit steers execution in time, directs the flow of control, determines the sequence of instructions that should be issued, etc.
As chip density increases, emerging devices have the capacity to accommodate huge numbers of functional units, which can potentially deliver much higher performance than current devices. As the number of functional units, especially on programmable devices, increases, efficiently and flexibly controlling these devices raises various issues. In many situations, the centralized point of control in traditional microprocessors with branch units is inadequate for managing this vastly increased number of functional units. For example, to exploit thread-level parallelism, a computing platform has to track multiple flows of control. Traditional centralized architecture, with its single flow of control, is unable to do this.
Conventional MIMD (multiple instructions multiple data) machines also have limitations in supporting thread-level parallelism. These machines usually limit each thread of execution to a microprocessor because control between different processors of a MIMD machine is generally so decoupled as to make it difficult to statically orchestrate their execution. A highly-parallel thread is usually unable to make full use of parallelism because of insufficient hardware resources in each MIMD processor, resources that are normally fixed in hardware. Dynamically spawning the work to other processors on a MIMD machine is usually done at very coarse granularity. This is due to high overheads arising from dynamic coordination that is needed when a single logical thread is split into multiple actual threads, each running on a different processor of a MIMD machine. This misses opportunities for exploiting parallelism and efficient use of computing resources.
Multi-threaded control architectures, such as SMT (simultaneous multi-threading), support multiple flows of control that share a common pool of functional units, allow sharing of functional unit resources across multiple threads of control, etc. However, they usually adopt a centralized point of control and dynamic instruction issue coordination that have problems with implementation and scaling. As a result, they are generally unable to accommodate either a larger number of simultaneously executing threads or a large number of functional units.
Distributing control information from a centralized control becomes worse with large, faster chips. With faster clock speed, there is less time for signals to propagate each cycle. With smaller silicon having narrower and taller wires, signal propagation speed along these wires deteriorates. Under centralized control architecture, all these signals need to be brought to the central point, which causes a scaling bottleneck.
Based on the foregoing, it is desirable that mechanisms be provided to solve the above deficiencies and related problems.