Due to diminishing returns of increasing cache sizes, and perceived limits to the amount of exploitable instruction-level parallelism (ILP), the focus of CPU architecture has shifted in recent years from increasing performance by increasing instructions per cycle (IPC) to parallelism via multiple cores and threads per CPU chip (chip multiprocessing, CMP). While server applications generally make good use of CMP, even they may be challenged to continue to do so as core counts escalate well into the double digits. In the personal and mobile spaces, the situation is much worse, with the majority of applications making minimal use, if any, of more than one thread or core.
What is proposed is essentially a refactoring of the resources of a multi-core CPU into something that is more easily utilized by real software. Rather than a set of cores with hard boundaries between them, those boundaries are removed to allow the execution units to be shared between threads. Rather than one or a few front-ends feeding a complex back-end, per core, a set of front-ends per chip feeds a set of simple back-ends (microcores) in a flexible manner. The benefits of this arrangement are: 1. Increased exploitation of ILP due to the potential of many microcores executing on the same cycle 2. Elimination of some write-after-write dependency (WAW) stalls 3. Facilitation of the execution of multiple serially dependent low-latency operations per cycle 4. Reduced memory stalls due to a large number of bypass paths inherently provided 5. Provision for speculation along multiple branch paths 6. A very high degree of parallelism from single-threaded code in certain special, but not uncommon, cases, by recognizing and exploiting loop-level parallelism.