Many computer systems now operate by way of multithreaded applications. Languages in which multithreading is often used, such as JAVA, also have become widespread. Further, computer processing devices such as the UltraSPARC T1 microprocessor available from Sun Microsystems, Inc. of Santa Clara, Calif. have demonstrated that multithreaded applications can run well when implemented by way of temporal multi-threading (“TMT”), which is also known as switch-on-event multithreading (“SoEMT”). Other computer processing devices, such as the Itanium or Itanium 2 processors (e.g., the Montecito processor) available from Intel Corp. also of Santa Clara, Calif., likewise are capable of employing SoEMT, albeit typically with fewer threads. Usage of such multithreading techniques can improve the operation of computer systems in various manners. For example, TMT attempts to improve performance by allowing a hardware (HW) thread that is waiting on memory to free the hardware core and run another virtual CPU (another HW thread) instead, which allows for a better utilization of the CPU core's resources.
Although conventional computer processing devices can achieve enhanced performance due to their implementation of multithreaded applications or other multithreading techniques, such computer processing devices nevertheless are limited in their performance. For example, with respect to the aforementioned UltraSPARC T1 microprocessor in particular, while that microprocessor commonly performs relatively large numbers of threads simultaneously, the performance on any single thread is relatively slow because of the large number of threads running on a single core of the CPU (e.g., if 4 threads are being executed on a 1 GHz machine, each thread is running at 0.25 GHz). In essence the designs of these machines are skewed so far toward multithreaded operation that the machines cannot quickly or efficiently execute a lone thread. That is, the architecture of these machines is highly slanted toward execution of high numbers of software (SW) threads operating on many HMW threads, but does not result in efficient performance of smaller numbers of SW threads or a single SW thread.
In comparison, the Itanium processors deploy fewer numbers of HW threads to achieve better throughput while allowing single threads to run at full speed. That is, higher processing speeds can be achieved by such processors due to the reduced number of threads being executed by the processors. Although the high processing speeds that can be achieved by such processors is desirable, it is not uncommon for the processors to stall on memory due to the relatively long memory latency experienced by the processors when accessing memory. Additionally, even though the Itanium architecture includes prefetch instructions that allow a compiler to fetch data ahead of when it will be needed without blocking the HW thread execution, it is often the case that prefetching cannot be done enough in advance to cover the latency of the memory subsystem to avoid stalling. Consequently, such stalling can result in an increase in the experienced Cycles Per Instruction (CPI) metric.
For at least these reasons, therefore, it would be advantageous if an improved method and system for computer processing could be developed that achieved enhanced speeds of operation and/or throughput. More particularly, it would be advantageous if such improvements could be achieved in relation to microprocessors that implement multithreading.