The following references illustrate the state of the art:                [1] R. Gabor, S. Weiss, and A. Mendelson, “Fairness Enforcement is Switch On Event Multithreading,” ACM Transactions on Architecture and Code Optimization, Vol. 4, No. 3, Article 15, pp. 1-34, September 2007.        [2] J. M. Borkenhagen, R. J. Eickemeyer, R. N. Kalla, and S. R. Kunkel, “A Multithreaded PowerPC Processor for Commercial Servers,” IBM Journal of Research and Development, Vol. 44, No. 6, pp. 885-898, November 2000.        [3] C. McNairy and R. Bhatia, “Montecito—The Next Product in the Itanium Processor Family,” Hot Chips 16, August 2004.        [4] B. J. Smith, “Architecture and Applications of the HEP Multiprocessor Computer System,” Proceedings of SPIE Real Time Signal Processing IV, pp. 241-248, 1981.        [5] L. Gwennap, “Sandy Bridge Spans Generations,” Microprocessor Report (www.MPRonline.com), September 2010.        [6] R. Waser and M. Aono, “Nanoionics-based Resistive Switching Memories,” Nature Materials, Vol. 6, pp. 833-840, November 2007.        [7] Y. Huai, “Spin-Transfer Torque MRAM (STT-MRAM) Challenges and Prospects,” AAPPS Bulletin, Vol. 18, No. 6, pp. 33-40, December 2008.        [8] L. O. Chua, “Memristor the Missing Circuit Element,” IEEE Transactions on Circuit Theory, Vol. 18, No. 5, pp. 507-519, September 1971.        [9] R. Waser, R. Dittmann, G. Staikov, and K. Szot, “Redox-Based Resistive Switching Memories Nanoionic Mechanisms, Prospects, and Challenges,” Advanced Materials, Vol. 21, No. 25-26, pp. 2632-2663, July 2009.        [10] B. C. Lee, E. Ipek, O. Mutlu, and D. Burger, “Architecting Phase Change Memory as a Scalable DRAM Alternative,” Proceedings of the Annual International Symposium on Computer Architecture, pp. 2-13, June 2009.        [11] M. N. Kozicki and W. C. West, “Programmable Metallization Cell Structure and Method of Making Same,” U.S. Pat. No. 5,761,115, June 1998.        [12] J. F. Scott and C. A. Paz de Araujo, “Ferroelectric Memories,” Science, Vol. 246, No. 4936, pp. 1400-1405, December 1989.        [13] Z. Diao et al, “Spin-Transfer Torque Switching in Magnetic Tunnel Junctions and Spin-Transfer Torque Random Access Memory,” Journal Of Physics: Condensed Matter, Vol. 19, No. 16, pp. 1-13, 165209, April 2007.        [14] International Technology Roadmap for Semiconductor (ITRS), 2009.        [15] A. C. Torrezan, J. P. Strachan, G. Medeiros-Riveiro, and R. S. Williams, “Sub-Nanosecond Switching of a Tantalum Oxide Memristor,” Nanotechnology, Vol. 22, No. 48, pp. 1-7, December 2011.        [16] J. Nickel, “Memristor Materials Engineering: From Flash Replacement Towards a Universal Memory,” Proceedings of the IEEE International Electron Devices Meeting, December 2011.        [17] Z. Guz, E. Bolotin, I. Keidar, A. Kolodny, A. Mendelson, and U. C. Weiser, “Many-Core vs. Many-Thread Machines: Stay Away From the Valley,” Computer Architecture Letters, Vol. 8, No. 1, pp. 25-28, May 2009.        [18] D. M. Tullsen, S. J. Eggers, and H. M. Levy, “Simultaneous Multithreading: Maximizing On-Chip Parallelism,” Proceedings of the Annual International Symposium on Computer Architecture, pp. 392-403, June 1995.        [19] J. W. Haskins, K. R. Hirst, and K. Skadron, “Inexpensive Throughput Enhancement in Small-Scale Embedded Microprocessors with Block Multithreading: Extensions, Characterization, and Tradeoffs,” Proceedings of the IEEE International Conference on Performance, Computing, and Communications, pp. 319-328, April 2001.        [20] M. K. Farrens and A. R. Pleszkun, “Strategies for Achieving Improved Processor Throughput,” Proceedings of the Annual International Symposium on Computer Architecture, pp. 362-369, May 1991.        [21] The gem5 Simulator. A modular platform for computer-system architecture research.        [22] SPEC CPU2006 benchmark suite.        
Multithreading processors have been used to improve performance in a single core for the past two decades. One low power and low complexity multithreading technique is Switch on Event multithreading (SoE MT, also known as coarse grain multithreading and block multithreading) [1], [2], [3], [20], where a thread runs inside the pipeline until an event occurs (e.g., a long latency event like a cache miss) and triggers a thread switch. The state of the replaced thread is maintained by the processor, while the long latency event is handled in the background. While a thread is switched, the in-flight instructions are flushed. The time required to refill the pipeline after a thread switch is referred to as the switch penalty. The switch penalty is usually relatively high, makes SOE MT less popular than simultaneous multithreading (SMT) [18] and fine-grain multithreading (interleaved multithreading) [4]. While fine-grain MT is worthwhile only for a large number of threads, the performance of SMT is limited in practice due to limitations on the number of supported threads (e.g., two for Intel Sandy Bridge [5]).