1. Field of the Invention
The present invention relates to a data processing apparatus and method for switching a workload between first and second processing circuitry, and in particular to a technique for improving the processing performance of the workload following the switch.
2. Description of the Prior Art
In modern data processing systems, the difference in performance demand between high intensity tasks such as games operation and low intensity tasks such as MP3 playback can exceed a ratio of 100:1. For a single processor to be used for all tasks, that processor would have to be high performance, but an axiom of processor micro-architecture is that high performance processors are less energy efficient than low performance processors. It is known to improve energy efficiency at the processor level using techniques such as Dynamic Voltage and Frequency Scaling (DVFS) or power gating to provide the processor with a range of performance levels and corresponding energy consumption characteristics. However, such techniques are generally becoming insufficient to allow a single processor to take on tasks with such diverging performance requirements.
Accordingly, consideration has been given to using multi-core architectures to provide an energy efficient system for the performance of such diverse tasks. Whilst systems with multiple processor cores have been used for some time to increase performance by allowing the different cores to operate in parallel on different tasks in order to increase throughput, analysis as to how such systems could be used to improve energy efficiency has been a relatively recent development.
The article “Towards Better Performance Per Watt in Virtual Environments on Asymmetric Single-ISA Multi-Core Systems” by V Kumar et al, ACM SIGOPS Operating Systems Review, Volume 43, Issue 3 (July 2009), discusses Asymmetric Single Instruction Set Architecture (ASISA) multi-core systems, consisting of several cores using the same instruction set architecture (ISA) but differing in features, complexity, power consumption, and performance. In the paper, properties of virtualised workloads are studied to shed insight into how these workloads should be scheduled on ASISA systems in order to improve performance and energy consumption. The paper identifies that certain tasks are more applicable to high frequency/performance micro-architectures (typically computation intensive tasks), while others are more suited to lower frequency/performance micro-architectures and as a side effect will consume less energy (typically input/output intensive tasks). Whilst such studies show how ASISA systems might be used to run diverse tasks in an energy efficient manner, it is still necessary to provide a mechanism for scheduling individual tasks to the most appropriate processors. Such scheduling management will typically place a significant burden on the operating system.
The article “Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction” by R Kumar et al, Proceedings of the 36th International Symposium of Microarchitecture (MICRO-36'03) discusses a multi-core architecture where all cores execute the same instruction set, but have different capabilities and performance levels. At run time, system software evaluates the resource requirements of an application and chooses the core that can best meet these requirements while minimising energy consumption. As discussed in section 2 of that paper, during an application's execution the operating system software tries to match the application to the different cores, attempting to meet a defined objective function, for example a particular performance requirement. In section 2.3, it is noted that there is a cost to switching cores, which necessitates restriction of the granularity of switching. A particular example is then discussed where, if the operating system decides a switch is in order, it powers up the new core, triggers a cache flush to save all dirty cache data to a shared memory structure, and then signals the new core to start at a predefined operating system entry point. The old core can then be powered down, whilst the new core retrieves required data from memory. Such an approach is described in section 2.3 as allowing an application to be switched between cores by the operating system. The remainder of the paper then discusses how such switching may be performed dynamically within a multi-core setting with the aim of reducing energy consumption.
Whilst the above paper discusses the potential for single-ISA heterogeneous multi-core architectures to provide energy consumption reductions, it still requires the operating system to be provided with sufficient functionality to enable scheduling decisions for individual applications to be made. The role of the operating system in this respect is made more complex when switching between processor instances with different architectural features. In this regard, it should be noted that the Alpha cores EV4 to EV8 considered in the paper are not fully ISA compatible, as discussed for example in the fifth paragraph of section 2.2.
Further, the paper does not address the problem that there is a significant overhead involved in switching applications between cores, which can significantly reduce the benefits to be achieved from such switching. The overhead includes not just the time taken to perform the switch during which no processor is performing the transferred workload, but also the penalty incurred by cache misses following the switch. When the destination core starts performing the transferred processing, any cache provided in the destination core starts off containing no valid data, and so the destination core experiences cold start cache misses. This means that data has to be fetched from memory, which slows processing performance and uses a significant amount of energy. The performance and energy efficiency recovers only once the destination cache has been “warmed” by caching some of the data values stored in memory. While the above paper by R. Kumar et al recognises the problem of cold-start cache misses at section 4.4, Kumar does not provide any solution to this problem. The present technique seeks to improve processing performance following the switch to the destination processor.