Field of the Invention
This invention relates to a method and apparatus for resource management in a multicore architecture.
Description of the Related Art
Today, semiconductor devices incorporating complex heterogeneous multicore architectures are put to use in a wide variety of systems and devices, from the ubiquitous desktop computer, to the latest in modern electronic devices, such as mobile telephones, Personal Digital Assistants and high speed telecoms or network switching equipment.
Whatever the intended use of any computer processor, the processor manufacturers continue to strive to increase the performance of current processors, whilst maintaining or reducing their unit “cost”.
The “cost” of a processor can be measured using a variety of parameters. Although in many cases, the cost will be a purely financial one, in many applications, especially in the embedded processor market, the cost calculation also includes ancillary considerations such as power consumption, cooling requirements, efficiency and time to bring to market.
The absolute capacity for any processor to perform useful functions may be characterised in terms of the MIPS (millions of instruction per second) ratio achievable, and thus the “price-performance” ratio of any processor may be characterised in terms of the MIPS/mm2, MIPS/$, or MIPS/mW, for example.
In practice however, not all instructions achieve the same amount of useful work, therefore “pure” MIPS ratings are not easily comparable. Thus, while a Digital Signal Processor (DSP) is well suited to solving the mathematically intensive processing near the wireless interface of a mobile phone, it is very inefficient at running the web browser that runs on the phone's screen. Effectively this means that processors can be more usefully classified in terms of “application available” price-performance.
Furthermore, an additional reduction in the effective performance can be caused by the inefficiency of the programming, i.e. software, tools that must be used to control and customise the processor to implement a particular application. The final level of performance that can be extracted from a processor for a particular application can thus be viewed as the level of usable or “achievable application available” price-performance.
In the semiconductor companies' drive to improve processor application available price-performance, a new class of processor, the multicore device, has been developed. Multicore devices are highly integrated processors that are built from a variety of elements (cores), each of which may be highly specialised, in order to provide the maximum level of useful price performance for a particular aspect of an application that can be executed by the processor. Such devices may be “heterogeneous”, i.e. incorporating multiple, dissimilar cores, or “homogenous”, i.e. incorporating multiple similar cores.
Most multicore devices may also be classified as System on Chip (SoC) devices, since the integration includes not only the multiple processing cores, but also the memory, IO and other system “cores” that are required to handle most (if not all) of the hardware requirements for any particular product. Although not all SoC devices have multiple processing cores, the terms multiple core and SoC are often interchanged. A good example of a multicore SoC can be found in many mobile phones, where one will find a single processor containing one or more DSPs to run the wireless interface, and a general purpose processor to run the user applications on the phone.
The emergence of multicore devices has been enabled by Moore's Law, which states that the number of transistors that can be fitted into any given area of silicon will double every 18 months due to improvements in the manufacturing process. Moore's Law therefore allows for more individual transistors to be fitted into any given area on the silicon die, making it technically and economically viable to manufacture ever more complex devices on a single piece of silicon. Equally, by reducing the size of the transistors, they are capable of being switched at ever higher speeds.
Historically, Moore's Law was used to manufacture a new generation of processors at smaller sizes which were faster or more cost effective in terms of silicon used, without any major changes to the underlying architecture (i.e. the improvements were improvements in manufacturing process and the device's physical micro-architecture rather than of device's logical macro-architecture).
Effectively, the trend towards multicore/SoC processors can be seen as a macro-architectural shift to higher levels of integration which first started with the introduction of IO (communications) functionality onto the silicon die itself; now the JO, the memory, and the functionality of multiple processing units, DSPs and co-processors can be integrated onto the same silicon die. These processors should reduce the manufacturing costs of end products by providing the lowest cost, highest performing processor for a particular class of application. Also, by integrating most of the system components onto a single processor, the part count can be reduced, therefore increasing reliability and lowering power consumption.
A key problem is how the use of the underlying hardware in such multicore devices can be optimised, in order to achieve the highest possible “application available” price-performance.
There are many ways in which processor and system designers may leverage parallelism within the application software (application level parallelism), and within the instruction stream (instruction level parallelism). The various manifestations differ in where the parallelism is managed and whether it is managed when the system is executing/at “run-time” (dynamic systems), or when the application software is being compiled/at compile time (static systems). In practice, the partition between dynamic and static systems and hardware intensive and software intensive solutions is not distinct and techniques from one discipline are often borrowed by the other.
At the level of the individual processing core, the concept of multiple issue processors, or machines which operate on many instructions from a single stream in parallel, is well established in the art. They come in two basic types; superscalar and Very Long Instruction Word (VLIW) processors. Superscalar processors issue varying numbers of instructions per clock cycle identified either at run-time (dynamically scheduled) or at compile time (statically scheduled). VLIW processors issue a fixed number of instructions, forming a very long instruction word, as defined by the compiler. Typically, the programmer is completely unaware of this process as the programming model of the system is a standard, single processor abstraction.
Super-threading and Hyper-threading are both technologies which emulate multiple processors by multiplexing multiple threads of execution amongst multiple virtual processors. Typically, these virtual processors share certain resources which, statistically, would not be used by a single thread all of the time. Super and Hyper-threading architectures appear as multiple independent processors and therefore require a level of application parallelism to be present in order to work efficiently. Typically hardware limitations in the processor core limit the number of threads which may be supported to substantially less than 100.
Furthermore, several system-architectural options exist for the exploitation of the inherent parallelism in many applications. Multiple Instruction Multiple Data (MIMD) machines, where each processor executes its own instructions and operates on its own set of data whilst cooperating with its peers through some shared resource (for example memory and/or interconnect), have become popular due their ability to address a wide variety of applications.
As performance demands increase, embedded systems are increasingly making use of multicore MIMD architectures, using multiple dissimilar or similar processing resources, to deliver the required level of silicon efficiency. Typically, these are a class of MIMD machine called centralised shared memory architectures, i.e. a single address space (or a proportion thereof) is shared amongst the multiple processing resources, although more application specific hybrid architectures are also commonly found.
Although each processing resource of a MIMD array may exploit Instruction Level Parallelism (ILP), MIMD machines may also take advantage of Thread Level Parallelism (TLP) to realise the potential performance of the underlying hardware. In contrast to ILP, which is identified at run-time (by specific hardware) or compile-time (by optimising compile tools), TLP is defined within high-level programming software at application design time.
Threading is a concept that has been used within the software community for many years, as a high level expression of parallelism. A thread defines an autonomous package of work containing an execution state, instruction stream and dataset, which, by definition, may execute concurrently with other threads. The complexity of the instruction stream is unimportant. A thread may describe anything from a simple transfer of data to a complex mathematical transform.
Traditionally, operating systems have assisted in the provision of system management, including thread allocation functions, which enable an application to be run on a certain configuration of a multicore architecture without the software engineer requiring detailed understanding of the underlying device architecture. However, existing software techniques for thread management within a uni-core device cannot be readily adapted to multicore architectures in a consistent way. Solutions to date have been proprietary, requiring bespoke solutions on a design by design basis and have typically compromised performance and scalability.
Historically, in the case of heterogeneous multi-core systems (that is, systems having broadly dissimilar processing resources), many varying approaches have been employed to enable the disparate processing resources to work together. However, broadly these may be split into two categories, “proxy host” and “co-operative” (also known as “peer to peer”). In the former case, a designated general purpose host processor (which in a bus-based system is often referred to as a CPU) governs the system overall, brokering tasks across the system and synchronising access to resources such as memory and devices. Such system supervision is typically operated in an operating system kernel and competes for slices of time with the system application and the processing of asynchronous events on the host processor. In other words, this general purpose processor must act as a centralised proxy thread manager for all the processing resources on the multicore device, as well as act as a key application processor.
When used in this configuration, the general processor must maintain queues of threads ready for execution for each processing resource, depending on a predefined scheduling policy, i.e. their priority (i.e. dispatch or ready queues), as well as queues of threads awaiting some event, or the return of another thread's results, before they can themselves start to be executed (i.e. pending and timing queues). These are in addition to other system overheads, such as processor configuration prior to thread execution.
Whenever the general purpose processor diverts its processing time from a thread it is currently executing, to the administration of the system (including thread management), for example, as a result of an interrupt issued due to the completion of a thread (and therefore the freeing up of the processing resource that has just completed that thread), the general processor must make a context change.
A context change involves storing the current progress of the thread being halted into memory, fetching instructions relevant to the administration routines for the servicing of the other threads/processing resources, then carrying out those instructions, including any configuration requirements. A further context change must be carried out to return to the original, halted thread. These context changes are typically executed on receipt of an interrupt, and in embedded systems, these interrupts are often both frequent and asynchronous to the application code executing on the general purpose processor. Therefore, the system as a whole exhibits significant degradation of performance. Context switches also have a negative impact upon the effectiveness of host processor caches (the so-called “cold-cache” effect)
In the case of a co-operative system, each processing resource runs a separate instance of an operating system, part of which enables inter-resource communications. Such an arrangement accordingly has a relatively rigid architectural partitioning, as a result of a specific routing of interrupts between peers. Although this type of system offers the primitives required to produce an application, the performance of the implementation still suffers from frequent context switches associated with operating system kernel activity.
In summary, current designs and methodologies for the realisation of system management in traditional architectures (general purpose processors, software executives etc.) are inappropriate for the system and thread management of complex heterogeneous multi-core architectures. Indeed the general purpose processor is poorly optimised at both the micro (instruction set) and a macro (caches, register file management) architectural level. Although the interconnect of a multicore processor provides a physical medium for interoperation between the separate processing resources, there is no system wide task management and communication layer shared amongst all the processing resources enabling a coherent approach to system management. In the worst case this may lead to a distinct problem associated with every possible communication channel between every processing resource, each of which must be traditionally separately solved in software on an ad-hoc basis.
Thus, there is a need for an efficient method of system management of these very complex multicore architectures. Software abstraction alone cannot provide the requisite level of performance of complex multicore architectures.