Recent trends in processor designs have favored chip multiprocessors, also referred to as multi-core processors. These include multiple cores all housed in the same package or silicon chip. These cores share some common resources, such as a second level (L2) cache or other caches included in the same package or chip.
The cores in the multi-core processors are typically complex cores. Complex cores are cores optimized around the requirements of application codes, typically to make the applications, such as scientific applications, games, business applications, etc., run as fast as possible. For example, the complex cores may have large pipelines and other features designed to improve application performance.
Typically for applications to run, an operating system (OS) must be first loaded and initialized. Then, the applications may be loaded and run on the OS. Compared to applications, OS code typically does not run proportionately faster on a complex core. In fact, prior research has shown that OS code has not sped up nearly as much as many application programs have, as processor designs have evolved over the past 15-20 years. This is due in part for reasons that OS code typically does not utilize all of the features of complex cores.
Many modern processors use techniques such as branch prediction, caching, out-of-order processing, multithreading, pipelining, and prefetching to improve the performance of applications. Unfortunately, OS's tend not to achieve the improved performance often achieved by conventional applications due to the inherent limitations of these techniques. Below is a brief description of these techniques followed by an explanation as to why OS code does not benefit as much from these techniques as more conventional applications do.
When considering the flow of instructions in a program one can consider both the data flow and the control flow. Data flow refers to how data moves from the output of one operation to the inputs of the next. For example: (1) add A and B, put the result it C; (2) take C and multiply it by 5 and put the result in D; (3) subtract A from D and put the result in E. In this example, the data comes in as A and B where an add is performed. The multiply must wait for the add to produce its result data, and the subtract must wait for the result of the multiply.
Control flow refers to the order of the individual instructions themselves. The previous example's control flow was from instruction (1) to instruction (2) to instruction (3). An example of a more interesting control flow is: (1) if A is true, then do B. (2) Do C.” The control flow in this case checks the condition A and may or may not flow through B before flowing to C.
Control flow in applications is represented by the various conditional branch instructions, hereafter referred to as branches, which tell the processor to jump (or not) to a different place in the program. In the previous example, a conditional branch would have been used to skip over B if A was false. Branches occur frequently in most programs and can, on average, be found every fifth or sixth instruction of non-scientific code. The performance of any program depends on a processor's ability to resolve data dependencies by calculating values that are needed to feed subsequent instructions and by resolving control dependencies by calculating conditions for branches.
Branch prediction is the process of guessing the outcome of a conditional branch before the result of the condition is known. This allows a processor to skip over a control dependence and execute instructions that follow the branch while in parallel waiting to verify the prediction. For example, branch predictors use the past pattern of branches to predict the outcomes of future branches. Branch predictors look up the pattern in a table, referred to as the pattern history table. The pattern history table holds past outcomes that indicate what was done the last time a particular pattern was seen. Because the combination of the number, order, and outcome of branches in any given program is very large, branch predictors are limited by the amount of history information that can be maintained in the pattern history table. The larger the program and the more varied the control flow, the more the capacity of the pattern history table is stressed.
Some OS kernels are comparable in size to some of the largest applications. Also, OS code tends to have poor branch prediction behavior because it often stresses the capacity of the pattern history table. The OS itself provides services to the many processes running on top of it, and often either jumps between different tasks, or only has short lived tasks. This leads to a chaotic, difficult to predict control flow. Hence, use of branch predictors and a large pattern history table may result in minimal improvement in performance for an OS unless the pattern history table is made unrealistically large.
Caching uses smaller, faster storage to hold copies of memory locations that have been recently used. A standard analogy for caching describes the stack of file folders on a person's desk as a cache of what is in a set of filing cabinets in a next room. Quite often, the file that is needed is on the person's desk, and if it is not, the person goes to the filing cabinets to get the file, periodically returning some of the files to the filing cabinet so as to limit the number of files on the desk. Going further with this analogy, one can see that a larger cache of files takes longer to search.
Processor caches are fixed in size, and that size is chosen to maximize the likelihood of finding the desired file, while minimizing the search time of the cache. Overall, the goal is to minimize the average time to access a file.
Processors use caches to reduce the amount of time that a load takes, thus more quickly resolving control and data dependencies. OS's, because they must share the cache with regular programs and may go hundreds of thousands of cycles between occasions when they touch a piece of data, often have poor cache behavior. Thus, a large cache may have minimal impact on the performance of an OS unless the cache is made unrealistically large.
Prefetching is the process of issuing a request to the memory system before a particular address is read/written in order to bring that address into the cache. If done far enough in advance of the read or write, the effective latency of that read or write can be reduced. Prefetching comes in several forms, which can be classified as hardware prefetches or software prefetches. Hardware prefetches can be thought of as memory address predictors that use past requests to predict future requests. Software prefetches are generally inserted by the programmer or the compiler and use pre-calculated addresses in advance of the actual load or store request. In both cases, prefetching is used to prepare the cache for a future request.
Using the file folder analogy, prefetching is analogous to the person grabbing a file from the filing cabinet because the person knows that he or she is going to need it later, even though he or she does not need it immediately. Prefetching in general does poorly with what is called pointer chasing code, which is code where the address of a memory request is dependent on the value brought in by a prior memory request. To again continue with the analogy, this is analogous to looking in one file for the name (i.e., a pointer) to another file that has the actual information that is needed. Prefetching does poorly in this situation, because the next file cannot be looked up until the current file is opened. Pointer chasing code is found frequently in OS's and thus the OS often cannot take advantage of prefetching.
In short, several of the techniques that architects often use to improve performance of applications either do not apply very well to typical OS code, or because of its occasional nature of execution, often cannot be taken advantage of. Furthermore, complex cores including the features described above and optimized to run applications faster, generate more heat, consume more power, and use more space on the chip then less complex processors. The increased energy, thermal and spatial costs may be justified by the increased speed of running applications. However, when OS code runs on a complex core, the increased energy and thermal costs of the complex core may not be offset because the OS code may not run faster.