Process Fundamentals
FIG. 1 shows a silicon chip 100 containing a plurality of Central Processing Units (CPUs) 101 which access a plurality of memories to fetch instructions, which when executed may perform read and write data accesses to a plurality of memories 109, 112. A few examples of such chips include, but are not limited to, the STMicroelectronics STM32 Flash Microcontrollers based on the ARM® Cortex™ processor, the Nvidia® Tegra 4 device incorporating ARM® Cortex™, Nvidia® GPU graphics processors and Nvidia®i500 modem processors for use in cellular phones and tablet computers, the Intel® 4950HQ processor incorporating Intel® Core™ i7 processor for use in workstations and servers, and the Tilera® GX-8009 processor for use in networking and multimedia equipment. A few examples of memories 109, 112 include, but are not limited to, non-volatile storage such as NOR flash memory, NAND flash memory, Ferro-electric memory (FRAM), Magneto-resistive memory (MRAM, and/or volatile storage such as static RAM (SDRAM) and Dynamic RAM (DRAM). Reference herein to “memory” and “memories” is intended as a general reference to anything capable of permanently or temporarily storing instructions and/or data and is intended to encompass accelerators, caches and buffers of the computer system.
Each CPU 101 contains an I-Port 102 to fetch instructions from memory, and a D-Port 103 to perform read and write data accesses to memory when such instructions are executed. The CPU 101 is directly connected to the I-Port 102, and to the D-Port 103, which are respectively connected to the I-Buffer 104 and D-Buffer 105, and the CPU along with the ports 101, 102, 103, 104, 105 operate at a much higher frequency than other on-chip components 106, 107, 108, 109, 110, and off-chip devices 112.
On-chip memory 109 is typically 10 to 50 times slower to access than the I-Buffer 104 and D-Buffer 105 as it is connected via the on-chip bus 108 and typically operates at a lower frequency. Due to the number of electrical circuits involved in accessing the on-chip memory 109, accesses to it consume more energy than to the I-Buffer 104 and D-Buffer 105.
External memory 112 runs at a lower frequency than all the on-chip components other than the memory controller 110. Accesses to it are typically 50 to 500 times slower than to the I-Buffer 104 and D-Buffer 105, as it is connected via the on-chip bus (108), memory controller 110 and external memory interconnect 111. Due to the number of electrical circuits involved in accessing the off-chip memory 112, and due to voltages and valency of the external memory interconnect 111, external memory accesses also consume more energy than accesses to the I-Buffer 104, D-Buffer 105 and on-chip memory 109.
The I-Buffer 104 is smaller than both the on-chip memory 109 and the off-chip memory 112 and stores a copy of instructions recently fetched by the CPU 101 via its I-Port 102 in case they are needed for execution in the future. This is known as temporal locality. When the I-Port 102 requests an instruction from a particular address, the I-Buffer 104 checks whether the required instruction exists within the I-Buffer and therefore can be provided immediately. This is known as an I-Hit. If the instruction does not exist within the I-Buffer 104, it has to be fetched from memory 109, 112 into the I-Buffer. This is known as an I-Miss.
The time taken for an I-Hit is less than an I-Miss and requires less circuitry to be activated, therefore a lower I-Miss rate will reduce the time and energy used when fetching instructions. Furthermore, reducing the I-Miss rate will reduce traffic on the on-chip bus 108, and to the underlying memories 109,112, providing more opportunity for operations to be enacted by the D-Buffer 105 and associated mechanisms, thus reducing time spent on D-Miss operations.
In an attempt to increase the I-Hit rate, accesses made by the I-Port 102 are typically a super-set of the actual accesses required. Dependant on mechanisms within the CPU (such as instruction pre-fetching and branch prediction) the I-Port may pre-fetch additional instructions not directly related to the current program flow in an attempt to place instructions in the I-Buffer before they are needed.
The structure and functionality of the I-Buffer 104 varies between different silicon chip designs, according to trade-offs between design complexity, silicon area required for the circuitry, the relative speed of the CPU 101 and the speeds and sizes of the on-chip memories 109, and the speed, size and nature of the off-chip memories 112. The I-Buffer's 104 characteristics include numOfStrides (the number of distinct sequentially addressed buffers), and the sizeOfStride (the number of sequentially addressed instructions in each stride). Critical Word First mechanisms may be implemented, such that on an I-Miss the address causing the I-Miss is accessed before other addresses in the stride.
Typical structures for the I-Buffer 104 range from simple buffers with numOfStrides=1 to hold a plurality of adjacent instructions, through to sophisticated multi-way and multi-level Instruction Caching mechanisms.
The D-Port 102 and D-Buffer 105 provide features analogous to the I-Port 102 and I-Buffer 104 for data read/write operations rather than instruction fetch. The D-Buffer 105 is smaller than both the on-chip memory 109 and the off-chip memory 112 and stores a copy of data recently read and written by the CPU 101 via its Data-Port 103. This is known as temporal locality. When the D-Port 103 requests a read/write at a particular address from the D-Buffer 105, a check is made whether the required data exists within the D-Buffer and thus can be accessed immediately. This is known as a D-Hit. If the data does not exist within the D-Buffer 105, it has to be accessed in the underlying memory 109, 112. This is known as a D-Miss.
The time taken for a D-Hit is less than a D-Miss and requires less circuitry to be activated, therefore the higher the D-Miss rate the more time and energy will be expended accessing memory. Furthermore, reducing the D-Miss rate will reduce traffic on the on-chip bus 108, and to the underlying memories 109, 112 providing more opportunity for operations to be enacted by the I-Buffer 104 and associated mechanisms and thus for I-Miss operations to complete more quickly.
Spatial locality refers to the likelihood of referencing an address increasing if an address nearby it was previously accessed. Exploiting spatial locality for instruction fetch can improve the I-Hit/I-Miss ratio. Exploiting spatial locality for data accesses can improve the D-Hit/D-Miss ratio.
Mechanisms within the I-Buffer 104 can exploit spatial locality and thus provide high-speed access to instructions buffered within it. Furthermore, properties of the on-chip memory 109 and off-chip memory 112 provide further opportunities to exploit spatial locality when refilling the I-Buffer 104, based upon properties of the memories and the interface, and doing so will improve performance.
As an example, if off-chip memory comprises a Micron M25P10-A serial NOR flash device it provides 1 MBytes of storage, arranged as 4 sectors of 32768 bytes, each sector comprising 512 pages of 256 bytes. Accessing an arbitrary 24-bit address requires 32 bits of serial command data to be transferred to the device, accessing the next ascending address requires no additional commands and thus is 32 times faster than accessing a random address.
If off-chip memory comprises a 64 MByte DDR DRAM formed from 4 independent 16 MByte memory banks, each arranged as 1024 columns of 8192 byte rows. Accessing a random address involves a transaction to “open” a row (requiring a number of cycles known as row-to-column delay), followed by a read operation using a column address and requiring multiple cycles to read the data. Accessing other addresses within an already open same row avoids the row-to-column cycles whereas accessing addresses in a different row requires another command and multiple cycles to close the row, followed by additional cycles to open the new row. Multiple memory banks can be “opened” at once, so arranging memory accesses to avoid conflicts in the same bank and row can improve performance considerably by benefiting from spatial locality within rows and within separate memory banks. Burst mode accesses, whereby all rows in a given column are accessed sequentially, can also improve performance, especially when used for I-Buffer/D-Buffer fill operations.
Software Development Tools Fundamentals
The target computer system 911 shown in FIG. 2 contains a plurality of processors 100 (equivalent to the processors in FIG. 1), off-chip memory 112 (equivalent to the memory in FIG. 1), and hardware interfaces 920 for connection to off-chip peripherals such as switches, motors and sensors. A Software Development Kit (SDK) consists of one or more software development tools which operate on a host computer 900 containing a plurality of processors 904 providing computer resources for running the SDK; RAM 901 to hold the programs and data processed by the SDK; file storage 902 such as, but not limited to, disc drives including magnetic discs, optical discs, tape drives, solid state drives and remote storage using network protocols (e.g. NFS, SMB or 9P clients); a display 903; and one or more input peripherals such as, but not limited to, a keyboard 905 for data/instruction entry. The host computer 900 is connected to a debug adaptor 909 via an interface 908 such as USB or Ethernet. The debug adaptor 909 is connected to the embedded system 911 via a debug link 910 typically JTAG or a NEXUS IEEE 5001 port.
With reference to FIG. 18 which illustrates code generation tool flow 600, computer programs are written as human-readable source code files 601, 604, using portable high-level languages (e.g. C or C++), with specialized low-level or performance critical functions written directly in assembly language specifically for the target processor on which they will execute. Each source code file contains a plurality of function definitions (each function consisting of a sequence of instructions intended to perform a specific effect), and a plurality of data definitions (each data item consisting of an area of memory containing specific values, which are read and written by the functions as they are executed). Code generation tools build executable files 610 from the source code files 601, 604, well-known examples being the GNU Compiler Collection (GCC) used with GNU Binutils, and the CLANG compiler front-end used with the LLVM Compiler Infrastructure. A compiler tool 602 translates a portable high-level language source code file 601 into an intermediate representation and applies some general high-level optimizations, then outputs an assembly language source file 603 specific to a particular processor (or abstract processor). An assembler tool 605 translates a human-readable assembly source code 603, 604 into a non-human readable form known as an object file 606. A librarian tool 607 allows multiple object files 606 to be combined into a single library file 608. A linker tool 609 combines multiple object files 606 and library files 608 into a single executable file 610. The linker stage 609 is the only point which has visibility of the whole program. Using library files make the programmer's job simpler as there are fewer individual files to specify and is also a popular method to distribute software components (e.g. operating systems, networking protocol stacks, device drivers etc.).
In existing systems, compiler 602 optimizations are applied in the compiler front-end (which translates high level languages into an intermediate representation), middle-end (which operates entirely on the intermediate representation), and back-end (which generates target-specific instructions from the intermediate representation and makes limited target-specific instruction set optimizations such as peepholing and scheduling). The optimizations are made within individual source file, within the scope of individual functions and their associated data, and potentially across the scope of many functions and data items within a given compilation unit.
The method by which a computer program is structured into a number of separate source code files varies for a number of different reasons including the programmer's design choices, the organisational structure of a multi-person development teams, and re-use of existing software components. Traditional compiler technologies which process each source file one at a time (e.g. the GNU GCC compiler) are unable to perform many of these optimizations between functions whose body is not available at the point of compiling the caller function. This is best illustrated in FIG. 3. functionA (in file x.c) cannot optimize its relationship with any other functions (as they are all defined in separate files), functionB and functionD (in file y.c) cannot optimize their relationship with functionA or functionC as they are to be found in file x.c and file z.c respectively and thus not visible when file y.c is compiled. Furthermore, interactions with hardware structures such as the underlying memory system (102, 103, 104, 105, 109, 110, 111, 112) cannot be applied without whole-program scope.
Real-Time Behaviour
Real-time behaviour is an important characteristic in embedded system designs and is not well addressed by existing program generation techniques. BCET is a metric for Best Case Execution Time. WCET is a metric for Worse Case Execution Time. ACET is a metric for Average Case Execution Time.
Consider the instruction sequences shown in FIG. 4. The flows between instructions 300 to 309, 311 to 320, 322 to 329, 331 to 340, 342 to 343 and 344 to 345 are sequential and therefore have strong spatial locality. The time and energy taken to fetch these instructions from the I-Buffer should be highly deterministic, BCET and WCET for these sequences should be similar (other than some minor variation based on time to fetch the first instruction in each sequence into the I-Buffer). However, the flows between instructions 310 to 322, 330 to 342, 341 to 344 and 346 to 311 are non-sequential and therefore have weak spatial locality. The time and energy taken to fetch these instructions from the I-Buffer will be unpredictable. Furthermore the ‘function call’ instruction 310, ‘then return from function call’ instructions 321 and 346, the ‘conditional branch’ instruction 330 and the ‘unconditional branch’ instruction 341 will cause the I-Port 102 to fetch from addresses which do not necessarily follow the required instruction address sequence. This can result in unnecessary instruction fetches from other addresses into the I-Buffer 104, wasting time, energy and polluting its contents with instructions which are not required.
Thus BCET and WCET for the above instruction sequences are highly variable and increase ACET such that the overall system performance (such as CPU frequency, or bandwidth available through the memory system) must be increased over what is required for BCET to provide sufficient headroom for WCET. This further increases design complexity, and energy consumption of the system over what might actually be required.
In U.S. Pat. No. 5,212,794 a code generation technique is described. The code generation technique is implemented within the GCC compiler and the LLVM compiler and requires profiling feedback generated by execution of an instrumented program. U.S. Pat. No. 6,839,895 describes another code generation technique similarly requiring profiling information.
An approach for improving virtual memory mapping of a program is described in U.S. Pat. No. 6,292,934. The approach involves disassembling a program to identify its basic blocks (BB) and generating an instrumented version of the program which, when processed, reveals the frequency by which each block is executed.
Existing code generation techniques, such as those described above, require an instrumented executable to be created, executed and dynamic feedback produced (as shown in FIG. 19) and require re-compilation from source code. They use statistical information in order to optimize BCET or WCET for a specific execution path. Traditionally, code generation tools (such as the compiler, and even the GCC Whole Program Optimization (WHOPR) and Link Time Optimization (LTO) whole program generation technologies) are only aware of the CPU and so are unable to make the correct trade-offs to generate an optimal program for a given embedded processing system. Existing techniques also ignore the effects that code generation has on the target machine, attempting to generate optimal BCET for a given function may disturb the I-Buffer/D-Buffer state sufficiently that whole program's BCET is inferior.
Compiler Feedback
As shown in FIG. 19, existing software tools (such as those described in EP 0459192 and U.S. Pat. No. 6,317,874) may be used such that executing 810 an instrumented version of the executable file 809 generates profile data 811 describing its dynamic behaviour, which is processed 812 into a compiler feedback file 814 for use in building a revised version of the executable whose optimizations are tuned to the dynamic behaviours observed.
This approach has many problems. Firstly, profiler tools 812 are often unavailable, or are unsuitable for use on embedded processor systems, as they consume large amounts of memory in the target system, and alter the real-time behaviour such that the program may fail to execute correctly (these problems are known as temporal bugs). Secondly, this method requires a representative set of inputs to stimulate “typical” behaviour during the “training run”. Thirdly, the information produced by the profiler 814, 815 explains which components consumed resources (such as time, or memory bandwidth), but not why this was so. The relationship between which and why is complex, and is frequently beyond the capabilities of even the most skilled programmers, and of existing software tools. This conflicts with the well known Pareto Principle which states that 80% of effects arise from 20% of the causes. For example, a function's execution may reveal in a large number of cache misses, but this could be because a previous function has disrupted the cache rather than the function itself not being cache-optimal. Correspondingly, programmers and existing optimization techniques move bottlenecks from one area to another in a complex and uncontrollable manner and are analogous to “balloon squeezing” (attempting to make a balloon smaller by squeezing it doesn't work—it just forces air from one part to another and squeezing too much may result in it bursting!). This wastes programmer's time, and can result in late, inefficient and even bug-ridden designs. Thirdly, the information generated is specific to a particular set of dynamic behaviours observed which may not exhibit typical behaviour (either due to the presence of the profiler altering its behaviour or due to the stimulus applied to the program).
Conventional optimizations (such as those described in Pettis' and Hansen's article “Profile Guided Code Positioning”, Proceedings of ACM SIGPLAN 1990 conference) are used in the compiler to layout the Basic Blocks (BBs) and control flow according to estimated call frequencies to arrange the predicted normal flow in a straight-line sequence. “Procedure Placement using Temporal Ordering” by N. Gloy, Proceedings of Micro-30 conference 1997, describes a mechanism to order BBs to make use of target-specific I-Buffer and to avoid conflict between BB sequences in order to reduce I-Miss. Research such as “Cache-conscious data placement” (Calder et al., published in Proceedings of ASPLOS-VIII, 1998) attempts to reorder global data for better cache efficiency. However, all these require dynamic feedback, and still only consider BCET behaviour.
Existing static feedback techniques (such as those covered by U.S. Pat. No. 5,655,122) operate on the basis of branch probabilities and frequencies rather than cost. As they operate prior to target code generation and ignore target-specific effects they yield poor results and are also unable to cope with Real-Time Operating System (RTOS) related branches and context switches because they only consider traditional branch instructions. Other static techniques (U.S. Pat. No. 7,275,242, GB 2463942) examine a linked executable and generate information to feedback to the compiler and linker such that recompilation and re-linking should produce a more optimal program, whereas it would be more desirable to apply significant whole-program scope and target-specific optimizations when actually creating the original executable.
Unlike instruction fetch operations, the data read/write accesses performed by the program are typically directly related to the data structures which the program accesses. FIG. 10 shows a typical data structure definition, written in the C programming language for a machine on which “char” occupies 1 byte, “unsigned int” occupies 4 bytes, and each field within the structure must be aligned to a multiple of its size. exampleStruct defines a data structure comprising a 1-byte value (fooMask), an array of 65536 bytes (arrayOfValues), and a 1-byte unsigned value (barMask). It can be seen that that the spatial locality between individual ascending subscripts of arrayOfValues is strong, the spatial locality between fooMask and the lower subscripts of arrayOfValues is moderate (but not high due to the alignment padding at bytes 1 . . . 3), but the spatial locality between fooMask and barMask is low. If the program contains instruction sequences which more frequently access both fooMask and barMask than access adjacent fields of arrayOfValues, the D-Hit rate could be significantly improved if the spatial locality between fooMask and barMask was increased, i.e. by altering the structure definition to be as shown in FIG. 11.
Research such as “Cache-conscious structure definition” (Chilimbi et al., SIGPLAN Conference on Programming Language Design and Implementation, 1999) require dynamic feedback. They also ignore overall cache capability and only consider cache-line length to reorder fields. As alignment and cache sets aren't considered, rearranging fields for simple spatial locality may actually damage D-Miss.
Tools such as PAHOLE take an executable file 609 as input, and generate textual information which a programmer can review in order to potentially modify the program's source code files 601 to improve the memory layout of data structures. However, this approach is often impractical—developer productivity is already low and adding more manual modifications can only make it worse. Furthermore, data structures used in source code files 601, 604 and previously built library files 608 could become incompatible unless everything is re-compiled. Thus the application of such manually applied techniques is inherently time-consuming and risky.
Relative placement of global data can impact D-Miss ratio, and the cost of D-Miss operations (for example, performing a cache writeback of a D-Buffer stride where only part of the stride has changed is a worthless but necessary operation). Languages such as C support the notion of “const” data to denote the data is only read but not written, and such data is placed in a separate memory area to writable data. However, many global data items are written infrequently compared to others, and a placement where DATA (frequentlyWritten) and DATA (infrequentlyWritten) have strong spatial locality can perform unnecessary writebacks, damaging D-Hit ratio, and increasing the latency of D-Miss operations.
Existing Whole-Program Compilers
Seminal whole program compilation technologies (such as those described in EP 0464525, EP 0428084) have significant compatibility and scalability issues and have failed to become mainstream. Newer compiler technologies, such as GNU GCC's LTO and WHOPR, and the CLANG/LLVM compiler system intend to be more practical and can support inlining and inter-procedural optimizations irrespective of the source code's file structure. Rather than each compiler invocation outputting target-machine specific assembly code, the compiler's intermediate representation is output and the linker performs optimization across the whole-program's intermediate representation before generating target specific instructions. This approach has a number of issues. Tools are immature, operate slowly and consume large amounts of processor cycles, memory and disc space when building the program, limiting the opportunities for practical usage. The optimizations are still limited to source files compiled to intermediate file format and do not apply to normal objects/libraries holding target instructions (such as the libraries typically used to integrate with an operating system or middleware such as device drivers etc.). Exposing the whole-program to the compiler optimizer and can also provide catastrophic program size and performance regressions due to overzealous optimizations and the compiler's inaccurate target cost models (resulting in programs which are actually larger and slower than programs which are compiled a source file at a time in the conventional manner).
Existing Post-Executable Optimisers
Research into post-executable optimizers has attempted to find a practical alternative to whole-program compilation, though have failed to provide acceptable solutions.
Tools such as AiPop, EP 1497722, US 2005/0235268) input and output assembly language source files, and do not fit into a conventional code generation flow of compiler, assembler, linker, especially when externally supplied library files are required. Research tools such as Alto (R. Muth et al., “alto: A Link-Time Optimizer for the Compaq Alpha” (Software Practice and Experience, 2001), Spike (R. S. Cohn et al., “Optimizing Alpha Executables on Windows NT with Spike” (Digital Technical Journal, 1997) and Diabolo, and o U.S. Pat. No. 5,966,539 require a first executable file to be generated by a linker tool 609, try to re generate information discarded in the compilation process 602, and then generate a new optimized executable. Generating two executables (the first, and the optimized second) slows the build process and negatively impacts developer productivity. Run-time feedback from profiling tools is also required—often this is unavailable, or may be incorrect. Furthermore, post-executable optimizers often fail to recover sufficient information from the source code (such as the C language's volatile keyword) and thus can generate incorrectly optimized executables yielding bogus semantics compared to the first executable. Analysis of the compiler's assembly language output is used with pattern matching to identify call sequences (e.g. switch statements) and so optimizations are very sensitive to current compiler output. Features such as dynamic linking (whereby parts of the program's libraries files are bound to the executable at the point at which the operating system loads and starts to execute it) are often unsupported, even though such facilities are mandatory for many environments.