Field of the Invention
The present invention relates in general to the process of executing load instructions to load information from memory in a microprocessor, and more particularly to a system and method of speculative parallel execution of cache line unaligned load instructions to load data that crosses a cache line boundary.
Description of the Related Art
Computer programs include instructions to perform the functions of the program including load instructions to read data from memory. A typical computer system includes a microprocessor for executing the instructions, and an external system memory coupled to the microprocessor for storing portions of the computer program and applicable data and information. Loading data from the system memory consumes valuable processing time, so the microprocessor typically includes a smaller and significantly faster cache memory for loading data for processing. The cache memory is typically incorporated within the microprocessor for faster access. The cache memory may be externally located, but if so is usually connected via a separate and/or dedicated cache bus to achieve higher performance. Blocks of data may be copied into the cache memory at a time, and the microprocessor operates faster and more efficiently when operating from the cache memory rather than the larger and slower external system memory. The cache memory is organized as a sequential series of cache lines, in which each cache line typically has a predetermined length. A common cache line size, for example, is 64 bytes although alternative cache line sizes are contemplated.
The computer program may repetitively execute one or more load instructions, such as in a loop or the like, to load a specified amount of data from a particular memory location in the cache memory. Each load instruction may include a load address and a data length. The load address specified in the software program, however, may not necessarily be the same physical address used by the microprocessor to access the cache memory. Modern microprocessors, such as those based on the x86 instruction set architecture, perform address translation including segmentation and paging and the like, in which the load address is transformed into an entirely different physical address for accessing the cache memory. Furthermore, a series of load operations may be sequentially executed to retrieve a larger block of data, in which one or more of the load instructions do not directly align with the cache line size. As a result, the memory read operation may attempt to load data that crosses a cache line boundary, meaning that the specified data starts on one cache line and ends on the next cache line. Since the target data occupies more than one cache line, this type of memory read operation is known as a cache line unaligned load. A special method is usually required to handle the cache line unaligned load operations because the data is not retrievable using a single normal load request. Modern microprocessors typically use a popular cache structure in which only one cache line is accessible for a single load request, so that the cache line unaligned load operation must be handled in a different manner which negatively impacts performance.
A common solution performed by some microprocessors is to sequentially issue the same load instruction twice at different times within the load pipeline. When the load instruction is initially received by the load pipeline, the address for locating the data is first transformed to a virtual address (and ultimately transformed to a physical address for accessing the cache memory), and it is only then that it is determined that the data load operation crosses a cache line boundary. Such an unaligned load operation invokes a load miss. In the event of a load miss, the load is executed again in the load pipeline, which further introduces a load miss caused replay of instructions that are dependent upon the load operation. Furthermore, a second issue of the unaligned load instruction causes an arbitration between the second issue of the unaligned load instruction and other normal issued load instructions, which will cause a fairly long latency.
In this manner, a cache line unaligned load operation is inefficient and consumes valuable processing time to eventually retrieve the correct data, including initial detection, duplicate execution, arbitration of resources, and replay of dependent instructions. A software program that causes a significant number of cache line unaligned load operations results in inefficient operation and reduced performance.