The present invention is generally directed to data processors and, more specifically, to an apparatus for supporting misaligned accesses in a data processor in the presence of speculative load instructions.
The demand for high performance computers requires that state-of-the-art microprocessors execute instructions in the minimum amount of time. A number of different approaches have been taken to decrease instruction execution time, thereby increasing processor throughput. One way to increase processor throughput is to use a pipeline architecture in which the processor is divided into separate processing stages that form the pipeline. Instructions are broken down into elemental steps that are executed in different stages in an assembly line fashion.
A pipelined processor is capable of executing several different machine instructions concurrently. This is accomplished by breaking down the processing steps for each instruction into several discrete processing phases, each of which is executed by a separate pipeline stage. Hence, each instruction must pass sequentially through each pipeline stage in order to complete its execution. In general, a given instruction is processed by only one pipeline stage at a time, with one clock cycle being required for each stage. Since instructions use the pipeline stages in the same order and typically only stay in each stage for a single clock cycle, an N stage pipeline is capable of simultaneously processing N instructions. When filled with instructions, a processor with N pipeline stages completes one instruction each clock cycle.
The execution rate of an N-stage pipeline processor is theoretically N times faster than an equivalent non-pipelined processor. A non-pipelined processor is a processor that completes execution of one instruction before proceeding to the next instruction. Typically, pipeline overheads and other factors decrease somewhat the execution rate advantage that a pipelined processor has over a non-pipelined processor.
An exemplary seven stage processor pipeline may consist of an address generation stage, an instruction fetch stage, a decode stage, a read stage, a pair of execution (E1 and E2) stages, and a write (or write-back) stage. In addition, the processor may have an instruction cache that stores program instructions for execution, a data cache that temporarily stores data operands that otherwise are stored in processor memory, and a register file that also temporarily stores data operands.
The address generation stage generates the address of the next instruction to be fetched from the instruction cache. The instruction fetch stage fetches an instruction for execution from the instruction cache and stores the fetched instruction in an instruction buffer. The decode stage takes the instruction from the instruction buffer and decodes the instruction into a set of signals that can be directly used for executing subsequent pipeline stages. The read stage fetches required operands from the data cache or registers in the register file. The E1 and E2 stages perform the actual program operation (e.g., add, multiply, divide, and the like) on the operands fetched by the read stage and generates the result. The write stage then writes the result generated by the E1 and E2 stages back into the data cache or the register file.
Assuming that each pipeline stage completes its operation in one clock cycle, the exemplary seven stage processor pipeline takes seven clock cycles to process one instruction. As previously described, once the pipeline is full, an instruction can theoretically be completed every clock cycle.
The throughput of a processor also is affected by the size of the instruction set executed by the processor and the resulting complexity of the instruction decoder. Large instruction sets require large, complex decoders in order to maintain a high processor throughput. However, large complex decoders tend to increase power dissipation, die size and the cost of the processor. The throughput of a processor also may be affected by other factors, such as exception handling, data and instruction cache sizes, multiple parallel instruction pipelines, and the like. All of these factors increase or at least maintain processor throughput by means of complex and/or redundant circuitry that simultaneously increases power dissipation, die size and cost.
In many processor applications, the increased cost, increased power dissipation, and increased die size are tolerable, such as in personal computers and network servers that use x86-based processors. These types of processors include, for example, Intel Pentium(trademark) processors and AMD Athlon(trademark) processors.
However, in many applications it is essential to minimize the size, cost, and power requirements of a data processor. This has led to the development of processors that are optimized to meet particular size, cost and/or power limits. For example, the recently developed Transmeta Crusoe(trademark) processor greatly reduces the amount of power consumed by the processor when executing most x86 based programs. This is particularly useful in laptop computer applications. Other types of data processors may be optimized for use in consumer appliances (e.g., televisions, video players, radios, digital music players, and the like) and office equipment (e.g., printers, copiers, fax machines, telephone systems, and other peripheral devices). The general design objectives for data processors used in consumer appliances and office equipment are the minimization of cost and complexity of the data processor.
Explicit speculative load instructions are an important tool in achieving high instruction level parallelism for wide instruction word processors. Speculative load instructions differ from conventional load instructions only in situations where a conventional load instruction would cause an exception. In most cases, the speculative load instruction would not cause an exception and a separate software test must be performed to determined if the loaded data is valid. This characteristic allows speculative load instructions to be performed sooner than strict program order permits thereby enabling higher parallelism. A special case arises in processors that do not provide hardware support of misaligned access of data because some code depends upon the ability to perform misaligned accesses. In such cases, misaligned loads are supported through software exception handlers.
Pure software solutions are the simplest solutions for supporting code requiring misaligned accesses in processors that do not have hardware to implement misaligned accesses. At one extreme, the compiler can be instructed not to use speculative loads in critical sections of code. This solution has the disadvantage of discarding any parallelism that might be gained from the use of speculative load instructions. At the other extreme, the compiler can introduce additional tests and recovery code to ensure that where speculative loads are used on misaligned data, the (incorrect) loaded data is not used in subsequent phases of the program. This second solution requires significant additional code that might severely limit the performance advantages of using speculative load instructions in the first place.
Existing hardware solutions are aimed at providing efficient support for test and recovery. For example, the IA64 provides an additional bit associated with each register, which may be set by a speculative load instruction when the accessed data is incorrect. A hardware check instruction is provided to test this bit and invoke appropriate recovery code. The principal disadvantage to this approach is the requirement to implement additional program state in the form of these bits as well as instructions for saving and restoring this state.
Perhaps the most significant drawback to any of these prior art solutions is the fact that these solutions are too rigid. Each solution either allows hardware recovery of misaligned accesses or it does not allow such hardware recovery. Designers of system-on-chip products do not have the choice of the level of support of misaligned accesses.
Therefore, there is a need in the art for improved exception handling techniques in the presence of speculative load instructions. In particular, there is a need for data processors that provide improved handling of misaligned accesses in the case of speculative (or dismissible) load instructions. More particularly, there is a need for data processors that provide embedded system designers with some flexibility in handling misaligned accesses in the case of dismissible load instructions.
The present invention provides a compromise between hardware and software solutions. In particular, the present invention provides the option, under software control, to enable exceptions for misaligned accesses caused by speculative load instructions. A bit is provided in the program status word (PSW) that selectively enables or disables misaligned access exceptions for speculative loads. Other sources of excepting behavior for speculative loads are ignored. Thus, software requiring misaligned accesses can be supported in the presence of speculative loads by implementing an appropriate exception handler.
To address the above-discussed deficiencies of the prior art, it is a primary object of the present invention to provide, according to an advantageous embodiment of the present invention, a data processor comprising: 1) an instruction execution pipeline comprising N processing stages capable of executing a load instruction; 2) a status register capable of storing a modifiable configuration value, the modifiable configuration value having a first value indicating the data processor is capable of executing a misaligned access handling routine and a second value indicating the data processor is not capable of executing a misaligned access handling routine; 3) a misalignment detection circuit capable of determining if the load instruction performs a misaligned access to a target address of the load instruction and, in response to a determination that the load instruction does perform a misaligned access, generating a misalignment flag; and 4) exception control circuitry capable of detecting the misalignment flag and in response thereto determining if the modifiable configuration value is equal to the first value.
According to one embodiment of the present invention, the exception control circuitry, in response to a determination that the modifiable configuration value is equal to the first value, causes the data processor to execute the misaligned access handling routine.
According to another embodiment of the present invention, the exception control circuitry, in response to a determination that the modifiable configuration value is equal to the second value, determines if the load instruction is speculative.
According to still another embodiment of the present invention, the exception control circuitry, in response to a determination that the load instruction is speculative, causes the data processor to dismiss the load instruction.
According to yet another embodiment of the present invention, the data processor further comprises a data protection unit capable of determining if the load instruction accesses a restricted area of memory.
According to a further embodiment of the present invention, the data protection unit, in response to a determination that the load instruction does access a restricted area of memory, causes the data processor to execute an exception handling routine.
According to a yet further embodiment of the present invention, the data protection unit, in response to a determination that the load instruction does access a restricted area of memory, is further capable of determining if the load instruction is speculative.
According to a still further embodiment of the present invention, the exception control circuitry, in response to a determination that the load instruction is speculative, causes the data processor to dismiss the load instruction.
The foregoing has outlined rather broadly the features and technical advantages of the present invention so that those skilled in the art may better understand the detailed description of the invention that follows. Additional features and advantages of the invention will be described hereinafter that form the subject of the claims of the invention. Those skilled in the art should appreciate that they may readily use the conception and the specific embodiment disclosed as a basis for modifying or designing other structures for carrying out the same purposes of the present invention. Those skilled in the art should also realize that such equivalent constructions do not depart from the spirit and scope of the invention in its broadest form.
Before undertaking the DETAILED DESCRIPTION OF THE INVENTION below, it may be advantageous to set forth definitions of certain words and phrases used throughout this patent document: the terms xe2x80x9cincludexe2x80x9d and xe2x80x9ccomprise,xe2x80x9d as well as derivatives thereof, mean inclusion without limitation; the term xe2x80x9cor,xe2x80x9d is inclusive, meaning and/or; the phrases xe2x80x9cassociated withxe2x80x9d and xe2x80x9cassociated therewith,xe2x80x9d as well as derivatives thereof, may mean to include, be included within, interconnect with, contain, be contained within, connect to or with, couple to or with, be communicable with, cooperate with, interleave, juxtapose, be proximate to, be bound to or with, have, have a property of, or the like; and the term xe2x80x9ccontrollerxe2x80x9d means any device, system or part thereof that controls at least one operation, such a device may be implemented in hardware, firmware or software, or some combination of at least two of the same. It should be noted that the functionality associated with any particular controller may be centralized or distributed, whether locally or remotely. Definitions for certain words and phrases are provided throughout this patent document, those of ordinary skill in the art should understand that in many, if not most instances, such definitions apply to prior, as well as future uses of such defined words and phrases.