1. Field of the Invention
The invention relates to the field of computer systems. More specifically, the invention relates to the execution of floating point and packed data instructions by a processor.
2. Background Information
In a typical computer system, one or more processors operate on data values represented by a large number of bits (e.g., 16, 32, 64, etc.) to produce a result in response to a programmed instruction. For example, the execution of an add instruction will add a first data value and a second data value and store the result as a third data value. However, multimedia applications (e.g., applications targeted at computer supported cooperation (CSCxe2x80x94the integration of teleconferencing with mixed media data manipulation), 2D/3D graphics, image processing, video compression/decompression, recognition algorithms and audio manipulation) require the manipulation of large amounts of data which is often represented by a smaller number of bits. For example, multimedia data is typically represented as 64-bit numbers, but only a handful of bits may carry the significant information.
To improve efficiency of multimedia applications (as well as other applications that have the same characteristics), prior art processors provide packed data formats. A packed data format is one in which the bits used to represent a single value are broken into a number of fixed sized data elements, each of which represents a separate value. For example, data in a 64-bit register may be broken into two 32-bit elements, each of which represents a separate 32-bit value.
Hewlett-Packard""s basic 32-bit architecture machine took this approach to implementing multi-media data types. That is, the processor utilized its 32-bit general purpose integer registers in parallel to implement 64-bit data types. The main drawback of this simple approach is that it severely restricts the available register space. Additionally, the performance advantage of operating on multimedia data in this manner in view of the effort required to extend the existing architecture is considered minimal.
A somewhat similar approach adopted in the Motorola(copyright) 88110(trademark) processor is to combine integer register pairs. The idea of pairing two 32-bit registers involves concatenating random combinations of specified registers for a single operation or instruction. Once again, however, the chief disadvantage of implementing 64-bit multi-media data types using paired registers is that there are only a limited number of register pairs that are available. Short of adding additional register space to the architecture, another technique of implementing multimedia data types is needed.
One line of processors which has a large software and hardware base is the Intel Architecture family of processors, including the Pentium(copyright) processor, manufactured by Intel Corporation of Santa Clara, Calif. FIG. 1 shows a block diagram illustrating an exemplary computer system 100 in which the Pentium processor is used. For a more detailed description of the Pentium processor than provided here, see Pentium Processor""s Users Manualxe2x80x94Volume 3: Architecture and Programming Manual, 1994, available from Intel Corporation of Santa Clara, Calif. The exemplary computer system 100 includes a processor 105, a storage device 110, and a bus 115. The processor 105 is coupled to the storage device 110 by the bus 115. In addition, a number of user input/output devices, such as a keyboard 120 and a display 125, are also coupled to the bus 115. A network 130 may also be coupled to bus 115. The processor 105 represents the Pentium processor. The storage device 110 represents one or more mechanisms for storing data. For example, the storage device 110 may include read only memory (ROM), random access memory (RAM), magnetic disk storage mediums, optical storage mediums, flash memory devices, and/or other machine-readable mediums. The bus 115 represents one or more busses (e.g., PCI, ISA, X-Bus, EISA, VESA, etc.) and bridges (also termed as bus controllers).
FIG. 1 also illustrates that the storage device 110 has stored therein an operating system 132 for execution on the processor 105. Of course, the storage device 110 preferably contains additional software (not shown). FIG. 1 additionally illustrates that the processor 105 includes a floating point unit 135 and a floating point status register 155 (the notation xe2x80x9cFPxe2x80x9d is used herein to refer to the term xe2x80x9cfloating pointxe2x80x9d). Of course, the processor 105 contains additional circuitry which is not necessary to understanding the invention.
The floating point unit 135 is used for storing floating point data and includes a set of floating point registers (also termed as the floating point register file) 145, a set of tags 150, and a floating point status register 155. The set of floating point registers 145 includes eight registers labeled RØ to R7 (the notation Rn is used herein to refer to the physical location of the floating point registers). Each of these eight registers is 80 bits wide and contains a sign field (bit 79), an exponent field (bits [78:64]), and a mantissa field (bits [63:01]). The floating point unit 135 operates the set of floating point registers 145 as a stack. In other words, the floating point unit 135 includes a stack referenced register file. When a set of register is operated as a stack, operations are performed with reference to the top of the stack, rather than the physical locations of the registers in the set of floating point registers 145 (the notation STn is used herein to refer to the relative location of the logical floating point register n to the top of the stack). The floating point status register 155 includes a top of stack field 160 that identifies which register in the set of floating point registers 145 is currently at the top of the floating point stack. In FIG. 1, the top of stack indication identifies a register 165 at physical location R4 as the top of the stack.
The set of tags 150 includes 8 tags and is stored in a single register. Each tag corresponds to a different floating point register and comprises two bits. As shown in FIG. 1, tag 170 corresponds to register 165. A tag identifies information concerning the current contents of the floating point register to which the tag correspondsxe2x80x9400=valid; 01=zero; 10=special; and 11=empty. These tags are used by the floating point unit 135 to distinguish between empty and non-empty register locations. Thus, the tags can be thought of as identifying two states: empty which is indicated by 11, and non-empty which is indicated by any one of 00, 01, or 10.
These tags may also be used for servicing events. An xe2x80x9ceventxe2x80x9d is any action or occurrence to which a computer system might respond, including hardware interrupts, software interrupts, exceptions, faults, traps, aborts, machine checks, assists, and debug events. Upon receiving an event, the processor""s event handling mechanism causes the processor to interrupt execution of the current process, store the interrupted process"" execution environment (i.e., the information necessary to resume execution of the interrupted process), and invoke the appropriate event handler to service the event. After servicing the event, the event handler causes the processor to resume the interrupted process using the process"" previously stored execution environment. Programmers of event handlers may use these tags to check the contents of the different floating registers in order to better service an event.
While each of the tags have been described as containing two bits, alternative embodiments could store only one bit for each tag. Each of these one bit tags identifying either empty or non-empty. In such embodiments, these one bit tags may be made to appear to the user as comprising two bits by determining the appropriate two bit tag value when the tag values are needed.
The status register 140 includes an EM field 175 and a TS field 180 for respectively storing an EM indication and a TS indication. If the EM indication is 1 and/or the TS indication is 1, the processor hardware causes a trap to the operating system upon execution of a floating point instruction by generating a xe2x80x9cdevice not availablexe2x80x9d exception. According to a software convention, the EM and TS indications are respectively used for emulating floating point instructions and implementing multi-tasking. However, the use of these indications is purely a software convention. Thus, either or both indications may be used for any purpose. For example, the EM indication may be used for implementing multitasking.
According to the software convention described above, the EM field 175 is used for storing a floating point emulate indication (xe2x80x9cEM indicationxe2x80x9d) that identifies whether the floating point unit should be emulated using software. A series of instructions or a single instruction (e.g. CPUID) is typically executed when a system is booted to determine if a floating point unit is present and to alter the EM indication if necessary. Thus, the EM indication is typically altered to indicate the floating point unit should be emulated when the processor does not contain a floating point unit. While in one implementation the EM indication equals 1 when the floating point unit should be emulated, alternative implementations could use other values.
Through the use of the operating system, many processors are capable of multitasking several processes (referred to herein as tasks) using techniques such as cooperative multitasking, time-slice multitasking, etc. Since a processor can execute only one task at a time, a processor must divide its processing time between the various tasks by switching between the various task. When a processor switches from one task to another, a task switch (also termed as a xe2x80x9ccontext switchxe2x80x9d or a xe2x80x9cprocess switchxe2x80x9d) is said to have occurred. To perform a task switch, the processor must stop execution of one task and either resume or start execution of another task. There are a number of registers (the floating point registers included) whose contents must be preserved to resume execution of a task after a task switch. The contents of these registers at any given time during the execution of a task is referred to as the xe2x80x9cregister statexe2x80x9d of that task. While multitasking several processes, a task""s xe2x80x9cregister statexe2x80x9d is preserved during the execution of other processes by storing it in a data structure (referred to as the task""s xe2x80x9ccontext structurexe2x80x9d) that is contained in a memory external to the processor. When execution of a task is to be resumed, the task""s register state is restored (e.g., loaded back into the processor) using the task""s context structure.
The preservation and restoration of a task""s register state can be accomplished using a number of different techniques. For example, one operating system stores the previous task""s entire register state and restores the next task""s entire register state upon each task switch. However, since it is time consuming to store and restore entire register states, it is desirable to avoid storing and/or restoring any unnecessary portions during task switches. If a task does not use the floating point unit, it is unnecessary to store and restore the contents of the floating point registers as part of that task""s register state. To this end, the TS indication has been historically used by operating systems, according to the previously described software convention, to avoid storing and restoring the contents of the floating point registers during task switches (commonly referred to as xe2x80x9cpartial context switchingxe2x80x9d or xe2x80x9con demand context switchingxe2x80x9d).
The use of the TS indication to implement partial context switching is well known. However, for purposes of the invention, it is relevant that the attempted execution of a floating point instruction while the TS indication indicates a partial context switch was performed (i.e., that floating point unit is xe2x80x9cunavailablexe2x80x9d or xe2x80x9cdisabledxe2x80x9d) results in a xe2x80x9cdevice not availablexe2x80x9d exception. In response to this exception, the event handler, executing on the processor, determines if the current task is the owner of the floating point unit (if data stored in the floating point unit belongs to the current task or a previously executed task). If the current task is not the owner, the event handler causes the processor to store the contents of the floating point registers in the previous task""s context structure, restore the current task""s floating point state (if available), and identifies the current task as the owner. However, if the current task is the owner of the floating point unit, the current task was the last task to use the floating point unit (the floating point portion of the current task""s register state is already stored in the floating point unit) and no action with respect to the floating point unit need be taken, and TS would not be set and no exception will occur. The execution of the handler also causes the processor to alter the TS indication to indicate the floating point unit is owned by the current task (also termed as xe2x80x9cavailablexe2x80x9d or xe2x80x9cenabledxe2x80x9d).
Upon completion of the event handler, execution of the current task is resumed by restarting the floating point instruction that caused the device not available exception. Since the TS indication was altered to indicate the floating point unit is available, the execution of following floating point instructions will not result in additional device not available exceptions. However, during the next partial context switch, the TS indication is altered to indicate a partial context switch was performed. Thus, when and if execution of another floating point instruction is attempted, another device not available exception will be generated and the event handler will again be executed. In this manner, the TS indication permits the operating system to delay, and possibly avoid, the saving and loading of the floating point register file. By doing so, task switch overhead is reduced by reducing the number of registers which must be saved and loaded.
While one operating system is described in which the floating point state is not stored or restored during task switches, alternative implementations can use any number of other techniques. For example, as previously mentioned above, an operating system could be implemented to always store and restore the entire register state on each task switch.
In addition to the different times at which the floating point state of a process can be stored (e.g., during context switches, in response to a device not available event, etc.), there are also different techniques for storing the floating point state. For example, an operating system can be implemented to store the entire floating point state (referred to herein as a xe2x80x9csimple task switchxe2x80x9d). Alternatively, an operating system can be implemented to store the contents of only those floating point registers whose corresponding tags indicate a non-empty state (referred to herein as a xe2x80x9cminimal task switchxe2x80x9d). In doing so, the operating system stores the contents of only those floating point registers which contain useful data. In this manner, the overhead for storing the floating point state may be reduced by reducing the number of registers which must be saved.
FIG. 2 is a flow diagram illustrating the execution of an instruction by the Pentium processor. The flow diagram starts at step 200; from which flow passes to step 205.
As shown in step 205, a set of bits is accessed as an instruction and flow passes to step 210. This set of bits includes an opcode that identifies the operation(s) to be performed by the instruction.
At step 210, it is determined whether the opcode is valid. If the opcode is not valid, flow passes to step 215. Otherwise, flow passes to step 220.
As shown in step 215, an invalid opcode exception is generated and the appropriate event handler is executed. This event handler may be implemented to cause the processor to display a message, abort execution of the current task, and go on to execute other tasks. Of course, alternative embodiments may implement this event handler in any number of ways.
At step 220, it is determined whether the instruction is a floating point instruction. If the instruction is not a floating point instruction, flow passes to step 225. Otherwise, flow passes to step 230.
As shown in step 225, the processor executes the instruction. Since this step is not necessary to describe the invention, it is not further described here.
As shown in step 230, it is determined whether the EM indication is equal to 1 (according to the described software convention, if the floating point unit should be emulated) and whether the TS indication is equal to 1 (according to the described software convention, if a partial context switch was performed). If the EM indication and/or the TS indication are equal to 1, flow passes to step 235. Otherwise, flow passes to step 240.
At step 235, the xe2x80x9cdevice not availablexe2x80x9d exception is generated and the corresponding event handler is executed. In response to this event, the corresponding event handler can be implemented to poll the EM and TS indications. If the EM indication is equal to 1, then the event handler can be implemented to cause the processor to execute the instruction by emulating the floating point unit and to resume execution at the next instruction (the instruction which logically follows the instruction received in step 205). If the TS indication is equal to 1, then the event handler can be implemented to function as previously described with reference to partial context switches (to store the contents of the floating point unit and restore the correct floating point state if required) and to cause the processor to resume execution by restarting execution of the instruction received in step 205. Of course, alternative embodiments may implement this event handler in any number of ways.
If certain numeric errors are generated during the execution of a floating point instruction, those errors are held pending until the attempted execution of the next floating point instruction whose execution can be interrupted to service the pending floating point numeric errors. As shown in step 240, it is determined whether there are any such pending errors. If there are any such pending errors, flow passes to step 245. Otherwise, flow passes to step 250.
At step 245, a pending floating point error event is generated. In response to this event, the processor determines if the floating point error is masked. If so, the processor attempts to handle the event internally using microcode and the floating point instruction is xe2x80x9cmicro restarted.xe2x80x9d The term micro restart refers to the technique of servicing an event without executing any non-microcode handlers (also termed as operating system event handlers). Such an event is referred to as internal event (also termed as a software invisible event) because the event is handled internally by the processor, and thus, does not require the execution of any external operating system handlers. In contrast, if the floating point error is not masked, the event is an external event (also termed as a xe2x80x9csoftware visible eventsxe2x80x9d) and the event""s corresponding event handler is executed. This event handler may be implemented to service the error and cause the processor to resume execution by restarting execution of the instruction received in step 205. This technique of restarting an instruction is referred to as a xe2x80x9cmacro restartxe2x80x9d or an xe2x80x9cinstruction level restart. Of course, alternative embodiments may implement this non-microcode event handler in any number of ways.
As shown in step 250, the floating point instruction is executed. During such execution, the tags are altered as necessary, any numeric errors that can be serviced now are reported, and any other numeric errors are held pending.
One limitation of the Intel Architecture processor family (including the Pentium processor), as well as certain other general purpose processors, is that they do not include a set of instructions for operating on packed data. Thus, it is desirable to incorporate a set of instructions for operating on packed data into such processors in a manner which is compatible with existing software and hardware. Furthermore, it is desirable to produce new processors that support a set of packed data instructions and that are compatible with existing software, including operating systems.
The invention provides a method for executing different sets of instructions that cause a processor to perform different data type operations in a manner that is invisible to various operating system techniques, that promotes good programming practices, and that is invisible to existing software conventions. According to one aspect of the invention, a data processing apparatus executes a first set of instructions of a first instruction type on what at least logically appears to software as a single logical register file. While the data processing apparatus is executing the first set of instructions, the single logical register file appears to be operated as a flat register file. In addition, the data processing apparatus executes a first instruction of a second instruction type using the logical register file. However, while the data processing apparatus is executing the first instruction, the logical register file appears to be operated as a stack referenced register file. Furthermore, the data processing apparatus alters all tags in a set of tags corresponding to the single logical register file to a non-empty state sometime between starting the execution of the first set of instructions and completing the execution of the first instruction. The tags identifying whether registers in the single logical register file are empty or non-empty.
According to another aspect of the invention, a method for implementing partial context switching when executing scalar and packed data instructions is described. According to this method, a data processing apparatus receives an instruction belonging to a first routine. The execution of the instruction requires either a scalar operation or packed data operation. The data processing apparatus then determines if what at least logically appears to software as a single logical register file for executing both the scalar and packed data operations is unavailable due to a partial context switch. If the logical register file is unavailable, then execution of the first routine is interrupted for the execution of a second routine that causes the contents of the logical register file to be copied into a memory. However, if the logical register file is available, then the instruction is executed on the logical register file.
According to another aspect of the invention, a method for executing packed data instructions is described. According to this method, a packed data instruction is received whose execution causes a packed data item to be written to what at least logically appears to software as a register in a logical register file that is also used for saving floating point data. As a result of executing this instruction, the packed data item is written in the mantissa field of the logical register and a value representing not a number or infinity is written in the sign and exponent fields of the logical register.