1. Field of the Invention
This invention relates to the field of microprocessor architectures. More particularly, the invention relates to reducing overhead associated with register set saving and restoring as is required when invoking functions, exception handlers, and interrupt service routines. The invention further relates to register shadowing and windowing strategies to reduce function calling and task switching times in multi-issue processors, especially superscalar RISC processors and very long instruction word (VLIW) digital signal processors (DSPs).
2. Description of the Prior Art
Studies show that register saving and restoring in response to function calls and returns accounts for between 5% and 40% of the data memory traffic in executing programs written in a high level programming language. Also, the registers must be saved whenever a program switches tasks. In a UNIX operating system, for example, this accounts for approximately 20% of the task switching overhead. In more streamlined real-time operating systems as are common with embedded processors and DSPs, register saving and restoring accounts for a much higher percentage of the task switching time. Even interrupt service routines that do not require a full task switch still require at least some of the registers to be saved and restored. This adds significant overhead in many cases.
Register shadowing and windowing techniques have been introduced in an effort to reduce delays associated with register set storing and loading. Prior art processor architectures that incorporate shadow registers or register windows are discussed in detail in John L. Hennessy and David A. Patterson, xe2x80x9cComputer Architecture: A Quantitative Approach,xe2x80x9d Morgan Kaufmann Publishers, Inc., San Francisco, California, 1991. These concepts are by now employed on some high performance RISC processors and DSPs. For example, the Analog Devices ADSP21xx series of DSPs use a shadow register bank. Sun Microsystem""s SPARC processors use windowed register banks. These register systems allow the processors to switch register sets in a single cycle.
Register shadowing is a technique whereby a primary register set is shadowed by a mirror image register set. When a register set switch command is issued, the machine context can be switched from the primary register set to the shadow register set. Shadow register sets are useful for fast switching between tasks or between a primary program and an interrupting program. For example, in a DSP, a supervisory task may run in the background while the main signal processing algorithm runs in the foreground. This technique can be supported for example, in the Analog Devices ADSP21xx series of DSPs. In the ADSP21xx processors, there is only one shadow register set. Hence, single-cycle context switching can only occur between one primary task and one secondary task. Also, in the case of the ADSP21xx, the address registers are not saved upon a shadow register switch. Hence in applications, a long sequence of commands is required to save and restore the address registers, requiring a significant time penalty.
As can be seen from the foregoing discussion, a problem with shadow register systems is their inability to provide single-cycle context switching to more than one task. In theory, if N shadow register sets are added, then single-cycle task switching between N+1 tasks is possible. The problem, however is that if more than N+1 tasks need to be supported, the single-cycle task switching will only be possible between a subset of the total number of tasks. Also, a significant amount of silicon area is needed for each added shadow register set. Finally, the software that manages the tasks becomes difficult and less efficient because it has to manage, a first type of task that has its context stored in a shadow register set, and a second type of task that has its context stored in memory. Whenever a task of the second type is invoked, context switch oriented register save and restore operations are required. For these reasons, shadow register sets have not gained widespread popularity.
Shadow register sets can optionally be used for register save and restore operations related to function calling and returning. For example, if the processor has a single shadow register set, a base level function can make a call to a first level function, and can perform a register bank switch so that the first level routine can save the registers and then restore them in single cycles. The problem is that this capability only exists for a single level of function calling. If the first level function were to call a second level function, both the primary and the shadow register sets would be occupied, requiring multiple cycles for all register save and restore operations related to subsequent function calls. Again, adding more shadow register sets extends the number of levels of function calls that can be supported with single-cycle register store and restore operations, but only at the price of a significant amount of silicon area. Moreover, the software becomes complicated due to the need to keep track of the current level of function nesting. For these reasons, shadow registers find limited use in function calling.
Register window systems are an extension of the shadow register concept and are designed to accelerate function calling. In a register window system, a group of shadow register sets is typically arranged as a circular buffer. When a function call is made, the active register set advances from one set to the next in the circular buffer. When the buffer wraps, an overflow is said to occur. Upon an overflow, a sequence of memory transfers is needed to save the first register file in the circular buffer arrangement to insure it does not get overwritten. As verified by the analysis of execution patterns of large numbers of benchmark programs, by making the circular buffer deep enough, usually on the order of twelve to sixteen register sets, the overhead associated with register save and restore operations can be made to be negligible.
Prior art register window systems have many drawbacks and are thus not used in most modem high performance processor designs. First of all, a twelve-level to sixteen-level deep register window system requires an excessive amount of silicon area. Secondly, as the total number of registers in the circular buffer of register files increases, the number of register address lines, and hence the amount of time needed for register address decoding increases. Longer register address access times lead to slower system clocks and thus slower overall processors. Thirdly, as the number of register sets in the windowing system increases, the number of registers that must be saved when a task switch occurs increases proportionally. This adds a significant overhead to task switching and adds interrupt latency. Adding multiple copies of shadowed register window systems to provide single-cycle task switching would require an enormous amount of silicon area and would have the same limitations relating to shadow registers as discussed above.
The problems become more severe in DSPs. For example, in machines such as the SPARC, the floating point registers are not included in a register window switch. Rather, floating point registers must be loaded and saved under program control. This would not be acceptable on a floating point DSP. The reason the floating point registers are not added to the register window on the SPARC is because, unlike on floating point DSPs, the floating point registers are not used as widely. DSPs also often contain the ALU core registers as well as address registers and possibly other types of auxiliary registers that would need to be added to the register windowing system. Modern load-store VLIW DSPs have multiple register sets that would need to be windowed multiple times to create an effective register window system. Hence, it can be seen that register windows become prohibitively expensive to implement with most DSP architectures.
In U.S. Pat. No. 3,781,810, a system is disclosed to speed up the storing and the retrieving of registers when the machine context must be switched in a nested fashion. Upon the occurrence of an interrupt, when a xe2x80x9cstorexe2x80x9d command is issued, a selected subset, of the register set is transferred to an auxiliary register set simultaneously via parallel data paths. The data in the auxiliary registers are then transferred to the memory in the background by using otherwise unused memory transfer cycles. If another store command is issued prior to the background transfer, the transfer is allowed to complete in the foreground. Register restore operations are processed similarly. The auxiliary registers are restored in the background, and are then transferred simultaneously into the primary register set. This technique has drawbacks that limit performance. For example, in the register store operation, the auxiliary register set must be overwritten with the contents of the currently active register set. This means that the auxiliary register set cannot supply useful information to be used in the task switch. It would be more effective to provide a system whereby the current register set could be transferred out to memory, and context could be switched to a shadow register set in a single cycle. Instead of the auxiliary register set being filled with the data to be transferred out to memory, it would be desirable to allow it to be preloaded with useful information to enable a truly nested single cycle task switching capability. The disclosure in U.S. Pat. No. 3,781,810 only allows for single directional transfers at a given time and needs to be extended to support store-and-load operations, delayed interrupts, and various methods for accelerated task switching disclosed herein.
A shadow register system is disclosed in U.S. Pat. No. 5,327,566. In this system, a SAVE command is issued to cause the processor to latch the register contents into a shadow register set. A RESTORE command is used to cause the processor to latch the previously saved register contents from the shadow register set back to the primary register set used by the processor. Also, in one aspect of the disclosure of U.S. Pat. No. 5,327,566, when an interrupt is detected, the processor automatically latches the register contents into the shadow register set, and, when a return from interrupt instruction is issued, the processor automatically restores the register contents. No interrupt nesting is supported by the system of U.S. Pat. No. 5,327,566. That is, if a program is interrupted by a first interrupt service routine which is then interrupted by a second interrupt service routine, the register context of the original program will be destroyed and will be unrecoverable. The concept of automatic register saving in response to interrupts needs to be expanded to support nested interrupts.
Therefore, it is a primary object of this invention to provide improved systems for register shadowing and register windowing. It is desired to implement a minimal number of register sets in a circularly buffered configuration to provide higher performance register shadowing and windowing systems at a fraction of the cost of prior art systems.
Another objective is to provide an architecture to allow data to transfer between the register set and the memory so that register set store and load operations can proceed concurrently with normal processing in advance of being needed.
Another objective of the invention is to provide improved methods for task switching in processors employing the inventive register shadowing and windowing systems. Another objective is to provide new interrupt modes that perform register set store and load operations automatically without incurring program cycle overhead.
Another objective of the invention is to provide a register shadowing system for VLIW and superscalar processors that include multiple register sets.
Another objective is to provide a register windowing system with a much lower silicon area requirement and to provide a method to accelerate task switching with this system.
One aspect of the present invention is a processor that can perform single-cycle task switches for an arbitrary number of tasks using the register shadowing technique of the current invention. In this technique, a direct memory access/direct register access (DMA/DRA) controller of the present invention is employed to perform shadow register set store and load operations in the background. The DMA/DRA controller is operative to monitor bus activity within the processor core, and to use otherwise unused cycles to transfer the shadow register set to a designated memory area and to load the shadow register set from another designated memory area When the task switch occurs, then, the register set context for the next task is made available by issuing a single-cycle register set switch command (referred to henceforth as a xe2x80x9csingle-cycle task switch commandxe2x80x9d).
Another aspect of the present invention is a DMA/DRA controller that performs register set store and load operations in cycle-steal and high priority burst modes. The DMA DRA controller can be initialized under software control and may include a list manager that manages sequences of shadow register set store and load operations.
Another aspect of the present invention is a method to perform look-ahead for register set load and store operations to reduce task switching overhead in multitasking executive. A related aspect of the present invention is a delayed interrupt processing technique that stores and or loads the shadow register set in response to designated interrupts in the background before conventional interrupt processing is allowed to begin. Interrupt descriptors of the present invention are used to define the register set store and load addresses.
Another aspect of the present invention is a shadow register system for use in VLIW DSPs. In this case, multiple register sets are shadowed. Either a single DMA/DRA controller or multiple DMA/DRA controllers are associated with the shadow register sets. Techniques are provided to speed task switching in VLIW DSPs implemented with the improvements of the current invention. A low cost register windowing system implemented with this simple shadow register arrangement is also presented.
Another aspect of the invention is a processor that implements a virtual register window system. This system appears as an arbitrary depth circular buffer, but is mostly implemented in main memory. Look-ahead and cycle-steal techniques are used to minimize the overhead associated with subroutine related register set save and restore operations using a minimal on-chip register window buffer. A method to accelerate task switching in processors with this type of register window system is also presented.
Another aspect of the present invention is a processor coupled to an internal or external memory. The processor comprises a processor core which comprises one or more functional units. A set of instructions are executed by the processor. The instructions include a register direct addressing mode wherein registers serve as operands to the instructions. A first register set is coupled to the functional units via a first data path. At least one shadow register set duplicates at least a subset of the first register set. The at least one shadow register set is coupled to the processor core via a second data path. The first and second data paths may overlap. At least one instruction in the instruction set is used to switch the active register set between the first register set and the at least one shadow register set. A direct memory access/direct register access (DMA/DRA) controller is coupled to the register sets and to the internal or external memory. The DMA/DRA controller transfers data directly between the register sets and the internal or external memory. The DMA/DRA controller responds to commands and control signals to transfer at least a portion of the contents of either the first register set or the shadow register set to or from a buffer area in the internal or external memory to free the processor core to concurrently process other instructions. Advantageously, the instruction used to switch between register sets is a toggle instruction which activates the inactive register set and deactivates the currently active register set. The first register set and the at least one shadow register set may include a third data path to couple the register sets to an internal or external memory so that transfers between the inactive register and the memory can occur simultaneously with transfers between the active register set and the processor core. In certain embodiments, the DMA/DRA controller receives information indicating the cycle-by-cycle utilization of data bussing resources required by the processor during program execution, and the DMA/DRA controller further transfers data between the register sets and the internal or external memory in a cycle steal mode, making use otherwise of the unused bandwidth available between the register sets and the internal or external memory. The DMA/DRA controller may also receive a priority signal, where, upon assertion, the DMA/DRA controller completes the data register to or from memory transfer in a burst mode. An on-chip memory buffer area may be included to provide high-speed transfer of data out of the shadow register set during a burst transfer. Advantageously, a separate data port may be included to allow data to transfer from the shadow register set to a second off-chip memory buffer area at the same time as data transfers from the external or internal memory into the shadow register set during a burst transfer. In certain embodiments, the DMA/DRA controller generates a done signal to indicate to the processor core when the register set store or load operation is complete. A bus switch may be used to couple the external or internal memory to the first and second data paths between the processor core and the register sets and to couple the external or internal memory to a second data path that routes to a second port of the register sets so that the inactive register set can be loaded and unloaded while the active register set performs data transactions with the processor core.
Another aspect of the present invention is a computer system which comprises a memory system containing program instructions and data, a processor which includes a processor core having one or more functional units, a first register set coupled to the functional units via a first data path, and at least one shadow register set which duplicates at least a subset of the first register set. The shadow register set is coupled to the processor core via a second data path. The first and second data paths may overlap. A set of instructions are executed by the processor. The instructions include a register direct addressing mode wherein registers serve as operands to the instructions. At least one instruction in the instruction set is used to switch the active register set between the first register set and the shadow register set. A direct memory access/direct register access (DMA/DRA) controller is coupled to the register sets and to the internal or external memory. The DMA/DRA controller transfers data directly between the register sets and the internal or external memory. The DMA/DRA controller responds to commands and control signals to transfer at least a portion of the contents of either the first register set or the shadow register set to and from a buffer area in the internal or external memory, thereby freeing the processor core to concurrently process other instructions. Preferably, the computer system executes a multitasking operating system or a multitasking executive which uses the register sets and the DMA/DRA controller to accelerate register set save and restore operations during task switching. The DMA/DRA controller advantageously accelerates register set save and restore operations for subroutine procedure calls and returns.
Another aspect of the present invention is a direct memory access/direct register access (DMA/DRA) controller operative to control information transfer between memory and at least one register set. The DMA/DRA controller comprises a core interface coupled to a processor core. The core interface is operative to receive control signals and commands and to send out status information. A control unit is coupled to the core interface. The control unit responds to the control signals and commands to generate control sequences needed to manage data transfers between the memory and the register set. The control unit also generates status information indicative of events related to the data transfer. A memory address pointer register is coupled in a feedback arrangement to an arithmetic unit which manipulates an address within the memory address pointer register. A register address pointer register is coupled in a feedback arrangement to an arithmetic unit which manipulates an address within the register address pointer register. A transfer control signal generator operates to generate timing and control signals to the register and memory interfaces involved in the data transfer. In certain embodiments, the arithmetic unit associated with the memory pointer register provides a simple autoincrement function and a simple autodecrement function. In alternative embodiments, the arithmetic unit associated with the memory pointer register provides an autoincrement by specified contents function and an autodecrement by specified constants function. In particular embodiments, the DMA/DRA controller is coupled to an active register set and to at least one inactive shadow register set, wherein the DMA/DRA controller controls transfer operations between memory and the shadow register set while the active register set performs transactions with the processor core. The DMA/DRA controller preferably includes a list manager. The list manager comprises a pointer to an entry in a descriptor table and comprises a list control unit responsive to descriptors stored in the descriptor table. Each descriptor contains at least a reference to a source or destination memory address involved in a DMA/DRA controlled register set transfer. The list control unit is operative to load the memory address register in response to information stored in the descriptor and to load the register address register in response to a bit field which indicates the target inactive shadow register set. The list control unit is further operative to obtain the address of the next entry in the descriptor table for future processing. A priority field indicates the priority of the DMA/DRA transfer associated with the descriptor. The descriptor preferably further comprises a next entry field to allow the descriptor table to take the form of a linked list. The descriptor may also include a field which provides a reference to indicate that the source or destination memory address register is to be loaded with a stack pointer, and may also include a field which contains the source or destination memory address to be loaded into the memory address pointer.
Another aspect of the present invention is a method for operating an instruction set processor coupled to an internal or external memory. The processor comprises one or more functional units and first and second register sets, wherein one of the register sets is an active register set presently responsive to processor instructions involving register operands and the other register set is a shadow register set not presently responsive to processor instructions involving register Operands. The processor further comprises at least one instruction to switch the active register set to a shadow state and the shadow register set to an active state. The processor includes a DMA/DRA controller capable of controlling sequences of data transfers between the shadow register set and memory. The method is a method of register set storing and loading which comprises the steps of issuing one or more commands which include either operands or references to one or more descriptors to set up the DMA/DRA controller to transfer data to or from the shadow register set and from or to a buffer area in memory. The method includes the further steps of monitoring the bus activity in the processor to determine when unused memory bandwidth is available, and moving the shadow register contents under the control of the DMA/DRA controller to or from the memory buffer during those cycles deemed to possess unused memory bandwidth. Preferably, the method includes the step of providing a done signal to the main control unit of the instruction set processor when the shadow register to or from memory transfer is complete. Also preferably, the method includes the step of responding to a priority move signal, such that when the priority move signal is asserted, the transfer will switch from using only unused cycles to a high priority burst mode which uses all cycles necessary to transfer data at a high data rate. In particular embodiments, the method implements steps in the main controller of the instruction set processor. In particular, the method issues the instruction to switch the active register set and checks the done signal returned by the DMA/DRA controller. If the done signal is asserted, the method proceeds to execute the active register set switch command. If the done signal is not asserted, the method asserts the priority move signal and proceeds to execute the active register set switch command only after the done signal is recognized.
Another aspect of the present invention is a method in a computer system which incorporates a memory system, a processor and input/output devices. The processor is responsive to interrupts. The processor comprises first and second register sets with the first register set in an active state and the second register set in a shadow state. The processor includes an active register set switch command. A DMA/DRA controller is operative to transfer data between the shadow register set and the system memory. The method is a method of accelerated task switching which comprises the step of maintaining a set of interrupt service routines. Each routine is associated with and is activated in response to a specified interrupt. The interrupts are categorized as general interrupts and as a real-time clock interrupt for a multitasking scheduler. An interrupt service routine is associated with a given one of the general interrupts and sets a ready flag in a task control block associated with the given one of the general interrupts. The interrupt service routine for the real-time interrupt activates a scheduler responsive to information in each of the task control blocks. The scheduler performs the steps of decrementing a time-to-run variable for the currently running task and checking the ready flags in the task control blocks together with priority indicators contained therein to determine the next task to run. If a next task to run has a higher priority than the current task, the scheduler checks to see if the shadow register empty flag is set, and if the shadow register empty flag is set, the scheduler issues a burst shadow register set load command to the DMA/DRA controller to fill the shadow register set with the context of the next task to run. The scheduler then passes control to the next task to run. If the shadow register empty flag is not set, the scheduler issues a burst shadow register set store-and-load command to the DMA/DRA controller, and, upon completion, passes control to the next task to run. If the next task to run has a priority equal to or less than the currently running task, the scheduler performs a return from interrupt if the time-to-run variable is more than a specified number of ticks away from completion; issues a shadow register load command from the task control block of the next task to run if the time-to-run variable is at a specified value; and issues a single-cycle active register set switch command and switches tasks if the time-to-run has decremented to its terminal value.
Another aspect of the present invention is a method in a computer system which incorporates a memory system, a processor, and input/output devices. The processor is responsive to interrupts and comprises first and second register sets. The first register set is in an active state and the second register set is in a shadow state. The processor includes an active register set switch command. A DMA/DRA controller is operative to transfer data between the shadow register set and memory. The method is a method of reducing context switching in response to interrupts. The method comprises the step of issuing a context switch by a device in the computer system. Interrupts are categorized into at least two classes. A first class is a conventional interrupt, and a second class is a delayed interrupt. When the interrupt request is categorized as a delayed interrupt, shadow register to and from memory transfer instructions are automatically issued from the processor to the DMA/DRA controller. When the DMA/DRA controller indicates the transfer is complete, the method finishes the current instruction and returning information, is stacked in a conventional manner. Control is then passed to the associated interrupt service routine. In preferred embodiments of the method, the interrupt service routine issues a single-cycle active register set switch command to switch the register context from the currently active register set to the shadow register set. Also preferably, a single-cycle active register set switch command is automatically issued to switch the active register context as a part of the interrupt processing sequence just prior to activating the interrupt service routine. The register set to and from transfer instructions are preferably either background commands or priority store-to-stack commands. Alternatively, the register set to and from transfer instructions are either background commands or priority store-and-load commands. In the further alternative, the register set to and from transfer instructions are either background commands or priority load commands. In preferred embodiments of the method, prior to issuing a return from interrupt command, the interrupt service routine issues a background save shadow register set command, where the target address is either specified by an specific address pointer or a stack pointer. The interrupt, category and the DMA/DRA related information are advantageously contained in an interrupt descriptor which comprises fields to automatically program the DMA/DRA controller. In certain embodiments, the interrupt descriptor further comprises an interrupt branch address. The fields may comprise at least one of the following:
a limit field which specifies all or a subset of the shadow register set that needs to be transferred;
a store field which indicates that a store operation is required if a shadow register empty flag is not set;
a load field which indicates whether the shadow register set needs to be loaded;
a stack or pointer field which indicates whether the load and store operations use pointers contained in the descriptor, or use a stack pointer;
a list manager field which indicates whether the interrupt is associated with a list manager that maintains a descriptor table;
a register set store address; and
a register set load address.
Another aspect of the present invention is a method for use with a processor which responds to interrupts and which comprises first and second register sets. The first register set is in an active state and the second register set is in a shadow state. The processor includes an active register set switch command. A DMA/DRA controller operates to transfer data between the shadow register set and memory. The method is a method of reducing context switching in response to interrupts. The method comprises the step of issuing an interrupt request by a device in the computer system. Interrupts are categorized into at least a first class and a second class. The first class is a conventional interrupt, and the second class is a delayed interrupt. When a received interrupt request is categorized as a delayed interrupt, a shadow register to/from memory transfer instruction is automatically issued from the processor to the DMA/DRA controller. When the DMA/DRA controller indicates the transfer is complete, the current instruction is finished, information is returned stacked in a conventional manner, and control is passed to the associated interrupt service routine.
Another aspect of the present invention is an apparatus which comprises a very long instruction word processor having multiple functional units which receive different dispatched portions of a very long instruction word. At least one of the functional units is coupled to at least one active register set and to at least one inactive (shadow) register set. The coupling occurs via at least one data path. At least one instruction in the instruction set is used to switch the active register set between the active register set and the shadow register set. A direct memory access/direct register access (DMA/DRA) controller is coupled to the register sets and is coupled to at least one of an internal memory or an external memory. The DMA/DRA controller transfers data directly between the register sets and the at least one of the internal memory and the external memory. The DMA/DRA controller responds to commands and control signals to transfer at least a portion of the contents of one or more register sets to and from at least one buffer area in the at least one of the internal memory and the external memory to free the processor core to concurrently process other instructions. In certain embodiments, the DMA/DRA controller includes multiple channels to move multiple register sets to or from memory. Preferably, the DMA/DRA controller includes parallel hardware to move the multiple channels of register set data to or from memory concurrently along parallel data paths.
Another aspect of the present invention is an apparatus which comprises a processor core having one or more functional units which receive dispatched instructions. One or more of the functional units are coupled to a register window buffer containing at least two register sets. The register window buffer is responsive to instructions which change the active register window. A direct memory access/direct register access (DMA/DRA) controller is coupled to the register window buffer sets and to either an internal memory or an external memory. The DMA/DRA controller is used to transfer data directly between the register sets and the internal or external memory. The DMA/DRA controller is responsive to commands and control signals to transfer at least a portion of the contents of one or more register sets to or from one or more buffer areas in the internal or external memory, thereby freeing the processor core to concurrently process other instructions and extending the effective length of the register window system. In preferred embodiments, the apparatus includes a cache memory and a bus interface unit. The bus interface unit couples data from the external memory to the cache memory. In one such preferred embodiment, the cache memory is connected directly to the bus interface unit. Alternatively, the apparatus includes a memory request queue between the bus interface unit and the cache. In certain preferred embodiments, the memory request queue is coupled to the DMA/DRA controller, and the DMA/DRA controller transfers data between the register sets and the memory request queue.
Another aspect of the present invention is a method in a computer system which employs a multitasking operating system executive. The method is a method of reducing the time required to switch tasks within processors containing virtual register window systems. The method comprises the steps of maintaining the register context of each task in a memory area contained within a task control block or referenced by a pointer within a task control block and maintaining the extended virtual register set extensions in the same memory area. Upon task switching, only the portion of the virual register window system that is not already stored in the memory area is stored. In preferred embodiments of this method, a parameter may be set that causes the processor to attempt to mirror the contents of all inactive register windows in the memory area so that, At the task switch time, a minimal number of registers will need to be saved. Preferably, the parameter is set when a time-to-run variable indicates that the task switch time is eminent.