Automated systems commonly utilize Central Processing Units (CPU) connected to various peripheral devices including caches, memory storage devices, and numerous peripherals over various busses and other interconnections. Generally, designers of automated systems have strived to improve system performance by increasing CPU processing speeds, bus speeds, memory utilization rates, and various other parameters. Additionally, significant efforts have been undertaken to simultaneously reduce the size and power requirements of such systems. While significant reductions in size and power requirements have occurred, software programs used by many of today's systems have tremendously increased in size and complexity. As a result, today's designers are often faced with the daunting challenge of having to squeeze ever more data, including video data and audio data, through CPUs at ever increasing rates while decreasing the size and power requirements of such systems.
For many applications, the ability of CPUs to process large quantities of data is often dictated by how fast, how much, and how quickly the CPU can obtain information from and/or write to memory or other data storage devices. As is well known in the art, today's systems often include multiple data storage devices, such as Random Access Memory (RAM), Read Only Memory (ROM), and various other peripheral storage devices such as hard disc drives, and write/rewritable magnetic and optical storage devices. Additionally, CPUs often obtain data from various non-localized data storage devices via communications networks such as the Internet. Since each storage device often contains data which is specified in variable word lengths and since today's CPUs generally utilize registers of fixed widths, the CPU commonly has to repeatedly request segments of the data until an entire data word is processed.
In most computer applications, the process of retrieving data from a memory location often takes longer than the time necessary to actually process the given quantity of data because the ability of the CPU to process information is significantly greater than its ability to retrieve information from memory storage devices. In order to speed up the processing capabilities of CPUs, many system designers utilize cache memory, which may be built onto the same chip as the processor itself. While caching certain segments of code is helpful in processing routine instructions, for many applications, such as data mining, speech recognition and video image processing, caching such information is generally not practical. As a result, for many applications, CPUs generally have to recall vast quantities of information from memory storage devices in byte sizes set by the size of registers.
Additionally, since registers are commonly provided in pre-set widths (i.e., 64 bits or 32 bits), multiple registers are often needed to download/retrieve large quantities of data from a storage device within a reasonable time period. These registers are often directed to download data and then hold it until the CPU is ready to perform a specific task. When configured in this manner, many systems result in CPUs with large numbers of registers, each of which increase power requirements and inhibit system miniaturization. For example, the popular Pentium III® processor utilizes over 100 registers to support its various features and functions.
As is commonly known in the art, CPU's often begin the processing of large quantities of data by first determining a location for the data (i.e., the address), then fetching the data provided at the address, processing the fetched data, determining a location (i.e., a second address) where the result of the data processing is to be sent, sending the result to the second location, and then determining an instruction pointer, which preferably contains the address for the next instruction. Generally, the first address, the data, the second address, the result location, and the instruction pointer are provided in a memory array in sequential order. The memory is generally configured in sequential order during compiling so that the number of JUMPs are limited and the processing needed to determine which instruction is to be processed next is reduced. While compiling a program to reduce the number of JUMPs is often desirable from a CPU processing viewpoint, compiling often results in memory arrays which are not utilized to their maximum capacity. Instead, many memories often have significant blocks in which data may be stored that are never used.
Additionally, while compilers often attempt to create software instructions that flow from one sequence line to a next, in reality, much of today's software code contains JUMPs, conditional branches, loops, and other data flow techniques. Since these software programs often do not naturally flow from one line to the next, system designers generally must also keep track of code locations via address pointers, and various other devices, each of which require additional registers and additional power.
Additionally, currently available CPUs commonly require multiple instructions and processing steps to accomplish some of the simplest tasks, such as adding two operands. For example, currently available CPUs often execute an instruction requiring Operand 1 to be added to Operand 2 by performing the following steps:    1. Fetch ADD instruction from location pointed to by Instruction Pointer (“IP”), and load the instruction into an instruction register;    2. Decode the instruction and store in instruction register;    3. Access a location in memory where a first operand is located, obtain the value for the first operand and store it in a temporary register;    4. Access a second location in memory where a second operand is located, obtain the value for the second operand and store it in a temporary register;    5. Perform the operation specified in the instruction register on the first and second operands by transferring the instruction and the first and second operands from their respective registers to the ALU;    6. Determine where the result of the ALU process is to be stored;    7. Store the results data to the determined location; and    8. Determine the next address for the next instruction, which may require a JUMP to another memory location.
While the above operation may be accomplished extremely quickly for a single mathematical calculation, today's CPUs often are required to process millions of transactions a second. When utilized on this magnitude, the constant reading, storing, addressing, and writing to and from memory via registers may significantly degrade a system's performance.
Therefore, since today's CPU often spend inordinate amounts of time determining from where data and instructions are to be obtained and/or stored, storing the data, processing data, determining where the result of the data processing is to be stored, and then actually storing the result, a system is needed that reduces the amount of time a CPU spends determining where to obtain data and actually fetching the data needed for processing.
Additionally, many of today's systems control numerous input/output devices, all of which are constantly requesting processor time. Each time a processor determines that a different Input/Output (I/O) device or a different processing routine needs to be executed, the processor commonly performs a state change. In a Windows® multi-tasking environment, state changes occur often because the various devices connected to the I/O bus are continuously jostling for the attention of the processors.
As shown in FIG. 3A, the process by which many currently available processors perform a state change often requires numerous steps. The state change operation begins at 302 when a processor receives a request to stop processing a first task and to begin, as soon as possible, processing a second task. When a state change request is received, the CPU sets a register pointer equal to zero at step 304 and begin transferring the contents of each register utilized by the CPU into memory at a location specified by a stack pointer. The data transfer continues through steps 306–310 until the contents of each register utilized by the CPU are copied to a block of memory, often in sequential order. As each register is transferred, the CPU also increments the stack pointer and a register pointer until the value of the register pointer equals the total number of registers whose contents need to be saved. At this point, the CPU is ready to implement the desired state change (i.e., the registers may now be loaded with new instructions, addresses, and operands). For advanced CPUs, such as Pentium IIIs, which utilize hundreds of registers, implementing a state change can often take many microseconds.
FIG. 4A shows a process 400 by which many current systems recover from a state change (i.e., resume the processing interrupted by the state change). Generally, the process 400 of recovering to the first state requires as many processing steps as does the changing of states to process the second task. As shown, the recovery operation begins at 402 when the CPU receives a direction that indicates the second task has been completed and that the first task may be restored. Next, the processor sets a register pointer equal to or less than the number of registers available to the CPU at step 404, and begins transferring the contents of memory from the location specified by the stack pointer into the appropriate registers until the contents have been restored for all of the registers which changed states in steps 406–410. After all of the registers are restored, the CPU then resumes processing the steps needed for the first task.
In many environments, such as the Microsoft® Windows® operating system, state changes occur frequently. These state changes often interrupt the performance of user interface devices, such as keyboards and audio and video display devices. Therefore, a system is needed which enables a CPU to more efficiently perform state change operations.