1. Field of the Invention
The invention relates to the field of embedded processor architecture.
2. Description of Background Art
Conventional embedded processors, e.g., microcontrollers, support only a single hard real-time asynchronous process since they can only respond to a single interrupt at a time. Most software implementations of hardware functions—called virtual peripherals (VPs)—respond asynchronously and thus their interrupts are asynchronous. Some examples of VPs include an Ethernet peripheral (e.g., 100 Mbit and 10 Mbit Transmit and receive rates); High-speed serial standards peripherals, e.g., 12 Mbps USB, IEEE-1394 Firewire Voice Processing and Compression: ADPCM, G.729, Acoustical Echo Cancellation (AEC); an image processing peripheral; a modem; a wireless peripheral, e.g., an IRDA (1.5 and 4 Mbps), and Bluetooth compatible system. These VPs can be used as part of a Home programmable network access (PNA) system, a voice over internet protocol (VoIP) system, and various digital subscriber line systems, e.g., asymmetric digital subscriber line (ADSL), as well as traditional embedded system applications such as machine control.
An interrupt is a signal to the central processing unit (CPU) indicating that an event has occurred. Conventional embedded processors support various types of interrupts including external hardware interrupts, timer interrupts and software interrupts. In conventional systems, when an interrupt occurs the central processing unit (CPU) completes the current instruction, saves the CPU context (at a minimum the CPU saves the program counter), and jumps to the address of the interrupt service routine (ISR) that responds to the interrupt. When the ISR is complete and the interrupt has been responded to, the CPU executes a return-from-interrupt instruction and restores the CPU context and continues executing the main code from where the code was interrupted.
If multiple interrupts are received the CPU must be capable of servicing them. In one conventional system, if a second interrupt occurs during the processing of a first interrupt the second interrupt is ignored until the first interrupt is serviced. FIG. 1 is an illustration of a conventional interrupt response. The “main” code is interrupted by a first interrupt “A INT” and the CPU then processes the ISR for this interrupt (ISR A). The second interrupt (B INT) is received while the first interrupt is being processed. In this conventional system the ISR for the second interrupt does not begin until the ISR for the first interrupt is completed.
In another conventional system two levels of interrupt priority are utilized with the rule that a higher priority interrupt can interrupt a lower priority interrupt but riot an equal priority interrupt.
Embedded processors have a number of interrupt sources and so there must be some way of selecting which sources can interrupt the processor. In conventional systems this is done by using control (mask) registers to select the desired interrupt sources.
As described above, when an interrupt occurs the CPU loads the appropriate interrupt service routine (ISR) address into the program-counter. One implementation of this is to use the interrupt number as an index into random access memory (RAM), e.g., using an interrupt-vector-table, to find the dynamic ISR address (such as used in an Intel's 8×86 processors). The size of the interrupt-vector-table is normally limited by having a limited number of interrupts or by grouping interrupts together to use the same address. Grouped interrupts are then further analyzed to determine the source of the interrupt.
A problem with processing one or more interrupts is that, as described above, the context of the CPU must be stored before the interrupt is processed. This is necessary in order for the CPU to be able to continue processing after the ISR from the same position it was in before the interrupt was received. The storing of the context information such as the program counter and other various registers usually takes at least one clock cycle, and often many more. This delay reduces the effective processing speed of the CPU.
Context storing is used in many conventional processors, e.g., RISC based processors, and includes a single register set, e.g., 32 registers (RO to R31). These registers are often insufficient for a desired processing task. Accordingly, the processor must save and restore the register values frequently in order to switch contexts. Switching contexts may occur when servicing an interrupt or when switching to another program thread in a multithreading environment. The old context values are saved onto a stack using instructions, the context is switched, and then the previous context for the new thread is restored by pulling its values off the stack using instructions. This causes a variety of problems including (1) significantly reducing the performance of the processor because of the need to frequently save and restore operation for each context switch, and (2) preventing some time critical tasks from executing properly because of the overhead required to switch contexts.
For example, if a program needs to read a port location for capturing its value every 100 clock cycles and presuming the read operation takes only 5 clock cycles then if it requires 32 registers to save and restore for the context switch and the save operation and the restore operation each require two instructions for each register then the context switch and restoration requires 128 instructions which prevents the successful completion of the task since the read operation must occur every 100 clock cycles.
Conventional systems have attempted to resolve the problem by using dedicated hardware for time critical tasks or by using a front-end dedicated logic to capture the data and put it in a first in first out (FIFO) buffer to be processed by software. Several problems with these techniques are (1) they require dedicated front-end logic, and (2) they require more memory, e.g., FIFO, which increases die space and cannot be used for any other function.
Another problem with conventional embedded processing systems for processing interrupts is that interrupts that have critical timing requirements may fail. With reference to FIG. 1, if interrupt A and interrupt B are both time-critical, they may be scheduled such that they both have a high priority (if priorities are available) and although interrupt A is processed in a timely manner, interrupt B is not processed until after interrupt A has been processed. This delay may cause interrupt B to fail since it is not processed in a predefined time. That is, conventional systems do not provide reasonable certainty regarding when an interrupt will be processed.
An embedded processor is a processor that is used for specific functions. Embedded processors generally have some memory and peripheral functions integrated on-chip. Conventional embedded processors have not been capable of operating using multiple hardware threads.
A pipelined processor is a processor that begins executing a second instruction before the first instruction has completed execution. That is, several instructions are in a “pipeline” simultaneously, each at a different stage. FIG. 3 is an illustration of a conventional pipeline.
The fetch stage (F) fetches instructions from memory, usually one instruction is fetched per cycle. The decode stage (D) reveals the instruction function to be performed and identifies the resources needed. Resources include general-purpose registers, buses, and functional units. The issue stage (I) reserves resources. For example, pipeline control interlocks are maintained at this stage. The operands are also read from registers during the issue stage. The instructions are executed in one of potentially several execute stages (E). The last writeback stage (W) is used to write results into registers.
A problem with conventional pipelined processors is that because the speed of CPUs is increasing, it is increasingly difficult to fetch instruction opcodes from flash memory without having wait-states or without stalling the instruction pipeline. A faster memory, e.g., static RAM (SRAM) could be used to increase instruction fetch times but requires significantly more space and power on the embedded processor. Some conventional systems have attempted to overcome this problem using a variety of techniques. One such technique is to fetch and execute from flash memory. This technique would limit the execution speed of conventional processors, e.g., to 40 million instructions per second (MIPS) which is unacceptable in many applications.
Another technique is to load the program code into fast SRAM from flash memory or other non-volatile memory and then to execute all program code directly from SRAM. As described above, the problem with this solution is that the SRAM requires significantly more space on the die (approximately five times the space necessary for comparable flash memory) and requires significantly more power to operate.
A third technique is to use flash memory and SRAM cache. When the program reference is within the SRAM, then full speed execution is possible, but otherwise a cache miss occurs that leads to a long wait during the next cache load. Such a system results in unpredictable and undeterministic execution time that is generally unacceptable for processors that are real-time constrained. The real-time constraints are imposed by the requirement to meet the timing required by standards such as IEEE 802.3 (Ethernet), USB, HomePNA 1.1 or SPI (Serial Peripheral Interface). These standards require that a response be generated within a fixed amount of time from an event occurring.
What is needed is a system and method that (1) enables multithreading in a embedded processor, (2) invokes zero-time context switching in a multithreading environment, (3) schedules multiple threads to permit numerous hard-real time and non-real time priority levels, (4) fetches data and instructions from multiple memory blocks in a multithreading environment, and (5) enables a particular thread to store multiple states of the multiple threads in the instruction pipeline.
This invention can also be used with digital signal processors (DSP) where the invention has the advantages of allowing smaller memory buffers, a faster response time and a reduced input to output time delay.