1. Field of the Invention
This invention relates generally to a computer processor, and more specifically, to a network processor having an instruction memory hierarchy that distributes instructions to a plurality of processing units organized in clusters within the network processor.
2. Background Art
Until recently, a lack of network bandwidth posed restraints on network performance. But emerging high bandwidth network technologies now operate at rates that expose limitations within conventional computer processors. Even high-end network devices using state of the art general purpose processors are unable to meet the demands of networks with data rates of 2.4-Gbs, 10-Gbs, 40-Gbs and higher.
Network processors are a recent attempt to address the computational needs of network processing which, although limited to specialized functionalities, are also flexible enough to keep up with often changing network protocols and architecture. Compared to general processors performing a variety of tasks, network processors primarily perform packet processing tasks using a relatively small amount of software code. Examples of specialized packet processing include packet routing, switching, forwarding, and bridging. Some network processors have arrays of processing units with multithreading capability to process more packets at the same time. However, current network processors have failed to address certain characteristics of network processing by relying too much on general processing architectures and techniques.
Access to instructions is one problem associated with a processing array having a conventional memory architecture. During an instruction fetch stage in a processing unit pipeline, each processing unit retrieves software code for execution during an execution stage from a memory element. In some typical processing arrays, each processing unit has dedicated instruction memory. But dedicated memory consumes valuable die area on the processor and inefficiently replicates the same code. In other typical processing arrays, processing units share a common instruction memory. But competition for memory access between processing units increases latencies during an instruction fetch pipeline stage. Furthermore, memory has limited bandwidth that is not capable of delivering instructions at a rate required by a processing array performing packet processing.
Limited instruction memory bandwidth causes more severe problems in processor arrays with multithreaded processing units. In hardware-level multithreading, a logically-partitioned processor streams instructions through its pipeline for more than one hardware thread at the same time to improve effective CPIs (Cycles Per Instruction). Each hardware thread can be associated with a different application program or network packet. When one thread experiences a stall caused by, for example, a memory access latency during the execution stage, the processing unit switches execution to a different thread rather than wasting execution cycles.
By contrast, in software-level multithreading, a single application program streams instructions to the processor using several software threads or processes. “Multithreading” and “threads” as used herein, however, refer to hardware multithreading and hardware instruction threads, respectively. Because multithreaded processing units have higher CPIs (i.e., instructions processed each cycle), even more instruction memory bandwidth is needed for instruction fetches. Moreover, since each thread has an independent instruction stream, there is even more contention for available memory bandwidth.
Therefore, what is needed is a processor including an instruction memory hierarchy and method of distributing instructions to an array of multithreaded processing units. Furthermore, there is a need for an instruction request arbiter and method for controlling instruction requests from the array of multithreaded processing units to the instruction memory hierarchy.