Computer architecture generally defines the functional operation, including the flow of information and control, among individual hardware units of a computer. One such hardware unit is a processing engine that contains arithmetic and logic processing circuits organized as a set of data paths. In some implementations, the data path circuits may be configured as a processor having operations that are defined by a set of instructions. The instructions are typically stored in an instruction memory and specify a set of hardware functions that are available on the processor. When implementing these functions, the processor generally processes “transient” data residing in a data memory in accordance with the instructions.
A high-performance processing engine may be realized by using a number of identical processors to perform certain tasks in parallel. For a purely parallel multiprocessor architecture, each processor may have shared or private access to non-transient data, such as program instructions (e.g., algorithms) stored in a memory coupled to the processor. Access to an external memory is generally inefficient because the execution capability of each processor is substantially faster than its external interface capability; as a result, the processor often idles while waiting for the accessed data. Moreover, scheduling of external accesses to a shared memory is cumbersome because the processors may be executing different portions of the program.
In an alternative implementation, the data paths may be configured as a pipeline having a plurality of processor stages. This configuration conserves internal memory space since each processor executes only a small portion of the program algorithm. A drawback, however, is the difficulty in apportioning the algorithm into many different stages of equivalent duration. Another drawback of the typical pipeline is the overhead incurred in transferring transient “context” data from one processor to the next in a high-bandwidth application.
One example of such a high-bandwidth application involves the area of data communications and, in particular, the use of a parallel, multiprocessor architecture as the processing engine for an intermediate network station. The intermediate station interconnects communication links and subnetworks of a computer network to enable the exchange of data between two or more software entities executing on hardware platforms, such as end stations. The stations typically communicate by exchanging discrete packets or frames of data according to predefined protocols, such as the Transmission Control Protocol/Internet Protocol (FCP/IP), the Internet Packet Exchange (IPX) protocol, the AppleTalk protocol or the DECNet protocol. In this context, a protocol consists of a set of rules defining how the stations interact with each other.
A router is an intermediate station that implements network services such as route processing, path determination and path switching functions. The route processing function determines the type of routing needed for a packet, whereas the path switching function allows a router to accept a frame on one interface and forward it on a second interface. The path determination, or forwarding decision, function selects the most appropriate interface for forwarding the frame. A switch is also an intermediate station that provides the basic functions of a bridge including filtering of data traffic by medium access control (MAC) address, “learning” of a MAC address based upon a source MAC address of a frame and forwarding of the frame based upon a destination MAC address. Modern switches further provide the path switching and forwarding decision capabilities of a router. Each station includes high-speed media interfaces for a wide range of communication links and subnetworks.
Increases in the frame/packet transfer speed of an intermediate station are typically achieved through hardware enhancements for implementing well-defined algorithms, such as bridging, switching and routing algorithms associated with the predefined protocols. Hardware implementation of such an algorithm is typically faster than software because operations can execute in parallel more efficiently. In contrast, software implementation of the algorithm on a general-purpose processor generally performs the tasks sequentially because there is only one execution path. Parallel processing of conventional data communications algorithms is not easily implemented with such a processor, so hardware processing engines are typically developed and implemented in application specific integrated circuits (ASIC) to perform various tasks of an operation at the same time. These ASIC solutions distinguish themselves by speed and the incorporation of additional requirements beyond those of the basic algorithm functions. However, the development process for such an engine is time consuming and expensive and, if the requirements change, inefficient since a typical solution to a changing requirement is to develop a new ASIC.
Such an ASIC solution may comprise an arrayed processing engine having a plurality of processor pipelines. Each element of the processor pipeline comprises a processor complex that includes, among other things, an instruction random access memory (IRAM) for storing executable program code routines and a central processing unit (CPU) that is programmable with respect to execution of the code. Each processor complex of a pipeline performs different processing on (packet) data propagating through various “stages” of the pipeline in accordance with a programmed code segment or routine. A code entry point for a particular routine is provided by an upstream CPU of each processor complex for each downstream CPU in the pipeline, thereby rendering the program code executed by each processor dependent on other processors in the engine.
Because of the size and complexity of such a highly integrated ASIC, it is rather difficult to build entirely functioning processor complexes, especially in the early yield learning of advanced semiconductor processes. As a result, a processor complex of a pipline may fail during production of the ASIC causing failure of the entire pipeline because data is unable to be passed among the processor complexes of the pipeline. Since the code executed by a downstream processor complex is dependent upon the “work” previously performed by an upstream processor complex, a software developer that is developing code for the downstream processor of a pipeline depends upon and expects certain operations to have been performed in order to provide the correct scenario for the code. Failure of an upstream processor complex may impact such program code development.
Data bypassing capabilities are generally not required for processor stages of a conventional pipeline processor because each processor stage is typically “hardware assisted” in that there are specific circuits associated with the function performed by the stage on data passing through the pipeline. Therefore, a subsequent processor stage generally cannot be programmed to perform the function of a previous stage, completion of which is typically required prior to performance of the subsequent stage function.
Therefore, an object of the present invention is to provide a mechanism for isolating a processor complex of an arrayed processing engine.
Another object of the invention is to provide a mechanism for supplying an independent code entry point for a programmable processor of an isolated processor complex.
Yet another object of the present invention is to provide a mechanism for advancing code execution of a processor complex within a pipeline of the arrayed processing engine having an isolated processor complex without running code on the isolated processor.