Specialized integrated circuit designs are becoming more and more popular for a variety of applications, such as multimedia, networking and wireless communications. Such designs typically require one or more processors in the implementation of high performance embedded applications. Such designs also have aggressive design goals for performance, power and cost. Meanwhile, the high development cost and time to market considerations require a flexible, programmable platform for their implementation. These types of applications have inherent parallelism at many levels of granularity that can be exploited by identifying a set of tasks that can run concurrently on multiple processors. Thus, these applications are typically mapped to system-on-a-chip (SOC) architectures consisting of a heterogeneous mix of programmable processors and application specific subsystems.
Although such new architectures can achieve very high performance in terms of speed, the overall performance of applications still suffers from the limited speed of data input/output (I/O) operations. This is especially important in a multiprocessor SOC system where the application is partitioned to run on several processors, which need to move a lot of data from one to another, in addition to in and out of the individual processors. The standard methods of moving data in and out of standard processors require multiple steps such as:                1. DMA controllers move data from a device into processor memory.        2. The processor consumes multiple instructions to setup a memory access that brings the data from processor memory into a register.        3. Some computation is performed on register operands.        4. The resulting products are flushed from a register back out to processor memory.        5. DMA controllers move data from processor memory out to a device.        
To avoid the memory bottleneck imposed by standard I/O methods, some designers may choose to implement most of their SOC functionality in custom RTL logic instead of on a programmable processor. This approach has its own drawbacks in that it inflexibly binds the custom design to a single application, requires a lengthy and arduous design cycle, and is quickly becoming prohibitively expensive. Various other strategies such as data pre-fetching, faster on-chip memories, multi-level cache hierarchies, and wider interfaces to memory have been applied to bridge the gap between processor and memory performance. But these approaches fail to satisfy the prodigious appetites of emerging embedded applications. Moreover, they may provide an ineffective tradeoff of performance for area and power in embedded applications, which often lack the pattern and locality of data references for which these methods were developed.
Worse yet, the advent of application specific processors in recent years has further widened the gap between processor speed and bandwidth, since processors customized to particular applications with special extensions are able to achieve huge performance improvements over general-purpose processors. Configurable and extensible processors such as the Xtensa processor from Tensilica, Inc. of Santa Clara, Calif., for example, lie between general-purpose processors and custom circuits. They allow the designer to enhance computational performance by adding custom data-path execution units, often very wide. Moreover, extension capabilities allow designers to add new instructions to the base processor that read and write their operands from a register file, either a customized register file added by the extension, or the existing register file in the base processor. However, data transfer to and from the execution units is still dependent upon the memory interface of the base processor. This can offset the performance gain in many applications, especially those involving high data bandwidth requirements such as networking or video.
Other processor architectures have also been developed that propose various alternative approaches to the conventional memory interface. One example is the ARC architecture from ARC International of San Jose, Calif.
The ARC architecture is a configurable processor architecture like the Xtensa. It provides two methods of interfacing to the processor besides the usual method of doing loads from and stores to memory on a dedicated processor memory interface. These are extension core registers and auxiliary registers which do not contend with data moving over the main memory bus. The extension registers can be directly accessed by peripheral logic, enabling such devices to communicate with the processor. The auxiliary registers allows 32 bit memory mapped access to registers and memory in an independent address space.
There are several shortcomings of the ARC architecture. The number and width of the extension and auxiliary registers is fixed. There is no handling of speculative reads or writes, or any option of reading/writing in a variable pipeline stage.
Another example architecture is the iWarp architecture, as described in, for example, S. Borkar et. al., “iWarp: An Integrated Solution to High-Speed Parallel Computing”, Supercomputing '88. iWarp is a product of a joint effort between Carnegie Mellon University and Intel Corporation. The goal of the effort was to develop a powerful building block for various distributed memory parallel computing systems.
iWarp supports both systolic and message passing models of communication. In the systolic model of communication, the source cell program sends data items to the destination cell as it generates them, and the destination cell program can start processing the data as soon as the first word of input has arrived. In the message passing mode of communication, the communication agent puts a message in the local memory for the computation agent to read it, and takes data from the local memory to send it as a message to another processor. An iWarp system may also use FIFO queuing along the communication path between two iWarp cells.
There are several shortcomings in the iWarp architecture. The communication link is only between two iWarp processors and is fixed (i.e. not “user defined” or configurable) Further, the number and width of the channels is fixed. There is no handling of speculative reads or writes, or any option of reading/writing in a variable pipeline stage.
Another example processor architecture is the Transputer Transputers are high performance microprocessors developed by Inmos Ltd. (now ST Microelectronics) that support parallel processing through on-chip hardware.
Transputer microprocessors can be connected together by their serial links in application-specific ways and can be used as the building blocks for complex parallel processing systems. Four high speed links allow transputers to be connected to each other in arrays, trees and many other configurations. The communication links between processors operate concurrently with the processing unit and can transfer data simultaneously on all links without the intervention of the CPU.
The Transputer architecture has several shortcomings with respect to data interfaces. The transputer interfaces provided a fixed link between two transputers, not a link from a processor (i.e. one transputer) to any external logic. Further the link consists of a fixed set of 4 serial interfaces, and the interfaces are not configurable in either width or number. There is no handling of speculative reads or writes, or any option of reading/writing in a variable pipeline stage.
Another alternative architecture is the IXP Network Processor architecture from Intel Corp. of Santa Clara, Calif. This architecture includes a number of microengines (which are basically RISC processors) with a dedicated dataflow link. Each microengine can write to the next one's register set, and these registers can be configured as a ring, where each microengine pushes data into it and the next pops data from it. These next neighbor registers can be read or written as operands to the regular ISA of the microengine (ME).
Although this architecture has certain advantages, the overall interface methodology is narrowly limited. For example, the next neighbor registers can connect only to another similar processor having the IXP architecture. Further, the number and width of these registers is predetermined and fixed. There is no handling of speculative reads or writes, or any option of reading/writing in a variable pipeline stage.
A final alternative example is the Queue Processor architecture. Specifically, the University of Electro-Communications, Japan, has proposed a produced order parallel queue processor architecture. To store intermediate results, the proposed system uses FIFO queue registers instead of random access registers. Datum is inserted in the queue in produced order scheme and can be reused.
Queue processors have nothing to do with interfacing a processor to external logic or to external queues. They use a queue (or a FIFO) instead of a register file inside the processor to store the input and output operands of instructions.
In summary, while other conventional proposals have sent data into and out of a processor without using the load/store unit, there are key shortcomings in all of them. They either have a fixed, dedicated link to another processor of the same architecture, or an address mapped interface to external logic. It would be preferable if the data could be calculated and written to the interface, or read from the interface and used in a calculation, all in one cycle. It would be further preferable if the number of interfaces could be configurable and the width of each interface could be independently configurable. It would be still further preferable if the pipeline stage when the interfaces are read or written could be configurable. Another important shortcoming is that in many of the prior art processors the new interfaces are connected to a similar processor, which is not a desirable restriction in many cases. And finally, it would be tremendously valuable if hardware and software models containing such novel interfaces could be created automatically.