This invention relates to array based computing devices. More particularly, this invention relates to a processing architecture that configures arbitrary sized data paths comprising chained processing elements.
Advances in semiconductor technology have greatly increased the processing power of a single chip general purpose computing device. The relatively slow increase in the inter-chip communication bandwidth requires modern high performance devices to use as much of the potential on chip processing power as possible. This results in large, dense integrated circuit devices and a large design space of processing architectures. This design space is generally viewed in terms of granularity, wherein granularity dictates that designers have the option of building very large processing units, or many smaller ones, in the same silicon area. Traditional architectures are either very coarse grain, like microprocessors, or very fine grain, like field programmable gate arrays (FPGAs).
Microprocessors, as coarse grain architecture devices, incorporate a few large processing units that operate on wide data words, each unit being hardwired to perform a defined set of instructions on these data words. Generally, each unit is optimized for a different set of instructions, such as integer and floating point, and the units are generally hardwired to operate in parallel. The hardwired nature of these units allows for very rapid instruction execution. In fact, a great deal of area on modem microprocessor chips is dedicated to cache memories in order to support a very high rate of instruction issue. Thus, the devices efficiently handle very dynamic instruction streams.
Most of the silicon area of modern microprocessors is dedicated to storing data and instructions and to control circuitry. Therefore, most of the silicon area is dedicated to allowing computational tasks to heavily reuse the small active portion of the silicon, the arithmetic logic units (ALUs). Consequently very little of the capacity inherent in a processor gets applied to the problem; most of the capacity goes into supporting a high diversity of operations.
Field programmable gate arrays, as very fine grain devices, incorporate a large number of very small processing elements. These elements are arranged in a configurable interconnected network. The configuration data used to define the functionality of the processing units and the network can be thought of as a very large semantically powerful instruction word allowing nearly any operation to be described and mapped to hardware.
Conventional FPGAs allow finer granularity control over processor operations, and dedicate a minimal area to instruction distribution. Consequently, they can deliver more computations per unit of silicon than processors, on a wide range of operations. However, the lack of resources for instruction distribution in a network of prior art conventional FPGAs make them efficient only when the functional diversity is low, that is when the same operation is required repeatedly and that entire operation can be fit spatially onto the FPGAs in the system.
Dynamically programmable gate arrays (DPGAs) dedicate a modest amount of on-chip area to store additional instructions allowing them to support higher operational diversity than traditional FPGAs. However, the silicon area necessary to support this diversity must be dedicated at fabrication time and consumes area whether or not the additional diversity is required. The amount of diversity supported, that is, the number of instructions supported, is also fixed at fabrication time. Furthermore, when regular data path operations are required all instruction stores are required to be programmed with the same data using a global signal broadcasted to all DPGAs.
The limitations present in the prior art FPGA and DPGA networks in the form of limited control over configuration of the individual FPGAs and DPGAs of the network severely limits the functional diversity of the networks. For example, in one prior art FPGA network, all FPGAs must be configured at the same time to contain the same configurations. Consequently, rather than separate the resources for instruction storage and distribution from the resources for data storage and computation, and dedicate silicon resources to each of these resources at fabrication time, there is a need for an architecture that unifies these resources. Once unified, traditional instruction and control resources can be decomposed along with computing resources and can be deployed in an application specific manner. Chip capacity can be selectively deployed to dynamically support active computation or control reuse of computational resources depending on the needs of the application and the available hardware resources.
A method and an apparatus for configuring arbitrary sized data paths comprising multiple context processing elements (MCPEs) are provided. According to one aspect of the invention, multiple MCPEs may be chained to form wider-word data paths of arbitrary widths. A first ALU of a first MCPE serves as the most significant byte (MSB) of the data path while a second ALU of a second MCPE serves as the least significant byte (LSB) of the data path. Carry chains are used to couple the MCPEs of the data path in order to chain forward a carry bit and back-propagate configuration signals through the data path. The ALUs of the data path are coupled using a left-going, or forward, carry chain for transmitting at least one carry bit from the LSB ALU to the MSB ALU. The MSB ALU comprises configurable logic for generating at least one signal in response to a carry bit received over the left-going carry chain, the at least one signal comprising a saturation signal and a saturation value. The saturation signal is generated using logic that tests for saturation in the data path.
The ALUs of the data path are coupled using a right-going carry chain for transmitting the saturation signal back down the data path. The right-going carry chain may comprise two lines coupled among the ALUs of the data path. The right-going carry chain comprises at least one back propagation channel. The saturation signal is transmitted from the MSB ALU through the ALUs of the data path to the LSB ALU using a first back propagation channel. Furthermore, a signal that selects a saturation value is transmitted from the MSB ALU to the LSB ALU using a second back propagation channel. The MCPEs of the data path use configurable logic to manipulate a resident bit sequence in response to the saturation signal transmitted thereby reconfiguring, or changing the operation of, the data path in response to the saturation signal. The carry chains support carry operations for non-local functions comprising minimum and maximum arithmetic functions.
These and other features, aspects, and advantages of the present invention will be apparent from the accompanying drawings and from the detailed description and appended claims which follow.