This invention is directed to an expanded capacity virtual bit map processor and specifically to a processor using a unique memory configuration and address generation capability allowing a set of process elements to manipulate a data array that is larger than the number of processing elements.
The tremendous increase in circuit complexity created by advances in integrated circuit technology poses a serious challenge for computer aided design systems. Some designs may require days of computation time for synthesis or verification tasks. The increase in circuit size, complexity, and related growth in computational demand is creating an increasingly difficult design barrier. This invention is directed to a system and technique for penetrating the size/complexity barrier by exploiting the inherent parallelism of many circuit design problems which should significantly reduce the solution time. The system incorporates an array processor and cooperating registers for each processor to contain and operate on a problem having a number of data points which is much larger than the number of available processing elements.
Increasing system complexity is driving the solution time to design automation problems to unacceptable levels. The governing factors of solution time or computer run-time are: the algorithm's efficiency, the problem size and the host machine's speed. The objective is to reduce runtimes for current design automation problems within feasible economic constraints.
Since technological advances provide the capability for larger circuits, some combination of better algorithms or better machines is required if design times are to remain reasonable. Considerable improvements have been made on design automation algorithms for conventional machines; however, some problems like maze-routing have resisted significant run-time improvements. For some of these problems, the development of new hardware is a potential solution. Hardware solutions range from a faster general purpose computer to a single algorithm built into special purpose hardware.
An example of a faster machine is a CRAY-1. Its pipelined real-number processing capability is well suited for matrix manipulations used in design automation tasks like analog circuit simulation or process simulation. However, extensive real number capabilities do not necessarily correspond to improved performance for other required design automation tasks the mainly use simple bit operations.
Representative of special purpose hardware is a graphics machine that converts a list of polygons into a raster-scan format for display on a conventional television screen. Examples of special purpose hardware for design automation are a system which implements a one layer maze-router and a system which implements design rule checks for an integrated circuit mask specification.
As a third alternative, special hardware can be designed to efficiently implement a range of tasks. For example, pipelined array processors are often used to enhance floating point arithmetic operations. Image processing machines can also be considered special purpose hardware capable of implementing a range of tasks.
It is an objective of this invention to provide an architecture for manipulating simple bit data structures: one and two dimensional bit arrays. Bit data structures are used in many design automation applications: design rule checking, routing, and boolean vector manipulation. Since bit operations are usually slow on conventional computers, an efficient bit processing machine can greatly reduce the run-time of many design automation programs.
The viability of a particular solution is determined by economic factors: development risk/cost, hardware cost, useful lifetime, and flexibility. For example, a big machine can provide enhanced performance over a potentially wide range of problems, but the hardware cost is large. Special purpose hardware has a fixed application and potentially limited flexibility but can have enhanced performance at a reduced cost.
General purpose bit processing machines have not been built for design automation applications but have been used for image processing. Since very large bit processing rates are required for image processing applications, highly parallel machines have been used. Parallelism is achieved using two architectures: array and pipelined.
The difficulties with the previous approaches lies in the mismatch between image processing and design automation requirements. For example, none of the array processors are capable of being easily configured to process problems other than the size of the array or to access data from a specific location within the array. A serious limitation of the pipelined architecture is the inflexible data width. Since design automation bit processing requirements vary, a flexible architecture is required.
It is an objective of this invention to provide a parallel array architecture to implement a range of bit operations. A N.times.N array machine is disclosed which is capable of processing a virtual data array of dimensions L.times.M, the problem array being of much greater size than the processor array. For a large system, the size of an individual processor is crucial. A cell architecture and instruction set have been proposed in a paper entitled "A Parallel Bit Map Processor Architecture for DA Algorithms," by T. Blank, M. Stefik and Willem van Cleemput, published in 18th Design Automation Conference proceedings pages 837-845, IEEE computer society and ACM June, 1981, and incorporated herein by reference.
General purpose bit processing machines have not been built for design automation purposes but have been used in the areas of cellular automation and image processing. Two different architectures are used: array and pipelined. The first machines proposed and built were configured as arrays. The Cytocomputer, Massively Parallel Processor, and LSI Adaptive Array Processor are known examples.
The computational requirements of image processing provides a large motivation for the development of bit processing architectures due to large image sizes. This is similar to certain DA problems where the bit map sizes are large and computationally expensive on SISD (Single Instruction and Single Data stream) machines. Typically, a picture is divided into a two dimensional lattice where each point on the plane represents the picture information at that point. Each picture element (pixel) represents the smallest resolution and is coded into m binary bits. Using this technique, an image can be represented in an N.times.N.times.m binary array.
Considering array architecture, the first work on a two dimensional image processing architecture was done by Unger in the late 1950's. His idea was to store an N.times.N pixel image in an N.times.N array of processors. Ideally, there would be an N.sup.2 speed improvement over a conventional SISD computer of the same cycle time.
His system was a classical SIMD (Single Instruction Multiple Data stream) machine wherein all processors operate synchronously on broadcast instructions from a master controller. Each processing element (PE) was a simple, one-bit machine with accumulator, six one-bit registers and direct connections with its eight nearest neighbors. In the 14 instructions, there were provisions for loading/storing the accumulator, boolean operations with the registers, operations into registers, boolean operations with the values of the four orthogonal neighboring accumulators, and finally, the capability to ripple values between many processor cells in the same instruction. An additional feature was the logical OR connection of all cells to the master controller which permitted data dependent master control. Using Unger's estimates, 170 logic gates and 11 memory elements would be required for each logic element. The basic instruction scheme proposed by Unger is utilized for the host computer in the proposed system; the scheme is disclosed in "A Computer Oriented Toward Spatial Problems", Proceedings of the IRE, pp. 1744-1750, IRE, October, 1958, incorporated herein by reference.
With a pipelined architecture, an image is processed by serially passing through individual processor stages. For an algorithm requiring N processing steps, one pass processing is possible using N stages. Otherwise, multiple passes must be made through the pipe.
The structure disclosed by Loughheed and McCubbrey in "The Cytocomputer, A Practical Pipelined Image Processor", IEEP, ACM, May, 1980 is an example. Each stage can perform two transforms: one based on the eight nearest neighbors values (including itself) and on all eight bits of its own pixel value where the function is preset by a master controller. The neighbor transform is capable of generating any function of the nine neighbor values which permits shifting, expanding, shrinking etc. of objects represented in the map. For the eight bit transformation of its own value, all 256 mappings are available. This permits ANDing, ORing, plane shuffling etc. For a problem that requires 100 neighbor or boolean operations, 100 pipeline stages are needed to complete the processing in one pass. If one neighbor, then one boolean operation is required 100 times, only 100 stages are needed for one-pass processing.
The system uses raster scan order which accesses the image pixels sorted, first by increasing x location then by increasing y location. Raster scan order is also used for generating television displays. The match between the serial data from a TV camera and the Cytocomputer's raster scan input format makes it a natural candidate for real time image processing applications.
An advantage of a pipelined architecture is its extensibility since additional stages are easily added by breaking only one pipe connection. However, the advantage is offset by the serial nature of a pipelined architecture. The classical pipeline problem is handling a data dependent branch instruction where the pipeline processors must be flushed and the data restored to the proper state at the time of the branch. Another potential problem is the fixed data width of the pipe (i.e. eight bits for the Cytocomputer). Processing is significantly more difficult to problems that require more bits than the pipe width. The serial pipeline nature also requires that the bulk storage be located elsewhere so there is no potential parallelism in data access.
It is an object of the present invention to provide an array processor wherein the only restraint on the size of a problem to be handled is the size of the memories attached to each processor, and the ability to generate addresses to uniquely address each problem data point.
In an array architecture, N.times.N one-bit processors are connected in a rectangular array where all processors synchronously perform the same instruction broadcast by a master controller. Each processor can exchange data directly with its nearest neighbors and is capable of bit-serial arithmetic operations. Global data communications are possible by taking the boolean OR or AND operations over the N.times.N processor region. The Massively Parallel Processor and the array processor built by NTT further described below are examples.
As of June 1982, the largest commercial array processor is the Massively Parallel Processor (MPP) described by K. E. Batcher, Architecture of a Massively Parallel Processor, IEE, ACM, May, 1980, pp 168-173.
The Array Unit (ARU) contains the 128.times.132 processing array. It is controlled by broadcast instructions from the Array Control Unit (ACU) which contains its own program store and can overlap its instruction execution with array control instructions. Higher level control and I/O interfacing is provided by the Program and Data Management Unit (PDMU). It is also capable of overlapping instruction execution with data I/O. Finally, a VAX 11/780 acts as the host computer.
Since the machine's primary application is image processing, each cell is tailored for that function. Each pixel value containing a variable number of bits is mapped onto a processor where both floating point and scalar operations are possible. Each processor is equipped with bitserial arithmetic capability and local memory. Also for image algorithms, each processor cell connects to its nearest four orthogonal neighbors; however, only shifting of the one-bit processor value is possible during each cycle. The basic machine operation is SIMD; however, data dependent operations are possible through the mask register since some instructions require a specific mask register state. Since image data is typically in a serial format and since computer mass storage devices are also serial, each processor is connected by a shift register which operates independently from the rest of the cell permitting efficient data movement into the array. At the edge of the processing array, external switching networks provide the capability to connect the processors in a serpentine fashion, wrapped around, or simply providing a constant data input.
Since performance was the primary objective hardware parallelism and the power of each processor is maximized. This causes the MPP system cost to be large. Additionally, the MPP does not have the ability to uniquely address the information in each processor or to reconfigure its processing size to dimensions larger than the number of processors.
An objective of the present invention is to provide both of these capabilities.
An array processor prototype is also disclosed which requires 1024 custom array processing chips and is controlled by a bit slice processor. All communication with the processing array is through a 32 bit data bus connected to one edge and a 150 bit broadcast control word. The array controller is connected to a host controller through an eight bit I/O channel to a host computer.
As disclosed in Digest of the IEEE International Solid State Circuit Conference by Sudo et al, in "An LSI Adaptive Array Processor", each processor is composed of three units: two data transfer units and a register/accumulator unit. Each unit is capable of performing simultaneous independent operations. Neighbor unit one is directly connected to its nearest orthogonal and diagonal neighbors permitting both signal propagation and reception from eight sources. Neighbor unit two only provides two-direction transfers, up and down. The register/accumulator unit is composed of two register banks containing 32 and 64 one-bit words, and an arithmetic unit capable of performing bit serial data operations. One of the most interesting features of each processor is control generation. The fundamental mode uses the instruction stream broadcast globally throughout the processing array; however, the global instructions are modified by the register contents located in each cell. This permits data-dependent operations so that subregions of the processing array can be specially configured. For example, the array could be conceptually divided into groups of eight bit words permitting a ripple carry to propagate within each word group.
However, the LSI Adaptive Array Processor cannot be reconfigured into problems larger than the number of physical processors or to uniquely address individual processors. The processing potential in each PE is significantly larger than the requirements for design automation problems.
In summary, the two basic bit processing architectures in the prior art are: pipeline and array. For use in design automation tasks, flexibility to adapt to a wide range of algorithms and data formats is important. Easy hardware expansion of the pipeline architecture is an advantage but is outweighed by its inflexiblity. The data storage is located outside of the machine and data dependant branches are difficult to efficiently control. The array architectures discussed possess greater flexibility but still fall short of the requirements for use in design automation algorithms. For example, no architecture can be configured to process problems larger than the number of physical processors or to read the information from a single processor.
The article by Blank Stefik, and Von Cleemput incorporated herein describes an N.times.N array procesor that overcomes some of the limitations and omissions of the previous architectures. Some details of the system components of this article are discussed below.
An objective of the present invention is to expand on the work disclosed in this article by an improved memory configuration and processing element addressing scheme. The article reviews the design of a very small processing cell that can be used to implement a very large system.
In the Bit Map Processor (BMP) of the above article, the major components of the proposed system architecture are:
Host Computer System
Broadcasts all instructions and data to the processing array. The instruction format used is similar to that of the Unger et al. article, incorporated above.
BMP Control
Regulates I/O between the host and the PE array.
Edge Control
Buffers the data exchange between the host computer system and the PE array. In the preferred embodiment of the present described invention, it serves to provide the data which would otherwise lie outside the boundary of the data array. As will be seen, this data, while not directly to be operated on, must be provided to afford execution of the Neighbor instruction.
PE Array
Contains an N.times.N array of bit processing elements. Instructions are broadcast to all processing nodes simultaneously, and all operations are performed synchronously. This machine is similar to a classical SIMD architecture except that both row and column select lines must be enabled before a processor may change state. Area selection permits the array to adapt to different data formats on an instruction by instruction basis and to address small regions.
Using a simple accumulator/register design, FIG. 7 shows a simple processor cell design. The function of each module shown in FIG. 7 is:
Cell Enable
The cell enable unit generates the only cell unique control signal, cell enable, which is generated from the logical AND of the row and column enable lines. This signal is used by the accumulator, register bank, and wire-OR circuits since they contain or transmit the only state information. Only cells enabled by both row and column selects are allowed to change state. The scheme for addressing each processing element is a key feature of the present invention, and as such will be discussed in greater detail below.
Reg Bank
The register bank is comprised of dynamic register cells. The cell refresh, read, and write circuits are included; however, the row and column selection circuits are not located within the cell boundary.
Accumulator
The accumulator is a one-bit register used as the default operand for all instructions. Since it is accessed on nearly every instruction, it can be a simple clocked storage register.
Wire OR I/O
The wire-OR unit OR's the accumulator value onto the global row and column lines if the cell is enabled. The unit also generates the logical OR of the row and column lines when they are used for cell input.
MUX
The accumulator input multiplexor simply selects between the three possible input sources: external data input, neighbor unit, and the logical unit (LU).
LU
The logical unit is used to calculate all the boolean instructions of two operands. A four to one multiplexor can implement the functions.
Neigh Unit
The neighbor unit performs the masked logical OR of five possible values: the accumulator and four orthogonal neighbors. The functionality of the present system is based on each processing element being able to access these five values. The entire function can be generated in one AND/OR gate.
Local Control Unit
The local control unit generates the primary control signals that are used throughout the entire cell. The inputs are the row and column select lines, clocks, and the opcode lines; the generated signals are: cell select, write memory, write accumulator, and MUX control.
The cell instruction set is divided into five categories: Boolean, load/store, read/write, enable and neighbor instructions. A complete instruction set is given in Blank et al article, incorporated herein by reference.
The system level instructions provide the capability to: read/write all system registers, set/clear all system registers, write the array instruction, and enable a region for cell operation. The enabled array region should be set in two distinct ways: either setting the lower and upper corners of a rectangular region or setting the row and column enable registers directly. App. B shows a proposed system instruction set.
In using the system described above, an N.times.N problem is mapped into N.sup.2 processors. However, in this system as in all other known systems, problems that require more processor storage or are larger than the number of processors are highly penalized for moving data across the processing array boundary. Moreover, the practical economic fact is than an attempt to build an N.times.N machine for an N.times.N problem will likely fail since design automation problems are constantly changing and growing. The optimal machine architecture must be reconfigureable and able to contain problems larger than the number of processors.
An objective of the present invention is to describe such a processor.
The prior art has not been able to move beyond systems in which the problem is larger than the physical number of processors. This application is directed to a reconfigurable architecture capable of manipulating problems larger than the number of physical processors. An implentation for a K.times.K processor array capable of containing an L.times.M problem is disclosed.
The implementation depends in part on a mapping technique that allows an array processor to manipulate problems larger than the number of physical processors. The technique folds or cuts and stacks a problem onto the physical array so that each processor contains and operates on many problem points. Proper mapping allows the neighbor instruction, which requires two dimensional processor interconnections over the problem area, to be efficiently implemented. Finally, by requiring an even number of vertical segments, only one edge register is required; Similarly, only one horizontal edge register is required, if the folding technique is used.
Six basic concepts allow a small number of physical processors to efficiently manipulate a large problem:
______________________________________ 1.degree. Each Physical processor contains a large amount of accumulator and register storage in contrast to the processors described in the prior art. 2.degree. The problem is completely contained within the physical processor array memory so the data isn't moving 3.degree. Each processor contains and manipulates data from many problem points. 4.degree. The mapping between physical processors and the virtual bit map is folded or cut and stacked for simplification of storage. 5.degree. A highly efficient addressing scheme for addressing any data point within the vertical array is developed. 6.degree. A simplified memory-processor element arrangement is described for each processing point. Three main types are necessary: a first for processor elements in the center of the array such as 5 (FIG. 1); a second for processor elements 6 on the edge of the array; and a third 7 for processor elements on the corners of the array. All three types are designed so a processor element may easily access its own value and the four values on either side of the element. ______________________________________
In an illustrative embodiment of the present invention, each bit processing element is a programmable logic array type 82S100, having 16 inputs, 8 outputs, and 48 available AND terms. It is programmed according to the algorithm of Appendix A.
In describing this invention, reference can be made to the following figures, to be explained in greater detail below.