A number of techniques have been proposed for improving the speed and cost of moderately complex computer program applications. By moderately complex computer programming is meant programming of about the same general level of complexity as multimedia processing.
Multimedia processing is becoming increasingly important with wide variety of applications ranging from multimedia cell phones to high definition interactive television. Media processing involves the capture, storage, manipulation and transmission of multimedia objects such as text, handwritten data, audio objects, still images, 2D/3D graphics, animation and full-motion video. A number of implementation strategies have been proposed for processing multimedia data. These approaches can be broadly classified based on the evolution of processing architectures and the functionality of the processors. In order to provide media processing solutions to different consumer markets, designers have combined some of the classical features from both the functional and evolution based classifications resulting in many hybrid solutions.
Multimedia and graphics applications are computationally intensive and have been traditionally solved in 3 different ways. One is through the use of a high speed general purpose processor with accelerator support, which is essentially a sequential machine with enhanced instruction set architecture. Here the overlaying software bears the burden of interpreting the application in terms of the limited tasks that the processor can execute (instructions) and schedule these instructions to avoid resource and data dependencies. The second is through the use of an Application Specific Integrated Circuit (ASIC) which is a completely hardware oriented approach, spatially exploiting parallelism to the maximum extent possible. The former, although slower, offers the benefit of hardware reuse for executing other applications. The latter, albeit faster and more power, area and time efficient for a specific application, offers poor hardware reutilization for other applications. The third is through specialized programmable processors such as DSPs and media processors. These attempt to incorporate the programmability of general purpose processors and provide some amount of spatial parallelism in their hardware architectures.
The complexity, variety of techniques and tools, and the high computation, storage and I/O bandwidths associated with multimedia processing presents opportunities for reconfigurable processing to enables features such as scalability, maximal resource utilization and real-time implementation. The relatively new domain of reconfigurable solutions lies in the region of computing space that offers the advantages of these approaches while minimizing their drawbacks. Field Programmable Gate Arrays (FPGAs) were the first attempts in this direction. But poor on-chip network architectures lead to high reconfiguration times and power consumptions. Improvements over this design using Hierarchical Network architectures with RAM style configuration loading have lead to a factor of two to four times reduction in individual configuration loading times. But the amount of redundant and repetitive configurations still remains high. This is one of the important factors that leads to the large overall configuration times and high power consumption compared to ASIC or embedded processor solutions.
A variety of media processing techniques are typically used in multimedia processing environments to capture, store, manipulate and transmit multimedia objects such as text, handwritten data, audio objects, still images, 2D/3D graphics, animation and full-motion video. Example techniques include speech analysis and synthesis, character recognition, audio compression, graphics animation, 3D rendering, image enhancement and restoration, image/video analysis and editing, and video transmission. Multimedia computing presents challenges from the perspectives of both hardware and software. For example, multimedia standards such as MPEG-1, MPEG-2, MPEG-4, MPEG-7, H.263 and JPEG 2000 involve execution of complex media processing tasks in real-time. The need for real-time processing of complex algorithms is further accentuated by the increasing interest in 3-D image and stereoscopic video processing. Each media in a multimedia environment requires different processes, techniques, algorithms and hardware. The complexity, variety of techniques and tools, and the high computation, storage and UO bandwidths associated with processing at this level of complexity presents opportunities for reconfigurable processing to enables features such as scalability, maximal resource utilization and real-time implementation.
To demonstrate the potential for reconfiguration in multimedia computations, the inventors have performed a detailed complexity analysis of the recent multimedia standard MPEG-4. The results show that there are significant variations in the computational complexity among the various modes/operations of MPEG-4. This points to the potential for extensive opportunities for exploiting reconfigurable implementations of multimedia/graphics algorithms.
The availability of large, fast, FPGAs (field programmable gate arrays) is making possible reconfigurable implementations for a variety of applications. FPGAs consist of arrays of Configurable Logic Blocks (CLBs) that implement various logical functions. The latest FPGAs from vendors like Xilinx and Altera can be partially configured and run at several megahertz. Ultimately, computing devices may be able to adapt the underlying hardware dynamically in response to changes in the input data or processing environment and process real time applications. Thus FPGAs have established a point in the computing space which lies in between the dominant extremes of computing, ASICS and software programmable/instruction set based architectures. There are three dominant features that differentiate reconfigurable architectures from instruction set based programmable computing architectures and ASICs: (i) spatial implementation of instructions through a network of processing elements with the absence of explicit instruction fetch-decode model (ii) flexible interconnects which support task dependent data flow between operations (iii) ability to change the Arithmetic and Logic functionality of the processing elements. The reprogrammable space is characterized by the allocation and structure of these resources. Computational tasks can be implemented on a reconfigurable device with intermediate data flowing from the generating function to the receiving function. The salient features of reconfigurable machines are:
Instructions are implemented through locally configured processing elements, thus allowing the reconfigurable device to effectively process more instructions into active silicon in each cycle.
Intermediate values are routed in parallel from producing functions to consuming functions (as space permits) rather than forcing all communication to take place through a central resource bottleneck.
Memory and interconnect resources are distributed and are deployed based on need rather than being centralized, hence presenting opportunities to extract parallelism at various levels.
The networks connecting the Configuration Logic Blocks or Units (CLBs) or processing elements can range from full connectivity crossbar to neighbor only connecting mesh networks. The best characterization to date which empirically measures the growth in the interconnection requirements with respect to the number of Look-Up Tables (LUTs) is the Rent's rule which is given as follows:Nio=CNpgates 
where Nio corresponds to the number of interconnections (in/out lines) in a region containing Ngates. C and p are empirical constants. For logical functions typically p ranges from 0.5<p<0.7.
It has been shown [1 ] (by building the FPGA based on Rent's model and using a hierarchical approach) that the configuration instruction sizes in traditional FPGAs are higher than necessary, by at least a factor of two to four. Therefore for rapid configuration, off-chip context loading becomes slow due to the large amount of configuration data that must be transferred across a limited bandwidth I/O path. It is also shown that greater word widths increase wiring requirements, while decreasing switching requirements. In addition, larger granularity data paths can be used to reduce instruction overheads. The utility of this optimization largely depends on the granularity of the data which needs to be processed. However, if the architectural granularity is larger than the task granularity, the device's computational power will be under utilized. Another promising development in efforts to reduce configuration time is shown in [2 ].
Most of the current approaches towards building a reconfigurable processor are targeted towards performance in terms of speed and are not tuned for power awareness or configuration time optimization. Therefore certain problems have surfaced that need to be addressed at the pre-processing phase.
First, the granularity or the processing ability of the Configurable Logic Units (CLUs) must be driven by the set of applications that are intended to be ported onto the processing platform. Some research groups have taken the approach of visual inspection [3 ], while others have adopted algorithms of exponential complexity [4,5] to identify regions in the application's Data Flow Graphs (DFGs) that qualify for CLUs. None of the current approaches attempt to identify the regions through an automated low complexity approach that deals with Control Data Flow Graphs (CDFGs).
Secondly, the number of levels in hierarchical network architecture must be influenced by the number of processing elements or CLUs needed to complete the task/application. This in turn depends on the amount of parallelism that can be extracted from the algorithm and the percentage of resource utilization. To the best of our knowledge no research group in the area of reconfigurable computing has dealt with this problem.
Thirdly, the complex network on the chip, makes dynamic scheduling expensive as it adds to the primary burden of power dissipation through routing resource utilization. Therefore there is a need for a reconfiguration aware scheduling strategy. Most research groups have adopted dynamic scheduling for a reconfigurable accelerator unit through a scheduler that resides on a host processor [6,7].
The increasing demand for fast processing, high flexibility and reduced power consumption naturally demand the design and development of a low configuration time aware-dynamically reconfigurable processor.
It is an object, therefore, to provide a low area, low power consuming and fast reconfigurable processor.
Task scheduling [1] is an essential part of the design cycle of hardware implementation for a given application. By definition, scheduling refers to the ordering of sub-tasks belonging to an application and the allocation of resources to these tasks. Two types of scheduling techniques are static and dynamic scheduling. Any application can be modeled as a Control-Data Flow Graph. Most of the current applications provide a large amount of variations to users and hence are control-dominated. To arrive at an optimal static schedule for such an application would involve a highly complex scheduling algorithm. Branch and Bound is an example of such an algorithm with exponential complexity. Several researchers have addressed task scheduling and one group has also addressed scheduling for conditional tasks.
Any given application can be modeled as a CDFG G(V,E). V is the set of all nodes of the graph. Theses nodes represent the various tasks of the CDFG. E is the set of all communication edges. These edges can be either conditional or unconditional. There are two possible methods of scheduling this CDFG which have been listed below.
Static scheduling of tasks is done at compile time. It is assumed that lifetimes of all the nodes are known at compile time. The final schedule is stored on-chip. During run-time, if there is a mistake in the assumption of lifetime of any node, then the schedule information needs to be updated. Advantage of this method is that worst-case execution time is guaranteed. But, a static schedule is always worse than a dynamic schedule in terms of optimality. Some of the existing solutions for static scheduling are stated here.
Chekuri [2] discusses the earliest branch node retirement scheme. This is applicable for trees and s-graphs. An s-graph is a graph where only one path has weighted nodes. In this case, it is a collection of Directed Acyclic Graphs (DAGs) representing basic blocks which all end in branch nodes, and the options at the branch nodes are: exit from the whole graph or exit to another branch node. The problem with this approach is that it is applicable only to small graphs and also restricted to S-graphs and trees. It also does not consider nodes mapped to specific processing elements.
Pop [3] tackles control task scheduling in 2 ways. The first is partial critical path based scheduling. But they do not assume that the value of the conditional controller is known prior to the evaluation of the branch operation. They also propose the use of a branch and bound technique for finding a schedule for every possible branch outcome. This is quite exhaustive, but it provides an optimal schedule. Once all possible schedules have been obtained, the schedules are merged. The advantages are that it is optimal, but it has the drawback of being quite complex. It also does not consider loop structures. Scheduling of tasks is done during run-time. Main advantage of such an approach is that there is no need for a schedule to be stored on-chip. Moreover, the schedule obtained is optimal. But, a major limiting factor is that the schedule information needs to be communicated to all the processing elements on the chip at all time. This is a degrading factor in an architecture where interconnects occupy 70% of total area.
Jha [4] addresses scheduling of loops with conditional paths inside them. This is a good approach as it exploits parallelism to a large extent and uses loop unrolling. But the drawback is that the control mechanism for having knowledge of each iteration and the resource handling that iteration is very complicated. This is useful for one or two levels of loop unrolling. It is quite useful where the processing units can afford to communicate quite often with each other and the scheduler. But in our case, the network occupies about 70% of the chip area [6] and hence cannot afford to communicate with each other too often. Moreover the granularity level of operation between processing elements is beyond a basic block level and hence this method is not practical.
Mooney [5] discusses a path based edge activation scheme. This means that if for a group of nodes (which must be scheduled onto the same processing unit and whose schedules are affected by branch paths occurring at a later stage) one knows ahead of time the branch controlling values, then one can at run time prepare all possible optimized list schedules for every possible set of branch controller values. This method is very similar to the partial critical path based method proposed by Pop discussed above. It involves the use of a hardware scheduler which is an overhead.
Existing research work on scheduling applications for reconfigurable devices has been focused on context-scheduling. A context is the bit-level information that is used to configure any particular circuit to do a given task. A brief survey of research done in this area is given here.
Noguera [7] proposes a dynamic scheduler and four possible scheduling algorithms to schedule contexts. These contexts are used to configure the Dynamic Reconfiguration Logic (DRL) blocks. This is well-suited for applications which have non-deterministic execution times.
Schmidt [8] aims to dynamically schedule tasks for FPGAs. Initially, all the tasks are allocated as they come till the entire real estate is used up. Schmidt proposes methods to reduce the waiting time of the tasks arriving next. A proper rearrangement of tasks currently executing on the FPGA is done in order to place the new task. A major limitation of this method is that it requires knowing the target architecture while designing the rearrangement techniques.
Fernandez [9] discusses a scheduling strategy that aims to allocate tasks belonging to a DFG to the proposed MorphoSys architecture. All the tasks are initially scheduled using a heuristic-based method which minimizes the total execution time of the DFG. Context loading and data transfers are scheduled on top of the initial schedule. Fernandez tries to hide context loading and data transfers behind the computation time of kernels. A main drawback is that this method does not apply for CDFG scheduling.
Bhatia [10] proposes a methodology to do temporal partitioning of a DFG and then scheduling the various partitions. The scheduler makes sure that the data dependence between the various partitions is maintained. This method is not suited for our purpose which needs real-time performance.
Memik [11] describes super-scheduler to schedule DFGs for reconfigurable architectures. He initially allocates the resources to the most critical path of the DFG. Then the second most critical path is scheduled and so on. Scheduling of paths is done using Non-crossing Bipartite matching. Though the complexity of this algorithm is less, the schedule is nowhere near optimal.
Jack Liu [12] proposes Variable Instruction Set Computer (VISC) architecture. Scheduling is done at the basic block level. An optimal schedule to order the instructions within a basic block has been proposed. This order of instructions is used to determine the hardware clusters.
An analysis of the existing work on scheduling techniques for reconfigurable architectures has shown that there is not enough work done on static scheduling techniques for CDFGs. This shows the need for a novel method to do the same.
The VLSI chip design cycle includes the steps of system specification, functional design, logic design, circuit design, physical design, fabrication and packaging. The physical design automatic of FPGA involves three steps which include partitioning, placement and routing.
Despite advances in VLSI design automation, the time it takes to market for a chip is unacceptable for many applications. The key problem is time taken due to fabrication of chips and therefore there is a need to find new technologies, which minimize the fabrication time. Gate Arrays use less time in fabrication as compared to full custom chips since only routing layers are fabricated on top of pre-fabricated wafer. However fabrication time for gate arrays is still unacceptable for several applications. In order to reduce the time to fabricate interconnects; programmable devices have been introduced which allow users to program the devices as well as interconnect.
FPGA is a new approach to ASIC design that can dramatically reduce manufacturing turn around time and cost. In its simplest form an FPGA consists of regular array of programmable logic blocks interconnected by a programmable routing network. A programmable logic block is a RAM and can be programmed by the user to act as a small logic module. The key advantage of FPGA is re-programmability.
The VLSI chip design cycle includes the steps of system specification, functional design, logic design, circuit design, physical design, fabrication and packaging. Physical design includes partitioning, floor planning, placement, routing and compaction.
The physical design automation of FPGAs involves three steps, which include partitioning, placement, and routing. Partitioning in FPGAs is significantly different than the partitioning s in other design styles. This problem depends on the architecture in which the circuit has to be implemented. Placement in FPGAs is very similar to the gate array placement. Routing in FPGAs is to find a connection path and program the appropriate interconnection points. In this step the circuit representation of each component is converted into a geometric representation. This representation is a set of geometric patterns, which perform the intended logic function of the corresponding component. Connections between different components are also expressed as geometric patterns. Physical design is a very complex process and therefore it is usually broken into various subsets.
The input to the physical design cycle is the circuit diagram and the output is the layout of the circuit. This is accomplished in several stages such as partitioning, floor planning, placement, routing and compaction.
A chip may contain several transistors. Layout of the entire circuit cannot be handled due to the limitation of memory space as well as computation power available. Therefore it is normally partitioned by grouping the components into blocks. The actual partitioning process considers many factors such as the size of the blocks, number of blocks, and the number of interconnections between the blocks. The set of interconnections required is referred as a net list. In large circuits the partitioning process is hierarchical and at the topmost level a chip may have 5 to 25 blocks. Each block is then partitioned recursively into smaller blocks.
This step is concerned with selecting good layout alternatives for each block as well as the entire chip. The area of each block can be estimated after partitioning and is based approximately on the number and type of commonness in that block. In addition interconnect area required within the block must also be considered. Very often the task of floor plan layout is done by a design engineer rather than a CAD tool due to the fact that human is better at visualizing the entire floor plan and take into account the information flow. In addition certain components are often required to be located at specific positions on the chip. During placement the blocks are exactly positioned on the chip. The goal of placement is to find minimum area arrangement for the blocks that allows completion of interconnections between the blocks while meeting the performance constraints. Placement is usually done in two phases. In the first phase initial placement is done. In the second phase the initial placement is evaluated and iterative improvements are made until layout has minimum area or best performance.
The quality of placement will not be clear until the routing phase has been completed. Placement may lead to un-routable design. In that case another iteration of placement is necessary. To limit the number of iterations of the placement algorithm an estimate of the required routing space is used during the placement process. A good routing and circuit performance heavily depend on a good placement algorithm. This is due to the fact that once the position of the block is fixed; there is not much to do to improve the routing and the circuit performance.
The objective of routing is to complete the interconnection between the blocks according to the specified net list. First the space that is not occupied by the blocks (routing space) is partitioned into rectangular regions called channels and switchboxes. This includes the space between the blocks. The goal of the router is to complete all circuit connections using the shortest possible wire length and using only the channel and switch boxes. This is usually done in two phases referred as global routing and detailed routing phases. In global routing connections are completed between the proper blocks disregarding the exact geometric details of each wire. For each wire global router finds a list of channels and switchboxes to be used as passageway for that wire. Detailed routing that completes point-to-point connections follows global routing. Global routing is converted into exact routing by specifying the geometric information such as location and spacing of wires. Routing is a very well defined studied problem. Since almost all routing problems are computationally hard the researchers have focused on heuristic algorithms.
Compaction is the task of compressing the layout in all directions such that the total area is reduced. By making the chip smaller wire lengths are reduced which in turn reduces the signal delay.
Generally approaches to global routing are classified as sequential and concurrent approaches.
In one approach nets are routed one by one. If a net is routed it may block other nets which are to be routed. As a result this approach is very sensitive to the order of the nets that are considered for routing. Usually the nets are ordered with respect to their criticality. The criticality of a net is determined by the importance of the net. For example a clock net may determine the performance of the circuit so it is considered highly critical. However sequencing techniques don't solve the net ordering problem satisfactorily. An improvement phase is used to remove blockages when further routing is not feasible. This may also not solve the net ordering problem so in addition to that ‘rip-up and reroute’ technique [Bol79, DK82] and ‘shove-aside’ techniques are used. In rip-up and reroute the interfering wires are ripped up and rerouted to allow routing of affected nets. Whereas in shove aside technique wires that allow completion of failed connections are moved aside without breaking the existing connection. Another approach [De86] is to first route simple nets consisting of only two or three terminals since there are few choices for routing such nets. After the simple nets are routed, a Steiner Tree algorithm is used to route intermediate nets. Finally a maze routing algorithm is used to route the remaining multi-terminal nets that are not too numerous.
To match the needs of the future moderately complex applications, provided is the first of a series of tools intended to help in the design and development of a dynamically reconfigurable multimedia processor.