The present invention pertains generally to clock trees in integrated circuits, and more particularly, to a metal programmable clock distribution scheme for allowing in-place programming of clock buffer delays.
Clock networks on CMOS integrated circuits have long been a source of difficulty to integrated circuit designers due to the importance of minimizing skew between clock inputs. A typical integrated circuit includes a clock tree which distributes one or more clock signals throughout the chip to clocked elements. A primary goal of a clock tree is to minimize clock skew between clocked elements. Since all clocked elements on a given tree are driven from one net with a clock spine, skew is caused by differing interconnect lengths and loads.
Clock skew represents a fraction of the clock period that cannot be used for computation. For example, a clock skew of 500 ps with a 200 MHz clock means that 500 ps of every 5 ns clock cycle, or 10 percent of the performance is wasted. Accordingly, clock skew may reduce the effective time allowed for certain logic paths within the design, and thus may reduce the effective maximum frequency, and therefore performance, of the design.
In addition, clock skew may have a number of other detrimental effects on the performance of a circuit design. For example, clock skew may cause hold time violations when only a small amount of logic is provided between registers, causing malfunction of the circuit. Further, clock skew may cause communication problems between integrated circuits. It should be recognized that these are only illustrative examples of effects that clock skew may have on a system.
For the above reasons, a primary goal of a clock tree is to minimize clock skew between clocked elements.
Clock trees may be balanced or unbalanced. Balanced clock trees distribute a number of clock drivers symmetrically and evenly placed on the integrated circuit die. In a balanced tree, the distance between each clock driver and its receiving element is preferably identical, and the load on each driver is matched. Balanced clock trees find suitable application in integrated circuits that are formed with functional blocks characterized by substantially similar loads, for example, memory chips formed with symmetrically balanced memory arrays.
By contrast, unbalanced clock trees distribute clock drivers in a non-symmetrical manner throughout the integrated circuit, generally with higher concentrations of clock drivers where the load is larger and lower concentrations of clock drivers where the load is smaller. Unbalanced clock trees are often utilized in complex circuits that are designed in a functionally hierarchical manner using a plurality of different functional blocks of differing loads. Unbalanced clock trees are typically used in integrated circuits that are partitioned into different functional blocks (which are often designed by different groups of designers).
In an integrated circuit design having a balanced clock tree, the clock network is typically pre-placed on the integrated circuit die prior to placement of functional logic blocks. This scheme has a number of limitations. First, the clock buffering circuit may interfere with ideal block placement on the chip. This means that area or timing may need to be sacrificed. Second, any smaller blocks than the average will have a larger clock driver than is needed, possibly increasing the amount of power required. This scheme may waste chip resources.
In an integrated circuit design having an unbalanced clock tree, the clock network is normally added after determining where the appropriate buffers need to be located. This scheme also has a number of limitations. First, it prevents simulation of the clock network until all layers of the hierarchy are complete. This means that a parent block made up of one or more children blocks cannot be simulated until all of its children blocks are complete. As a result, the entire design must be complete before simulation can occur. If, as a result of simulation, it is discovered that one or more clock routes must be adjusted to meet the clock skew requirements, the final artwork is delayed until the layer(s) requiring adjustment are reworked, and the entire adjusted artwork is resimulated. This scheme is clearly time-consuming and costly.
Accordingly, a need exists for a clock distribution scheme that combines the advantages, and reduces the disadvantages, of both the balanced and unbalanced clock trees. In particular, a need exists for allowing clock drivers to be pre-placed where likely to be needed. The pre-placement allows simulation of various hierarchical levels without requiring all levels of artwork to be completed. A need also exists for allowing in-place programming of clock driver delays on a driver-by-driver basis without disturbing the clock tree.
The present invention is a metal programmable clock distribution scheme for allowing in-place programming of clock driver delays on a driver-by-driver basis. In accordance with a preferred embodiment of the invention, the clock network is pre-placed at the top level of the functional hierarchy. The clock network comprises a plurality of instances of clock drivers each comprising clock buffer chains in locations where it is expected that a clock driver may be needed. Each clock buffer chain comprises a plurality of unit delay clock buffers connected in series between the respective clock driver input line and, potentially, the respective clock driver output line. Designers reserve clock driver input metal on each of the plurality of metal layers between the clock driver input line and the input of the first clock buffer in the clock buffer chain. For each clock buffer in a given clock buffer chain, designers also reserve clock driver output metal on each of the metal layers between the output of each of respective clock buffer in the clock buffer chain to the clock driver output line.
The metal layers between the clock driver output line and the outputs of the individual clock buffers in the clock buffer chain of a given clock driver are partitioned into one or more programming layers and one or more non-programming layers. For the purposes of this invention, the term xe2x80x9cnon-programming layerxe2x80x9d refers to a metal layer in which the routing of metal, once initially determined, remains fixed within a given clock driver block regardless of changes to the delay of the clock driver. The term xe2x80x9cprogramming layerxe2x80x9d refers to a metal layer in which the routing of metal within a given clock driver block may change due to changes to the delay of the clock driver.
Then, for each clock buffer in the clock buffer chain, an output connection route is mapped between the output of the respective clock buffer in the clock buffer chain of the clock driver to the clock driver output line through the reserved clock driver output metal on each of the one or more metal layers that lie between the output of each of respective clock buffer in the clock buffer chain and the clock driver output line. Metal is then implemented along a route corresponding to the output connection route for each of the non-programming layers.
During design, a desired clock driver delay for each clock tree branch is determined. Metal is then implemented along a route corresponding to the output connection route on each of the one or more programming layers in order to complete the output connection route between the output of the clock buffer corresponding to the desired delay and the clock driver output line.
Once the location of a given clock driver is decided, the location of the clock driver remains static throughout the remaining design process. The connections of the individual clock buffers in the clock buffer chain of the clock driver are preferably programmable using a single programmable layer (e.g., metal M4). Accordingly, metal within the clock driver block is reserved and implemented in each of the layers M2 and M3 for potential output line connection from the output of each unit delay buffer in the clock buffer chain of the clock driver. Once the desired clock driver delay is determined, metal layer M4 is implemented to program and complete the connection between the output of the clock buffer in the clock buffer chain that will result in the desired clock driver delay and the clock driver output line. The input of the subsequent clock buffer in the chain, if it exists, is preferably also programmably connected to a static line of the integrated circuit such as ground using the programmable layer M4.
The invention affords several advantages over the prior art. First, the delay of the clock drivers in the top level functional block are programmed by changing preferably only a single or small few metal mask(s). Thus, clock skew modifications can be achieved orthogonal to the top level routing by simply executing a metal mask turn in the one or more programming layers. Accordingly, top level simulation does not depend on top level clock routing, since the top level clock routing remains unchanged. Thus, top level artwork builds do not depend on top level clock distribution, as was normally the case in the prior art, and therefore no rerouting is required. This approach allows designers to change the clock driver delays late in the design process, or to fix any skew problems discovered after the artwork is manufactured, by changing only the metal mask of the programming layers.
Second, the present invention allows xe2x80x98in placexe2x80x99 tuning on a clock-driver-by-clock-driver basis. This approach allows the clock drivers to be implemented ahead of time and to be changed without affecting the design/artwork of the top level.