This invention relates in general to computer algorithms and an apparatus for rapidly performing same. More particularly, this invention relates to a method and an apparatus for creating an output vector from a first vector and a second vector, the second vector being a complement of the first vector. This method is particularly useful for rapidly finding the first 1 in a long vector.
As is well known in the art, binary numbers are the native tongue of computers. The transistor, the basic building block of most modern microprocessors, can create binary information: a 1 if current passes through, or a 0 if current doesn't. From these 1s and 0s, which are called bits, a computer can create any number, provided it has enough transistors grouped together to hold the 1s and 0s required. A group of 1s and 0s is commonly known as a vector. For example, a vector X(n-1:0) has n storage locations numbered from 0 to n-1. The first bit in the vector X(n-1:0) is X(0) while the last bit is X(n-1). When referring to an entire vector X(n-1:0), a short-hand notation X is often used by those skilled in the art.
Consider a vector X' having a length n of 8 and equal to 01101110. The rightmost bit X'(0), in this case 0, is the least significant bit. The left most bit X'(7), in this case 0, is the most significant bit. If it is desired to find the first 1 in X', one could first look at the least significant bit, X'(0). Because X'(0) is 0, one could then look at the next higher-order bit, X'(1). In this case, X'(1) is 1. Thus, X'(1) is "the first 1 found in X'." While in this simple example the first 1 in X' was found in a relatively small number of steps, much more complicated examples would require many additional time-consuming steps.
It may be desired to represent the first 1 found in X in another vector Z'. In this case, Z' could be set equal to 1111101 where the 0 in Z'(1) indicates the position of the first 1 found in X'.
In modern microprocessors, it may be necessary to find the first 1 in a long vector. For example, a resource scheduler or a resource allocator may need to find the first 1 in a vector X(n-1:0) that represents the availability of n resources. A 1 in such a vector, which will be referred to as a resource vector, may represent that a particular resource is available, while a 0 may represent that the resource is busy. These resources may include floating point registers and general registers for out-of-order instruction execution. Because of the large number of resources in modern microprocessors, the length n of a resource vector may be as great as 64. It is likely that future microprocessors will have even more resources and will utilize even larger resource vectors.
It may also be necessary to find the first 1 in a long vector when aligning complex instruction set computer (CISC) instructions. An instruction alignment unit may search a long vector that marks the end bytes of CISC instructions. Thus, the location of the end byte of a particular CISC instruction may be found by finding the first 1 in that long vector.
Because of the ever-increasing clock speeds of modern microprocessors, certain microprocessor functions, such as scheduling and allocating resources, need to be performed more rapidly than ever before. Such functions are preferably performed in less than a single microprocessor clock cycle.
Conventional methods for finding the first 1 in a long vector utilize AND gates. As is known in the art, an AND gate is a multiple-input-single-output device which realizes the logical function AND. These conventional methods require multiple levels of AND gates, i.e., the output of a first AND gate is coupled to an input of a second AND gate and so on. Each level of AND gates induces a delay equal to the switching speed of the AND gate. Modern AND gates have extremely fast switching speeds. However, the multiple levels of AND gates required for finding the first 1 in a long vector using conventional methods do not allow scheduling or allocating resources in a single clock cycle of a modern microprocessor. This deficiency is also due in part to the fact that AND gates are relatively slow when implemented in domino logic.
Domino logic is known by those skilled in the art as a modification of conventional clocked CMOS logic. Domino logic allows a single clock to pre-charge and evaluate a cascaded set of dynamic logic blocks. In a cascaded set of logic blocks, each stage evaluates and causes the next stage to evaluate--in the same way a line of dominos fall. Any number of logic stages may be cascaded, provided that the sequence can evaluate within the evaluate clock phase.
Another conventional method for finding the first 1 in a vector does not involve the use of AND gates. FIG. 1 shows a diagram of a prior art Find First One Block. This block contains a plurality of domino OR gates. As shown in FIG. 1, vector X(n-1:0) is input into the block. Likewise, vector Y(n-1:0) is input into the block. Y is the complement of X. For binary numbers, the complement of 1 is 0. Similarly, the complement of 0 is 1. For example, if a vector X' has a length n of 8 and is set equal to 01101110, then the complement of X' is 10010001. The output of the block is vector Z.
In this prior art method, Y(0) is directly coupled to Z(0). For k equals 1 to n-1, X(k-1:0) and Y(k) are input into a k+1 input OR gate. The output of each k+1 input OR gate is output into Z(k).
While this prior art method can find the first 1 more rapidly than the above described method using AND gates, it could not efficiently find the first 1 in long vectors. As can be seen in FIG. 1, there is a large fanout of X(0) because it is connected to n-1 OR gates. Large fanouts typically increase the load capacitance and slow down the previous driving gate in the domino chain. In addition, as the number of inputs into an OR gate increases, the diffusion capacitance increases and the output speed decreases. Finally, the surface area required to implement this prior art block is large due to the significant transistor count. The large surface area increases device costs. Further, the large surface area increases delays and slows down the block. As is known by those skilled in the art, delays due to wire lengths are becoming comparable to more dominant gate delays as process geometries shrink.
Thus, a need exists for a method and apparatus for efficiently finding the first 1 in a long vector within a single microprocessor clock cycle.