This invention relates in general to accessing memory of a computer system, and in particular to a method and system that provide a translation look-aside buffer (TLB) implementing a self-timed evaluation of whether a virtual address for a memory access request is found within the TLB, which reduces the latency involved in accessing the memory to satisfy the memory access request.
Computer systems may employ a multi-level hierarchy of memory, with relatively fast, expensive but limited-capacity memory at the highest level of the hierarchy and proceeding to relatively slower, lower cost but higher-capacity memory at the lowest level of the hierarchy. The hierarchy may include a relatively small, fast memory called a cache, either physically integrated within a processor or mounted physically close to the processor for speed. The computer system may employ separate instruction caches (xe2x80x9cI-cachesxe2x80x9d) and data caches (xe2x80x9cD-cachesxe2x80x9d). In addition, the computer system may use multiple levels of caches. The use of a cache is generally transparent to a computer program at the instruction level and can thus be added to a computer architecture without changing the instruction set or requiring modification to existing programs.
Cache structures implemented for processors typically include a translation look-aside buffer (TLB), which is generally a large content addressable memory (CAM) structure. Generally, when an instruction being executed by the processor 19 (as shown in FIG. 2A) requests access to memory (e.g., to read from or write to an address), the cache""s TLB receives a virtual address for such memory access request and translates the virtual address to a physical address. That is, the TLB translates a received virtual address to a physical address of the cache memory (e.g., random access memory (RAM)) to be accessed to satisfy the memory access request. More specifically, a TLB typically comprises multiple entries of addresses, and when the TLB receives a virtual address it compares the virtual address with its entries to determine if a match is made for the cache. If the TLB determines that a match is made for one of its entries, indicating that the requested address is contained in the cache, a WORD line is activated or fired (e.g., transitions from a low voltage value to a high voltage value) causing the appropriate physical address to be accessed in the cache memory (e.g., in the RAM memory). That is, if a match is made for a received virtual address within the TLB, then the TLB outputs the appropriate physical address and the WORD line is fired causing such physical address to be accessed in the cache data arrays (e.g., RAM memory) to satisfy the memory access request.
Therefore, the cache TLB is necessarily in the critical path (the path required for completing an instruction) for a memory access request. The TLB is a fundamental part of all microprocessors in that an access of the cache cannot begin for a memory access request until the physical address is obtained for such memory access request from the TLB. Therefore, it is very critical that a TLB executes as fast as possible. That is, because the TLB necessarily affects the speed at which an instruction can be satisfied, it is desirable to have the TLB implemented in a manner such that instructions requiring to access the cache can be satisfied in a timely manner (i.e., quickly). However, prior art TLB implementations result in an undesirably long time in evaluating whether a match exists for a received virtual address within the TLB. As a result, prior art TLB implementations consume an undesirably long time before firing the WORD line to access the appropriate physical address in the cache memory when a match occurs within the TLB. Therefore, prior art implementations require an undesirably long time to satisfy a memory access request.
Turning to FIG. 1, an example of a TLB CAM 10 of the prior art is shown. As shown, TLB CAM 10 comprises circuitry 12, which is the circuitry for a single bit of the TLB CAM 10. Such circuitry of a TLB CAM 10 is well-known in the art, and therefore will not be described in great detail herein. In the exemplary TLB CAM 10 of FIG. 1, such a TLB CAM comprises 128 entries with each entry having 52 bits. Thus, circuitry 12 would be duplicated 51 times to provide a 52-bit entry, and such a 52-bit entry would be duplicated 127 times to provide a 128 entry TLB. Because such a TLB CAM is commonly implemented as an array having 128 rows of entries and 52 columns, the TLB entries may be referred to herein as rows. It should be understood that in various implementations TLB CAM 10 may have any number of entries (rows) with each entry having any number of bits (columns). TLB CAM 10 may receive a 52-bit virtual address for a memory access request, and compare the virtual address with its entries to determine whether a match is achieved in the TLB CAM for the received virtual address. As further shown in FIG. 1, TLB CAM 10 comprises a MATCH line through each bit circuitry 12 of an entry. TLB CAM 10 comprises a separate MATCH line (not shown) for each of the 128 entries, and each MATCH line indicates whether a match is made for its corresponding entry and a received virtual address.
Generally, each bit of an entry has a field effect transistor (FET) that is used to indicate whether it matches a corresponding bit of a received virtual address. For example, bit circuitry 12 of FIG. 1 includes an N-channel field effect transfer (NFET) 26 that is coupled to the MATCH line for that entry. NFET 26 is implemented such that if the bit circuitry 12 of this entry fails to match the corresponding bit of a received virtual address, then the NFET 26 pulls the MATCH line for this entry low. That is, the MATCH line is initially at a high voltage level, and if all the bits of the entry match a received virtual address, then the MATCH line remains at a high voltage level to indicate that the corresponding entry matches the virtual address (i.e., that a xe2x80x9chitxe2x80x9d is achieved for the virtual address in the corresponding entry). However, if one or more bits fail to match the received virtual address, then such mismatching bit(s) pull the MATCH line low, thereby indicating that a hit was not achieved for the virtual address in the corresponding entry. Because every bit of the TLB CAM 10 includes such an NFET 26, a very small NFET 26 is typically utilized to reduce the amount of surface area and cost required for implementing the TLB CAM 10. Therefore, each NFET 26 of the TLB""s bits is typically capable of discharging the MATCH line at a relatively slow rate. That is, each NFET 26 is typically a small NFET that requires a relatively long time to discharge a MATCH line because of the parasitic capacitance on the MATCH line presented by the other cells coupled to such MATCH line. Although, if many bits of an entry all fail to match the virtual address, thereby causing many of such NFETs 26 to join in pulling down the MATCH line, such an entry is capable of discharging the MATCH line more quickly than an entry in which only a few bits fail to match the virtual address.
It should be recognized that it is desirable to determine the MATCH line value for each entry as soon as possible so as to allow the cache memory to be accessed to satisfy a memory access request in a timely manner. Therefore, it is desirable to evaluate the value of the MATCH lines for a TLB as soon as possible to decrease the time required to satisfy a memory access request. However, care must be taken to prevent the MATCH lines from being accessed prematurely (i.e., before an entry has completed pulling the MATCH line low for a mismatch) to avoid an erroneous access of a physical address in the cache memory. For example, suppose a match is achieved for a virtual address in a first entry of the TLB, and suppose a second entry of the TLB does not match the virtual address. If the MATCH lines are evaluated before the MATCH line for the second entry has had sufficient opportunity to discharge, the WORD line will fire causing an erroneous access of the physical address output by the second entry.
One implementation of the prior art dedicates one phase of a clock cycle to determining whether each entry in the TLB matches a received virtual address (to allow sufficient time for the MATCH line of each entry failing to match the virtual address to be discharged), and dedicates a later phase of a clock cycle to firing the WORD line to access the appropriate physical address in the cache memory (i.e., the physical address of a matching entry of the TLB). By dedicating a sufficiently long block of time for determining whether each entry in the TLB matches a received virtual address, this implementation avoids an erroneous memory access caused by evaluating the MATCH lines too early.
However, such prior art implementation requires an undesirably long time before the cache memory is accessed to satisfy a memory access request. For instance, if the TLB matching completes very quickly for a memory access request, this implementation does not begin the access of the cache memory (e.g., by firing the WORD line) any earlier. Thus, this implementation may result in wasted time in that a portion of time reserved for matching the entries of the TLB with a received virtual address may be unused. That is, because the circuitry is required to wait for a particular clock edge to occur before evaluating the TLB MATCH lines and accessing the appropriate physical address of the cache memory, the speed at which a memory access request can be satisfied is hindered. Thus, this prior art TLB implementation requires an undesirably long amount of time in the critical path for a memory access request. This implementation does not make efficient use of time to determine a match in the TLB entries, and therefore this implementation does not satisfy memory access requests in an efficient and timely manner. More specifically, this prior art design does not utilize a self-timing implementation for the TLB, but rather utilizes a predetermined timing sequence for the TLB. Accordingly, because this prior art design does not utilize a self-timing implementation for the TLB, it does not enable a fast TLB (i.e., a TLB that quickly determines whether a match is achieved for a received virtual address) to satisfy memory access requests efficiently and quickly.
A second implementation of the prior art utilizes a xe2x80x9cdummyxe2x80x9d row and xe2x80x9cdummyxe2x80x9d column in an attempt to satisfy memory access requests more efficiently. Turning to FIG. 2A, an example of this second implementation is illustrated. As shown, the TLB CAM 10 comprises 128 entries (or rows) each having 52 bits (or columns). In addition, TLB CAM 10 comprises a dummy row (shown as row 13) and a dummy column (shown as column 11). One bit of the dummy row 13 is tied to a MATCH line for the dummy row 13, and the remaining bits of the dummy row 13 are coupled to the MATCH line but are not enabled (e.g., are tied to ground). For example, the common bit of the dummy row 13 and dummy column 11, shown as bit 17, may be tied to a MATCH line for the dummy row 13, and the remaining bits of the dummy row 13 and dummy column 11 are coupled to the MATCH line but not enabled. As a result, a single NFET 26 pulls down the MATCH line for the dummy row 13, thereby providing a reference for when the evaluation of the MATCH lines for the actual entries of the TLB CAM can be performed. That is, the NFET 26 of bit 17 is implemented to pull down the MATCH line for the dummy row 13. Because a single bit mismatching for an entry will provide the slowest time for pulling the MATCH line to a low voltage level, the dummy row 13 having a single NFET 26 pulling the dummy row""s MATCH line low provides a time reference that can be used for triggering the evaluation of the TLB""s MATCH lines. Thus, by the time the dummy row""s MATCH line is pulled low, the MATCH lines for every entry of the TLB should be set to their appropriate values. More specifically, by the time the dummy row""s MATCH line is pulled low, every mismatching entry of the TLB should have completed discharging. Accordingly, rather than executing according to a predetermined time sequence, this prior art implementation utilizes a dummy row 13 to provide a reference time for triggering the evaluation of the TLB""s MATCH lines. Thus, the evaluation of the TLB""s MATCH lines will always be triggered based on the worst case for matching a virtual address within the TLB, i.e., the case in which only one bit of an entry fails to match the virtual address.
To further illustrate the operation of the prior art implementation of FIG. 2A, exemplary wave traces are shown in FIG. 2B. As clock 102 (e.g., the processor clock) goes high, the virtual address 104 is fired (e.g., received into the TLB CAM 10). When the virtual address 104 fires, the dummy MATCH line 110 for dummy row 13 is pulled low through a single NFET of bit 17. At some point after the dummy MATCH line 110 goes low, the self-timed path for evaluating the TLB""s MATCH lines, shown as line 106 in FIG. 2B, is triggered (goes high). More specifically, detection circuitry 20 in FIG. 2A is used to detect the dummy MATCH line 110 falling. When detection circuitry 20 detects the dummy MATCH line 110 at a low voltage level, the detection circuitry 20 generates the evaluate signal 106 (e.g., causes the evaluate signal 106 to transition to a high voltage level). When the evaluate signal 106 goes high, the WORD line 108 is triggered. Once the WORD line 108 transitions high, the appropriate physical address of TLB RAM 15 is accessed to satisfy the memory access request.
Because the dummy row 13 provides a reference time for evaluating the TLB MATCH lines based on the worst possible case (i.e., only one bit failing to match a received virtual address), this prior art implementation requires an undesirably long time to trigger the self-timed path for evaluation 106. That is, this prior art implementation requires an undesirably long time before the MATCH lines of the TLB are evaluated to detect a match for a virtual address. Additionally, it is difficult to implement a dummy row 13 that closely models the operation of the actual TLB entries. The process, voltage and temperature (PVT) effects may vary within the TLB circuitry, which make modeling the actual TLB entries with a dummy row 13 difficult. More specifically, the dummy row implementation typically attempts to model an entry that is a relatively far distance away from the dummy row, thereby making such modeling difficult due to a skew presented over such distance by PVT effects. For example, in some implementations a dummy row is relied upon for modeling an entry that is approximately 1000 microns away. Therefore, it may be very difficult to implement a dummy column 11 and dummy row 13 that closely model a normal row and column in TLB CAM 10 due to different processing problems encountered within the TLB circuitry.
Because of the PVT effects and the difficulty in closely modeling a dummy row 13 and column 11 to a normal row and column of the TLB CAM 10, margin must be added to the critical path to ensure that an erroneous memory access does not occur because of evaluating the TLB""s MATCH lines prematurely. Thus, an additional amount of delay is typically added into the self-time path to account for any other effects that might slow down the MATCH lines for the actual entries of the TLB CAM 10. For example, a delay is typically implemented after the dummy MATCH line 110 is detected at a low voltage before the self-time path 106 is triggered. Accordingly, an undesirably long amount of time is required for the critical path of a memory access request because of the undesirably long time required in determining a MATCH in the TLB CAM 10. Furthermore, the dummy row and dummy column implementation of FIG. 2A consumes an additional amount of surface area and incurs an additional cost because of the additional column and row that are implemented for the TLB CAM structure 10.
To further improve the speed of a TLB, various attempts have been made to implement a xe2x80x9cbuddyxe2x80x9d self-timed TLB in which the neighboring MATCH lines within a TLB are utilized to trigger the evaluate. Such a xe2x80x9cbuddyxe2x80x9d self-timed TLB is desirable in that the evaluate for a MATCH line is triggered based on a neighboring MATCH line that is relatively close in proximity. However, prior art attempts at implementing a buddy self-timed TLB have been unsuccessful. A prevalent problem in prior art self-timed TLB circuitry is avoiding an erroneous memory access when a disproportionate number of bits fail to match a virtual address with the TLB""s entries. To illustrate this problem of a TLB self-timed circuit, attention is directed to FIG. 3 which illustrates a typical buddy self-timed implementation of the prior art. Shown in FIG. 3 is a first MATCH line, MATCH line A, which is used to indicate whether a match is achieved for a first 52-bit entry of a TLB. Accordingly, MATCH line A has 52 NFETs coupled to it; one for each bit of its corresponding entry. Also shown is a second MATCH line, MATCH line B, which is used to indicate whether a match is achieved for a second 52-bit entry of a TLB. Thus, MATCH line B also has 52 NFETs coupled to it. MATCH line A and MATCH line B are input to a NAND gate 302, which outputs a signal for triggering the evaluation of the TLB""s MATCH lines. Of course, a TLB may have many entries (e.g., 128 entries) which each have a separate MATCH line, and all of such MATCH lines would be input with a neighboring MATCH line to NAND gate circuitry, such as NAND gate 302, to determine the evaluate signal for such MATCH lines. However, for simplicity in explaining this prior art implementation, only two entries of a TLB are illustrated in FIG. 3.
In operation, MATCH line A and MATCH line B will each initially have a high voltage value when a virtual address is received by the TLB. While both MATCH line A and MATCH line B have a high voltage value, the evaluate signal output by NAND gate 302 is a low voltage value. However, when either one of MATCH line A or MATCH line B goes to a low voltage value, the NAND gate 302 outputs a high voltage value for the evaluate signal, indicating that the MATCH lines of the TLB are ready to be evaluated and the WORD line fired to access the appropriate physical address of the cache. Suppose, for example, that when the TLB receives a virtual address, all 52 bits fail to match the virtual address of the first entry, resulting in 52 NFETs pulling down on MATCH line A. Further suppose that only one bit of the second entry fails to match the virtual address, resulting in one NFET pulling down on MATCH line B. Because 52 NFETs are discharging MATCH line A, while only one NFET is discharging MATCH line B, MATCH line A is discharged much faster than MATCH line B due to the parasitic capacitance present on MATCH line B. When MATCH line A discharges, it causes NAND gate 302 to trigger the evaluate signal prematurely. That is, MATCH line B has likely not discharged to a low voltage value when MATCH line A causes the NAND gate 302 to trigger the evaluate signal. Accordingly, when the evaluate signal goes high, the WORD line is erroneously fired to access the memory address for the second entry of the TLB because the MATCH line B has not completed its discharge yet. Therefore, a prevalent problem with prior art self-timed TLB circuits is the inherent skew in the time required for discharging TLB MATCH lines for entries having a disproportionate number of bits matching a received virtual address. As a result, an effective buddy self-timed TLB has not been developed in the prior art.
In view of the above, prior art TLB implementations are problematic for several reasons. First, prior art TLB implementations require an undesirably long time before accessing the cache memory to satisfy a memory access request, thereby resulting in an undesirably high latency in the cache. Furthermore, as prior art implementations attempt to decrease the time required for the TLB by utilizing buddy self-timed circuitry, erroneous memory addresses are accessed when a disproportionate number of bits between various TLB entries match a received virtual address. To avoid such erroneous memory address accesses, a dummy column and dummy row implementation is typically utilized for prior art TLBs. However, such a dummy column and dummy row implementation results in an undesirably long time before accessing the cache memory to satisfy a memory access request because the evaluate signal (which triggers the WORD line) is based on a reference time for the worst case matching scenario within the TLB (i.e., only a single bit failing to match the received virtual address). Furthermore, the dummy column and dummy row implementation consumes an undesirably large amount of surface area because it requires an additional column and row to be implemented, and it therefore results in increased cost in implementing the TLB. Additionally, the dummy row and dummy column implementation is problematic in that the dummy row is used to model an entry that is physically located relatively far away therefrom, which increases the skew in the model because of PVT effects.
In view of the above, a desire exists for a TLB implementation that does not require an undesirably long time in evaluating whether a match is achieved in the TLB for a virtual address before accessing the cache memory to satisfy the memory access request. A further desire exists for a TLB implementation that not only evaluates whether a match is achieved quickly, but also avoids access of erroneous memory addresses. A further desire exists for a TLB implementation that does not consume an undesirably large amount of surface area. Accordingly, a desire exists for a TLB that does not utilize a dummy column and dummy row for triggering the evaluation of whether a match is achieved in the TLB. Thus, a desire exists for a self-timed TLB that is capable of triggering an evaluation of whether a match is achieved in the TLB soon after the TLB entries have been compared with the virtual address to minimize the latency in accessing memory to satisfy a memory access request, while also avoiding accesses of erroneous memory addresses.
These and other objects, features and technical advantages are achieved by a system and method which provide a self-timed TLB that utilizes a two-level match scheme to trigger the evaluation of whether a match is achieved for a received virtual address within the TLB. The first level of the match scheme is referred to herein as the local match, and the second level of the match scheme is referred to herein as the global match. In a preferred embodiment, an entry of a TLB comprises groups of bits, with each group coupled to a separate local match line. The local match lines are initially set to a high voltage level, in a preferred embodiment, and if any bit within a group fails to match the corresponding bit of the virtual address, it pulls the local match line to a low voltage level. Additionally, in a preferred embodiment, each of the local match lines of an entry is coupled to a global match line. The global match line is initially set to a high voltage level, and if any of the local match lines indicate that their respective group fails to match a received virtual address (e.g., by the local match line having a low voltage level), then such a local match line causes the global match line to be pulled to a low voltage level. Accordingly, if the global match line has a high voltage level when the global match lines are evaluated, it indicates that the TLB entry associated with the global match line matches the received virtual address. However, if the global match line has a low voltage level when the global match lines are evaluated, it indicates that the TLB entry associated with the global match line fails to match the received virtual address.
For example, in a most preferred embodiment, a 52-bit TLB entry comprises four groups of bits, with each group having 13 bits. For instance, bits [1:13] form a first group, bits [14:26] form a second group, bits [27:39] form a third group, and bits [40:52] form a fourth group. Each of the groups are coupled to a local match line. For instance, each bit of the first group is coupled to a local match line A, each bit of the second group is coupled to a local match line B, each group of the third group is coupled to a local match line C, and each bit of the fourth group is coupled to a local match line D. In a most preferred embodiment, the local match lines are initially set to a high voltage level. When a virtual address is received in the TLB, each bit of each group is compared with the corresponding bit of the received virtual address. If one or more bits of a group fail to match the corresponding bits of the virtual address, then such mismatching bit(s) pull their respective local match line(s) to a low voltage level. Otherwise, the local match line for the group of bits remains at a high voltage level.
Furthermore, in a most preferred embodiment, each of the local match lines controls a FET that is coupled to the global match line for an entry of the TLB. Therefore, in a most preferred embodiment, four FETs are coupled to the global match line for an entry (i.e., one FET for each local match line). If one or more of the local match lines are at a low voltage level, indicating that their respective group(s) of bits failed to match the corresponding bits of the received virtual address, then such local match line(s) turn their FET(s) on to pull the global match line to a low voltage level. Otherwise the global match line for the TLB entry remains at a high voltage level.
In a preferred embodiment, one global match line triggers the evaluation of a neighboring global match line to determine whether a match is achieved for any entry. More specifically, in a preferred embodiment, two global match lines are input to a NAND gate, and the output of the NAND gate is utilized to trigger the evaluation of the two global match lines. In a preferred embodiment, only one entry in a TLB may match a received virtual address. Accordingly, at most, only one of the two global match lines will remain at a high voltage level, indicating a match with the received virtual address. Therefore, when at least one of the two global match lines transitions low, the output of the NAND gate transitions high to trigger an evaluation of the pair of global match lines.
A slight timing skew may be present between the two global match lines transitioning to a low voltage level, however, such a timing skew is very small in a preferred embodiment. One way in which a preferred embodiment minimizes such timing skew is by implementing local match lines that are associated with a group of bits of an entry. By implementing an entry of the TLB as multiple groups of bits, a largely disproportionate number of bits mismatching in any one group is avoided. For example, in a most preferred embodiment, each group of bits comprises 13 bits of the entry. Accordingly, the most disproportionate result that can occur in a most preferred embodiment is having 13 bits mismatching in one group and only one bit mismatching in another group. It will be recalled that in prior art implementations, a much more disproportionate result could occur, such as 52 bits mismatching for one entry and only 1 bit mismatching for another. Therefore, the local match of a preferred embodiment reduces the timing skew within the TLB circuitry because a large, disproportionate number of bits mismatching is curtailed.
Furthermore, the timing skew is reduced in a preferred embodiment because a relatively small number of FETs are implemented to pull down on the global match line for each entry. For example, in a most preferred embodiment, four NFETs are coupled to a global match line to pull down the global match line if the associated entry fails to match a received virtual address. Accordingly, the number of NFETs pulling down the global match lines for a pair of entries can not be very disproportionate. For instance, in a most preferred embodiment, the most disproportionate result is having four NFETs pulling down the global match line for one entry and only one NFET pulling down the global match line for another entry. Also, in a preferred embodiment, the global match line FETs are much larger than the FETs utilized in prior art implementations to pull down a match line for an entry (e.g., NFET 26 of FIG. 1). Because the global match line FETs are larger in a preferred embodiment, the global match line can be pulled down much faster with only a single FET, thereby reducing the gain (or skew) recognized by having more than one FET pulling down on the global match line. In view of the above, the time skew in determining a match for two entries of a TLB is minimized. Therefore, a preferred embodiment provides a TLB that utilizes a xe2x80x9cbuddyxe2x80x9d self-timed evaluation of the TLB""s global match lines to minimize the latency in accessing the cache memory.
It should be appreciated that a technical advantage of one aspect of the present invention is that a TLB implementation is provided that reduces the latency in accessing memory to satisfy a memory access request. A further technical advantage of one aspect of the present invention is that a TLB implementation is provided that utilizes a xe2x80x9cbuddyxe2x80x9d self-timed evaluation of whether a match is achieved for a received virtual address in the entries of the TLB. As a result, a technical advantage of one aspect of the present invention is that a TLB implementation is provided that minimizes the time required for evaluating whether a match is achieved in the TLB for a virtual address, thereby minimizing the latency in accessing the cache memory to satisfy a memory access request. For instance, a preferred embodiment provides a xe2x80x9cbuddyxe2x80x9d self-timed evaluation implementation for a TLB, wherein the evaluation of a MATCH line is triggered by another MATCH line in relatively close proximity thereto. A further technical advantage of one aspect of the present invention is that a TLB implementation is provided that evaluates whether a match is achieved in the TLB as soon as possible, but also in a manner that avoids access of erroneous memory addresses. It should also be appreciated that a technical advantage of one aspect of the present invention is that a TLB implementation is provided that does not consume an undesirably large amount of surface area. For instance, a TLB implementation is provided that does not require a dummy column and dummy row for triggering the evaluation of whether a match is achieved in the TLB.
The foregoing has outlined rather broadly the features and technical advantages of the present invention in order that the detailed description of the invention that follows may be better understood. Additional features and advantages of the invention will be described hereinafter which form the subject of the claims of the invention. It should be appreciated by those skilled in the art that the conception and specific embodiment disclosed may be readily utilized as a basis for modifying or designing other structures for carrying out the same purposes of the present invention. It should also be realized by those skilled in the art that such equivalent constructions do not depart from the spirit and scope of the invention as set forth in the appended claims.