Not applicable.
Not applicable.
1. Field of the Invention
The present invention relates generally to the debugging of digital logic devices. More specifically, the present invention relates to the retrieval of state data and program counter data from a digital logic device. Still, more particularly, the invention relates to a digital logic device that includes a port for off-loading test data generated at or near the normal clock speed of the digital logic device to an external device operating at a slower speed.
2. Background of the Invention
The design and development of digital logic circuits has become increasingly complex, due in large measure to the ever-increasing functionality offered in such circuits. Integrated circuits are constantly surpassing milestones in performance, as more and more functionality is packaged into smaller sizes. This enhanced functionality requires that a greater number of transistors be included in an integrated circuit, which in turn requires more rigorous testing to insure reliability once the device is released. Thus, integrated circuit designs are repeatedly tested and debugged during the development phase to minimize the number and severity of errors that may subsequently arise. In addition, chips may be tested to determine the performance characteristics of the device, including the speed or throughput of the chip, software running on the chip, or the aggregate performance of the system.
As integrated circuits become more complex, the length of the debug phase increases, requiring a greater advanced lead-time before product release. In addition, as the complexity of integrated circuits increase, it becomes necessary to fabricate more prototype iterations of the silicon (or xe2x80x9cspinsxe2x80x9d of silicon) in order to remove successive layers of bugs from the design, thereby increasing the engineering and materials cost of the released product. It would be desirable to reduce these engineering and material costs and speed up the product cycle. Moreover, if more data, or more accurate data was available for analysis, the designers and debuggers might be able to expedite the design and debug process for the product, thereby minimizing the number of spins and the time to release the product.
One of the chief difficulties encountered during the debug phase of a product is identifying the source of an error. This can be extremely difficult because the error may make it impossible to obtain state information from the integrated circuit. For example, in a processor, an error may cause the processor to quit executing, thus making it impossible to obtain the state data necessary to identify the source of the error. As a result, the debug process requires that the debug team infer the source of the error by looking at memory accesses by the processor or patterns of activity on other external busses. The normal technique for probing external busses is to solder a wire onto a terminal or trace. Unfortunately, merely adding a soldered wire to a terminal or trace can create signal reflections, which may distort the data being monitored. Thus, the manual probing of bus terminals and traces is impractical and inaccurate, especially those attached to high speed, highly complex chips. More sophisticated techniques are also used, but are expensive and suffer, albeit to a lesser degree, from the same effects. Further, because the state information available on these busses is typically a small subset of the processor""s state, the debug team must make guesses regarding the state of data internal to the processor. If the internal state of the processor could be acquired and stored, these inferences would be replaced by solid data. By reducing the designer""s uncertainty and increasing the available data, this would be beneficial in solving problems with the processor hardware or software.
In certain products under development, such as new microprocessors under development by the assignee of the present invention, the number of transistors is exceedingly large and their dimensions are exceedingly small. Both of these factors make it practically impossible to probe internal terminals of the chip or internal wire traces. Moreover, to the extent that certain internal terminals and traces could be probed, the conventional methods for conducting such a probing operation are extremely expensive, and some might potentially corrupt the state of the terminals and traces being probed. Consequently, the only common technique currently available to test or probe the state of terminals and traces in highly complex chips is to route signals through the chip""s external output terminals, to some external interface. This approach, as presently implemented, suffers in certain respects.
Oftentimes the internal clock rate of the chip operates at a much higher rate than the external logic analyzers that receive and process the data. As an example, processor designs currently under development operate at clock speeds up to and exceeding 2.0 GHz. The fastest commercial logic analyzers, despite their expense, are incapable of operating at GHz frequencies. Thus, either certain data must be ignored, or some other mechanism must be employed to capture the high-speed data being generated on the chip. The typical approach is to run the chip at a slower clock speed so the data can be captured by external test equipment. This solution, however, makes it more difficult to detect the bugs and errors that occur when the chip is running at full clock speeds. Some errors that occur at full clock speed will not be detected when the clock speed is reduced to accommodate the off-chip logic analyzers. Also, increasingly the processor connects to external components that have a minimum speed, below which they will not operate. These speeds require the processor to operate faster than the external logic analyzer can accommodate.
The assignee of the present invention has developed a specially dedicated port with pads for accessing test data from several on-chip data sources. This port permits a large quantity of internal state data to be sent off the chip at a relatively high bandwidth. Despite the advances offered by this dedicated port, the amount of transmission bandwidth available from this port still reflects but a fraction of the bandwidth required to sample all of the data being generated internally in the chip. Thus, even with the increased bandwidth offered by the dedicated test port, there is a mismatch between the chip""s ability to create data and the port""s ability to off-load the data. Given this mismatch, it is impossible to get all of the internal state data off the chip.
One way to handle the mismatch between the amount of state data being created and the output data rate has been through selection of the data to be output. The prior art systems that have attempted to address this problem have been configured to operate in a worst case scenario, so that if the output port is saturated, some fraction of the data, or the most desirable data, will be selected and sent off chip. Thus, as example, the operator may configure the device to output every nth packet of data, and discard all others. Alternatively, the device may be configured to ignore all non-operational packets, and send only operational packets off chip. The problem with these approaches is there are periods when the output port is not saturated, and therefore other internal state data could be sent off-chip during these periods. Unfortunately, no one has developed a system that permits internal state data to be off-loaded in dynamic fashion, so that the most important data is off-loaded during periods when the port is saturated, and all state data is off-loaded if the port is not saturated.
It would be desirable if a system or technique was developed that would permit the downloading of data from a device under test to external logic analyzers, while operating the device at normal or close to normal clock speeds. It also would be advantageous if a mechanism was developed which would dynamically optimize the data that is downloaded based on how busy the port was. Despite the apparent advantages that such a system would offer, to date no such system has been developed.
The problems noted above are solved in large part by a bandwidth manager that is capable of receiving incoming state data, and outputting the most important portions of that data. The bandwidth manager receives state data from one or more sources within the chip. The data from the sources is delivered to the output port in packets, each of which is comprised of one or more ticks. In these packets, as is conventional in most communication protocols, the most critical information for debug is located in the early ticks in the packet. Examples of critical information include packet type, addressing, flow control, and resource identifiers. The remaining information in longer packets is data, also referred to as payload. In many debugging exercises, the payload represents less important information than the header information.
The source of the incoming data is selected in a front end multiplexer. The front end multiplexer may include packet prediction logic which decodes the packet type opcode in the first tick of the packet, and which predicts whether a particular tick is the start of a packet or not. In addition, the packet prediction logic predicts the number of ticks in a packet. The packet prediction logic adds one or more bits to each tick of the packet, which indicate the relative importance of that tick. Another bit may be added to each tick to indicate if the tick is full of valid data. On each clock cycle, the front-end multiplexer presents a tick of data to a smart buffer. The smart buffer decides whether to accept the data tick from the front end multiplexer based on the importance bit(s) and the full bit. If the newly presented tick has a higher importance level than the tick stored at the rear of the smart buffer, then the newly presented data tick overwrites the less important tick previously stored. This tail-eating mechanisms repeats on each clock cycle as the most important data is saved instead of less important data. If the smart buffer does not accept a packet tick because its importance is less than the last stored tick, then the newly presented tick is dropped. Every other subsequent tick in the packet will also be dropped automatically. Data in the smart buffer forms is queued for transmission off the chip, and the output data rate.
According to one aspect of the invention, the smart buffer may be organized in a modified FIFO arrangement. Each storage location in the smart buffer knows whether it is empty or full and the relative importance of the tick that it contains. In addition, each storage location receives signals that indicate what the corresponding information is for adjacent storage locations, and also knows if the current cycle is an output cycle for the output port. When data is presented, each element decides what to do. If the current cycle is an output cycle of the port, data can be accepted because the next storage location can also accept data. Also, if the adjacent element is empty, then incoming data can be accepted. Similarly, if any downstream location is empty, then data can be accepted. If none of the downstream storage locations are empty, then the new incoming data can only be accepted if it is more important than the data stored in the last storage location. Of course, the scheduling function could be handled centrally, but the organization described is preferred because it will naturally produce regular data and control structures.
According to one aspect of the present invention, data may be handled in the bandwidth manager in multiple parallel streams. Thus, the front end multiplexer selects multiple data sources n, for presenting to multiple smart buffers n. Each of the smart buffers is configured in identical fashion. In the case where multiple streams are used, a back end multiplexer selects which of the streams to select for connection to output drivers to be driven off the chip. The back end multiplexer includes arbitration logic that determines which of the data streams is selected for each available output time-slot. According to the preferred embodiment, if only one stream has valid data, it is selected. If one stream has already truncated transmission of its current packet, another stream is selected until the start of a new packet is encountered. If one stream has not truncated, AND the other streams of equal or higher importance have at least one free entry in the smart buffer the first stream is selected (This rule prevents fragmentation when, for example, two streams of equal importance might otherwise have long coincident packets that exceed the available bandwidth. Absent this rule, the two streams would, from time to time, interfere with each other, thus causing truncation on both streams. The result would be that both streams truncate, then neither stream could use the available output cycles until start of packet. To avoid this waste of bandwidth, the effect of this rule is to cause a stream to defer to equal importance data until end of packet, so long as the other streams can accept higher importance data. The effect of the fairness term is that generally, streams will alternate under conditions of high load.) Otherwise, if the data from one stream is more important than data from the other(s), the more important data is selected. If the streams are of equal importance and none of the other conditions exist, then the arbitration logic selects the least recently selected stream.
According to another aspect of the invention, the clock rate of the bandwidth manager logic may be determined automatically. In some cases, data from the source is accompanied by a forwarded clock. A clock in the bandwidth manager uses the internal CPU clock to measure the period of the forward clock to determine the base frequency. If the input source has a fractional divisor, the nominal frequency is rounded to the next slower integral divisor. This enables the output rate to be determined independently of the CPU clock and the output driver frequency, by turning it into a ratio that can be determined by auto-sensing.
According to another aspect of the present invention, the bandwidth manager predicts the start of a data packet from the packet type opcode. A signal is received by the bandwidth manager from other logic in the chip that indicates that data received several ticks before was the start of a packet. To avoid the overhead which would be required to buffer the incoming ticks until the start of packet signal was received, logic is provided which predicts the start of packets, and which resynchronizes using the start of packet signal in conjunction with a data pattern of a known length, such as a series of 1-tick packets. The logic checks for the possible data pattern each clock cycle, and indicate if the data is valid and if it matches the desired data pattern. When the delayed start of packet signal arrives, if the history indicates a match of the pattern, then it is known that the current cycle is the start of packet, and can be used to predict packet length.
These and other aspects of the present invention will become apparent upon reading the detailed description of the preferred embodiment and the appended claims.