Diagnosing and debugging failures in large, complex modern digital designs is a difficult task that spans the entirety of the design process. Recently, the role of post-silicon validation has increased, particularly in the microprocessor design industry, in light of the scaling problems of pre-silicon methodologies and tight time-to-market development schedules.
Pre-silicon verification operates on an abstract design model and has the advantage of being fully deterministic and fully observable, but is limited by its slow speed and low fault coverage. Failing testcases can be reliably reproduced to diagnose functional bugs, which in turn manifest consistently. By contrast, real silicon lacks observability, controllability, and deterministic repeatability. As a result, some tests may not produce the same outcome over multiple executions, due to the interaction of asynchronous clock domains and varying environmental and electrical conditions. Bugs that manifest inconsistently over repeated executions of a same test are particularly difficult to diagnose. Furthermore, the number of observable signals in post-silicon is extremely limited, and transferring observed signal values off-chip is time-consuming. During post-silicon validation, tests are executed directly on silicon prototypes. A test failure can be due to complex functional errors that escaped pre-silicon verification, electrical failures at the circuit level, and even manufacturing faults that escaped testing. The failed test must be re-run by validation engineers on a post-silicon validation hardware platform with minimal debug support. Post-silicon failure diagnosis is notoriously difficult, especially when tests do not fail consistently over multiple runs. The limited observability and controllability characteristics of this environment further exacerbate this challenge, making post-silicon diagnosis one of the most challenging tasks of the entire validation effort.
In industry practice, the post-silicon validation process begins when the first silicon prototypes become available. These chips are then connected to specialized validation platforms that facilitate running post-silicon tests, a mix of directed and constrained-random workloads. Upon completion of each test, the output of the silicon prototype is checked against an architectural simulator, or in some cases, self-checked.
When a check fails (i.e., the semiconductor device has failed the test), indicating that an error has occurred, the debugging process begins, seeking to determine the root cause of the failure. On-chip instrumentation can be used to observe intermediate signals. Techniques such as scan chains, on-chip logic analyzers, and flexible logging infrastructures are configured to trace design signals (only a small number can usually be observed) and periodically transfer data off-chip. Traces are then examined by validation engineers to determine the root cause of the problem. This process may be time-consuming and engineering intensive, and may be further exacerbated by bugs with inconsistent outcomes. Additionally, off-chip data transfers are very slow, which further hinders observability due to limited transfer time.
The debugging process of non-deterministic failures can be aided by deterministic replay mechanisms. However, these solutions perturb system execution which can prevent the bug from manifesting, and often incur significant hardware and performance overheads. In addition, in an effort to automate the failure diagnosis process, methods based on formal verification techniques have been proposed. These solutions require deterministic execution and a complete golden (known-correct) model of the design for comparison. However, the fundamental scaling limitations of formal methods preclude these techniques from handling industrial size designs. Accordingly, there is a need for a scalable post-silicon validation platform to localize inconsistent bugs, minimize off-chip transfers, without a priori knowledge of the design or the failure.