1. Field of the Invention
The present invention relates to processors such as a CPU (Central Processing Unit), MPU (Micro Processing Unit), DSP (Digital Signal Processor) or GPU (Graphics Processing Unit: graphics processor, or image processing LSI or geometry engine), comprising a plurality of execution units (hereinafter referred to simply as “cores”), and to testing methods for such processors.
2. Description of the Related Art
Conventionally, in computer systems, such as servers, where especially high processing capacity is required, such as for the primary business processing of an enterprise, improvements in processing capacity have been achieved by connecting a plurality of processors via loose coupling using a cluster configuration or close coupling using an SMP (Symmetrical Multi-Processor) configuration.
However, with loose coupling using a cluster configuration, communication overhead between the server nodes becomes a problem, while in the case of close coupling using SMP, complexity of the server hardware becomes a problem, and in both cases, with conventional architecture, there is a limit to the performance improvement that could be achieved on a single computer system.
In this regard, multicore processors such as CMPs (Chip Multi-Processors), which enable performance improvements by employing a multicore architecture wherein a plurality of cores are installed in a single processor, are currently becoming mainstream in the field of high-end processors.
However, in the case of multicore configurations such as CMP, contrary to the improvements in processing performance achieved by increasing the number of cores, there are problems such as increasing complexity of control due to installation of a plurality of cores and lower yields during semiconductor manufacturing due to increased die size. The reduced yields during semiconductor manufacturing due to the increased die size are a particularly important problem for multicore processors such as CMPs comprising a multicore.
FIG. 1 illustrates the basic hardware configuration of a conventional single-core processor.
Processor 101 comprises a unified level-2 cache block 102 comprising a local interconnect interface 110 and a unified level-2 cache 111, and a core block 103 comprising a level-1 instruction cache 112, level-1 data cache 113, instruction branch unit 114, instruction issue unit 115, load store unit 116, general purpose register file 117, integer execution unit 118, integer completion unit 119, floating point register file 120, floating point execution unit 121, and floating point completion unit 122. The processor 101 is connected to other processors and to main memory via the local interconnect interface 110 and receives instructions and data from the main memory.
An instruction received through the local interconnect interface 110 is fed via the unified level-2 cache 111, level-1 instruction cache 112 and instruction branch unit 114 into either the general purpose register file 117 or the floating point register file 120, and is given to either the integer execution unit 118 or the floating point execution unit 120.
Data received through the local interconnect interface 110 is fed via the unified level-2 cache 111, level-1 data cache 113 and load store unit 116 into either the general purpose register file 117 or floating point register file 120, thereby providing data to either the integer execution unit 118 or the floating point execution unit 121.
The data to be operated on in said integer execution unit 118 and the result of the operation in the integer execution unit 118 are written back and stored in the general-purpose register file 117 by the integer completion unit 119. The data to be operated on in the floating point execution unit 121 and the result of the operation in the floating point execution unit 121 are written back and stored in the floating point register file 120 by the floating point completion unit 122.
Thus, one method to improve processing performance of a computer system, such as a server, is to increase the number of execution units contained in the computer system.
Furthermore, the configuration of a server using a conventional symmetrical multiprocessor is shown in FIG. 2. Processor 201 comprises a single core block 211 and level-2 cache block 210. Furthermore, the server system comprises a plurality of said processors 201 connected via a processor local interconnect, a service processor 203 connected via a JTAG interface as specified in IEEE 1149.1, and a processor local interconnect arbiter 202, as well as a system backplane crossbar controller 206 connected via a system backplane crossbar. The processor local interconnect arbiter 202 performs arbitration between the processors connected to the processor local interconnect. Moreover, the system backplane crossbar controller 206 performs interface control among system boards connected to the system backplane crossbar.
The registers and scan FFs in each CPU are set by performing scans on the core blocks 211 in said plurality of processors 201 via the JTAG interface by controlling the service processor 203 by means of a service processor program 204 and a service processor terminal 205.
Next, as an example of the application of multicore processors, FIG. 3 illustrates the configuration of a server system using 2-CMP multicore processors with two cores. Processor 301 comprises a core-0 block 311, core-1 block 312 and CMP common block 310. Furthermore, the server system comprises a plurality of said processors 301 connected via a processor local interconnect, a service processor 203 connected via a JTAG interface and a processor local interconnect arbiter 202, as well as a system backplane crossbar controller 206 connected via a system backplane crossbar. The registers and scan FFs in each CPU are set by performing scans on the core-0 block 311 and core-1 block 312 in said plurality of processors 301 via the JTAG interface by controlling the service processor 203 by means of a service processor program 204 and a service processor terminal 205.
System configurations containing a JTAG interface were described above for the case where processors are installed in a computer system such as server. However, another important function of the JTAG interface is LSI component testing during semiconductor manufacture. Conventionally, in LSI component testing, identification of defect-free LSIs was performed by inputting a test pattern from an LSI tester into the LSI to be tested, testing the LSI internal circuitry, and then returning the output to the LSI tester and comparing it to expected value data that had been prepared in advance. However, with the increasing scale of LSI logic, such as processors manufactured to a high scale of integration by recent ultramicro processes, it has become impossible to disregard the scale of the test pattern size. Increased test pattern size not only affects production efficiency by requiring a longer time for LSI component testing, but also requires more advanced and high performance LSI testers, leading to increased costs of LSI testing.
In this regard, in recent processors and other large scale integrated circuits, the method has been adopted whereby a self-diagnosis circuit called a BIST (Built-In Self Test) circuit, comprising a test pattern generating circuit and a test result analyzer circuit, is incorporated in advance, thereby greatly reducing the signal interface between the LSI circuit being tested and the LSI tester and keeping the costs of the LSI test from increasing. BIST circuits designed for logic circuits are broadly categorized as RAM-BIST, which is used for testing memory, especially built-in caches, in a large-scale integrated circuit such as a processor, and logic BIST, which is used for testing logic, such as built-in execution units. Since logic BISTs are designed for testing logic circuits such as execution units, as described above, in multicore processors which are currently becoming mainstream, a logic BIST circuit which treats the plurality of built-in cores as a test unit could be installed.
Here, FIG. 4 illustrates an example of the conventional configuration of a logic BIST circuit in a 2-CMP multicore processor comprising two core blocks. Processor 401 is a 2-CMP multicore processor comprising a logic BIST circuit 402, core-0 block 403, core-1 block 404, and CMP common block 405. Furthermore, logic BIST circuit block 402 contains a TAP controller 411, scan chain selection control circuit 412, LFSR (Linear Feedback Shift Register) test pattern, generating circuit 413, scan chain switching MUX circuit 414, and MISR test pattern compression circuit 415.
The TAP (Test Access Port) controller 411 controls scan shifting for circuits such as built-in RAM and built-in execution units at the wafer manufacturing stage and package manufacturing stage in the LSI manufacturing process of the processor 401. Furthermore, when a processor 401 equipped with said TAP controller 411 is installed in a computer system, system control is performed by means of JTAG commands and the like.
First, the scan chain selection control circuit 412 is controlled by the TAP controller 411 and the scan chain is switched by means of the scan chain switching MUX circuit 414 from system mode to logic BIST mode (scan chain select).
Then, an initial test pattern is transferred from the LSI tester (not illustrated) to the TAP controller 411 (test data-in). Next, the TAP controller 411 causes the initial test pattern to be scanned into test pattern storage shift register included inside the LFSR test pattern generating circuit 413 (test pattern scan-in), and applies a shift clock (not illustrated) to said shift register, causing a pseudo-random number based test pattern to be generated as the output of the LFSR test pattern generating circuit 413. Working each logic block circuit by applying a pulse of the system clock, resulting in setting the logically operated test patterns by the logic block circuit to the scan chains of each logic block circuit, the generated test pattern passes through the scan chain switching MUX circuit 414 that is switched to logic BIST mode, and said generated test pattern is applied to core-0 block internal scan F/F chain 421, core-1 block internal scan F/F chain 422, and CMP common block internal scan F/F chain 423.
Furthermore, each test pattern that has passed through the core-0 block internal scan F/F chain 421, core-1 block internal scan F/F chain 422, and CMP common block internal scan F/F chain 423 is inputted into the MISR test pattern compression circuit 415.
The MISR test pattern compression circuit 415 furthermore scans seed data into the shift register for storing signatures that are included within the MISR test pattern compression circuit 415 (seed scan-in), and a shift clock (not illustrated) from said TAP controller 411 is applied, causing said test pattern to be compressed into a signature (nth order bit sequence data), which is outputted to the TAP controller 411 (signature scan-out).
The signature of the core-0 block 403, core-1 block 404 and CMP common block 405 inputted into the TAP controller 411 from said MISR test pattern compression circuit 415 is transferred from the TAP controller 411 to the LSI tester (not illustrated) (test data-out) and is compared in the LSI test to the respective expected value data to analyze the LSI test results. Namely, if the inputted signature of the logic block matches the corresponding expected value data, the test analysis result for that logic block will be ‘pass,’ and if it does not match, the test analysis result for that logic block will be ‘fail.’
The test pattern generation operation in the LFSR test pattern generating circuit 413 and the test pattern compression operation in the MISR test pattern compression circuit 415 are described below with the aid of FIG. 11 and FIG. 12 respectively.
In the present conventional configuration, since the test patterns that have passed through the core-0 block 403, core-1 block 404 and CMP common block 405 respectively are inputted into the same MISR test pattern compression circuit 415, there is a single compressed test pattern for the entire LSI circuit, and a single expected value datum that is compared to that compressed test pattern in the LSI tester for the entire LSI circuit.
Therefore, if the LSI circuit in question is an LSI circuit that contains a plurality of logic blocks, such as multicore processor, as shown in the conventional configuration disclosed in FIG. 4, then all the test patterns that have passed through all the logic blocks, i.e. through core-0 block 403, core-1 block 404 and CMP common block 405, will be joined together and compressed into a single signature, thus making it difficult to analyze the test results individually for each logic block (the core-0 block 403, core-1 block 404 and CMP common block 405) based on comparison of the single signature against the corresponding expected value data.
Furthermore, even assuming it were possible to analyze the test results for each individual logic block based on said single signature, there was still the problem that identification of a defect-free LSI is not possible unless the comparison of all the logic blocks against the expected value data is completed. Namely, for multicore processors comprising a plurality of cores, there was the problem that the testing costs for completely defect-free LSI test result analysis were the same as for partial core defect-free test result analysis.
Japanese Unexamined Patent Application Publication 2001-74811 discloses the art of building a BIST circuit comprising an LFSR pattern generating circuit and an MISR pattern compression circuit for each circuit module into a semiconductor integrated circuit comprising a plurality of circuit modules, and thereby executing self-tests at the circuit module level. In the BIST circuit comprising an LFSR pattern generating circuit and MISR pattern compression circuit, installed for each circuit module, as illustrated in FIG. 1 and FIG. 10 of Japanese Unexamined Patent Application Publication 2001-74811, the scale of the BIST circuit is reduced at the point of the circuit configuration that connects and isolates the test path between circuit modules.
However, in the configuration disclosed in FIG. 1 of said Japanese Unexamined Patent Application Publication 2001-74811, the BIST circuits of each of the circuit modules are connected in series, while in the configuration disclosed in FIG. 10 of the Japanese Unexamined Patent Application Publication 2001-74811, the scan paths of the circuit modules are not isolated, and thus, for example, when analyzing the results of the MISR pattern compression circuit connect only to the circuit modules required for identification of a partially defect-free LSI, if all the circuit modules other than the circuit modules for which results are to be analyzed are defective, then the scan path of the whole LSI will not function normally, and thus the MISR pattern compression circuit will also not function normally, making it altogether impossible to implement partially defect-free LSI identification.
In the prior art, as described above, in a processor with a multicore configuration based on CMP or the like, reduced yield due to increased die size was a problem. In this connection, noting the fact that a processor with a multicore configuration comprises a plurality of core blocks and a single CMP common block, a method could be considered whereby a processor could be salvaged as a partially defect-free LSI if one or more core blocks and the CMP common block were functioning normally. This method would mean that an LSI that was not fully defect-free could be salvaged as a partially defect-free LSI with a configuration capable of functioning as a processor and thus could be marketed for use as a single processor for entry-level models or the like. Namely, productizing partially defect-free LSIs that conventionally would have been disposed of would make it possible to provide differentiation in terms of performance and cost within a lineup having the same processor architecture. However, in the prior art, identifying a partially defect-free LSI at the time of manufacturing required collection and analysis of scan data for all scan points, etc., just as for completely defect-free LSIs, so the analysis was complicated and the LSI test was costly and time-consuming.