1. Field of the Invention
The present invention generally relates to computer systems, and more particularly to an efficient method of loading values into a computer system as part of an initialization (bootstrap) routine.
2. Description of the Related Art
The basic structure of a conventional symmetric multi-processor computer system 10 is shown in FIG. 1. Computer system 10 has one or more processing units arranged in one or more processor groups; in the depicted system, there are four processing units 12a, 12b, 12c and 12d in processor group 14. The processing units communicate with other components of system 10 via a system or fabric bus 16. Fabric bus 16 is connected to one or more service processors 18a, 18b, a system memory device 20, and various peripheral devices 22. A processor bridge 24 can optionally be used to interconnect additional processor groups. System 10 may also include firmware (not shown) which stores the system's basic input/output logic, and seeks out and loads an operating system from one of the peripherals whenever the computer system is first turned on (booted).
System memory device 20 (random access memory or RAM) stores program instructions and operand data used by the processing units, in a volatile (temporary) state. Peripherals 22 may be connected to fabric bus 16 via, e.g., a peripheral component interconnect (PCI) local bus using a PCI host bridge. A PCI bridge provides a low latency path through which processing units 12a, 12b, 12c and 12d may access PCI devices mapped anywhere within bus memory or I/O address spaces. The PCI host bridge interconnecting peripherals 22 also provides a high bandwidth path to allow the PCI devices to access RAM 20. Such PCI devices may include a network adapter, a small computer system interface (SCSI) adapter providing interconnection to a permanent storage device (i.e., a hard disk), and an expansion bus bridge such as an industry standard architecture (ISA) expansion bus for connection to input/output (I/O) devices including a keyboard, a graphics adapter connected to a display device, and a graphical pointing device (mouse) for use with the display device. The service processors can alternately reside in a modified PCI slot which includes a direct memory access (DMA) path.
In a symmetric multi-processor (SMP) computer, all of the processing units 12a, 12b, 12c and 12d are generally identical, that is, they all use a common set or subset of instructions and protocols to operate, and generally have the same architecture. As shown with processing unit 12a, each processing unit may include one or more processor cores 26a, 26b which carry out program instructions in order to operate the computer. An exemplary processor core includes the Power5™ processor marketed by International Business Machines Corp. which comprises a single integrated circuit superscalar microprocessor having various execution units, registers, buffers, memories, and other functional units, which are all formed by integrated circuitry. The processor cores may operate according to reduced instruction set computing (RISC) techniques, and may employ both pipelining and out-of-order execution of instructions to further improve the performance of the superscalar architecture.
Each processor core 26a, 26b includes an on-board (L1) cache (typically, separate instruction and data caches) implemented using high speed memory devices. Caches are commonly used to temporarily store values that might be repeatedly accessed by a processor, in order to speed up processing by avoiding the longer step of loading the values from system memory 20. A processing unit can include another cache, such as a second level (L2) cache 28 which, along with a memory controller 30, supports both of the L1 caches that are respectively part of cores 26a and 26b. Additional cache levels may be provided, such as an L3 cache 32 which is accessible via fabric bus 16. Each cache level, from highest (L1) to lowest (L3) can successively store more information, but at a longer access penalty. For example, the on-board L1 caches in the processor cores might have a storage capacity of 128 kilobytes of memory, L2 cache 28 might have a storage capacity of 4 megabytes, and L3 cache 32 might have a storage capacity of 32 megabytes. To facilitate repair/replacement of defective processing unit components, each processing unit 12a, 12b, 12c, 12d may be constructed in the form of a replaceable circuit board, pluggable module, or similar field replaceable unit (FRU), which can be easily swapped, installed in, or swapped out of system 10 in a modular fashion.
FIG. 2 illustrates further features of the cache architecture of a processor 12 in system 10, including an L1 cache 34, an L2 cache store queue 36, an L2 cache array 38, and an L2 directory array 40. As processor core 26 completes store instructions, the output data are stored in L1 cache 34 and are also sent to L2 store queue 36. L2 store queue 36 may, for example, have 32 entries (128-byte cache lines). L2 store queue 36 includes read/claim (RC) dispatch logic 42 which issues queue entries to one of several RC machines 44. RC machines 44 determine whether a requested cache line is already allocated in L2 cache array 38 (a cache hit) by reading directory array 40, and obtain ownership of any memory block as necessary to gain write permission for the cache line.
FIG. 2 also depicts pervasive logic 46 which is embedded in the integrated circuit (IC) chip comprising processing unit 12. Pervasive logic 46 is used to load (scan) information into various components of system 10, including L2 cache 14. Pervasive logic 46 communicates with a service processor 18 via an interface port 48. Interface port 48 may be a JTAG port that complies with the Institute of Electrical and Electronics Engineers (IEEE) standard 1149.1 pertaining to a test access port and boundary-scan architecture, having two data lines TDI and TDO (the test data input and output ports), a test mode select line TMS, a test clock line TCK, and a reset line TRST. JTAG interface 48 is connected to a scan communications (SCOM) controller 50 and a scan ring controller 52.
SCOM controller 50 allows the service processor to access “satellites” located in the clock-controlled (functional) components of system 10 during normal operation of the system. These SCOM satellites have internal control and error registers which can be used to enable various functions in the components. SCOM is an extension to the JTAG protocol allowed by IEEE standard 1149.1, that provides read and write access of the internal registers while leaving system clocks running. SCOM controller 50 is connected to a scan clock 54 and a functional clock 56. SCOM controller 50 may include appropriate logic to arbitrate between JTAG interface 48 and any assembly code commands from the processor cores. SCOM controller 50 may alternatively be connected to an external SCOM interface which provides chip-to-chip communications without requiring the involvement of the service processor.
Scan ring controller 52 provides the normal JTAG scan function in a level sensitive scan design (LSSD) to set the internal latch state of the system with functional clocks stopped. The internal latches are used for testing or initializing components. There may be tens of thousands of latches in a scan ring. In particular, the scan ring can include an L2 cache data input latch 58, an L2 cache control input latch 60, a directory data input latch 62, and a directory control input latch 64. These four latches are used to load appropriate program instructions and operand data into the L2 cache in order to bootstrap the system. The scan input port of cache array 38 is typically 64 bytes. A single scan input port is utilized, as it is not feasible to provide latches for every entry in the cache array.
Test software 66 running outside system 10 under the control of a user 68 can be connected to service processor 18 via, e.g. an Ethernet connection, and used to control a test procedure. In a typical test procedure involving the L2 cache, all functional clocks of processing unit 12 are stopped by SCOM controller 50 while the values for a partial cache line (64 bytes) are scanned in to latches 58, 60, 62 and 64 by pulsing scan clock 54 under the control of scan ring controller 52. Once the scan in is complete, SCOM controller 50 pulses functional clock 56 to move the values from the latches into the functional components such as L2 cache array 38 and L2 directory array 40. Testing begins after these steps have been repeated such that all of the cache lines have been so loaded into L2 cache array 38 and L2 directory array 40.
Setting up the L2 cache for testing in this manner can be very time consuming. As computer systems become more complex, it is necessary to perform more exhaustive testing to ensure that all of the system components are properly functioning. Test software 66 may accordingly take the form of an operating system such as Linux which is considerably large, e.g., two megabytes or more of program code. Since only 64 bytes of useful data can be scanned in per cycle, it takes about 50,000 scan cycles to fully scan the operating system into the cache 14, and each of these cycles requires waiting for the particular 64-byte word to propagate through thousands of latches before the next 64 bytes can be sent, resulting in a significant delay before testing or initialization can commence. It would, therefore, be desirable to devise a more efficient method of loading bootstrap or test values into a cache. It would be further advantageous if the method could take advantage of existing structures in the cache architecture to reduce any additional overhead.