1. Field of the Invention
The present invention generally relates to data processing system architecture technology. More specifically, the present invention relates to a data switching device with a bandwidth management unit to reduce system data traffic between the processor and the system controller in a data processing system while performing vector-calculation operations, such as vector product operations, and the processing method employed by the data switching device.
2. Description of the Related Art
The primary value of data processing systems resides in their computing power. This computing power is useful in engineering, statistics, scientific research, and many other fields. For example, engineers use computing power to solve high-order polynomial equations, or to simulate the stress (force) distribution of an aircraft or a sailing vessel. Because most applications require a large number of computing steps, data processing systems need to quickly retrieve data to be processed and output the result of the operation. Therefore, the efficiency of data transfer is a critical factor in computing performance.
FIG. 1 (Prior Art) is a block diagram of a part of a typical data processing system, such as a computer system. FIG. 1 shows only the components of the data processing system that are required to perform a mathematical operation. As shown in FIG. 1, the data processing system comprises processor 10, system controller 20, main memory 30, peripheral device(s) 40 and cache memory 50. Co-processor 10a is an optional component, which is used to help processor 10 perform special mathematical operations, such as floating-point operations. The functions of these components are described as follows.
Processor 10 is the processing center of the data processing system, which receives instructions and sequentially executes them. In addition, processor 10 usually includes several embedded registers (not shown) that store the data to be processed and the operation result, and which serve to reduce the number of times it is necessary to communicate with external data sources. System bus 60 is connected between processor 10 and system controller 20.
System controller 20 is a bridge device for interfacing between processor 10 and other components in the data processing system, such as main memory 30 and peripheral devices 40. The main functions of system controller 20 are to manage the main memory (typically implemented by Dynamic Random Access Memories, or DRAM) and to interface between the system bus and a peripheral bus (such as a Peripheral Component Interface bus, or PCI). Briefly speaking, the memory management function of system controller 20 comprises transferring information, such as program code and data code, between processor 10 and main memory 30. In addition, system controller 20 controls peripheral devices 40, such as the input/output devices. For example, a multimedia system of peripheral devices 40 displays the result of the desired operation. The interface function of system controller 20 is irrelevant to the issue of the present invention and will not be further discussed.
Cache memory 50 and optional co-processor 10a, both of which are located in proximity to processor 10, provide processor 10 with additional assistance. Cache memory 50, typically implemented by Static Random Access Memories (SRAM), serves as a buffer space for temporarily storing the input/output data of processor 10. As described above, processor 10 includes only a limited number of embedded registers and therefore cannot pre-load all the program code that is ready to be executed. If processor 10 were required to load the program/data code instruction-by-instruction at the time of execution, it is clear that the computing speed of processor 10 would decrease. Using cache memory 50 as a buffer allows processor 10 to execute instructions without the interruptions resulting from accessing external program/data code.
Co-processor 10a, as described above, provides additional calculation functions that are not implemented by hardware in processor 10. For example, some co-processors provide processors with floating-point calculation functions, which otherwise would be fulfilled by software. Basically, co-processor 10a operates under the control of processor 10, (i.e. co-processor 10a receives operation code and data code related to the floating-point operation from processor 10), and cannot work independently. Today, many of the additional functions previously provided by co-processors have been merged into processors. Nevertheless, the modern multi-processor system is similar in architecture to that of a processor/co-processor system, although more complicated.
According to the above description, the process for performing a mathematical operation in the data processing system as shown in FIG. 1 is briefly described as follows. In the following example, the operands (data ready to be processed) are stored in main memory 30. After receiving an instruction for adding operand X with operand Y. processor 10 issues a read request for reading the data X and Y to system controller 20 through system bus 60. System controller 20 reads out the data X and Y stored in main memory 30 in response to the read request received from processor 10 and sends the data back to processor 10 through system bus 60. After finishing the addition operation, processor 10 then issues a write request for writing the addition result to main memory 30. This write request is also transferred by system bus 60. Finally, system controller 20 receives the write request and writes the addition result to a destination location in main memory 30. The addition operation is completed.
It is evident that system bus 60 is quite busy. In the above calculation, processor 10 issues, through system bus 60, the read request containing the addressing information of operands X and Y, and the write request containing the result data and the addressing information of the result data. In fact, the data traffic of system bus 60 is heavier than that of other buses. As described above, system controller 20 is electrically coupled to, and transfers data between, processor 10, main memory 30, PCI bus and graphic subsystem 40. Therefore, data from various sources that is ready to be processed is transferred to processor 10 through system bus 60, thereby increasing the data traffic on system bus 60. One could describe system bus 60 as a bottleneck in the system performance. Many methods have been proposed to solve this problem. For example, the data processing system can use the Direct Memory Access (DMA) technique to bypass the graphic data required in the display system, and add a controller to directly control the operation of the peripheral devices. However, information associated with mathematical operations must pass through system bus 60 (in order to be executed by processor 10) and cannot be rerouted to other components. Mathematical operations requiring a lot of data, such as vector or matrix operations, have an especially great impact on the traffic load of system bus 60.
FIG. 2 (Prior Art) is a data flow diagram showing the flow of data between processor 10, system controller 20 and main memory 30 during a vector multiplication operation. In FIG. 2, the data (request or control signals) sequence is denoted by symbols 1a through 1k. FIG. 2 only depicts the components relevant to this calculation process, i.e. processor 10, system controller 20 and main memory 30.
The operation illustrated in FIG. 2 is a calculation of the inner product of vector X and vector Y (that is, X.multidot.Y), wherein X=(x.sub.1, x.sub.2, . . . , x.sub.n), Y=(y.sub.1, y.sub.2, . . . , y.sub.n) and n represents the dimensions of vectors X and Y. As shown in FIG. 2, a vector-calculation instruction 1a, which indicates the operation of X.multidot.Y, is first sent to processor 10. After accepting vector-calculation instruction 1a, processor 10 begins to retrieve the data of vectors X and Y and execute the vector multiplication operation.
First, processor 10 must retrieve the data of vector X. Processor 10 sends a read request 1b containing addressing information for the data of vector X to system controller 20. Then system controller 20 produces control signal 1c to access main memory 30 according to the addressing information contained in read request 1b. Data 1d, corresponding to the elements of vector X, are transmitted from main memory 30 to system controller 20. Then system controller 20 returns data 1e to processor 10. Note that data 1d and 1e each contain at least n numbers, which correspond to vector elements x.sub.1 through x.sub.n.
Processor 10 employs the same method to retrieve the data of vector Y. Request if contains addressing information for the data of vector Y. System controller 20 produces control signal 1g in response to the addressing information in request 1f, thereby accessing main memory 30. Data 1h and 1i contain the elements of vector Y, that is, elements y.sub.1 through y.sub.n. Thereby, processor 10 has acquired the data of vectors X and Y, respectively.
After performing the inner product operation and obtaining the result R=x.sub.0 y.sub.0 +x.sub.1 y.sub.1 +x.sub.2 y.sub.2 +. . . +X.sub.n Y.sub.n, processor 10 stores the result R in the destination location to which the vector-calculation instruction 1a refers. In this case, the destination location is in main memory 30. Therefore, processor 10 generates a write request 1j containing the inner product result R and transmits it to system controller 20. System controller 20, by means of the control signal 1k, writes the inner product result R to the destination location in main memory 30. Now the inner product operation is completed.
In the vector inner product calculation process described above, the operation speed is determined by the computing power of processor 10 and by the transmission speed of vectors X and Y. In this case, regardless of the computing power provided by processor 10, a large amount of data (at least 2n numbers corresponding to vectors X and Y) must be transmitted through system bus 60. Because this vector data requires a certain amount of time to flow through system bus 60, system bus 60 becomes bottlenecked, as described above. As a result, the time spent transmitting the data of vectors X and Y has a significant impact on the performance of the data processing system.