Computer architecture refers to the physical structure and interconnections of the registers, logical and arithmetic units, control units, and other hardware within a computer. All computers have at least one processor and more complex computers, such as servers, have many processors working together. Also, there are at least two kinds of memory devices associated with the computer: an internal volatile memory called random access memory which is erased when the computer is turned off; and an external memory, called a hard drive, which permanently stores the programs, also called applications, to be executed by a processor when called. Of course, there are a number of peripheral devices such as monitors, Internet connections, keypads, mouses, other pointing devices, other optical and magnetic drives, connections to other computers, etc.
A processing element of a computer retrieves data in the form of applications, programs, or data from the external memory into an internal memory. When data and/or instructions are needed for the application, the processing element may retrieve the data/instructions from internal memory to its registers for arithmetic and logical processing. Now that processing speeds are faster and faster, computer architects have directed an aspect of research and development into keeping the processor occupied and its registers filled for the next operation. One of many approaches taken by computer architects has been to minimize the time required to retrieve data/instructions from external and internal memory into the processor's registers. Incorporating smaller high speed memory units called caches nearer the memory is an implementation of this approach. These caches, moreover, may be hierarchical meaning that a level one (L1) cache is nearest to the processing element and is very fast which may be accessed in only one or very few processing cycles. There may be a L1 cache for instructions and a different L1 cache for data. There also may be level two (L2) and/or level three (L3) caches with the higher number denoting a larger, more distant, and perhaps slower cache but still closer and faster than either internal or external memory. Thus, when a processing element needs data/instructions which is not readily available in its registers, it accesses its nearest cache by generating a control signal to access the cache directory and the data array in which the data is actually stored.
Computer architectures come in a myriad of arrangements today wherein the multiple processors may share caches and/or memory. A processor's memory may be distributed in that each processing element may be connected on an internal bus to a local memory subsystem with unique addresses. The local memory of another processing element might have different addresses so that the processing elements may access each other's local memory for the address stored in that particular local memory over some interconnect fabric.
Managing data in caches has become a science in and of itself. There is always a cache management scheme, an example of which is that the most recently used (MRU) data and/or instructions are stored in the nearest cache. When the nearest cache gets full, then the oldest data/instructions may spill over to fill the next cache and so on. There are other cache management schemes. Caches, moreover, may be accessed by different processing elements so that the same data/instructions, whether accessed by different processing elements or within different caches, must be checked before use to determine if the data is valid. For instance, if processing element 1 has data in its cache and processing element 2 is executing an operation to change that data, then processing element 1 should wait until processing element 2 has completed its manipulation to guarantee that processing element 1 will not access stale data. Maintaining valid data/instructions in the various caches is accomplished by a cache coherency scheme, an example of which is MESI. Each entry in a cache is tagged to indicate its state, i.e., whether the data/instruction is Modified, Exclusive, Shared, or Invalid, hence MESI. Modified data is data that is being modified by a processing element and so another processing element should wait until the modification is complete. Exclusive data means that the processing element having the data in its cache has exclusive control of the data. Shared data is shared by other processing elements; and Invalid data should not be used by any processing element. There are many cache coherency schemes; the MESI protocol above is only one example.
A key problem in processing any Shared data is how many times the data needs to be copied while processing the data. The greater the number of copies that need to be made for multiple processing elements, the more memory bandwidth is consumed and the greater the latency of processing. Memory bandwidth and latency of processing are critical performance variables in many applications.
A typical system of processing elements and accessible memory units is shown in FIG. 1. Typically data is transferred in packets, also called cells. Each packet or cell may have a header and a body as determined by the protocol of the data transfer method and mechanism. For instance, in an asynchronous transfer mode (ATM), a cell consists of 53 octets or bytes in which the first five bytes contain header information and the remaining forty-eight bytes contain the body, also called the payload or data. The header may contain such information as an identifier or address of the next destination and/or the sender of the packet; the type of payload associated with the header, e.g., is the payload user data or control data; is the payload string or integer type; an error control check; a priority check; the nature of the request associated with the data, e.g., is the request a “ping”, a “query”, or a “reply”; etc. If the data is too large to be transmitted in one single packet, it will be split into packets of convenient size, each with a special unique packet header to enable them to be reassembled at the receiving end.
With reference to FIG. 1, a packet comprising a packet header 122 and body 124 are received along path 1 into a packet memory 120 through a network interface and receive logic 110 of a typical packet receiver. A packet is synonymous with a frame, both of which may consist of multiple data cells. The frame header 122 is pulled from memory 120 into a bridge services processor 130, such as an input/output processor, for memory translations along step 2. The local processing element 150 stores the header data 122 in its cache 140 from local memory 120 for examination and/or modification along path 4. When/if all parts have been received, the header 122 and body 124 are concatenated and decoded as a normal packet in the processing element 150 along path 5. The body 124 may be sent to the next processing element's memory subsystem 180 along path 3 depending on the application.
The packet header modifications are completed and a writeback of the header is triggered first to the current processor's cache 140 along path 6 and then to the current memory subsystem 120 along paths 7 and 8. The local processing element 150 takes care of the routine header manipulations but sends the new or different headers to another processing element 190 with a different memory subsystem 180. When the local processing element 150 forwards the modified header 122 to another processing element 190 it must cast out the modified header from its cache 140. The header then must then be read from memory 120 for the next processing element. From packet memory 120, the recombined packet enters a memory engine along path 9, such as the bus interface Direct Memory Access (DMA) engine 160 which manages memory access. The bus interface DMA engine 160 notifies the next processing element 190 when each or both the header/body 122/124 is complete. If the main body of data 124 needs to go to the next processing element 190 and has not yet been transferred, its transfer is now triggered. This transfer may be with the header or roughly in parallel with the transfer of the modified header 122. The next processing element 190 is notified that the header, the body or both are available in its memory 180 and proceeds.
There is thus a need in the industry to increase the memory bandwidth by decreasing the amount of traffic on an interconnect and/or internal bus system in a data communications system by eliminating redundant or unnecessary memory accesses.