Known computer designs usually have a direct connection between the processor and its memory components. In conventional designs data values are exchanged between the processor and the memory components containing load/store addresses and load/store data objects going in and out of the processor. In more sophisticated designs additional to the data values instruction addresses and instruction data objects are going out of the output side of the processor. With the improvement of the processor performance, and the enlargement of the memory components, the speed of data transfer between the processor and the memory components constitutes a bottleneck of the system performance and therefore, a so called cache memory was introduced into the design in addition to the main memory. A cache is a small fast memory component holding data recently accessed by the processor, and designed to speed up subsequent access to the same data. A cache is most often applied to processor-memory access but also used for a local copy of data accessible over a network.
The cache may be located on the same integrated circuit as the processor, in order to shorten the transmission distance and thereby further reduce the access time. The cache is built from faster memory chips than a main memory so that a cache hit takes much less time to complete than a normal memory access. Processor microarchitecture in this area has been developed gradually and led to so called System on Chip designs, wherein the cache is on the same silicon die as the processor. In this case it is often known as primary cache, since there may be a larger, slower secondary or third cache outside the CPU chip. As the processor's performance is getting faster, multiple levels of caching have been introduced, i.e. Level 1 being the closest to the processor, with Level 2 and sometimes Level 3 caches all on the same die. These different caches are usually of different sizes e.g. 16 kBytes for Level 1, 256 kByte for Level 2, 1 MByte for Level 3 so as to allow the smaller caches to run faster.
In computer systems it is conventional to define in each instruction to be executed a set of register addresses which are used to access a register file in the computer system. The register addresses usually include first and second register addresses defining registers from which operands are extracted and at least one destination register address defining a register into which the results of an operation are loaded. Data processing instructions generally use the contents of the first and second registers in some defined mathematical or logical manipulation and load the results of that manipulation into the defined destination register. Memory access instructions use the register addresses to define memory locations for loading and storing data to and from a data memory. In a load instruction, source registers define a memory location from which data is to be loaded into the destination register. In a store instruction, the source registers define a memory location into which data is to be stored from the destination register.
Existing computer systems generally operate by generating memory addresses for accessing memory sequentially. The architecture of existing computer systems is arranged such that each memory access instruction defines a single memory address. Memory access units exist which allow two addresses to be generated from a single instruction, by automatically incrementing the address defined in the instruction by a certain predetermined amount. However, these systems are clearly restricted in that, if two addresses are generated, the second address necessarily bears a certain predetermined relationship to the first address. Vector stride units also exist which allow more than one memory address to be computed, but these are also limited in the relationship between the addresses. Moreover, it is necessary to generate the first address prior to calculating the second address, and therefore it is not possible to generate two memory access addresses simultaneously in a single memory access unit.
In some known computer systems a permuter is used for picking up data that is in columnar organisation and transforming it into a row organisation. A permuter is a device for reordering data in a large data structure. In most conventional computer systems the permuter is used in a reasonable amount of the cycle count. However, a permuter operation has the disadvantage of slow performance, since it requires a long processing time. Known super computers usually perform scatter/gather operations by means of expensive memory systems. Scatter/gather operations have not yet been implemented into microprocessors, since they require extensive processing time, especially if it shall be performed efficiently. Furthermore, it is not an option to implement scatter/gather operations in microprocessor designs for cost reasons.
Some computer systems have more than one execution channel, e.g. dual ported computer systems with two execution channels. In such dual ported computer systems, each execution channel has a number of functional units which can operate independently, whereas both execution channels can be in use simultaneously. In some cases the execution channels share a common register file. It is useful in such architectures to provide instructions which simultaneously instruct both execution channels to implement a function so as to speed up operation of the processor. In such a scenario, a so-called long instruction may have two instruction portions each intended for a particular execution channel. Each instruction portion needs to define the register addresses for use in the function to be performed by the execution channel for which it is intended. In some cases both instruction portions may wish to define associated or the same register addresses. In these situations a long instruction needs to define two sets of register addresses, one for each execution channel.
In known (vector) computer architectures, the process of sending a set of values, e.g. v1, v2, v3, v4 . . . , to a related set of memory addresses, e.g. a, a+n, a+2n, a+3n . . . , is called “scatter”, whereas the process of fetching a set of values is called “gather”. In a packed single instruction multiple data (SIMD) format, the instructions provide scatter operations, e.g. store vector byte STVB, store vector for half word instructions STVH, store vector for word instructions STVW and gather operations, e.g. load vector byte LDVB, load vector for half word instructions LDVH, load vector for word instructions LDVW, which transfer data between the register packed format and a set of scalar values in memory addresses. STVW stores word 0 of the source in the base address and word 1 of the source in the base address plus offset. LDVH fetches half word 0 of the destination from the base address, half word 1 for the base address plus offset, half word 2 from the base address plus 2*offset, half word 3 from the base address plus 3*offset. However, the LDV and STV operations are limited to a single 64 bit packed SIMD value.
To achieve a high performance of a (vector) computer system it is desirable to do scatter and gather operations efficiently, even though the operation implies multiple data transfers between the processor and memory. It is an object of the present invention to provide more efficient methods for picking up data that is in columnar organisation and transforming it into a row organisation. Particularly super computers have the problem to handle complex data structures, especially when data is in the wrong organisation. More specifically, it is an object of the present invention to provide a method to gather a data structure in an organisation of A0, A1, A2, A3, A4, into an organisation of A0, A8, A16, A24 in one operation.
Recently dual ported processors have been developed with specific designs comprising two execution channels or pipelines and two load/store units (LSU) capable of two load/store data transactions per cycle (e.g. Broadcom “Firepath” processor) which will be described in more detail further below. In existing systems such dual ported processors have been connected directly to a psuedo dual ported on chip memory of a small size, e.g. 192–256 kbyte, to react fast enough. Since processors comprising two execution pipelines capable of two load/store data transactions per cycle run faster than conventional processors and the amount of required memory is increased, problems occur in implementing data caches to such processor designs.
Another object of the present invention is to overcome the above mentioned problems and disadvantages by providing a processor architecture for dual ported processor implementations with two execution pipelines capable of two load/store data transactions per cycle and managing the data transactions between the processor and its cache memory. Still another object of the present invention is to provide increased flexibility for memory accesses in such dual ported processor implementations comprising two execution pipelines capable of two load/store data transactions per cycle.