The present invention is related to a memory device for constituting a memory subsystem of a data processing apparatus. More specifically, the present invention is directed to such a memory device suitably used to a memory subsystem of such a type of data processing apparatus that a large amount of data is directly supplied from a storage apparatus having a large storage capacity.
Very recently, since operating speeds of microprocessors are increased, remarkable advances appear in highspeed operations/high performance of peripheral components capable of supporting microprocessors. For instance, as to memory devices, various future types of xe2x80x9csynchronous DRAMsxe2x80x9d have been proposed, e.g., xe2x80x9cMoSys DRAM(MDRA)xe2x80x9d, xe2x80x9cMedia DRAMxe2x80x9d, and xe2x80x9cSyncLink DRAMxe2x80x9d, which are described in Japanese magazine xe2x80x9cNIKKEI MICRODEVICExe2x80x9d entitled xe2x80x9cStrong Competition on Post-SDRAM: Protocol Control Method is Acceptable?xe2x80x9d issued in April, 1996, pages 74 to 83 (will be referred to a xe2x80x9cpublication No. 1xe2x80x9d hereinafter). Thus, there is a trend such that these synchronous DRAMs are standardized as main memories of information processing appliances.
On the other hand, performance of microprocessors is drastically improved in connection with great progress of semiconductor technology and development in the RISC techniques. In particular, since semiconductor technology is considerably advanced, operating frequencies of semiconductor chips for constituting highspeed microprocessors may exceed 500 MHz. While such highspeed microprocessors are commercially available, performance of electronic systems with employment of this sort of highspeed processor is similarly improved.
However, the following problems are revealed when the above-explained: electronic systems are practically realized.
That is, in general, the above-described high performance microprocessors can have sufficiently high capabilities while processing data held in cache memories employed inside processors and in peripheral circuits thereof are accessible by these electronic components in high speeds. However, when huge problems such as scientific technical calculations are tried to be solved by these high performance microprocessors, data to be handled cannot be held in these cache memories. Therefore, there is a problem such that the actual performance of these microprocessors would be considerably lowered. In other words, since a so-called xe2x80x9ccache missxe2x80x9d happens to occur, processor waiting states will occur while data are transferred from either main memory or memory subsystems of lower hierarchy to the cache memories. As a result, the processors are brought into idle states and the system performance would be greatly lowered. The degree in lowering of this system performance is described in, for example, xe2x80x9cPseudo Vector Processor based on Register Window and Superscalar Pipeinexe2x80x9d written in Parallel Processing Symposium JSPP, published in 1992, pages 367 to 374 (will be referred to as a xe2x80x9cpublication No. 2xe2x80x9d hereinafter).
In this publication No. 2, the pseudo vector processor is proposed so as to solve such a cache miss problem. Then, in this pseudo vector processor, while a large number of registers are provided within this vector processor, the memory access operations for either the main memory or the memory subsystem of the lower hierarchy are carried out in the pipeline manner, so that lowering of the performance caused by the data waiting time could be minimized.
However, in this pseudo vector processor, the throughput degree required for either the main memory or the memory subsystem of the lower hierarchy is extremely high, as compared with a throughput degree for a normal microprocessor system equipped with a general-purpose cache memory and required for either a main memory thereof or a memory subsystem of a lower hierarchy. This is because this approach by a pseudo vector processor is intended to hide an increase of latency in accessing to either the main memory or the memory subsystem of the lower hierarchy by employing the pipeline structure, not to reduce an data amount to be treated.
As a consequence, either the main memory or the memory subsystem of the lower hierarchy used for the above-explained pseudo vector processor is necessarily constituted by employing the multi-bank structure in order to realize a large memory capacity as well as a high throughput. In this multi-bank structure, a plurality of memory devices equipped with highspeed interfaces such as synchronous DRAMs are arranged in a parallel manner.
A higher need for either a main memory or a memory subsystem of a lower hierarchy with a large memory capacity and a high throughput is required other than the pseudo vector processor. Another approach to solve the each cache miss problem, different from the above-explained architecture, is described in Micro-vector processor Architecturesxe2x80x9d written in the research report by Information Processing Society of Japan published on Jun. 12, 1992, pages 17 to 24 (will be referred to as a xe2x80x9cpublication No. 3xe2x80x9d hereinafter).
In the above-described publication, one approach has been proposed in order to avoid lowering of the effective memory access performance. That is, in such a case that the function of the vector processor is manufactured in a single semiconductor chip by utilizing the high integration technique, the multithread processing operation in the vector instruction level is carried out as to the problem in which a total number of memory access pipelines is restricted by the input/output pin neck. Also, in this case, the high throughput is required for either the main memory or the memory subsystem of the lower hierarchy. As a result, similar to the pseudo vector processor, it is required to prepare either the main memory or the memory subsystem of the lower hierarchy, which owns the multi-bank structure.
A common necessary subject matter for such systems with employment of the above-explained two different architectures is given as follows. That is, either a main memory having a high memory capacity/throughput or a memory subsystem having high memory capacity/throughput of a lower hierarchy must be realized by using a small amount of electronic components and made in low cost. In other words, this common necessary subject matter implies that such a memory system is required to be provided, and this memory system is matched with a trend in a low-cost/compact processor. If such a memory system could not be realized, then the system balance would be destroyed and therefore the system value would be lost.
Similarly, completely different systems have been proposed. That is, the xe2x80x9cunified memory architecture (UMA)xe2x80x9d system has been proposed as a measure capable of constructing relatively low-cost personal computers, in which a cache memory mounted outside a processor is reduced, and/or other memories (frame buffer and the like) than a main memory may function as this main memory. This new trend is disclosed in Japanese magazine xe2x80x9cNIKKEI MICRODEVICExe2x80x9d entitled xe2x80x9cUS PC industries starting - - - reduction in total memory quantitiesxe2x80x9d issued in February 1996, pages 42 to 62 (will be referred to as a xe2x80x9cpublication No. 4xe2x80x9d hereinafter). This system described in the publication No. 4 is arranged by that there are two large flows in memory accesses.
As one memory access flow, there is such an access operation from the processor functioning as the main memory, whereas as another memory access, there is such a sequential access operation from the graphics controller as the frame buffer. Then, the above-explained memory access system is featured by employing such a mode that a plurality of access, streams may access one memory subsystem. It should be understood that the performance of the memory subsystem must be maintained to some extent so as to achieve the practically meaningful mode. To this end, some data supplying ideas are necessarily required in low cost (namely, suppressing of increase in total component quantity) without largely lowering the resultant throughput with respect to a plurality of access streams.
There is a key point how to provide, or realize high performance main memory, or high performance memory subsystems of lower hierarchies even when systems having any one of the above-described architectures xe2x80x9cpseudo vector processorxe2x80x9d, xe2x80x9cmicrovector processorxe2x80x9d, and also xe2x80x9cunified memory architecturexe2x80x9d are practically established.
To realize a high throughput main memory, or a high throughput memory subsystem of a lower hierarchy by utilizing the conventional techniques, the system having the multi-bank structure and using xe2x80x9csynchronous DRAMsxe2x80x9d may constitute the most effective way.
FIG. 8 is a schematic block diagram for showing a system arrangement of a data processing apparatus with employment of a conventional synchronous DRAM. FIG. 8 is a schematic block diagram for indicating a structure of the conventional synchronous DRAM. Referring now to FIG. 8 and FIG. 9, the conventional synchronous DRAM will be explained.
In FIG. 8 and FIG. 9, reference numeral 200 shows an instruction processor, reference numerals 201 and 202 indicate data streams, reference numeral 203 represents a multiplexer, and reference numeral 220 denotes a memory subsystem. Also, numerals 221 to 228 show synchronous DRAMs, numeral 300 is a memory cell, reference numeral 301 shows a control circuit, numerals 310 to 312, and 314 represent registers, and numerals 320 and 321 indicate decoders.
In FIG. 9 which represents the structure of the conventional synchronous DRAM, the registers 310, 311, 312, and 314 provided in this DRAM hold the relevant signals of row-address signal, column-address signal, data-in signal, and data-out signal in response to a clock supplied outside this memory chip. The decoder 320 corresponds to a decoder for the row-address signal, and the decoder 321 corresponds to a decoder for the column-address signal. The memory cell 300 is accessed by the outputs from the decoders 320 and 321. Based upon the respective control signals CS, RAS, CAS, and WE, the control circuit 301 produces set signals 301a and 301b supplied to the address registers 310 and 311, and also produces a set signal 301b supplied to the write register 314. Also, the control circuit 301 produces a set signal 301d supplied to the read data register 314, and also produces a set signal 301c supplied to the memory cell 300.
The synchronous DRAM shown in FIG. 9 is featured by such that the external interface of this synchronous DRAM is constituted by the pipeline system. In other words, the interface interfacing between the control logic (memory control device) of the DRAM and this DRAM is realized as such an interface capable of performing the synchronous transfer operation in response to the sync clock. As a result, synchronous DRAMs corresponding to a plurality of banks may be connected to one set of memory interfaces.
The conventional data processing apparatus indicated in FIG. 8 is arranged by the instruction processor (command processor) 200, the memory control apparatus 210, and the memory subsystem 220. The memory subsystem 220 is constructed of the synchronous DRAMs 221 to 228 having the memory structures shown in FIG. 8. As a result, this memory subsystem 220 can be constituted as a multi-bank type memory subsystem with employment of a small number of structural components, as compared with such a memory subsystem with using asynchronous DRAMs.
The memory control apparatus 210 is provided with the control circuit 211 for allocating the memory access request to two sets of RAMs. 4 sets of synchronous DRAMs selected from the synchronous DRAMs 221 to 228 are connected to each of the memory subsystems 220 for the interfacing function. In this case, the addressing method is given as described in an internal portion of the memory subsystem 220. That is, the addresses are allocated in such a manner that the DRAM accessed every word address is shifted. On the other hand, this approach does not constitute the optimum solution, namely this approach corresponds to a general allocation method used to process an 8-byte single access. The reason why this approach does not constitute the optimum solution is given as follows:
In general, as described in the above-explained publication 2, the pseudo-vector processor sequentially executes the iteration which constitutes the DO loop. As a consequence, since one vector operand is not continuously access, like a general vector processor, either the main memory or the memory subsystem functioning as the lower hierarchy is accessed in the discontinuous manner. In other words, the access operation in this case constitutes such an access patterns as [a(i+2)xe2x86x92b(i+2)xe2x86x92a(i+3)xe2x86x92b(i+3)], as indicated in FIG. 4 of the publication 2. Even when the vector xe2x80x9caxe2x80x9d and the vector xe2x80x9cbxe2x80x9d are held in the continuous region, the access addresses for the memory system are not continued.
On the other hand, the micro vector processor described in the above-explained publication 3 executes the multi-thread process operation in the vector instruction level. Also, in this vector processor, the access operations corresponding to the vector operands of the plural streams are present in a mixture manner. As a result, also in this case, even when the operands of the respective streams are allocated to the continuous region, the access addresses with respect to either the main memory or the memory subsystem of the lower hierarchy are not continued. This may constitute the major reason why the approach shown in FIG. 8 is not the optimum approach.
Furthermore, the above-explained UMA (unified memory access) in the publication No. 4 may have a similar architecture to the above-described architecture such that a plurality of memory access streams are produced.
As previously explained, such a memory system that the high throughput is required, although the addresses of memory access operations are not continued, must necessarily employ the memory structure with employment of large numbers of memory banks except for this memory system that a large number of highspeed RAMs available in use of cache memories is employed. This reason is given as follows even when the fine processing technique of the semiconductor process could be advanced, the performance of memory cells used in various types of DRAMs could not be greatly improved. If the continuous access operation could not be realized in RAMs, then DRAMs could not be operated in high speeds. In other words, since the synchronous DRAM is employed, the RAM interface portion of the memory system can be operated in high speeds. However, when the access operation is required for the not continued addresses, there is no solution capable of accepting the requirement of the processor side other than increasing of the banks.
As a result, in the data processing apparatus requiring such highspeed data process operations, either the main memory or the memory subsystem of the lower hierarchy, which is arranged by the multi-bank, must be prepared. This would cause a serious/essential problem, namely the total component quantity of the system could not be reduced, as compared with the compactness of the processor. This aspect will now be explained with reference to FIG. 8.
In a conventional data processing apparatus shown in FIG. 8, it is now assumed that a stream 201 of a continuous address (a0, a1, a2, a3, - - - ) is mixed with a stream 202 of another continuous address (b0, b1, b2, b3, - - - ) issued from an instruction processor 200. Then, another assumption is made that an arrangement of these continuous addresses on a memory subsystem is described inside a memory subsystem 220.
In the above case, the stream 201 is mixed with the stream 202 by a multiplexer 203, and then the mixed address stream is directly supplied to a memory control apparatus 210 so as to be processed therein. As explained above, when accesses are mixed with each other, the resulting access mode is approximated to the random mode with respect to the memory system. As a result, the feature of DRAM which can accept the continuous access cannot be accomplished. For example, when the cycle time of the DRAM is equal to 8 machine cycles, 8 banks must be prepared as the minimum quantity in order to respond to the access request issued from the processor every cycle.
As previously explained, there is a certain possibility that the addresses of the memory access operations within the processor can be made continuous. However, the reason why the addresses with respect to either the main memory or the memory subsystem of the lower hierarchy are made discontinuous is given as follows. That is to say, the access requests are issued by mixing the elements of the plural vector operand streams with each other. This element mixture itself is required for the processing method capable of performing the highspeed data processing operation within the processor. Therefore, it is no meaning to consider the method for capable of avoiding this element mixture. As a result, as the memory subsystem, such a method capable of extracting the continuity of access requests from the access requests issued in the discontinuous manner can satisfy the requirement of realizing the highspeed processor.
This conventional highspeed operation idea is described in JP-A-7-262083. That is, this patent application is related to such a DRAM that a plurality of data register arrays are provided within this DRAM in correspondence with the row, and this DRAM is equipped with the mechanism for holding the access data in correspondence with the different row addresses at the same time.
In addition, the system called as xe2x80x9cVirtual Channel Memoryxe2x80x9d has been proposed in 1997. This virtual channel memory system may largely improve the effective bandwidth in such a manner that a plurality of cache regions corresponding to the row data called as xe2x80x9cchannelxe2x80x9d are provided between the memory cell array and the circuit for the external interface, and these plural channels are allocated to a plurality of controllers which access to the memory. This virtual channel memory system is described more in detail in Japanese magazine xe2x80x9cNIKKEI MICRODEVICExe2x80x9d entitled xe2x80x9cVirtual Channel Memory - - - effective in plural memory mastersxe2x80x9d issued in February 1998, pages 142 to 129 (will be referred to as a xe2x80x9cpublication No. 5xe2x80x9d hereinafter).
As previously explained in JP-A-7-262083, when the plural sets of data registers are provided in correspondence with the rows as the cache memory, there is such a problem that the data transfer capability within the DRAM chip is deteriorated. In this case that while a sense amplifier is recognized as a simple buffer, and this sense amplifier corresponds to a cell provided in a chip of a general-purpose DRAM, a mechanism capable of reading out data appearing on this sense amplifier is realized, all of the data appearing on the sense amplifier need not be moved within the DRAM chip.
However, in such a case that data corresponding to a row are held in plural planes of buffers, the data appearing on this sense amplifier must be transferred. At this time, there is such a problem that the data transfer capability within the DRAM chip is lowered. In general, data lines from sense amplifiers to I/O buffers are commonly used among a plurality of cells (separate data bits designated by same row addresses). This reason is given as follows That is, if the I/O data lines are not commonly used, then power consumption of the DRAM is increased, and furthermore, the area occupied by the circuits operable in relatively high speeds is increased. For instance, in the case that a bit number per row is equal to 1024 bits, when this data is transferred from a sense amplifier to a data register array within one access operation (during 10 ns), this data transfer capability would become necessarily 100 Gb/s. If a DRAM is arranged by an n-bit width structure, then the overall DRAM chip would require the transfer capability of nxc3x97100 Gb/s (for example, if N=16, then, 200 G byte/sec.). Such a high data transfer capability of DRAM can be hardly and practically realized. Such a reading system circuit method from the memory cells is described in Japanese book xe2x80x9cULTRA LSI MEMORYxe2x80x9d written by K. ITO, published by BAIFUKAN, on pages 161 to 173 (will be referred to as a xe2x80x9cpublication No. 6xe2x80x9d hereinafter).
On the other hand, when the above-described data for one access operation are subdivided and the subdivided data portions are transferred, there are large demerits in the performance. That is, while these subdivided data portions are transferred, no access operation cannot be carried out for this memory cell.
Furthermore, since a large number of memory devices are used in a system, structures of these memory devices are not specific to a specialized system. Therefore, these memory devices are preferably required to be commonly used among various sorts of systems. If this memory structure could not be realized, then such a memory device is not commercially acceptable, namely becomes very expensive, even when a high performance memory device could be realized in a certain system of a specific field. As a result, system competition (cost-to-performance ratio) would be greatly deteriorated.
Furthermore, the VCM (Virtual Channel Memory) system owns the following restrictions. That is, in this VCM system, the data width per channel is fixed, and the data transfer amount from the memory cell to the channel can be designated. However, there is such a restriction that a total channel number prepared by a chip is limited. Even when a certain area is secured for channels, if both the channel total number and the data width are fixed, then there is a serious limitation that this VCM memory is applied to various sorts of systems.
In the case that the VCM system is arranged by employing a small number of channels having a large data width as to such a utilization that a smaller data width per channel is better and a large number of channels is better, the memory area for the channels which has been previously prepared could not be effectively utilized. Also, since a total channel number becomes short, the performance of this. VCM memory could not be sufficiently achieved. Also, in the case that the VCM system is arranged by employing a large number of channels having a small data width as to such a utilization that a larger data width per channel is better and a small total number of channels is better, the following problems occur. That is, the overhead of the managing circuit for managing the memory is increased, and also, the data transfer operation to the channel frequently occurs, resulting in a deterioration of the data transfer efficiency.
An object of the present invention has been made to solve these various problems of the related art, and therefore, is to provide a memory device capable of flexibly accepting requirements of a necessary data width, and a necessary channel number. Also, the object of the present invention is to provide such a system capable of optimizing performance thereof and management cost thereof by arranging a memory subsystem with use of such a memory device even in such a process operation for processing an access address issued from a request source to this memory subsystem when a plurality of essentially continuous streams are mixed with each other. Also, another object of the present invention is to provide such a memory device capable of covering various systems defined form a personal-used system up to a large-scaled technical calculation system.
The above-explained objects of the present invention can be achieved by employing the below-mentioned memory device. That is, in order that a register array is provided which has a specific structure where a position for holding data may be specified by using an absolute register number and an absolute word number within this memory device, and a virtual register array is constituted on the register array, and also this virtual register array is made of xe2x80x9cSxc3x97N-structured registerxe2x80x9d, the size of which is xe2x80x9cSxe2x80x9d words and which is arranged by N sets of registers, this memory device is comprised of: a mode register for defining the register size xe2x80x9cSxe2x80x9d and the register number xe2x80x9cNxe2x80x9d; and a converting circuit for converting both a virtual register number and a virtual word number, which are applied from an external circuit provided outside this memory device, into both an absolute register number and an absolute word number by using the value held in the mode register.
It should be understood that information may be arbitrarily set from the external circuit provided outside the memory device with respect to the above-explained mode register.