The present invention relates generally to memory management, and, more particularly, to a method for storing and retrieving data that conserves memory bandwidth.
Recent dramatic technological advancements in the fields of computers, semiconductors, and communications have led to a proliferation of products that are capable of real-time processing of digitized streams of multiple data types, such as audio, video, graphics, and communications data streams. Such products are commonly referred to as xe2x80x9cmultimedia productsxe2x80x9d. These multimedia products include multimedia personal computers (PCs), television set-top boxes, videoconferencing systems, High Definition Television (HDTV) sets, video telephony systems, Internet (Web) browsers, video arcade game systems, consumer video game consoles, and many others.
High-quality multimedia applications require enormous amounts of processing power, memory resources, and communications bandwidth, which requirements are continuously increasing due to the increasing variety and complexity of the multimedia data being processed. For example, many multimedia products must be capable of simultaneous, real-time processing of photorealistic 3-D graphics, CD-quality digital audio, full-motion digital video (e.g., MPEG-encoded video), and communications data streams. Until recently, each of these multimedia processing functions was handled by a separate, dedicated processor element. Typically, a separate, programmable Digital Signal Processor (DSP) is used to handle each multimedia data type, with each DSP functioning as a co-processor in conjunction with a host CPU. However, Philips Semiconductors"" TriMedia Product Group has developed a new Very Long Instruction Word (VLIW) processor architecture for consumer multimedia applications that converges these different functions into a single multi-function processor, called the TriMedia processor. The TM-2000 processor, which is the latest version of the TriMedia processor, is a programmable DSP/CPU that combines a next-generation, programmable microprocessor core with a full set of innovative development tools to simultaneously process full-motion video (i.e., MPEG-2 digital video and DVD video), 3-D graphics, and CD-quality audio, and high-speed communications data streams. By combining these various functions on a single chip, which reduces cost, size and power demands, the TM-2000 processor makes possible the implementation of an advanced multimedia system at an affordable cost and with a smaller footprint. This implementation of multiple processing functions on a single chip is sometimes referred to as a xe2x80x9csystem-on-a-chipxe2x80x9d.
With reference now to FIG. 1, there can be seen a high-level block diagram of the TM-2000 processor 20. As can be readily seen, the TM-2000 processor 20 includes a VLIW CPU 22 supported by a dedicated on-chip data cache 23 and a separate, dedicated on-chip instruction cache 24. The TM-2000 processor 20 also includes a plurality of on-chip, independent, DMA-driven multimedia I/O and coprocessing units 50a-50j that will hereinafter referred to as xe2x80x9cfunction unitsxe2x80x9d. These on-chip function units 50a-50j manage input, output, and formatting of video, audio, graphics, and communications datastreams and perform operations specific to key multimedia algorithms, thereby streamlining and accelerating the processing of these video, audio, graphics, and communications datastreams.
With continuing reference to FIG. 1, the TM-2000 processor 20 utilizes an external Synchronous Dynamic Random Access Memory (SDRAM) 30 (or, a Sychronous Graphics Random Access Memory (SGRAM)) that is shared by the function units 50a-50j via a high-speed internal 32-bit bus 40a, and a 64-bit bus 40b. The 32-bit bus 40a connects to a main memory interface 41 through a bridge 43. The 32-bit bus 40a and the 64-bit bus 40b will hereinafter be collectively referred to as the xe2x80x9cdata highway 40xe2x80x9d. Bus transactions use a block transfer protocol. The on-chip function units 50a-50j can be masters or slaves on the data highway 40. Programmable bandwidth allocation enables the data highway 40 to maintain real-time responsiveness in a variety of different applications.
Because the SDRAM 30 is a shared memory resource that is frequently accessed by the multiple function units 50a-50j of the processor 20 via the data highway 40, the two-way data traffic on the data highway 40 requires a large amount of memory bandwidth. Memory bandwidth is defined as the maximum rate (e.g., Mbytes/second) at which the data can be transferred between the SDRAM 30 and the function units 50a-50j and the CPU 22 of the processor 20. It is highly advantageous to minimize the amount or proportion of the overall memory bandwidth for the processor 20 that is consumed by any given one of the function units 50a-50j and the CPU 22 within the processor 20, in order to thereby improve the efficiency, speed, and overall performance of the processor 20. In a worst case scenario, if the memory bandwidth is insufficient, bottlenecks can occur due to data traffic congestion on the data highway 40, thereby resulting in improper operation of the system and/or system failure.
The processing of digital video datastreams is a function that consumes a large amount of the available memory bandwidth, due to the fact that this function requires extensive use of memory in order to execute the complex algorithms that are required to decode and process the digital video datastreams. For example, the decoding and processing of MPEG-2 encoded digital video datastreams requires many memory-intensive operations to be performed. In the context of the TM-2000 processor 20 depicted in FIG. 1, the function unit 50a, called xe2x80x9cMPEG2 Coprocessorxe2x80x9d, is responsible for decoding the MPEG-2 encoded digital video datastream received by the function unit 50b, called xe2x80x9cVin/TS-In2xe2x80x9d, hereinafter referred to simply as xe2x80x9cVideo Inxe2x80x9d. The decoded digital video data is stored in the SDRAM 30, and then the function unit 50c, called xe2x80x9cHD-VOxe2x80x9d (High Definition-Video Out), hereinafter referred to simply as xe2x80x9cVideo Outxe2x80x9d, fetches the decoded digital video data, performs any required post-processing operations, and then outputs the decoded digital video data to a display device. One particularly memory-intensive operation that is required by the MPEG-2 decoding function is Motion Compensation (MC), due to the fact that it entails block-based processing on randomly distributed reference blocks of the digital video data stored in the SDRAM 30, which demands frequent and random memory accesses.
Based on the above and other factors, and as will be appreciated by those skilled in the pertinent art, the function unit 50a (hereinafter referred to simply as the xe2x80x9cMPEG-2 decoderxe2x80x9d) consumes a considerable amount of the available memory bandwidth in the TM-2000 processor 20. Thus, in designing future generations of this TriMedia processor family the amount of the memory bandwidth required by this function unit should be minimized. The present invention meets this design objective by providing a novel methodology for storing data in and fetching data from a memory. Moreover, as will become readily apparent to a person skilled in the pertinent art, the methodology of the present invention has utility in any device or system that could benefit therefrom, the TriMedia processor being discussed herein by way of example only. In general, the present invention has utility in any system that includes a memory that is accessed in a manner that requires a first memory bandwidth if the data is stored and retrieved in the conventional way, but only requires a second memory bandwidth that is less than the first memory bandwidth if the data is stored and retrieved in accordance with the methodology of the present invention.
The present invention encompasses, in one of its aspects, a method for storing a block of data consisting of N rows and M columns, which includes the step of transposing the block of data by 90xc2x0 to thereby produce a transposed block of data consisting of M rows and N columns, and, the step of storing the transposed block of data. The transposed block of data is preferably retrieved by using one or more fetch commands, with the number of fetch commands required to retrieve the transposed block of data being less than the number of fetch commands required to retrieve the same data if stored in its original form. In a presently contemplated implementation, the block of data is a reference macroblock of decoded MPEG video data that is used in motion compensation operations, and each of the fetch commands is an Axc3x97B fetch command, where A represents the number of columns of data and B represents the number of rows of data to be fetched in response thereto, and wherein further, A greater than B.
The present invention encompasses, in another of its aspects, a processor that implements the above-described method. In a presently contemplated implementation, the processor is a multimedia processor that includes a number of function units that are commonly coupled to a system bus that is coupled to a memory (e.g., an SDRAM) in which the transposed data is stored.