A multimedia system-on-chip (SoC) is an integrated circuit that integrates all components of a computer or electronic multimedia system into a single chip. Conventional multimedia SoCs usually include hardware components for image capture, image processing, video compression/de-compression, computer vision, graphics, and display processing. SoCs are very common in the mobile electronics market because of their low power consumption.
In conventional multimedia SoCs, each of the hardware components access and compete for limited bandwidth available in the shared main memory, for example a double data rate (DDR) memory. Since each individual hardware component is computationally and memory bandwidth intensive, it is difficult for the SoC to meet various multimedia latency and throughput requirements. This problem will be described with reference to FIGS. 1-4.
FIG. 1 illustrates a conventional multimedia SoC 100.
As illustrated in the figure, multimedia SoC 100 includes a main system interconnect 102, a secondary interconnect 104, a plurality of processing components 106, a cache 116, a main memory 118, and a peripheral device 120. Plurality of processing components 106 further includes a main CPU 108, an audio component 110, a secondary CPU 112, and a video component 114.
Main system interconnect 102 is operable to act as the main communication bus of multimedia SoC 100. Main system interconnect 102 allows any one of secondary interconnect 104, main CPU 108, audio component 110, secondary CPU 112, video component 114, cache 116, or main memory 118 to communicate or transfer data to any other component connected to main system interconnect 102.
Secondary interconnect 104 is operable to communicate with peripheral device 120, via bi-directional line 122. Secondary interconnect 104 is additionally operable to communicate with main system interconnect 102, via bi-directional line 124.
Main CPU 108 is operable to communicate with all other components of multimedia SoC 100, via hi-directional line 126 and main system interconnect 102. Main CPU 108 is additionally operable to communicate with audio component 110, via bi-directional line 138. Main CPU 108 is yet further operable to process data from cache 116 or main memory 118.
Audio component 110 is operable to communicate with all other components of multimedia SoC 100, via bi-directional line 130 and main system interconnect 102. Audio component 110 is yet further operable to process audio data from cache 116 or main memory 118.
Secondary CPU 112 is operable to communicate with all other components of multimedia SoC 100, via bi-directional line 132 and main system interconnect 102.
Secondary CPU 112 is additionally operable to communicate with video component 114, via bi-directional line 134.
Video component 114 is operable to communicate with all other components of multimedia SoC 100, via bi-directional line 136 and main system interconnect 102. Video component 114 is yet further operable to process video data from cache 116 or main memory 118.
Cache 116 is operable to communicate with all other components of multimedia SoC 100, via bi-directional line 138 and main system interconnect 102. Cache 116 is operable to communicate with main memory 118, via bi-directional line 140. Cache 116 is a system level 3 (L3) memory that is additionally operable to store portions of audio and video data that is stored on main memory 118.
Main memory 118 is operable to communicate with all other components of multimedia SoC 100, via bi-directional line 142 and main system interconnect 102. Main memory 118 is a random access memory (RAM) that is operable to store all audio and video data of multimedia SoC 100.
Peripheral device 120 is operable to communicate with all other components of multimedia SoC 100, via bi-directional line 122 and secondary interconnect 104. Peripheral device 120 is additionally operable to receive an input from a user to instruct conventional SoC 100 to process audio or video data. Peripheral device 120 is yet further operable to display audio or video data.
In conventional multimedia SoCs, each processing component fetches data from the main memory by injecting a unique traffic pattern. For example; an imaging component would predominantly fetch raster data of an image frame, and a codec engine would perform a mix of deterministic and random block level fetching of a video frame. A processing component fetching data in the main memory is a slow process. As more processing components need to fetch data in the main memory, the traffic patterns become more complex, and it becomes increasingly difficult for the SoC to meet all of the latency and throughput requirements for a given use case.
To decrease latency and increase throughput, conventional multimedia SoCs use system level (L3) caching to decrease the number of times data needs to be accessed in the main memory. A cache is small specialized type of memory that is much smaller and faster than the RAM used in a SoCs main memory. A cache is used to store copies of data that is frequently accessed in the main memory. When a processing component of a conventional multimedia SoC needs to access data, it first checks the cache. If the cache contains the requested data (cache hit), the data can quickly be read directly from its location in the cache, eliminating the need to fetch the data in the main memory. If the data is not in the cache (cache miss), the data has to be fetched from the main memory.
When a cache miss occurs, the data that was requested is placed in the cache once it is fetched from the main memory. Since caches are usually quite small, data needs to be evicted from the cache in order to make room for the new data. Generally, a least recently used (LRU) algorithm is used to determine which data should be evicted from the cache. Using a LRU algorithm, the data that has spent the most amount of time in the cache without being accessed is evicted. Once evicted, the data that has been fetched from the main memory from the cache miss is put in its place.
Using an L3 cache does not necessarily decrease latency and increase throughput of a conventional multimedia SoC. Since an SoC has many processing components accessing a cache, cross thrashing occurs. Cross thrashing occurs when too many processing components attempt to utilize an SoCs cache at the same time.
If a first processing component detects a cache miss, the cache is rewritten with the data that was retrieved from the main memory. When a second processing component checks the cache, another cache miss occurs, and the cache is again rewritten with the new data from the main memory. When the first processing component checks the cache, another cache miss will occur, because all of the data in the cache has been rewritten by the second processing component. In this manner, the cache is constantly being rewriting because each processing component is detecting a cache miss every time it checks the cache.
In operation, consider the situation where a user will instruct peripheral device 120 to begin playing audio data. Peripheral device 120 will then instruct main CPU 108 that audio data needs to be played, via secondary interconnect 104 and main system interconnect 102. After receiving instructions that audio data needs to be played from peripheral device 120, main CPU 108 instructs audio component 110 that it needs to begin processing audio data to be played, via bi-dircetional line 138. Audio component 110 informs main CPU 108 that it is ready to begin processing audio data, but that it can only process small portions of the audio data at a time.
At this point, main CPU 108 needs to locate the first small portion of audio data to be processed by audio component 110. Main CPU 108 first checks if the small portion of audio data to be processed is located in cache 116. Main CPU 108 finds that cache 116 does not contain the required audio data. Since cache 116 does not contain the required audio data, main CPU 108 then locates the audio data in main memory 118.
Main CPU 108 locates the first small portion of audio data to be processed as well as the test of the audio data to be played. Main CPU 108 then writes all of the audio data to be processed to cache 116. After writing the audio data to cache 116, main CPU 108 transmits the first small portion of audio data to be processed to audio component 110.
Audio component 110 then processes the audio data, and transmits the processed data to peripheral device 120, via main system interconnect 102 and secondary interconnect 104. After transmitting the processed data to peripheral device 120, audio component 110 instructs main CPU 108 that it is ready to process the next portion of audio data. Main CPU 108 checks if the next small portion of audio data is located in cache 116. Main CPU 108 finds that cache 116 does contain the next small portion of audio data.
Main CPU 108 transmits the data from cache 116 to audio component 110 to be processed. Again, audio component 110 processes the audio data, transmits the processed data to peripheral device 120, and then instructs main CPU 108 that is ready for the next small portion of audio data. Conventional multimedia SoC 100 will continue to operate in this manner until a later time.
At some time later, let the user instruct peripheral device 120 to begin playing video data in addition to the currently playing audio data. Peripheral device 120 will then instruct secondary CPU 112 that video data needs to be played, via secondary interconnect 104 and main system interconnect 102. After receiving instructions that video data needs to be played from peripheral device 120, secondary CPU 112 instructs video component 114 that it needs to begin processing video data to be played, via bi-directional line 140. Video component 114 informs secondary CPU 112 that it is ready to begin processing video data, but that it can only process small portions of the video data at a time.
At this point, secondary CPU 112 needs to locate the first small portion of video data to be processed by video component 114. Secondary CPU 112 first checks if the small portion of video data to be processed is located in cache 116. Secondary CPU 112 finds that cache 116 contains audio data and not the required video data. Since cache 116 does not contain the required video data, secondary CPU 112 then locates the video data in main memory 118.
Secondary CPU 112 locates the first small portion of video data to be processed as well as the rest of the video data to be played. Secondary CPU 112 then writes all of the video data to be processed to cache 116. After writing the video data to cache 116, secondary CPU 112 transmits the first small portion of video data to be processed to video component 114.
Simultaneously, audio component 110 instructs main CPU 108 that it has finished processing a small portion of audio data and is ready to process the next portion of audio data. Main CPU 108 checks if the next small portion of audio data is located in cache 116. Main CPU 108 finds that cache 116 contains video data but not the required audio data.
Since main CPU 108 cannot find the required audio data in cache 116 it must find the audio data in main memory 118. Locating and fetching the required audio data from main memory 118 instead of cache 116 takes a long time and conventional multimedia SoC 100 is no longer able to meet the latency requirements for playing audio data. In order to meet the latency requirements for playing audio data, main CPU 108 rewrites cache 116 with the required audio data.
Next, video component 114 processes the video data from secondary CPU 112 and transmits the processed data to peripheral device 120, via main system interconnect 102 and secondary interconnect 104. After transmitting the processed video data to peripheral device 120, video component 114 instructs secondary CPU 110 that it is ready to process the next small portion of video data.
After being instructed that video component 114 is ready to process the next small portion of video data, secondary CPU 112 checks if the next small portion of video data is located in cache 116. Since cache 116 was just rewritten with audio data by main CPU 108, secondary CPU 112 does not find the required video data in cache 116.
Since secondary CPU 112 cannot find the required video data in cache 116 it must find the video data in main memory 118. Locating and fetching the required video data from main memory 118 instead of cache 116 takes a long time and conventional multimedia SoC 100 is no longer able to meet the latency requirements for playing video data. In order to meet the latency requirements for playing video data, secondary CPU 112 then rewrites cache 116 with the required video data.
At this point, cross thrashing continues to occur in conventional multimedia SoC 100. Main CPU 108 and secondary CPU 112 continually overwrite each other's data in cache 116. Since cache 116 is continually over written, main CPU 108 and secondary CPU 112 are forced to continually fetch data from main memory 118. Since main CPU 108 and secondary CPU 112 are continuously fetching data from main memory 118, conventional multimedia SoC 100 is unable to meet latency requirements.
Additionally, the amount of data that can be fetched from main memory 118 at any one time is limited. Due to cross thrashing, both of main CPU 108 and secondary CPU 112 are forced to fetch data from main memory 118. Due to the limited bandwidth of main memory 118, one CPU may have to wait for the other to finish fetching data before it may begin fetching its own data, further increasing latency as well as decreasing throughput.
One method of handling cross thrashing is to use a partitioned cache. In a partitioned cache, each processing component of a conventional multimedia SoC has a designated section of a cache to use. Cache partitioning reduces cross thrashing because each processing component is only able to rewrite its own designated section of cache. Cache partitioning requires a large cache which is not feasible in conventional multimedia SoCs, because as cache size increases it becomes slower to fetch data in the cache.
Block coding is a technique used during video encoding to encode data in discrete chunks known as macroblocks (MB). An MB typically consists of an array of 4×4, 8×8, or 16×16 pixel samples that may be further subdivided into several different types of blocks to be used during the block decoding process. After each MB is encoded it is stored in memory next to the previously encoded MB.
A loop filter (LPF) is a filter that is applied to decoded compressed video to improve visual quality and prediction performance by smoothing the sharp edges which can form between MBs when block coding/decoding techniques are used. During block decoding, an LPF may access and decode each MB in order from the main memory and then use the MB to predict the next MB that will need to be fetched and decoded from the main memory.
Examples of processing components accessing data in the main memory of a conventional multimedia SoC will now be described with additional reference to FIGS. 2-3.
FIG. 2 illustrates a graph 200 of memory address distribution of an LPF for H264 video decoding at a granular level.
As illustrated in the figure, graph 200 includes a Y-axis 202, an X-axis 204, an MB block 206, an MB block 208, and an MB block 210. MB block 206 further includes MB 212, MB 214, and MB 216. MB block 208 further includes MB 218, MB 220, and MB 222. MB block 210 further includes MB 224, MB 226, and MB 228.
Y-axis 202 represents memory addresses. X-axis 204 represents time.
MB block 206 represents the all of individual MB blocks that are being accessed by an LPF from time t0 to time t1. MB block 208 represents the all of individual MB blocks that are being accessed by an LPF from time t1 to time t2. MB block 210 represents the all of individual MB blocks that are being accessed by an LPF beyond time t2.
MB 212, MB 214, MB 216, MB 218, MB 220, MB 222, MB 224, MB 226, and MB 228 each represent the range of addresses at which data is stored in the main memory for an MB block being process by an LPF.
Data for each of MB 212, MB 214, MB 216, MB 218, MB 220, MB 222, MB 224, MB 226, and MB 228 are stored is stored in the main memory between address 0 and address 100. At time t0, an LPF is instructed to decode MB block 206. The LPF then begins decoding the first MB, which is MB 212.
While decoding, data needed by the LPF is fetched between address 20 and address 60 for MB 212 between time t0 and time t0+1. Next the LPF uses MB 212 to predict that the next two MBs that need to be decoded are MB 214, and MB 216. The LPF then decodes MB block 214 between time t0+1 and time t0+2, during which time data needed to decode block MB 214 is accessed between address 20 and address 60. Finally, after decoding MB 214, the LPF decodes MD 216 in the same manner between time t0+2 and time t1, during which time data needed to decode block MB 214 is accessed between address 20 and address 60.
At time t1, after finishing decoding each MB in MB block 206, the LPF is instructed to decode MB block 208 next. Again, the LPF begins by decoding the first MB of MB block 208, which is MB 218. The LPF decodes MB 218 between time t1 and time t1+i. While decoding, the LPF uses MB 218 to predict that the next two MBs that need to be decoded are MB 220, and MB 222. Next, the LPF decodes MB 220 between time t1+i and time t1+2, and MB 222 between time t1+2, and time t2.
After MB 218, MB 220, and MB 222 are decoded, the LPF is instructed to decode the next MB which is MB 224. As described above, the LPF uses MB 224 to predict which MBs need to be decoded next and then fetches all of the MBs at their address in the main memory. Once fetched, the LPF decodes the MBs and waits for further instructions.
Accessing each MB from the main memory is very bandwidth intensive. The bandwidth is increased when the LPF access multiple MBs from different locations in the main memory due to MB decoding predictions. Writing the entire MB block that contains the current MB to be decoded to a cache would reduce the main memory bandwidth being used. Using a cache, the LPF would only need to fetch a MB block once from the main memory, and then could fetch each individual MB from the cache.
Block motion compensation is an algorithmic technique used to predict a frame in a video given the previous and/or future frames by accounting for the motion of the camera and or the objects in the video. Block motion compensation exploits the fact that for many frames of a video, the only difference between two consecutive frames is camera movement or an object in the frame moving. Using motion compensation, a video stream will contain some full frames to be used as a reference, the only information needed to be stored between reference frames would be information needed to transform the previous frame into the next frame. Similar to block encoding/decoding, block motion compensation uses MBs that are 4×4, 8×8, or 16×16 pixels in size.
FIG. 3 illustrates a graph 300 of memory address distribution for a motion compensator of an H264 decoder.
As illustrated in the figure, graph 300 includes a Y-axis 302, an X-axis 304, an MB 306, an MB 308, an MB 310, an MB 312, an MB 314, and an MB 316.
Y-axis 302 represents memory addresses. X-axis 304 represents time.
MB 306 MB 308, MB 310, MB 312, MB 314, and MB 316 each represent the range of address at which data is stored in memory for that particular MB.
In operation, at time t0, a motion compensator is instructed to compensate for movement between two frames. At this point, the motion compensator begins compensating for motion by processing MB 306. While processing MB 306, the motion compensator fetches data stored in memory between address 20 and address 70, between time t0 and time t1. After processing MB 306, the motion compensator needs to process MB 308 to continue compensating for motion between two frames. The motion compensator process MB 308 from time t1 and time t2, during which time data is fetched between memory address 30 and memory address 80.
The motion compensator continues process each of MB 310, MB 312, MB 314, and MB 316. As described above, the motion compensator processes each MB by fetching and process data stored at the addresses within the boundaries of each MB.
Similar to the LPF of FIG. 2, a motion compensator accessing all of the MBs needed for motion compensation from the main memory is very bandwidth intensive. Writing each MB to a cache would reduce the main memory bandwidth used. Since individual MBs are used several times to transform one reference MB into another, a cache with a LRU policy could be used. Using a LRU policy would lower cache misses by only evicting the MB that has gone the longest duration since being used.
Both the LPF of FIG. 2 and the motion compensator of FIG. 3 fetch different types of data from different sections of the main memory. The locality of data for different processing components will now be described in FIG. 4.
FIG. 4 illustrates a graph 400 of a main memory address distribution for multiple processing components.
As illustrated in the figure, graph 400 includes a Y-axis 402, an X-axis 404, a memory address portion 406, a memory address portion 408, and a memory address portion 410.
Y-axis 402 represents memory addresses. X-axis 404 represents time.
Memory address portion 406 represents all of the addresses of data for use by an LPF. Memory address portion 408 represents all of the addresses of data for use by a motion compensator. Memory address portion 410 represents all of the address of data for use by an imaging component.
In operation, all data for use by an LPF is located at an address within memory address portion 406. When an LPF needs to fetch data from the main memory, it always fetches data from an address within memory address portion 406. An LPF will not fetch data from memory address portion 408 or memory address portion 410.
Similarly, all data for use by a motion compensator is located at an address within memory address portion 408 and all data for use by an imaging component is located at an address within memory address portion 410. A motion compensator will not fetch data from memory address portion 406 or memory address portion 410, and an imaging component will not fetch data from memory address portion 406 or memory address portion 408.
FIG. 4 may also be used to further illustrate cross thrashing of processing components in a conventional multimedia SoC. Since data is fetched from one of address portion 406, address portion 408, or address portion 410 by multiple processing components, cached data would continually be evicted to make room for the newly fetched data. Multiple processing components fetching data from memory would continually overwrite the cache, creating cross thrashing.
A problem with the conventional system and method for fetching data on a multimedia SoC is that cross thrashing occurs when multiple processing components are using a single cache. Each time a cache miss occurs, the processing component that created the cache miss overwrites the cache with data fetched from the main memory. When another processing component attempts to access data in the cache, a cache miss occurs because it was just over written.
Multiple processing components using a single cache not only creates cross thrashing but also requires each component to fetch data from the main memory. Processing components continually fetching data from the main memory of a conventional multimedia SoC increases latency and decreases throughput.
Another problem with the conventional system and method for fetching data on a multimedia SoC is that partitioned caches are impractical. Conventional partitioned caches are very large, which is not beneficial to conventional multimedia SoCs since cache speed decreases as the cache size increases. The size of conventional partitioned caches reduce cross thrashing, but due to their size and speed there is no benefit when compared to fetching data from the main memory.
Another problem with the conventional system and method for fetching data from the main memory of a multimedia SoC is the limited bandwidth of the main memory. As more components fetch data from the main memory, more bandwidth is taken up. Limited main memory bandwidth limits the amount of data that can be fetched from the main memory at one time which increases latency and decreases throughput.
What is needed is a system and method for using a cache with a multimedia SoC that eliminates cross thrashing without increasing latency or decreasing throughput.