Graphic and video rendering is quite challenging in the mobile phone environment. More and more mobile phones get new capabilities where deeper colour scheme are used as well as video objects can be displayed. Beside the construction of objects to display, another challenge is to simply combine and display them in the most efficient way.
The sources combination and display of graphical and video objects is facing the same constraints as any function in a mobile phone, that is:
Power consumption
Memory footprint
Memory bandwidth
The objects to combine are graphical or video objects, which can be viewed as frame or sub-frame. More generally, they can be assimilated to a 2-dimension continuous array containing pixels information. The physical arrangement of pixel information will be ruled by several characteristics:
Colour Space:
YUV/YCbCr: Luminance and Chrominance information are separated with specific sampling rate
RGB: Three primary colour system used for display
Colour Depth:
Each colour component is coded with a variable length bits field ending in specific data arrangement, which must match memory granularity: byte (8 bits), half-word (16 bits), word (32 bits). Examples of such arrangement for RGB colour space are: RGB332, RGB444, RGB565, RGB666 and RGB888 where each number of triple sequence indicates the number of bits associated to each colour channel, Red Green and Blue respectively.
Chrominance Sampling:
Especially valid for the YUV/YCbCr colour space since the Chrominance information represented by the U/Cb and V/Cr channels can have a different sampling rate from the Luminance information represented by the Y channel. Examples of sampling configuration are: 4:4:4 (each pixel has a Luminance and Chrominance information), 4:2:2 (each horizontal pair of pixels is sharing the chrominance information), 4:2:0 (each quadruple formed by horizontal pair of pixels on two adjacent lines are sharing the chrominance information) and 4:1:1 (each horizontal quadruple of pixels is sharing the chrominance information)
Memory Alignment Constraint:
Despite we can have arbitrary length for a bit field, computer and more generally a processing unit is accessing memory through a fixed granularity. The information length turns into a power of 2 multiple of a byte length. Example: byte (1), half-word (2), word (4), quad-word (8) and so forth.
Finally, it is nearly impossible to describe exhaustively all possible representation of colour information of graphical or video objects. Albeit, they all share the same framework: It can be represented by a 2D array of pixel colour information.
The combination of graphic or video objects can be described as the geometric operation and information conversion applied to a series of objects in order to merge them into a new graphic or video object. Example of such a process can be the following:                A picture is captured from a camera sensor at a specific frame rate, predefined resolution and using a 4:2:0 YUV colour representation. A window of interest inside this picture can be selected, will be decimated (shrunk) and colour converted into RGB565 colour representation; the scaled size may actually match the final display screen size.        A processing unit builds a frame using an RGB565 colour representation. The resulting constructed frame is scaled and colour converted to a size/colour depth, which may match the display screen size.        Another processing unit is building a frame through a compressed video decoding task. The resulting decoding is scaled and colour converted to a size/colour depth, which may match the display screen size.        All the former objects can be stacked in an arbitrary order, and each layer can be associated with a specific transparency value ranging from 0% (opaque) to 100% (transparent).        The final construction is sent to a display unit, which will get it visible, in common sense.        
Combination of objects can be quite complex, and not only because of heterogeneous colour space/resolution representations. The various objects can be produced at different time bases making their respective representation not available at identical instant. These constraints force the usage of temporary buffers to hold representation of objects in order to combine them at appropriate time.
Going further in the combination process, one can have intermediate steps of combination. Supposing as example we have N objects to combine, the N objects set can be partitioned in groups of objects—let say I, J and K whose sums equal N—each partition can be combined and their respective results further combined in a final combination process. We can immediately see such a hierarchical combination is creating intermediate objects representation, which will written and read by the final combination process. This hierarchical combination process has some advantages on one hand since producing simpler tasks to execute. On the other hand intermediate objects representation has a drawback; it will consume memory to hold the information and will require memory bandwidth to store and retrieve data. This is something, which can create a strong penalty when designing products applied to mobile market where power, memory size and memory bandwidth are scarce resources.
While the hierarchical combination of graphic or video objects simplifies a complex combination process by dividing it in simpler operations, it nevertheless results in potential bottlenecks around the memory resources. FIG. 1 illustrates the example of the rendering process of mobile phone used for a camera preview process. A sensor 110 is providing image to an Image processing block 120 which is then processed by a Graphic processor 140 for the purpose of generating image frames which are forwarded to an Display Engine 160 which are to be displayed on a Display 170. The architecture is based on a centralized memory 100 and a central processor 130 and all blocks communicate to each other via that same memory. The advantage of this known architecture results from its great flexibility since each processing unit within the display pipeline may have its own frame rate. The clear drawback comes from the fact that such architecture becomes prohibitive as the size of the memory increases to host all intermediate data structures. In addition, the bandwidth of the memory is significantly increased since any process is requesting access to that memory, such as, for instance, the exchanges between Graphic processor 140 and Display engine 160 which request access to the memory through requests illustrated by reference 150 in the figure.
In order to solve the issue created by the intermediate production of a hierarchical combination, an immediate approach to improve the situation and reduces the access to memory 100 is to try to create a direct path between different units, hereinafter referred to as producers and consumers of objects, in the combination chain. This is the aim of the streaming technique which is shown in FIG. 2. Sensors 110, memory 100, processor 130 and display 170 remain unchanged and keep their reference numbers. The streaming architecture is based on a synchronous pipeline comprising an image processing unit 220, a graphic process 240 and a display engine 260 which communicate via a direct communication link, as shown by arrow 250, which does not use the central memory 100, since all units 220, 240 and 260 do include their own minimum internal storage.
This streaming architecture has the advantage of reducing the size of the external memory and also achieves fastest and deterministic processing chain.
However, the clear drawbacks results from the synchronous pipeline which prohibits the use of such architecture in some situations, and further does not allow any access to intermediate data.
FIG. 3 illustrates a halfway solution showing the considered example of the camera preview process between a sensor 310, an image processing unit 320, a first local memory 351 (located within unit 320 for instance), a graphic processor 340, a second local memory 352 (located within unit 320 or 330 for instance), a display engine 360 and a display 370.
Unfortunately, this approach is not always possible because objects size to hold for intermediate combination processing. As example a QVGA (320×240) frame of RGB565 colour depth will require about 150 Kbytes of data in a local memory. This amount of memory will become 600 Kbytes when considering is VGA (640×380) resolution for the same colour depth. Such a buffer size can be viewed as quite modest when compared to standard Personal Computer memory configuration; nevertheless this translates into large area which will grow Integrated Circuits size and will make them uncompetitive for a mass market like mobile phone Integrated devices.
Finally, an added constraint comes from the Software structure, which will control the combination process. Despite the first block diagram representation is likely the worst solution to implement, it is the one that Software developers will like the most since it offers maximum flexibility. This is the concept of unified memory where any section of memory is viewed in a continuous address space. The software programmer creates full size object placeholders in memory and allocate them to producer and consumer agents the way he wants regardless of the memory congestion it can potentially creates.
The technical problem to solve is to create a mechanism, which will offer the maximum software flexibility while maintaining the local and external memory size and bandwidth to the bare minimum at equivalent functionality.