Mobile video is a distribution means that is becoming more and more popular with emerging services, such as e.g. mobile TV and video streaming. However, in order to be able to send video over a wireless network, the video has to be encoded using lossy compression, often at a high compression rate.
Compared to the quality that is usually achieved when media content is distributed via a fixed distribution network, such as e.g. for fixed TV distribution, the visual quality tend to be lower for mobile video distribution. To a large extent this is due to the much lower transmission bit rates that are used for mobile video distribution.
Mobile video distribution involves transmission of media content or one or more mobile media clients. Before media content is encoded at a media server, necessary pre-processing, including steps, such as e.g. colour format conversion, video format conversion and/or the frame rate conversion, may be, and are usually, executed in order to improve the quality of the media content after it has been decoded by a vide coded at the media client.
Most video codecs used today, such as e.g. MPEG-4, H.263 and H.264, are using block based coding where a transform is applied on a per block basis.
A macro block is a term commonly used in association with video compression and refers to a block unit having a size of 16×16 pixels. The block sizes used for a transform, such as e.g. the 2-dimensional Discrete Cosine Transform (DCT), are different for different codecs, and, thus, macro blocks are typically subdivided further into smaller blocks, such as into blocks consisting of 8×8 or 4×4 pixels. By way of example, MPEG-4 and H.263 uses an 8×8 pixel block size, while H.264 uses a 4×4 pixel block size.
Instead of sending pixel values over the network, coefficients obtained from a used transformation are sent for the respective blocks from the media server.
A 2-dimensional DCT transform is separable, which means that the basic transformation functions will be obtained by multiplying a respective 1-dimensional horizontal and vertical basis function together. For an 8×8 pixel block there are 64 basis functions, where the horizontal frequencies increase from the left to the right, while the vertical frequencies increase from top to bottom.
Except for rounding errors, no information is destroyed in this type of transformation. For an 8×8 pixel block, 64 pixel values are transformed into 64 DCT coefficients. By way of example, H.264 uses a 4×4 DCT-like transform, where 16 pixels are transformed into 16 DCT coefficients.
One of the first steps to be executed in the encoder of a video enabled media server is to execute a DCT transformation, where the result of the transform from pixel values is rounded to integers. After such a DCT transformation has been commenced, the energy will be efficiently concentrated, but at this stage there are still many coefficients that have to be coded. An example of such a DCT transformation matrix is illustrated in FIG. 1a, where 8×8 pixels, illustrated with the left matrix 100 are transformed into the corresponding 8×8 DCT coefficients, illustrated with the matrix 101 to the right.
One of the major bit savings in lossy video compression comes from quantization of the transform coefficients, which is typically executed next. However, a typical scenario may be described such that as the quantization step size increases, the accuracy of the decoded transform coefficients decreases, which typically will result in a quality degradation which will be visible to end-users when the video is displayed on a video enabled user device/media client.
In FIG. 1b an exemplified quantisation of the DCT coefficients obtained after the DCT transformation of FIG. 1a, is presented, where the DCT coefficients 101 are shown in the left matrix 101, while the resulting quantised coefficients are shown in the matrix 102 to the right of FIG. 1b. In this example the DCT coefficients have been divided by 10, and, thus, only 10 coefficients will have to be transmitted from the media server, but while the amount of data that has to be sent has been reduced considerably, also encoding artefacts have been introduced to the media content, as a result of the described process.
After having compressed the images, forming the media content by way of quantisation, a plurality of images, typically 10 to 30 images per second for video streaming, will have to be sent, in order to be able to provide the media content as video that can be rendered by a media client.
However, relatively often a large amount of the images will have a similar content, e.g. in situations where the background is exactly the same for two or more successive images.
FIG. 2 illustrates an example of how the required bandwidth may be reduced even more, by making use of the fact that content that has already been encoded can be used also for encoding of subsequent blocks, by way of executing motion compensation.
In FIG. 2, a first series of images 200-203 is representing an original video sequence, showing a figure that is moving to the right in front of a background that remains the same throughout the whole series. The series of images to be encoded on the basis of the media content of images 200-203, before it is sent to a media client, is illustrated with images 204-207.
A sequence of images normally starts with a first frame 204, where the complete image, i.e. the information of image 200, is being encoded, e.g. according to the encoding principles described above. This information is transmitted in a frame, which is typically referred to as an intra frame, or an I-frame.
In a second image 201, the figure is similar to the one of image 200, but it has moved to the right, towards the middle of the image. Therefore, instead of coding and sending all information about image 201, only the information about the movement between the images, i.e. the difference between the present image 201 and the previous image 200 will be encoded and sent in a next frame 205.
In a corresponding way the difference between image 201 and 202, as illustrated with frame 206, is identified, encoded and sent next, instead of sending the complete content of image 202. These types of frames are typically referred to as predicted frames, or P-frames.
In order to reduce the risk of loosing information during distribution, e.g. due to packet loss, and to be able to smoothly switch channel, another I-frame will be sent every now and then, and, thus, after a number of P-frames, 205 and 206, have been sent in the given example, the information of image 203, is transmitted in a subsequent I-frame 207.
Media content comprising blocks with high frequencies, i.e. blocks which comprise transform coefficients with high contrast, e.g. where the luminance of different pixels vary a lot from high to low luminance, often need to be encoded with many bits, i.e. with a low quantization, in order to achieve good visual quality for the reproduced video. One example where high frequencies are usually present is when a video comprises text, or any other similar type of graphical information that has been applied on the video, which usually tend to have sharp transitions between high and low luminance values, when shown together with images forming the video. This is a reason why video that includes graphical information as an overlay, often does not look that good when presented to a user at relatively low qualities, as is usually the fact for mobile video applications.
One way of trying to reduce this problem is to send the graphical information separated from the video content, and to later apply the separated graphical information as an overlay after the video has been decoded at a client. Such a process is commonly used for digital broadcasted TV applications.
Graphical information, as described in this document, typically includes, but is not limited to, sub-titles and other text information, logotypes, graphics presented in news programs, or score boards presented in sport events, which appears as an overlay on the video when presented to the end-users.
There are also other solutions known from fixed TV distribution where underlying text has been smudged in order to make a text overlay more visible.
There are a number of known methods that can be used for detecting and extracting text from media content, such as e.g. images and/or video.
U.S. Pat. No. 6,937,766 refers to a method for detecting, extracting and indexing text in video. The method can be applied e.g. to static text, scrolling text, overlay text, as well as in-scene text.
WO/2008/003095 relates to a method for extracting text from images for the purpose of searching in a text of a media content that comprises images, as well as in text in videos.
JP2005235220 suggests another method which is adapted to detect subtitles in a video, while EP0720114 refers to a method for detecting text caption in a video.
All of the documents cited above suggest different methods for detecting and/or extracting text and/or graphical information in media content comprising a series of images. The suggested methods do, however, fail to discuss or suggest any way of handling artefacts of a distributed video, which will most likely appear in the vicinity of graphical information, when a video comprising text and/or graphics is reproduced and displayed at a video client.
Sending graphical information separated from an encoded video is a commonly known and preferred way of transmitting video that includes graphical information over a narrow bandwidth channel. Separation of graphical information from the images normally requires that the graphical information is stored separate from the video content at the media source. Separating the graphical information from the video is, however, not always possible, since the provider of the video content does not always have full control of the graphical information.
As can be understood from the documents referred to above there are a number of ways of extracting graphical information from media content, using various image processing techniques. However, even if the extracted graphical information is transmitted separately from a media source/media server to a media client and added to the decoded video as an overlay at the media client, as suggested above coding artefacts may, and will most likely, still be visible around the graphical information when the video is rendered at the media client. This phenomenon is typical, not only for video that comprises letters of a subtitle in an overlay as a result from encoding the underlying graphical information, but also for other types of media content, that involves distribution of one or more images.
In order to be able to transmit media content comprising some kind of overlay graphical information over a communication network there are principally three different scenarios to choose from.
According to a first scenario, which will now be presented with reference to FIG. 3, graphical information is included in the media content already at the media source. The graphical information is encoded together with the media content at a media server 300 that is controlled by the operator, before it is transmitted to a media client 301, such as e.g. a cellular telephone, a laptop or a set top box, via a communication network 302.
In a first step 3:1, media content to be delivered to media client 301 is retrieved either from an external media source (not shown), e.g. if the media content refers to streamed video, or from an internal or external memory means (not shown), e.g. if the media content instead comprises stored content.
In a next step 3:2, the media content, including graphical content, is encoded, using any conventional codec. The encoded content is then transmitted, typically by way of broadcasting the content over a communication network 302, such as e.g. a mobile communication network, to one or more media clients that are tuned to the respective channel. This is indicated with a subsequent step 3:3. At the media client 301, the media content is received in a subsequent step 3:4, after which the content is decoded in a next step 3:5, and displayed via any conventional displaying means, in a final step 3:6.
Although the method described above is easy to implement, it is not recommended for distribution of media content that is distributed to media clients at low bit rates, since graphics, such as e.g. text, since, under the present circumstances, the graphic information tend to be hard to read.
According to a second, alternative scenario, which will now be described with reference to the flow chart of FIG. 4, graphical information of media content is instead separated from the media content at a media server 400 and can then be sent from the media server 400 to a media client 401 separated from the encoded media content. At the media client 401, the graphical information is then added as an overlay to the, encoded and transmitted media content, after decoding.
According to FIG. 4, graphical content that is provided together with other media content has already been separated from the media content at the media source, and, thus, in a first step 4:1, the general media content is retrieved from a media source, while the graphical content is retrieved in another step 4:2, after which the graphical content is transmitted to media client 401 in another step 4:3, and received by the media client 401 in a next step 4:4.
Alternatively, the graphical content may also be encoded in step 4:3, or even prior to that step, and sent as compressed content over the network 202. In such a case, the graphical content is also decoded in step 4:4, or in a step subsequent to step 4:4. Scalable Vector Graphics (SVG), is the primary compression method to be used for encoding extracted graphical information, where video coding can be seen as one possible alternative amongst others. If it is known that the graphical information is text and the used font, size and position of the text is also known, the text may alternatively be interpreted and sent as ASCII symbols. The procedures used for these particular aspects may be based on any conventional technique, and will therefore not be discussed in any further detail in this document.
The media content, comprising graphical content, is encoded, as indicated in a subsequent step 4:5, and the encoded media content is transmitted to media client 401 in another step 4:6. At media client 401, the encoded content is received, as indicated in a next step 4:7, and decoded, as indicated in a subsequent step 4:8.
In another step 4:9, the graphical information received in step 4:4 is added as an overlay to the decoded media content, and the media content can then be displayed to a user, as indicated with a final step 4:10.
The scenario described above is often preferred when video comprising overlay graphical information is to be transmitted to a media client, since it provides a reliable way of maintaining a relatively good visual quality when displayed at the media client.
Also according to a third scenario, illustrated with reference to FIG. 5, it is assumed that graphical information has been added to media content at a media server 500. According to this scenario, however, the media content that is retrieved from a media source in step 5:1 already comprises graphical content.
In a next step 5:2, however, the graphical content is identified and extracted from the remaining content, before it is transmitted to a media client 401 in a next step 5:3, and received by the media client 401 in a subsequent step 5:4.
As indicated in the previous scenario, also the graphical information may have been encoded, e.g. as SVG, before transmission to the media client 401. In such a case this information will be decoded at the media client 401, before it is added as an overlay in step 5:9.
In a next step 5:5 the media content is encoded, before it is transmitted, as indicated with another step 5:6. Remaining steps 5:7-5:10 correspond to steps 4:7-4:10 of FIG. 4.
A deficiency with both scenarios described so far is that the displayed media content tends to comprise visible artefacts around the graphical information when it is displayed at the media client. This is due the fact that the transform blocks containing graphical information tend to have a lot of high frequencies which, in terms of bits, makes these blocks expensive to encode, compared to encoding of blocks that comprise only lower frequencies.