1. Field of the Invention
The present invention relates to a technology for playing back hierarchically-encoded video image data in which video images of a plurality of different resolutions or fields of view are hierarchically encoded in a single video stream data as well as audio data associated with a certain encoded layer of the hierarchically-encoded video image data.
2. Description of the Related Art
In recent years, 720×480 pixel or 1440×1080 pixel resolution video images encoded using an MPEG2-Video encoding system have been broadcasted by terrestrial digital broadcasting. Regarding terrestrial digital broadcasting, 320×240 pixel video images encoded using an H.264/AVC (Audio Visual Coding) encoding system have also been broadcasted for mobile phones and other portable devices through a separate stream called one-segment broadcasting.
On the other hand, an H.264/SVC (Scalable Video Coding) technology capable of encoding video images of a plurality of resolutions into a single video stream data has been standardized as an extension of the H.264/AVC. According to the H.264/SVC standard, for example, a plurality of video images of different resolutions, for example, 320×240, 720×480, and 1440×1080 pixel resolutions are hierarchically encoded into a single video stream data as different encoded layers (also referred to as layers). Encoding video images of different resolutions into a single data stream as described above can provide higher compression and transmission efficiency as compared to cases where separate video streams are transmitted.
Moreover, according to the H.264/SVC standard, it is also possible to encode a plurality of video images of different fields of view into a single video stream data. For example, a full-frame video image showing an entire soccer ground and a video image of a specific region showing only a region that includes a soccer player in that full-frame video image are hierarchically encoded into different layers. Then, during playback, the layers are selectively decoded, making it possible to change the field of view of a video image being viewed or to play back a video image suited to the display resolution and the display aspect ratio of the display apparatus.
In this manner, by using the H.264/SVC standard, a plurality of types of display apparatuses can be supported with transmission of only a single video stream data without the need to transmit video through different streams as in the case of terrestrial digital broadcasting and one-segment broadcasting. This means that transmission band efficiency can be increased and services that enable a user to choose a plurality of video image sizes or fields of view can be provided, and therefore, it is envisaged that in the future, hierarchically-encoded video image data compliant with the H.264/SVC standard will be used for television broadcasting.
It should be noted that even in the case of using hierarchically-encoded video image data for television broadcasting, a situation in which only a single audio data stream is provided as in current broadcasting can also be conceived. As described above, the use of hierarchically-encoded video image data enables the user to choose a layer to change the field of view of a video image to be viewed. However, in the case where only a single audio data stream associated with one particular encoded layer is provided, a problem as described below arises. That is, a problem may arise which defies the user's sense of presence when the field of view is changed, because even when the field of view of a video image is changed, the audio stream data to be played back does not change.
Japanese Patent Laid-Open No. 2004-336430 discloses a technology for giving the sense of presence to the user by changing auditory lateralization of the audio in accordance with the clipping size or position of the video image when the field of view has been changed as a result of enlarging a part of a video image.
However, in Japanese Patent Laid-Open No. 2004-336430, playback of video data, such as hierarchically-encoded video image data, in which video images of a plurality of fields of view are hierarchically encoded was not taken into account. In the case where only a single audio data stream associated with one layer (i.e., one field of view) is provided with respect to video stream data in which video images of a plurality of fields of view are hierarchically encoded, it is required to perform processing which are different from cases where video and audio are provided in a one-to-one correspondence. For example, a case where, with respect to a video content of a soccer broadcast, a video image of the entire soccer ground and a video image of a region of interest of the soccer ground are hierarchically encoded in video stream data, and a single audio data stream is associated with the layer of the video image of the region of interest will be considered. In such case, if processing is performed in the same manner as in conventional technologies, assuming that the audio stream is associated with the layer of the video image of the entire soccer ground, unnecessary audio correction processing will be applied when the video image of the region of interest is chosen.
Moreover, there may be a case where hierarchically-encoded video stream data contains a plurality of video images of the same resolution but different fields of view. However, such a case is not taken into account in Japanese Patent Laid-Open No. 2004-336430.