The JPEG standard, which compresses and encodes still images, and the MPEG1 and MPEG2 standards, which compress and encode moving images using inter-image motion prediction/motion compensation techniques, have been established as highly-efficient techniques for encoding image data. Various makers have developed and put into production image pickup apparatuses such as digital cameras and digital video cameras, DVD recorders, and the like that are capable of recording image data to a storage medium using such encoding techniques.
Among such products, there are some in which still image data that has been shot can be shared among multiple image pickup apparatuses in real time, by transmitting/receiving that image data between the apparatuses using a system such as wireless communication. A user can use such an apparatus to shoot and record an object of his/her preference.
Meanwhile, among such products, there are also apparatuses provided with functions for editing moving images captured by the image pickup apparatus, such as cutting out a desired section of the moving image, combining a moving image with another moving image, and so on. For example, by using moving images captured by multiple image pickup apparatuses as material to be edited, the moving images recorded by different image pickup apparatuses can be combined and a new moving image created.
By the way, digitized moving image data is very large in size. Accordingly, moving image data encoding standards designed to achieve an even higher rate of compression that the previously mentioned MPEG1, MPEG2, and the like continue to be researched. Recently, an encoding scheme called H.264/MPEG-4 Part 10 (called simply “H.264” hereinafter) has been standardized by the ITU-T (International Telecommunication Union-Telecommunication Standardization Sector) and the ISO (International Organization for Standardization).
The structure of data encoded using the H.264 standard shall be described hereinafter with reference to FIGS. 11A to 11C and 12A to 12B.
Note that FIGS. 11A to 11C and 12A to 12B illustrate picture groups indicating encoded moving image data (an image sequence) and picture types for each picture. In these diagrams, the upper level indicates the display order (displayed in order from the left), whereas the lower level indicates the encoding order (encoded in order from the left).
The picture types in image frames according to the H.264 standard include I pictures, which are encoded from only data within the same frame, and P pictures, which are encoded using the difference between that frame and the frame temporally previous. Furthermore, there are also B pictures, which can use the difference between that frame and the frame temporally following, in addition to the difference between that frame and the frame temporally previous.
For example, FIG. 11A illustrates that a picture P8 is a P picture frame that is displayed ninth. The arrow in FIG. 11A indicates a reference relationship, showing, in the example shown in FIG. 11A, that the picture P8 refers to a picture B0. Meanwhile, in the example shown in FIG. 11B, the picture B0 refers to pictures P2 and B7.
In the H.264 standard, it is possible, when performing inter-frame prediction, to use arbitrary frames and picture types within an image sequence as reference images. For example, as shown in FIG. 11A, the picture P8, which is a P picture, can refer not only to I pictures but can also skip I pictures and refer to other frames. Similarly, as shown in FIG. 11B, the picture B0, which is a B picture, can also refer not only to I pictures but can also skip I pictures and refer to other frames.
In this manner, the H.264 standard permits such flexible reference images. Therefore, the H.264 standard can improve the accuracy of inter-frame prediction and the encoding efficiency beyond that of standards such as MPEG2, in which a P picture can refer only to the I picture immediately previous thereto or to another P picture.
However, because the H.264 standard permits such flexible reference images as mentioned earlier, there are cases where random access cannot be performed quickly in the H.264 standard. As an example, FIG. 11C illustrates a case in which an image sequence is played back from a frame partway through, or a picture I5, using random access.
When starting playback from the picture I5 in the image sequence, the picture P8 is decoded thereafter, and because the picture P8 refers to the picture B0, it is necessary to decode the picture B0 in advance. Furthermore, because the picture B0 refers to the pictures P2 and B7, it is also necessary to decode the pictures P2 and B7 in advance in order to decode the picture B0. Similarly, although not shown in FIG. 11C, the pictures P2 and B7 each refer to other pictures, and thus it is also necessary to decode those other pictures in advance in order to decode the pictures P2 and B7.
Thus, even if playback is started from the picture I5, references that skip the picture I5 are allowed, and therefore it is necessary to go back and start the decoding process from data prior to the picture I5, making it difficult to quickly start playback from the picture I5. Furthermore, even if a user wishes to cut edit the encoded bitstream using the picture I5 as the cut frame, references that skip the picture I5 are permitted, and thus it is necessary to go back and start the decoding process from data prior to the picture I5. It is therefore difficult to perform cut edits where the bitstream is cut using the picture I5 as the cut frame.
Accordingly, Japanese Patent Laid-Open No. 2003-199112, for example, proposes a method that provides a periodical limitation on I pictures in order to eliminate this problem and enable quick random access. This limited I picture is called an “IDR picture” in the H.264 standard. The IDR picture shall now be described with reference to FIGS. 12A and 12B. Note that the image sequences illustrated in FIGS. 12A and 12B indicate the same image sequences as those shown in FIGS. 11A and 11B, but in which an IDR picture has been set for the picture I5.
When an IDR picture has been set for the picture I5, the frame memory into which the reference images of the moving image are being recorded is cleared of those reference images when the IDR picture is encoded. Therefore, pictures encoded after the IDR picture cannot refer to pictures encoded before that IDR picture. Likewise, pictures encoded before the IDR picture cannot refer to pictures encoded after that IDR picture.
In the example shown in FIG. 12A, the P pictures and B pictures encoded after the IDR picture, or the picture IDR5, cannot refer to the P pictures and B pictures encoded before that IDR picture. To be more specific, pictures such as the pictures P8 and B7, which are encoded after the picture IDR5, cannot refer to pictures such as the pictures P2 and B0, which are encoded before the picture IDR5.
Conversely, in the example shown in FIG. 12B, the P pictures and B pictures encoded before the IDR picture, or the picture IDR5, cannot refer to the P pictures and B pictures encoded after that IDR picture. To be more specific, pictures such as the pictures P2 and B0, which are encoded before the picture IDR5, cannot refer to pictures such as the pictures P8 and B7, which are encoded after the picture IDR5.
Accordingly, with the H.264 standard, when starting playback of encoded data from an IDR picture, it is not necessary to go back and decode image data from before the IDR picture, making it possible to implement playback with quick random access. Furthermore, because skipping the IDR picture and referring to other pictures is prohibited, editing that uses the IDR picture as the cut frame is also possible.
Next, control of the encoded data amount in the H.264 standard shall be described. The variable bitrate (VBR) scheme is one technique for controlling the encoded data amount. Hereinafter, encoded data amount control according to the VBR scheme shall be briefly described.
The VBR scheme is a scheme for controlling the encoded data amount that varies the target encoding bitrate based on the local properties of the video, while attempting to bring the encoding bitrate as close as possible to an average target encoding bitrate. Because this scheme encodes the video signal using a target encoding bitrate based on the properties of the video, it has a characteristic that there is little fluctuation in the image quality. In other words, frames that are difficult to encode and will thus suffer from low image quality are encoded at a higher target encoding bitrate, whereas frames that are easy to encode and will thus have sufficiently high image quality are encoded at a lower target encoding bitrate.
Recent digital video cameras are provided with multiple recording modes (encoding modes) that use encoded data amount control techniques to enable high image quality recording or extended time recording. For example, there are video cameras that have three recording modes, which encode data in accordance with an average target encoding bitrate: an LP (Long Play) mode, an SP (Standard Play) mode, and an XP (Excellent Play) mode. The VBR scheme is typically used in all recording modes. The average target encoding bitrate is lowest in the LP mode, whereas the average target encoding bitrate is highest in the XP mode. The average target encoding bitrate in the SP mode is between that of the LP mode and the XP mode.
In LP mode, the encoding bitrate is low, leading to a drop in image quality; however, the resulting file is small, and thus a larger amount of video can be recorded. On the other hand, in XP mode, the encoding bitrate is high, leading to an increase in image quality; however, the resulting file is large, and thus only a small amount of video can be recorded. A user can shoot video using the recording mode s/he prefers in light of the image quality of the recorded video, the space remaining in the storage medium, and so on.
Japanese Patent Laid-Open No. 2001-346201 describes an image encoding apparatus that uses the VBR scheme. This document describes a case where an input image is divided into low-resolution images and encoded using multiple image encoding apparatuses; in such a case, encoded data amounts are then allocated to each of the encoding apparatuses so that the image quality of the low-resolution images is the same for each of the image encoding apparatuses.
With the aforementioned H.264 standard, using the IDR picture, which limits the reference relationships in inter-frame prediction, enables quick random access, easy editing, and so on. For this reason, it is necessary to set an IDR picture at an appropriate location in order to enable quick playback from an arbitrary location in the encoded bitstream, easy editing using the encoded bitstream as materials, and so on.
However, because the reference relationships are limited in the described manner by setting an IDR picture, setting many IDR pictures has the potential to reduce the encoding efficiency. In other words, if priority is to be placed on encoding efficiency, it is desirable to set as few IDR pictures as possible. A method that sets IDR pictures periodically, such as the background art, has a problem that IDR pictures are also set for frames that are not necessary for random access, editing, and so on, which leads to a drop in the encoding efficiency.
In addition, in the case where multiple users are to edit multiple moving images (encoded bitstreams) shot using their respective image pickup apparatuses, there are many cases where the intervals and times at which the IDR pictures are set in the respective encoded bitstreams differ from one another. For this reason, reducing the number of set IDR pictures in order to prevent a drop in the encoding efficiency makes it difficult to splice together desired sections of video when editing multiple encoded bitstreams from different users.
Furthermore, maintaining uniform image quality among the videos is an important issue in the abovementioned situation where multiple moving images (encoded bitstreams) recorded using different apparatuses are to be edited. If the individual encoded bitstreams are of differing image qualities, editing the video will result in a difference in image quality that is visibly apparent in the areas at which streams have been spliced together.
FIGS. 21A to 21C illustrate examples in which a user A has recorded a scene A at a target encoding bitrate based on the SP mode, whereas a user B has recorded a scene B at a target encoding bitrate based on the XP mode. FIG. 21A indicates the change in the average target encoding bitrate in the recording of the scene A by the user A, and shows that the user A starts shooting the scene A at time t2 and stops shooting the scene A at time t3. FIG. 21B, meanwhile, indicates the change in the average target encoding bitrate in the recording of the scene B by the user B, and shows that the user B starts shooting the scene B at time t1 and stops shooting the scene B at time t2. FIG. 21C then shows the change in the average target encoding bitrate of the encoded bitstream in the case where the scenes A and B have been spliced together through a cut edit.
In FIG. 21C, the encoding bitrate drops suddenly at the splice between the scenes B and A at time t2, due to the difference in the average target encoding bitrates between the SP and XP modes. In other words, in FIG. 21C, the image quality appears to be suddenly dropping in the section after time t2, as compared to the section before time t2. For this reason, a viewer who plays back the video of such an encoded bitstream will feel a sense of unnaturalness immediately following time t2.