1. Field of the Invention
The present invention relates to an editing apparatus, an editing method, an editing program, and an editing system that allow video data that have been compression-encoded using inter-frame compression to be edited more quickly than before.
2. Desciption Of Related Art
As a record medium that is recordable and removable from a recoding and reproducing apparatus, that has a relatively large recording capacity, and that is suitable for recording AV (Audio/Video) data composed of video data and audio data, a DVD (Digital Versatile Disc) having a recording capacity of 4.7 GB (Giga Byte) or more has already become common. Patent document “Japanese Patent Application Laid-Open No. 2004-350251” describes an image capturing apparatus that records DVD-Video format data to a recordable type DVD.
Since this recordable type DVD uses UDF (Universal Disk Format) as a file system, a computer apparatus based on UDF can access this recordable type DVD. Since UDF contains ISO (International Organization for Standardization) 9660 based format, various types of file systems used for computer apparatus can access the recordable type DVD. When video data captured, for example, by an image capturing apparatus and audio data obtained together with the video data that are captured are recorded as a file to this recordable type DVD, since affinity of the image capturing apparatus to other apparatus such as computer apparatus increases, recorded data can be more effectively used.
Since the data amount of video data is huge, they are normally compression-encoded according to a predetermined system and then recorded to a record medium. As a standard compression-encoding system for video data, MPEG2 (Moving Picture Experts Group 2) system is known. In recent years, as advanced and highly effective encoding systems of the MPEG2 compression-encoding system, ITU-T (International Telecommunication Union-Telecommunication Standardization Sector) Recommendation H. 264 or ISO (International Organization for Standardization)/IEC (International Electrotechnical Commission) International Standard 14496-10 (MPEG-4 Part 10) Advanced Video Coding (hereinafter referred to as H. 264|AVC) is becoming common.
In these MPEG2 and H. 264|AVC, intra-frame encoding using orthogonal transform or the like is performed. In addition, inter-frame encoding according to prediction encoding using motion compensation is performed so as to improve the compression rate. Next, inter-frame compression according to prediction encoding for the MPEG2 system will be described.
First of all, the structure of a data stream according to MPEG2 will be outlined. MPEG2 is a combination of prediction encoding using motion compensation and compression encoding using DCT. Data of MPEG2 are hierarchically structured as the block layer, the macro block layer, the slice layer, the picture layer, the GOP layer, and the sequence layer in the ascending order. The block layer is composed of a DCT block that is the unit of DCT. The macro block layer is composed of a plurality of DCT blocks. The slice layer is composed of a header portion and at least one macro block. The picture layer is composed of a header portion and at least one slice. One picture corresponds to one screen. The boundaries of layers can be identified with identification codes.
The GOP layer is composed of a header portion, an I (Intra-coded) picture that is a picture based on intra-frame encoding and a P (Predictive-coded) picture and a B (Bi-directionally predictive coded) picture that are pictures based on prediction-encoding. An I picture can be decoded only with its own information. A P picture and a B picture need the immediately preceding picture as a reference picture and the immediately preceding and following pictures as reference pictures, respectively. Thus, a P picture and a B picture cannot be decoded by themselves. For example, a P picture is decoded with the chronologically immediately preceding I picture or P picture as a reference picture. On the other hand, a B picture is decoded with two pictures of the chronologically immediately preceding and following I picture(s) or B picture(s) as reference pictures. A group that contains at least one I picture and that is complete with itself is referred to as a GOP (Group Of Picture) and is an independently accessible minimum unit of an MPEG stream.
One GOP is composed of one or a plurality of pictures. In the following description, it is assumed that one GOP is composed of a plurality of pictures. There are two types of GOPs, a closed GOP that can be fully decoded by itself and an open GOP that can be decoded with information of the immediately preceding GOP. Since an open GOP can be decoded with more information than a closed GOP, the open GOP has a higher picture quality than the closed GOP and is generally used.
Next, with reference to FIG. 1A, FIG. 1B, and FIG. 1C, a decoding process for data that have been inter-frame compressed will be described. In this example, it is assumed that one GOP is composed of a total of 15 pictures of one I picture, four P pictures, and 10 B pictures and that the GOP type is the open GOP. As exemplified in FIG. 1A, I, P, and B pictures of the GOP are displayed in the order of “B0B1I2B3B4P5B6B7P8B9B10P11B12B13P14”. In this sequence, the suffixes represent the order in which pictures are displayed.
In this example, the first two B0 and B1 pictures are pictures that have been predicted and decoded with the last P14 picture of the immediately preceding GOP and I2 picture of the current GOP, respectively. The first P5 picture of the current GOP is a picture predicted and decoded with I2 picture. The other P8 picture, P11 picture, and P14 are pictures that have been predicted and decoded with the immediately preceding P picture. Each B picture preceding the I picture is a picture that has been predicted and decoded with the immediately preceding and following I and/or P picture.
On the other hand, since a B picture is predicted and decoded with the chronologically preceding and following I or P picture, it is necessary to designate the order of I, P, and B pictures of a stream or a record medium taking into account of the decoding order in which the decoder decodes the pictures. In other words, an I and/or P picture that decodes a B picture needs to be always decoded before the B picture is decoded.
In the foregoing, as exemplified in FIG. 1B, pictures of a stream or a record medium are arranged in the order of “I2B0B1P5B3B4P8B6B7P11B9B10P14B12B13” and they are input to the decoder in this order. In this sequence, the suffixes of the pictures shown in FIG. 1B correspond to those shown in FIG. 1A and represent the order in which the pictures are displayed.
As shown in FIG. 1C, in the decoding process of the decoder, first of all, I2 picture is decoded and then B0 picture and B1 picture are predicted and decoded with the decoded I2 picture and the last P14 picture (in the display order) of the immediately preceding GOP. Thereafter, B0 picture and B1 picture are output from the decoder in the order of which they have been decoded and then I2 picture is output. After B1 picture is output, P5 picture is predicted and decoded with I2 picture. Thereafter, B3 picture and B4 picture are predicted and decoded with I2 picture and P5 picture. Thereafter, B3 picture and B4 picture that have been decoded are output from the decoder in the order of which they have been decoded and then P5 picture is output.
Thereafter, likewise, processes of which a P or I picture that is used to predict a B picture is decoded before the B picture, the B picture is predicted and decoded with the decoded P or I picture, the decoded B picture is output, and the P or I picture used to decode the B picture is output are repeated. The arrangement of pictures of a record medium or a stream as shown in FIG. 1B is generally used.
In the H. 264|AVC system, the encoding process and the decoding process for video data are performed nearly in the same manner as those in the MPEG2 system. In the H. 264|AVC system, inter-frame prediction is more flexibly performed with pictures. In the H. 264|AVC system, a randomly accessible picture that is equivalent to an I picture in the MPEG2 system is referred to as an IDR (Instantaneous Decoding Refresh) picture. In the following, an encoding system will be described on the basis of the MPEG2 system.
Now, the case of which video data that have been compression-encoded according to an encoding system using inter-frame compression such as the MPEG2 system are edited is considered. As an exemplary editing process, the case of which scenes of a middle portion of a video program are deleted and the remaining portions are connected will be described. For example, as exemplified in FIG. 2A, region A-B as scenes to be deleted is designated in a video stream of a series of GOPs such as GOP #1, GOP #2, . . . , GOP #9, . . . . In this case, it is assumed that edit point A at the front end of region A-B and edit point B of the rear end of region A-B are a picture in the middle portion of GOP #3 and a picture in the middle portion of GOP #7, respectively. The video stream is edited in such a manner that pictures in the region A-B are deleted, edit point A and edit point B are connected, and thereby one edited video stream is obtained (refer to FIG. 2B).
When the video stream is edited in such a manner, if pictures in region A-B are simply deleted, the structure of GOP #3+7 that contains the connected portions is destroyed. As a result, a problem of which the video stream cannot be normally reproduced occurs. Thus, the video stream cannot be edited in the accuracy of one frame, but one GOP.
As an exemplary editing process performed in the unit of one GOP, a method of deleting GOP #4 to GOP #6 contained in region A-B to be deleted are deleted and the rear end of GOP #3 and the front end of GOP #7 are connected can be considered. However, when GOPs have the open GOP structure, this method causes a problem of which GOP #7 that precedes a GOP that is deleted cannot decode a group of B pictures (B picture0 and B1 picture in FIG. 1A, FIG. 1B, and FIG. 1C).
To solve the foregoing problem, a method of temporarily decoding a video stream to be edited, editing the video stream in the accuracy of one frame, and then decoding the edited stream can be considered. However, if processes of decoding all the video stream to be edited and then encoding the edited video stream are performed whenever the video stream is edited, it will take a long processing time. In contrast, if all the video stream that has been encoded is decoded and then re-encoded, the picture quality of all the video stream will deteriorate.
These problems also occur when a plurality of video streams are connected.
To solve these problems, so far, a technique of decoding only a necessarily minimum region near an edit point and then re-encoding the decoded region is known. In other words, only GOPs to be deleted and connected and those that are affected thereby are decoded and then re-encoded. Other GOPs are copied in the unit of one GOP. As a typical exemplary technique of which only a necessarily minimum region near an edit point is decoded and re-encoded, a technique called smart rendering is generally known.
Next, with reference to the foregoing FIG. 2A and FIG. 2B, an exemplary process of decoding only a necessarily minimum region near an edit point and then re-encoding the decoded region will be outlined. When region A-B designated by edit point A and edit point B is deleted, portions to be decoded on the edit point A side are GOP #3 containing edit point A and GOP #2 immediately preceding GOP #3. When these GOPs are open GOPs, GOP #2 immediately preceding GOP #3 containing edit point A is necessary to decode B0 picture and B1 picture at the front end in the display order of GOP #3. On the other hand, portions necessary to be decoded on the edit point B side are GOP #7 containing edit point B and GOP #8 immediately following GOP #7. When these GOPs are open GOPs, to decode a block of B pictures at the front end in the display order of GOP #8, it is necessary to use data of GOP #7.
In the state shown in FIG. 2A, first of all, GOP #4 to GOP #6 are deleted, GOP #2 and GOP #3 are decoded, and then GOP #7 and GOP #8 are decoded. In GOP #3, pictures immediately following edit point A are deleted from those that have been decoded. Likewise, in GOP #7, pictures immediately preceding edit point B are deleted from those that have been decoded. Thereafter, edit point A and edit point B are connected and newly created GOP #3+7 is re-encoded with reference to the code amounts of GOP #2 and GOP #8 immediately preceding and following GOP #3+7 (refer to FIG. 2B). B0 picture and B1 picture at the front end of GOP #3+7 are encoded with the last P15 picture of the decoded GOP #2 and I3 picture of the decoded GOP #3. GOP #2 and GOP 8 that have not been decoded can be stored in memory and they can be used.
These process can be applied to an editing process of connecting a plurality of video streams.