(1) Field of the Invention
The present invention relates to a multiplex scheme conversion apparatus for converting the multiplex scheme of the coded media data such as video and sound and outputting the converted coded media data.
(2) Description of the Related Art
With the recent increase in the capacity of storage media and communication networks or the development of communication techniques, apparatuses and services for processing coded multimedia data such as video and sound have become popular. For example, in the broadcasting field, conventional analog broadcasting is being replaced by digitally coded media data broadcasting. The digital broadcasting is received by only immobile apparatuses at present, but in the future, broadcasting for mobile apparatuses such as mobile phones is scheduled to be started. Also, in the communication field, the environment where multi media data is handled by both immobile terminals and mobile terminals is being prepared. For example, a video distribution service for third generation mobile phones has already been started. In light of the present situation, the use method of recording content data received via broadcasting or the Internet on memory cards such as secure digital (SD) cards or optical discs such as digital versatile disk-rewritable (DVD-RAM) and causing such apparatuses to share the content data is expected to become popular.
At the time of distributing media data via broadcasting or networks, or by using a storage medium, the header information and the media data necessary for reproducing media data are multiplexed. In multiplexing, a multiplex scheme is standardized respectively for broadcasting, storage apparatuses such as DVDs and mobile apparatuses. First, for digital broadcasting or DVDs, the moving picture expert group (MPEG-2) system standard that is standardized by the International standardization organization/international engineering consortium (ISO/IEC JTC1/SC29/WG11) is used. Also, for mobile terminals, the MP4 file format standardized by the ISO/IEC JTC1/SC29/WG11 is employed in the Transparent end-to-end packet switched streaming service (TS26.234) prescribed, as a wireless video distribution standard, by the Third generation partnership project (3GPP) that is the international standardization group that is an establishment for standardizing the third generation mobile communication system.
Also, the MPEG-4 advanced video coding (AVC) is standardized as the standard that succeeds the MPEG-2 Visual or the MPEG-4 visual that are popular as a video coding method at present. Therefore, it is expected that coded video data of the MPEG-4 AVC is multiplexed, broadcast, stored or distributed using the MPEG-2 system standard or the MP4 file format (called MP4 from here).
The outline of coded data multiplex scheme in the MPEG-2 system and the MP4 will be described below. In the MPEG-2 system and the MPEG-4, the basic unit used in handling coded data is an access unit (AU), and the structure of an AU will be described first. An AU as a unit includes coded data for one picture, and the AU data in the MPEG-4 AVC has the structure shown in FIGS. 1A to 1C. In the MPEG-4 AVC, it is possible to include, in AU data, not only the data necessary for decoding pictures but also supplementary information called supplemental enhancement information (SEI) that is unnecessary for decoding, AU boundary information and the like. All the data is stored in the network adaptation layer (NAL) unit. Note that, in the MPEG-2 system, a NAL unit called access unit delimiter indicating the start of an AU is surely added to the top of the AU. The NAL unit is composed of a header and a payload as shown as FIG. 1A. The header size is 1 byte and the header includes a field indicating the type of data stored in the payload (called NAL unit type from here). NAL unit type values are defined based on the respective kinds of data such as a slice and a SEI, a NAL unit type is referred to when obtaining a data type stored in the NAL unit. A NAL unit such as a header information and a SEI are stored in an AU in addition to slice data for one picture as shown as FIGS. 1B and 1C. As information for identifying the boundary of NAL unit data that is not included in the NAL unit, identification information can be added to the top of each NAL unit when an AU is stored. Identification information is added using the following two types of addition methods: adding a start code prefix shown in 3 bytes of 0x000001 (called byte stream format from here) in FIG. 1B; and adding the size of NAL unit as shown in FIG. 1C (called NAL size format from here). Note that it is prescribed that at least one zero_byte whose value is 0x00 that is represented in 1 byte is added before the start code prefix of the leading NAL unit and a NAL unit having a specific NAL unit type value of an AU. The byte stream format is used in the MPEG-2 system, and the NAL size format is used in the MP4.
Next, a slice and header information will be described in detail. Slices are roughly divided into two types: Instantaneous decoder refresh (IDR) slices; and the other type of slices. An IDR slice is slice data that is intra-coded, and header information such as later-described sequence parameter set (SPS) and the like can be switched only in such an IDR slice. In the case where an IDR slice is included in a picture, the other slices in the picture are IDR slices, therefore, an AU including IDR slices is called as an IDR AU from here. Also, a unit composed of AUs from an IDR AU to the AU immediately before the next IDR AU is called sequence. Here, random accesses are performed, on a sequence-by-sequence basis because only AUs in a sequence are referred to in decoding slice data of an AU. Next, there are two types of header information: SPS; and picture parameter set (PPS). SPS is header information that is fixed on a sequence-by-sequence basis and PPS is header information that is switchable on a picture-by-picture basis. Header information can include several SPSs and PPSs, and these SPSs and PPSs are distinguished from each other based on index numbers. Also, one SPS or PPS is stored in one NAL unit. Index numbers of an SPS and a PPS that are referred to by each picture are obtained in the following way. First, the index number of a PPS referred to by a picture is shown in the header part of slice data. Next, as the index number of an SPS referred to by the PPS is shown in the PPS, the index number of the SPS referred to by the picture can be obtained by analyzing the PPS referred to by the picture. An SPS and PPS referred to by a picture is necessary in decoding the slice data of a picture.
Next, boundary information addition method in broadcasting at the time of separating AU data by the MPEG-2 system will be described. In the MPEG-2 system, the coded data is multiplexed into a packetized elementary stream (PES) packet and then the PES packet is multiplexed into a transport stream (TS) packet. FIG. 2A shows the structure of a PES packet, and FIG. 2B shows the structure of a PES packet and a TS packet. In the payload of a PES packet, access unit (AU) data is stored. FIG. 2A (1) to (3) show a storage example indicating how AU data is stored in the payload of a PES packet. Several AUs may be stored together as shown in FIGS. 2A(1) and 2A(2) and AU data may be divided and stored as shown in FIG. 2A(3). Further, in the payload, staffing data can be included separately from the AU data. The header of the PES packet starts from a 4-byte start code composed of a start code prefix shown in 3 bytes of 0x000001 and a stream ID represented in 1 byte. The stream ID is an identification number indicating the type of a coded data included in the payload data of the PES packet, and the identification number may be an arbitrary value between 0xE0 and 0xEF inclusive in the MPEG-4 AVC. It is possible to store, in the header, decoding time and display time of the leading AU in the payload, but such pieces of time information are not always stored in all PES packets, in other words, there are PES packets where no time information is stored. In the case where pieces of time information being the decoding time or the display time, on AU are not shown by the header of a PES packet but needed, AU data is analyzed and the differential value in decoding time or display time of the current AU and the immediately before AU is obtained. Note that the starting position in a PES packet is detected by searching 4-byte start code in the payload data of TS packets. On the other hand, the data of the PES packet is divided and stored in the payloads of TS packets as shown in FIG. 2B. A TS packet is a packet having a fixed length of 188 bytes, and it is composed of a 4-byte header, an adaptation field and payload data. Note that an adaptation field is included only in the case where a specific flag is set in the header. In the header, an identification number called PID indicating the type of data transmitted by the TS packet and a counter called continuity_counter is included. The continuity_counter is a 4-bit field. In the case of TS packets having an identical PID, values are incremented one by one according to the sending order of such TS packets until values are incremented to be the maximum value and then re-counted from the starting value. The association between the PID of a TS packet and the type of data transmitted by the TS packet is separately provided in program information transmitted by the TS packet. Therefore, at the time of receiving TS packets, PIDs of TS packets are obtained first, and then these packets are divided depending on PID values. For example, in the case where the program information obtained at the starting time of receiving shows that the MPEG-4 AVC data is transmitted by the TS packet whose PID is 32, obtaining the TS packet whose PID is 32 makes it possible to obtain the AU data of the MPEG-4 AVC. Here, a gap between continuity_counter values of the received TS packets indicates that a packet loss has occurred in a transmission path. In addition, in the case where AU data is separated from a TS packet, a PES packet is separated from the payload data of a TS packet and the AU data is separated from the separated PES packet.
Lastly, the multiplex scheme of AU data in the MP4 will be described. In the MP4, header information on a sample-by-sample basis or media data is managed on an object-by-object basis. The object is called Box. Here, a sample is a basic unit in handling media data in the MP4, and one sample corresponds to 1 AU. A sample number is assigned to each sample in an ascending order of decoding time, and these sample numbers are incremented by one for each sample. FIG. 3A shows the structure of the Box composed of the following fields: (i) size that is the size of the whole Box including a size field; (ii) type that is a Box identifier, the identifier basically being four alphabets, the field length being 4 bytes, and the Box in the MP4 file being searched by judging whether the sequential 4-byte data matches the identifier of the type field; (iii) version that is the version number of the Box; (iv) flags that is the flag information set for each Box; and (v) data that is data such as header information or media data.
Note that some Boxes do not include fields of version and flags because they are unnecessary. Identifiers of type fields are used in referring to Boxes in the following description, for example, the Box whose type is ‘moov’ is called ‘moov’. The Box structure in the MP4 file is shown as FIG. 3B. The MP4 file is composed of ‘fytp’, ‘moov’ and, ‘mdat’ or ‘moof’, and ‘fytp’ is placed in the top of the MP4 file. Information for identifying an MP4 file is included in ‘fytp’, and media data is stored in ‘mdat’. Each media data included in ‘mdat’ is called ‘trak’, and each ‘trak’ is identified by a ‘trak’ ID. Next, header information on a sample included in each ‘trak’ of ‘mdat’ is stored in ‘moov’. In ‘moov’, as shown as FIG. 4A, Boxes are hierarchically placed, and pieces of header information of audio media tracks and video media tracks are separately stored in a ‘trak’. In a ‘trak’, Boxes are hierarchically placed, and the following information is stored in each Box in ‘stbl’: (i) sizes, decoding time and display starting time of samples; or (ii) information on randomly-accessible samples (FIG. 4B). Such randomly-accessible samples are called Sync samples, and a list of sample numbers of the Sync samples is shown by ‘stss’ in ‘stbl’. The header information of all the samples in a ‘trak’ is stored in ‘moov’ in the above description, but it is possible to divide these ‘trak’s into fragments and store the header information on a fragment-by-fragment basis. The header information on each unit obtained by dividing the ‘trak’ is shown in ‘moof’. In the example of FIG. 5, the header information of samples to be stored in ‘mdat’ #1 can be stored in ‘moof’ #1.
Here is a case where broadcasting data received in a mobile terminal such as a mobile phone is transmitted by e-mail in a form of an attachment. In 3GPP, in the case where video and sound are handled in e-mail, it is prescribed that such media data is multiplexed using the MP4. Therefore, the TS multiplex scheme needs to be converted to the MP4 at the time of transmitting e-mail. The following is the description of how a conventional multiplex scheme conversion apparatus converts, into the MP4 file, the packet sequence of TS packets where coded data of the MPEG-4 AVC is multiplexed (For example, refer to Japanese Laid-Open Patent application No. 2003-114845). Note that fragments are not used in the converted MP4. FIG. 6 is a flow chart showing the conversion operation. In the step 101, the PES packet data is separated, from the payload of the inputted TS packet, by detecting the starting position of the PES packet. Next, in the step 102, the starting position of an AU is detected from the payload of the separated PES packet. Step 101 and step 102 are repeated until the starting position of an AU is detected (Loop A), and the AU data for 1 AU is separated at the time of the detection. The display time information shown in the PES header is obtained in the step 103, and ‘moov’ and ‘mdat’ are made in the step 104. Processing from step 101 to step 104 is repeated until the processing of the last TS packet is completed (Loop B), in other words, Loop B is completed at the time of processing all the AUs in media data. After that, in the step 105, data of ‘moov’ and ‘mdat’ are connected to each other, and the MP4 file is completed. FIG. 7 is a flow chart showing the processing of step 104 in detail. In the step 201, the header information necessary for making Boxes in ‘moov’ is obtained by analyzing AU data. In the step 202, storage format of NAL units composing an AU is converted from a byte stream format to a NAL size format. In the step 203, Box data in ‘moov’ is made based on the header information of AUs obtained in the step 201. In the step 204, ‘mdat’ data is made. Note that a Box for storing the header information, which includes the size and decoding time of a sample, for each sample is completed at the time when processing of all the AUs in a ‘trak’ is completed.
Depending on weather or other receiving conditions, a packet loss occurs at the time of receiving TS packets to be broadcast. Especially, in a case where they are received by a mobile terminal such as a mobile phone, packet losses frequently occur. In the case of TS packets, it is possible to detect that AU data is lost because of a packet loss by checking whether continuity_counter value is discontinuous or not, but in the case of MP4, it is impossible to show that sample data is lost. This is the cause of the first problem that sample data cannot be correctly decoded because data is lost at the time of reproducing the MP4 file, resulting in video data with low picture quality being displayed.
Further, at the time of converting a multiplex scheme from TS to MP4, the storage format of the NAL unit must be converted from a byte stream format to a NAL size format. In the byte stream format, the boundary of the NAL units is detected by detecting a start code added to the top of the NAL unit. However, in the case where the boundary part is lost because of a packet loss, the boundary cannot be detected. This makes it impossible to separate the data of the NAL unit correctly. Conventionally, there is no prescribed storing method of the NAL unit data, as a sample of MP4, which can be used in the case where NAL unit data cannot be separated correctly. This is the cause of the second problem that data to be stored in a converted NAL unit cannot be determined.