1. Field of the Invention
The present invention relates to a transmitting device with discard control of specific media data.
2. Description of the Related Art
A transmitting device capable of transmitting different types of media data separately encodes simultaneously generated media data, constructs the data into frames with proper sizes, and then transmits the frames to a network. A receiving device, on the other hand, uses different media decoders to decode received media data of different types, frame by frame. Determination of a proper size of a frame of each type of different types of media data is important in multimedia data multiplexing.
FIG. 1 illustrates a functional configuration of a transmitting device that transmits different types of media data and a receiving device.
A transmitting device 1 encodes and transmits video and audio data in real time. A camera 4 for producing video signals and a microphone 5 for producing audio signals are connected to this transmitting device 1. This transmitting device 1 communicates with a receiving device 2 through an IP (Internet Protocol) network 3.
The transmitting device 1 has, for processing video data, a format converter 101, a video encoder 102, a video data buffer 103, and an RTP (Real-time Transport Protocol) header adding section 104. For processing audio data, the transmitting device 1 has an amplifier 110, an A/D (Analog/Digital) converter 109, an audio encoder 108, an audio data buffer 107, and an RTP header adding section 106. In addition, the transmitting device 1 has a UDP (User Datagram Protocol)/IP protocol stack section 105, which multiplexes multiple media data each having a UDP port number assigned, and outputs the multiplexed data to the IP network 3.
A video signal provided from the camera 4 is converted by the format converter 101 into a video format appropriate for compression encoding. The converted video signal is compression-encoded by the video encoder 102. This encoding scheme is to generate coded data of each frame (screen image) as a unit. An example of such encoding scheme is ISO/IEC (International Organization for Standardization/International Electrotechnical Commission) MPEG-4 (Moving Picture Experts Group-4) Visual. According to a low-bit-rate MPEG-4, each frame is generated every 100 milliseconds (10 frames per second). The generated bit stream data is provided to the video data buffer 103. The RTP header adding section 104 adds an RTP header to each frame of video data taken out of the buffer 103. UDP and IP headers are added to each of the RTP packets by the UDP/IP protocol stack section 105 and the packets are sent out to the IP network 3.
On the other hand, an audio signal from the microphone 5 is amplified by the amplifier 110 and converted by the A/D converter 109 into PCM digital data. The audio data is then compression-encoded by the audio encoder 108. The encoding scheme is to generate coded data of frames so that each frame has a predetermined time length. Examples of such encoding scheme include 3GPP (3rd Generation Partnership Project) AMR (Adaptive Multi Rate). In the AMR, each frame is 20 milliseconds long. The generated bit stream is provided to the audio data buffer 107. The RTP header adding section 106 adds an RTP header to each frame of audio data taken out of the buffer 107. UDP and IP headers are added to each of the RTP packets by the UDP/IP protocol stack section 105 and the packets are sent out to the IP network 3.
The receiving device 2 is connected to the IP network 3 for receiving data packets and decodes received data packets according to the media data type of the packets.
Table 1 shows data structures of packet-multiplexed frames according to the conventional art.
TABLE 1IPUDPRTPAudio DataHeaderHeaderHeaderIPUDPRTPVideo DataHeaderHeaderHeader
According to the conventional art, both audio data frames and video data frames are separately required. This is because frames of different media data are decoded by specific decoders for the respective media data.
According to the conventional art described above, the following RTP packets are generated. The assumption here is that encoding parameters have typical values at a low transmission rate (64 kilobits/second).
Video encoding bit rate; 32 kilobits/second, Frame rate; 10 frames/second,
Audio encoding bit rate; 6,800 bits/second, Frame length; 20 milliseconds,
RTP header size; 12 bytes,
UDP header size; 8 bytes,
IP header size; 20 bytes.
For video data:
Video encoding bit rate; (32 kilobits/second)/(8 bits)=4 kilobytes/second,
Number of frames transmitted per second; 10 frames/second,
Video data packet transmission interval; 1 second/10 frames=100 millisecond/frame,
Frame rate; (4 kilobytes/second)/(10 frames/second)=400 bytes/frame,
Size of 1 video packet; 400 bytes+12 bytes+8 bytes+20 bytes=440 bytes, Transmission time of 1 video packet; (440 bytes×8 bits)/(64 kilobits/second)=55 milliseconds.
For audio data:
Audio encoding bit rate; (6,800 bits/second)/(8 bits)=850 bytes/second,
Number of frames transmitted per second; 50 frames/second,
Audio data packet transmission interval; (1 second)/(50 frames)=20 milliseconds/frame,
Number of bytes per frame; 850 bytes/second×20 millisecond=17 bytes/frame,
Size of 1 audio packet; 17 bytes+12 bytes+8 bytes+20 bytes=57 bytes,
Transmission time of 1 audio packet; (57 bytes×8 bits)/(64 kilobits/second)=7.125 milliseconds.
The audio data packets are transmitted at intervals of 20 milliseconds whereas the video data packets are transmitted at intervals of 100 milliseconds. Accordingly, transmission of a video data packet breaks in every 5th audio data packet transmission.
FIG. 2 illustrates a transmission sequence of frames according to the conventional art.
An audio data packet is transmitted every 20 milliseconds whereas a video data packet is transmitted every 100 milliseconds. An audio data packet consists of a 40-byte-long header (IP, UDP, and RTP) portion and a data portion. If the data portion is 17 bytes long, transmission of the audio data packet takes 7.125 milliseconds. A video data packet consists of 40-byte-long header portion and a data portion. If the data portion is 400 bytes long, transmission of the video data packet takes 55 milliseconds.
Suppose an audio data packet A3 in FIG. 2 is transmitted immediately after the transmission of a video data packet V1, for example. The video data packet V1 arrives at the receiving device 55 milliseconds after it is transmitted from the transmitting device. The audio data packet A3, on the other hand, arrives at the receiving device 7.125 milliseconds after that, that is, 62.125 milliseconds after it is transmitted from the transmitting device. Accordingly, while the transmission interval from the transmitting device is 20 milliseconds, the audio data packet immediately before which the video data packet has been transmitted is received at the receiving device 62 milliseconds after its transmission. Subsequently, audio data packets A4, A5, A6 and A7 arrive at the receiving device in succession. Because the time required for the transmission of an audio data packet, namely 7 milliseconds, is short enough compared with a transmission interval of 20 milliseconds, the arrival-interval of the audio data packets at the receiving device will be restored to its original length in the course of time. However, a variation in arrival delay occurs again because another video data packet is inserted between audio data packets.
Audio data must be continuous when it is reconverted into an analog signal and played back. In this respect, audio data differs from frame-by-frame based discrete data such as video data. Therefore, the receiving device stores audio data packets in a buffer before playback in accordance with the maximum amount of delay of arrival of audio data packets. Delay variations of audio data packets require a large amount of buffer storage at the receiving end and thereby increase delay in the entire transmission.
Furthermore, according to the conventional art, RTP, UDP and IP headers must be added to each of video and audio data packets individually. Accordingly, the following header overhead is required:
(12 bytes+8 bytes+20 bytes)×8 bits×(50 audio frames+10 video frames)=19.2 kilobits/second.
There is a method for reducing delay variations of audio data packets in which video data packets are divided into sub-packets and video data sub-packets and audio data packets are alternately transmitted. In this case, jitter in arrival time of audio data packets can be avoided because both audio and video packets are transmitted at equal intervals. However, this method further increases header overhead, impairing significantly the efficiency of transmission as follows:
(12 bytes+8 bytes+20 bytes)×8 bits×(50 audio frames+50 video frames)=32 kilobits/second.
Known prior-arts are IETF RFC 3550 (Internet Engineering Task Force-Request For Comment 3550) that defines RTP, IETF RFC 3016 that defines an RTP format for carrying MPEG-4 video data, and IETF RFC 3267 that defines an RTP format for carrying AMR data.
If the bit rate of a transmission channel is lower than that of multiplexed frames, the number of multiplexed frames stored in a multiplex buffer increases with time. On the Internet where no bandwidth guarantees are provided, or mobile communication networks where radio communication environments tend to change, the bit rate of a transmission channel significantly changes. For audio data in real-time communications, audible discontinuities occur when the audio data does not arrive at desired time intervals. For video data, users perceive jitter as variations of frame intervals and display delay. Thus, according to the conventional art, transmission of media data of one type affects delay variations in transmission of media data of another type.
Especially in real-time communications such as IP video telephony, at least audio data must be transmitted without delay so that conversations can be carried out. Therefore, in an environment where the bit rate of a transmission channel changes, control is required for reliably transmitting audio data.