The present invention relates to a data transmission method and a data transmission system for transmitting a plurality of streams over a network when communicating, streaming, etc. among a plurality of terminals.
Below, an explanation will be made of a conventional method for transmitting a plurality of streams over a network when communicating, streaming, etc. among a plurality of terminals in relation to the drawings.
FIG. 1 is a view of an example of a television (TV) conference system.
In this TV conference system, a conference is simultaneously carried out by using five terminals of a terminal 1 to a terminal 5 with cameras CMR mounted thereon.
The terminals 1 to 5 are connected via switches SW1 to SW4, routers RT1 to RT3, and an ISDN network NTW1.
The signals (video and audio) from the terminals 1 to 5 are assembled at a multipoint control unit (MCU) 6 where they are combined to the signal to be reproduced at each terminal.
The MCU 6 has mainly two functions. One is that of a block of a multipoint controller (MC) 6A for controlling which terminals are attending the conference, while the other is that of a multipoint processor (MP) 6B for combining signals assembled from multiple points for every terminal.
FIGS. 2A and 2B are views of the structure of data flowing over the network and an amount of transmission in the TV conference system of FIG. 1.
As shown in FIG. 2A, signals (A1, V1) transmitted from the terminal 1 pass through the switch SW1, router RT1, ISDN network NTW1, router RT2, and switch SW4 to be transmitted to the MCU 6.
Similarly, the signals transmitted from the terminals 2, 3, 4, and 5 are transmitted to the MCU 6. The signals assembled at the MCU 6 are combined as follows for every terminal.                Terminal 1: (A2-3-4-5, V2-3-4-5)        Terminal 2: (A1-3-4-5, V1-3-4-5)        Terminal 3: (A1-2-4-5, V1-2-4-5)        Terminal 4: (A1-2-3-5, V1-2-3-5)        Terminal 5: (A1-2-3-4, V1-2-3-4)        
Here, A denotes audio, and V denotes video. Further, (,) of (A1,V1) indicates that each signal is separated, and (-) of (A1-2-3-4) indicates that the signals are combined.
“Combined” means that the signals are added in a baseband state (for example PCM) in the case of the audio.
In the case of the video, it means that the signals are combined to one having the same image size by reducing the sizes of the images in the baseband (pixel) state and joining the plurality of images with each other in one frame.
The data structure of the signal flowing over the network shown in FIG. 2A becomes as shown in FIG. 2B.
Namely, the data has the same amount of information before and after the composition. The audio and video are formed into different packets and multiplexed (MUX) in packet units. Further, data is also multiplexed in addition to the audio and video.
When arranged in this way, it is understood that, as the amount of information of the signals flowing over the network of the TV conference system, signals of 20 times the data structure flow in all layers.
Next, a case where the TV conference system is applied to wireless telephones will be considered.
FIG. 3 is a view of the topology in the case where the TV conference system is applied to wireless telephones. In other words, FIG. 3 is a view of an example of the configuration of multipoint communication. In this example as well, the case where five terminals MT (Mobile Terminal) 1 to MT5 communicate is shown.
The terminals MT1 to MT5 are connected via mobile base stations (MBS) 11A to 11D arranged in the network, mobile switching centers (MSC) 13A to 13C with the MCUs 12A to 12C connected thereto, and further gateway mobile switching centers (GMSC) 14A to 14E having home location registers (HLR).
The center portion is a network wherein the GMSCs 14A to 14E are connected in a so-called mesh state (for example circuit switched network or a packet switching network).
A great difference from the TV conference system resides in that there are many MCUs in the network, and the MCU located nearest each terminal multiplexes the signals of the multiple points.
That is, in the MCU, in the same way as the time of the TV conference, there are the function of an MC and the function of an MP. However, one MC among the plurality of MCUs controls one communication, while a plurality of MPs are controlled by this one MC and perform the multiplexing.
FIGS. 4A and 4B are views of the structure of the data flowing in the network and the amount of transmission in the multipoint communication of FIG. 3.
As shown in FIG. 4A, unlike the TV conference system, there are a plurality of MCUs, so the signals of the multiple points must all be transferred to a plurality of MCUs 12A to 12C. Accordingly, the signals (A1,V1) transmitted from for example the terminal MT1 are transmitted to the MCU 12A, MCU 12B, and MCU 12C.
The data structure of the signals (A1, V1) becomes as shown in FIG. 4B. The channel is narrow, so, unlike the time of the TV conference system, the image sent from each terminal is transmitted matching with the size after composition.
Further, when looking at the MCU 12A, two patterns are combined in the following way from the collected five signals for the terminals MT1 and MT2:                MT1: (A2-3-4-5, V2-3-4-5)        MT2: (A1-3-4-5, V1-3-4-5)        
The data structure of this signal is indicated by numeral 15 in FIG. 4B. This becomes the same as the combined one in the TV system. Note, due to a difference of thicknesses of the wireless or other channels, the size of the images, quality of audio, etc. are different from those of the TV conference utilizing an ISDN network.
In this way, behind the existence of the GMSCs, since the composed signals do not flow over the network, the structure of the data flowing at this layer becomes the format as indicated by reference numeral 16 in FIG. 4B. The amount of transmission also becomes 15 times this data structure.
In this way, it is understood that the amount of the data flowing over the entire network is improved a little in comparison with the TV conference by arranging a plurality of MCUs.
Further, by giving the terminal side the function of simultaneously decoding a plurality of streams, the MCU side can multiplex the data in packet units without composition at the baseband level. This situation will be shown in FIGS. 5A and 5B.
In this case, looking at the MCU 12A, the signals combined for the terminals MT1 and MT2 become as follows:                MT1: (A2, 3, 4, 5, V2, 3, 4, 5)        MT2: (A1, 3, 4, 5, V1, 3, 4, 5)        
This data structure becomes as shown in FIG. 5B. The example of FIG. 5B shows the situation where the data is multiplexed in packet units.
Next, an explanation will be given of the operation of an MCU in multipoint communication.
FIG. 6 is a view of an example of the configuration of a conventional MCU used for multipoint communication.
Note that, in this example, the explanation will be made treating the three existing MCUs as one MCU 12.
There is a time difference by which each of the signals collected from the terminals MT1 to MT5 reach the MCU 12.
In order to make these constant, the MCU 12 inserts delay units DLY1 to DLY5 for the signals to match their phases, then demultiplexes the plurality of signals at the demultiplexers DMX1 to DMX5 provided in the MP, passes them through a switcher (buffer) BF, and combines them at the multiplexers MX1 to MX5 for every terminal.
This delay amount and the demultiplexing and multiplexing at the MP are performed according to instructions of the MC.
Next, how this delay time is controlled will be explained.
FIGS. 7A, 7B and 7C and FIGS. 8A, 8B and 8C are views for explaining a situation where the video and audio are encoded and decoded.
(Explanation of Video Encoding)
First, an explanation will be made of the video encoding in relation to FIG. 7A.
1) in FIG. 7A indicates a vertical synchronization signal V Sync. The bold lines represent frames. This frame is an access unit of the video. Generally, this is used as the unit for compression of the amount of information. Further, according to the method of compression, there may be I-pictures and P-pictures. An I-picture is a picture compressed utilizing the correlation with a frame, while a P-picture is a picture compressed utilizing the correlation among frames. The numerals after the picture type indicate the sequence of input frames.
The picture input as in 2) in FIG. 7A is encoded at a time 4).
5) in FIG. 7A indicates the image of the buffer existing inside the encoder. An inverse form to a virtual decoder buffer (VBV buffer) is described rather than the operation of the actual buffer. This corresponds to a virtual buffer existing inside a controller for controlling the rate.
Accordingly, this buffer is instantaneously generated when the encoding is terminated. The bold line shows this situation.
3) in FIG. 7A indicates the value of an STC (system time clock) when each access unit of the video is input to the encoder. This STC illustrates an absolute clock in a telephone network. All systems and terminals are assumed as operating with the same clock and time.
6) in FIG. 7A indicates an STS (decoding time stamp) which indicates the timing when the access unit finished being encoded at 5) starts being decoded at the reproduction side.
This value is transmitted together when the access units of the video are formed into packets and multiplexed. Accordingly, for 10 pictures, a value such as STC—V6 is transmitted. When the system reaches this time, the decoding is started.
(Explanation of Encoding of Audio Related Information)
Next, an explanation will be made of the audio encoding in relation to FIG. 7B and FIG. 8A.
In audio, unlike video, there is no concept of discrete access units such as frames. However, the audio is fetched in the form of access units for every certain number of sample number.
8) in FIG. 7B and FIG. 8A show the situation where an AAU (audio access unit) is input into the encoder. 7) is the time when the AAU is input. 9) is the time when the encoding is actually carried out, while 10) indicates the situation where data is generated in the virtual buffer at the instant when the encoding is completed. 11) is the timing when each AAU is decoded. This value is multiplexed together with the AAU and transmitted to the decoder side.
(Explanation of Video Decoding)
Next, an explanation will be made of the video decoding in relation to FIG. 7C and FIG. 8B.
The bit stream (compressed signal) generated in the buffer in 5) of FIG. 7A starts to be transmitted while the state of the buffer on the decoder side is monitored. The data is accumulated in the decoder buffer.
This situation is shown in 12) of FIG. 7C and FIG. 8B. Here, the state of the virtual buffer (VBV buffer) is illustrated.
13) of FIG. 7C and FIG. 8B indicate the timing when the decoding is carried out matching with the time of the STC of 15). Here, it is supposed that the decoding is ideally instantaneously completed and, simultaneously with the completion of the decoding, the data is output as shown in 14).
Here, the time from the instant when the signal is input to the encoder (terminal) to when the signal is output from the decoder (terminal) is defined as the end-to-end delay. Namely, that time is shown in 15) of FIG. 7C and FIG. 8B. This becomes the same in all access units both video and audio.
The state where the video and audio become out of phase is defined as “lip-sync deviation”. Deviation between the same video or between the same audio is defined as “jitter”.
(Explanation of Decoding of Audio Related Information)
Next, an explanation will be made of the audio decoding in relation to FIG. 8C.
As shown in 16) in FIG. 8C, the audio is transmitted with a delay so as to match the end-to-end-delay of the video. The data is accumulated in the decoder buffer.
The timing of decoding is determined for every AAU shown in 17) matching with the value of the STC of 19) in FIG. 8C. The decoding is instantaneously completed matching with this. The data is output from the decoder immediately after that.
As described above, the information concerning the video and audio are synchronized by transmitting a time stamp such as a DTS. Further, they are controlled so that no underflow or overflow of the buffer occurs in the system.
By utilizing the DTS shown in FIGS. 7A to 7C and FIGS. 8A to 8C, it is possible to achieve synchronization among multiple points. This situation is shown in FIG. 9.
In the example of FIG. 9, the signals of the terminals MT1 and MT2 reach the MCU 12A without through the GMSC 14.
Contrary to this, the signals of the terminals MT3, MT4, and MT5 reach the MCU 12A after passing through the GMSC 14.
Accordingly, it is learned that the signals of the terminal MT3 (T3-AU1, AU2), terminal MT4 (T4-AU1, AU2), and terminal MT5 (T5-AU1, AU2) arrive delayed in comparison with those of the terminal MT1 (T1-AU1, AU2) and the terminal MT2 (T2-AU1, AU2) as shown in the situation of the time difference of the packets transmitted from the terminals indicated by symbol TM1 in FIG. 9.
The MCU 12A analyzes this DTS from each packet, controls the delay units in the MCU to match the phases of the signals from the terminals, then multiplexes and combines the signals.
In this way, it becomes possible to make the phases of the signals from all of the terminals completely match at each of the terminals MT1 and MT2 as shown in the situation of reproduction and display at each terminal shown by the reference symbol TM2 in FIG. 9.
Further, in recent years, Internet telephone and other services using the Internet have been started.
In the Internet, the band is often not compensated. Therefore, it is an area where the quality of service (QoS) is low. When using such a network, it is necessary to monitor the state of congestion and control a signal to be transmitted to the network in accordance with the state of congestion.
FIG. 10 is a view of an example of the configuration of a multipoint communication system utilizing only a network having a low QoS.
As a network having a low QoS, here, the case of utilizing the Internet is shown.
In FIG. 10, the terminals are indicated by MT1 to MT4 in the same way as the above. Further, 21A to 21C denote MBSs, 22A and 22B denote MSCs, 23A and 23B denote MCUs, 24A and 24B denote packet switching networks, 25A and 25B denote Internet exchanges (IX), and 26 denotes the Internet.
The signals rising from the terminals are all transmitted to the packet switching networks 24A and 24B at the MSCs 22A and 22B. Here, the MCUs 23A and 23B for multiplexing the signals of the multiple points are arranged in this packet switching networks.
The MCU 23A preparing the signals to be transmitted to the terminals MT1 and MT2 receives the signals from the terminals MT1, MT2, MT3, and MT4, multiplex them, and send them to the terminals MT1 and MT2.
Here, the data of the terminals MT3 and MT4 are transmitted through the Internet 26, so the transmission delay is greatly influenced in accordance with the state of congestion of the network.
At this time, in order to confirm the congestion, an RTCP (real-time control protocol) is utilized to monitor the RTT (round trip time).
When the RTT widely fluctuates by more than the amount of allowable end-to-end jitter, the amount of the data transmitted over the network is controlled to ease the congestion state so as to avoid congestion.
Summarizing the problem to be solved by the invention, there are the following problems.
(Problem 1)
Conventionally, all of the signals of the multiple points have been gathered at the MCU (multipoint control unit) for combining the signals of the multiple points which then composed the signals required for each terminal. For this reason, many signals had to be transmitted over the network.
(Problem 2)
Conventionally, when combining signals of multiple points, in order to match the times of the signals of the multiple points, the times taken for the transfer were canceled and the phases matched by inserting delay. In order to realize this, delay units compensating for large delays were necessary.
(Problem 3)
When transferring a plurality of signals such as video and audio signals among two or more multiple points, signals of more importance and signals of less importance from the viewpoint of the continuity of the signals are frequently mixed together among these plurality of signals.
For example, when comparing the video and audio, the continuity is more important in the audio.
These signals are transmitted over bands having the same QoS, so the transmission cost becomes high.
Further, from the viewpoint of effective utilization of the bands, the utilization efficiency was low.
(Problem 4)
When utilizing different bands, a plurality of signals (for example audio and video) are transmitted through a plurality of transmission lines. At this time, since the delay values of the signals flowing over the transmission lines are different, if the signals are recombined as they are, a plurality of signals will end up out of phase. In the case of audio and video, this will result in lip-sync deviation and an extremely strange feeling. In some cases, the signals could become even more out of phase than with lip-sync deviation.
(Problem 5)
When communicating by utilizing only a network having a low QoS, there is a possibility of large jitters or large delay occurring in accordance with the state of congestion of the network.
In order to enlarge the permissible value of such jitter in the network, a large delay unit (buffer) becomes necessary somewhere in the system. In one-way streaming, delivery of a continuous signal is made possible by this method.
Further, if a large delay is inserted, in the communication, a deviation occurs in the responses to each other and conversation ends up becoming impossible.
Further, if a state of congestion occurs in the network, the audio will be interrupted. Not only it is then difficult to use this system as a communication tool, but also there is the problem that once congestion occurs, the system cannot be restored for a long time.