The present invention relates to a data transmission method for transmitting a data packet, and to an apparatus therefor; and in particular to a data transmission method for transmitting voice data and image data for video conferences in the form of packet, and to an apparatus therefor.
Personal computers (PCS) have widely appeared and have become necessities for office and home users. Accordingly, the employment of applications for the PCS have increased, and now PCS are used not only as mere OA devices but also as media for information exchange. For example, a so-called video conference system (or a television conference system) is gaining more attention: in the system, remote conferences are connected by a communication line, and voice data and image data exchanged are processed by the PCS.
For such a video conference system, ordinarily, the ISDN (Integrated Services Digital Network) is used as a communication medium and desktop PCS are employed as processors. The ISDN is a digital data transmission network that can theoretically allocate, for a single communication line, two channels each for voice transmission and data transmission. That is, the ISDN is a transmission medium that can totally handle so-called multi-media, such as text, data, still picture and motion picture, in addition to voice over a telephone. The desktop PCS are employed for video conferencing because, in addition to high popularization, it is assumed that individual conference members stay resident at specific locations in their offices.
As a result of recent, large technical developments, light and compact PCS, so-called notebook computers (PCS), have produced. Almost all the notebook PCS are battery operated, and can be carried about and used outside, i.e., can be used in a mobile environment. Thus, there has been an increased demand for the holding of video conferences in a mobile environment.
To realize a video conference in a mobile environment, the choice of what should be employed as a data transmission medium is one of the problems. While the above described ISDN provides high performance, it is expensive and is not yet popular. If the ISDN is employed, the connecting points will be very limited so that mobility is lost. On the other hand, a Plain Old Telephone System (POTS) is inexpensive and popular. The participants in a conference can, at any point where the telephone jack is provided, connect their notebook PCS to the plain old telephone system by using conventional device like modem. Therefore, a demand is increased for employing the plain old telephone system as a communication medium for video conferences.
FIG. 9 of the drawings for this disclosure is a specific diagram illustrating the arrangement of a video conference network in which the plain old telephone system and PCS are used. The PCS are connected to the plain old telephone network via their modems (M). Some PCS may be connected to an in-house telephone network via a PBX (Private Branch Exchange). It should be noted that, though not illustrated, each PC has the hardware components that are required for a video conference, such as a video camera for capturing the appearance of a user; a video capture board/controller for digitalizing input image and fetching the digital data into a computer; a microphone and a loudspeaker for inputting and outputting voices; and an audio controller for processing voice data that are to be input or output.
To implement a video conference using the plain old telephone system, the quantity of data that must be transmitted is the biggest problem. Since data for a video conference include voices and images, the total quantity of such data that must be transmitted is much greater than the band width that is possible with one telephone line, i.e., a maximum transmission rate. For video conference systems in the past that used the plain old telephone system, only simple solutions were applied: (1) transmission of voices was abandoned, or (2) dedicated lines were provided for transmission of voice data and of image data respectively. Recently, however, as data compression techniques have improved and the ability of a CPU to process voices and images have increased, voice data and image data are mixed together (or multiplexed) and the resultant data can be transferred across only a single telephone line.
Communication on a network, like telephone line, is generally performed by using packets, i.e., by dividing a string of data into packets composed of a fixed bit length. A packet consists of a data portion in which is contained the substance of the data to be transmitted, and a header portion in which is contained attribute/control information for the data to be transmitted. Usually, voice data and image data are coded and compressed before they are mixed and divided into packets.
For the transmission of voice data and image data for a video conference by using a single telephone line, priority should be given to voice data. The cutting out of voice not only makes the participants feel uncomfortable but it also disables conversation, so that real time is required more for voice than image. Thus, when voice data and image data are transmitted at the same time, a band in a packet is reserved for voice data first, and the remaining area is given to image data. It should be noted that this forcibly delays image data, because the same communication path is used in common.
FIG. 10 of the drawings of this disclosure is a diagram for an example packet structure for transmitting voice data and image data. One packet is 288 bit long. This corresponds to a quantity of data for 20 (50/1) msec when a modem with a maximum transmission rate of 14.4 kbps is used. The assignment of data fields in the packet is separated into two types, depending on whether voice data is included.
In FIG. 10(a) is a structure of a packet with voice data included (also called a "VOD (Voice Over Data) packet"). The first significant bit is "SYNC", that is used for synchronization. The second significant bit is a GSM bit for indicating whether or not voice data are carried in the packet. GSM is an abbreviation of Global System for Mobile communication. A voice coding algorithm in GSM is well known as a Regular Pulse Excited-Linear Predictive Coder (RPE-LPC). When the voice data are included, a voice flag (also called a "Voice Activity bit (voice input/output monitor bit)") is set (ON). A SYNC bit and the GSM bit constitute the header portion of a packet. Beginning at the third significant bit, the remaining bits is reserved for a data portion. Six bits, from the third through the eighth bit, are employed for CRC (Cyclic Redundancy Check), i.e., for the detection of transmitted data errors. 264 bits, from the ninth through the 272th bit, are assigned for voice data (Four bits starting, among 264 bits, are used as parity bits.). The voice data that are to be transmitted are coded and compressed by, for example, the GSM algorithm. The remaining 16 bits, from the 273th through the 288th bit, are assigned for image data. The image data are coded and compressed by, for example, MPEG (Motion Picture Experts Group) 1 or H261. H261 is a compression algorithm that conforms to the ITU (International Telecommunication Union) advisory. By using this packet, voice data are transmitted at the maximum transmission rate of 13 kbps (=260 bits.div.20 msec).
In FIG. 10(b) is shown the structure of a packet (also called a "NON VOICE packet") that does not contain voice data. The first significant bit, SYNC, is used for synchronization. The second significant bit is a GSM bit indicating whether or not voice data are included in the packet. When voice data are not carried in the packet, the Voice Activity flag is reset (OFF). The SYNC bit and the GSM bit serve as the header portion of a packet. Beginning at the third significant bit, the remainder of the bits are reserved for the data portion. In this case, all the remaining band of 286 bits, from the third through the 288th bit, is given to image data. The image data are coded and compressed by MPEG1 or H261 as is described above.
For the joint transmission of multiplexed voice data and image data, a priority is given to voice data for which there is a greater real time requirement, as was previously described. Thus, the band width allocated for image data is varied, depending on whether or not voice data are present in the packet. This can be intuitively understood by referring to FIG. 10. From the fluctuation of the band width for image data in the packet, the following problems are derived.
(1) Problem related to a bit rate for coding and compressing image data
A coding and compression module (software) for image data, or for a motion picture compressor (hardware), generally adjusts a data compression rate in accordance with a provided parameter, e.g., a bit rate. More specifically, in accordance bit rate, the above software module or hardware component maintains a steady number of image frames to be coded and compressed per unit of time. Therefore, at an optimal bit rate, optimal data transmission can be performed wherein a transmission rate (band width) for image data and image quality are well balanced. However, when the band width assigned for image data is dynamically changed as is described above, the optimal bit rate is accordingly varied. If a larger bit rate is given to image data in advance with an assumption that voice data are always not present in a packet i.e., that the band width allocated for image data is wide, image quality is improved but the quantity of data for one image frame is enormous. A motion picture compression and decompressing module, or a motion picture compression and decompression device, generally handles image data by units of one frame each. If the data quantity for one frame is increased and then a longer time is required by a reception side to receive one image frame, the time for the decompression of image data and for the display of the image data is also delayed. As a result, an image that was captured several seconds before is displayed on a receiver machine.
On the other hand, if a smaller bit rate is given to image data in advance with the assumption that voice data are always present in a packet, i.e., that a band width allocated for image data is narrow, a data quantity for one image frame is reduced so that the delay of an image is resolved. However, as a tradeoff, image quality is poor even when voice data are actually not present in a packet, and thus a wide band is allocated for image data.
(2) Problem concerning a frame rate for a video capture
In order to employ PCS for a video conference, ordinarily a device, such as a video capture board or a video capture controller for digitalizing image input by a video camera and converting the resultant data into file format, is employed. Generally, the video capture controller performs the capture of image data by units of a single frame. The capture is performed in response, for example, to an image input request from upper-level hardware, i.e., a CPU that executes a video application program.
A frame rate, i.e., the number of image frames to be captured per second, is increased in order to provide motion of picture as smooth as possible. However, the total quantity of image data is accordingly increased. When a narrow band width is allocated for image data (see FIG. 10(a)), and when a high frame rate is used, a data delay (buffering) occurs, and an image that was captured several seconds before is displayed on a receiver machine.
On the other hand, if the frame rate is reduced too much, the delay of an image is prevented, but the smooth motion of picture can not be provided. Further, an empty packet (gap) in which video data can not be transmitted appears between the current frame and the succeeding frame, and data transmission is not efficient. Data buffering of one frame or more will induce a delay of displayed picture and is thus meaningless. It is preferable that capturing be performed at an interval wherein transmission of one frame is completed and transmission of a succeeding frame is begun. When the time required for transmitting one frame is calculated by using a data quantity for one image frame, and a band width in a communication path allocated for image data, an optimal time interval can be acquired for capturing a succeeding image frame. However, this calculation can not be applied when the band width assigned for image data is dynamically changed.
Although priority is given to voice data, it is desirable that picture be reproduced as smoothly as possible and at a constant speed. It is therefore inevitable that the problem of coding and compressing image data and of video capture must be resolved.
The above problems are not remarkable in a transmission system, such as the ISDN or the LAN, that can assign a wide band communication path for voice data and for image data. The above problems are very critical, however, in a transmission system, such as a single telephone line, wherein a single narrow band communication path is used in common by data channels.