1. Field of the Invention
The present invention relates to the processing by a computer workstation of multiple streams of audio data received over a network.
2. Description of the Prior Art
Conventionally voice signals have been transmitted over standard analog telephone lines. However, with the increase in locations provided with local area networks (LANs) and the growing importance of multimedia communications, there has been considerable interest in the use of LANs to carry voice signals. This work is described, for example in "Using Local Area Networks for Carrying Online Voice" by D. Cohen, pages 13-21 and "Voice Transmission over an Ethernet Backbone" by P. Ravasio, R, Marcogliese, and R. Novarese, pages 39-65, both in "Local Computer Networks" (edited by P. Ravasio, G. Hopkins, and N. Naffah; North Holland, 1982). The basic principles of such a scheme are that a first terminal or workstation digitally samples a voice input signal at a regular rate (e.g. 8 kHz). A number of samples are then assembled into a data packet for transmission over the network to a second terminal, which then feeds the samples to a loudspeaker or equivalent device for playout, again at a constant rate.
One of the problems with using a LAN to carry voice data is that the transmission time across the network is variable. Thus the arrival of packets at a destination node is both delayed and irregular. If the packets were played out in irregular fashion, this would have an extremely adverse effect on intelligibility of the voice signal. Therefore, voice over LAN schemes utilize some degree of buffering at the reception end, to absorb such irregularities. Care must be taken to avoid introducing too large a delay between the original voice signal and the audio output at the destination end, which would render natural interactive two-way conversation difficult (in the same way that an excessive delay on a transatlantic conventional phone call can be highly intrusive). A system is described in "Adaptive Audio Playout Algorithm for Shared Packet Networks", by B. Aldred, R. Bowater, and S. Woodman, IBM Technical Disclosure Bulletin, pp. 255-257, Vol. 36, No. 4, April 1993 in which packets that arrive later than a maximum allowed value are discarded. The amount of buffering is adaptively controlled depending on the number of discarded packets (any other appropriate measure of lateness could be used). If the number of discarded packets is high, the degree of buffering is increased, while if the number of discarded packets is low, the degree of buffering is decreased. The size of the buffer is altered by temporarily changing the play-out rate (this affects the pitch; a less noticeable technique would be to detect periods of silence and artificially increase or decrease them as appropriate).
Another important aspect of audio communications is conferencing involving multipoint communications, as opposed to two-way or point-to-point communications. When implemented over traditional analog telephone lines, audio conferencing requires each participant to send an audio signal to a central hub. The central hub mixes the incoming signals, possibly adjusting for the different levels, and sends each participant a summation of the signals from all the other participants (excluding the signal from that particular node). U.S. Pat. No. 4,650,929 discloses a centralized video/audio conferencing system in which individuals can adjust the relative volumes of the other participants. U.S. Pat. No. 4,389,720 discloses a telephone conferencing system with individual gain adjustment performed by system ports for multiple end user stations.
The use of a centralized mixing node, often referred to as a multipoint control unit (MCU), has been carried over into some multimedia (audio plus video) workstation conferencing systems. For example, U.S. Pat. No. 4,710,917 describes a multimedia conferencing system, in which each participant transmits audio to and receives audio from a central mixing unit. Other multimedia conferencing systems are described in "Distributed Multiparty Desktop Conferencing System: MERMAID" by K. Watabe, S. Sakata, K. Maeno, H. Fukuoka, and T. Ohmori, pp. 27-38 in CSCW '90 (Proceedings of the Conference on Computer-Supported Cooperative Work, 1990, Los Angeles) and "Personal Multimedia Multipoint Communications Services for Broadband Networks" by E. Addeo, A. Gelman and A. Dayao, pp. 53-57 in Vol. 1, IEEE GLOBECOM, 1988.
The use of a centralized MCU or summation node however has several drawbacks. Firstly, the architecture of most LANs is based on a peer-to-peer arrangement, and so there is no obvious central node. Moreover, the system relies totally on the continued availability of the nominated central node to operate the conference. There can also be problems with echo suppression (the central node must be careful not to include the audio from a node in the summation signal played back to that node).
These problems can be avoided by the use of a distributed audio conferencing system, in which each node receives a separate audio signal from every other node in the conference. U.S. Pat. No. 5,127,001 describes such a distributed system, and discusses the synchronisation problems that arise because of the variable transit time of packets across the network. U.S. Pat. No. 5,127,001 overcomes this problem by maintaining separate queues of incoming audio packets from each source node. These effectively absorb the jitter in arrival time in the same way as described above for simple point-to-point communications. At regular intervals a set of audio packets are read out, one packet from each of the queues, and summed together for playout. In U.S. Pat. No. 5,127,001 the audio contributions from the different parties are combined using a weighted sum. A somewhat similar approach is found in GB 2207581, which describes a rather specialized local area network for the communication of digital audio in aircraft. This system includes means for adjusting independently the gain of each audio channel using a store of predetermined gain coefficients.
One of the problems in audio conferencing systems, as discovered with the MERMAID system referred to above, is determining who is speaking at any given moment. U.S. Pat. No. 4,893,326 describes a multimedia conferencing system, in which each workstation automatically detects if its user is speaking. This information is then fed through to a central control node, which switches the video so that each participant sees the current speaker on their screen. Such a system requires both a video and audio capability to operate, and furthermore relies on the central video switching node, so that it cannot be used in a fully distributed system.
A distributed multimedia conferencing system is described in "Personal Multimedia-Multipoint Teleconference System" by H. Tanigawa, T. Arikawa, S. Masaki, and K. Shimamura, pp. 1127-1134 in IEEE INFOCOM 91, Proceedings Vol 3. This system provides sound localization for a stereo workstation, in that as a window containing the video signal from a conference participant is moved from right to left across the screen, the apparent source of the corresponding audio signal moves likewise. This approach provides limited assistance in identification of a speaker. A more comprehensive facility is described in Japanese abstract JP 02-123886 in which a bar graph is used to depict the output voice level associated with an adjacent window containing a video of the source of the sound.
The prior art therefore describes a variety of audio conferencing systems. While conventional centralized telephone audio conferencing is both widespead and well understood from a technological point of view, much work remains to be done to increase the performance of audio conferencing implementations in the desk-top environment.