Speech conference systems allow a number of speech terminals to be connected together into a telephone conference, so that a mixed conference signal which is picked up via respective microphones of the speech terminals of the other participants is fed as a mixed conference signal to the respective participant for audio output. The mixed conference signal intended for a participant for output,—also referred to below as the mixed signal—is in such cases predominantly a superimposition of all audio signals present, however frequently without the audio signal of the participant, since the latter does not need to hear his own contributions to the conference and in fact should not usually do so, since this would actually cause a type of undesired echo effect of what he is saying which the participant could find disturbing. Thus a specific mixed signal is frequently formed for each of the N participants of a telephone conference in which the (N−1) voice signals of the other participants of the telephone conference are processed into the specific mixed signal. This can prove expensive in terms of computing power for the audio conferencing system and entail difficulties in understanding speech for participants involved in the telephone conference since the respective mixed signal for example can also include audio signals with background noises, with the background noises of a number of audio signals being able to be superimposed so that they are clearly perceptible and adversely effect the comprehensibility of the useful audio signals—i.e. the sentences spoken by one of the participants.
To reduce the computing outlay and the background noises it can be useful, especially with telephone conferences with a comparatively large number of participants, not to superimpose all (N−1) speech signals of the N participants, but merely a subset of these N participants and in particular especially a subset of M—with M<N—actively-speaking participants. The audio signals of the other, largely inactive, participants can be ignored in the creation of the mixed signal, so that only the M actively-speaking audio signals are superimposed. This method of operation is based on the assumption that in a well-organized teleconference led by a moderator only a few participants are speaking at the same time and usually speak chronologically after one another.
This type of method for a packet-switched communication system in which an audio energy is determined for each conference participant, on the basis of which a number M of conference participants are included in a mixed signal and the remaining conference participants are not included in the mixed signal is known, from the publication “Automatic Addition and Deletion of clients in VoIP Conferencing”, IEEE Symposium on Computers and Communications, Hammamet, Tunesia, July 2001 by Prasad, Kuri, Jamadagni, Dagale, Ravindranath. The particular characteristic of the method is that for each conference participant the mixed signal is formed individually at a terminal of the respective conference participant and each conference participant can adapt the volumes of the mixed M signals themselves via a user interface. However this demands a high transmission bandwidth. Furthermore the publication mentions an upper limit of M=4.
If now—as with the method mentioned in the last section—the set of active and inactive participants is formed dynamically and adapted over the course of time in accordance with audio signals present in the audio conference system to the current and changing activity circumstances, this results is disadvantages in the audio quality of the mixed signal on removal of previously active and now inactive audio signal from the mixed signal or during insertion of a previously inactive and now active audio signal into the mixed signal. For example an abrupt appearance and/or an abrupt disappearance of background noises can occur where an audio signal of a participant features such background noises and this audio signal is determined for a period as active and for another period as an inactive participant. In addition a crosstalk effect and a truncation of crosstalk audio signals can occur in the form of so-called speech clipping, which can be produced as a result of an incorrect composition of the audio signal viewed as active.