1. Field of the Invention
The present invention relates to a multipoint television conference system having conference terminals disposed at a plurality of locations and a multipoint controlling unit (hereinafter, referred to as MCU) that mixes audio signals of the individual locations that are received from the individual conference terminals, distributes the mixed audio signal to the individual conference terminals, combines a video signal from the individual video signals that are received from the individual conference terminals corresponding to a selection signal, and distributes a combined video signal to the individual conference terminals.
2. Description of the Prior Art
A conventional multipoint television conference system is composed of one MCU and a plurality of conference terminals. When a conference attendee speaks to an assigned conference terminal, the speech is transmitted as an audio signal to the MCU.
Each conference terminal has an audio/video signal transmitting/receiving/transferring function, a video displaying function, and an audio outputting function. The MCU mixes audio signals received from conference terminals with each other and distributes the mixed audio signal to conference terminals. When each conference terminal transmits a video signal to the MCU, the MCU selects a video signal of one conference terminal. Alternatively, the MCU may combine video signals received from two or more conference terminals, distribute the combined video signal to each conference terminal and cause each conference terminal to display the combined video signal.
Next, with reference to FIG. 1, a conventional television conference system will be described. FIG. 1 is a block diagram showing the structure of the conventional television conference system. Referring to FIG. 1, the television conference system has a plurality of conference terminals 16a to 16c and one MCU 17. In this example, it is assumed that three people have a conference using respective conference terminals 16a to 16c. 
The conference terminals 16a to 16c convert video signals and audio signals of respective locations (A) to (C) into transmission signals and transmit the transmission signals to the MCU 17 through communication lines 1671a to 1671c, respectively. The MCU 17 processes the video signals and audio signals in a particular manner that will be described later and distributes the resultant signals to the conference terminals 16a to 16c. 
The MCU 17 comprises a line interface portion 171, an audio processing portion 172, a controlling portion 173, a video processing portion 174, and a speaking attendee determination processing portion 175.
The line interface portion 171 is connected to a plurality of conference terminals 16a to 16c. The line interface portion 171 transmits and receives video signals and audio signals as transmission signals. The audio processing portion 172 and the video processing portion 174 are connected to the line interface position 171 through connection lines 7172 and 7174, respectively.
The audio processing portion 172 decodes audio signals received from the conference terminals 16a to 16c and supplies the decoded signals to the speaking attendee determination processing portion 175 through a connection line 7275. The speaking attendee determination processing portion 175 determines a speaking attendee corresponding to the received audio signal and supplies the determined result as speaking attendee information to the controlling portion 173 through a connection line 7573.
The controlling portion 173 generates a video control signal for an image switching process and an image combining process with the input speaking attendee information and supplies the video control signal to the video processing portion 174 through a connection line 7374. In addition to the video control signal, video signals of individual conference terminals are supplied from the line interface portion 171 to the video processing portion 174 through the connection line 7174. The video processing portion 174 performs the image switching process, the image combining process, and so forth for the video signals corresponding to the video control signal. The video processing portion 174 encodes video signals and supplies the encoded video signals to the line interface portion 171 through the connection line 7174.
The audio processing portion 172 mixes the audio signals received from the conference terminals 16a to 16c, encodes the mixed signal, and supplies the encoded signal to the line interface portion 171. The line interface portion 171 multiplexes the processed audio signals and the processed video signals and distributes the multiplexed signal to all the conference terminals 16a to 16c through the connection lines 1671a to 1671c. 
Next, with reference to FIG. 2, the internal structure of the speaking attendee determination processing portion 175 will be described. Referring to FIG. 2, the speaking attendee determination processing portion 175 has audio volume detecting portions 175a to 175c. The audio volume detecting portions 175a to 175c receive audio signals from the audio processing portion 172 through connection lines 110a to 110c, respectively. The audio volume detecting portions 175a to 175c compare the audio volumes of the audio signals with a predetermined threshold value and transmits the comparison results as speaking attendee determination information to a speaking attendee determining portion 14 through connection lines 7114a to 7114c, respectively.
When an audio volume is equal to or higher than the predetermined threshold value in any of audio volume detecting portions 175a to 175c, the speaking attendee determining portion 14 determines that the conference terminal corresponding to the volume detecting portion has a speaking attendee. When the audio volume is lower than the predetermined threshold value in all of audio volume detecting portions 175a to 175c, the speaking attendee determining portion 14 determines that the conference terminals have no speaking attendees.
When a plurality of conference terminals have speaking attendees, the speaking attendee determining portion 14 determines that the conference terminal which has the longest time period in which an audio volume is equal to or larger than the predetermined threshold has a speaking attendee. The determined result is output as speaking attendee information to the controlling portion 173 through a connection line 7573.
In the conventional television conference system, the speaking attendee determination information is detected from audio signals received from the conference terminals. Generally, in addition to a voice of a conference attendee, an audio signal contains noise such as a page turning noise, a desk knocking noise, and a breathing noise. Moreover, when a conference is held at a place where there are many people, their voices may be contained in an audio signal. In this case, voices of other than conference attendees become noises.
Thus, the speaking attendee determination signal detected from audio signals may include errors. Consequently, it was difficult to accurately determine a conference terminal that has a speaking attendee in the prior art. If a conference terminal that has a speaking attendee is incorrectly determined, it is difficult to smoothly manage a television conference system.
In order to overcome the aforementioned disadvantages, the present invention has been made and accordingly, has an object to provide a television conference system which accurately determines conference terminals each having a speaking attendee without being disturbed by noises.
According to an aspect of the present invention, there is provided a multipoint television conference system, comprising: terminals, and a multipoint control unit for controlling the terminals, wherein each of the terminals is provided for each attendee to the multipoint television conference; and wherein the multipoint control unit comprises: plural detection means, each of the plural detection means detects whether or not each attendee is speaking on the basis of an audio signal and video signal from each of the terminals in order to output a speaker detection signal representing a result of the detection; determination means for determining who are main speakers among the attendees on the basis of the speaker detection signals in order to generate a determination signal representing the main speakers; and combining means for combining the video signals on the basis of the speaker determination signal in order to output the combined video signal to the terminals.
The system may further comprises mixing means for mixing the audio signals input from the terminals to output the mixed audio signal to the terminals.
In the system, the number of main speakers may be one, and the combining means may select the video signal from the terminal corresponding to the main speaker as a main in the combined video signal.
In the system, each of the plural detection means may comprise: volume detection means for detecting whether or not the audio signal from the corresponding terminal is louder than a first threshold to generate a voice detection signal; image recognition means for detecting whether or not a movement of a lip of the attendee to the corresponding terminal is larger than a second threshold to generate a movement detection signal; and means for generating the speaker detection signal on the basis of the voice detection signal and the movement detection signal.
In the system, the means for generating the speaker detection signal may activate the speaker detection signal when the audio signal is louder than the first threshold and the movement of the lip is larger than the second threshold simultaneously.
In the system, the determination means may comprise means for determining the main speakers on the basis of periods, in each of which each of the speaker detection signal is active.
These and other objects, features and advantages of the present invention will become more apparent in light of the following detailed description of the best mode embodiment thereof, as illustrated in the accompanying drawings.