1. Field of the Invention
The present invention relates to an apparatus for integrally controlling audio and video signals for systems such as TV conferencing systems and visual telemetry systems in which audio and video signals transmitted from a spatially remote site are used to reproduce scenes rich in reality. More particularly, the invention relates to an apparatus for integrally controlling audio and video signals by analyzing received video signals and controlling audio signal processing parameters in accordance with the analyzed results.
2. Description of the Related Art
As video systems for transmitting audio and video signals from a spatially remote site, movies and televisions are known which have been in practical use from old days. Techniques of movies and televisions are well known and the details thereof are omitted. Only the effects of a combination of audio and video signals are given herein. Basic sound signals for a movie or a television are recorded simultaneously when a scene is taken. After scenes are taken, the basic sound signals are repetitively edited and processed while looking at the scenes to generate audio signals matching the scenes. Editing and processing include an addition of effect sounds and new sounds after recording and an adjustment of quality and volume of recorded sounds. An object of editing is to improve reality. It is well known that reality improves if high quality audio signals matching the contents of scenes are used. For example, a movie of a surround stereophonic sound system in which sound images move following a motion of scene images, provides excellent reality more than a movie of a monophonic sound system.
Audio signals cannot be repetitively edited or processed while audio and video signals of a movie or a television are transmitted in real time from a spatially remote site, being unable to provide excellent reality such as described above.
As full-duplex visual communication systems, TV conferencing systems have been in practical use. In a TV conferencing system, audio and video signals recorded by a microphone and a camera (hereinafter a video signal containing an audio signal is represented by an AV signal where applicable) are transmitted to a remote site via communication networks, and images and sounds of scenes are reproduced on a display unit and from a loudspeaker. Microphones, cameras, display units, and loudspeakers are prepared at respective communication sites which are interconnected by communication networks to realize full-duplex and multi-site communications. As simplex visual communication systems, there are a visual telemetry system in which scenes at a remote site are monitored by using AV signals and a telepresence system in which a user has a virtual experience as if presenting at a remote site by looking at images and listening sounds at the remote site. Such TV conferencing systems, visual telemetry systems, and tele presence systems are real-time visual communication systems by which present events are recorded by a TV camera and a microphone and transmitted to a destination with high fidelity. Recently, a system called an easy-to-use computer supported cooperative work (CSCW) has become available in which images transmitted in real time and computer graphics generated by a computer are displayed at the same time.
FIG. 37 is a schematic diagram showing an example of a conventional multi-site, individual-type TV conferencing system.
In this multi-site TV conferencing system S51, AV signals are transmitted among TV conferencing sites (A to E) 3751 to 3755 via a communication network 3756, each site being equipped with a TV conferencing apparatus for each of participants A to E.
FIG. 38 is a schematic diagram showing the configuration of, for example, the TV conferencing apparatus at E site 3755.
The TV conferencing apparatus at E site 3755 has a camera 3862, a microphone 3869, a display unit 3801, and loudspeakers 3860 and 3861.
The camera 3862 takes an image of the participant E at the TV conferencing site E and its video signal is transmitted to the other TV conferencing sites (A to D) 3751 to 3754. The microphone 3869 records voices of the participant E and its audio signal is transmitted to the other TV conferencing sites (A to D) 3751 to 3754.
In windows 2564 to 2567 of the display nit 3801, the images of the participants A to D at the other TV conferencing sites (A to D) 3751 to 3754 are displayed. Voices of the participants A to D at the other TV conferencing sites (A to D) 3751 to 3754 are synthesized and reproduced from the loudspeakers 3860 and 3861.
With conventional TV conferencing systems and visual telemetry systems, a correspondence between audio and video signals becomes poor in some cases because a conference room or a space in which an object to be monitored does not always satisfies the sound recording conditions matching scene images. For example, consider zoom-up of the image of a speaker at a TV conferencing system. In order to realize a good correspondence between audio and video signals during an image zoom-up operation, it is necessary, for example, for a microphone to move and record speeches near at the speaker at the same time when a camera is moved for the zoom-up operation, and for a sound recording area to coincide with an image taking area. However, in practice, it is impossible for a conventional system to move a microphone near to a speaker. Therefore, even if the image of a speaker is zoomed up, the sound volume does not change and the AV signal having a poor correspondence is transmitted to a communication partner. Such an AV signal reproduced at the destination provides low reality hindering a smooth progress of a conference. For example, if a conference is progressed always with voices from a far field, it is easily conceivable that the conference does not become attractive and its smooth progress is difficult.
In addition to a poor correspondence between audio and video signals, there is a poor correspondence between video signals. This will be explained in the following.
FIGS. 39A and 39B are schematic diagrams explaining the states at the TV conferencing sites (A and E) 3755 and 3751 of the conventional TV conferencing system S51 wherein participants E and A at the TV conferencing sites (E and A) 3755 and 3751 have a conversation.
As shown in FIG. 39A, at the TV conferencing site E 3755, the participant A is displayed in the leftside window 2564 of the display unit 3801 and the participant E looks at the window 2564. Therefore, an angle .theta. between a sight of the participant E and the optical axis of the camera 3862 becomes large.
As shown in FIG. 39B, at the TV conferencing site A 3751, the participant E is displayed in the rightside window 2567 of the display unit 3801 and the participant A looks at the window 2567. Therefore, an angle .theta. between a sight of the participant A and the optical axis of the camera 3862 becomes large.
The participants E and A feel therefore that the partner is not looking at him or her, losing reality of discussion in the conference room.
As described above, with the conventional TV conferencing system S51, conversation partners (speakers and listeners) are not displayed clearly and distinguishably and reality cannot be produced.
JP-A-61-10381 discloses a technique of selectively transmitting only an image of a participant not speaking.
JP-A-60-203086 discloses a technique of displaying an enlarged image of a participant now speaking.
JP-A-63-77282 discloses a technique of changing the direction of a camera toward a participant now speaking.
These conventional techniques are related to application techniques of apparatuses on the speaker side. In a TV conference, reality can be obtained if conversation partners (speakers and listeners) are displayed clearly and distinguishably. Any one of the conventional techniques cannot display clearly and distinguishably conversation partners, being unable to provide sufficient reality.
If a correspondence between audio and video signals is poor in a monitor operation of a visual telemetry system (e.g., if audio signals unnecessary for video signals are reproduced), these unnecessary audio signals may cause an overlook of an instrument and an erroneous decision of occurrence of an event.
As apparent from the description of editing sounds of a television or a movie, editing and processing of sounds are performed in order to improve the correspondence between audio and video signals and improve reality. However, conventional real-time visual communication systems such as TV conferencing systems and visual telemetry systems do not record and process sounds and images after they have once recorded and processed, being unable to provide a conference with good reality and a correct and speedy monitor operation.