1. Field of the Invention
The present invention relates to a stereo voice transmission apparatus used in a remote conference system or the like, an echo canceler especially for a stereo voice, and a voice input/output apparatus to which this echo canceler is applied.
2. Description of the Related Art
In recent years, along with the developments of communication techniques, strong demand has arisen for a remote conference system through which a conference can be held between remote locations.
A remote conference system generally comprises an input/output system, a control system, and a transmission system to exchange image information such as motion and still images and voice information between the remote locations through a transmission line. The input/output system includes a microphone, a loudspeaker, a TV camera, a TV set, an electronic blackboard, a FAX machine, and a telewriting unit. The control system includes a voice unit, a control unit, a control pad, and an imaging unit. The transmission system includes the transmission line and a transmission unit. In a remote conference system, a decrease in transmission cost of information such as image information and voice information has been demanded. In particular, if these pieces of information can be transmitted at a transmission rate of about 64 kbps which allows transmission in an existing public subscriber line, a remote conference system at a lower cost than a high-quality remote conference system using optical fibers can be realized. In an ISDN (Integrated Service Digital Network) in which digitization has been completed to the level of end user, i.e., a public subscriber, the above transmission rate will serve as a factor for the solution of the problem on popularity of remote conference systems in applications ranging from medium-and-small-business use to home use.
In a remote conference system using a transmission line at a low transmission rate of, e.g., 64 kbps, a large volume of information such as images and voices must be compressed within a range which does not interfere with discussions in a conference. Even if a monaural voice must be compressed to a low transmission rate of about 16 kbps by voice data compression such as ADPC, a stereo voice is not generally used.
In a remote conference system, to enhance the effect of presence and discriminate a specific speaker who is currently talking to listeners, it is preferable to employ stereo voices.
A stereo voice transmission scheme capable of transmitting a high-quality stereo voice at low cost is known even in a transmission line having a low transmission rate (Jpn. Pat. Appln. KOKAI Application No. 62-51844).
In this stereo voice transmission scheme, main information representing a voice signal of at least one of a plurality of channels and additional information required to synthesize a voice signal of the remaining channel from the main information are coded, and the coded information is transmitted from a transmission side. On a reception side, the voice signal of each channel transmitted by the main channel is decoded and reproduced, and the voice signal of the remaining channel is reproduced by synthesizing the main information and the additional information.
This scheme will be described in detail with reference to FIG. 1.
As shown in FIG. 1, a voice X(.omega.) (where .omega. is the angular frequency) of a speaker A.sub.1 is input to right- and left-channel microphones 101.sub.R and 101.sub.L. In this case, echoes from a wall and the like are neglected. Left- and right-channel transfer functions are defined as G.sub.L (.omega.) and G.sub.R (.omega.), left- and right-channel input voices Y.sub.L (.omega.) and Y.sub.R (.omega.) are expressed as follows: EQU Y.sub.L (.omega.)=G.sub.L (.omega.) . X(.omega.) (1) EQU Y.sub.R (.omega.)=G.sub.R (.omega.) . X(.omega.) (2)
From equations (1) and (2), the following equations can be derived: ##EQU1##
From equation (4), if the transfer function G(.omega.) is known, the right-channel voice can be reproduced. According to this scheme, therefore, in stereo voice transmission, the right- and left-channel voices are not independently transmitted. A voice signal of one channel, e.g., the right-channel voice signal Y.sub.R (.omega.), and an estimated transfer function G(.omega.) are transmitted from the transmission side. The right-channel voice signal Y.sub.R (.omega.) and the transfer function G(.omega.) which are received by the reception side are synthesized to obtain the left-channel voice signal Y.sub.L (.omega.). Therefore, the right- and left-channel voices are reproduced at right- and left-channel loudspeakers 501.sub.R and 501.sub.L, thereby transmitting the stereo voice.
According to the above scheme, if an utterance is a single utterance, the transfer function G(.omega.) can be defined by a simple delay and simple attenuation. The volume of information can be much smaller than that of the voice signal Y.sub.L (.omega.), and estimation can be simply performed. Therefore, a stereo voice can be transmitted in a smaller transmission amount.
In the above system, since the single utterance is assumed, an accurate transfer function G(.omega.), i.e., additional information cannot be generated in a multiple simultaneous utterance mode, and a sound image localization fluctuates.
In a conversation as in a conference, a ratio of the multiple simultaneous utterance to the single utterance may be generally very low. In a conventional scheme, as described above, each single utterance is transmitted as a monaural voice to realize a high band compression ratio. However, monaural voice transmission is directly applied even in the multiple simultaneous utterance mode which is rarely set. Therefore, a sound image localization undesirably fluctuates.
In addition, in a remote conference system, a speaker on the other end of the line is displayed for a discussion in a conference. In this case, if a sound image localization is formed in correspondence with the position of a window on a screen, the sound image localization is effective for improving a natural effect and discrimination of a plurality of speakers. This sound image localization control is achieved such that delay and gain differences are given to voices of speakers on the other end of line, and the voices of these speakers are output from upper, lower, right, and left loudspeakers.
When a conference is held as described above, voices output from the loudspeakers may be input again to a microphone to cause echoing and howling. An echo canceler is effective to cancel echoing and howling.
Assume that the position of the window can be located at an arbitrary position on the screen. In this case, to cancel echoing and howling upon a change in window position, a sound image localization control unit for controlling the sound image localization must be located on an acoustic path side when viewed from the echo canceler. However, in this arrangement, when the window position changes, the sound image localization control unit and the echo canceler must relearn control and canceling, and a cancel amount undesirably decreases.
To solve the above problem, an echo canceler may be used for each loudspeaker. In this case, the echo cancelers must perform filtering of up to 4,000 stages (FIRAF). thereby greatly increasing the cost.
In a remote conference system, use of a stereo voice is desirable to improve the effect of presence. In this case, the output voices from the right and left loudspeakers are input to the right and left microphones through different echo paths. For this reason, four echo paths are present. A processing volume four times that of monaural voice processing is required for a stereo voice echo canceler.
FIG. 2 shows the arrangement of a conventional stereo voice echo canceler.
FIG. 2 shows only a right-channel microphone. If the same stereo voice echo canceler is used for the left-channel microphone, a stereo echo canceler for canceling echoes input from the right and left microphones can be realized.
Referring to FIG. 2, output voices from first and second loudspeakers 501.sub.1 an 501.sub.2 constituting the left and right loudspeakers are reflected by an obstacle 610 such as a wall or man and input as an echo signal component to a right-channel microphone 101.
At this time, the echo signal component is assumed to be generated through two echo paths H.sub.RR and H.sub.LR.
As echo cancelers for canceling these echo components, first and second echo cancelers 600.sub.1 and 600.sub.2 for respectively estimating two pseudo echo paths H'.sub.RR and H'.sub.LR corresponding to the two echo paths H.sub.RR and H.sub.LR are required.
However, such an echo canceler must be realized using a filter having an impulse response of several hundreds of msec for one echo path when the number of echo paths is increased to two and then four, the circuit size increases to increase the cost.