The present invention relates to a method of detecting an acoustic signal, and a method of detecting a period of a desired acoustic signal in a signal including noise and the desired acoustic signal.
In recent years, although speech recognition apparatuses have been remarkably developed, the development of a speech recognition apparatus for recognizing speech in a noisy environment has been retarded because it is difficult to correctly detect a speech period (i.e., to detect a period during which speech is present on the time axis) in a signal contaminated by noise. When a noise period is recognized as a speech period, noise is forcibly caused to correspond to any phoneme, and it is impossible to obtain a correct speech recognition result. Therefore, it is very important to develop a speech period detection technique which can be used in a noisy environment.
FIG. 1 is a timing chart for explaining the first conventional speech period detection method. This chart shows changes in short time power as a function of time. The short time power of a signal output from a microphone is plotted along the ordinate, and the time is plotted along the abscissa. In the following description, the short time power will be referred to as a "power". A signal generally contains stationary noise 11 (noise having almost a constant power, such as air-conditioning noise or fan noise of equipment), unstationary noise 12 (noise whose power is greatly changed, such as a door closing sound and undesired speech), and desired speech 13. Although the power of the stationary noise can be known in advance, the unstationary noise power is unpredictable.
According to the first conventional method, a power of a signal is kept monitored. When this power exceeds a threshold value Th14 determined on the basis of the stationary noise power, the corresponding period is recognized as a speech period. Most of the existing speech recognition apparatuses perform speech period detection by using this method. According to this method, although a correct speech period 16 shown in FIG. 1 can be detected, an unstationary noise period 15 having a high power is also erroneously detected as a speech period, resulting in inconvenience.
The second conventional method will be described below.
According to the second conventional method, two microphones are located to cause an S/N ratio difference between outputs from the two microphones. The examples of microphone arrangement for the method are shown in FIGS. 2(a) and 2(b). That is, as shown in FIG. 2(a), a first microphone 1 is located near a speaker 3, and a second microphone 2 is located away from the speaker 3. Alternatively, as shown in FIG. 2(b), the first microphone 1 is located in front of the speaker 3, and the second microphone 2 is located near the side of the speaker 3. In these arrangement, the speech power level of the output from the first microphone is higher than that from the second microphone. On the other hand, assuming that noise is generated in a remote location, the noise power levels of the outputs from these microphones are almost equal to each other. As a result, an S/N ratio difference in outputs of the two microphones occurs.
FIGS. 3(a), 3(b), and 3(c) are charts for explaining an ideal operation of the second conventional method. More specifically, FIG. 3(a) shows a time change in power P1 of the output from the first microphone, and FIG. 3(b) shows a time change in power P2 of the output from the second microphone. Reference numerals 11 in FIGS. 3(a) and 3(b) as in FIG. 1 denote stationary noise; 12, unstationary noise, and 13, speech. Since the two microphones are arranged as shown in FIG. 2(a) or FIG. 2(b), the power of the speech in FIG. 3(b) is lower than that in FIG. 3(a), while the noise power levels of these outputs are equal to each other. As shown in FIG. 3(c), according to the second conventional method, a difference PD (=P1-P2) between the short time powers P1 and P2 of the two signals is calculated. When the power difference PD is larger than a given threshold value Pth17, a corresponding time period 18 is detected as a speech period. According to the second conventional method, as is apparent from FIG. 3(c), the unstationary noise period having a high power is not detected as a speech period, unlike in the first conventional method.
The second conventional method, however, is rarely operated in an ideal state because the following three conditions must be satisfied to correctly detect a speech period by utilizing a power difference in the two signals:
Condition 1: An S/N ratio difference in two signals must be present.
Condition 2: Noise and speech periods of the two signals must be matched with each other as a function of time.
Condition 3: A variation in S/N ratio difference caused by various factors is small (stability of the S/N ratio difference).
According to the second conventional method, the first condition is satisfied, while the second and third conditions are not satisfied. Therefore, the following problems are posed.
The first problem will be described below. FIG. 4 shows an arrangement obtained by adding a noise source 4 to the arrangement of FIG. 2(a). At this time, speech is input to the first microphone 1 and then the second microphone 2. However, noise is input to the second microphone 2 and then the first microphone 1. Therefore, the speech and noise periods of the two microphone output signals are not matched as a function of time.
The above situation is shown in FIGS. 5(a), 5(b), and 5(c). FIG. 5(a) shows the power P1 of the output from the first microphone 1, FIG. 5(b) shows the power P2 of the output from the second microphone 2, and FIG. 5(c) shows the power difference PD. Reference numeral 11 denotes stationary noise; 12, unstationary noise; and 13, speech, as in FIGS. 3(a) to 3(c).
Relationships between the speech powers and the noise powers in FIGS. 5(a) and 5(b) are the same as those in FIGS. 3(a) and 3(b). However, in the relationships shown in FIGS. 5(a) and 5(b), the speech as the output from the second microphone 2 is delayed from that as the output from the first microphone 1 by a period .tau.S31, whereas the noise as the output from the second microphone 2 advances from that from the output from the first microphone by a period .tau.N32. The speech and noise periods are not matched with each other as a function of time. As a result, the difference PD between the two signal powers is different from that of FIG. 3(c), as shown in FIG. 5(c). When a period during which the difference exceeds the threshold value Pth17 is detected as a speech period, a period 33 in FIG. 5(c) is erroneously detected as a speech period, thus posing the first problem. Because the time difference .tau.N32 in this noise period is greatly changed depending on the position of the noise source, it is impossible to establish matching by using a delay element.
As the second problem, there are various factors for changing an S/N ratio difference between the two microphone outputs in a practical situation, therefore, it is difficult to assure stability of the S/N ratio difference between the two signals as follows.
The first variation factor is the position of the noise source. As described above, the noise source is assumed to be located in a remote location. When, however, the noise source is located at a relatively close location, the position of the noise source becomes a large variation factor for the S/N ratio difference. FIGS. 6(a) and 6(b) explain this situation. Reference numerals 1 and 2 in FIGS. 6(a) and 6(b) denote first and second microphones, respectively; 3, speakers; and 4, noise sources, as in FIG. 4. When the noise source 4 is located at positions indicated in FIGS. 6(a) or 6(b), the noise power of the output from the first microphone 1 is higher than that from the second microphone 2, as in the speech powers. As a result, an S/N ratio difference between the two microphone outputs becomes fairly small.
The second variation factor is movement of the speaker. For example, when the speaker 3 turns his head in a right 45.degree. direction in FIG. 6(b), the speech signal is received by each microphone at almost the same level. As a result, a speech power difference does not occur in the outputs of the two microphones, thus an S/N ratio difference varies.
The third variation factor is an influence of room echoes. When two microphones are located so as to cause the S/N ratio difference in their outputs, room echoes having different time structures and magnitudes are added to the noise and speech components of the each microphone output. As a result, an S/N ratio is difference greatly changed as a function of time.
In addition to the above mentioned major variation factors, there are other factors such as electrical noise and vibration noise. Therefore, it is very difficult to find a microphone arrangement which assure a stable S/N ratio difference in an atmosphere where these various factors for changing the S/N ratios are present.
As described above, the second conventional method has the above decisive drawback and cannot be effectively utilized in practical applications.
The third conventional method for overcoming this drawback of the second conventional method will be described with reference to FIG. 7. Referring to FIG. 7, reference numeral 1 denotes a first microphone; 2, a second microphone; 21, a short time power calculation unit; 22, a speech period candidate detection unit; 23 and 24, average power calculation units for speech period candidates; 25, a power difference detection unit; and 26, a speech period candidate testing unit.
According to this method, as in the second conventional method, the first microphone is located such that a ratio of speech to ambient noise is large, whereas the second microphone is located such that an S/N ratio is smaller than that of the first microphone. According to this method, a short time power of an output signal from the first microphone 1 is calculated by the short time power calculation unit 21. The short time power of the signal is kept monitored by the speech period candidate detection unit 22. The speech period candidate detection unit 22 detects a speech period candidate as a period when its power exceeds a threshold value Th. The above operations are the same as those in the first conventional method shown in FIG. 1. The noise period 15 shown in FIG. 1 is detected as a speech period candidate. Then, average powers of the outputs from the first and second microphones during this candidate period are calculated by the average power calculation units 23 and 24. Next, the difference PDL between two average powers is obtained by the power difference detection unit 25. Finally, when the power difference PDL exceeds a predetermined threshold value PDLt, this candidate period is recognized as a correct speech period by the speech period candidate testing unit 26. Otherwise, this candidate period is discarded.
According to the characteristic feature of the third conventional method, a difference between the average powers obtained within a relatively long time candidate period, is calculated in place of the short time power difference. Even if the speech and noise periods of one microphone output are not matched with those of the other microphone output, as shown in FIGS. 5(a) and 5(b), or even time variations in S/N ratio caused by room echoes occur, its influence on the average power difference is reratively small. Therefore, the third conventional method seems to solve the problems of the second conventional problem.
In the third conventional method, however, since the speech period is determined based on the average power within the candidate period, an incorrect discrimination result occurs when the noise and speech periods appear continuously, as shown in FIG. 8. FIG. 8 shows an output from the first microphone. A correct speech period is a period 34 in FIG. 8. As shown in FIG. 8, since unstationary noise 12 is close to speech 13 along the time axis, a period 35 which contains both the noise and speech periods and the short time power of which exceeds a threshold value Th14 is detected as a speech period candidate. When this candidate period 35 is discriminated as a correct speech period upon calculation of an average power difference, a period 36 shown in FIG. 8 becomes an erroneously detected period. When the above speech period is discarded, the correct speech period is recognized as a non-speech period. In either case, an erroneous discrimination result is obtained.
The third conventional method, therefore, cannot serve as a means for solving the drawback of the second conventional method.
Various problems are present in the conventional speech period detection methods. It is therefore difficult to correctly detect a speech period when unstationary noise is present in an input signal.