The invention relates to speech analysis, and particularly to means for discriminating voiced and unvoiced sounds in speech, while using minimal computational resources.
Automatic speech processing is an important and growing field. Many applications require an especially rapid means for discriminating voiced and unvoiced sounds, so that they can respond to different commands without perceptible delay. Emerging applications include single-purpose devices that recognize just a few predetermined commands, based on the order of the voiced and unvoiced intervals spoken, and applications requiring a fast response, such as a voice-activated stopwatch or camera. Such applications are often highly constrained in computational resources, battery power, and cost. In addition, many applications detect voiced and unvoiced sounds separately and process them differently, including applications that interpret natural language—such as speech-to-text dictation, spoken commands for browsing and searches, and voice-activated controls—which often use a pre-processor to discriminate voiced and unvoiced sounds, thereby simplifying word identification and reducing latency. Other applications perform speech compression or coding to improve the efficiency of wireless transmission of speech, using different compression routines for voiced and unvoiced sounds due to the different properties of those sounds. All of these applications would benefit from a computationally efficient, fast routine that reliably discriminates voiced and unvoiced sounds as soon as they are spoken.
As used herein, “discriminating” voiced and unvoiced sounds means separately detecting voiced and unvoiced sounds, and identifying them as such. “Voiced” sounds are sounds generated by vocal cord vibration, and “unvoiced” sounds are generated by the turbulence of air passing through an obstruction in the air passage but without vocal cord involvement. A method is “computationally efficient” if it requires few processor operations and few memory locations to obtain the intended result. For brevity, voiced and unvoiced may be abbreviated as V and U respectively.
Prior art includes many strategies for discriminating V and U sound types. Some prior art involves a simple frequency threshold, exploiting the tendency of voiced sounds to have a lower primary frequency than unvoiced sounds. U.S. Pat. No. 7,523,038 teaches two analog bandpass filters to separate high and low frequencies, U.S. Pat. No. 4,357,488 teaches high-pass and low-pass analog filters, and U.S. Pat. No. 6,285,979 teaches a single bandpass filter with gated counters to separate voiced and unvoiced sounds. U.S. application Ser. No. 13/220,317 teaches a filter to select lower-frequency sounds and reject high-frequency sounds, while U.S. application Ser. Nos. 13/274,322 and 13/459,584 carry this further by detailing implementation methods and unique applications of such filtered signals. A simple frequency cut lacks reliability because real speech includes complex sound modulation due to vocal cord flutter as well as effects from the fluids that normally coat the vocal tract surfaces, complicating the frequency spectrum for both sound types. Strong modulation, overtones, and interference between frequency components further complicate the V-U discrimination. In some cases, both voiced and unvoiced sounds are produced simultaneously. Unless the speech is carefully spoken to avoid these problems, a simple frequency threshold is insufficient for applications that must detect, and respond differently to, voiced and unvoiced sounds.
Many prior methods calculate the frequency spectrum digitally, for example using FFT (fast Fourier transformation). Examples are U.S. Pat. No. 4,637,046 which analyzes digitally filtered data to detect increasing or decreasing trends that correlate with sound type, and U.S. Pat. No. 7,921,364 which defines multiple digital frequency bands for the same purpose but without the trending parameter. Spectral analysis, by FFT or otherwise, requires a fast processor and ample memory to store a large number of digitized values of the sound waveform. Transformation into the frequency domain takes substantial time, even with a powerful processor. These computational requirements are difficult for many low-cost, resource-limited systems such as wearable devices and embedded controllers. Also, extensive computation consumes power and depletes the battery, a critical issue with many portable/wearable devices. Also, as mentioned, simple frequency is a poor V-U discriminator because in actual speech the waveform is often overmodulated. In addition, speech often exhibits rapid, nearly discontinuous changes in spectrum, further complicating the standard FFT analyses and resulting in misidentification of sound type. For these reasons and others, the digitally derived spectral information is well correlated with sound type only for idealized cases. In real speech, reliance on spectral bands for V-U discrimination results in mis-identified sounds, despite using extra computational resources to transform time-domain waveforms into frequency-domain spectra.
Prior art includes many attempts to overcome these limitations of spectral analysis. Often prior methods divide the sound signal into “frames” which are brief (10-30 milliseconds, typically) portions of sound, with separate analysis of the sound in each frame. Frames may be overlapped for enhanced resolution, but this doubles the computational load. Many methods include autocorrelation or other frame-by-frame comparisons. U.S. Pat. No. 6,640,208 B1 employs autocorrelation with an adaptable threshold criterion. U.S. Pat. No. 6,915,256 B2 uses this technique to detect voiced sounds and to quantify the pitch of the fundamental. U.S. Pat. No. 6,915,257 B2 teaches autocorrelation with an adaptable frame size adjusted by feedback, resulting in different frame sizes for voiced and unvoiced sounds. U.S. Pat. No. 7,246,058 B2 extends the autocorrelation to include two separate sound signals, such as using two microphones. U.S. application Ser. No. 13/828,415 teaches autocorrelation to measure harmonicity, which tends to be higher for voiced sounds. However, strong modulation is often present in the voiced intervals with a rough voice, which complicates the autocorrelation and reduces the harmonicity in voiced sounds. The same speaker speaking the same commands may exhibit very different autocorrelation parameters if tired, or ill, or after smoking to name a few examples. Another problem is that unvoiced sounds, particularly sibilants, often have strong temporary autocorrelation and significant harmonicity, particularly if the speaker has a lisp; dental issues can cause this also. Autocorrelation is said to discriminate U and V sounds with less computational demands than spectral analysis, but in practice autocorrelation requires a large number of frames and lag parameters, which generally takes at least as much computational resources as a spectral analysis of equivalent quality. And, as mentioned, many prior art methods employ both spectral analysis and frame-by-frame autocorrelation analysis, further burdening resource-constrained systems.
Some prior methods combine multiple different analyses to improve the V-U discrimination. Each analysis typically involves multiple complex calculations and thresholds, digital filtering, autocorrelation, harmonicity, bandwidth cuts, frame-by-frame trending, and other criteria. In some cases the various criteria are applied sequentially, and in other cases the parameters are combined as a least-squares or other global fit. Examples are: U.S. Pat. No. 5,809,455 which detects peak-value changes and statistical changes in successive frames; U.S. Pat. No. 8,219,391 B2 with separate codebooks for V and U frames; U.S. Pat. No. 4,720,862 which compares the autocorrelation power with the residual power outside the autocorrelation; and U.S. Pat. No. 8,583,425 B2 which detects voiced sound as a narrowband signal, but detects unvoiced sound separately using a high frequency threshold. Reliability improvements are indeed obtained when multiple test criteria are analyzed and combined, if they have been carefully calibrated, but the computational overload is increased with each additional analysis technique, further stressing small systems. Each additional analysis also causes an additional processing delay, which becomes quite annoying when numerous criteria must be calculated using multiple software routines. And, as mentioned, each computation draws power for processor operations and memory writing, which results in reduced battery life.
A potentially important advancement is disclosed in U.S. application Ser. No. 13/610,858 which detects voiced and unvoiced sounds by applying formulas to select characteristic waveform features. Although this reference is useful as a starting point, further detail is needed showing how those formulas can be adjusted to optimize the discrimination. Also, experimental demonstration that the method has high reliability in V-U discrimination is needed.
Many prior methods employ a probabilistic model (such as HMM) or a signal-generation process (such as CELP) to model the signal, usually guided by an error-feedback algorithm to continuously adjust a model of the sound signal. U.S. Pat. No. 6,249,758 B1 is an example of signal analysis by synthesis, in this case using two generators aligned with the voiced and unvoiced components separately. In practice, however, only the voiced component can be reproduced by synthesis because the unvoiced component is too fast and too dynamic to be synthesized, at least in a practical system for a reasonable cost. And, the computational requirements of both the signal generation software and the adaptive model software greatly exceed most low-end embedded system capabilities, while the computational delays retard even the most capable processors.
Some prior methods characterize the sound with zero-crossing detection, such as U.S. Pat. Nos. 6,023,671 and 4,589,131. Zero-crossing detection is a big step in the right direction, since it works entirely in the time domain, is fast, and extracts sound information from specific waveform features directly. However, zero-crossing schemes produce insufficient waveform information, particularly when multiple frequency components interfere or when a high-frequency component rides on a lower-frequency component, which is often the case in real speech. The resulting sound signal doesn't cross zero very often. The zero-crossing distribution discards all information occurring between the zero crossings of the signal, thereby losing information that is crucial for V-U discrimination in all but the most idealized sounds.
All of the prior art methods that reliably discriminate voiced and unvoiced sounds employ either cumbersome analog electronic filters, or extensive digital processing, or large data arrays, or all three. It takes an advanced multi-core processor with a gigabyte memory, plus a radio link to a remote supercomputer, just to handle the computational demands of the prior art methods, and still there is that annoying delay. Low-cost voice-activated systems such as wearable devices and embedded controllers are usually unable to implement any of the reliable prior-art methods for discriminating voiced and unvoiced sounds. This limitation retards innovation and product development into the important nano-device market. What is needed is a method to discriminate voiced and unvoiced speech rapidly and reliably, while using extremely minimal processing and memory resources.