1. Field of the Invention
The invention described herein is related to machine recognition of human speech. More specifically, the invention is directed to probabilistic methodologies and associated system architectures for automatic speech recognition based on acoustic correlates of phonetic features in signals representative of speech sounds by locating landmarks associated with broad manner phonetic features and extracting fine phonetic features at and around the landmark locations. In a further aspect of the invention, the acoustic correlates are knowledge-based acoustic parameters.
2. Description of the Prior Art
Numerous automatic speech recognition (ASR) procedures are known in the prior art, some of which, including embodiments of the present invention, utilize knowledge of acoustic-phonetics. These procedures can be classified into three broad categories: (1) the acoustic-phonetic approach to recognition, (2) the use of acoustic correlates of phonetic features in the front-ends of dynamic statistical ASR methods like Hidden Markov Models (HMMs), and (3) the use of phonetic features in place of phones as recognition units in the dynamic statistical approaches to ASR that use standard front-ends like Mel-Frequency Cepstral Coefficients (MFCCs).
In ASR systems of the prior art that use phonetic features as recognition units in statistical methods, the usual statistical frameworks are implemented and use phonetic features as an intermediate unit of recognition. The output of intermediate classifiers to then used to recognize phonemes, words, or sentences. These methods use no explicit knowledge of the acoustic correlates of phonetic features.
Some prior art ASR systems have utilized acoustic cues that are correlates of phonetic features to form the front-end in HMM based ASR methods. These methods traditionally use standard front-ends like MFCCs and Linear Predictive Coding (LPC) coefficients. The use of acoustic-phonetic knowledge in the front-ends of these systems has led to some improvement in performance as measured against certain performance criteria.
Research has shown that acoustic-phonetic knowledge based acoustic parameters perform better than the standard MFCC based signal representation on the task of broad class segmentation as tested using an HMM based back end. In particular, it has been shown that the decrease in performance was much less dramatic for the knowledge-based front-end than for MFCCs when cross-gender testing was carried out, that is, when training was done on males and testing was done on females, and vice versa. These experiments were extended to isolated word recognition and a similar pattern was observed not only for cross gender testing, but also for testing across adults and children whose speech can be from different databases.
The acoustic-phonetic approach is characterized by the use of spectral coefficients or knowledge-based acoustic correlates of phonetic features to first carry out a segmentation of speech and then analyze the individual segments or linguistically relevant landmarks for phonemes or phonetic features. This method may or may not involve the use of statistical pattern recognition methods to carry out the recognition task. That is, these methods include pure knowledge based approaches with no statistical modeling. The acoustic phonetic approach has been followed and implemented for recognition in varying degrees of completeness or capacity of application to real world recognition problems. FIG. 1 illustrates via a block diagram a system configuration which implements the traditional acoustic phonetic approach. First, speech is analyzed by signal processor 110 using any one of a number of spectral analysis techniques, e.g., Short-Time Fourier Transform (STFT), Linear Predictive Coding (LPC) or Perceptual Linear Prediction (PLP). Typically, speech is processed using overlapping frames of 10-25 ms and overlap of 5 ms. Acoustic correlates of phonetic features are extracted from the spectral representation of the speech signal. For example, low frequency energy may be calculated as and acoustic correlate of sonorancy and zero crossing rate may be calculated as a correlate of frication. The acoustic correlates are then passed to landmark detection or speech segmentation module 120. Speech is then segmented by either finding transient locations using spectral change across adjacent frames, or by using the acoustic correlates of source or-manner classes to determine segments possessing stable manner classes. Further analysis of the individual segmentations is carried out in feature detection or phone classification module 130 by recognizing each segment as a phoneme directly or by determining the presence or absence of individual phonetic features and using the intermediate determinations to find the phonemes. When multiple segmentations are generated, as opposed to just a single segmentation, a number of different phoneme sequences may result. The phoneme sequences may be subsequently constrained by a sentence recognition module 140 to select a best representation of the speech based on acoustic and language scores corresponding to the vocabulary and grammar of the language spoken.
Most applications of acoustic phonetic methods in the prior art have been limited to the landmark detection module 120 and the phone classification module 130. Only the SUMMIT system discussed below has implemented an acoustic-phonetic system to carry out recognition on continuous speech with a substantial vocabulary. However, the SUMMIT system uses a traditional front end with little or no knowledge-based acoustic parameters (APs).
A number of problems have been associated with acoustic-phonetic systems of the prior art. First, it has been argued that the difficulty in proper decoding of phonetic units into words and sentences grows dramatically with an increase in the rate of phoneme insertion, deletion, and substitution. This argument makes the assumption that phoneme units are recognized in a first path with no knowledge of language and vocabulary constraints. This has been true for many of the acoustic-phonetic methods, but need not be limiting since vocabulary and grammar constraints may be used to constrain the speech segmentation paths.
Another shortcoming of the prior art is that extensive knowledge of the acoustic manifestations of phonetic units are required in those systems and the lack of completeness of this knowledge has been pointed out as a drawback of the knowledge-based approach. While it is true that the knowledge is incomplete, there is no reason to believe that the standard signal representation, for example, Mel-Frequency Cepstral Coefficients (MFCCs) used in previous ASR systems, are sufficient to capture all of the acoustic manifestations of the speech sound. Moreover, although a comprehensive knowledge base has not been implemented, there has been significant development in the research on acoustic correlates, such as for place of stop consonants and fricatives, nasal detection, and semivowel classification. Thus, a scalable system, such as that of the present invention, must be implemented to allow the addition of new knowledge as it is developed.
A third argument against the acoustic-phonetic approach is that the choice of phonetic features and-their acoustic correlates is not optimal. It is true that linguists may not agree with each other on the optimal set of phonetic features, but standardization of an optimal set of features may commence without abandoning the acoustic-phonetic approach as a viable ASR option. For example, the exemplary phonetic feature set used in embodiments of the present invention is based on known distinctive feature theories and are at a minimum optimal in that sense. The scalability of the present invention allows reorganization of acoustic correlates at any time as feature sets converge to approach optimal.
A further drawback of the acoustic-phonetic approach, as argued by detractors thereof, is that the design of sound classifiers is not optimal. This argument probably assumes that binary decisions with hard knowledge-based thresholds are used to carry out the classification. However, statistical pattern recognition methods that are no less optimal than the HMMs have been applied to specific procedures of acoustic-phonetic methods, including some acoustic phonetic knowledge-based methods, with some success. However, application of statistical pattern recognition methods to even recognition tasks at the word level in acoustic phonetic based systems to accomplish complete recognition tasks, such as those accomplished through the present invention, has not heretofore been realized.
Another shortcoming of the acoustic-phonetic approach that has been argued is that no well-defined automatic procedure exists for tuning the method. Acoustic-phonetic methods can be tuned if they use standard data driven pattern recognition procedures, which is a requirement that also applies to the present invention. However, the present invention implements an ASR system that does not require tuning except under extreme circumstances, for example, accents that are extremely different from standard American English (assuming the original system was trained on natively-born American speakers).
The aforementioned SUMMIT system is a prior art ASR system that implements a traditional statistical model front-end using, for example, MFCCs or auditory-based models to obtain multi-level segmentations of the speech signal. The segments are found using either (1) an acoustic segmentation method which finds time instances when the change in the spectrum is beyond a certain threshold, or (2) boundary detection methods that use statistical context dependent broad class models. The segments and landmarks (defined by boundary locations) are then analyzed for phonemes using Gaussian Mixture Models (GMMs) or multi-layer perceptrons with exceptional results.
The combination of landmark and knowledge-based approaches to automatic speech recognition offers a number of advantages over systems of the prior art. First, by carrying out the analysis only at significant locations, the landmark-based approach to speech recognition utilizes strong correlation among the speech frames. Second, analysis at different landmarks may be done with different acoustic correlates that are computed at different resolutions. For example, analysis at stop bursts to determine the place of articulation requires a higher resolution than that required at syllabic peaks to determine the tongue tip and blade features. Third, the approach provides very straightforward analysis of errors. Given the physical significance of the acoustic correlates and a recognition framework that uses only the relevant acoustic correlates, error analysis can determine whether the acoustic correlates need to be refined or when the decision process did not take into account a certain type of variability that occurs in the speech signal. In fact, this combined landmark and knowledge-based approach to recognition is a tool in itself for understanding speech variability.
While there have been many attempts at an acoustic-phonetic approach to ASR, only the SUMMIT system has been able to match the performance of HMM based methods on practical recognition tasks. The other acoustic-phonetic methods were stopped at the level of finding distinctive acoustic correlates of phonetic features, detection of landmarks, or broad class recognition. Although the SUMMIT system carries out segment based speech recognition with some knowledge-based measurements, it is neither a landmark based system per se nor a phonetic feature based system. Like HMM based systems, it uses all available acoustic information (for example, all the MFCCs) for all decisions. While acoustic-phonetics knowledge and the concept of phonetic features have been used with HMM based systems with some success, the addition has only marginally enhanced the ability to recognize speech at the level of phonemes.
Certain embodiments of the present invention are similar to SUMMIT in that both systems generate multiple segmentations, as will be discussed below, and then use the information extracted from the segments or landmarks to carry out further analysis in a probabilistic manner. However, there are many significant factors that set the systems apart. First, SUMMIT is a phone-based recognition system while the present invention is a phonetic feature based system. Secondly, although the present invention uses a similar concept of obtaining multiple segmentations and then carrying out further analysis based on the information obtained from those segments, the present invention concentrates on linguistically motivated landmarks as opposed to analyzing all the front-end parameters extracted from segments and segment boundaries, as is a principle of SUMMIT. Furthermore, certain embodiments of the present invention do not require every observable acoustic correlate to be included in the phonetic analysis of the segments in that the sufficiency and invariance properties of predetermined acoustic parameters are utilized. Moreover, certain embodiments of the present invention implement binary phonetic feature classification, which provides a uniform framework for speech segmentation, phonetic classification, and lexical access. This differs greatly from the SUMMIT system where segmentation and analysis of segments are carried out using different procedures. Finally, as previously noted, the SUMMIT system uses standard model front-ends for recognition with a few augmented knowledge-based measurements, while certain embodiments of the present invention utilize only the relevant knowledge-based APs for each decision.