1. Field of the Invention
This invention is a method for speech recognition on both English and Chinese. The method includes a fixed number of elastic frames of equal length without filter and without overlap to normalize the waveform of a Chinese syllable or English word to produce an equal-sized matrix of linear predict coding cepstra (LPCC), a Bayesian pattern matching method to select a known English word or Chinese syllable for the input unknown English word or Chinese syllable, a segmentation method for an unknown sentence or name to be partitioned into a set of D unknown English words or Chinese syllables and a screening method to select an English or Chinese sentence or name from an English and Chinese sentence and name database.
2. Description of the Prior Art
In the recent years, many speech recognition devices with limited capabilities are now available commercially. These devices are usually able to deal only with a small number of acoustically distinct English words or Chinese syllables. The ability to converse freely with a machine still represents the most challenging topic in speech recognition research. The difficulties involved in speech recognition are:
(1) to extract linguistic information from an acoustic signal and discard extra linguistic information such as the identity of the speaker, his or her physiological and psychological states, and the acoustic environment (noise),
(2) to normalize an utterance which is characterized by a sequence of feature vectors that is considered to be a time-varying, nonlinear response system, especially for an English words which consist of a variable number of syllables,
(3) to meet real-time requirement since prevailing recognition techniques need an extreme amount of computation, and
(4) to find a simple model to represent a speech waveform since the duration of waveform changes every time with nonlinear expansion and contraction and since the durations of the whole sequence of feature vectors and durations of stable parts are different every time, even if the same speaker utters the same words or syllables.
These tasks are quite complex and would generally take considerable amount of computing time to accomplish. Since for an automatic speech recognition system to be practically useful, these tasks must be performed in a real time basis. The requirement of extra computer processing time may often limit the development of a real-time computerized speech recognition system.
A speech recognition system basically contains extraction of a sequence of feature for an English word or Chinese syllable, normalization of the sequence of features such that the same English words or Chinese syllables have their same feature at the same time position and different English words or Chinese syllables have their different own features at the same time position, segmentation of an unknown English (Chinese) sentence or name into a set of D unknown English words (Chinese syllables) and selection of a known English (Chinese) sentence or name from a database to be the unknown one.
The measurements made on speech waveform include energy, zero crossings, extrema count, formants, linear predict coding cepstra (LPCC) and Mel frequency cepstrum coefficient (MFCC). The LPCC and the MFCC are most commonly used in most of speech recognition systems. The sampled speech waveform can be linearly predicted from the past samples of the speech waveform. This is stated in the papers of Markhoul, John, Linear Prediction: A tutorial review, Proceedings of IEEE, 63(4) (1975), Li, Tze Fen, Speech recognition of mandarin monosyllables, Pattern Recognition 36(2003) 2713-2721, and in the book of Rabiner, Lawrence and Juang, Biing-Hwang, Fundamentals of Speech Recognition, Prentice Hall PTR, Englewood Cliffs, N.J., 1993. The LPCC to represent an English word (a Chinese syllable) provides a robust, reliable and accurate method for estimating the parameters that characterize the linear, time-varying system which is recently used to approximate the nonlinear, time-varying response system of the speech waveform. The MFCC method uses the bank of filters scaled according to the Mel scale to smooth the spectrum, performing a processing that is similar to that executed by the human ear. For recognition, the performance of the MFCC is said to be better than the LPCC using the dynamic time warping (DTW) process in the paper of Davis, S. B. and Mermelstein, P., Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences, IEEE Trans. Acoustic Speech Signal Process, ASSP-28(4), (1980), 357-366, but in the recent research including the present invention, the LPCC gives a better recognition than the MFCC by the use of the Bayesian classifier with much less computation time. There are several methods used to perform the task of utterance classification. A few of these methods which have been practically used in automatic speech recognition systems are dynamic time warping (DTW) pattern matching, vector quantization (VQ) and hidden Markov model (HMM) method. The above recognition methods give good recognition ability, but their methods are very computational intensive and require extraordinary computer processing time both in feature extraction and classification. Recently, the Bayesian classification technique tremendously reduces the processing time and gives better recognition than the HMM recognition system. This is given by the papers of Li, Tze Fen, Speech recognition of mandarin monosyllables, Pattern Recognition 36(2003) 2713-2721 and Chen, Y. K., Liu, C. Y., Chiang, G. H. and Lin, M. T., The recognition of mandarin monosyllables based on the discrete hidden Markov model, The 1990 Proceedings of Telecommunication Symposium, Taiwan, 1990, 133-137, but the feature extraction and compression procedures, with a lot of experimental and adjusted parameters and thresholds in the system, of the time-varying, nonlinear expanded and contracted feature vectors to an equal-sized pattern of feature values representing an English word or a Chinese syllable for classification are still complicate and time consuming. The main defect in the above or past speech recognition systems is that their systems use many arbitrary, artificial or experimental parameters or thresholds, especially using the MFCC feature. These parameters or thresholds must be adjusted before their systems are put in use. Furthermore, the existing recognition systems are not able to identify the English word or Chinese syllable in a fast or slow speech, which limits the recognition applicability and reliability of their systems.
Therefore, there is a need to find a simple speech recognition system, which can naturally and theoretically produce an equal-sized sequence of feature vectors to well represent the nonlinear time-varying waveform of an English word or a Chinese syllable so that each feature vector in the time sequence will be the same for the same English words or Chinese syllables and will be different for different English words or Chinese syllables, which provides a faster processing time, which does not have any arbitrary, artificial or experimental thresholds or parameters and which has an ability to identify the English words or Chinese syllables in a fast and slow utterance in order to extend its recognition applicability. The most important is that the speech recognition system must be very accurate to identify a word or syllable or a sentence.