1. Field of the Invention
This invention is a method for speech recognition on all languages without using samples of a word. A word may contain one or more syllables. A sentence in any language consists of several words. The method includes 12 elastic frames of equal length without filter and without overlap to normalize the waveform of a word to produce a 12×12 matrix of linear predict coding cepstra (LPCC). A word with the 12×12 matrix of LPCC is considered as a vector in the 144-dimensional vector space. Several hundreds of different “unknown” words of unknown languages or unknown voices are represented by vectors, spreading in the 144-dimensional vector space. When a speaker utters a known word of any language, the feature of the known word is simulated or computed by the unknown vectors around it in the space and then the feature of the known word is stored in the word database.
The invention contains 12 elastic frames to normalize a word, a Bayesian pattern matching method to select a known word for the input unknown word, a segmentation method for an unknown sentence or name to be partitioned into a set of D unknown words and a screening method to select a known sentence or name from database. This invention does not use any known samples and is able to recognize a sentence of any language correctly
2. Description of the Prior Art
In the recent years, many speech recognition devices with limited capabilities are now available commercially. These devices are usually able to deal only with a small number of acoustically distinct words. The ability to converse freely with a machine still represents the most challenging topic in speech recognition research. The difficulties involved in speech recognition are:
(1) to extract linguistic information from an acoustic signal and discard extra linguistic information such as the identity of the speaker, his or her physiological and psychological states, and the acoustic environment (noise),
(2) to normalize an utterance which is characterized by a sequence of feature vectors that is considered to be a time-varying, nonlinear response system, especially for an English words which consist of a variable number of syllables,
(3) to meet real-time requirement since prevailing recognition techniques need an extreme amount of computation, and
(4) to find a simple model to represent a speech waveform since the duration of waveform changes every time with nonlinear expansion and contraction and since the durations of the whole sequence of feature vectors and durations of stable parts are different every time, even if the same speaker utters the same words or syllables.
These tasks are quite complex and would generally take considerable amount of computing time to accomplish. Since for an automatic speech recognition system to be practically useful, these tasks must be performed in a real time basis. The requirement of extra computer processing time may often limit the development of a real-time computerized speech recognition system.
A speech recognition system basically contains extraction of a sequence of feature for a word, normalization of the sequence of features such that the same words have their same feature at the same time position and different words have their different own features at the same time position, segmentation of an unknown sentence or name into a set of D unknown words and selection of a known sentence or name from a database to be the unknown one.
The measurements made on speech waveform include energy, zero crossings, extrema count, formants, linear predict coding cepstra (LPCC) and Mel frequency cepstrum coefficient (MFCC). The LPCC and the MFCC are most commonly used in most of speech recognition systems. The sampled speech waveform can be linearly predicted from the past samples of the speech waveform. This is stated in the papers of Markhoul, John, Linear Prediction: A tutorial review, Proceedings of IEEE, 63(4) (1975), Li, Tze Fen, Speech recognition of mandarin monosyllables, Pattern Recognition 36(2003) 2713-2721, and in the book of Rabiner, Lawrence and Juang, Biing-Hwang, Fundamentals of Speech Recognition, Prentice Hall PTR, Englewood Cliffs, N.J., 1993. The LPCC to represent a word provides a robust, reliable and accurate method for estimating the parameters that characterize the linear, time-varying system which is recently used to approximate the nonlinear, time-varying response system of the speech waveform. The MFCC method uses the bank of filters scaled according to the Mel scale to smooth the spectrum, performing a processing that is similar to that executed by the human ear. For recognition, the performance of the MFCC is said to be better than the LPCC using the dynamic time warping (DTW) process in the paper of Davis, S. B. and Mermelstein, P., Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences, IEEE Trans. Acoustic Speech Signal Process, ASSP-28(4), (1980), 357-366, but in the recent research including the present invention, the LPCC gives a better recognition than the MFCC by the use of the Bayesian classifier with much less computation time. There are several methods used to perform the task of utterance classification. A few of these methods which have been practically used in automatic speech recognition systems are dynamic time warping (DTW) pattern matching, vector quantization (VQ) and hidden Markov model (HMM) method. The above recognition methods give good recognition ability, but their methods are very computational intensive and require extraordinary computer processing time both in feature extraction and classification. Recently, the Bayesian classification technique tremendously reduces the processing time and gives better recognition than the HMM recognition system. This is given by the papers of Li, Tze Fen, Speech recognition of mandarin monosyllables, Pattern Recognition 36(2003) 2713-2721 and Chen, Y. K., Liu, C. Y., Chiang, G. H. and Lin, M. T., The recognition of mandarin monosyllables based on the discrete hidden Markov model, The 1990 Proceedings of Telecommunication Symposium, Taiwan, 1990, 133-137, but the feature extraction and compression procedures, with a lot of experimental and adjusted parameters and thresholds in the system, of the time-varying, nonlinear expanded and contracted feature vectors to an equal-sized pattern of feature values representing a word for classification are still complicate and time consuming. The main defect in the above or past speech recognition systems is that their systems use many arbitrary, artificial or experimental parameters or thresholds, especially using the MFCC feature. These parameters or thresholds must be adjusted before their systems are put in use. Furthermore, the existing recognition systems are not able to identify the English word or Chinese syllable in a fast or slow speech, which limits the recognition applicability and reliability of their systems.
Therefore, there is a need to find a speech recognition system, which can naturally and theoretically produce an equal-sized sequence of feature vectors to well represent the nonlinear time-varying waveform of a word so that each feature vector in the time sequence will be the same for the same words and will be different for different words, which provides a faster processing time, which does not have any arbitrary, artificial or experimental thresholds or parameters and which has an ability to identify the words in a fast and slow utterance in order to extend its recognition applicability. The most important is that the speech recognition system must be very accurate to identify a word or a sentence in all languages.