1. Field of the Invention
Usually, words are input by typing. One needs a good skill to type and a good memory to exactly spell words. To input Chinese words, one needs exact pronunciation and a very good skill to type. There are several thousands of commonly used words. It is hard to use speech recognition methods to recognize a word and a sentence and input words. The present invention classifies a large number of commonly used known words into a small number of m(=about 500) categories represented by m unknown voices by using any languages, or dialects, even the pronunciation is incorrect. Each unknown voice represents a category of known words with similar pronunciation to the unknown voice. When a user pronounces a word, the invention uses the Bayes classifier to find the F most similar unknown voices. All known words from F categories represented by the F most similar unknown voices are arranged in a decreasing similarity according to their pronunciation similarity to the pronounced word and their alphabetic letters (or the number of strokes of a Chinese word). The user can easily and fast find the pronounced word. The invention does not exactly find the pronounced word from several thousands of words. It is impossible. The invention only finds the F most similar unknown voices from a small number of fixed m categories and hence the invention is able accurately and quickly to recognize and input a large amount of words. Furthermore, since m unknown voices are fixed and are independent of any languages, persons or sex, the speech recognition method is stable and can be easily used by all users.
The method includes 12 elastic frames of equal length without filter and without overlap to normalize the waveform of a word or an unknown voice to produce a 12×12 matrix of linear predict coding cepstra (LPCC), and hence the Bayesian pattern matching method can compare the equal-sized 12×12 matrices of LPCC between two words or unknown voices.
Since the same word can be pronounced in any language or in any accent, correct or incorrect, the same word is classified into several categories. Hence any person using any language and without knowing spelling or typing skill can easily use the invention to recognize a word and a sentence, and input a large amount of words.
This invention does not use any samples for any known words and is still able to recognize a sentence of any language correctly
2. Description of the Prior Art
In the recent years, many speech recognition devices with limited capabilities are now available commercially. These devices are usually able to deal only with a small number of acoustically distinct words. The ability to converse freely with a machine still represents the most challenging topic in speech recognition research. The difficulties involved in speech recognition are:
(1) to extract linguistic information from an acoustic signal and discard extra linguistic information such as the identity of the speaker, his or her physiological and psychological states, and the acoustic environment (noise),
(2) to normalize an utterance which is characterized by a sequence of feature vectors that is considered to be a time-varying, nonlinear response system, especially for an English words which consist of a variable number of syllables,
(3) to meet real-time requirement since prevailing recognition techniques need an extreme amount of computation, and
(4) to find a simple model to represent a speech waveform since the duration of waveform changes every time with nonlinear expansion and contraction and since the durations of the whole sequence of feature vectors and durations of stable parts are different every time, even if the same speaker utters the same words or syllables.
These tasks are quite complex and would generally take considerable amount of computing time to accomplish. Since for an automatic speech recognition system to be practically useful, these tasks must be performed in a real time basis. The requirement of extra computer processing time may often limit the development of a real-time computerized speech recognition system.
A speech recognition system basically contains extraction of a sequence of feature for a word, normalization of the sequence of features such that the same words have their same feature at the same time position and different words have their different own features at the same time position, segmentation of a sentence or name into a set of D words and selection of a matching sentence or name from a database to be the sentence or name pronounced by a user.
The measurements made on speech waveform include energy, zero crossings, extreme count, formants, linear predict coding cepstra (LPCC) and Mel frequency cepstrum coefficient (MFCC). The LPCC and the MFCC are most commonly used in most of speech recognition systems. Furthermore, the existing recognition systems are not able to identify any language in a fast or slow speech, which limits the recognition applicability and reliability of their systems.
Therefore, there is a need to find a speech recognition system, which can naturally and theoretically produce an equal-sized sequence of feature vectors to well represent the nonlinear time-varying waveform of a word so that each feature vector in the time sequence will be the same for the same words and will be different for different words, which provides a faster processing time, which does not have any arbitrary, artificial or experimental thresholds or parameters and which has an ability to identify the words in a fast and slow utterance in order to extend its recognition applicability. The most important is that the speech recognition system must be very accurate to identify a word or a sentence in all languages.
Up to now, there is no speech recognition system to input a large number of words by speech recognition, because the existent speech recognition system is not good enough to identify a word or a sentence.