1. Field of the Invention
The present invention relates to a speech recognition apparatus using syntactic and semantic analysis.
2. Description of the Related Art
Recently, there have been attempts in which a person gives instructions to machines directly by means of speech, and various techniques for speech recognition have been developed.
At present, however, the speech dialogue system between human and machines have not yet been realized. This is, in part, due to a problem of many variations of speech. Natural spoken language (spontaneous speech) is ambiguous, as compared to written language. In spontaneous speech, not only grammar but also units of sentences and boundaries of sentences have many variations. In addition, spontaneous speech includes unintentional utterances such as speech meaningless words, muttering, noise, etc.
Conventional speech recognition apparatuses cannot cope with such variations of input speech. In these apparatuses, a person must input predetermined words in a limited vocabulary according to a predetermined sequence or grammar, and the input of such meaningless words as "er" or "uh" or continuously spoken sentences cannot be recognized.
In the conventional speech recognition apparatus, an input speech is first detected, and the detected speech period is regarded as continuous series of words. Thus, the detected speech period is analyzed and evaluated as a complete sentence. Specifically, information such as a variation in speech energy is utilized for detecting a starting point or starting and end points of the input speech period, thereby detecting an input speech period as a complete sentence. Subsequently, an input vector extracted from speech feature parameters of input speech is matched with reference vector of words or phonemes. The extracted candidates of words and phonemes are fed into syntactic and semantic analysis by using syntactic and semantic information. In the above processing, candidates of input speech must be grammatically well formed utterances, and words and phonemes included in the speech are regarded as temporally continuous meaningful series and evaluated as a sentence.
In the above method, however, when there are variations of the input speech such as noise, meaningless words (e.g. "er", "uh"), a silent period, muttering, non-verbal sounds, ellipsis, and out-of-vocabulary words, the syntactic and semantic analysis of the entire speech fails.
To solve this problem, there is proposed a method wherein a meaning such as a category is given to noise and silent periods and the noise and silent periods are analyzed under the same conditions as other meaningful elements. However, because of uncertainty of position (time) of appearance of these elements, the amount of calculations increases considerably and the scope of processing is limited.
There is another problem in the conventional method, in which starting and end points of a speech may be determined only by acoustic features, irrespective of syntactic and semantic processing. In this case, syntactic and semantic processing of noise, meaningless words (e.g. "er", "uh"), a silent period, muttering, non-verbal sounds, ellipsis, and out-of-vocabulary words, which occur within an input speech period, may fail.
In the above processing, there is proposed an analysis method in which a range of allowance is provided in a speech period, i.e. a movable range of an end point. In this case, too, the starting point of the input speech is treated as being fixed beforehand. Thus, the problem in this case is the same as that in the case where both starting and end points are fixed.
Further, a spotting method is known as a method of providing a range of allowance to the starting and end points for matching with reference vectors. In this case, the starting and end points of the unit of matching, e.g. a word or a phoneme, is provided with the range of allowance, and the starting and end points of an input speech are determined on the basis of likelihood between input vector and reference vector. It is necessary to perform linguistic processing, i.e. to treat grammatically well-formed sentence, muttering and ellipsis in the input speech as matching units (i.e. words, etc.), regard all of word series and phoneme series in the speech as meaningful unit series, and analyze and evaluate the meaningful unit series as complete well-formed sentences. Thus, the scope of processing is limited.
As stated above, at present, there is no robust spontaneous speech understanding method using syntactic and semantic analysis of input speech. For this reason, in the conventional speech recognition apparatus, the speech interface must require input of speech in units of a syntactically fixed sentence. The conventional speech recognition apparatus cannot recognize spontaneous speech in natural human-machine dialogues, including noise, meaningless words (e.g. "er", "uh"), a silent period, muttering, ellipsis, and out-of-vocabulary words.
In the conventional speech recognition apparatus, spontaneous speech cannot be processed as a speech input, then predetermined words in a limited vocabulary according only to a predetermined sequence or grammar can be recognized, and the input of such meaningless words as "er" or "uh" or continuously spoken sentences make recognition error.