The present invention relates to speech recognition and, more particularly, to a speech recognition apparatus and method for automatically recognizing continuous input speech in a naturally spoken language.
An apparatus for automatically recognizing continuous input speech is used as a man-machine interface for directly inputting speech data by an operator into a machine, and this apparatus is having increasingly greater importance. In order to analyze the sound pattern of a continuous utterance, and the extract characteristics thereof, as well as recognize them, various methods have been proposed. These methods have a common feature in that speech recognition is effected using a sequence of acoustically invariant units (i.e., phonemes) of input speech as minimum processing units.
A conventional method is known wherein input speech is divided into a sequence of segments (phonemes) (phonemic segmentation), and each segment is classified (labeled). In phonemic segmentation, segment boundaries between each two neighboring phonemes included in a continuous speech sound are detected by analyzing acoustic power and/or the spectral decomposition of an input sound pattern. More specifically, segmentation is performed such that a portion of speech in which a change in acoustic power or spectral decomposition over time is notable, is determined as a segment boundary. Labeling (i.e., comparing individual segments with reference phonemic labels to obtain a pattern matching result) is then performed. With this method, however, it is difficult to accurately detect segment boundaries and therefore difficult to effectively perform phonemic segmentation. This is because a change in acoustic power or spectral decomposition over time can easily be influenced by the speech speed and intonation of individual operators.
Another conventional method for automatic speech recognition has been proposed wherein the sound pattern of continuous input speech is divided into a plurality of frames at constant time intervals. Similarity of phonemes is calculated for each individual frame, and labeling is performed based on the similarity data. In this method, it is very complicated to edit phonemic labels, which are sequentially produced by calculating the similarity data between the divided pattern and a reference label pattern. In addition, it is difficult to develop an effective post-processing method for obtaining a recognition result based on the labeling of each frame under various conditions. Therefore, in an automatic speech recognition method according to the conventional method, various ad hoc processing rules are needed, depending on the situation, to overcome the above drawbacks. As a result, although a recognition processing procedure is complicated, improvement in recognition efficiency cannot be expected, and the reliability of the recognition result is thus degraded.