This invention relates to an apparatus and method for automatic speech recognition.
Automatic speech recognition systems provide means for human beings to interface with communication equipment, computers and other machines in a mode of communication which is most natural and convenient to humans. Where required, this will enable operators of telephones, computers, etc. to call others, enter data, request information and control systems when their hands and eyes are busy, when they are in the dark, or when they are unable to be stationary at a terminal. Also, machines using normal voice input require much less user training than do systems relying on complex keyboards, switches, push buttons and other similar devices.
One known approach to automatic speech recognition of isolated words involves the following: periodically sampling a bandpass filtered (BPF) audio speech input signal; monitoring the sampled signals for power level to determine the beginning and the termination (endpoints) of the isolated words; creating from the sampled signals frames of data and then processing the data to convert them to processed frames of parametric values which are more suitable for speech processing; storing a plurality of templates (each template is a plurality of previously created processed frames of parametric values representing a word, which when taken together form the reference vocabulary of the automatic speech recognizer); and comparing the processed frames of speech with the templates in accordance with a predetermined algorithm, such as the dynamic programming algorithm (DPA) described in an article by F. Itakura, entitled "Minimum prediction residual principle applied to speech recognition", IEEE Trans. Acoustics, Speech and Signal Processing, Vol. ASSP-23, pp. 67-72, February 1975, to find the best time alignment path or match between a given template and the spoken word.
Isolated word recognizers such as those outlined above require the user to artificially pause between every input word or phrase. However, problems are still encountered in properly identifying the word boundaries, especially since extraneous non-speech sound originating in the environment of the recognition apparatus or even with the user of the latter such as lip smacks, tongue clicks or the like) may have power levels sufficient to indicate to the processive equipment the occurrence of a sound to be processed so that the processing may begin with such extraneous sounds and bring about non-recognition of the subsequent speech sounds.
It is therefore, desirable to combine the relative ease of implementation of an isolated word recognizer with the advantages of continuous speech recognition in a single, inexpensive and less complex automatic speech recognition machine.
Accordingly, it is an object of the present invention to provide a method and an arrangement for recognizing the endpoints of words and other utterances based on other criteria than merely the power level of the received sounds.