1. Field of the Invention
The present invention relates to speech recognition systems and, more particularly, to such systems employing a frequency domain filter.
2. Description of Related Art
The recognition of speech is a subset of the general problem of signal processing, in which a pervasive problem is the reduction of noise elements. Although noise cannot be eliminated entirely, it is usually considered sufficient to reduce noise levels to a point at which the embedded signal is discernable to an acceptable probability.
Prior to advances in computing power, speech recognition had been aided by physical filters comprising electrical/electronic circuit elements. Concomitant with developments in CPU power and memory size, software-based speech recognition models have been created. A continuing difficulty, however, has been the creation of such models that can operate in or close to real time and preserve recognition accuracy.
At present the accuracy of commercially available speech-to-text systems is not considered satisfactory by many, even after having been trained by a sole user and when used in substantially noise-free environments. Therefore, it is evident that those operating in high-noise environments in which speech recognition accuracy is of vital importance face a particularly onerous communications challenge. Such environments may include, for example, aircraft cockpits, naval vessels, high-noise manufacturing and construction sites, and military operations sites, to name but a few. Decisions are made in these environments can literally be in the xe2x80x9clife or deathxe2x80x9d category, and thus recognition accuracy is paramount.
As is discussed in a PhD thesis of M. K. Ravishankar (Carnegie Mellon University, 1996), the disclosure of which is incorporated herein by reference, one of the tools of speech recognition technology comprises the xe2x80x9chidden Markov modelxe2x80x9d (HMM). The HMM is used in Carnegie Mellon""s Sphinx-II system, a statistical modeling package.
The commonly accepted unit of speech is the phoneme, of which there are approximately 50 in spoken English. However, as phonemes do not exist in isolation in actual speech, this characterization has been refined to take into account the influence of preceding and succeeding phonemes, which cubes the recognition problem to determining one in 503 triphones. Each of these is modeled by a 5-state HMM in the Sphinx-II system, yielding a total of approximately 375,000 states.
In addition to recognizing a sequence of phonemes, which can be approached as a statistical problem, an interpretation of that sequence must also be made. This interpretation comprises searching for the most likely sequence of words given the input speech. One of the methods known in the art (Ravishankar, 1996) is Viterbi decoding using a beam search, a dynamic programming algorithm that searches the state space for the most likely state sequence that accounts for the input speech. The state space is constructed by creating word HMM models from the constituent phoneme or triphone HMM models, and the beam search is applied to limit the resulting large state space by eliminating less likely states. The Viterbi method is a time-synchronous search that processes the input speech one frame at a time and at a particular rate, typically 100 frames/sec.
The models that have been presented thus far, however, still yield computationally unwieldy techniques that cannot operate accurately in or close to real time in noisy environments.
It is therefore an object of the present invention to provide an improved speech recognition system that adaptively filters out unwanted noise.
It is an additional object to provide such a system that outputs a textual interpretation of the filtered audio signal.
It is a further object to provide a method for recognizing speech in a noisy environment.
It is another object to provide such a method of building a set of software-based model filters for use in speech recognition.
An additional object is to provide a system and method for generating frequency-domain filters for use in signal processing applications.
A further object is to provide a text representation of a stream of sound containing speech and noise.
These objects and others are attained by the present invention, an improved speech recognition system and associated methods. One aspect of the invention is a method and system for converting a sound signal containing a speech component and a noise component into recognizable language. The method comprises the steps of transforming the sound signal from a time domain into a frequency domain. Next the transformed signal is compared with a set of models of all possible sound signals to find a closest-matching known sound signal.
A filter is then applied to the transformed signal. Here the filter corresponds to the model of the closest-matching known sound signal. Next a determination is made of an identity of the speech by searching a set of control data models to match a data model with the filtered transformed signal. Finally, a text stream representative of the determination is output, which enables a user not only to hear what may be a noisy voice message, but also to read the filtered message in some form, such as printed text or on a display screen.
The features that characterize the invention, both as to organization and method of operation, together with further objects and advantages thereof, will be better understood from the following description used in conjunction with the accompanying drawing. It is to be expressly understood that the drawing is for the purpose of illustration and description and is not intended as a definition of the limits of the invention. These and other objects attained, and advantages offered, by the present invention will become more fully apparent as the description that now follows is read in conjunction with the accompanying drawing.