A. Field of the Invention
The present invention relates generally to speech processing and, more particularly, to audio classification.
B. Description of Related Art
Speech has not traditionally been valued as an archival information source. As effective as the spoken word is for communicating, archiving spoken segments in a useful and easily retrievable manner has long been a difficult proposition. Although the act of recording audio is not difficult, automatically transcribing and indexing speech in an intelligent and useful manner can be difficult.
Speech is typically received by a speech recognition system as a continuous stream of words without breaks. In order to effectively use the speech in information management systems (e.g., information retrieval, natural language processing and real-time alerting systems), the speech recognition system initially processes the speech to generate a formatted version of the speech. The speech may be transcribed and linguistic information, such as sentence structures, may be associated with the transcription.
Additionally, information relating to the speakers may be associated with the transcription.
Speech recognition systems, when transcribing audio that contains speech, may classify the audio into a number of different audio classifications. The audio classifications may include classifications, such as speech/non-speech and vowel/consonant portions of the audio. The speech recognition system may use the speech classifications when processing the speech signals. Non-speech regions, for example, are not transcribed. Also, whether a vowel or consonant is being spoken may dictate which acoustic model to use in analyzing the audio.
Conventional speech recognition systems may decode an incoming audio stream into a series of phonemes, where a phoneme is the smallest acoustic event that distinguishes one word from another. The phonemes may then be used to classify the audio signal. The number of phonemes used to represent a particular language may vary depending on the particular phoneme model that is employed. For English, a complete phoneme set may include approximately 50 different phones.
One problem associated with generating a complete phoneme set for an incoming audio signal is that the complete phoneme set may be computationally burdensome, particularly when attempting to process speech in real-time.
Thus, there is a need in the art to more efficiently classify segments of an audio signal.