Generally, the speech recognition research aims to build systems that automatically acquire the structure and meaning of spoken language. But to this day, all common automatic speech recognition (ASR) frameworks are designed to detect predefined words using a predefined grammar. There is the problem that no learning at all is possible with such systems: the underlying models are trained using annotated speech databases in advance and remain fixed during recognition. Although it is clear that human-like speech processing involves learning even during recognition, not much effort was made to develop online-learning systems. So far, all of the conventional approaches have failed to provide computational models for the speech acquisition.
To enable systems to recognize the meaning of speech in a completely unsupervised manner, it is first necessary to acquire the acoustic structure of the language. The reason for acquiring the acoustic structure of the language is that it is necessary to segment a distinct acoustic event in a speech to assign meaning to the acoustical event. Therefore, learning of the speech segmentation must, at least to some extent, precede the learning of the speech semantics. The best segmentation results will be obtained if the model used for segmentation captures the acoustical structure of the language to be segmented as much as possible. This will be the case if each basic unit of speech is modeled by a distinct model. These basic speech units (SU) may be defined in different ways based on linguistic knowledge. The basic speech units need to be chosen by finding a compromise between a low number of speech unit (SU) models to be estimated and more complete capturing of the acoustical speech structure.
Methods used in known speech recognition systems to generate an acoustic model (AM) will be described first.
Supervised Acoustic Model Acquisition
In speech processing, Acoustic Model Acquisition (AMA) refers to the process of using annotated speech utterances to estimate the parameters of models for basic speech units (SU), like phonemes or syllables. Conventionally, there is no method to learn the speech unit (SU) models in an unsupervised manner. To distinguish between the supervised approaches and the method for unsupervised acoustic model acquisition, the former will be referred to as Supervised AMA and the latter as Unsupervised AMA, respectively herein.
Model training methods strongly depend on the mathematical structure used to model the distinct speech units (SU). Hidden Markov Models (HMM) are usually used as speech unit (SU) models although there were a few attempts to replace the HMM. The reason is that given an annotated speech-database, a bunch of HMM centric methods exists that may be applied to train the different speech unit (SU) models. Because annotations are only available for pre-recorded speech utterances, model training must be carried out off-line before the models may be used for online speech recognition tasks. Additionally, the methods for the common HMM training require a large amount of data to estimate the parameters of the speech unit (SU) models; and hence, are not suitable for online learning.
During recognition, the estimated speech unit (SU) models are concatenated to word models using a predefined word-dictionary. The generated word models are subsequently embedded into a large recognition model according to a grammar which defines possible word transitions. Using incoming speech data, the best path through the recognition model is determined which directly leads to a sequence of detected words, i.e. the most likely sequence of speech units (SU).
Other than the restriction of off-line training, another problem with supervised AMA is that it is not possible to model every possible utterance with extremely huge (but always finite) amounts of annotated training data. Therefore, given an annotated training database to use for supervised syllable-based AMA, it is always possible to imagine syllables that are not modeled due to the lack of suitable training data.
Supervised Speech Segmentation
Other than artificial neural networks commonly trained to detect segment onsets and rely on segment or at least onset-annotated speech databases, the major focus in supervised speech segmentation research is about HMM related techniques for segment spotting. For that purpose HMM based keyword-spotting was proposed for speech recognition. Single speech unit (SU) models (also referred to as keyword models) are commonly embedded into a HMM with a dedicated filler model inserted between each model transition. The filler-model or garbage-model is designed to model all parts in the processed utterances that are not described by a speech unit (SU) model. Such systems give a high quality of segmentation. The single speech unit (SU) models must be trained in advance using annotated speech data.
To enable systems trained using such method to cope with non-restricted spontaneous speech utterances, the supervised AMA described above was applied to larger speech databases for training. The basic idea of such attempts is to avoid the use of theoretically and practically difficult concept of filler-model or garbage-model. But no matter how much data is used for training, not all speech units (SU) occurring later during the unsupervised recognition or segmentation may be handled by such an approach.
In general the choice of an appropriate filler model is a major drawback of the HMM-based segment spotting. Recently very few works were presented that do not rely on filler models for segment spotting. Such methods, however, require the annotated speech for speech unit (SU) model training.
Unsupervised Speech Segmentation
Although model-based speech segmentation is generally known as being more powerful, approaches that are not model-based have the benefit of working from scratch without any preparatory training. For example, simple energy based level-tracking for segment generation may be implemented using less than half a dozen predefined parameters. Most unsupervised speech segmentation methods have in common that the speech signal is mapped to a one-dimensional feature space where minima are used to generate segment boundaries dependent on a given sensitivity threshold.
The reason that the approaches not based on model is less powerful compared to the model-based (and so far supervised) approaches is that the structure of speech is encoded using only a small set of parameters. To be more precise, approaches that are not model-based do not include dedicated models for each basic unit of speech as in the case of the model-based speech processing. Although this is sufficient for some applications like single word voice control interfaces, the segmentation performance is far beyond the need required for segmenting spontaneous speech in a human like manner.
A few basic attempts were made using recursive neural network models. After training, these models were able to generalize from utterance boundaries to the word boundaries inside of the utterances. Although hit:false-alarm ratios of 2.3:1 were achieved with such neural network based methods, it is clear that these are not suitable for realistic speech segmentation tasks because of the limited memory capacity. Additionally, all reported ANN based approaches use manually annotated symbolic feature corpora. When restricted to a single speech model, such methods fail to generate segmentation results comparable to the model-based segmentation.
In Hema A. Murthy, T. Nagaraj an, and N. Hemalatha, “Automatic segmentation and labeling of continuous speech without bootstrapping,” EUSIPCO, Poster-presentation, 2004, and G. L. Sarada, N. Hemalatha, T. Nagarajan, and Hema A. Murthy, “Automatic transcription of continuous speech using unsupervised and incremental training,” Poster-Presentation, InterSpeech 2004, the speech signal to be transcribed is segmented into syllable-like units by an unsupervised method. Similar segments are then grouped together using an unsupervised clustering technique. For each cluster, a dedicated model is derived.
Although this method is appealing, it assumes that the set of segments contains all syllables of the language to model. This approach, however, does not allow online training. Additionally, it uses the unsupervised segmentation method to find all the syllables.
In Giampiero Salvi, “Ecological language acquisition via incremental model-based clustering,” 2005, unsupervised incremental clustering of feature frames is applied to child directed speech to estimate a set of Gaussian distributions. These are assumed to model the underlying phonetic classes contained in the utterances for training. Although the approach is based on phonetic models, it is perceptually motivated because no annotation was used. Although such unsupervised approach seems to be appealing, the resulting phonetic models may delineate speech on a level that is too basic to be a realistic solution for speech processing. Indeed no property of speech has been exploited and the investigated problem may be reduced to online-clustering of vectors.