1. Technical Field
The present disclosure relates to speech recognition and more specifically to combining frame and segment level processing, via temporal pooling, for phonetic classification.
2. Introduction
Traditional approaches for automatic speech recognition aim to recognize a sequence of events in an input. By focusing on a sequence of words or phoneme, these approaches greatly limit the range of solutions for automatic speech recognition. For example, traditional approaches can only use the acoustic features, measured at specific time intervals called frames, as input. Moreover, traditional models are typically only trained according to the maximum likelihood criterion.
The current models are based on one of two distinct approaches for automatic speech recognition: frame-based classification and segment-based classification. Frame-based classification models perform a frame-level analysis of the input to determine the structure and characteristics of the input. The classification performance for these models, however, is marked by significant error rates.
On the other hand, segment-based classification models perform a segment-level analysis of the input to determine the structure and characteristics of the input. Segment-based classification models assume that the boundaries of the input are known at test time. The features in the input are extracted at the segment level, and processed through a static architecture that has no concept of time. These models typically perform better than frame-based classification models. Nevertheless, segment-based classification models include significant drawbacks. First, segment-based classification models typically require hand tuning of the system to the task, which can be costly and inefficient. Second, such segmental approaches, when adapted to situations when no segment information is provided beforehand, often result in very high computational costs. These and other problems exist in current speech classification models.