The present invention, generally, relates to processing a speech signal, more particularly, to processing speech signals for speech-to-text (STT) systems.
Recently, a deep neural network (DNN) have come into use instead of a Gaussian mixture model (GMM) as an acoustic model. Along with the use of the DNN, logmel features as features of the speech signal have come into use as inputs of STT systems instead of Mel Frequency Cepstrum Coefficient (MFCC) features.