Speech recognition in computer science, ability of the computing machine/device or computing program to identify words and phrases spoken in a language and convert them to a machine-readable format. Speech recognition softwares enable users to dictate their devices. The devices installed with the speech recognition software first receive an analog signal and converts the analog signal into digital signal. Further, the speech recognition software processes the digital signal to determine the spoken words and phrases. Elementary speech recognition software has a limited vocabulary of words and phrases and may only identify words and phrases if they are spoken very clearly. However, it has been observed that the factor such as low signal-to-noise ratio, overlapping speech, intensive use of computer power, and homonyms reduces the accuracy of the speech recognition software. The available speaker recognition and speech recognition softwares typically employ Mel-frequency-cepstrum coefficients (MFCC) as the feature representation of human speech. MFCCs are usually derived by digitizing the speech and applying a shifting window to obtain short-term frames and compute Fast Fourier Transform (FFT) spectrum, calculate filter band energy output where the center frequencies of the bands are Mel-frequency distributed, and finally use Discrete Cosine Transform (DCT) to produce Mel-frequency cepstrum coefficients (MFCC). There is one vector of MFCC's for each frame. MFCC's can be augmented by their first-order and second-order derivatives (expanded feature vectors) to enhance the recognition performance for speaker and speech recognition. Moreover, each MFCC can also be mean-removed in order to mitigate, e.g., channel distortion. The above MFCC's and their expansion and/or normalization work best in a quiet environment where training and testing conditions match. For noisy environments the aforementioned methodologies proves to give an undesired result. Further, it has been also observed that there are no available softwares which are capable of providing the best desired result for both quiet and noisy environments. For example, in noise-robust speech recognition softwares, generally yields degraded recognition accuracies when operating in a quiet condition and when compared to a non-noise robust counterpart. The conventional speech recognition softwares are subject to a variety of problems such as the capability of mitigating noise interference and of separating inter-speaker variability from channel distortion.
There is a long felt need for a computer implemented system and method which will enable the conventional softwares to provide the desired speech recognition yield in both the quiet and noisy environments by identifying significant speech frames within speech signals.