The present disclosure relates to a speech signal processing technology, and more particularly, to a technology to segment a speech signal into a plurality of time frames.
Various technologies using computing devices for processing speech signals have been developed. The previous speech signal segmentation technologies for extracting speech signal features have not considered quasi-regular structure of speech signal. Most commonly used speech segmentation technique in state-of-the-art automatic speech recognition(ASR) system is fixed frame size and rate(FFSR) technique; segments the speech at typical size of 30 ms frame as the frame shifts aside in 10 ms order. The technique using the FFSR extract features equally without considering signal properties. That is, they extract features while shifting a frame having a length of 30 ms, by 10 ms. That is, the sizes of used frames are fixed to specific values irrespective of the types of speech signals. The method is effective in recognizing vowel of which a maintenance time is long and which has a periodic property, but is not effective in recognizing consonant of which the maintenance time is short and which has a non-periodic attribute. The segmented speech signal is further analyzed by feature extraction technique such as Mel-Frequency Cepstral Coefficient(MFCC). The MFCC technique extracts all frequency components of speech signal through a Fast Fourier Transform(FFT) and further process frequency information non-linearly to be represented as 13 feature vectors. According to the technique (i.e. MFCC), when noise is added to speech signal, even frequency components of noise are included in the feature vectors and features unique to speech signals are not well represented. As a result, serious degradation of speech recognition accuracy is caused by conventional speech processing techniques (i.e. FFSR, MFCC).
On the one hand, if a neural signal measured from auditory cortex is high-frequency pass filtered, a spike signal is extracted. On the other hand, when a signal is low pass filtered and a component having a band lower than or equal to 300 Hz is extracted, a signal called local field potential (LFP) may be obtained. The LFP above may be considered as a signal that does not contribute to the generation of the spike signal.
The phase components of the low frequency components of the neural signal of the auditory cortex generated while hearing and then recognizing speech signals may have 1) a parsing function that divides the speech signals into decodable units, and 2) an independent information unit function that provides one piece of information by themselves.