Analysis of non-sinusoidal waveforms is particularly applicable to speech recognition systems. Some speech processors begin the pitch extraction process by dividing the speech wave into separate frequency channels, either using Fourier Transform methods or a filter bank that mimics that encountered in the human auditory system to a greater or lesser degree. This is done to make the speech recognition system noise resistant.
In the Fourier Transform scheme small segments of the wave are transformed successively from the time domain to the frequency domain, and the components in the resulting spectrum are analysed to see if they comprise a harmonic series. The fundamental of the series provides an estimate of the pitch of the speech at that moment. This approach is relatively economical, but it has the disadvantage that it destroys the temporal information in the speech wave before it has been completely analysed.
In the filter-bank method the speech wave is divided into channels by filters operating in the time domain, and the result is a set of waveforms each of which carries some portion of the original speech information. The temporal information in each channel is analysed separately and then a combined estimate of the pitch of the speech is calculated. These methods are very complex and there are difficulties in providing sufficient resolution for optimum pitch extraction.
Simple speech recognition systems, which employ pitch extractors that operate on the raw waveform in the time domain, are inefficient and susceptible to disruption by background noise.