1. Technical Field of the Invention
The invention relates to the field of automatic speech recognition and more particularly to a speech analyzing stage and a method for analyzing a speech signal sampled at one of at least two different system sampling rates utilized in an automatic speech recognition system.
2. Discussion of the Prior Art
Automatic recognition of speech is becoming a technology which is used for controlling all types of electronic devices like mobile telephones or for obtaining access to services over a telecommunication network.
Automatic speech recognition systems can differ in respect to the spectral range in which input speech signals are analyzed. Today, many telecommunication terminals with automatic speech recognition capability focus on the spectral range up to 4 kHz by sampling an analog input speech signal using an analog-to-digital converter operated at a sampling rate of 8 kHz. A standard approach for analyzing and recognizing such digitized speech signals in an automatic speech recognition system 100 is shown in FIG. 1.
The digitized input speech signal is analyzed by means of a spectral analyzer in the form of a MEL filterbank 110. In the MEL filterbank 110 the spectral band of the input speech signal is divided into a plurality of subbands which are equidistant in the MEL spectral domain. The MEL filterbank 110 then performs a short-term spectral analysis with respect to the short-term speech energy for each subband. The spectral analysis in the MEL spectral range takes into account properties of the human speech perception since the human auditory system has a higher spectral resolution at low frequencies.
The MEL filtered speech signal is then input into a non-linear transformation block 120 which comprises for each subband analyzed by the MEL filterbank 110 an individual non-linear transformation unit. Each non-linear transformation unit of the non-linear transformation block 120 converts the speech energy comprised within the respective subband from the linear spectral domain into the logarithmic spectral domain. The output of the non-linear transformation block 120 is input into a Discrete Cosine Transformation (DCT) block 130 which transforms the speech signal into the cepstral domain. The output of the DCT block 130 consists of L acoustic parameters in the cepstral domain (cepstral parameters). The cepstral parameters are taken as input for the recognition unit 140 where pattern matching takes place. By means of pattern matching the cepstral parameters of the speech signal are compared with corresponding parameters that are stored as pre-trained reference models in a reference model database 150. Hidden Markov Models (HMM) are most often used as reference models. The reference models are trained in advance to represent the spectral characteristic of e.g. words or phonems. By means of pattern matching a recognition result can be obtained which is subsequently output by the recognition unit 140.
It has become apparent from the above that the conventional automatic speech recognition system 100 depicted in FIG. 1 analyzes the input speech signal in a spectral range up to 4 kHz by sampling the analog input speech signal at 8 kHz. Of course, higher sampling rates may be used as well. For example, personal computers often use a sampling rate of 11 kHz which represents ¼ of the 44.1 kHz used for the sampling of CDs. It is evident that a higher sampling bandwidth is connected with more spectral information so that the performance of automatic speech recognition systems generally increases if higher sampling rates are employed.
In the future it is expected that electronic devices which are operable at several sampling rates and network systems which comprise terminals operating at one of different system sampling rates will be developed. Consequently, there will arise the question how an automatic speech recognition system which allows to analyze speech signals sampled at different sampling rates may be constructed.
From “Speech processing, transmission and quality aspects (STQ); Distributed Speech Recognition; Front-end feature extraction algorithm; Compression algorithms”, ETSI standard document ETSI ES 201 108 v1.1.2 (2000–04), April 2000 a proposal for a network system comprising an automatic speech recognizing system supporting three different sampling rates of 8, 11 and 16 kHz is known.
The speech analysis in this network system is based on a MEL filterbank with 23 subbands. The number of 23 MEL subbands is kept constant for all three sampling rates. This means that the subbands are differently distributed over each of the three spectral ranges of 4, 5.5 and 8 kHz (corresponding to the sampling rates of 8, 11 and 16 kHz) to be analyzed.
It is clear that by differently distributing the 23 subbands over the three spectral ranges the spectral analysis is different for each sampling rate. Consequently, one and the same reference model looks differently depending on the sampling rate at which the respective reference model has been trained. This implies that the reference models have to be trained for each sampling rate individually to guarantee optimal recognition performance. Thus, the training effort and the memory requirements for an automatic speech recognition system operable at three different sampling rates are at least increased by a factor of three.
There exists, therefore, a need for a speech analyzing stage and a method for analyzing a speech signal sampled at one of at least two different system sampling rates of an automatic speech recognition system which are user-friendly and which allow to simplify the hardware requirements of the automatic speech recognition system.