1. Field of the Invention
The present invention relates to a method for compressing data. More in particular, the present invention relates to a method for compressing feature vectors in a distributed speech recognition system.
2. Description of the Related Art
Many applications make use of speech recognition techniques. Examples include:                Interactive Voice Response (IVR) services based on speech recognition of “sensitive” information, such as banking and brokerage transactions. Speech recognition features may be stored for future human verification purposes or to satisfy procedural requirements;        human verification of utterances in the speech database collected from a deployed speech recognition system. This database can then be used to retrain and tune models in order to improve system performance;        applications where machine and human recognition are mixed (e.g. human assisted dictation).        
An application of speech recognition is also disclosed by U.S. Pat. No. 5,946,653, which describes a technique to control a target system by recognising a spoken command and then applying a stimulus to the target system based on the recognised spoken command. Target systems and software applications controlled using voice command are desirable because a user can control the target systems or applications by speaking commands thereby improving the ease of operations and user friendliness perceived by the user.
In a typical speech recognition system an input speech is received through a microphone, sampled and converted to a digital representation of the original input speech. The digitised speech is then processed (according to a commonly called “feature extraction” or “front-end” processing) so as to create feature vectors which provide a representation of the speech in a more compact format. The feature vectors are then transmitted or passed to a pattern recognition and/or reconstruction system, commonly called the “back-end”, that compares the incoming feature vectors to speech templates in order to reconstruct the input speech.
Speech recognition and/or reconstruction in the back-end typically requires search algorithms that use large amounts of memory and CPU cycles.
Three main approaches are known in the art for speech processing:                server-side: the audio signal is sent to the server by the device through a transmission channel. The server performs all the audio signal processing and send back to the device the results of the recognition process. This approach has the limitation of the absence of graphical displays and of the instability of the connection between the device and the server. With low-resolution analog-to-digital conversion, the transcoding and transmission losses and all the errors inherent in every wireless technology, the quality of the digitised audio signal is sometimes insufficient for successful speech recognition;        client-side: the speech processing is completely performed in the user's device. While this approach solves the audio channel problems, the client device needs to have heavy processing and memory capabilities and low consumption; however, wireless hand-held devices such as Personal Digital assistants (PDAs), cell phones, and other embedded devices are typically limited in computation, memory, and battery energy. Complex search algorithms are thus difficult to perform on these conventional devices due to said resource limitations.        distributed speech recognition (DSR): speech recognition tasks are performed part in the client device and part on the server. The client device extracts specific features on the user's digitised speech and sends these digital representation to the server. The server finishes the process, by comparing the extracted information with the language models and vocabulary lists that are stored in the server, so that the wireless device is less memory-constrained. Other advantages of this approach are the possibility of adding voice interface to a variety of mobile devices without significant hardware requirements, the possibility of easily updating services, content and code, and low sensitivity to errors (these systems can typically handle data packet losses of up to 10% without detrimental impacts on the speech recognition accuracy).        
The distributed speech recognition (DSR) system therefore provides that only the front-end processing is performed in the wireless hand-held device while the computational and memory intensive back-end processing is performed at a remote server (see for example, EP 1 395 978).
Moreover, in order to save communication channel bandwidth, it has been proposed in the art to compress the feature vectors extracted in the front-end processing of a DSR system, before their transmission to the remote server for the back-end processing. This compression is commonly called in the art “vector quantization” or “vector compression”.
In this context, the European Telecommunication Standards Institute (ETSI) released a standard (“Aurora”) for DRS feature extraction and compression algorithms (ETSI ES 202.050, ES 202 212, ES 201 108 and ES 202 211).
According to the feature extraction algorithm of Aurora ETSI standard, the digitised input speech is filtered, each speech frame is windowed using a Hamming window and transformed into the frequency domain using a Fast Fourier Transform (FFT). Then a Mel-frequency domain transformation and subsequent processing steps are performed so as to obtain a vector comprising 14 features—twelve static Mel cepstral coefficients C(1) to C(12), plus the zero cepstral coefficient C(O) and a log energy term lnE—for each time frame of the speech data (see also EP 1 395 978).
According to the compression algorithm of Aurora ETSI standard, the 14 features are then grouped into pairs thereby providing seven two-feature vectors for each time frame of the speech data. These two-feature vectors are then compressed by using seven respective predetermined codebooks.
A codebook is a set of predetermined indexed reference vectors which are chosen to be representative of the original information, represented by the feature vectors. The distribution of reference vectors in the codebook may be non-uniform, as provided for by the Aurora ETSI standard.
The compression or quantization is performed by replacing an input vector with the index of the reference vector that offers the lowest distortion.
Indeed, as the index is a positive integer value between 0 and N−1 (wherein N is the number of reference vectors in a codebook), it can be represented by a more compact information than an input feature vector comprising Q features, with Q≧2.
According to ETSI algorithm, the lowest distortion is found by evaluating a weighted Euclidean distance between an input vector and each reference vector of the respective codebook. Once the closest reference vector is found, the index of that reference vector is used to represent that input vector.
The value of minimum distance Dmin computed with an Euclidean distance for two-feature vectors is expressed as
      D    min    =            argmin              1        ≤        i        ≤        N              ⁢          {                                                  (                                                C                                      i                    ,                    A                                                  -                                  X                  A                                            )                        2                    +                                    (                                                C                                      i                    ,                    B                                                  -                                  X                  B                                            )                        2                              }      wherein N is the number of vectors in the codebook; (XA,XB) is the input vector and Ci=(Ci,A,Ci,B) is the i-th vector of the codebook.
According to the above expression, the conventional ETSI compression algorithm (exhaustive computation) requires N computation of distance, equivalent to the evaluation of 2·N squares and 3·N additions (the computation of the square root can be omitted, because it does not affect the result of the search of the minimum value Dmin). These operations require a high processor capacity proportional to N.
Attempts have been made in the art in order to improve the compression algorithm as, for example, disclosed in EP 0 496 541, U.S. Pat. No. 5,946,653, U.S. Pat. No. 6,389,389.
However, the techniques disclosed by these documents do not efficiently reduce the computational effort required to find the vector of a codebook which has the minimum distance from an input vector.
A different vector quantization technique, in this case for video data compression, is proposed in U.S. Pat. No. 4,958,225. In particular, this patent discloses a method for compressing video data employing vector quantization for identifying one of a set of codebook vectors which most closely matches an input vector. According to an embodiment, the method comprises the following steps: 1) computing the norm of an input vector I; 2) identifying a reference codebook vector Aj which has a norm closest to the norm of the input vector; 3) computing the distance hI,J between the input vector I and the reference codebook vector Aj; 4) identifying a subset S of the codebook vectors made up of codebook vectors from a limited volume of the vector space around the input vector I, such as vectors having a norm in the range |I|−hI,J to |I|+hI,J; 5) search the subset S for the codebook vector having the smallest distance to the input vector; 6) select the codebook vector having the smallest distance to the input vector. The identification in step 4) of the subset S reduces the number of vectors which must be evaluated in step 5) for the smallest distance computation.
However, the Applicant notes that, even if this method reduces the number of distance computation which has to be computed between the input vector and the reference vectors to that of the reference vectors included in the subset S, it still requires a large number of computations and instructions to be executed through steps 1 to 6. Moreover, only the norms of all reference codebook vectors can be calculated off-line in advance and stored in memory to be used later, when step 2) has to be performed. Therefore, the on-line computational effort required by this method is still high.