Generally speaking, the means for effecting speech recognition comprise means for obtaining an audio signal, means for acoustic analysis which extract modeling parameters and, finally, recognition means that compare these extracted modeling parameters with models and suggest the form stored in the models that can be associated with the signal in the most probable manner. Optionally, voice activation detection (VAD) means may be used. These provide the detection of the sequences corresponding to speech which are required to be recognized. They extract segments of speech from the audio signal at the input, outside of the periods without voice activity, which will subsequently be processed by the modeling parameter extraction means.
More particularly, the invention relates to the interactions between the three speech recognition modes: onboard, centralized and distributed.
In an on-board speech recognition mode, the whole of the means for effecting the speech recognition are located within the user terminal. The limitations of this mode of recognition are therefore associated notably with the power of the on-board processors and with the memory available for storing the speech recognition models. On the other hand, this mode allows autonomous operation, without connection to a server, and in this respect is reliant on a substantial development associated with the reduction of the cost of processing capacity.
In a centralized speech recognition mode, the whole speech recognition procedure and the recognition models are located and are executed on a computer, generally called vocal server, accessible by the user terminal. The terminal simply transmits a speech signal to the server. This method is used notably in the applications offered by telecommunications operators. A basic terminal can thus have access to sophisticated voice-activated services. Many types of speech recognition (robust, flexible, very large vocabulary, dynamic vocabulary, continuous speech, mono- or multi-speaker, several languages, etc.) may be implemented within a speech recognition server. Indeed, centralized computer systems have large and increasing model storage capacities, working memory sizes and computational powers.
In a distributed speech recognition mode, the acoustic analysis means are onboard within the user terminal, the recognition means being at the server. In this distributed mode, a noise-filtering function associated with the modeling parameter extraction means can be advantageously effected at the source. Only the modeling parameters are transmitted allowing a substantial gain in transmission rate, which is particularly advantageous for multi-modal applications. In addition, the signal to be recognized can be better protected from transmission errors. Optionally, the voice activation detector (VAD) can also be onboard so as to only transmit the modeling parameters during the speech sequences, which has the advantage of significantly reducing the active transmission duration. Distributed speech recognition also allows signals for speech and data, notably text, images and videos, to be carried on the same transmission channel. The transmission network can, for example, be of the IP, GPRS, WLAN or Ethernet type. This mode also allows the user to benefit from protection and correction procedures against the loss of packets forming the signal transmitted to the server. However, it requires the availability of data transmission channels, with a strict transmission protocol.