1. Field of the Invention
The present invention relates to distributed speech recognition (DSR) systems, devices, methods, and signals where speech recognition feature parameters are extracted from speech and encoded at a near or front end, and electromagnetic signals carrying the feature parameters are transmitted to a far or back end where speech recognition is completed. In its particular aspects, the present invention relates to distributed speech recognition where the front end is provided in a wireless mobile communications terminal and the back end is provided via the communications network.
2. Description of the Related Art
Distributed speech recognition (DSR) is known from the Aurora project of the European Telecommunications Standards Institute (ETSI) for use in mobile communications systems (see http://www.etsi.org/technicalactiv/dsr.com).
It is expected that demand for telephony based speech recognition services, voice web browsing, and other man-to-machine voice communications via portable wireless communication devices will proliferate rapidly, and in the near future much of the available network capacity could be consumed by users talking to (or chatting with) remotely located machines via such communication devices to retrieve information, make transactions, and to entertain themselves.
DSR is under consideration by ETSI for mobile communications systems since the performance of speech recognition systems using speech signals obtained after transmission over mobile channels can be significantly degraded when compared to using a speech signal which has not passed through an intervening mobile channel. The degradations are a result of both the low bit rate speech coding by the vocoder and channel transmission errors. A DSR system overcomes these problems by eliminating the speech coding and the transmission errors normally acceptable for speech for human perception, as opposed to speech to be recognized (STBR) by a machine, and instead sends over an error protected channel a parameterized representation of the speech which is suitable for such automatic recognition. In essence, a speech recognizer is split into two parts: a first or front end part at the terminal or mobile station which extracts recognition feature parameters, and a second or back end part at the network which completes the recognition from the extracted feature parameters.
As in traditional speech recognizers, the first part of the recognizer chops an utterance into time intervals called “frames”, and for each frame extracts feature parameters, to produce from an utterance a sequence or array of feature parameters. The second part of the recognizer feeds the sequence of feature parameters into a Hidden Markov Model (HMM) for each possible word of vocabulary, each HMM for each word having been previously trained by a number of sample sequences of feature parameters from different utterances by the same speaker, by different speakers if speaker-independence is applied. The HMM evaluation gives, for each evaluated word, a likelihood that a current utterance is the evaluated word. Then, finally, the second part of the recognizer chooses the most likely word as its recognition result.
While DSR in accordance with the Aurora Project does not employ vector quantization (VQ), it is generally known to form vector data from feature parameter data and to compress such vector data using a codebook e.g. when sending such data over a channel, wherein each vector is replaced by a corresponding codebook index representing the vector. Thus a temporal sequence of vectors is converted to a sequence or string of indices. At the receiving end the same codebook is used to recover the sequence of vectors from the sequence or string of indices. The codebook has a size Sz necessary to include indicies representing each possible vector in a suitably quantized vector space, and each index is described by a number of bits B=log2 (Sz) necessary to distinguish between indices in the codebook.