A typical speech recognition system operates by breaking down an input speech signal into smaller segments over time. Each segment is then individually analyzed, and some features, specifically those acoustic features that have been found relevant for the purpose of speech recognition, are extracted. These extracted features are then matched against reference models for the words in the vocabulary, and the best match is selected.
Speech recognition applications in use today, include voice activated dialing (through a telephone company) and dictation software.
In voice activated dialing through a telephone company, when a user lifts the handset of his telephone, he is connected to a speech recognition server which is located at the telephone company's exchange. The user then speaks the name of a person he wishes to be connected to, and the server interprets the voice command and performs the connection task.
The user is connected to the speech recognition server through a circuit switched network, and a part of the network bandwidth, usually of the order of 8 Kbytes/sec., is constantly devoted to the user for maintaining a connection. Here the server performs the feature extraction, after the decoder has decoded the incoming speech.
Packet networks are replacing the existing Time Division Multiplexing (TDM) based voice networks. In a packet network system 10, as shown in FIG. 1, the speech being sent to the speech recognition server 12 at the telephone exchange will be typically compressed using a speech coder at an access interface 14 at the sending end, to a low rate such as 1 Kbytes/sec. At the exchange, since the information is actually intended for the recognizer at the exchange end, the compressed speech will have to be first decoded, and then passed on to the speech recognition server 12, e.g., as PCM samples.
This type of system has at least two disadvantages, namely:
The computational load on the telephone exchange server is increased, since it has to first decode the input to speech samples, and then perform all the steps of speech recognition. PA1 The design of compression algorithms for telephony is based on perceptual criteria of voice quality. However these criteria do not necessarily preserve the performance of the speech recognition system and therefore the speech recognition system may not perform well.
In summary, speech recognition typically requires a number of steps or stages and current speech recognition systems perform all of these steps at the same location. Such systems have a problem when the user is remote from the speech recognition system, connected to the recognition system via a system which compresses the user's speech before transmitting the compressed speech, e.g., via a packet network.
Remotely transmitted speech is typically compressed before being sent over a packet network. The reason for this compression is to achieve some efficiency by saving time and space. However, speech compression algorithms are generally designed to trade off space saving with human comprehension and are not designed for compression of acoustic features. That is, they compress speech data as much as possible while still allowing a user at the receiving end to be able to understand the un-compressed speech data. What present systems fail to realize is that sometimes there is speech recognition equipment and processing at the receiving end. In those cases, the losses caused by the compression (and un-compression) of the speech data may degrade speech recognition. One way to overcome this problem is, of course, to transmit uncompressed speech, but this increases the load on the network.
In addition to the problems described above, a further problem arises when a system combines speech recognition with normal speech transmission. In such cases, e.g., in a telephone system, the system would need to have separate compression algorithms for speech and its associated acoustic features which is to be recognized and for speech which is not to be recognized.