The present invention relates to speech recognition methods. In particular, the invention relates to speech recognition where speech data is transmitted and received over a lossy or corrupted communication link.
Speech recognition has traditionally been performed using systems in which the transmission of speech data within the system is free of errors. However, the emergence of the Internet and of digital wireless technology has given rise to situations where this is no longer the case. In applications where speech is sampled and partially processed on one device and then packetized and transmitted over a digital network for further analysis on another, packets of speech data may be delayed, lost or corrupted during transmission.
This is a serious problem for current speech recognition technologies, which require data to be present even if it has additive noise. Existing Internet protocols for error free data transmission such as TCPIP are not suitable for interactive ASR (xe2x80x9cAutomatic Speech Recognitionxe2x80x9d) systems, as the retry mechanisms introduce variable and unpredictably long delays into the system under poor network conditions. In another approach, real time delivery of data packets is attempted, ignoring missing data in order to avoid introducing delays in transmission. This is catastrophic for current recognition algorithms as stated above.
It would be desirable to have a class of recognition algorithms and transmission protocols intermediate the conventional protocols which are able to operate robustly and with minimal delays or incomplete speech data under poor network conditions. Ideally, the protocol would have a mechanism by which loss and delay may be traded off, either in a fixed manner or dynamically, in order to optimize speech recognition over lossy digital networks, for example in a client-server environment.
A system and method according to the present invention provide speech recognition on speech vectors received in a plurality of packets over a lossy network or communications link. Typically, the recognition occurs at a server on speech vectors received from a client computer over the network or link. The system and method are able to operate robustly, despite packet loss or corruption during transmission. In addition, the system and method may dynamically adjust the manner in which packets are being transmitted over the lossy communications link to adjust for varying or degraded network conditions.
The method includes constructing for a speech recognizer multidimensional speech vectors which have features derived from a plurality of packets received over a lossy communications link. Some of the packets associated with each speech vector are missing or corrupted, resulting in potentially corrupted features within the speech vector. These potentially corrupted features are indicated to the speech recognizer when present. Speech recognition is then attempted by the speech recognizer on the speech vectors. If speech recognition is unsuccessful, a request for retransmission of a missing or corrupted packet is made over the lossy communications link when potentially corrupted features are present in the speech vectors.
The system for recognizing a stream of speech received as a plurality of speech vectors over a lossy communications link comprises a buffering and decoding unit coupled to the lossy communications link. The buffering and decoding unit receives a plurality of packets, identifies missing or corrupted packets, and constructs a series of speech vectors from the received packets. Each speech vector has a plurality of certain features and uncertain features. A speech recognizer is coupled to the buffering and decoding unit and classifies each speech vector as one of a plurality of stored recognition models based on only the certain features within the speech vector.
The system and method may include a capability to request retransmission of lost or corrupted packets or bandwidth renegotiation from a source of the packets over the lossy communications link. The renegotiation may include, for example, a request to include error correction or detection bits in the packets, a request to compress the packets prior to transmission, or a request to: discard less salient components of the signal to reduce bandwidth requirements, for example, by performing principle components analysis on speech data prior to packetization.